translating foreign data

Steven D'Aprano steve+comp.lang.python at pearwood.info
Fri Jun 22 06:45:27 EDT 2018


On Fri, 22 Jun 2018 11:14:59 +0100, Ben Bacarisse wrote:

>>> The code page remark is curious.  Will some "code pages" have digits
>>> that are not ASCII digits?
>>
>> Good question.  I have no idea.
> 
> It's much more of an open question than I thought.

Nah, Python already solves that for you:


py> s = "১২৩৪৫.৬৭৮৯০"
py> for c in s:
...     print(unicodedata.name(c))
...
BENGALI DIGIT ONE
BENGALI DIGIT TWO
BENGALI DIGIT THREE
BENGALI DIGIT FOUR
BENGALI DIGIT FIVE
FULL STOP
BENGALI DIGIT SIX
BENGALI DIGIT SEVEN
BENGALI DIGIT EIGHT
BENGALI DIGIT NINE
BENGALI DIGIT ZERO
py> float(s)
12345.6789



Further to my earlier post, if you call:

for sep in ",u\00B7u\066B":
    mystring = mystring.replace(sep, '.')

before passing it to float, that ought to cover just about anything you 
will find in real-world data regardless of language. If Ethan finds 
something that isn't covered by those three cases (comma, middle dot and 
Arabic decimal separator) he'll likely need to consult an expert on that 
language.

Provided Ethan doesn't have to deal with thousands separators as well. 
Then it gets complicated.


-- 
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson




More information about the Python-list mailing list