[Tutor] What kind of number is this
Mark Tolonen
metolone+gmane at gmail.com
Sat Jul 25 18:09:33 CEST 2009
"Emad Nawfal (عماد نوفل)" <emadnawfal at gmail.com> wrote in message
news:652641e90907250514m1566287aq75f675fd6336057f at mail.gmail.com...
> On 7/25/09, Dave Angel <davea at ieee.org> wrote:
>> Emad Nawfal (9E'/ FHAD) wrote:
>>> Hi Tutors,
>>> I have a bunch of text files that have many occurrences like the
>>> following
>>> which I believe, given the context, are numbers:
>>>
>>> ١٨٧٢
>>>
>>> ٥٧
>>>
>>> ٢٠٠٨
>>>
>>> etc.
>>>
>>> So, can somebody please explain what kind of numbers these are, and how
>>> I
>>> can get the original numbers back. The files are in Arabic and were
>>> downloaded from an Arabic website.
>>> I'm running python2.6 on Ubuntu 9.04
>> Those are standard html encodings for some Unicode characters. [snip]
You might find re.sub() useful to process your text files. It will replace
the HTML encodings with the actual Unicode character.
>>> import re
>>> data =
>>> u"١٨٧٢٥٧٢٠٠٨"
>>> s = re.sub(r'&#(\d+);',lambda m: unichr(int(m.group(1))),data)
>>> s
u'\u0661\u0668\u0667\u0662\u0665\u0667\u0662\u0660\u0660\u0668'
>>> print s
1872572008
And this can be helpful for identifying Unicode characters:
>>> import unicodedata
>>> for c in s:
... print unicodedata.name(c)
...
ARABIC-INDIC DIGIT ONE
ARABIC-INDIC DIGIT EIGHT
ARABIC-INDIC DIGIT SEVEN
ARABIC-INDIC DIGIT TWO
ARABIC-INDIC DIGIT FIVE
ARABIC-INDIC DIGIT SEVEN
ARABIC-INDIC DIGIT TWO
ARABIC-INDIC DIGIT ZERO
ARABIC-INDIC DIGIT ZERO
ARABIC-INDIC DIGIT EIGHT
-Mark
More information about the Tutor
mailing list