[Tutor] What kind of number is this

Mark Tolonen metolone+gmane at gmail.com
Sat Jul 25 18:09:33 CEST 2009


"Emad Nawfal (عماد نوفل)" <emadnawfal at gmail.com> wrote in message 
news:652641e90907250514m1566287aq75f675fd6336057f at mail.gmail.com...
> On 7/25/09, Dave Angel <davea at ieee.org> wrote:
>> Emad Nawfal (9E'/ FHAD) wrote:
>>> Hi Tutors,
>>> I have a bunch of text files that have many occurrences like the 
>>> following
>>> which I believe, given the context,  are numbers:
>>>
>>> &#1633;&#1640;&#1639;&#1634;
>>>
>>> &#1637;&#1639;
>>>
>>>  &#1634;&#1632;&#1632;&#1640;
>>>
>>> etc.
>>>
>>> So, can somebody please explain what kind of numbers these are, and how 
>>> I
>>> can get the original numbers back. The files are in Arabic and were
>>> downloaded from an Arabic website.
>>> I'm running python2.6 on Ubuntu 9.04

>> Those are standard html encodings for some Unicode characters. [snip]

You might find re.sub() useful to process your text files.  It will replace 
the HTML encodings with the actual Unicode character.

>>> import re
>>> data = 
>>> u"&#1633;&#1640;&#1639;&#1634;&#1637;&#1639;&#1634;&#1632;&#1632;&#1640;"
>>> s = re.sub(r'&#(\d+);',lambda m: unichr(int(m.group(1))),data)
>>> s
u'\u0661\u0668\u0667\u0662\u0665\u0667\u0662\u0660\u0660\u0668'
>>> print s
1872572008

And this can be helpful for identifying Unicode characters:

>>> import unicodedata
>>> for c in s:
...  print unicodedata.name(c)
...
ARABIC-INDIC DIGIT ONE
ARABIC-INDIC DIGIT EIGHT
ARABIC-INDIC DIGIT SEVEN
ARABIC-INDIC DIGIT TWO
ARABIC-INDIC DIGIT FIVE
ARABIC-INDIC DIGIT SEVEN
ARABIC-INDIC DIGIT TWO
ARABIC-INDIC DIGIT ZERO
ARABIC-INDIC DIGIT ZERO
ARABIC-INDIC DIGIT EIGHT

-Mark




More information about the Tutor mailing list