decode Numeric Character References to unicode

7stud bbxx789_05ss at yahoo.com
Mon Feb 18 06:53:16 EST 2008


On Feb 18, 3:20 am, William Heymann <k... at aesaeion.com> wrote:
> How do I decode a string back to useful unicode that has xml numeric character
> references in it?
>
> Things like 占

BeautifulSoup can handle two of the three formats for html entities.
For instance, an 'o' with umlaut can be represented in three different
ways:

&_ouml_;
ö
&#xf6;

BeautifulSoup can convert the first two formats to unicode:

from BeautifulSoup import BeautifulStoneSoup as BSS

my_string = '占'
soup = BSS(my_string, convertEntities=BSS.XML_ENTITIES)
print soup.contents[0].encode('utf-8')
print soup.contents[0]

--output:---
<some asian looking character>

Traceback (most recent call last):
  File "test1.py", line 6, in ?
    print soup.contents[0]
UnicodeEncodeError: 'ascii' codec can't encode character u'\u5360' in
position 0: ordinal not in range(128)

The error message shows you the unicode string that BeautifulSoup
produced: u'\u5360'

If that won't work for you, it's not hard to write you own conversion
function to handle all three formats:

1) Create a regex that will match any of the formats
2) Convert the first format using htmlentitydefs.name2codepoint
3) Convert the second format using unichar()
4) Convert the third format using int('0'+ match, 16) and then
unichar()



More information about the Python-list mailing list