decode Numeric Character References to unicode

7stud bbxx789_05ss at
Mon Feb 18 06:53:16 EST 2008

On Feb 18, 3:20 am, William Heymann <k... at> wrote:
> How do I decode a string back to useful unicode that has xml numeric character
> references in it?
> Things like 占

BeautifulSoup can handle two of the three formats for html entities.
For instance, an 'o' with umlaut can be represented in three different


BeautifulSoup can convert the first two formats to unicode:

from BeautifulSoup import BeautifulStoneSoup as BSS

my_string = '占'
soup = BSS(my_string, convertEntities=BSS.XML_ENTITIES)
print soup.contents[0].encode('utf-8')
print soup.contents[0]

<some asian looking character>

Traceback (most recent call last):
  File "", line 6, in ?
    print soup.contents[0]
UnicodeEncodeError: 'ascii' codec can't encode character u'\u5360' in
position 0: ordinal not in range(128)

The error message shows you the unicode string that BeautifulSoup
produced: u'\u5360'

If that won't work for you, it's not hard to write you own conversion
function to handle all three formats:

1) Create a regex that will match any of the formats
2) Convert the first format using htmlentitydefs.name2codepoint
3) Convert the second format using unichar()
4) Convert the third format using int('0'+ match, 16) and then

More information about the Python-list mailing list