decode Numeric Character References to unicode
Duncan Booth
duncan.booth at invalid.invalid
Mon Feb 18 06:16:51 EST 2008
William Heymann <kosh at aesaeion.com> wrote:
> How do I decode a string back to useful unicode that has xml numeric
> character references in it?
>
> Things like 占
>
Try something like this:
import re
from htmlentitydefs import name2codepoint
name2codepoint = name2codepoint.copy()
name2codepoint['apos']=ord("'")
EntityPattern = re.compile('&(?:#(\d+)|(?:#x([\da-fA-F]+))|([a-zA-Z]+));')
def decodeEntities(s, encoding='utf-8'):
def unescape(match):
code = match.group(1)
if code:
return unichr(int(code, 10))
else:
code = match.group(2)
if code:
return unichr(int(code, 16))
else:
code = match.group(3)
if code in name2codepoint:
return unichr(name2codepoint[code])
return match.group(0)
return EntityPattern.sub(unescape, s.decode(encoding))
Obviously if you really do only want numeric references you can take out
the lines using name2codepoint and simplify the regex.
More information about the Python-list
mailing list