decode Numeric Character References to unicode

Mon Feb 18 06:16:51 EST 2008

William Heymann <kosh at aesaeion.com> wrote:

> How do I decode a string back to useful unicode that has xml numeric
> character references in it?
> 
> Things like 占
> 
Try something like this:

import re
from htmlentitydefs import name2codepoint

name2codepoint = name2codepoint.copy()
name2codepoint['apos']=ord("'")

EntityPattern = re.compile('&(?:#(\d+)|(?:#x([\da-fA-F]+))|([a-zA-Z]+));')

def decodeEntities(s, encoding='utf-8'):
    def unescape(match):
        code = match.group(1)
        if code:
            return unichr(int(code, 10))
        else:
            code = match.group(2)
            if code:
                return unichr(int(code, 16))
            else:
                code = match.group(3)
                if code in name2codepoint:
                    return unichr(name2codepoint[code])
        return match.group(0)

    return EntityPattern.sub(unescape, s.decode(encoding))

Obviously if you really do only want numeric references you can take out 
the lines using name2codepoint and simplify the regex.