decode Numeric Character References to unicode

Duncan Booth duncan.booth at invalid.invalid
Mon Feb 18 06:16:51 EST 2008


William Heymann <kosh at aesaeion.com> wrote:

> How do I decode a string back to useful unicode that has xml numeric
> character references in it?
> 
> Things like 占
> 
Try something like this:

import re
from htmlentitydefs import name2codepoint

name2codepoint = name2codepoint.copy()
name2codepoint['apos']=ord("'")

EntityPattern = re.compile('&(?:#(\d+)|(?:#x([\da-fA-F]+))|([a-zA-Z]+));')

def decodeEntities(s, encoding='utf-8'):
    def unescape(match):
        code = match.group(1)
        if code:
            return unichr(int(code, 10))
        else:
            code = match.group(2)
            if code:
                return unichr(int(code, 16))
            else:
                code = match.group(3)
                if code in name2codepoint:
                    return unichr(name2codepoint[code])
        return match.group(0)

    return EntityPattern.sub(unescape, s.decode(encoding))

Obviously if you really do only want numeric references you can take out 
the lines using name2codepoint and simplify the regex.



More information about the Python-list mailing list