unescaping xml escape codes

Bengt Richter bokr at oz.net
Sun Aug 10 20:09:42 EDT 2003


On Sun, 10 Aug 2003 10:08:46 -0700, Daniel <dl-notspam-rubin at yahoo.com> wrote:

>I'm working with strings that contain xml escape codes, such as '0'
>and need a way in python to unescape these back to their ascii
>representation, such as '&' but can't seem to find a python method for
>this. I tried xml.sax.saxutils.unescape(s), but while it works with
>'&', it doesn't work with '0' and other numeric codes. Any
>suggestions on how to decode the numeric xml escape codes such as this?
>Thanks.
>
Maybe just a regex sub function would do it for you? Do you just need the decimal
forms like above or also the hex? If your coded entities are � to ÿ or
&x00; to &xff; this might work. Other entities are converted to '?'.

If you want to do this properly, I think you have to parse the html a little and see
what the encoding is, and convert to unicode, and then do the conversions.

Very little tested!!
====< cvthtmlent.py >======================================
import re
rxo =re.compile(r'\&\#(x?[0-9a-fA-F]+);')
def ent2chr(m):
    code = m.group(1)
    if code.isdigit(): code = int(code)
    else: code = int(code[1:], 16)
    if code<256: return chr(code)
    else: return '?' #XXX unichr(code).encode('utf-16le') ??
        
def cvthtmlent(s): return rxo.sub(ent2chr, s)

if __name__ == '__main__':
    import sys; args = sys.argv[1:]
    if args:
        arg = args.pop(0)
        if arg == '-test':
            print cvthtmlent(
                'blah [0] blah [ö] blah [&#x31;&#x32;&#x33;] &#x3c9')
        else:
            if arg == '-': fi = sys.stdin
            else: fi = file(arg)
            for line in fi:
                sys.stdout.write(cvthtmlent(line))
===========================================================
If you run this in idle, you can see the umlaut, but not the omega, which becomes a '?'

Martin can tell you the real scoop ;-)

 >>> from cvthtmlent import cvthtmlent as cvt
 >>> print cvt('blah [0] blah [ö] blah [&#x31;&#x32;&#x33;] &#x3c9;')
 blah [0] blah [ö] blah [123] ?

Regards,
Bengt Richter




More information about the Python-list mailing list