Html character entity conversion

Duncan Booth duncan.booth at invalid.invalid
Tue Aug 1 09:06:01 EDT 2006


pak.andrei at gmail.com wrote:

> How can I convert encoded string
> 
> sEncodedHtmlText = 'привет
> питон'
> 
> into human readable:
> 
> sDecodedHtmlText  == 'привет питон'

How about:

>>> sEncodedHtmlText = 'text: 
приветпито&#108
5;'
>>> def unescape(m):
    return unichr(int(m.group(0)[2:-1]))

>>> print re.sub('&#[0-9]+;', unescape, sEncodedHtmlText)
text: ???????????

I'm afraid my newsreader couldn't cope with either your original text or my 
output, but I think this gives the string you wanted. You probably also 
ought to decode sEncodedHtmlText to unicode first otherwise anything which 
isn't an entity escape will be converted to unicode using the default ascii 
encoding.



More information about the Python-list mailing list