How do you htmlentities in Python

Thomas Jollans thomas at jollans.NOSPAM.com
Mon Jun 4 12:14:56 EDT 2007


"Adam Atlas" <adam at atlas.st> wrote in message 
news:1180965792.757685.132580 at q75g2000hsh.googlegroups.com...
> As far as I know, there isn't a standard idiom to do this, but it's
> still a one-liner. Untested, but I think this should work:
>
> import re
> from htmlentitydefs import name2codepoint
> def htmlentitydecode(s):
>    return re.sub('&(%s);' % '|'.join(name2codepoint), lambda m:
>         name2codepoint[m.group(1)], s)
>

'&(%s);' won't quite work: HTML (and, I assume, SGML, but not XHTML being 
XML) allows you to skip the semicolon after the entity if it's followed by a 
white space (IIRC). Should this be respected, it looks more like this: 
r'&(%s)([;\s]|$)'

Also, this completely ignores non-name entities as also found in XML. (eg 
%x20; for ' ' or so) Maybe some part of the HTMLParser module is useful, I 
wouldn't know. IMHO, these particular batteries aren't too commonly needed.

Regards,
Thomas Jollans 





More information about the Python-list mailing list