Easy way to remove HTML entities from an HTML document?

Christopher T King squirrel at WPI.EDU
Sun Jul 25 17:30:22 EDT 2004


On Sun, 25 Jul 2004, Robert Oschler wrote:

> Is there a module/function to remove all the HTML entities from an HTML
> document (e.g. - &nbsp, &amp, &apos, etc.)?

htmllib has this capability, but if you're not doing any other HTML 
parsing, a regex, coupled with htmllib's helper module, htmlentitydefs, 
does nicely:

 import re
 import htmlentitydefs

 def convertentity(m):
     if m.group(1)=='#':
         try:
             return chr(int(m.group(2)))
         except ValueError:
             return '&#%s;' % m.group(2)
     try:
         return htmlentitydefs.entitydefs[m.group(2)]
     except KeyError:
         return '&%s;' % m.group(2)

 def converthtml(s):
     return re.sub(r'&(#?)(.+?);',convert,s)

 converthtml('Some <html> string.')  # --> 'Some <html> string.'

Unknown or invalid entities are left in &xxx; format, while also leaving
Unicode entities in &#nnn; format.  If you want a Unicode string to be
returned (and Unicode entities interpreted), replace 'chr' with 'unichr',
and 'htmlentitydefs.entitydefs[m.group(2)]' with
'unichr(htmlentitydefs.name2codepoint[m.group(2)])'.

Hope this helps.




More information about the Python-list mailing list