Easy way to remove HTML entities from an HTML document?
Christopher T King
squirrel at WPI.EDU
Sun Jul 25 17:30:22 EDT 2004
On Sun, 25 Jul 2004, Robert Oschler wrote:
> Is there a module/function to remove all the HTML entities from an HTML
> document (e.g. -  , &, &apos, etc.)?
htmllib has this capability, but if you're not doing any other HTML
parsing, a regex, coupled with htmllib's helper module, htmlentitydefs,
does nicely:
import re
import htmlentitydefs
def convertentity(m):
if m.group(1)=='#':
try:
return chr(int(m.group(2)))
except ValueError:
return '&#%s;' % m.group(2)
try:
return htmlentitydefs.entitydefs[m.group(2)]
except KeyError:
return '&%s;' % m.group(2)
def converthtml(s):
return re.sub(r'&(#?)(.+?);',convert,s)
converthtml('Some <html> string.') # --> 'Some <html> string.'
Unknown or invalid entities are left in &xxx; format, while also leaving
Unicode entities in &#nnn; format. If you want a Unicode string to be
returned (and Unicode entities interpreted), replace 'chr' with 'unichr',
and 'htmlentitydefs.entitydefs[m.group(2)]' with
'unichr(htmlentitydefs.name2codepoint[m.group(2)])'.
Hope this helps.
More information about the Python-list
mailing list