Easy way to remove HTML entities from an HTML document?
Robert Oschler
no_replies at fake_email_address.invalid
Mon Jul 26 18:08:44 EDT 2004
"Christopher T King" <squirrel at WPI.EDU> wrote in message
news:Pine.LNX.4.44.0407251706010.20890-100000 at ccc6.wpi.edu...
>
> htmllib has this capability, but if you're not doing any other HTML
> parsing, a regex, coupled with htmllib's helper module, htmlentitydefs,
> does nicely:
>
> import re
> import htmlentitydefs
>
> def convertentity(m):
> if m.group(1)=='#':
> try:
> return chr(int(m.group(2)))
> except ValueError:
> return '&#%s;' % m.group(2)
> try:
> return htmlentitydefs.entitydefs[m.group(2)]
> except KeyError:
> return '&%s;' % m.group(2)
>
> def converthtml(s):
> return re.sub(r'&(#?)(.+?);',convert,s)
>
> converthtml('Some <html> string.') # --> 'Some <html> string.'
>
> Unknown or invalid entities are left in &xxx; format, while also leaving
> Unicode entities in &#nnn; format. If you want a Unicode string to be
> returned (and Unicode entities interpreted), replace 'chr' with 'unichr',
> and 'htmlentitydefs.entitydefs[m.group(2)]' with
> 'unichr(htmlentitydefs.name2codepoint[m.group(2)])'.
>
> Hope this helps.
>
Chris,
I believe the line that reads:
def converthtml(s):
return re.sub(r'&(#?)(.+?);',convert,s)
Should read:
def converthtml(s):
return re.sub(r'&(#?)(.+?);',convertentity,s)
Once I made that change it worked like a charm. I'm showing the correction
for future Usenet searchers.
So you can pass a function to re.sub() as the replacement patttern? Very
cool, I didn't know that. I think you could spend a year just learning
regular expressions and still miss something.
Thanks,
Robert.
More information about the Python-list
mailing list