Easy way to remove HTML entities from an HTML document?

Robert Oschler no_replies at fake_email_address.invalid
Mon Jul 26 18:08:44 EDT 2004


"Christopher T King" <squirrel at WPI.EDU> wrote in message
news:Pine.LNX.4.44.0407251706010.20890-100000 at ccc6.wpi.edu...
>
> htmllib has this capability, but if you're not doing any other HTML
> parsing, a regex, coupled with htmllib's helper module, htmlentitydefs,
> does nicely:
>
>  import re
>  import htmlentitydefs
>
>  def convertentity(m):
>      if m.group(1)=='#':
>          try:
>              return chr(int(m.group(2)))
>          except ValueError:
>              return '&#%s;' % m.group(2)
>      try:
>          return htmlentitydefs.entitydefs[m.group(2)]
>      except KeyError:
>          return '&%s;' % m.group(2)
>
>  def converthtml(s):
>      return re.sub(r'&(#?)(.+?);',convert,s)
>
>  converthtml('Some <html> string.')  # --> 'Some <html> string.'
>
> Unknown or invalid entities are left in &xxx; format, while also leaving
> Unicode entities in &#nnn; format.  If you want a Unicode string to be
> returned (and Unicode entities interpreted), replace 'chr' with 'unichr',
> and 'htmlentitydefs.entitydefs[m.group(2)]' with
> 'unichr(htmlentitydefs.name2codepoint[m.group(2)])'.
>
> Hope this helps.
>

Chris,

I believe the line that reads:

def converthtml(s):
      return re.sub(r'&(#?)(.+?);',convert,s)

Should read:

def converthtml(s):
      return re.sub(r'&(#?)(.+?);',convertentity,s)

Once I made that change it worked like a charm.  I'm showing the correction
for future Usenet searchers.

So you can pass a function to re.sub() as the replacement patttern?  Very
cool, I didn't know that.  I think you could spend a year just learning
regular expressions and still miss something.


Thanks,
Robert.





More information about the Python-list mailing list