Stripping HTML tags from a string

Alex Martelli aleaxit at yahoo.com
Thu May 3 06:18:22 EDT 2001


"William Park" <parkw at better.net> wrote in message
news:mailman.988837024.29166.python-list at python.org...
    [snip]
> Since others gave solutions using 'sgmllib', here are a solution using
> 're' as requested:
>
> >>> pat = re.compile(r'<P\b|<BR\b', re.I)

Parsing HTML with regular expressions is always a rather unpleasant
task since there are SO many durned 'irregularities' that may be
validly present and that you still have to account for.  Here, for
example, whitespace _might_ validly be present after the leading
'<' and before the tagname, so one needs to put in \s? to cover
for that.  Using sgmllib means reusing all the needed regular
expressions _already_ developed and refined and tested and made
very solid by LOTS of other long-term reuse.


Alex






More information about the Python-list mailing list