Stripping HTML tags from a string

William Park parkw at better.net
Thu May 3 10:28:49 EDT 2001


On Thu, May 03, 2001 at 12:18:22PM +0200, Alex Martelli wrote:
> "William Park" <parkw at better.net> wrote in message
> news:mailman.988837024.29166.python-list at python.org...
>     [snip]
> > Since others gave solutions using 'sgmllib', here are a solution using
> > 're' as requested:
> >
> > >>> pat = re.compile(r'<P\b|<BR\b', re.I)
> 
> Parsing HTML with regular expressions is always a rather unpleasant
> task since there are SO many durned 'irregularities' that may be
> validly present and that you still have to account for.  Here, for
> example, whitespace _might_ validly be present after the leading
> '<' and before the tagname, so one needs to put in \s? to cover
> for that.  Using sgmllib means reusing all the needed regular
> expressions _already_ developed and refined and tested and made
> very solid by LOTS of other long-term reuse.

Yes, but original poster apparently didn't know how to use 're', and
asked for regular expression solution.  Normal progression would be to
move on to library module for his specific need, after he gets tired of
adjusting the regular expressions. (sigh!)

--William Park, Open Geometry Consulting, Mississauga, Ontario, Canada.
  8 CPUs, Linux, Python, LaTeX, vim, mutt




More information about the Python-list mailing list