[Tutor] Stripping HTML tags.

Danny Yoo dyoo at hkn.eecs.berkeley.edu
Sat Apr 17 19:31:19 EDT 2004



On Sat, 17 Apr 2004, [ISO-8859-1] "J=F6rg W=F6lke" wrote:

> > Regex parsing of HTML can be slightly subtle, so it might be worth it
> > to invest some time with HTMLParser:
>
> Wrong it's not "slightly subtle" or "tricky" to parse nested Languages
> wit Regex - It's _impossible_.


I like to understate things.  *grin*


But it's possible to do limited processing of HTML using regular
expressions alone.  I'd better be careful by saying that the processing
isn't really parsing --- it's more like "munging".

    http://www.catb.org/~esr/jargon/html/M/munge.html

Dave's original problem can be done with just regular expressions, but
only because we're not trying to maintain or reconstruct the tree of
nested tags.  Stripping HTML tags with with just regular expressions
correctly is really ugly, but it is doable.


Since Dave's doing the right thing by looking at HTMLParser, I guess it's
ok to now show how we can try to approach it with regexes:

###
"""A small function to try to strip HTML from a string, using just regular
expressions."""

import re
pattern =3D re.compile("""
    (

       <               ## start character,

                       ## following by any number of the following
                       ## three cases:
       (
           ([^>"'])    ## Case 1: a non-closing, non-quote character

           |

           (           ## Case 2: a double-quoted string
               "
               [^"]*
               "
           )

           |

           (           ## Case 3: a single-quoted string
               '
               [^']*
               '
           )

       )*
       >               ## end character
    )
       """, re.VERBOSE)

def strip_html(s):
    """Removes HTML from a string s"""
    return pattern.sub('', s)
###

And this sorta works... but it's still not correct!  *grin* It doesn't
handle HTML comments correctly.  Ooops.


The regex approach above can be extended to work, and Tom Christiansen of
Perl fame wrote a very nice, comprehensive article that shows how to do it
correctly in his "Far More Than Everything You've Ever Wanted to Know
About" on Regular Expressions:

    http://www.perl.com/doc/FMTEYEWTK/regexps.html


But rather than go through all that pain, it might just be simpler to use
HTMLParser from the Standard Library.  *grin*


Hope this helps!




More information about the Tutor mailing list