[Tutor] Stripping HTML tags.

Danny Yoo dyoo at hkn.eecs.berkeley.edu
Fri Apr 16 19:27:02 EDT 2004



On Fri, 16 Apr 2004, Alan Gauld wrote:

> > It works but seems a bit messy. Is there a neater way to do this ?
>
> Yes use the html parser module. There is some sample code that shows how
> to strip all tags to get plain text from an html file. And your code is
> less reliable(tags spanning lines, nested tags etc) than the html parser
> code...


Hi Dave,

Here's a concrete (but still far-fetched) example of why this sort of
stuff is hard to get right the first time:

"""
<h1>testing<h1>
<p>This is a te<div class="The expression '2>3' is True">st.</p>
"""


The regular expression:

###
rematch=re.compile('<[^>]*>')
###

won't account for attribute values, so the example above can trick it.

Regex parsing of HTML can be slightly subtle, so it might be worth it to
invest some time with HTMLParser:

    http://www.python.org/doc/lib/module-HTMLParser.html


Good luck!




More information about the Tutor mailing list