[Tutor] re question

Fri Aug 8 22:17:04 EDT 2003

Jeff Shannon wrote:

> tpc at csua.berkeley.edu wrote:
>
>> hello Jonathan, you should use re.findall as re.match only returns the
>> first instance.  By the way I would recommend the htmllib.HTMLParser
>> module instead of reinventing the wheel.
>>
>
> Indeed, it's not just reinventing the wheel.  Regular expressions, by 
> themselves, are insufficient to do proper HTML parsing, because re's 
> don't remember state and can't deal with nested/branched data 
> structures (which HTML/XML/SGML are).  As someone else pointed out, 
> you're likely to grab too much, or not enough.  Anybody seriously 
> trying to do anything with HTML should be using HTMLParser, *not* re.
>
Hmm...

I looked through the library docs on this, and tried to do it with re's 
because figuring out how to use HTMLParser looked like more work than 
using re's -- 3 hours' documentation search to avoid one hour of 
reinventing the wheel.

What I'd like to do is:

Find all instances of such-and-such between two tags (for which I've 
received a helpful response).
Strip out all (or possibly all-but-whitelist) tags from an HTML page 
(substitute "" for "<.*?>" over multiple lines?).
Iterate over links / images and selectively change the targets / sources 
(which would take me a lot of troubleshooting to do with RE).

Possibly other things; I'm not trying to compete with HTMLParser but get 
some basic functionality. I'd welcome suggestions on how to do this with 
HTMLParser.

-- 
++ Jonathan Hayward, jonathan.hayward at pobox.com
** To see an award-winning website with stories, essays, artwork,
** games, and a four-dimensional maze, why not visit my home page?
** All of this is waiting for you at http://JonathansCorner.com