[Tutor] re question
Jonathan Hayward http://JonathansCorner.com
jonathan.hayward at pobox.com
Fri Aug 8 22:17:04 EDT 2003
Jeff Shannon wrote:
> tpc at csua.berkeley.edu wrote:
>
>> hello Jonathan, you should use re.findall as re.match only returns the
>> first instance. By the way I would recommend the htmllib.HTMLParser
>> module instead of reinventing the wheel.
>>
>
> Indeed, it's not just reinventing the wheel. Regular expressions, by
> themselves, are insufficient to do proper HTML parsing, because re's
> don't remember state and can't deal with nested/branched data
> structures (which HTML/XML/SGML are). As someone else pointed out,
> you're likely to grab too much, or not enough. Anybody seriously
> trying to do anything with HTML should be using HTMLParser, *not* re.
>
Hmm...
I looked through the library docs on this, and tried to do it with re's
because figuring out how to use HTMLParser looked like more work than
using re's -- 3 hours' documentation search to avoid one hour of
reinventing the wheel.
What I'd like to do is:
Find all instances of such-and-such between two tags (for which I've
received a helpful response).
Strip out all (or possibly all-but-whitelist) tags from an HTML page
(substitute "" for "<.*?>" over multiple lines?).
Iterate over links / images and selectively change the targets / sources
(which would take me a lot of troubleshooting to do with RE).
Possibly other things; I'm not trying to compete with HTMLParser but get
some basic functionality. I'd welcome suggestions on how to do this with
HTMLParser.
--
++ Jonathan Hayward, jonathan.hayward at pobox.com
** To see an award-winning website with stories, essays, artwork,
** games, and a four-dimensional maze, why not visit my home page?
** All of this is waiting for you at http://JonathansCorner.com
More information about the Tutor
mailing list