Parsing HTML - modify URLs

Fuzzyman michael at foord.net
Wed Jul 7 06:35:14 EDT 2004


I am trying to parse an HTML page an only modify URLs within tags -
e.g. inside IMG, A, SCRIPT, FRAME tags etc...

I have built one that works fine using the HTMLParser.HTMLParser and
it works fine.... on good HTML. Having done a google it looks like
parsing dodgy HTML and having HTMLParser choke is a common theme.

I would have difficulties using regular expressions as I want to
modify local reference URLS as well as absolute ones.

It would be nice to just override the error handling of HTMLParser -
but short of digging in the source code it's not a documented
technique :-)

Anyone got any suggestions - this is to go on a server as a CGI - and
I don't have shell access or anything like that, so I'd like to avoid
installing mxTidy. Anyone know an HTML parsing library that will allow
me to rewrite out most of the page unmodified and just modify the
contents of some of the tags.

Regards,

Fuzzy

http://www.voidspace.org.uk/atlantibots/pythonutils.html



More information about the Python-list mailing list