Html: replacing tags

Lee Harr missive at frontiernet.net
Fri Jun 13 17:55:32 EDT 2003


>> I'm working on an RSS aggregator and I'd like to replace all
>> img-tags in a piece of html with links to the image, thereby
>> using the alt-text of the img as link text (if present). The
>> rest of the html, including tags, should stay as-is. I'm capable
>> of doing this in what feels like the dumb way (parsing it with
>> regexes for example, or plain old string splitting and rejoining),
>> but I have this impression the HTMLParser or htmllib module should
>> be able to help me with this task.
>> 
>> However, I can't figure out how (if?) I can make a parser do this.
> 
> Yes, HTMLParser only parses, but you do this subclassing, and you can
> override behaviour.  What I do is to subclass HTMLParser and subclass
> all methods to add their parameters nearly as is to a list of the
> class object. Then, when the parsing has finished you can retrieve
> this list and join in to get a string with the original HTML.
> 
> Of course, inside the handle_start|end|tag you can test the tag
> being parsed and insert it as is or subsitute it with something else.
> 


I needed to do something very similar recently. I was making a mirror
of a website for burning on to a cdrom, so all links needed to be made
relative instead of absolute.

It seems like this may be a very common thing to do (replacing tags).
If someone makes a general solution, it might be nice if this
functionality were in the standard library.

My solution was to get a list of the tags and then just
line.replace(old_tag, new_tag)
through the file.  

Problem is it tends to find things that should not be replaced.





More information about the Python-list mailing list