Buffering HTML as HTMLParser reads it?

chrispwd at gmail.com chrispwd at gmail.com
Wed Aug 1 14:31:18 EDT 2007


Hello,

I am working on a project where I'm using python to parse HTML pages,
transforming data between certain tags. Currently the HTMLParser class
is being used for this. In a nutshell, its pretty simple -- I'm
feeding the contents of the HTML page to HTMLParser, then I am
overriding the appropriate handle_ method to handle this extracted
data. In that method, I take the found data and I transform it into
another string based on some logic.

Now, what I would like to do here is take that transformed string and
put it "back into" the HTML document. Has anybody ever implemented
something like this with HTMLParser?

I'm thinking maybe somehow have HTMLParser append each character it
reads except for data inside tags in some kind of buffer? This way I
can have the HTML contents read into a buffer, then when I do my own
handle_ overrides, I can also append to that buffer with the
transformed data. Once the HTML page is finished parsing, ideally I
would be able to print the contents of the buffer and the HTML would
be identical except for the string transformations.

I also need to make sure that all newlines, tags, spacing, etc are
kept in tact -- this part is a requirement for other reasons.

Thanks!




More information about the Python-list mailing list