Parsing HTML--looking for info/comparison of HTMLParser vs. htmllib modules.

Fri Jul 7 14:12:18 EDT 2006

I'm writing a program that will parse HTML and (mostly) convert it to 
MediaWiki format. The two Python modules I'm aware of to do this are 
HTMLParser and htmllib. However, I'm currently experiencing either real 
or conceptual difficulty with both, and was wondering if I could get 
some advice.

The problem I'm having with HTMLParser is simple; I don't seem to be 
getting the actual text in the HTML document. I've implemented the 
do_data method of HTMLParser.HTMLParser in my HTMLParser subclass, but 
it never seems to receive any data. Is there another way to access the 
text chunks as they come along?

HTMLParser would probably be the way to go if I can figure this out. It 
seems much simpler than htmllib, and satisfies my requirements.

htmllib will write out the text data (using the AbstractFormatter and 
AbstractWriter), but my problem here is conceptual. I simply don't 
understand why all of these different "levels" of abstractness are 
necessary, nor how to use them. As an example, the html <i>text</i> 
should be converted to ''text'' (double single-quotes at each end) in my 
mediawiki markup output. This would obviously be easy to achieve if I 
simply had an html parse that called a method for each start tag, text 
chunk, and end tag. But htmllib calls the tag functions in HTMLParser, 
and then does more things with both a formatter and a writer. To me, 
both seem unnecessarily complex (though I suppose I can see the benefits 
of a writer before generators gave the opportunity to simply yield 
chunks of output to be processed by external code.) In any case, I don't 
really have a good idea of what I should do with htmllib to get my 
converted tags, and then content, and then closing converted tags, 
written out.

Please feel free to point to examples, code, etc. Probably the simplest 
solution would be a way to process text content in HTMLParser.HTMLParser.

Thanks,
Ken