Parsing HTML--looking for info/comparison of HTMLParser vs. htmllib modules.

wes weston oweston at earthlink.net
Fri Jul 7 18:43:08 EDT 2006


from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
     def __init__(self):
         HTMLParser.__init__(self)
         self.TokenList = []
     def handle_data( self,data):
         data = data.strip()
         if data and len(data) > 0:
             self.TokenList.append(data)
         #print data
     def GetTokenList(self):
         return self.TokenList


try:
     url = "http://....your url here.............."
     f = urllib.urlopen(url)
     res = f.read()
     f.close()
except:
     print "bad read"
     return

h = MyHTMLParser()
h.feed(res)
tokensList = h.GetTokenList()


Kenneth McDonald wrote:
> I'm writing a program that will parse HTML and (mostly) convert it to 
> MediaWiki format. The two Python modules I'm aware of to do this are 
> HTMLParser and htmllib. However, I'm currently experiencing either real 
> or conceptual difficulty with both, and was wondering if I could get 
> some advice.
> 
> The problem I'm having with HTMLParser is simple; I don't seem to be 
> getting the actual text in the HTML document. I've implemented the 
> do_data method of HTMLParser.HTMLParser in my HTMLParser subclass, but 
> it never seems to receive any data. Is there another way to access the 
> text chunks as they come along?
> 
> HTMLParser would probably be the way to go if I can figure this out. It 
> seems much simpler than htmllib, and satisfies my requirements.
> 
> htmllib will write out the text data (using the AbstractFormatter and 
> AbstractWriter), but my problem here is conceptual. I simply don't 
> understand why all of these different "levels" of abstractness are 
> necessary, nor how to use them. As an example, the html <i>text</i> 
> should be converted to ''text'' (double single-quotes at each end) in my 
> mediawiki markup output. This would obviously be easy to achieve if I 
> simply had an html parse that called a method for each start tag, text 
> chunk, and end tag. But htmllib calls the tag functions in HTMLParser, 
> and then does more things with both a formatter and a writer. To me, 
> both seem unnecessarily complex (though I suppose I can see the benefits 
> of a writer before generators gave the opportunity to simply yield 
> chunks of output to be processed by external code.) In any case, I don't 
> really have a good idea of what I should do with htmllib to get my 
> converted tags, and then content, and then closing converted tags, 
> written out.
> 
> Please feel free to point to examples, code, etc. Probably the simplest 
> solution would be a way to process text content in HTMLParser.HTMLParser.
> 
> Thanks,
> Ken



More information about the Python-list mailing list