HTMLParser fragility

Wed Apr 5 06:33:25 EDT 2006

I've been using HTMLParser to scrape Web sites. The trouble with this 
is, there's a lot of malformed HTML out there. Real browsers have to be 
written to cope gracefully with this, but HTMLParser does not. Not only 
does it raise an exception, but the parser object then gets into a 
confused state after that so you cannot continue using it.

The way I'm currently working around this is to do a dummy pre-parsing 
run with a dummy (non-subclassed) HTMLParser object. Every time I hit 
HTMLParseError, I note the line number in a set of lines to skip, then 
create a new HTMLParser object and restart the scan from the beginning, 
skipping all the lines I've noted so far. Only when I get to the end 
without further errors do I do the proper parse with all my appropriate 
actions.