HTMLParser fragility

Wed Apr 5 06:46:07 EDT 2006

Lawrence D'Oliveiro wrote:
> I've been using HTMLParser to scrape Web sites. The trouble with this 
> is, there's a lot of malformed HTML out there. Real browsers have to be 
> written to cope gracefully with this, but HTMLParser does not. Not only 
> does it raise an exception, but the parser object then gets into a 
> confused state after that so you cannot continue using it.
> 
> The way I'm currently working around this is to do a dummy pre-parsing 
> run with a dummy (non-subclassed) HTMLParser object. Every time I hit 
> HTMLParseError, I note the line number in a set of lines to skip, then 
> create a new HTMLParser object and restart the scan from the beginning, 
> skipping all the lines I've noted so far. Only when I get to the end 
> without further errors do I do the proper parse with all my appropriate 
> actions.

You could try HTMLTidy (http://www.egenix.com/files/python/mxTidy.html) 
as a first step to get well formed HTML.

Daniel