HTMLParser fragility
Daniel Dittmar
daniel.dittmar at sap.corp
Wed Apr 5 06:46:07 EDT 2006
Lawrence D'Oliveiro wrote:
> I've been using HTMLParser to scrape Web sites. The trouble with this
> is, there's a lot of malformed HTML out there. Real browsers have to be
> written to cope gracefully with this, but HTMLParser does not. Not only
> does it raise an exception, but the parser object then gets into a
> confused state after that so you cannot continue using it.
>
> The way I'm currently working around this is to do a dummy pre-parsing
> run with a dummy (non-subclassed) HTMLParser object. Every time I hit
> HTMLParseError, I note the line number in a set of lines to skip, then
> create a new HTMLParser object and restart the scan from the beginning,
> skipping all the lines I've noted so far. Only when I get to the end
> without further errors do I do the proper parse with all my appropriate
> actions.
You could try HTMLTidy (http://www.egenix.com/files/python/mxTidy.html)
as a first step to get well formed HTML.
Daniel
More information about the Python-list
mailing list