Which HTMLParser?

Mon Dec 22 18:33:35 EST 2003

Jarek Zgoda <jzgoda at gazeta.usun.pl> writes:
[...]
> If you are not sure that your source is valid HTML, use SGML parser
> instead.

Note that htmllib is a simple subclass of sgmllib, so the results you
get from sgmllib will be the same as for htmllib as far as this
concern goes.

HTMLParser.HTMLParser can cope better with XHTML, and treats optional
or missing start/end tags more simply (ie. better) than sgmllib /
htmllib.

> Personally I recommend F. Lundh's sgmlop -- fast, robust and
> well-written piece of software, real Meisterstueck. Works perfectly on
> Unix, Windows and IBM iSeries (formerly AS/400).

I don't think it's any more lenient, though.  And harder to modify.

Use mxTidy or uTidylib to clean bad HTML.

John