Which HTMLParser?

John J. Lee jjl at pobox.com
Mon Dec 22 18:33:35 EST 2003


Jarek Zgoda <jzgoda at gazeta.usun.pl> writes:
[...]
> If you are not sure that your source is valid HTML, use SGML parser
> instead.

Note that htmllib is a simple subclass of sgmllib, so the results you
get from sgmllib will be the same as for htmllib as far as this
concern goes.

HTMLParser.HTMLParser can cope better with XHTML, and treats optional
or missing start/end tags more simply (ie. better) than sgmllib /
htmllib.


> Personally I recommend F. Lundh's sgmlop -- fast, robust and
> well-written piece of software, real Meisterstueck. Works perfectly on
> Unix, Windows and IBM iSeries (formerly AS/400).

I don't think it's any more lenient, though.  And harder to modify.

Use mxTidy or uTidylib to clean bad HTML.


John




More information about the Python-list mailing list