HTMLParser fragility

Thu Apr 6 14:16:40 EDT 2006

Richie Hindle wrote:
>
> But Tidy fails on huge numbers of real-world HTML pages.  Simple things like
> misspelled tags make it fail:
>
> >>> from mx.Tidy import tidy
> >>> results = tidy("<html><body><pree>Hello world!</pre></body></html>")

[Various error messages]

> Is there a Python HTML tidier which will do as good a job as a browser?

As pointed out elsewhere, libxml2 will attempt to parse HTML if asked
to:

>>> import libxml2dom
>>> d = libxml2dom.parseString("<html><body><pree>Hello world!</pre></body></html>", html=1)
>>> print d.toString()
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><pree>Hello world!</pree></body></html>

See how it fixes up the mismatching tags. The libxml2dom package is
available in the usual place:

http://www.python.org/pypi/libxml2dom

Paul