HTMLParser fragility
Paul Boddie
paul at boddie.org.uk
Thu Apr 6 14:16:40 EDT 2006
Richie Hindle wrote:
>
> But Tidy fails on huge numbers of real-world HTML pages. Simple things like
> misspelled tags make it fail:
>
> >>> from mx.Tidy import tidy
> >>> results = tidy("<html><body><pree>Hello world!</pre></body></html>")
[Various error messages]
> Is there a Python HTML tidier which will do as good a job as a browser?
As pointed out elsewhere, libxml2 will attempt to parse HTML if asked
to:
>>> import libxml2dom
>>> d = libxml2dom.parseString("<html><body><pree>Hello world!</pre></body></html>", html=1)
>>> print d.toString()
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><pree>Hello world!</pree></body></html>
See how it fixes up the mismatching tags. The libxml2dom package is
available in the usual place:
http://www.python.org/pypi/libxml2dom
Paul
More information about the Python-list
mailing list