HTMLParser fragility

Richie Hindle rjh at cyberscience.com
Wed Apr 5 07:30:58 EDT 2006


[Daniel]
> You could try HTMLTidy (http://www.egenix.com/files/python/mxTidy.html) 
> as a first step to get well formed HTML.

But Tidy fails on huge numbers of real-world HTML pages.  Simple things like
misspelled tags make it fail:

>>> from mx.Tidy import tidy
>>> results = tidy("<html><body><pree>Hello world!</pre></body></html>")
>>> print results[3]
line 1 column 7 - Warning: inserting missing 'title' element
line 1 column 13 - Error: <pree> is not recognized!
line 1 column 13 - Warning: discarding unexpected <pree>
line 1 column 31 - Warning: discarding unexpected </pre>
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.

Is there a Python HTML tidier which will do as good a job as a browser?

-- 
Richie



More information about the Python-list mailing list