HTMLParser fragility
Richie Hindle
rjh at cyberscience.com
Wed Apr 5 07:30:58 EDT 2006
[Daniel]
> You could try HTMLTidy (http://www.egenix.com/files/python/mxTidy.html)
> as a first step to get well formed HTML.
But Tidy fails on huge numbers of real-world HTML pages. Simple things like
misspelled tags make it fail:
>>> from mx.Tidy import tidy
>>> results = tidy("<html><body><pree>Hello world!</pre></body></html>")
>>> print results[3]
line 1 column 7 - Warning: inserting missing 'title' element
line 1 column 13 - Error: <pree> is not recognized!
line 1 column 13 - Warning: discarding unexpected <pree>
line 1 column 31 - Warning: discarding unexpected </pre>
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.
Is there a Python HTML tidier which will do as good a job as a browser?
--
Richie
More information about the Python-list
mailing list