Parsing broken HTML via Mozilla

G. S. Hayes sjdevnull at yahoo.com
Mon Aug 9 22:25:02 EDT 2004


Walter Do:rwald <walter at livinglogic.de> wrote in message news:<mailman.1413.1092080863.5135.python-list at python.org>...

> Hello all!



Hi!

> 

> I'm trying to parse broken HTML with several Python tools.

> Unfortunately none of them work 100% reliable.



What have you tried?



I've been using Tidy with pretty good results; there's a Python
wrapper called utidylib available at http://utidylib.berlios.de



Make sure to use the "force output" option and it'll do a reasonable
job of parsing fairly broken HTML and outputting either as plain HTML,
XHTML, or several other formats (with lots of tweaky knobs available
to tune the output if you want to).



More information about the Python-list mailing list