Parsing broken HTML via Mozilla

Mon Aug 9 22:25:02 EDT 2004

Walter Do:rwald <walter at livinglogic.de> wrote in message news:<mailman.1413.1092080863.5135.python-list at python.org>...

> Hello all!

Hi!

> 

> I'm trying to parse broken HTML with several Python tools.

> Unfortunately none of them work 100% reliable.

What have you tried?

I've been using Tidy with pretty good results; there's a Python
wrapper called utidylib available at http://utidylib.berlios.de

Make sure to use the "force output" option and it'll do a reasonable
job of parsing fairly broken HTML and outputting either as plain HTML,
XHTML, or several other formats (with lots of tweaky knobs available
to tune the output if you want to).