Parsing broken HTML via Mozilla

Mon Aug 9 20:17:01 EDT 2004

"Walter Dörwald" <walter at livinglogic.de> wrote in message
news:mailman.1413.1092080863.5135.python-list at python.org...
> Hello all!
>
> I'm trying to parse broken HTML with several Python tools.
> Unfortunately none of them work 100% reliable. Problems are
> e.g. nested comments, bare "&" in URLs and "<" in text (e.g.
> "if foo < bar") etc.
>
> All of these pages can be displayed properly in a browser
> so why not reuse the parser in e.g. Mozilla? Is there any
> way to get proper XML out of Mozilla? Calling mozilla on the
> command line would be OK, but it would be better if I could
> use Mozilla like a SAX parser. Is there any project that
> provides this functionality?
>
> Bye,
>     Walter Dörwald
>
> Maybe you should preprocess your files with something like,
http://www.zope.org/Members/chrisw/StripOGram
which can help you get rid of the stuff you dont want

Tom