libxml2dom - parsing maligned html

Tue Aug 26 12:02:12 EDT 2008

bruce wrote:
> I'm using quick test with libxml2dom
> 
> ===============
> import libxml2dom
> 
> aa=libxml2dom.parseString(foo)
> ff=libxml2dom.toString(aa)
> 
> print ff
> ===============
> 
> ----------------------------------
> when i start, foo is:
> <html>
> <body>
> </body>
> </html>
> 
> <html>
> <body>
> .
> .
> .
> </body>
> </html>
> -------------------------------
> when i print ff it's:
> <html>
> <body>
> </body>
> </html>
> -------------------------------
> 
> so it's as if the parseString only reads the initial "html" tree. i've
> reviewed as much as i can find regarding libxml2dom to try to figure out how
> i can get it to read/parse/handle both html trees/nodes.
> 
> i know, the html is maligned/screwed-up, but i can't seem to find any app
> (tidy/beautifulsoup) that can "know" which one of the html trees to throw
> out/remove!!
> 
> technically, both html trees are valid, it's just that they both shouldn't
> be in the file!!!

What about splitting the string on "<html" and them parsing each part on its own?

Stefan