libxml2dom - parsing maligned html

bruce bedouglas at earthlink.net
Tue Aug 26 11:28:10 EDT 2008


Hi...

I'm using quick test with libxml2dom

===============
import libxml2dom

aa=libxml2dom.parseString(foo)
ff=libxml2dom.toString(aa)

print ff
===============

----------------------------------
when i start, foo is:
<html>
<body>
</body>
</html>

<html>
<body>
.
.
.
</body>
</html>
-------------------------------
when i print ff it's:
<html>
<body>
</body>
</html>
-------------------------------

so it's as if the parseString only reads the initial "html" tree. i've
reviewed as much as i can find regarding libxml2dom to try to figure out how
i can get it to read/parse/handle both html trees/nodes.

i know, the html is maligned/screwed-up, but i can't seem to find any app
(tidy/beautifulsoup) that can "know" which one of the html trees to throw
out/remove!!

technically, both html trees are valid, it's just that they both shouldn't
be in the file!!!

thoughts/comments appreciated

thanks





More information about the Python-list mailing list