libxml2dom - parsing maligned html

Tue Aug 26 17:22:43 EDT 2008

hi paul...

so you're the guy behind the libxml2dom ehh..!! glad to say hey!

so this really is an issue with libxml2dom. ok, good, at least i know where
the issue is. and yeah, i know the real issue is the fact that the html
isn't valid!! shouldn't have multiple "html" trees...

from what i can tell, this isn't really solved via tidy/beautifulsoup
either, as a multiple html tree structure probably won't be looked at as
being invalid fom a token perspective.

ok, i can somehow live with this, i can accommodate it. but tell me, when
the parse module/class for libxml2dom does its thing, why does it not go
forward on the tree when it comes to a </html>, if there's more text in the
string to process???

oh, also, regarding screen parsing/crawling, i've seen a number of sites
that have discussed using a web testing app, like selinium, and driving a
browser process, in order to really capture all the required data. any
thoughts on the pros/cons of this kind of approach to scraping data...

thanks

-bruce

-----Original Message-----
From: python-list-bounces+bedouglas=earthlink.net at python.org
[mailto:python-list-bounces+bedouglas=earthlink.net at python.org]On Behalf
Of Paul Boddie
Sent: Tuesday, August 26, 2008 8:48 AM
To: python-list at python.org
Subject: Re: libxml2dom - parsing maligned html

On 26 Aug, 17:28, "bruce" <bedoug... at earthlink.net> wrote:
> so it's as if the parseString only reads the initial "html" tree. i've
> reviewed as much as i can find regarding libxml2dom to try to figure out
how
> i can get it to read/parse/handle both html trees/nodes.

Maybe there's some possibility to have libxml2 read directly from a
file descriptor and to stop after parsing the first document, leaving
the descriptor open; currently, this isn't supported by libxml2dom,
however. Another possibility is to feed text to libxml2 until it can
return a well-formed document, which I do as part of the
libxml2dom.xmpp module, but I don't really support this feature in the
public API.

Again, improvements to libxml2dom may happen if I find the time to do
them.

Paul
--
http://mail.python.org/mailman/listinfo/python-list