libxml2dom - parsing maligned html

Paul Boddie paul at boddie.org.uk
Wed Aug 27 04:44:15 EDT 2008


On 26 Aug, 23:22, "bruce" <bedoug... at earthlink.net> wrote:
>
> ok, i can somehow live with this, i can accommodate it. but tell me, when
> the parse module/class for libxml2dom does its thing, why does it not go
> forward on the tree when it comes to a </html>, if there's more text in the
> string to process???

I imagine that libxml2, which actually does the parsing, stops doing
its work when it has successfully closed all open elements. Perhaps
there's a way of making it go on and potentially complain about
trailing input.

> oh, also, regarding screen parsing/crawling, i've seen a number of sites
> that have discussed using a web testing app, like selinium, and driving a
> browser process, in order to really capture all the required data. any
> thoughts on the pros/cons of this kind of approach to scraping data...

Once upon a time I used the KPartPlugins to automate Konqueror,
combining them with a DOM implementation, qtxmldom, which let me read
the contents of Web pages from a real browser. Unfortunately, that
technology doesn't work with recent versions of KDE (or PyKDE), and
attempts to use Mozilla via PyXPCOM weren't successful. If you wanted
to pursue this route, my advice would be to ask the Mozilla people,
particularly those who work with PyXPCOM. An alternative might be to
look into the state of bindings for the Webkit browser technologies.

Paul



More information about the Python-list mailing list