python screen scraping/parsing

Paul Boddie paul at boddie.org.uk
Fri Jun 13 15:49:23 EDT 2008


On 13 Jun, 20:10, "bruce" <bedoug... at earthlink.net> wrote:
>
> url ="http://www.pricegrabber.com/rating_summary.php/page=1"

[...]

>         tr =
> "/html/body/div[@id='pgSiteContainer']/div[@id='pgPageContent']/table[2]/tbo
> dy/tr[4]"
>
>         tr_=d.xpath(tr)

[...]

> my issue appears to be related to the last "tbody", or tbody/tr[4]...
>
> if i leave off the tbody, i can display data, as the tr_ is an array with
> data...

Yes, I can confirm this.

> with the "tbody" it appears that the tr_ array is not defined, or it has no
> data... however, i can use the DOM tool with firefox to observe the fact
> that the "tbody" is there...

Yes, but the DOM tool in Firefox probably inserts virtual nodes for
its own purposes. Remember that it has to do a lot of other stuff like
implement CSS rendering and DOM event models.

You can confirm that there really is no tbody by printing the result
of this...

d.xpath("/html/body/div[@id='pgSiteContainer']/
div[@id='pgPageContent']/table[2]")[0].toString()

This should fetch the second table in a single element list and then
obviously give you the only element of that list. You'll see that the
raw HTML doesn't have any tbody tags at all.

Paul



More information about the Python-list mailing list