Using Xpath to parse a Yahoo Finance page

MRAB python at mrabarnett.plus.com
Sun Dec 2 21:25:45 EST 2012


On 2012-12-03 01:23, Jason Hsu wrote:
> I'm trying to extract the data on "total assets" from Yahoo Finance using Python 2.7 and lxml.
>
> Here is a special test script I set up to work on this issue:
>
>      import urllib
>      import lxml
>      import lxml.html
>
>      url_local1 = "http://www.smartmoney.com/quote/FAST/?story=financials&timewindow=1&opt=YB&isFinprint=1&framework.view=smi_emptyView"
>      result1 = urllib.urlopen(url_local1)
>      element_html1 = result1.read()
>      doc1 = lxml.html.document_fromstring (element_html1)
>      list_row1 = doc1.xpath(u'.//th[div[text()="Total Assets"]]/following-sibling::td/text()')
>      print list_row1
>
>      url_local2 = "http://finance.yahoo.com/q/bs?s=FAST"
>      result2 = urllib.urlopen(url_local2)
>      element_html2 = result2.read()
>      doc2 = lxml.html.document_fromstring (element_html2)
>      list_row2 = doc2.xpath(u'.//td[strong[text()="Total Assets"]]/following-sibling::td/strong/text()')
>      print list_row2
>
> I'm able to get the row of data on total assets from the Smartmoney page, but I get just an empty list when I try to parse the Yahoo Finance page.
>
The problem is that you're asking it to look for an exact match.

If you look at the HTML itself, you'll see that there's whitespace
around the "Total Assets" part.

This should work:

list_row2 = doc2.xpath(u'.//td[strong[contains(text(),"Total 
Assets")]]/following-sibling::td/strong/text()')

(Although I tested it in Python 3.2.)



More information about the Python-list mailing list