Using Xpath to parse a Yahoo Finance page
Stefan Behnel
stefan_ml at behnel.de
Mon Dec 3 01:44:07 EST 2012
MRAB, 03.12.2012 03:25:
> On 2012-12-03 01:23, Jason Hsu wrote:
>> I'm trying to extract the data on "total assets" from Yahoo Finance using
>> Python 2.7 and lxml.
>>
>> Here is a special test script I set up to work on this issue:
>>
>> import urllib
>> import lxml
>> import lxml.html
>>
>> url_local1 =
>> "http://www.smartmoney.com/quote/FAST/?story=financials&timewindow=1&opt=YB&isFinprint=1&framework.view=smi_emptyView"
>>
>> result1 = urllib.urlopen(url_local1)
>> element_html1 = result1.read()
>> doc1 = lxml.html.document_fromstring (element_html1)
The last three lines are unnecessarily complicated code. Just use
doc = lxml.html.parse(url_local1)
>> list_row1 = doc1.xpath(u'.//th[div[text()="Total
>> Assets"]]/following-sibling::td/text()')
>> print list_row1
>>
>> url_local2 = "http://finance.yahoo.com/q/bs?s=FAST"
>> result2 = urllib.urlopen(url_local2)
>> element_html2 = result2.read()
>> doc2 = lxml.html.document_fromstring (element_html2)
>> list_row2 = doc2.xpath(u'.//td[strong[text()="Total
>> Assets"]]/following-sibling::td/strong/text()')
>> print list_row2
>>
>> I'm able to get the row of data on total assets from the Smartmoney page,
>> but I get just an empty list when I try to parse the Yahoo Finance page.
>>
> The problem is that you're asking it to look for an exact match.
>
> If you look at the HTML itself, you'll see that there's whitespace
> around the "Total Assets" part.
>
> This should work:
>
> list_row2 = doc2.xpath(u'.//td[strong[contains(text(),"Total
> Assets")]]/following-sibling::td/strong/text()')
Something like "contains(text(),"Total Assets")" is better expressed as
"contains(.,"Total Assets")" because it considers the complete text content
instead of just one text node.
Stefan
More information about the Python-list
mailing list