[Tutor] weather scraping with Beautiful Soup
Stefan Behnel
stefan_ml at behnel.de
Fri Jul 17 14:40:40 CEST 2009
Che M wrote:
> <div class="blueBox">
> <div id="curcondbox">
> <div class="subG b">West of Town, Jamestown, Pennsylvania (PWS)</div>
> <div class="bm10">Updated: <span class="pwsrt" pwsid="KPAJAMES1" pwsunit="english" pwsvariable="lu" value="1247814018">3:00 AM EDT on July 17, 2009</span></div>
> <table cellspacing="0" cellpadding="0" class="full">
> <tr>
> <td class="vaT full">
> <table cellspacing="0" cellpadding="5" class="full">
> <tr>
> <td class="vaM taC"><img src="http://icons-pe.wxug.com/i/c/a/nt_clear.gif" width="42" height="42" alt="Clear" class="condIcon" /></td>
> <td class="vaM taC full">
> <div style="font-size: 17px;"><span class="pwsrt" pwsid="KPAJAMES1" pwsunit="english" pwsvariable="tempf" english="°F" metric="°C" value="60.3">
> <span class="nobr"><span class="b">60.3</span> °F</span>
> </span></div>
>
> The 60.3 is the value I want to extract. It appears to be down within a hierarchy
> something like:
>
> <body
> <div class="blueBox">
> <div id="curcondbox">
> <table
> <table
> <div>
> <span class="nobr">
> <span class="b">
You may consider using lxml's cssselect module:
from lxml import html
doc = html.parse("http://some/url/to/parse.html")
spans = doc.cssselect("div.bluebox > #curcondbox span.b")
print spans[0].text
However, I'd rather go for the other "60.3" value using XPath:
print doc.xpath('//span[@pwsvariable="tempf"]/@value')
Stefan
More information about the Tutor
mailing list