Parsing HTML

Samuel Karl Peterson skpeterson at nospam.please.ucdavis.edu
Sun Feb 11 07:42:27 EST 2007


"mtuller" <mituller at gmail.com> on 10 Feb 2007 15:03:36 -0800 didst
step forth and proclaim thus:

> Alright. I have tried everything I can find, but am not getting
> anywhere. I have a web page that has data like this:

[snip]

> What is show is only a small section.
> 
> I want to extract the 33,699 (which is dynamic) and set the value to a
> variable so that I can insert it into a database.

[snip]

> I have also tried Beautiful Soup, but had trouble understanding the
> documentation.

====================
from BeautifulSoup import BeautifulSoup as parser

soup = parser("""<tr >
<td headers="col1_1"  style="width:21%"   >
<span  class="hpPageText" >LETTER</span></td>
<td headers="col2_1"  style="width:13%; text-align:right"   >
<span  class="hpPageText" >33,699</span></td>
<td headers="col3_1"  style="width:13%; text-align:right"   >
<span  class="hpPageText" >1.0</span></td>
<td headers="col4_1"  style="width:13%; text-align:right"   >
</tr>""")

value = \
   int(soup.find('td', headers='col2_1').span.contents[0].replace(',', ''))
====================

> Thanks,

> Mike

Hope that helped.  This code assumes there aren't any td tags with
header=col2_1 that come before the value you are trying to extract.
There's several ways to do things in BeautifulSoup.  You should play
around with BeautifulSoup in the interactive prompt.  It's simply
awesome if you don't need speed on your side.

-- 
Sam Peterson
skpeterson At nospam ucdavis.edu
"if programmers were paid to remove code instead of adding it,
software would be much better" -- unknown



More information about the Python-list mailing list