Beautiful Soup Table Parsing

Andreas Perstinger andipersti at gmail.com
Thu Aug 9 03:25:49 EDT 2012


On 09.08.2012 01:58, Tom Russell wrote:
> For instance this code below:
>
> soup = BeautifulSoup(urlopen('http://online.wsj.com/mdc/public/page/2_3021-tradingdiary2.html?mod=mdc_pastcalendar'))
>
> table = soup.find("table",{"class": "mdcTable"})
> for row in table.findAll("tr"):
>      for cell in row.findAll("td"):
>          print cell.findAll(text=True)
>
> brings in a list that looks like this:

[snip]

> What I want to do is only be getting the data for NYSE and nothing
> else so I do not know if that's possible or not. Also I want to do
> something like:
>
> If cell.contents[0] == "Advances":
>      Advances = next cell or whatever??---> this part I am not sure how to do.
>
> Can someone help point me in the right direction to get the first data
> point for the Advances row? I have others I will get as well but
> figure once I understand how to do this I can do the rest.

To get the header row you could do something like:

header_row = table.find(lambda tag: tag.td.string == "NYSE")

 From there you can look for the next row you are interested in:

advances_row = header_row.findNextSibling(lambda tag: tag.td.string == 
"Advances")

You could also iterate through all next siblings of the header_row:

for row in header_row.findNextSiblings("tr"):
      # do something

Bye, Andreas



More information about the Python-list mailing list