Beautiful Soup Table Parsing

Thu Aug 9 01:43:51 EDT 2012

Tom Russell <tsrdatatech at gmail.com> writes:

> I am parsing out a web page at
> http://online.wsj.com/mdc/public/page/2_3021-tradingdiary2.html?mod=mdc_pastcalendar
> using BeautifulSoup.
>
> My problem is that I can parse into the table where the data I want
> resides but I cannot seem to figure out how to go about grabbing the
> contents of the cell next to my row header I want.
>
> For instance this code below:
>
> soup = BeautifulSoup(urlopen('http://online.wsj.com/mdc/public/page/2_3021-tradingdiary2.html?mod=mdc_pastcalendar'))
>
> table = soup.find("table",{"class": "mdcTable"})
> for row in table.findAll("tr"):
>     for cell in row.findAll("td"):
>         print cell.findAll(text=True)
>
> brings in a list that looks like this:
>
> [u'NYSE']
> [u'Latest close']
> [u'Previous close']
> ...
>
> What I want to do is only be getting the data for NYSE and nothing
> else so I do not know if that's possible or not.

I am quite confident that it is possible (though I do not know
the details).

First thing to note: you can use the "break" statement in order
to leave a loop "before time". As you have a nested loop,
you might need a "break" on both levels, the outer loop's "break"
probably controlled by a variable which indicates "success".

Second thing to note: the "BeautifulSoup" documentation might
tell you something about the return values of its methods.
I assume "BeautifulSoup" builds upon "lxml" and the return values
are "lxml" related. Then the "lxml" documentation would tell you
how to inspect further details about the html structure.