newb: BeautifulSoup

Fri Sep 21 03:38:24 EDT 2007

On Sep 20, 9:04 pm, crybaby <joemystery... at gmail.com> wrote:
> I need to traverse a html page with big table that has many row and
> columns.  For example, how to go 35th td tag and do regex to retireve
> the content.  After that is done, you move down to 15th td tag from
> 35th tag (35+15) and do regex to retrieve the content?

1) You can find your table using one of these methods:

a)
target_table = soup.find('table', id='car_parts')

b)
tables = soup.findall('table')
target_table = tables[2]

The tables are put in a list in the order that they appear on the
page.

2) You can get all the td's in the table using this statement:

all_tds = target_table.findall('td')

3) You can get the contents of the tags using these statements:

print all_tds[34].string
print all_tds[49].string

Here is an example:

from BeautifulSoup import BeautifulSoup

doc = """
<html>
    <head>
        <title></title>
    </head>
    <body>
        <table>
        </table>

        <table>
            <tr><td>hello</td></tr>
            <tr><td>world</td><td>goodbye</td></tr>
        </table>
    </body>
</html>
"""

soup = BeautifulSoup(doc)

tables = soup.findAll('table')
target_table = tables[1]

all_tds = target_table.findAll('td')
print all_tds[0].string
print all_tds[2].string

--output:--
hello
goddbye