BeautifulSoup and Problem Tables

Sun Sep 21 13:37:47 EDT 2008

On Sat, 20 Sep 2008 20:51:52 -0700 (PDT), academicedgar at gmail.com wrote:
[snip]
> from BeautifulSoup import BeautifulSoup
> bst=file(r"c:\bstest.htm").read()
> soup=BeautifulSoup(bst)
> rows=soup.findAll('tr')
> len(rows)
> a=len(rows[0].findAll('td'))
> b=len(rows[1].findAll('td'))
> c=len(rows[2].findAll('td'))
> d=len(rows[3].findAll('td'))
> e=len(rows[4].findAll('td'))
> f=len(rows[5].findAll('td'))
> g=len(rows[6].findAll('td'))
> h=len(rows[8].findAll('td'))
> i=len(rows[9].findAll('td'))
> j=len(rows[10].findAll('td'))
> k=rows[1].findAll('td')[1].contents[0]
[snip]
> However, I discovered that my tables have inconsistent numbers of
> rows.  
[snip]
> I have been Googling for some insight into this and I have not been
> successful finding anything. I would really appreciate any suggestions
> or some direction about how to better describe the problem.

Would it be accurate to describe the problem as wanting to
extract the contents of the cth column of the rth row of a
table in spite of various pathologies in the construction of
the table?

If so, maybe it would help to post sample HTML (trimmed to a
minimum) of the pathologies that must be handled.  I gotta
confess, though, that it doesn't take many rowspans or colspans
to put this problem beyond my reach.

-- 
To email me, substitute nowhere->spamcop, invalid->net.