scraping nested tables with BeautifulSoup

Kent Johnson kent at kentsjohnson.com
Tue Apr 4 09:54:22 EDT 2006


Gonzillaaa at gmail.com wrote:
> I'm trying to get the data on the "Central London Property Price Guide"
> box at the left hand side of this page
> http://www.findaproperty.com/regi0018.html
> 
> I have managed to get the data :) but when I start looking for tables I
> only get tables of depth 1 how do I go about accessing inner tables?
> same happens for links...
> 
> this is what I've go so far
> 
> import sys
> from urllib import urlopen
> from BeautifulSoup import BeautifulSoup
> 
> data = urlopen('http://www.findaproperty.com/regi0018.html').read()
> soup = BeautifulSoup(data)
> 
> for tables in soup('table'):
> 	table = tables('table')
> 	if not table: continue
> 	print table #this returns only 1 table

There's something fishy here. soup('table') should yield all the tables
in the document, even nested ones. For example, this program:

data = '''
<body>
      <table width='100%'>
          <tr><td>
              <TABLE WIDTH='150'>
                  <tr><td>Stuff</td></tr>
              </table>
          </td></tr>
      </table>
</body>
'''

from BeautifulSoup import BeautifulSoup as BS

soup = BS(data)
for table in soup('table'):
      print table.get('width')


prints:
100%
150

Another tidbit - if I open the page in Firefox and save it, then open 
that file into BeautifulSoup, it finds 25 tables and this code finds the 
table you want:

from BeautifulSoup import BeautifulSoup
data2 = open('regi0018-firefox.html')
soup = BeautifulSoup(data2)

print len(soup('table'))

priceGuide = soup('table', dict(bgcolor="#e0f0f8", border="0", 
cellpadding="2", cellspacing="2", width="150"))[1]
print priceGuide.tr


prints:
25
<tr><td bgcolor="#e0f0f8" valign="top"><font face="Arial" 
size="2"><b>Central London Property Price Guide</b></font></td></tr>


Looking at the saved file, Firefox has clearly done some cleanup. So I 
think you have to look at why BS is not processing the original data the 
way you want. It seems to be choking on something.

Kent



More information about the Python-list mailing list