scraping nested tables with BeautifulSoup

Kent Johnson kent at kentsjohnson.com
Tue Apr 4 11:57:53 EDT 2006


Gonzillaaa at gmail.com wrote:
> Hey Kent,
> 
> thanks for your reply. how did you exactly save the file in firefox? if
> I save the file locally I get the same error.

I think I right-clicked on the page and chose "Save page as..."

Here is a program that shows where BS is choking. It finds the last leaf 
node in the parse data by descending the last child of each node:

from urllib import urlopen
from BeautifulSoup import BeautifulSoup

data = urlopen('http://www.findaproperty.com/regi0018.html').read()
soup = BeautifulSoup(data)

tag = soup
while hasattr(tag, 'contents') and tag.contents:
     tag = tag.contents[-1]

print type(tag)
print tag


It prints:
<class 'BeautifulSoup.NavigableString'>

<!/BUTTONS>

<TABLE BORDER=0 CELLSPACING=0 CELLPADDING=2 WIDTH=100% BGCOLOR=F0F0F0>
<TD ALIGN=left VALIGN=top>
<snip lots more>

So for some reason BS thinks that everything from <!BUTTONS> to the end 
is a single string.

Kent



More information about the Python-list mailing list