Why doesn't input code return 'plants' as in 'Getting Started with Beautiful Soup' text (on page 30) ?

Peter Otten __peter__ at web.de
Sun Jul 12 05:51:58 EDT 2015


Simon Evans wrote:

> Dear Mark Lawrence, thank you for your advice.
> I take it that I use the input you suggest for the line :
> 
> soup = BeautifulSoup("C:\Beautiful Soup\ecological_pyramid.html",lxml")
> 
> seeing as I have to give the file's full address I therefore have to
> modify your :
> 
> soup = BeautifulSoup(ecological_pyramid,"lxml")
> 
> to :
> 
> soup = BeautifulSoup("C:\Beautiful Soup\ecological_pyramid," "lxml")
> 
> otherwise I get :
> 
> 
>>>> with open("C:\Beautiful Soup\ecologicalpyramid.html"."r")as
>>>> ecological_pyramid: soup = BeautifulSoup(ecological_pyramid,"lxml")
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> NameError: name 'ecological_pyramid' is not defined
> 
> 
> so anyway with the input therefore as:
> 
>>>> with open("C:\Beautiful Soup\ecologicalpyramid.html"."r")as
>>>> ecological_pyramid: soup = BeautifulSoup("C:\Beautiful
>>>> Soup\ecological_pyramid,","lxml") producer_entries = soup.find("ul")
>>>> print(producer_entries.li.div.string)

No. If you pass the filename beautiful soup will mistake it as the HTML. You
can verify that in the interactive interpreter:

>>> soup = BeautifulSoup("C:\Beautiful Soup\ecologicalpyramid.html","lxml")
>>> soup
<html><body><p>C:\Beautiful Soup\ecologicalpyramid.html</p></body></html>

You have to pass an open file to BeautifulSoup, not a filename:

>>> with open("C:\Beautiful Soup\ecologicalpyramid.html","r") as f:
...     soup = BeautifulSoup(f, "lxml")
... 

However, if you look at the data returned by soup.find("ul") you'll see

>>> producer_entries = soup.find("ul")
>>> producer_entries
<ul id="producers">
<li class="producers">
</li><li class="producerlist">
<div class="name">plants</div>
<div class="number">100000</div>
</li>
<li class="producerlist">
<div class="name">algae</div>
<div class="number">100000</div>
</li>
</ul>

The first <li>...</li> node does not contain a div

>>> producer_entries.li
<li class="producers">
</li>

and thus

>>> producer_entries.li.div is None
True

and the following error is expected with the given data. 
Returning None is beautiful soup's way of indicating that the
<li> node has no <div> child at all. If you want to 
process the first li that does have a <div> child a straight-forward 
way is to iterate over the children:

>>> for li in producer_entries.find_all("li"):
...     if li.div is not None:
...         print(li.div.string)
...         break # remove if you want all, not just the first
... 
plants

Taking a second look at the data you probably want the li nodes with
class="producerlist":

>>> for li in soup.find_all("li", attrs={"class": "producerlist"}):
...     print(li.div.string)
... 
plants
algae





More information about the Python-list mailing list