[Tutor] BeautifulSoup - getting cells without new line characters

jonasmg at softhome.net jonasmg at softhome.net
Sat Apr 1 14:22:23 CEST 2006


Kent Johnson writes: 

> jonasmg at softhome.net wrote:
>> Kent Johnson writes:  
>> 
>> 
>>>jonasmg at softhome.net wrote:  
>>>
>>>
>>>>List of states:
>>>>http://en.wikipedia.org/wiki/U.S._state   
>>>>
>>>>: soup = BeautifulSoup(html)
>>>>: # Get the second table (list of states).
>>>>: table = soup.first('table').findNext('table')
>>>>: print table   
>>>>
>>>>...
>>>><tr>
>>>><td>WY</td>
>>>><td>Wyo.</td>
>>>><td><a href="/wiki/Wyoming" title="Wyoming">Wyoming</a></td>
>>>><td><a href="/wiki/Cheyenne%2C_Wyoming" title="Cheyenne, 
>>>>Wyoming">Cheyenne</a></td>
>>>><td><a href="/wiki/Cheyenne%2C_Wyoming" title="Cheyenne, 
>>>>Wyoming">Cheyenne</a></td>
>>>><td><a href="/wiki/Image:Flag_of_Wyoming.svg" class="image" title=""><img 
>>>>src="http://upload.wikimedia.org/wikipedia/commons/thumb/b/bc/Flag_of_Wyomin 
>>>>g.svg/45px-Flag_of_Wyoming.svg.png" width="45" alt="" height="30" 
>>>>longdesc="/wiki/Image:Flag_of_Wyoming.svg" /></a></td>
>>>></tr>
>>>></table>   
>>>>
>>>>Of each row (tr), I want to get the cells (td): 1,3,4 
>>>>(postal,state,capital). But cells 3 and 4 have anchors. 
>>>
>>>So dig into the cells and get the data from the anchor.  
>>>
>>>cells = row('td')
>>>cells[0].string
>>>cells[2]('a').string
>>>cells[3]('a').string  
>>>
>>>Kent  
>>>
>>>_______________________________________________
>>>Tutor maillist  -  Tutor at python.org
>>>http://mail.python.org/mailman/listinfo/tutor
>>  
>> 
>> for row in table('tr'):
>>    cells = row('td')
>>    print cells[0]  
>> 
>> IndexError: list index out of range 
> 
> It works for me: 
> 
> 
> In [1]: from BeautifulSoup import BeautifulSoup as bs 
> 
> In [2]: soup=bs('''<tr>
>     ...: <td>WY</td>
>     ...: <td>Wyo.</td>
>     ...: <td><a href="/wiki/Wyoming" title="Wyoming">Wyoming</a></td>
>     ...: <td><a href="/wiki/Cheyenne%2C_Wyoming" title="Cheyenne,
>     ...: Wyoming">Cheyenne</a></td>
>     ...: <td><a href="/wiki/Cheyenne%2C_Wyoming" title="Cheyenne,
>     ...: Wyoming">Cheyenne</a></td>
>     ...: <td><a href="/wiki/Image:Flag_of_Wyoming.svg" class="image" 
> title=""><img
>     ...: 
> src="http://upload.wikimedia.org/wikipedia/commons/thumb/b/bc/Flag_of_Wyomin
>     ...: g.svg/45px-Flag_of_Wyoming.svg.png" width="45" alt="" height="30"
>     ...: longdesc="/wiki/Image:Flag_of_Wyoming.svg" /></a></td>
>     ...: </tr>
>     ...: </table> '''
>     ...:
>     ...:
>     ...:
>     ...: ) 
> 
> In [18]: rows=soup('tr') 
> 
> In [19]: rows
> Out[19]:
> [<tr>
> <td>WY</td>
> <td>Wyo.</td>
> <td><a href="/wiki/Wyoming" title="Wyoming">Wyoming</a></td>
> <td><a href="/wiki/Cheyenne%2C_Wyoming" title="Cheyenne,
> Wyoming">Cheyenne</a></td>
> <td><a href="/wiki/Cheyenne%2C_Wyoming" title="Cheyenne,
> Wyoming">Cheyenne</a></td>
> <td><a href="/wiki/Image:Flag_of_Wyoming.svg" class="image" 
> title=""><img src="http://upload. 
> 
> g.svg/45px-Flag_of_Wyoming.svg.png" width="45" alt="" height="30" 
> longdesc="/wiki/Image:Flag_
> </tr>] 
> 
> In [21]: cells=rows[0]('td') 
> 
> In [22]: cells
> Out[22]:
> [<td>WY</td>,
>   <td>Wyo.</td>,
>   <td><a href="/wiki/Wyoming" title="Wyoming">Wyoming</a></td>,
>   <td><a href="/wiki/Cheyenne%2C_Wyoming" title="Cheyenne,
> Wyoming">Cheyenne</a></td>,
>   <td><a href="/wiki/Cheyenne%2C_Wyoming" title="Cheyenne,
> Wyoming">Cheyenne</a></td>,
>   <td><a href="/wiki/Image:Flag_of_Wyoming.svg" class="image" 
> title=""><img src="http://upload
> n
> g.svg/45px-Flag_of_Wyoming.svg.png" width="45" alt="" height="30" 
> longdesc="/wiki/Image:Flag_ 
> 
> In [23]: cells[0].string
> Out[23]: 'WY' 
> 
> In [24]: cells[2].a.string
> Out[24]: 'Wyoming' 
> 
> In [25]: cells[3].a.string 
> 
> 
> Kent 
> 
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor

Yes, ok. But so, it is only possible get data from a row (rows[0]) 

cells=rows[0]('td') 

And I want get data from all rows. I have trying with several 'for' setences 
but i can not. 


More information about the Tutor mailing list