[Tutor] beautifulsoup - getting an error when NavigableString object is returned
Clay Wiedemann
clay.wiedemann at gmail.com
Sun Mar 4 04:22:05 CET 2007
I wanted to strip the quotes from IMDB quote pages, just to start
learning python. Quotes are not nested, so I got the anchor links that
precede them. I thought I could walk down until I hit an HR tag,
meanwhile grabbing people and quotes via hits on <b> and <br>.
But once I tried to walk down from my hit on the anchor link and pull
the name, I found I kept getting a NavigableString instead of tag, so
asking for the .name attribute gave an error.
Any idea why this might happen?
This is the relevant chunk of IMDB code:
<a name="qt0210620"></a>
<b><a href="/name/nm0629454/">Bill</a></b>:
You're supposed to wear the blue dress when I wear this.
<br>
<b><a href="/name/nm0707043/">Mary</a></b>:
I don't want to dress like twins anymore.
<br>
<b><a href="/name/nm0629454/">Bill</a></b>:
We're not twins. We're a trio.
<br>
<hr width="30%">
---
And this is what I wrote (and if there are other awful things about
this, I would be happy to know):
#!/usr/bin/env python
import urllib2
from BeautifulSoup import BeautifulSoup
import re
# stubs --------------------------
movietitle_stub = "Nashville" #later search an pull first
result (if movie?)
movieurl_stub = "http://imdb.com/title/tt0073440/" #and get this
def soupifyPage(target):
"""
grab html from a page
probably need real method of checking for failure, huh
"""
codeReq = urllib2.Request(target)
response = urllib2.urlopen(codeReq)
soupyhtml = BeautifulSoup(response)
return soupyhtml
def pullQuote(curTag):
# character is in bold
print curTag.nextSibling.name
'''
if curTag.nextSibling.name == 'hr':
#are done
return quoteBlock
print "seeing" + curTag.nextSibling.name
quoteBlock = quoteBlock + " - " + curTag.nextSibling.name
curTag = curTag.nextSibling
'''
quotepage = movieurl_stub + "quotes"
print "Getting this:" + quotepage
print "---------------"
quotebag = soupifyPage(quotepage)
# each quote is preceded by anchorlink, begins with qt : example <a
name="qt0229419"></a>
# the end with an HR tag
# they are not nested
quotations = quotebag.findAll(attrs = {'name' : re.compile("^qt")})
for q in quotations:
#pullQuote(q)
print q.nextSibling.name # attribute error: "'NavigableString'
object has no attribute 'name'"
print "next!"
Thanks,
Clay
- - - - - - -
Clay S. Wiedemann
More information about the Tutor
mailing list