using urllib2

Alexnb alexnbryan at gmail.com
Sun Jun 29 16:04:04 EDT 2008


Okay, now I ran in it the shell, and this is what happened:

>>> for tabs in soup.findAll('table', {'class': 'luna-Ent'}):
...     tabs.findAll('td')[-1].contents[-1].string
... 
u' '
u' '
u' '
u' '
u' '
u'not complex or compound; single. '
u' '
u' '
u' '
u' '
u' '
u'inconsequential or rudimentary. '
u'unlearned; ignorant. '
u' '
u'unsophisticated; naive; credulous. '
u' '
u'not mixed. '
u' '
u'not mixed. '
u' '
u' '
u' '
u' '
u'). '
u' '
u'(of a lens) having two optical surfaces only. '
u'an ignorant, foolish, or gullible person. '
u'something simple, unmixed, or uncompounded. '
u'cords for controlling the warp threads in forming the shed on draw-looms.
'
u'a person of humble origins; commoner. '
u' '
>>> 

However, the definitions are there. I printed the actual soup and they were
there in the format they always were in. So what is the deal!?!

>>> soup.findAll('table', {'class': 'luna-Ent'})
[<table class="luna-Ent"><tr><td valign="top" class="dn">1.</td><td
valign="top">easy to understand, deal with, use, etc.: a simple matter;
simple tools.  </td></tr></table>

See there is the first one in the shell, I mean it is there, but the for
loop can't find it. I am wondering, because the above
soup.findAll('table'..etc. makes it a list. Do you think that has anything
to do with the problem?


Alexnb wrote:
> 
> Actually after looking at this, the code is preactically the same, except
> the definitions. So what COULD be going wrong here?
> 
> Also, I ran the program and decided to print the whole list of definitions
> straight off BeautifulSoup, and I got an interesting result:
> 
> What word would you like to define: simple
> [u' ', u' ', u' ', u' ', u' ', u'not complex or compound; single.
> 
> those are the first 5 definitions. and later on, it does the same thing.
> it only sees a space, any ideas?
> 
> Alexnb wrote:
>> 
>> Okay, so i've hit a new snag and can't seem to figure out what is wrong.
>> What is happening is the first 4 definitions of the word "simple" don't
>> show up. The html is basicly the same, with the exception of noun turning
>> into adj. Ill paste the html of the word cheese, and then the one for
>> simple, and the code I am using to do the work. 
>> 
>> line of html for the 2nd def of cheese:
>> 
>> <table class="luna-Ent"><tr><td valign="top" class="dn">2.</td><td
>> valign="top">a definite mass of this substance, often in the shape of a
>> wheel or cylinder. </td></tr></table>
>> 
>> line of html for the 2nd def of simple:
>> 
>> <table class="luna-Ent"><tr><td valign="top" class="dn">2.</td><td
>> valign="top">not elaborate or artificial; plain: a simple style. 
>> </td></tr></table>
>> 
>> code:
>> 
>> import urllib
>> from BeautifulSoup import BeautifulSoup
>> 
>> 
>> def get_defs(term):
>>     soup =
>> BeautifulSoup(urllib.urlopen('http://dictionary.reference.com/search?q=%s'
>> % term))
>> 
>>     for tabs in soup.findAll('table', {'class': 'luna-Ent'}):
>>         yield tabs.findAll('td')[-1].contents[-1].string
>> 
>> word = raw_input("What word would you like to define: ")
>> 
>> mainList = list(get_defs(word))
>> 
>> n=0 
>> q = 1
>> 
>> for x in mainList:
>>     print str(q)+".  "+str(mainList[n])
>>     q=q+1
>>     n=n+1
>> 
>> Now, I don't think it is the italics because one of the definitions that
>> worked had them in it in the same format. Any Ideas??!
>> 
>> 
>> Jeff McNeil-2 wrote:
>>> 
>>> On Jun 29, 12:50 pm, Alexnb <alexnbr... at gmail.com> wrote:
>>>> No I figured it out. I guess I never knew that you aren't supposed to
>>>> split a
>>>> url like "http://www.goo\
>>>> gle.com" But I did and it gave me all those errors. Anyway, I had a
>>>> question. On the original code you had this for loop:
>>>>
>>>> for tabs in soup.findAll('table', {'class': 'luna-Ent'}):
>>>>         yield tabs.findAll('td')[-1].contents[-1].string
>>>>
>>>> I hate to be a pain, but I was looking at the BeautifulSoup docs, and
>>>> found
>>>> the findAll thing. But I want to know why you put "for tabs," also why
>>>> you
>>>> need the "'table', {'class': 'luna-Ent'}):" Like why the curly braces
>>>> and
>>>> whatnot?
>>>>
>>>> Jeff McNeil-2 wrote:
>>>>
>>>> > On Jun 27, 10:26 pm, Alexnb <alexnbr... at gmail.com> wrote:
>>>> >> Okay, so I copied your code(and just so you know I am on a mac right
>>>> now
>>>> >> and
>>>> >> i am using pydev in eclipse), and I got these errors, any idea what
>>>> is
>>>> >> up?
>>>>
>>>> >> Traceback (most recent call last):
>>>> >>   File
>>>> >> "/Users/Alex/Documents/workspace/beautifulSoup/src/firstExample.py",
>>>> >> line 14, in <module>
>>>> >>     print list(get_defs("cheese"))
>>>> >>   File
>>>> >> "/Users/Alex/Documents/workspace/beautifulSoup/src/firstExample.py",
>>>> >> line 9, in get_defs
>>>> >>     dictionary.reference.com/search?q=%s' % term))
>>>> >>   File
>>>> >>
>>>> "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/url
>>>> lib.py",
>>>> >> line 82, in urlopen
>>>> >>     return opener.open(url)
>>>> >>   File
>>>> >>
>>>> "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/url
>>>> lib.py",
>>>> >> line 190, in open
>>>> >>     return getattr(self, name)(url)
>>>> >>   File
>>>> >>
>>>> "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/url
>>>> lib.py",
>>>> >> line 325, in open_http
>>>> >>     h.endheaders()
>>>> >>   File
>>>> >>
>>>> "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/htt
>>>> plib.py",
>>>> >> line 856, in endheaders
>>>> >>     self._send_output()
>>>> >>   File
>>>> >>
>>>> "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/htt
>>>> plib.py",
>>>> >> line 728, in _send_output
>>>> >>     self.send(msg)
>>>> >>   File
>>>> >>
>>>> "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/htt
>>>> plib.py",
>>>> >> line 695, in send
>>>> >>     self.connect()
>>>> >>   File
>>>> >>
>>>> "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/htt
>>>> plib.py",
>>>> >> line 663, in connect
>>>> >>     socket.SOCK_STREAM):
>>>> >> IOError: [Errno socket error] (8, 'nodename nor servname provided,
>>>> or not
>>>> >> known')
>>>>
>>>> >> Sorry if it is hard to read.
>>>>
>>>> >> Jeff McNeil-2 wrote:
>>>>
>>>> >> > Well, what about pulling that data out using Beautiful soup? If
>>>> you
>>>> >> > know the table name and whatnot, try something like this:
>>>>
>>>> >> > #!/usr/bin/python
>>>>
>>>> >> > import urllib
>>>> >> > from BeautifulSoup import BeautifulSoup
>>>>
>>>> >> > def get_defs(term):
>>>> >> >     soup = BeautifulSoup(urllib.urlopen('http://
>>>> >> > dictionary.reference.com/search?q=%s' % term))
>>>>
>>>> >> >     for tabs in soup.findAll('table', {'class': 'luna-Ent'}):
>>>> >> >         yield tabs.findAll('td')[-1].contents[-1].string
>>>>
>>>> >> > print list(get_defs("frog"))
>>>>
>>>> >> > jeff at martian:~$ python test.py
>>>> >> > [u'any tailless, stout-bodied amphibian of the order Anura,
>>>> including
>>>> >> > the smooth, moist-skinned frog species that live in a damp or
>>>> >> > semiaquatic habitat and the warty, drier-skinned toad species that
>>>> are
>>>> >> > mostly terrestrial as adults. ', u' ', u' ', u'a French person or
>>>> a
>>>> >> > person of French descent. ', u'a small holder made of heavy
>>>> material,
>>>> >> > placed in a bowl or vase to hold flower stems in position. ', u'a
>>>> >> > recessed panel on one of the larger faces of a brick or the like.
>>>> ',
>>>> >> > u' ', u'to hunt and catch frogs. ', u'French or Frenchlike. ',
>>>> u'an
>>>> >> > ornamental fastening for the front of a coat, consisting of a
>>>> button
>>>> >> > and a loop through which it passes. ', u'a sheath suspended from a
>>>> >> > belt and supporting a scabbard. ', u'a device at the intersection
>>>> of
>>>> >> > two tracks to permit the wheels and flanges on one track to cross
>>>> or
>>>> >> > branch from the other. ', u'a triangular mass of elastic, horny
>>>> >> > substance in the middle of the sole of the foot of a horse or
>>>> related
>>>> >> > animal. ']
>>>>
>>>> >> > HTH,
>>>>
>>>> >> > Jeff
>>>>
>>>> >> > On Jun 27, 7:28 pm, Alexnb <alexnbr... at gmail.com> wrote:
>>>> >> >> I have read that multiple times. It is hard to understand but it
>>>> did
>>>> >> help
>>>> >> >> a
>>>> >> >> little. But I found a bit of a work-around for now which is not
>>>> what I
>>>> >> >> ultimately want. However, even when I can get to the page I want
>>>> lets
>>>> >> >> say,
>>>> >> >> "Http://dictionary.reference.com/browse/cheese", I look on
>>>> firebug,
>>>> >> and
>>>> >> >> extension and see the definition in javascript,
>>>>
>>>> >> >> <table class="luna-Ent">
>>>> >> >> <tbody>
>>>> >> >> <tr>
>>>> >> >> <td class="dn" valign="top">1.</td>
>>>> >> >> <td valign="top">the curd of milk separated from the whey and
>>>> prepared
>>>> >> in
>>>> >> >> many ways as a food. </td>
>>>>
>>>> >> >> Jeff McNeil-2 wrote:
>>>>
>>>> >> >> > the problem being that if I use code like this to get the html
>>>> of
>>>> >> that
>>>>
>>>> >> >> > page in python:
>>>>
>>>> >> >> > response = urllib2.urlopen("the webiste....")
>>>> >> >> > html = response.read()
>>>> >> >> > print html
>>>>
>>>> >> >> > then, I get a bunch of stuff, but it doesn't show me the code
>>>> with
>>>> >> the
>>>> >> >> > table that the definition is in. So I am asking how do I access
>>>> this
>>>> >> >> > javascript. Also, if someone could point me to a better
>>>> reference
>>>> >> than
>>>> >> >> the
>>>> >> >> > last one, because that really doesn't tell me much, whether it
>>>> be a
>>>> >> >> book
>>>> >> >> > or anything.
>>>>
>>>> >> >> > I stumbled across this a while back:
>>>> >> >> >http://www.voidspace.org.uk/python/articles/urllib2.shtml.
>>>> >> >> > It covers quite a bit. The urllib2 module is pretty
>>>> straightforward
>>>> >> >> > once you've used it a few times.  Some of the class naming and
>>>> >> whatnot
>>>> >> >> > takes a bit of getting used to (I found that to be the most
>>>> >> confusing
>>>> >> >> > bit).
>>>>
>>>> >> >> > On Jun 27, 1:41 pm, Alexnb <alexnbr... at gmail.com> wrote:
>>>> >> >> >> Okay, I tried to follow that, and it is kinda hard. But since
>>>> you
>>>> >> >> >> obviously
>>>> >> >> >> know what you are doing, where did you learn this? Or where
>>>> can I
>>>> >> >> learn
>>>> >> >> >> this?
>>>>
>>>> >> >> >> Maric Michaud wrote:
>>>>
>>>> >> >> >> > Le Friday 27 June 2008 10:43:06 Alexnb, vous avez écrit :
>>>> >> >> >> >> I have never used the urllib or the urllib2. I really have
>>>> >> looked
>>>> >> >> >> online
>>>> >> >> >> >> for help on this issue, and mailing lists, but I can't
>>>> figure
>>>> >> out
>>>> >> >> my
>>>> >> >> >> >> problem because people haven't been helping me, which is
>>>> why I
>>>> >> am
>>>> >> >> >> here!
>>>> >> >> >> >> :].
>>>> >> >> >> >> Okay, so basically I want to be able to submit a word to
>>>> >> >> >> dictionary.com
>>>> >> >> >> >> and
>>>> >> >> >> >> then get the definitions. However, to start off learning
>>>> >> urllib2, I
>>>> >> >> >> just
>>>> >> >> >> >> want to do a simple google search. Before you get mad, what
>>>> I
>>>> >> have
>>>> >> >> >> found
>>>> >> >> >> >> on
>>>> >> >> >> >> urllib2 hasn't helped me. Anyway, How would you go about
>>>> doing
>>>> >> >> this.
>>>> >> >> >> No,
>>>> >> >> >> >> I
>>>> >> >> >> >> did not post the html, but I mean if you want, right click
>>>> on
>>>> >> your
>>>> >> >> >> >> browser
>>>> >> >> >> >> and hit view source of the google homepage. Basically what
>>>> I
>>>> >> want
>>>> >> >> to
>>>> >> >> >> know
>>>> >> >> >> >> is how to submit the values(the search term) and then
>>>> search for
>>>> >> >> that
>>>> >> >> >> >> value. Heres what I know:
>>>>
>>>> >> >> >> >> import urllib2
>>>> >> >> >> >> response = urllib2.urlopen("http://www.google.com/")
>>>> >> >> >> >> html = response.read()
>>>> >> >> >> >> print html
>>>>
>>>> >> >> >> >> Now I know that all this does is print the source, but
>>>> thats
>>>> >> about
>>>> >> >> all
>>>> >> >> >> I
>>>> >> >> >> >> know. I know it may be a lot to ask to have someone
>>>> show/help
>>>> >> me,
>>>> >> >> but
>>>> >> >> >> I
>>>> >> >> >> >> really would appreciate it.
>>>>
>>>> >> >> >> > This example is for google, of course using pygoogle is
>>>> easier in
>>>> >> >> this
>>>> >> >> >> > case,
>>>> >> >> >> > but this is a valid example for the general case :
>>>>
>>>> >> >> >> >>>>[207]: import urllib, urllib2
>>>>
>>>> >> >> >> > You need to trick the server with an imaginary User-Agent.
>>>>
>>>> >> >> >> >>>>[208]: def google_search(terms) :
>>>> >> >> >> >     return
>>>> >> >> >>
>>>> urllib2.urlopen(urllib2.Request("http://www.google.com/search?"
>>>> >> >> >> > +
>>>> >> >> >> > urllib.urlencode({'hl':'fr', 'q':terms}),
>>>> >> >> >> >                                          
>>>> >> >>  headers={'User-Agent':'MyNav
>>>> >> >> >> > 1.0
>>>> >> >> >> > (compatible; MSIE 6.0; Linux'})
>>>> >> >> >> >                           ).read()
>>>> >> >> >> >    .....:
>>>>
>>>> >> >> >> >>>>[212]: res = google_search("python & co")
>>>>
>>>> >> >> >> > Now you got the whole html response, you'll have to parse it
>>>> to
>>>> >> >> recover
>>>> >> >> >> > datas,
>>>> >> >> >> > a quick & dirty try on google response page :
>>>>
>>>> >> >> >> >>>>[213]: import re
>>>>
>>>> >> >> >> >>>>[214]: [ re.sub('<.+?>', '', e) for e in re.findall('<h2
>>>> >> >> >> class=r>.*?</h2>',
>>>> >> >> >> > res) ]
>>>> >> >> >> > ...[229]:
>>>> >> >> >> > ['Python Gallery',
>>>> >> >> >> >  'Coffret Monty Python And Co 3 DVD : La Premi\xe8re folie
>>>> des
>>>> >> Monty
>>>> >> >> >> ...',
>>>> >> >> >> >  'Re: os x, panther, python & co: msg#00041',
>>>> >> >> >> >  'Re: os x, panther, python & co: msg#00040',
>>>> >> >> >> >  'Cardiff Web Site Design, Professional web site design
>>>> services
>>>> >> >> ...',
>>>> >> >> >> >  'Python Properties',
>>>> >> >> >> >  'Frees < Programs < Python < Bin-Co',
>>>> >> >> >> >  'Torb: an interface between Tcl and CORBA',
>>>> >> >> >> >  'Royal Python Morphs',
>>>> >> >> >> >  'Python & Co']
>>>>
>>>> >> >> >> > --
>>>> >> >> >> > _____________
>>>>
>>>> >> >> >> > Maric Michaud
>>>> >> >> >> > --
>>>> >> >> >> >http://mail.python.org/mailman/listinfo/python-list
>>>>
>>>> >> >> >> --
>>>> >> >> >> View this message in
>>>>
>>>> >> context:http://www.nabble.com/using-urllib2-tp18150669p18160312.html
>>>> >> >> >> Sent from the Python - python-list mailing list archive at
>>>> >> Nabble.com.
>>>>
>>>> >> >> > --
>>>> >> >> >http://mail.python.org/mailman/listinfo/python-list
>>>>
>>>> >> >> --
>>>> >> >> View this message in
>>>> >> >>
>>>> context:http://www.nabble.com/using-urllib2-tp18150669p18165634.html
>>>> >> >> Sent from the Python - python-list mailing list archive at
>>>> Nabble.com.
>>>>
>>>> >> > --
>>>> >> >http://mail.python.org/mailman/listinfo/python-list
>>>>
>>>> >> --
>>>> >> View this message in...
>>>>
>>>> read more »
>>> 
>>> The definitions were embedded in tables with a 'luna-Ent' class.  I
>>> pulled all of the tables with that class out, and then returned the
>>> string value of td containing the actual definition. The findAll
>>> method takes an optional dictionary, thus the {}.
>>> --
>>> http://mail.python.org/mailman/listinfo/python-list
>>> 
>>> 
>> 
>> 
> 
> 

-- 
View this message in context: http://www.nabble.com/using-urllib2-tp18150669p18184788.html
Sent from the Python - python-list mailing list archive at Nabble.com.




More information about the Python-list mailing list