using urllib2

Jeff McNeil jeff at jmcneil.net
Sun Jun 29 12:56:33 EDT 2008


On Jun 29, 12:50 pm, Alexnb <alexnbr... at gmail.com> wrote:
> No I figured it out. I guess I never knew that you aren't supposed to split a
> url like "http://www.goo\
> gle.com" But I did and it gave me all those errors. Anyway, I had a
> question. On the original code you had this for loop:
>
> for tabs in soup.findAll('table', {'class': 'luna-Ent'}):
>         yield tabs.findAll('td')[-1].contents[-1].string
>
> I hate to be a pain, but I was looking at the BeautifulSoup docs, and found
> the findAll thing. But I want to know why you put "for tabs," also why you
> need the "'table', {'class': 'luna-Ent'}):" Like why the curly braces and
> whatnot?
>
> Jeff McNeil-2 wrote:
>
> > On Jun 27, 10:26 pm, Alexnb <alexnbr... at gmail.com> wrote:
> >> Okay, so I copied your code(and just so you know I am on a mac right now
> >> and
> >> i am using pydev in eclipse), and I got these errors, any idea what is
> >> up?
>
> >> Traceback (most recent call last):
> >>   File
> >> "/Users/Alex/Documents/workspace/beautifulSoup/src/firstExample.py",
> >> line 14, in <module>
> >>     print list(get_defs("cheese"))
> >>   File
> >> "/Users/Alex/Documents/workspace/beautifulSoup/src/firstExample.py",
> >> line 9, in get_defs
> >>     dictionary.reference.com/search?q=%s' % term))
> >>   File
> >> "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/url lib.py",
> >> line 82, in urlopen
> >>     return opener.open(url)
> >>   File
> >> "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/url lib.py",
> >> line 190, in open
> >>     return getattr(self, name)(url)
> >>   File
> >> "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/url lib.py",
> >> line 325, in open_http
> >>     h.endheaders()
> >>   File
> >> "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/htt plib.py",
> >> line 856, in endheaders
> >>     self._send_output()
> >>   File
> >> "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/htt plib.py",
> >> line 728, in _send_output
> >>     self.send(msg)
> >>   File
> >> "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/htt plib.py",
> >> line 695, in send
> >>     self.connect()
> >>   File
> >> "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/htt plib.py",
> >> line 663, in connect
> >>     socket.SOCK_STREAM):
> >> IOError: [Errno socket error] (8, 'nodename nor servname provided, or not
> >> known')
>
> >> Sorry if it is hard to read.
>
> >> Jeff McNeil-2 wrote:
>
> >> > Well, what about pulling that data out using Beautiful soup? If you
> >> > know the table name and whatnot, try something like this:
>
> >> > #!/usr/bin/python
>
> >> > import urllib
> >> > from BeautifulSoup import BeautifulSoup
>
> >> > def get_defs(term):
> >> >     soup = BeautifulSoup(urllib.urlopen('http://
> >> > dictionary.reference.com/search?q=%s' % term))
>
> >> >     for tabs in soup.findAll('table', {'class': 'luna-Ent'}):
> >> >         yield tabs.findAll('td')[-1].contents[-1].string
>
> >> > print list(get_defs("frog"))
>
> >> > jeff at martian:~$ python test.py
> >> > [u'any tailless, stout-bodied amphibian of the order Anura, including
> >> > the smooth, moist-skinned frog species that live in a damp or
> >> > semiaquatic habitat and the warty, drier-skinned toad species that are
> >> > mostly terrestrial as adults. ', u' ', u' ', u'a French person or a
> >> > person of French descent. ', u'a small holder made of heavy material,
> >> > placed in a bowl or vase to hold flower stems in position. ', u'a
> >> > recessed panel on one of the larger faces of a brick or the like. ',
> >> > u' ', u'to hunt and catch frogs. ', u'French or Frenchlike. ', u'an
> >> > ornamental fastening for the front of a coat, consisting of a button
> >> > and a loop through which it passes. ', u'a sheath suspended from a
> >> > belt and supporting a scabbard. ', u'a device at the intersection of
> >> > two tracks to permit the wheels and flanges on one track to cross or
> >> > branch from the other. ', u'a triangular mass of elastic, horny
> >> > substance in the middle of the sole of the foot of a horse or related
> >> > animal. ']
>
> >> > HTH,
>
> >> > Jeff
>
> >> > On Jun 27, 7:28 pm, Alexnb <alexnbr... at gmail.com> wrote:
> >> >> I have read that multiple times. It is hard to understand but it did
> >> help
> >> >> a
> >> >> little. But I found a bit of a work-around for now which is not what I
> >> >> ultimately want. However, even when I can get to the page I want lets
> >> >> say,
> >> >> "Http://dictionary.reference.com/browse/cheese", I look on firebug,
> >> and
> >> >> extension and see the definition in javascript,
>
> >> >> <table class="luna-Ent">
> >> >> <tbody>
> >> >> <tr>
> >> >> <td class="dn" valign="top">1.</td>
> >> >> <td valign="top">the curd of milk separated from the whey and prepared
> >> in
> >> >> many ways as a food. </td>
>
> >> >> Jeff McNeil-2 wrote:
>
> >> >> > the problem being that if I use code like this to get the html of
> >> that
>
> >> >> > page in python:
>
> >> >> > response = urllib2.urlopen("the webiste....")
> >> >> > html = response.read()
> >> >> > print html
>
> >> >> > then, I get a bunch of stuff, but it doesn't show me the code with
> >> the
> >> >> > table that the definition is in. So I am asking how do I access this
> >> >> > javascript. Also, if someone could point me to a better reference
> >> than
> >> >> the
> >> >> > last one, because that really doesn't tell me much, whether it be a
> >> >> book
> >> >> > or anything.
>
> >> >> > I stumbled across this a while back:
> >> >> >http://www.voidspace.org.uk/python/articles/urllib2.shtml.
> >> >> > It covers quite a bit. The urllib2 module is pretty straightforward
> >> >> > once you've used it a few times.  Some of the class naming and
> >> whatnot
> >> >> > takes a bit of getting used to (I found that to be the most
> >> confusing
> >> >> > bit).
>
> >> >> > On Jun 27, 1:41 pm, Alexnb <alexnbr... at gmail.com> wrote:
> >> >> >> Okay, I tried to follow that, and it is kinda hard. But since you
> >> >> >> obviously
> >> >> >> know what you are doing, where did you learn this? Or where can I
> >> >> learn
> >> >> >> this?
>
> >> >> >> Maric Michaud wrote:
>
> >> >> >> > Le Friday 27 June 2008 10:43:06 Alexnb, vous avez écrit :
> >> >> >> >> I have never used the urllib or the urllib2. I really have
> >> looked
> >> >> >> online
> >> >> >> >> for help on this issue, and mailing lists, but I can't figure
> >> out
> >> >> my
> >> >> >> >> problem because people haven't been helping me, which is why I
> >> am
> >> >> >> here!
> >> >> >> >> :].
> >> >> >> >> Okay, so basically I want to be able to submit a word to
> >> >> >> dictionary.com
> >> >> >> >> and
> >> >> >> >> then get the definitions. However, to start off learning
> >> urllib2, I
> >> >> >> just
> >> >> >> >> want to do a simple google search. Before you get mad, what I
> >> have
> >> >> >> found
> >> >> >> >> on
> >> >> >> >> urllib2 hasn't helped me. Anyway, How would you go about doing
> >> >> this.
> >> >> >> No,
> >> >> >> >> I
> >> >> >> >> did not post the html, but I mean if you want, right click on
> >> your
> >> >> >> >> browser
> >> >> >> >> and hit view source of the google homepage. Basically what I
> >> want
> >> >> to
> >> >> >> know
> >> >> >> >> is how to submit the values(the search term) and then search for
> >> >> that
> >> >> >> >> value. Heres what I know:
>
> >> >> >> >> import urllib2
> >> >> >> >> response = urllib2.urlopen("http://www.google.com/")
> >> >> >> >> html = response.read()
> >> >> >> >> print html
>
> >> >> >> >> Now I know that all this does is print the source, but thats
> >> about
> >> >> all
> >> >> >> I
> >> >> >> >> know. I know it may be a lot to ask to have someone show/help
> >> me,
> >> >> but
> >> >> >> I
> >> >> >> >> really would appreciate it.
>
> >> >> >> > This example is for google, of course using pygoogle is easier in
> >> >> this
> >> >> >> > case,
> >> >> >> > but this is a valid example for the general case :
>
> >> >> >> >>>>[207]: import urllib, urllib2
>
> >> >> >> > You need to trick the server with an imaginary User-Agent.
>
> >> >> >> >>>>[208]: def google_search(terms) :
> >> >> >> >     return
> >> >> >> urllib2.urlopen(urllib2.Request("http://www.google.com/search?"
> >> >> >> > +
> >> >> >> > urllib.urlencode({'hl':'fr', 'q':terms}),
> >> >> >> >                                          
> >> >>  headers={'User-Agent':'MyNav
> >> >> >> > 1.0
> >> >> >> > (compatible; MSIE 6.0; Linux'})
> >> >> >> >                           ).read()
> >> >> >> >    .....:
>
> >> >> >> >>>>[212]: res = google_search("python & co")
>
> >> >> >> > Now you got the whole html response, you'll have to parse it to
> >> >> recover
> >> >> >> > datas,
> >> >> >> > a quick & dirty try on google response page :
>
> >> >> >> >>>>[213]: import re
>
> >> >> >> >>>>[214]: [ re.sub('<.+?>', '', e) for e in re.findall('<h2
> >> >> >> class=r>.*?</h2>',
> >> >> >> > res) ]
> >> >> >> > ...[229]:
> >> >> >> > ['Python Gallery',
> >> >> >> >  'Coffret Monty Python And Co 3 DVD : La Premi\xe8re folie des
> >> Monty
> >> >> >> ...',
> >> >> >> >  'Re: os x, panther, python & co: msg#00041',
> >> >> >> >  'Re: os x, panther, python & co: msg#00040',
> >> >> >> >  'Cardiff Web Site Design, Professional web site design services
> >> >> ...',
> >> >> >> >  'Python Properties',
> >> >> >> >  'Frees < Programs < Python < Bin-Co',
> >> >> >> >  'Torb: an interface between Tcl and CORBA',
> >> >> >> >  'Royal Python Morphs',
> >> >> >> >  'Python & Co']
>
> >> >> >> > --
> >> >> >> > _____________
>
> >> >> >> > Maric Michaud
> >> >> >> > --
> >> >> >> >http://mail.python.org/mailman/listinfo/python-list
>
> >> >> >> --
> >> >> >> View this message in
>
> >> context:http://www.nabble.com/using-urllib2-tp18150669p18160312.html
> >> >> >> Sent from the Python - python-list mailing list archive at
> >> Nabble.com.
>
> >> >> > --
> >> >> >http://mail.python.org/mailman/listinfo/python-list
>
> >> >> --
> >> >> View this message in
> >> >> context:http://www.nabble.com/using-urllib2-tp18150669p18165634.html
> >> >> Sent from the Python - python-list mailing list archive at Nabble.com.
>
> >> > --
> >> >http://mail.python.org/mailman/listinfo/python-list
>
> >> --
> >> View this message in...
>
> read more »

The definitions were embedded in tables with a 'luna-Ent' class.  I
pulled all of the tables with that class out, and then returned the
string value of td containing the actual definition. The findAll
method takes an optional dictionary, thus the {}.



More information about the Python-list mailing list