using urllib2

Jeff McNeil jeff at jmcneil.net
Fri Jun 27 21:22:42 EDT 2008


Well, what about pulling that data out using Beautiful soup? If you
know the table name and whatnot, try something like this:

#!/usr/bin/python

import urllib
from BeautifulSoup import BeautifulSoup


def get_defs(term):
    soup = BeautifulSoup(urllib.urlopen('http://
dictionary.reference.com/search?q=%s' % term))

    for tabs in soup.findAll('table', {'class': 'luna-Ent'}):
        yield tabs.findAll('td')[-1].contents[-1].string

print list(get_defs("frog"))

jeff at martian:~$ python test.py
[u'any tailless, stout-bodied amphibian of the order Anura, including
the smooth, moist-skinned frog species that live in a damp or
semiaquatic habitat and the warty, drier-skinned toad species that are
mostly terrestrial as adults. ', u' ', u' ', u'a French person or a
person of French descent. ', u'a small holder made of heavy material,
placed in a bowl or vase to hold flower stems in position. ', u'a
recessed panel on one of the larger faces of a brick or the like. ',
u' ', u'to hunt and catch frogs. ', u'French or Frenchlike. ', u'an
ornamental fastening for the front of a coat, consisting of a button
and a loop through which it passes. ', u'a sheath suspended from a
belt and supporting a scabbard. ', u'a device at the intersection of
two tracks to permit the wheels and flanges on one track to cross or
branch from the other. ', u'a triangular mass of elastic, horny
substance in the middle of the sole of the foot of a horse or related
animal. ']

HTH,

Jeff

On Jun 27, 7:28 pm, Alexnb <alexnbr... at gmail.com> wrote:
> I have read that multiple times. It is hard to understand but it did help a
> little. But I found a bit of a work-around for now which is not what I
> ultimately want. However, even when I can get to the page I want lets say,
> "Http://dictionary.reference.com/browse/cheese", I look on firebug, and
> extension and see the definition in javascript,
>
> <table class="luna-Ent">
> <tbody>
> <tr>
> <td class="dn" valign="top">1.</td>
> <td valign="top">the curd of milk separated from the whey and prepared in
> many ways as a food. </td>
>
>
>
> Jeff McNeil-2 wrote:
>
> > the problem being that if I use code like this to get the html of that
> > page in python:
>
> > response = urllib2.urlopen("the webiste....")
> > html = response.read()
> > print html
>
> > then, I get a bunch of stuff, but it doesn't show me the code with the
> > table that the definition is in. So I am asking how do I access this
> > javascript. Also, if someone could point me to a better reference than the
> > last one, because that really doesn't tell me much, whether it be a book
> > or anything.
>
> > I stumbled across this a while back:
> >http://www.voidspace.org.uk/python/articles/urllib2.shtml.
> > It covers quite a bit. The urllib2 module is pretty straightforward
> > once you've used it a few times.  Some of the class naming and whatnot
> > takes a bit of getting used to (I found that to be the most confusing
> > bit).
>
> > On Jun 27, 1:41 pm, Alexnb <alexnbr... at gmail.com> wrote:
> >> Okay, I tried to follow that, and it is kinda hard. But since you
> >> obviously
> >> know what you are doing, where did you learn this? Or where can I learn
> >> this?
>
> >> Maric Michaud wrote:
>
> >> > Le Friday 27 June 2008 10:43:06 Alexnb, vous avez écrit :
> >> >> I have never used the urllib or the urllib2. I really have looked
> >> online
> >> >> for help on this issue, and mailing lists, but I can't figure out my
> >> >> problem because people haven't been helping me, which is why I am
> >> here!
> >> >> :].
> >> >> Okay, so basically I want to be able to submit a word to
> >> dictionary.com
> >> >> and
> >> >> then get the definitions. However, to start off learning urllib2, I
> >> just
> >> >> want to do a simple google search. Before you get mad, what I have
> >> found
> >> >> on
> >> >> urllib2 hasn't helped me. Anyway, How would you go about doing this.
> >> No,
> >> >> I
> >> >> did not post the html, but I mean if you want, right click on your
> >> >> browser
> >> >> and hit view source of the google homepage. Basically what I want to
> >> know
> >> >> is how to submit the values(the search term) and then search for that
> >> >> value. Heres what I know:
>
> >> >> import urllib2
> >> >> response = urllib2.urlopen("http://www.google.com/")
> >> >> html = response.read()
> >> >> print html
>
> >> >> Now I know that all this does is print the source, but thats about all
> >> I
> >> >> know. I know it may be a lot to ask to have someone show/help me, but
> >> I
> >> >> really would appreciate it.
>
> >> > This example is for google, of course using pygoogle is easier in this
> >> > case,
> >> > but this is a valid example for the general case :
>
> >> >>>>[207]: import urllib, urllib2
>
> >> > You need to trick the server with an imaginary User-Agent.
>
> >> >>>>[208]: def google_search(terms) :
> >> >     return
> >> urllib2.urlopen(urllib2.Request("http://www.google.com/search?"
> >> > +
> >> > urllib.urlencode({'hl':'fr', 'q':terms}),
> >> >                                            headers={'User-Agent':'MyNav
> >> > 1.0
> >> > (compatible; MSIE 6.0; Linux'})
> >> >                           ).read()
> >> >    .....:
>
> >> >>>>[212]: res = google_search("python & co")
>
> >> > Now you got the whole html response, you'll have to parse it to recover
> >> > datas,
> >> > a quick & dirty try on google response page :
>
> >> >>>>[213]: import re
>
> >> >>>>[214]: [ re.sub('<.+?>', '', e) for e in re.findall('<h2
> >> class=r>.*?</h2>',
> >> > res) ]
> >> > ...[229]:
> >> > ['Python Gallery',
> >> >  'Coffret Monty Python And Co 3 DVD : La Premi\xe8re folie des Monty
> >> ...',
> >> >  'Re: os x, panther, python & co: msg#00041',
> >> >  'Re: os x, panther, python & co: msg#00040',
> >> >  'Cardiff Web Site Design, Professional web site design services ...',
> >> >  'Python Properties',
> >> >  'Frees < Programs < Python < Bin-Co',
> >> >  'Torb: an interface between Tcl and CORBA',
> >> >  'Royal Python Morphs',
> >> >  'Python & Co']
>
> >> > --
> >> > _____________
>
> >> > Maric Michaud
> >> > --
> >> >http://mail.python.org/mailman/listinfo/python-list
>
> >> --
> >> View this message in
> >> context:http://www.nabble.com/using-urllib2-tp18150669p18160312.html
> >> Sent from the Python - python-list mailing list archive at Nabble.com.
>
> > --
> >http://mail.python.org/mailman/listinfo/python-list
>
> --
> View this message in context:http://www.nabble.com/using-urllib2-tp18150669p18165634.html
> Sent from the Python - python-list mailing list archive at Nabble.com.





More information about the Python-list mailing list