Suitable Python code to scrape specific details from web pages.

Peter Pearson ppearson at nowhere.invalid
Tue Aug 12 20:50:55 EDT 2014


On Tue, 12 Aug 2014 15:44:58 -0700 (PDT), Simon Evans wrote:
[snip]
> Dear Programmers, Thank you for your responses. I have installed
> 'Beautiful Soup' and I have the 'Getting Started in Beautiful Soup'
> book, but can't seem to make any progress with it, I am too thick to
> make much use of it. I was hoping I could scrape specified stuff off
> Web pages without using it.

I've only used BeautifulSoup a little bit, and am no expert, but
with it one can do wonderfully complex things with simple code.
Perhaps you can find some examples online; this newsgroup sometimes
has awesome demonstrations of BS prowess.

At the risk of embarrassing myself in public, I'll show you some
code I wrote that scrapes data from a web page containing a
description of a drug.  The drug's web page contains the desired
data in tags that look like this:

<input id="form-widgets-minconcentration" name="form.widgets.minconcentration"
class="text-widget float-field" value="1.0" type="text" />

The following code finds all these tags and builds a dict by which you
can lookup the "value" for any given "name".

    from BeautifulSoup import BeautifulSoup as BS
    ...

    def dump_drug_data(url):
        """Fetch data from one drug's URL and print selected fields in columns.
        """
        contents = urllib2.urlopen(url=url).read()
        soup = BS(contents)
        inputs = soup.findAll("input")
        input_dict = dict((i.get("name"), i.get("value")) for i in inputs)
        print(" ".join(f.format(input_dict[n]) for f, n in (
                    ("{0:5s}", "form.widgets.absorption_halflife"),
                    ("{0:5s}", "form.widgets.elimination_halflife"),
                    ("{0:5s}", "form.widgets.minconcentration"),
                    ("{0:5s}", "form.widgets.maxconcentration"),
                    ("{0:13s}", "form.widgets.title"),
                    )))

Try giving a more specific picture of your quest, and it's very
likely that people smarter than me will give you good help.

-- 
To email me, substitute nowhere->spamcop, invalid->net.



More information about the Python-list mailing list