[Tutor] Retrieve data
Steven D'Aprano
steve at pearwood.info
Wed Apr 13 01:03:27 CEST 2011
l.leichtnam at gmail.com wrote:
> Hello everyone,
>
> I would to retrieve data, and especially the temperature and the weather from http://www.nytimes.com/weather. And I don't know how to do so.
Consider whether the NY Times terms and conditions permit such automated
scraping of their web site.
Be careful you do not abuse their hospitality by hammering their web
site unnecessarily (say, by checking the weather eighty times a minute).
Consider whether they have a public API for downloading data directly.
If so, use that. Otherwise:
Use the urlib2 and urlib modules to download the raw HTML source of the
page you are interested in. You may need to use them to login, to set
cookies, set the referer [sic], submit data via forms, change the
user-agent... it's a PITA. Better to use an API if the web site offers one.
Use the htmllib module to parse the source looking for the information
you are after. If their HTML is crap, as it so often is with commercial
websites that should know better, download and install BeautifulSoup,
and use that for parsing the HTML.
Don't be tempted to use regexes for parsing the HTML. That is the wrong
solution. Regexes *seem* like a good idea for parsing HTML, and for
simple tasks they are quick to program, but they invariably end up being
ten times as much work as a proper HTML parser.
If the content you are after requires Javascript, you're probably out of
luck.
--
Steven
More information about the Tutor
mailing list