[Tutor] Retrieve data

Steven D'Aprano steve at pearwood.info
Wed Apr 13 01:03:27 CEST 2011


l.leichtnam at gmail.com wrote:
> Hello everyone,
> 
> I would to retrieve data, and especially the temperature and the weather from http://www.nytimes.com/weather. And I don't know how to do so.

Consider whether the NY Times terms and conditions permit such automated 
scraping of their web site.

Be careful you do not abuse their hospitality by hammering their web 
site unnecessarily (say, by checking the weather eighty times a minute).

Consider whether they have a public API for downloading data directly. 
If so, use that. Otherwise:

Use the urlib2 and urlib modules to download the raw HTML source of the 
page you are interested in. You may need to use them to login, to set 
cookies, set the referer [sic], submit data via forms, change the 
user-agent... it's a PITA. Better to use an API if the web site offers one.

Use the htmllib module to parse the source looking for the information 
you are after. If their HTML is crap, as it so often is with commercial 
websites that should know better, download and install BeautifulSoup, 
and use that for parsing the HTML.

Don't be tempted to use regexes for parsing the HTML. That is the wrong 
solution. Regexes *seem* like a good idea for parsing HTML, and for 
simple tasks they are quick to program, but they invariably end up being 
ten times as much work as a proper HTML parser.

If the content you are after requires Javascript, you're probably out of 
luck.


-- 
Steven



More information about the Tutor mailing list