[Baypiggies] web scraping best practice question

Andrew Dalke dalke at dalkescientific.com
Tue Nov 3 23:04:12 CET 2009


On Nov 2, 2009, at 10:24 PM, Dennis Reinhardt wrote:
> 1) Save the pages you access so that if you need to re-parse, you  
> have a local copy ... or you hit an error and need to reacquire.

For one project, what I did for this was set up Squid reverse proxy,  
and configured it to keep all pages for a few hours. In that way I  
could test nearly everything, including HTTP error codes, without  
having to do a separate file I/O interface and without hitting the  
remote server hard while I was debugging things. The only change in  
my code was setting http_proxy.


				Andrew
				dalke at dalkescientific.com




More information about the Baypiggies mailing list