[Baypiggies] web scraping best practice question

Asim Jalis asimjalis at gmail.com
Mon Nov 2 20:28:54 CET 2009


Your script should also check robots.txt.

Also if you are scraping every few seconds, it is possible that the
site will eventually ban your IP (once they notice it). Or they might
not. I guess it's a risk you take.

If you want to reduce this risk, you can contact them directly and ask
them if what you are doing is okay. If your scraping ends up sending
traffic to the site they might be happy to let you scrape. Or they
might have an API that they could point you to. Or they might even pay
you.

Asim

On Mon, Nov 2, 2009 at 11:22 AM, Isaac <hyperneato at gmail.com> wrote:
> Hello Baypiggies.
>
> I wrote a Python script to send a query to a single website. I am
> curious: what is the best practice for the rate of sending requests
> when scraping a single site? I'll have about 4000 requests.
> I thought about _politely_ writing:
>
> import random
> for x in large_query_list:
>    send_scrap_query(x)
>    t = random.randint(1, 5)
>    sleep(t)
>
> to pause for a psuedo-random duration between each request- so I don't
> put strain on anyone's network. Does anyone have recommendations for
> best practices regarding rete of sending a set of queries? I missed
> the talk about web scraping from the beginning of the year.
>
> -Isaac
> _______________________________________________
> Baypiggies mailing list
> Baypiggies at python.org
> To change your subscription options or unsubscribe:
> http://mail.python.org/mailman/listinfo/baypiggies
>


More information about the Baypiggies mailing list