Rate limiting a web crawler

Wed Dec 26 13:30:40 EST 2018

On 12/26/18 10:35 AM, Simon Connah wrote:
> Hi,
>
> I want to build a simple web crawler. I know how I am going to do it
> but I have one problem.
>
> Obviously I don't want to negatively impact any of the websites that I
> am crawling so I want to implement some form of rate limiting of HTTP
> requests to specific domain names.
>
> What I'd like is some form of timer which calls a piece of code say
> every 5 seconds or something and that code is what goes off and crawls
> the website.
>
> I'm just not sure on the best way to call code based on a timer.
>
> Could anyone offer some advice on the best way to do this? It will be
> running on Linux and using the python-daemon library to run it as a
> service and will be using at least Python 3.6.
>
> Thanks for any help.

One big piece of information that would help in replies would be an
indication of scale. Is you application crawling just a few sites, so
that you need to pause between accesses to keep the hit rate down, or
are you calling a number of sites, so that if you are going to delay
crawling a page from one site, you can go off and crawl another in the
mean time?

-- 
Richard Damon