Rate limiting a web crawler

Wed Dec 26 14:32:11 EST 2018

On 26/12/2018 18:30, Richard Damon wrote:
> On 12/26/18 10:35 AM, Simon Connah wrote:
>> Hi,
>>
>> I want to build a simple web crawler. I know how I am going to do it
>> but I have one problem.
>>
>> Obviously I don't want to negatively impact any of the websites that I
>> am crawling so I want to implement some form of rate limiting of HTTP
>> requests to specific domain names.
>>
>> What I'd like is some form of timer which calls a piece of code say
>> every 5 seconds or something and that code is what goes off and crawls
>> the website.
>>
>> I'm just not sure on the best way to call code based on a timer.
>>
>> Could anyone offer some advice on the best way to do this? It will be
>> running on Linux and using the python-daemon library to run it as a
>> service and will be using at least Python 3.6.
>>
>> Thanks for any help.
> 
> One big piece of information that would help in replies would be an
> indication of scale. Is you application crawling just a few sites, so
> that you need to pause between accesses to keep the hit rate down, or
> are you calling a number of sites, so that if you are going to delay
> crawling a page from one site, you can go off and crawl another in the
> mean time?
> 

Sorry. I should have stated that.

This is for a minimum viable product so crawling say two or three domain 
names would be enough to start with but I'd want to grow in the future.

I'm building this on AWS and my idea was to have each web crawler 
instance query a database (DynamoDB) and get say 10 URLs and if they 
hadn't be crawled in the previous say 12 to 24 hours then recrawl them. 
If they have been crawled in the last 12 to 24 hours then skip that URL. 
Once a URL has been crawled I would then save the crawl date and time in 
the database.

Doing it that way I could skip the whole timing thing on the daemon end 
and just use database queries to control whether a URL is crawled or 
not. Of course that would mean that one web crawler would have to "lock" 
a domain name so that multiple instances do not query the same domain 
name in parallel which would be bad.