Rate limiting a web crawler

Simon Connah scopensource at gmail.com
Wed Dec 26 14:34:07 EST 2018


On 26/12/2018 19:04, Terry Reedy wrote:
> On 12/26/2018 10:35 AM, Simon Connah wrote:
>> Hi,
>>
>> I want to build a simple web crawler. I know how I am going to do it 
>> but I have one problem.
>>
>> Obviously I don't want to negatively impact any of the websites that I 
>> am crawling so I want to implement some form of rate limiting of HTTP 
>> requests to specific domain names.
>>
>> What I'd like is some form of timer which calls a piece of code say 
>> every 5 seconds or something and that code is what goes off and crawls 
>> the website.
>>
>> I'm just not sure on the best way to call code based on a timer.
>>
>> Could anyone offer some advice on the best way to do this? It will be 
>> running on Linux and using the python-daemon library to run it as a 
>> service and will be using at least Python 3.6.
> 
> You can use asyncio to make repeated non-blocking requests to a web site 
> at timed intervals and to work with multiple websites at once.  You can 
> do the same with tkinter except that requests would block until a 
> response unless you implemented your own polling.
> 

Thank you. I'll look into asynio.



More information about the Python-list mailing list