urllib2 rate limiting

Dimitrios Apostolou jimis at gmx.net
Thu Jan 10 12:17:47 EST 2008


Hello list,

I want to limit the download speed when using urllib2. In particular, 
having several parallel downloads, I want to make sure that their total 
speed doesn't exceed a maximum value.

I can't find a simple way to achieve this. After researching a can try 
some things but I'm stuck on the details:

1) Can I overload some method in _socket.py to achieve this, and perhaps 
make this generic enough to work even with other libraries than urllib2?

2) There is the urllib.urlretrieve() function which accepts a reporthook 
parameter. Perhaps I can have reporthook to increment a global counter and 
sleep as necessary when a threshold is reached.
However there is not something similar in urllib2. Isn't urllib2 supposed 
to be a superset of urllib in functionality? Why there is no reporthook 
parameter in any of urllib2's functions?
Moreover, even the existing way reporthook can be used doesn't seem so 
right: reporthook(blocknum, bs, size) is always called with bs=8K even 
for the last block, and sometimes (blocknum*bs > size) is possible, if the 
server sends wrong Content-Lentgth HTTP headers.

3) Perhaps I can use filehandle.read(1024) and manually read as many 
chunks of data as I need. However I think this would generally be 
inefficient and I'm not sure how it would work because 
of internal buffering of urllib2.

So how do you think I can achieve rate limiting in urllib2?


Thanks in advance,
Dimitris

P.S. And something simpler: How can I disallow urllib2 to follow 
redirections to foreign hosts?



More information about the Python-list mailing list