please critique my thread code

Roger Heathcote usenet at technicalbloke.com
Thu Jun 19 03:24:43 EDT 2008


MRAB wrote:
> On Jun 15, 2:29 pm, wins... at cs.wisc.edu wrote:
>> I wrote a Python program (103 lines, below) to download developer data
>> from SourceForge for research about social networks.
>>
>> Please critique the code and let me know how to improve it.
>>
>> An example use of the program:
>>
>> prompt> python download.py 1 240000
>>
>> The above command downloads data for the projects with IDs between 1
>> and 240000, inclusive. As it runs, it prints status messages, with a
>> plus sign meaning that the project ID exists. Else, it prints a minus
>> sign.
>>
>> Questions:
>>
>> --- Are my setup and use of threads, the queue, and "while True" loop
>> correct or conventional?
>>
>> --- Should the program sleep sometimes, to be nice to the SourceForge
>> servers, and so they don't think this is a denial-of-service attack?
>>
>> --- Someone told me that popen is not thread-safe, and to use
>> mechanize. I installed it and followed an example on the web site.
>> There wasn't a good description of it on the web site, or I didn't
>> find it. Could someone explain what mechanize does?
>>
>> --- How do I choose the number of threads? I am using a MacBook Pro
>> 2.4GHz Intel Core 2 Duo with 4 GB 667 MHz DDR2 SDRAM, running OS
>> 10.5.3.
>>
>> Thank you.
>>
>> Winston
>>
> [snip]
> String methods are quicker than regular expressions, so don't use
> regular expressions if string methods are perfectly adequate. For
> example, you can replace:

<SNIP>

Erm, shurely the bottleneck will be bandwidth not processor/memory?* If 
it isn't then - yes, you run the risk of actually DOSing their servers!

Your mac will run thousands of threads comfortably but your router may 
not handle the thousands of TCP/IP connections you throw at it very 
well, especially if it is a domestic model, and sure as hell sourceforge 
aren't going to want more than a handfull of concurrent connections from 
you.

Typical sourceforge page ~ 30K
Project pages to read = 240000

= ~6.8 Gigabytes

Maybe send their sysadmin a box of chocolates if you want to grab all 
that in any less than a week and not get your IP blocked! :)


Roger Heathcote

* Of course, stylistically, MRAB is perfectly right about not wasting 
CPU on regexes where string methods will do, unless you are planning on 
making your searches more elaborate in the future.




More information about the Python-list mailing list