please critique my thread code
Pau Freixes
pfreixes at gmail.com
Sun Jun 15 10:22:22 EDT 2008
Hi,
The main while in main thread spend all cpu time, it's more convenient put
one little sleep between each iteration or use a some synchronization method
between threads.
And about your questions IMO:
> --- Are my setup and use of threads, the queue, and "while True" loop
> correct or conventional?
May be, exist another possibility but this it's good, another question is
if iterate arround the 240000 numbers it's the more efficient form for
retrieve all projects.
--- Should the program sleep sometimes, to be nice to the SourceForge
> servers, and so they don't think this is a denial-of-service attack?
You are limiting your number of connections whit you concurrent threads, i
don't believe that SourceForge have a less capacity for request you
concurrent threads.
>
> --- Someone told me that popen is not thread-safe, and to use
> mechanize. I installed it and followed an example on the web site.
> There wasn't a good description of it on the web site, or I didn't
> find it. Could someone explain what mechanize does?
I don't know , but if you don't sure you can use urllib2.
>
> --- How do I choose the number of threads? I am using a MacBook Pro
> 2.4GHz Intel Core 2 Duo with 4 GB 667 MHz DDR2 SDRAM, running OS
> 10.5.3.
For default phtreads in linux flavor spend 8MB for thread stack, i dont know
in you MacBook. i think between 64 to 128 threads it's correct.
> <http://10.5.3.>
>
> Thank you.
>
> Winston
>
>
>
> #!/usr/bin/env python
>
> # Winston C. Yang
> # Created 2008-06-14
>
> from __future__ import with_statement
>
> import mechanize
> import os
> import Queue
> import re
> import sys
> import threading
> import time
>
> lock = threading.RLock()
>
> # Make the dot match even a newline.
> error_pattern = re.compile(".*\n<!--pageid login -->\n.*", re.DOTALL)
>
> def now():
> return time.strftime("%Y-%m-%d %H:%M:%S")
>
> def worker():
>
> while True:
>
> try:
> id = queue.get()
> except Queue.Empty:
> continue
>
> request = mechanize.Request("http://sourceforge.net/project/"\
> "memberlist.php?group_id=%d" %
> id)
> response = mechanize.urlopen(request)
> text = response.read()
>
> valid_id = not error_pattern.match(text)
>
> if valid_id:
> f = open("%d.csv" % id, "w+")
> f.write(text)
> f.close()
>
> with lock:
> print "\t".join((str(id), now(), "+" if valid_id else
> "-"))
>
> def fatal_error():
> print "usage: python application start_id end_id"
> print
> print "Get the usernames associated with each SourceForge project
> with"
> print "ID between start_id and end_id, inclusive."
> print
> print "start_id and end_id must be positive integers and satisfy"
> print "start_id <= end_id."
> sys.exit(1)
>
> if __name__ == "__main__":
>
> if len(sys.argv) == 3:
>
> try:
> start_id = int(sys.argv[1])
>
> if start_id <= 0:
> raise Exception
>
> end_id = int(sys.argv[2])
>
> if end_id < start_id:
> raise Exception
> except:
> fatal_error()
> else:
> fatal_error()
>
> # Print the start time.
> start_time = now()
> print start_time
>
> # Create a directory whose name contains the start time.
> dir = start_time.replace(" ", "_").replace(":", "_")
> os.mkdir(dir)
> os.chdir(dir)
>
> queue = Queue.Queue(0)
>
> for i in xrange(32):
> t = threading.Thread(target=worker, name="worker %d" % (i +
> 1))
> t.setDaemon(True)
> t.start()
>
> for id in xrange(start_id, end_id + 1):
> queue.put(id)
>
> # When the queue has size zero, exit in three seconds.
> while True:
> if queue.qsize() == 0:
> time.sleep(3)
> break
>
> print now()
> --
> http://mail.python.org/mailman/listinfo/python-list
>
--
Pau Freixes
Linux GNU/User
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20080615/9f129fe1/attachment-0001.html>
More information about the Python-list
mailing list