please critique my thread code

Pau Freixes pfreixes at gmail.com
Sun Jun 15 10:22:22 EDT 2008


Hi,

The main while in main thread spend all cpu time, it's more convenient put
one little sleep between each iteration or use a some synchronization method
between threads.


And about your questions IMO:


> --- Are my setup and use of threads, the queue, and "while True" loop
> correct or conventional?


May be, exist another possibility but this it's good, another question is
if iterate arround the 240000 numbers it's the more efficient form for
retrieve all projects.

--- Should the program sleep sometimes, to be nice to the SourceForge
> servers, and so they don't think this is a denial-of-service attack?


You are limiting your number of connections whit you concurrent threads, i
don't believe that SourceForge have a less capacity for request you
concurrent threads.


>
> --- Someone told me that popen is not thread-safe, and to use
> mechanize. I installed it and followed an example on the web site.
> There wasn't a good description of it on the web site, or I didn't
> find it. Could someone explain what mechanize does?


I don't  know , but if you don't sure you can use urllib2.



>
> --- How do I choose the number of threads? I am using a MacBook Pro
> 2.4GHz Intel Core 2 Duo with 4 GB 667 MHz DDR2 SDRAM, running OS
> 10.5.3.


For default phtreads in linux flavor spend 8MB for thread stack, i dont know
in you MacBook. i think between 64 to 128 threads it's  correct.


> <http://10.5.3.>
>
> Thank you.
>
> Winston
>
>
>
> #!/usr/bin/env python
>
> # Winston C. Yang
> # Created 2008-06-14
>
> from __future__ import with_statement
>
> import mechanize
> import os
> import Queue
> import re
> import sys
> import threading
> import time
>
> lock = threading.RLock()
>
> # Make the dot match even a newline.
> error_pattern = re.compile(".*\n<!--pageid login -->\n.*", re.DOTALL)
>
> def now():
>    return time.strftime("%Y-%m-%d %H:%M:%S")
>
> def worker():
>
>    while True:
>
>        try:
>            id = queue.get()
>        except Queue.Empty:
>            continue
>
>        request = mechanize.Request("http://sourceforge.net/project/"\
>                                        "memberlist.php?group_id=%d" %
> id)
>        response = mechanize.urlopen(request)
>        text = response.read()
>
>        valid_id = not error_pattern.match(text)
>
>        if valid_id:
>            f = open("%d.csv" % id, "w+")
>            f.write(text)
>            f.close()
>
>        with lock:
>            print "\t".join((str(id), now(), "+" if valid_id else
> "-"))
>
> def fatal_error():
>    print "usage: python application start_id end_id"
>    print
>    print "Get the usernames associated with each SourceForge project
> with"
>    print "ID between start_id and end_id, inclusive."
>    print
>    print "start_id and end_id must be positive integers and satisfy"
>    print "start_id <= end_id."
>    sys.exit(1)
>
> if __name__ == "__main__":
>
>    if len(sys.argv) == 3:
>
>        try:
>            start_id = int(sys.argv[1])
>
>            if start_id <= 0:
>                raise Exception
>
>            end_id = int(sys.argv[2])
>
>            if end_id < start_id:
>                raise Exception
>        except:
>            fatal_error()
>    else:
>        fatal_error()
>
>    # Print the start time.
>    start_time = now()
>    print start_time
>
>    # Create a directory whose name contains the start time.
>    dir = start_time.replace(" ", "_").replace(":", "_")
>    os.mkdir(dir)
>    os.chdir(dir)
>
>    queue = Queue.Queue(0)
>
>    for i in xrange(32):
>        t = threading.Thread(target=worker, name="worker %d" % (i +
> 1))
>        t.setDaemon(True)
>        t.start()
>
>    for id in xrange(start_id, end_id + 1):
>        queue.put(id)
>
>    # When the queue has size zero, exit in three seconds.
>    while True:
>        if queue.qsize() == 0:
>            time.sleep(3)
>            break
>
>    print now()
> --
> http://mail.python.org/mailman/listinfo/python-list
>



-- 
Pau Freixes
Linux GNU/User
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20080615/9f129fe1/attachment-0001.html>


More information about the Python-list mailing list