Threaded Design Question

Jun-geun Park junkeun.park at gmail.com
Thu Aug 9 19:06:14 EDT 2007


half.italian at gmail.com wrote:
> Hi all!  I'm implementing one of my first multithreaded apps, and have
> gotten to a point where I think I'm going off track from a standard
> idiom.  Wondering if anyone can point me in the right direction.
> 
> The script will run as a daemon and watch a given directory for new
> files.  Once it determines that a file has finished moving into the
> watch folder, it will kick off a process on one of the files.  Several
> of these could be running at any given time up to a max number of
> threads.
> 
> Here's how I have it designed so far.  The main thread starts a
> Watch(threading.Thread) class that loops and searches a directory for
> files.  It has been passed a Queue.Queue() object (watch_queue), and
> as it finds new files in the watch folder, it adds the file name to
> the queue.
> 
> The main thread then grabs an item off the watch_queue, and kicks off
> processing on that file using another class Worker(threading.thread).
> 
> My problem is with communicating between the threads as to which files
> are currently processing, or are already present in the watch_queue so
> that the Watch thread does not continuously add unneeded files to the
> watch_queue to be processed.  For example...Watch() finds a file to be
> processed and adds it to the queue.  The main thread sees the file on
> the queue and pops it off and begins processing.  Now the file has
> been removed from the watch_queue, and Watch() thread has no way of
> knowing that the other Worker() thread is processing it, and shouldn't
> pick it up again.  So it will see the file as new and add it to the
> queue again.  PS.. The file is deleted from the watch folder after it
> has finished processing, so that's how i'll know which files to
> process in the long term.
> 
> I made definite progress by creating two queues...watch_queue and
> processing_queue, and then used lists within the classes to store the
> state of which files are processing/watched.
> 
> I think I could pull it off, but it has got very confusing quickly,
> trying to keep each thread's list and the queue always in sync with
> one another.  The easiset solution I can see is if my threads could
> read an item from the queue without removing it from the queue and
> only remove it when I tell it to.  Then the Watch() thread could then
> just follow what items are on the watch_queue to know what files to
> add, and then the Worker() thread could intentionally remove the item
> from the watch_queue once it has finished processing it.
> 
> Now that I'm writing this out, I see a solution by over-riding or
> wrapping Queue.Queue().get() to give me the behavior I mention above.
> 
> I've noticed .join() and .task_done(), but I'm not sure of how to use
> them properly.  Any suggestions would be greatly appreciated.
> 
> ~Sean
> 

1) Use a (global) hash table and a mutex on it. Before a worker thread
starts processing a file, have the worker acquire the mutex, add the
filename into a global dictionary as a key, and release the mutex.  Then,
when the watcher thread attempts to read a file, it only has to check
whether the filename is on the table.

If different files may have a same name, you can also use size or some
other signatures. Also, I think basic operations on primitive types on
Python are atomic so you don't actually need mutex, but in threaded
programs, it's always a good habit to use mutexes explicitly.

2) If the watcher thread doesn't spend much time in "recognizing" files 
to process,
simply make the watcher thread reads many(or all) files in the directory,
put them all in the queue, and wait until all items in the queue are 
processed
using .join(). Workers can indicate that the processing of the last 
dequeued item
is done by calling .task_done() (As soon as all items are flaged 
"task_done()", join() will
unblock.) After processing of files on the queue are all done,
watcher can move or remove processed files and read the directory again. 
Of course,
you need to keep track of "recognized" files of that turn.



More information about the Python-list mailing list