Thread processing

Wed Mar 5 18:49:00 EST 2003

Dave Brueck <dave at pythonapocrypha.com> writes:

> Your question seems very similar to the one you asked yesterday, so
> even though you're reluctant to go into too much detail, maybe a
> specific example will help you get the answer you're looking for.

It's certainly related. I got some useful scripts working using
queues, but hit some issues when I tried to generalise them. It's been
a long day :-)

You're probably right, that I need to get more specific.

OK, here's the detail. I'm trying to automate some monitoring and
administration scripts running against multiple (Oracle) databases.
The pattern I'm writing is getting very repetitive, and I'm interested
in abstracting out some of the control structures.

So my problem is with writing a general enough framework, rather than
with getting a specific example working.

I have a table of databases, including database name, connection
details, and a few other details.

Most of my scripts conform to the pattern:

    open the "monitoring" database
    read the table of databases

    for each database:
        start a thread processing that database

    wait for the threads to complete
    ????

The problem arises at the ??? point. Some of my scripts have nothing
to do in the main thread, and so I could just exit (relying on the
fact that the main thread waits for all non-daemon threads).

In practice, I'd nearly always want to write a log message saying the
processing had finished, so I'd wait for all the threads to complete.
For that, simply joining in any order is fine (but I need to keep a
list of the threads so I can join them all, and hence actually know
when I finished).

In some cases, I have substantial results to report. One case is a
script which simply checks that the database is running - the thread
writes connection errors into a queue, which I then read and report
from. For this usage, I have no problem with a queue (this is the case
which prompted my question yesterday).

In some cases, all I want to do is to log each thread's completion
individually.

The case which got me stuck (in a way) was a stats collection process.
There, the thread procedure collects performance stats from the target
database, and loads them into the monitoring database. When the thread
has finished, I want to do some restructuring of the data. But I
*don't* want to do that in the database thread - it's a separation of
responsibility issue, insofar as the thread's responsibility is simply
to grab the data; cleaning it up after it's on the monitoring database
is a separate job (which can, however, start as soon as the collection
is complete).

The common theme with the latter two is that the main process doesn't
need *any* substantial information from the thread, just its identity
(in practice, I'd set the thread name to the database name, so I could
get that one piece of information. But alternatively, I could key any
information I need off the thread ID - the point is that everything I
need is already known to the main thread - and much of it, the worker
doesn't need, so I don't want to pass it in only to have it returned
to me). I could use a queue and have the thread terminate with
q.put(self), but (a) why should I have to? and (b) it hinders my
ability to test the thread procedure in isolation.

The basic problem is that I can easily *see* my solution in terms of
waiting for "the next thread to finish". But I can't *code* it that
way. So whatever I do is a workaround, and makes it harder for me to
express what I really want, because the infrastructure is getting in
the way.

If I could see the solution in different terms, maybe I wouldn't be
having such problems. But at the moment, the mindset isn't there.

>> I can use a Queue to have the threads notify when they have finished,
>> but that means rewriting the thread procedures to do the notification,
>> which isn't really what I want. Something like:
>>
>>     q = Queue()
>>     # All the threads need to get q as input, and do
>>     # q.put(self) when the run method finishes
>>
>>     while threads:
>>         thread = q.get()
>>         thread.join()
>
> You don't need to join the thread if it is just going to terminate on its
> own.
>>
>>         threads.remove[thread]
>
> Do you actually need a list of all the threads? Why bother?

So that I can do the "while threads" above. Otherwise, I have no way
of knowing when I have no threads left, and hence no way of knowing
when to *stop* waiting on the queue. (OK, a simple counter is probably
enough here, but the thread gives me better generality for if I have a
thread subclass with useful extra data in it).

This feels so much like a workaround, that I can't get comfortable
with it.

[I've had a flash of inspiration here, which may help me, but it's too
late at night to work it through right now. Basically, the queue
approach would be a lot more comfortable, if I didn't *also* have to
keep track of "when all the results are in". I'll think about this a
little more...

Can you explain how you'd structure the above without needing a "while
threads" loop? So that the main thread stops waiting on the queue when
the last worker has finished. If I could see that, I might be closer
to understanding this.]

>
>>         # post-processing for thread here
>
> This is what you need to clarify - what sort of post-processing needs to
> be done? Do you really need to have a handle to the thread object or just
> the data to be post-processed?

For significant reporting, I need the data (and for that, a Queue is
fine). For simple logging, the thread name is enough. For other cases,
the main thread *has* the data, keyed on the database name (which I
set the thread name to, so I just need the thread object). The data is
of no use to the thread, so passing it to the thread just to get it
passed back is silly and intrusive.

>> That's nasty because I have to write my thread procedures specifically
>> to work with this idiom (and I'm trying to make this stuff as generic
>> as possible, because I have a lot of potential code which will follow
>> this pattern)
>
> Is it really _that_ big of a deal to have the worker thread put done work
> on a queue?

For cases where all the worker is saying is "I've finished", I believe
so. For a start, I have to pass that queue to the worker in the first
place, which constrains the form of my worker.

If I'm trying to write a framework, this means I can't just use a
generic thread (which, when I have no post-processing to do, can be
exactly what I want).

> A common pattern for a pool of worker threads is to dump finished work
> into a queue. IMO that _is_ a good way of doing this as it is very
> flexible, not problematic, and very simple to write in Python.

Maybe I'm still looking at things wrongly. But I've gone through 4 or
5 iterations of this now, and each one has worked for one problem, but
then failed to work for the next one.

Each time I try to tackle a new problem, I end up rewriting the
framework. Which implies to me that I've got the framework wrong. I've
tried a few variations on queues, and messed up each time, so I was
hoping to look at the problem from a different angle (based on my
Windows experience).

Look at this another way, then. Are there any examples of thread pool
frameworks around which I could use as a starting point? I have
searched round and found nothing much (Aahz has some specific examples
on his site, but nothing generic, the cookbook doesn't have anything,
where else could I look?)

> Maybe the specific case would shed light on what you don't like.

Maybe it didn't... :-(

As I say, I can code the examples I currently have, it's just that I
can't do so in a way that doesn't break when I try the next one along
:-(

I'm sorry. I just feel that I'm thinking about things all wrong. I'll
have another go at this tomorrow, in the light of what you've said,
and if I'm still stuck, I'll try posting some real code (none to hand
at the moment)...

Paul.
-- 
This signature intentionally left blank