How to force a thread to stop

Wed Jul 26 16:33:19 EDT 2006

bryanjugglercryptographer at yahoo.com wrote:
> Carl J. Van Arsdall wrote:
>   
>> bryanjugglercryptographer at yahoo.com wrote:
>>     
>>> Carl J. Van Arsdall wrote:
>>>
>>> I don't get what threading and Twisted would to do for
>>> you. The problem you actually have is that you sometimes
>>> need terminate these other process running other programs.
>>> Use spawn, fork/exec* or maybe one of the popens.
>>>
>>>       
>> I have a strong need for shared memory space in a large distributed
>> environment.
>>     
>
> Distributed shared memory is a tough trick; only a few systems simulate
> it.
>   
Yea, this I understand, maybe I chose some poor words to describe what I 
wanted. I think this conversation is getting hairy and confusing so  I'm 
going to try and paint a better picture of what's going on.  Maybe this 
will help you understand exactly what's going on or at least what I'm 
trying to do, because I feel like we're just running in circles.  After 
the detailed explanation, if threads are the obvious choice or not, it 
will be much easier to pick apart what I need and probably also easier 
for me to see your point... so here goes... (sorry its long, but I keep 
getting dinged for not being thorough enough).

So, I have a distributed build system.  The system is tasked with 
building a fairly complex set of packages that form a product.  The 
system needs to build these packages for 50 architectures using cross 
compilation as well as support for 5 different hosts.  Say there are 
also different versions of this with tweaks for various configurations, 
so in the end I might be trying to build 200+ different things at once.  
I have a computing farm of 40 machines to do this for me..  That's the 
high-level scenario without getting too detailed.  There are also 
subsystems that help us manage the machines and things, I don't want to 
get into that, I'm going to try to focus on a scenario more abstract 
than cluster/resource management stuff.

Alright, so manually running builds is going to be crazy and 
unmanageable.  So what the people who came before me did to manage this 
scenario was to fork on thread per build.  The threads invoke a series 
of calls that look like

os.system(ssh <host> <command>)

or for more complex operations they would just spawn a process that ran 
another python script)

os.system(ssh <host> <script>)

The purpose behind all this was for a couple things:

   * The thread constantly needed information about the state of the 
system (for example we don't want to end up building the same 
architecture twice)
   * We wanted a centralized point of control for an entire build
   * We needed to be able to use as many machines as possible from a 
central location.

Python threads worked very well for this.  os.system behaves a lot like 
many other IO operations in python and the interpreter gives up the 
GIL.  Each thread could run remote operations and we didn't really have 
any problems.  There wasn't much of a need to do fork, all it would have 
done is increased the amount of memory used by the system.

Alright, so this scheme that was first put in place kind of worked.  
There were some problems, for example when someone did something like

os.system(ssh <host> <script>)  we had no good way of knowing what the 
hell happened in the script.  Now granted, they used shared files to do 
some of it over nfs mounts, but I really hate that.  It doesn't work 
well, its clunky, and difficult to manage.  There were other problems 
too, but I just wanted to give a sample.

Alright, so things aren't working, I come on board, I have a boss who 
wants things done immediately.  What we did was created what we called a 
"Python Execution Framework".  The purpose of the framework was to 
mitigate a number of problems we had as well as take the burden of 
distribution away from the programmers by providing a few layers of 
abstraction (i'm only going to focus on the distributed part of the 
framework, the rest is irrelevant to the discussion).  The framework 
executes and threads modules (or lists of modules).  Since we had 
limited time, we designed the framework with "distribution environment" 
in mind but realized that if we shoot for the top right away it will 
take years to get anything implemented. 

Since we knew we eventually wanted a distributed system that could 
execute framework modules entirely on remote machines we carefully 
design and prepared the system for this.  This involves some abstraction 
and some simple mechanisms.  However right now each ssh call will be 
executed from a thread (as they will be done concurrently, just like 
before).  The threads still need to know about the state of the system, 
but we'd also like to be able to issue some type of control that is more 
event driven -- this can be sending the thread a terminate message or 
sending the thread a message regarding the completion of a dependency 
(we use conditions and events to do this synchronization right now).  We 
hoped that in the case of a catastrophic event or a user 'kill' signal 
that the the system could take control of all the threads (or at least, 
ask them to go away), this is what started the conversation in the first 
place.  We don't want to use a polling loop for these threads to check 
for messages, we wanted to use something event driven (I mistakenly used 
the word interrupt in earlier posts, but I think it still illustrates my 
point).  Its not only important that the threads die, but that they die 
with grace.  There's lots of cleanup work that has to be done when 
things exit or things end up in an indeterminable state. 

So, I feel like I have a couple options,

 1) try moving everything to a process oriented configuration - we think 
this would be bad, from a resource standpoint as well as it would make 
things more difficult to move to a fully distributed system later, when 
I get my army of code monkeys.

2) Suck it up and go straight for the distributed system now - managers 
don't like this, but maybe its easier than I think its going to be, I dunno

3) See if we can find some other way of getting the threads to terminate.

4) Kill it and clean it up by hand or helper scripts - we don't want to 
do this either, its one of the major things we're trying to get away from.

Alright, that's still a fairly high-level description.  After all that, 
if threads are still stupid then I think I'll much more easily see it 
but I hope this starts to clear up confused.  I don't really need a 
distributed shared memory environment, but right now I do need shared 
memory and it needs to be used fairly efficiently.  For a fully 
distributed environment I was going to see what various technologies 
offered to pass data around, I figured that they must have some 
mechanism for doing it or at least accessing memory from a central 
location (we're setup to do this now we threads, we just need to expand 
the concept to allow nodes to do it remotely).  Right now, based on what 
I have to do I think threads are the right choice until I can look at a 
better implementation (i hear twisted is good at what I ultimately want 
to do, but I don't know a thing about it).

Alright, if you read all that, thanks, and thanks for your input.  
Whether or not I've agreed with anything, me and a few colleagues 
definitely discuss each idea as its passed to us.  For that, thanks to 
the python list!

-carl

-- 

Carl J. Van Arsdall
cvanarsdall at mvista.com
Build and Release
MontaVista Software