Multiprocessing bug, is my editor (SciTE) impeding my progress?

Fri Dec 9 04:10:48 EST 2011

Thanks once again to everyone for their recommendations, here's a
follow-up.  In summary, I'm still baffled.

I tried ipython, as Marco Nawijn suggested.  If there is some special
setting which returns control to the interpreter when a subprocess
crashes, I haven't found it yet.  Yes, I'm RTFM.  As with SciTE,
everything just hangs.  So I went back to SciTE for now.

And I'm doing what Terry Reedy suggested -- I am editing
multiprocess.Pool in place.  I made a backup, of course.  I am using
sudo to run SciTE so that I can edit the system files, and not have to
worry about chasing path and import statement problems.

What I have found, so far, is no evidence that a string is needed in
any of the code.  What's the task variable?  It's a deeply-nested
tuple, containing no strings, not even in the WORKING code.  This
makes me wonder whether that traceback is truly complete.

I wrote a routine to display the contents of task, immediately before
the offending put().  Here's a breakdown.

In the WORKING version:

task:  <type 'tuple'>
   <type 'int'> 	0
   <type 'int'> 	0
   <type 'function'> 	<function mapstar at 0xa7ec5a4>
   <type 'tuple'> 	(see below)
   <type 'dict'> 	{}

task[3]:  <type 'tuple'>
   <type 'tuple'> 	(see below)

task[3][0]:  <type 'tuple'>
   <type 'function'> 	<function mean_square_error at 0xa7454fc>
   <type 'tuple'> 	(see below)

task[3][0][1]:  <type 'tuple'>
   <class 'neural.SplitData'> 	(see below)

task[3][0][1][0]:  <class 'neural.SplitData'>
   net		<class 'neural.CascadeArray'>	shape=(2, 3)
   inp 		<type 'numpy.ndarray'> 		shape=(307, 2)
   tgt 		<type 'numpy.ndarray'> 		shape=(307, 2)

By watching this run, I've learned that task[0] and task[1] are
counters for groups of subprocesses and individual subprocesses,
respectively.  Suppose we have four subprocesses.  When everything is
working, task[:2] = [0,0] for the first call, then [0,1], [0,2],
[0,3]; then, [1,0], [1,1], [1,2], etc.

task[2] points to multiprocessing.Pool.mapstar, a one-line function
that I never modify.  task[4] is an empty dictionary.  So it looks
like everything that I provide appears in task[3].

task[3] is just a tuple inside a tuple (which is weird).  task[3][0]
contains the function to be called (in this case, my function,
mean_square_error), and then a tuple containing all of the arguments
to be passed to that function.  The docs say that the function in
question must be defined at the top level of the code so that it's
importable (it is), and that all the arguments to be sent to that
function will be wrapped up in a single tuple -- that is presumably
task[3][0][1].

But that presumption is wrong.  I wrote a function which creates a
collections.namedtuple object of the type SplitData, which contains
the function's arguments.  It's not task[3][0][1] itself, but the
tuple INSIDE it, namely task[3][0][1][0].  More weirdness.  You don't
need to worry about task[3][0][1][0], other than to note that these
are my neural network objects, they are intact, they are the classes I
expect, and they are named as I expect -- and that there are NO STRING
objects.

Now, are there any differences between the working version of my code
and the buggy version?  Other than a few trivial name changes that I
made deliberately, the structure of task looks the SAME...

task:  <type 'tuple'>
   <type 'int'> 	0
   <type 'int'> 	0
   <type 'function'> 	<function mapstar at 0x88e0a04>
   <type 'tuple'> 	(see below)
   <type 'dict'> 	{}

task[3]:  <type 'tuple'>
   <type 'tuple'> 	(see below)

task[3][0]:  <type 'tuple'>
   <type 'function'> 	<function error at 0x88a5fb4>
   <type 'tuple'> 	(see below)

task[3][0][1]:  <type 'tuple'>
   <class '__main__.SplitData'> 	(see below)

task[3][0][1][0]:  <class '__main__.SplitData'>
   func 	<class 'cascade.Cascade'>	shape=(2, 3)
   inp 		<type 'numpy.ndarray'> 		shape=(307, 2)
   tgt 		<type 'numpy.ndarray'> 		shape=(307, 2)

Again, all the action is in task[3].  I was worried about the empty
dictionary in task[4] at first, but I've seen this {} in the working
program, too.  I'm not sure what it does.

For completeness, here's mean_square_error() from the working program:

def mean_square_error(b):
    out = array([b.net(i) for i in b.inp])
    return sum((out-b.tgt)**2)

And, here's error() from the buggy program.

def error(b):
    out = array([b.func(i) for i in b.inp])
    return sum((out-b.tgt)**2)

I renamed mean_square_error(), because I realized that the mean-square
error is the only kind of error I'll ever be computing.  I also
renamed "net" to "func", in SplitData, reflecting the more general
nature of the Cascade class I'm developing.  So I mirror that name
change here.  Other than that, I trust you can see that error() and
mean_square_error() are identical.

I can call mean_square_error directly with a SplitData tuple and it
works.  I can call error directly with a SplitData tuple in the broken
program, and it ALSO works.  I'm only having problems when I try to
submit the job through Pool.  I tried putting a print trap in
error().  When I use Pool then error() never gets called.

I suppose that the logical next step is to compare the two Pool
instances... onward...  :^P