[SciPy-user] cow: 'Connection reset by peer' timeout problem?

Simon Saubern s.saubern at chem.csiro.au
Tue Jan 14 00:11:24 EST 2003


I've been using cow to try out some distributed calculations. 
Everything works fine if I use a subset of my data, but when I use 
the full set I get "error: (10054, 'Connection reset by peer')" 
messages on the master unit (see below for full output).

I can operate on larger and larger subsets until I get to the point 
where if the slaves take more than about 5 minutes to complete a 
task, the above error appears at the master. I can't tell if this is 
a problem with the master or one of the slaves.

The setup:
10 x slave + master, all Win2K SP-3
Python 2.2.2
latest scipy binary for Win

cowname['data']=data # a list 35000 long.
lendata=range(7000) # just use a subset
bessy=None
while not bessy:
     bessy=cowname.loop_code('do something;do 
something;calc=function(data[x])',loop_var='x',inputs={'x':lendata},returns=['calc'])
     bessy gets processed here

'data' is quite large and takes a while to transfer over the network. 
But by doing it once and looping over the index, I minimize network 
movements. The 'python' process on each slave uses about 85MB.

Increasing 'lendata' eventually causes the 'Connection reset by peer' 
message to appear, and being a Win2k setup, I have to walk around to 
all the slaves and manually restart the slave process.

I can't find anything in the scipy.cow directory that might be 
associated with a timing control.

Can anyone help?

Would it be better to take the loop_code code and distribute it in 
advance as a module on each salve, then use loop_apply?

Is it a network issue? The 10 slaves are on 3 different routers (I 
think). I've tried shuffling the order of the slaves around when the 
cluster is started according to my perceived notion of network 
responsiveness, but to no avail.

When the cluster is started, all the slaves are listening on the same 
port. Is this likely to cause conflicts?


Any pointers welcomed.

---------------error output


   File "C:\PROGRA~1\Python22\Lib\site-packages\scipy\cow\cow.py", 
line 823, in loop_code
     return self.loop_send_recv(package,loop_data,loop_var)
   File "C:\PROGRA~1\Python22\Lib\site-packages\scipy\cow\cow.py", 
line 847, in loop_send_recv
     results = self._send_recv(package,addendums)
   File "C:\PROGRA~1\Python22\Lib\site-packages\scipy\cow\cow.py", 
line 345, in _send_recv
     self.last_results = self._recv()
   File "C:\PROGRA~1\Python22\Lib\site-packages\scipy\cow\cow.py", 
line 303, in _recv
     results.append(worker.recv())
   File 
"C:\PROGRA~1\Python22\Lib\site-packages\scipy\cow\sync_cluster.py", 
line 404, in recv
     package = self.channel.read()
   File 
"C:\PROGRA~1\Python22\Lib\site-packages\scipy\cow\sync_cluster.py", 
line 164, in read
     x = self.rfile.read()
   File "c:\Program Files\Python22\lib\socket.py", line 228, in read
     new = self._sock.recv(k)
error: (10054, 'Connection reset by peer')
>>>
------------
-- 

Cheers,

Simon



More information about the SciPy-User mailing list