[SciPy-user] cow: 'Connection reset by peer' timeout problem?
Simon Saubern
s.saubern at chem.csiro.au
Tue Jan 14 00:11:24 EST 2003
I've been using cow to try out some distributed calculations.
Everything works fine if I use a subset of my data, but when I use
the full set I get "error: (10054, 'Connection reset by peer')"
messages on the master unit (see below for full output).
I can operate on larger and larger subsets until I get to the point
where if the slaves take more than about 5 minutes to complete a
task, the above error appears at the master. I can't tell if this is
a problem with the master or one of the slaves.
The setup:
10 x slave + master, all Win2K SP-3
Python 2.2.2
latest scipy binary for Win
cowname['data']=data # a list 35000 long.
lendata=range(7000) # just use a subset
bessy=None
while not bessy:
bessy=cowname.loop_code('do something;do
something;calc=function(data[x])',loop_var='x',inputs={'x':lendata},returns=['calc'])
bessy gets processed here
'data' is quite large and takes a while to transfer over the network.
But by doing it once and looping over the index, I minimize network
movements. The 'python' process on each slave uses about 85MB.
Increasing 'lendata' eventually causes the 'Connection reset by peer'
message to appear, and being a Win2k setup, I have to walk around to
all the slaves and manually restart the slave process.
I can't find anything in the scipy.cow directory that might be
associated with a timing control.
Can anyone help?
Would it be better to take the loop_code code and distribute it in
advance as a module on each salve, then use loop_apply?
Is it a network issue? The 10 slaves are on 3 different routers (I
think). I've tried shuffling the order of the slaves around when the
cluster is started according to my perceived notion of network
responsiveness, but to no avail.
When the cluster is started, all the slaves are listening on the same
port. Is this likely to cause conflicts?
Any pointers welcomed.
---------------error output
File "C:\PROGRA~1\Python22\Lib\site-packages\scipy\cow\cow.py",
line 823, in loop_code
return self.loop_send_recv(package,loop_data,loop_var)
File "C:\PROGRA~1\Python22\Lib\site-packages\scipy\cow\cow.py",
line 847, in loop_send_recv
results = self._send_recv(package,addendums)
File "C:\PROGRA~1\Python22\Lib\site-packages\scipy\cow\cow.py",
line 345, in _send_recv
self.last_results = self._recv()
File "C:\PROGRA~1\Python22\Lib\site-packages\scipy\cow\cow.py",
line 303, in _recv
results.append(worker.recv())
File
"C:\PROGRA~1\Python22\Lib\site-packages\scipy\cow\sync_cluster.py",
line 404, in recv
package = self.channel.read()
File
"C:\PROGRA~1\Python22\Lib\site-packages\scipy\cow\sync_cluster.py",
line 164, in read
x = self.rfile.read()
File "c:\Program Files\Python22\lib\socket.py", line 228, in read
new = self._sock.recv(k)
error: (10054, 'Connection reset by peer')
>>>
------------
--
Cheers,
Simon
More information about the SciPy-User
mailing list