[SciPy-dev] cow: 'Connection reset by peer' timeout problem?

eric jones eric at enthought.com
Sat Feb 1 05:32:39 EST 2003


Hey Simon,

I don't remember ever seeing this, but it has been about a year since I
used cow heavily.  At the time, the jobs I ran lasted about 1 minute
each, so I didn't run into the 4 minute time out you are seeing.  

I can't think of a technical reason why 4 minutes is a magic number from
the Python code standpoint.  There is a timeout value I believe, but it
wouldn't cause the error you are seeing.

Could it have something with ssh timing out and disconnecting?

eric

----------------------------------------------
eric jones                    515 Congress Ave
www.enthought.com             Suite 1614
512 536-1057                  Austin, Tx 78701 


> -----Original Message-----
> From: scipy-dev-admin at scipy.net [mailto:scipy-dev-admin at scipy.net] On
> Behalf Of Simon Saubern
> Sent: Wednesday, January 29, 2003 7:34 PM
> To: scipy-dev at scipy.net
> Subject: [SciPy-dev] cow: 'Connection reset by peer' timeout problem?
> 
> I'm re-posting this message here as I didn't get any replies on the
> scipy-users list:
> 
> I've been using cow to try out some distributed calculations.
> Everything works fine if I use a subset of my data, but when I use
> the full set I get "error: (10054, 'Connection reset by peer')"
> messages on the master unit (see below for full output).
> 
> I can operate on larger and larger subsets until I get to the point
> where if the slaves take more than about 4 minutes to complete a
> task, the above error appears at the master.
> 
> That is, connections are established (confirmed using netstat),
> processing occurs on the slaves and keeps going, but the master times
> out after about 4min.
> 
> Is this a 'keep alive' problem? If so, how can I extend the time out
> period?
> 
> The setup:
> 10 x slave + master, all Win2K SP-3
> Python 2.2.2
> latest scipy binary for Win
> 
> cowname['data']=data # a list 35000 long.
> lendata=range(7000) # just use a subset
> bessy=None
> while not bessy:
>      bessy=cowname.loop_code('do something;do
>
something;calc=function(data[x])',loop_var='x',inputs={'x':lendata},retu
rn
> s=['calc'])
>      bessy gets processed here
> 
> 'data' is quite large and takes a while to transfer over the network.
> But by doing it once and looping over the index, I minimize network
> movements. The 'python' process on each slave uses about 85MB.
> 
> Increasing 'lendata' eventually causes the 'Connection reset by peer'
> message to appear.
> 
> Any pointers welcomed.
> 
> ---------------error output
> 
> 
>    File "C:\PROGRA~1\Python22\Lib\site-packages\scipy\cow\cow.py",
> line 823, in loop_code
>      return self.loop_send_recv(package,loop_data,loop_var)
>    File "C:\PROGRA~1\Python22\Lib\site-packages\scipy\cow\cow.py",
> line 847, in loop_send_recv
>      results = self._send_recv(package,addendums)
>    File "C:\PROGRA~1\Python22\Lib\site-packages\scipy\cow\cow.py",
> line 345, in _send_recv
>      self.last_results = self._recv()
>    File "C:\PROGRA~1\Python22\Lib\site-packages\scipy\cow\cow.py",
> line 303, in _recv
>      results.append(worker.recv())
>    File
> "C:\PROGRA~1\Python22\Lib\site-packages\scipy\cow\sync_cluster.py",
> line 404, in recv
>      package = self.channel.read()
>    File
> "C:\PROGRA~1\Python22\Lib\site-packages\scipy\cow\sync_cluster.py",
> line 164, in read
>      x = self.rfile.read()
>    File "c:\Program Files\Python22\lib\socket.py", line 228, in read
>      new = self._sock.recv(k)
> error: (10054, 'Connection reset by peer')
> >>>
> ------------
> --
> 
> Cheers,
> 
> Simon
> _______________________________________________
> Scipy-dev mailing list
> Scipy-dev at scipy.net
> http://www.scipy.net/mailman/listinfo/scipy-dev






More information about the SciPy-Dev mailing list