[IPython-dev] IPCluster failing when starting more than a few engines.

Thu Mar 6 15:37:40 EST 2014

Here's some more information.  Hopefully someone can help with this as this problem basically makes IPython parallel unusable.

I had our SA's disable the firewall and then things work fine.  All the engines start up and connect.  With the firewall on, I have to add the line "--enginessh=host" to the controller_args input to enable ssh port forwarding for the connections.  When I do that, if I try to launch a single engine on 30 separate computers (with a shared file system), I can only connect 5 of them even though ipcluster log reports that they all connected fine.

I'm wondering if there is some timing issue w/ running that many SSH port forward calls (it looks like 3 ports per engine are set up).  

Any thoughts on what I could try to fix this?

Ted

________________________________________
From: ipython-dev-bounces at scipy.org [ipython-dev-bounces at scipy.org] on behalf of Drain, Theodore R (392P) [theodore.r.drain at jpl.nasa.gov]
Sent: Wednesday, March 05, 2014 11:06 AM
To: IPython developers list
Subject: [IPython-dev] IPCluster failing when starting more than a few  engines.

Using IPython 2.0.0 dev branch sync'ed on 2014-02-24 11:44:52.  Running ipcluster start on a set of machines w/o a shared file system using SSHEngineSetLauncher.  I have 6 machines that have between 4 and 12 cores on each machine.  If I run ipcluster with 2 engines/machine, it works fine.  If I increase it to 3 or higher, I start getting engines that fail to connect.

Some failures are failures to connect like look like this:
10:43:30.195 [IPEngineApp] Registering with controller at tcp://x.x.x.x:59987
10:43:35.196 [IPEngineApp] CRITICAL | Registration timed out after 5.0 seconds

Other failures are weirder:
10:43:30.184 [IPEngineApp] Registering with controller at tcp://x.x.x.x:59987
10:43:30.249 [IPEngineApp] Starting to monitor the heartbeat signal from the hub every 3010 ms.
10:43:30.251 [IPEngineApp] Using existing profile dir: u'.ipython/profile_dev'
10:43:30.252 [IPEngineApp] Completed registration with id 6
10:43:36.273 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (1 time(s) in a row).
10:43:39.281 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (2 time(s) in a row).
10:43:42.293 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (3 time(s) in a row).
10:44:36.469 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (1 time(s) in a row).
10:44:42.489 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (1 time(s) in a row).

In the second case, if I connect a client to the controller, there is no engine with ID 6 available even though it seems to be getting some heart beats from the hub.

I've tried adding lines like these to my config file and it doesn't help:
c.IPClusterStart.delay = 0.5
c.SSHEngineSetLauncher.delay = 0.5

The number of failures increases with the number of engines being started on each machine.  Trying to start 12 engines on a single machine is almost a complete failure.

Any thoughts on what I should be doing differently?

Thanks,
Ted
_______________________________________________
IPython-dev mailing list
IPython-dev at scipy.org
http://mail.scipy.org/mailman/listinfo/ipython-dev