[IPython-dev] IPCluster failing when starting more than a few engines.

Drain, Theodore R (392P) theodore.r.drain at jpl.nasa.gov
Wed Mar 5 14:06:00 EST 2014


Using IPython 2.0.0 dev branch sync'ed on 2014-02-24 11:44:52.  Running ipcluster start on a set of machines w/o a shared file system using SSHEngineSetLauncher.  I have 6 machines that have between 4 and 12 cores on each machine.  If I run ipcluster with 2 engines/machine, it works fine.  If I increase it to 3 or higher, I start getting engines that fail to connect.

Some failures are failures to connect like look like this:
10:43:30.195 [IPEngineApp] Registering with controller at tcp://x.x.x.x:59987
10:43:35.196 [IPEngineApp] CRITICAL | Registration timed out after 5.0 seconds

Other failures are weirder:
10:43:30.184 [IPEngineApp] Registering with controller at tcp://x.x.x.x:59987
10:43:30.249 [IPEngineApp] Starting to monitor the heartbeat signal from the hub every 3010 ms.
10:43:30.251 [IPEngineApp] Using existing profile dir: u'.ipython/profile_dev'
10:43:30.252 [IPEngineApp] Completed registration with id 6
10:43:36.273 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (1 time(s) in a row).
10:43:39.281 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (2 time(s) in a row).
10:43:42.293 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (3 time(s) in a row).
10:44:36.469 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (1 time(s) in a row).
10:44:42.489 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (1 time(s) in a row).

In the second case, if I connect a client to the controller, there is no engine with ID 6 available even though it seems to be getting some heart beats from the hub.

I've tried adding lines like these to my config file and it doesn't help:
c.IPClusterStart.delay = 0.5
c.SSHEngineSetLauncher.delay = 0.5

The number of failures increases with the number of engines being started on each machine.  Trying to start 12 engines on a single machine is almost a complete failure.

Any thoughts on what I should be doing differently?

Thanks,
Ted



More information about the IPython-dev mailing list