[IPython-dev] IPCluster failing when starting more than a few engines.

Burkhard Ritter burkhard at ualberta.ca
Fri Mar 7 21:32:50 EST 2014


If I remember correctly I also had difficulties bringing all engines
up reliably and it seemed to be due to timing issues with ipcluster.
In the end just wrote my own scripts to start up my engines. My script
does something like this:

```
for ((i=0;i<$N;i++)); do
    nohup nice -n19 ipengine --profile=my_profile
--ssh=controller_node --log-to-file &
    sleep 15
done
```

Most of the time I only have two nodes so I just run these scripts by
hand, but it shouldn't be difficult to extend the script and and start
all engines on a  number of nodes.

Burkhard

On Thu, Mar 6, 2014 at 3:00 PM, Drain, Theodore R (392P)
<theodore.r.drain at jpl.nasa.gov> wrote:
> Sorry to keep spamming the list but...
>
> It appears the problem I'm having is purely timing (or timeout) based.  If I run ipengine by hand after the controller comes up, I can connect more than 5 engines (so it's not a resource problem).  I then tried hacking hub.py which has a line like this:
>
>         self.registration_timeout = max(5000, 2*self.heartmonitor.period)
>
> If I change that to 60000 (60 seconds), I can get a few more engines to connect but it's basically guesswork as to how many make it up.  And there isn't a config option input for that timeout so that isn't much of a solution even if I could come up w/ a time that worked.
>
> At this point I'm thinking I'm going to have to write my own version of "ipcluster" that runs the controller, sets up the port forwards, and spawns the engines.  Perhaps if I have more control over how that happens that I can get a cluster that will reliably start up.
>
> ________________________________________
> From: ipython-dev-bounces at scipy.org [ipython-dev-bounces at scipy.org] on behalf of Drain, Theodore R (392P) [theodore.r.drain at jpl.nasa.gov]
> Sent: Thursday, March 06, 2014 12:51 PM
> To: IPython developers list
> Subject: Re: [IPython-dev] IPCluster failing when starting more than    a       few     engines.
>
> One further bit of information:  I'm hitting a hard limit of 5 engines connecting using SSH port forwarding.  I can run any number of engines locally and it works fine.  Could there be some kind of ZMQ limit or SSH limit?  The host machine does spawn a huge number of processes  - I count 33 processes created when running ipcluster start with a single remote engine which seems a little excessive.
>
> ________________________________________
> From: ipython-dev-bounces at scipy.org [ipython-dev-bounces at scipy.org] on behalf of Drain, Theodore R (392P) [theodore.r.drain at jpl.nasa.gov]
> Sent: Thursday, March 06, 2014 12:37 PM
> To: IPython developers list
> Subject: Re: [IPython-dev] IPCluster failing when starting more than a  few     engines.
>
> Here's some more information.  Hopefully someone can help with this as this problem basically makes IPython parallel unusable.
>
> I had our SA's disable the firewall and then things work fine.  All the engines start up and connect.  With the firewall on, I have to add the line "--enginessh=host" to the controller_args input to enable ssh port forwarding for the connections.  When I do that, if I try to launch a single engine on 30 separate computers (with a shared file system), I can only connect 5 of them even though ipcluster log reports that they all connected fine.
>
> I'm wondering if there is some timing issue w/ running that many SSH port forward calls (it looks like 3 ports per engine are set up).
>
> Any thoughts on what I could try to fix this?
>
> Ted
>
> ________________________________________
> From: ipython-dev-bounces at scipy.org [ipython-dev-bounces at scipy.org] on behalf of Drain, Theodore R (392P) [theodore.r.drain at jpl.nasa.gov]
> Sent: Wednesday, March 05, 2014 11:06 AM
> To: IPython developers list
> Subject: [IPython-dev] IPCluster failing when starting more than a few  engines.
>
> Using IPython 2.0.0 dev branch sync'ed on 2014-02-24 11:44:52.  Running ipcluster start on a set of machines w/o a shared file system using SSHEngineSetLauncher.  I have 6 machines that have between 4 and 12 cores on each machine.  If I run ipcluster with 2 engines/machine, it works fine.  If I increase it to 3 or higher, I start getting engines that fail to connect.
>
> Some failures are failures to connect like look like this:
> 10:43:30.195 [IPEngineApp] Registering with controller at tcp://x.x.x.x:59987
> 10:43:35.196 [IPEngineApp] CRITICAL | Registration timed out after 5.0 seconds
>
> Other failures are weirder:
> 10:43:30.184 [IPEngineApp] Registering with controller at tcp://x.x.x.x:59987
> 10:43:30.249 [IPEngineApp] Starting to monitor the heartbeat signal from the hub every 3010 ms.
> 10:43:30.251 [IPEngineApp] Using existing profile dir: u'.ipython/profile_dev'
> 10:43:30.252 [IPEngineApp] Completed registration with id 6
> 10:43:36.273 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (1 time(s) in a row).
> 10:43:39.281 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (2 time(s) in a row).
> 10:43:42.293 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (3 time(s) in a row).
> 10:44:36.469 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (1 time(s) in a row).
> 10:44:42.489 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (1 time(s) in a row).
>
> In the second case, if I connect a client to the controller, there is no engine with ID 6 available even though it seems to be getting some heart beats from the hub.
>
> I've tried adding lines like these to my config file and it doesn't help:
> c.IPClusterStart.delay = 0.5
> c.SSHEngineSetLauncher.delay = 0.5
>
> The number of failures increases with the number of engines being started on each machine.  Trying to start 12 engines on a single machine is almost a complete failure.
>
> Any thoughts on what I should be doing differently?
>
> Thanks,
> Ted
> _______________________________________________
> IPython-dev mailing list
> IPython-dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-dev
> _______________________________________________
> IPython-dev mailing list
> IPython-dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-dev
> _______________________________________________
> IPython-dev mailing list
> IPython-dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-dev
> _______________________________________________
> IPython-dev mailing list
> IPython-dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-dev
>



More information about the IPython-dev mailing list