[IPython-dev] Using IPython Cluster with SGE -- help needed

Mon Aug 5 10:19:27 EDT 2013

Thanks, Mathieu,

answers inline:

Am 05.08.2013 16:02, schrieb Matthieu Brucher:
> Hi,
> 
> I don't know why the registration was not complete. Is your home
> folder the same on all nodes and on the login node?

Yes, it is. Could this be some firewall issue?

> You won't see 12 jobs. You asked for 12 engines, and they will all be
> submitted in one job and the 12 engines will be started by mpiexec -n
> 12. This is the standard way of using batch schedulers. Ask for some
> cores, run an mpi application on these cores.

Well, then I guess our IT department doesn't like "the standard way". We
have a multi-node cluster, comprising 12 nodes, one 'management' and 11
'computing' nodes. And we don't have/use mpi usually.

What I would need in order to use our multi-node cluster the way our
sysadmins want us to, I'd need to submit a total of {n} ipengines via
{n} calls to ``qsub``.

Any idea how I can accomplish this?

Thanks for your help!
Andreas.

> 
> You can also try to submit additional engines now that the controller
> is up and running. Check that the configuration files are present and
> readable.
> 
> Cheers,
> 
> 
> 2013/8/5 Andreas Hilboll <lists at hilboll.de>:
>> Am 04.08.2013 16:20, schrieb Matthieu Brucher:
>>> Hi,
>>>
>>> I guess we may want to start with the ipython documentation on this
>>> topic: http://ipython.org/ipython-doc/stable/parallel/parallel_process.html
>>>
>>> Cheers,
>>>
>>> 2013/8/4 Andreas Hilboll <lists at hilboll.de>:
>>>> Hi,
>>>>
>>>> I would like to use IPython for calculations on our cluster. It's a
>>>> total of 11 compute + 1 management nodes (all running Linux), and we're
>>>> using SGE's qsub to submit jobs. The $HOME directory is shared via NFS
>>>> between all the nodes.
>>>>
>>>> Even after reading the documentation, I'm unsure about how to get things
>>>> running. I assume that I'll have to execute ``ìpcluster -n 16`` on all
>>>> compute nodes (they have 16 cores each). I'd have the ipython shell
>>>> (notebook won't work due to firewall restrictions I cannot change) on
>>>> the management node. But how does the management node know about the
>>>> kernels which are running on the compute nodes and waiting for a job?
>>>> And how can I tell the management node that it shall use qsub to submit
>>>> the jobs to the individual kernels?
>>>>
>>>> As I think this is a common use case, I'd be willing to write up a nice
>>>> tutorial about the setup, but I fear I need some help from you guys to
>>>> get things running ...
>>>>
>>>> Cheers,
>>>>
>>>> -- Andreas.
>>>> _______________________________________________
>>>> IPython-dev mailing list
>>>> IPython-dev at scipy.org
>>>> http://mail.scipy.org/mailman/listinfo/ipython-dev
>>>
>>>
>>>
>>
>> Okay, thanks to the good docs, I was able to start a cluster:
>>
>> (test_py27)hilboll at login:~> ipcluster start --profile=nexus_py2.7 -n 12
>> 2013-08-05 15:26:04,264.264 [IPClusterStart] Using existing profile dir:
>> u'/gpfs/hb/hilboll/.config/ipython/profile_nexus_py2.7'
>> 2013-08-05 15:26:04.272 [IPClusterStart] Starting ipcluster with
>> [daemon=False]
>> 2013-08-05 15:26:04.273 [IPClusterStart] Creating pid file:
>> /gpfs/hb/hilboll/.config/ipython/profile_nexus_py2.7/pid/ipcluster.pid
>> 2013-08-05 15:26:04.273 [IPClusterStart] Starting Controller with
>> SGEControllerLauncher
>> 2013-08-05 15:26:04.289 [IPClusterStart] Job submitted with job id: '60'
>> 2013-08-05 15:26:05.289 [IPClusterStart] Starting 12 Engines with
>> SGEEngineSetLauncher
>> 2013-08-05 15:26:05.306 [IPClusterStart] Job submitted with job id: '61'
>> 2013-08-05 15:26:35.351 [IPClusterStart] Engines appear to have started
>> successfully
>>
>> However, using qstat, I can only see one job in the queue, which is the
>> controller:
>>
>> hilboll at login:~> qstat
>> job-ID  prior   name       user         state submit/start at     queue
>>                          slots ja-task-ID
>> -----------------------------------------------------------------------------------------------------------------
>>      60 0.57500 ipython    hilboll      r     08/05/2013 15:26:06
>> all.q at login.cluster                1
>>
>>
>> I used the following job template:
>>
>> c.SGEEngineSetLauncher.batch_template = '''#!/bin/bash
>> #$ -N ipython #- Name optional!
>> #$ -q all.q #- Nutze die Queue 'all.q'.
>> #$ -S /bin/bash #- erforderlich !
>> #$ -V #- Verwendet Pfade wie in aktueller Shell
>> #$ -j y #- merge STDOUT and STDERR
>> #$ -o log_ipython_{n}.log
>>
>> source /hb/hilboll/local/anaconda/bin/activate test_py27
>> mpiexec -n {n} ipengine --profile-dir={profile_dir}
>> '''
>>
>> If I use a 'blank' ``ipengine --profile-dir={profile_dir}`` instead of
>> the mpiexec call, I get exactly two jobs in the queue, one for the
>> controller and one for the first engine.
>>
>> My naive understanding would be that exactly {n} jobs get submitted via
>> the SGEEngineSetLauncher. Is my expectation wrong?
>>
>> In the logfile, I get this here, 12 times:
>>
>> 2013-08-05 15:26:09.038 [IPEngineApp] Registration timed out after 2.0
>> seconds
>>
>> Any help resolving this issue is greatly appreciated :)
>>
>> Cheers,
>>
>> -- Andreas.
> 
> 
> 

-- 
-- Andreas.