[scikit-learn] Scaling model selection on a cluster

Vlad Ionescu ionescu.vlad1 at gmail.com
Mon Aug 8 02:48:34 EDT 2016


I don't think they're too fast. I tried with slower models and bigger data
sets as well. I get the best results with n_jobs=20, which is the number of
cores on a single node. Anything below is considerably slower, anything
above is mostly the same, sometimes a little slower.

Is there a way to see what each worker is running? Nothing is reported in
the scheduler console window about the workers, just that there is a
connection to the scheduler. Should something be reported about the work
assigned to workers?

If I notice speed benefits going from 1 to 20 n_jobs, surely there should
be something noticeable above that as well if the distributed part is
running correctly, no? This is a very easily parallelizable task, and my
nodes are in a cluster on the same network. I highly doubt it's (just)
overhead.

Is there anything else that I could look into to try fixing this?

Fitting 10 folds for each of 10000 candidates, totalling 100000 fits
[Parallel(n_jobs=20)]: Done  10 tasks      | elapsed:    0.7s
[Parallel(n_jobs=20)]: Done 160 tasks      | elapsed:    4.8s
[Parallel(n_jobs=20)]: Done 410 tasks      | elapsed:   12.6s
[Parallel(n_jobs=20)]: Done 760 tasks      | elapsed:   23.7s
[Parallel(n_jobs=20)]: Done 1210 tasks      | elapsed:   37.9s
[Parallel(n_jobs=20)]: Done 1760 tasks      | elapsed:   55.0s
*[Parallel(n_jobs=20)]: Done 2410 tasks      | elapsed:  1.2min*

---

Fitting 10 folds for each of 10000 candidates, totalling 100000 fits
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    6.2s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:   27.5s
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed:  1.0min
*[Parallel(n_jobs=4)]: Done 792 tasks      | elapsed:  1.7min*


---

Fitting 10 folds for each of 10000 candidates, totalling 100000 fits
[Parallel(n_jobs=100)]: Done 250 tasks      | elapsed:    9.1s
[Parallel(n_jobs=100)]: Done 600 tasks      | elapsed:   19.3s
[Parallel(n_jobs=100)]: Done 1050 tasks      | elapsed:   34.0s
[Parallel(n_jobs=100)]: Done 1600 tasks      | elapsed:   49.8s
*[Parallel(n_jobs=100)]: Done 2250 tasks      | elapsed:  1.2min*

If 4 workers do 442 tasks in a minute, then 5x=20 workers should ideally do
5x442 = 2210. So double the workers, half the time seems to hold very well
until 20 workers. I have a hard time imagining that it would stop holding
at exactly the number of cores per node.

On Mon, Aug 8, 2016 at 8:25 AM Gael Varoquaux <gael.varoquaux at normalesup.org>
wrote:

> My guess is that your model evaluations are too fast, and that you are
> not getting the benefits of distributed computing as the overhead is
> hiding them.
>
> Anyhow, I don't think that this is ready for prime-time usage. It
> probably requires tweeking and understanding the tradeoffs.
>
> G
>
> On Sun, Aug 07, 2016 at 09:25:47PM +0000, Vlad Ionescu wrote:
> > I copy pasted the example in the link you gave, only made the search
> take a
> > longer time. I used dask-ssh to setup worker nodes and a scheduler, then
> > connected to the scheduler in my code.
>
> > Tweaking the n_jobs parameters for the randomized search does not get any
> > performance benefits. The connection to the scheduler seems to work, but
> > nothing gets assigned to the workers, because the code doesn't scale.
>
> > I am using scikit-learn 0.18.dev0
>
> > Any ideas?
>
> > Code and results are below. Only the n_jobs value was changed between
> > executions. I printed an Executor assigned to my scheduler, and it
> reported 240
> > cores.
>
> > import distributed.joblib
> > from joblib import Parallel, parallel_backend
> > from sklearn.datasets import load_digits
> > from sklearn.grid_search import RandomizedSearchCV
> > from sklearn.svm import SVC
> > import numpy as np
>
> > digits = load_digits()
>
> > param_space = {
> >     'C': np.logspace(-6, 6, 100),
> >     'gamma': np.logspace(-8, 8, 100),
> >     'tol': np.logspace(-4, -1, 100),
> >     'class_weight': [None, 'balanced'],
> > }
>
> > model = SVC(kernel='rbf')
> > search = RandomizedSearchCV(model, param_space, cv=3, n_iter=1000,
> verbose=1,
> > n_jobs=200)
>
> > with parallel_backend('distributed', scheduler_host='my_scheduler:8786'):
> >     search.fit(digits.data, digits.target)
>
> > Fitting 3 folds for each of 1000 candidates, totalling 3000 fits
> > [Parallel(n_jobs=200)]: Done   4 tasks      | elapsed:    0.5s
> > [Parallel(n_jobs=200)]: Done 292 tasks      | elapsed:    6.9s
> > [Parallel(n_jobs=200)]: Done 800 tasks      | elapsed:   16.1s
> > [Parallel(n_jobs=200)]: Done 1250 tasks      | elapsed:   24.8s
> > [Parallel(n_jobs=200)]: Done 1800 tasks      | elapsed:   36.0s
> > [Parallel(n_jobs=200)]: Done 2450 tasks      | elapsed:   49.0s
> > [Parallel(n_jobs=200)]: Done 3000 out of 3000 | elapsed:  1.0min finished
>
> > -------------------------------------
>
> > Fitting 3 folds for each of 1000 candidates, totalling 3000 fits
> > [Parallel(n_jobs=20)]: Done  10 tasks      | elapsed:    0.5s
> > [Parallel(n_jobs=20)]: Done 160 tasks      | elapsed:    3.7s
> > [Parallel(n_jobs=20)]: Done 410 tasks      | elapsed:    8.6s
> > [Parallel(n_jobs=20)]: Done 760 tasks      | elapsed:   16.2s
> > [Parallel(n_jobs=20)]: Done 1210 tasks      | elapsed:   25.0s
> > [Parallel(n_jobs=20)]: Done 1760 tasks      | elapsed:   36.2s
> > [Parallel(n_jobs=20)]: Done 2410 tasks      | elapsed:   48.8s
> > [Parallel(n_jobs=20)]: Done 3000 out of 3000 | elapsed:  1.0min finished
>
>
> >
>
> > On Sun, Aug 7, 2016 at 8:31 PM Gael Varoquaux <
> gael.varoquaux at normalesup.org>
> > wrote:
>
> >     Parallel computing in scikit-learn is built upon on joblib. In the
> >     development version of scikit-learn, the included joblib can be
> extended
> >     with a distributed backend:
> >     http://distributed.readthedocs.io/en/latest/joblib.html
> >     that can distribute code on a cluster.
>
> >     This is still bleeding edge, but this is probably a direction that
> will
> >     see more development.
>
> >     _______________________________________________
> >     scikit-learn mailing list
> >     scikit-learn at python.org
> >     https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> --
>     Gael Varoquaux
>     Researcher, INRIA Parietal
>     NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
>     Phone:  ++ 33-1-69-08-79-68
>     http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160808/3937ad3c/attachment.html>


More information about the scikit-learn mailing list