[scikit-learn] Scaling model selection on a cluster

Mon Aug 8 03:59:23 EDT 2016

I realize this is in early stages and I'd like to help improve it, even if
just by testing on an actual cluster. All of the examples I've seen are
very small, and it's impossible for anyone to notice if they're really
running in parallel judging just by the execution time. None of them
mention how you can ensure or check that each worker is doing work either.

If there's anything I can do to help debug this (I realize it could be a
problem on my end though), please let me know.

On Mon, Aug 8, 2016 at 9:48 AM Vlad Ionescu <ionescu.vlad1 at gmail.com> wrote:

> I don't think they're too fast. I tried with slower models and bigger data
> sets as well. I get the best results with n_jobs=20, which is the number of
> cores on a single node. Anything below is considerably slower, anything
> above is mostly the same, sometimes a little slower.
>
> Is there a way to see what each worker is running? Nothing is reported in
> the scheduler console window about the workers, just that there is a
> connection to the scheduler. Should something be reported about the work
> assigned to workers?
>
> If I notice speed benefits going from 1 to 20 n_jobs, surely there should
> be something noticeable above that as well if the distributed part is
> running correctly, no? This is a very easily parallelizable task, and my
> nodes are in a cluster on the same network. I highly doubt it's (just)
> overhead.
>
> Is there anything else that I could look into to try fixing this?
>
> Fitting 10 folds for each of 10000 candidates, totalling 100000 fits
> [Parallel(n_jobs=20)]: Done  10 tasks      | elapsed:    0.7s
> [Parallel(n_jobs=20)]: Done 160 tasks      | elapsed:    4.8s
> [Parallel(n_jobs=20)]: Done 410 tasks      | elapsed:   12.6s
> [Parallel(n_jobs=20)]: Done 760 tasks      | elapsed:   23.7s
> [Parallel(n_jobs=20)]: Done 1210 tasks      | elapsed:   37.9s
> [Parallel(n_jobs=20)]: Done 1760 tasks      | elapsed:   55.0s
> *[Parallel(n_jobs=20)]: Done 2410 tasks      | elapsed:  1.2min*
>
> ---
>
> Fitting 10 folds for each of 10000 candidates, totalling 100000 fits
> [Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    6.2s
> [Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:   27.5s
> [Parallel(n_jobs=4)]: Done 442 tasks      | elapsed:  1.0min
> *[Parallel(n_jobs=4)]: Done 792 tasks      | elapsed:  1.7min*
>
>
> ---
>
> Fitting 10 folds for each of 10000 candidates, totalling 100000 fits
> [Parallel(n_jobs=100)]: Done 250 tasks      | elapsed:    9.1s
> [Parallel(n_jobs=100)]: Done 600 tasks      | elapsed:   19.3s
> [Parallel(n_jobs=100)]: Done 1050 tasks      | elapsed:   34.0s
> [Parallel(n_jobs=100)]: Done 1600 tasks      | elapsed:   49.8s
> *[Parallel(n_jobs=100)]: Done 2250 tasks      | elapsed:  1.2min*
>
> If 4 workers do 442 tasks in a minute, then 5x=20 workers should ideally
> do 5x442 = 2210. So double the workers, half the time seems to hold very
> well until 20 workers. I have a hard time imagining that it would stop
> holding at exactly the number of cores per node.
>
> On Mon, Aug 8, 2016 at 8:25 AM Gael Varoquaux <
> gael.varoquaux at normalesup.org> wrote:
>
>> My guess is that your model evaluations are too fast, and that you are
>> not getting the benefits of distributed computing as the overhead is
>> hiding them.
>>
>> Anyhow, I don't think that this is ready for prime-time usage. It
>> probably requires tweeking and understanding the tradeoffs.
>>
>> G
>>
>> On Sun, Aug 07, 2016 at 09:25:47PM +0000, Vlad Ionescu wrote:
>> > I copy pasted the example in the link you gave, only made the search
>> take a
>> > longer time. I used dask-ssh to setup worker nodes and a scheduler, then
>> > connected to the scheduler in my code.
>>
>> > Tweaking the n_jobs parameters for the randomized search does not get
>> any
>> > performance benefits. The connection to the scheduler seems to work, but
>> > nothing gets assigned to the workers, because the code doesn't scale.
>>
>> > I am using scikit-learn 0.18.dev0
>>
>> > Any ideas?
>>
>> > Code and results are below. Only the n_jobs value was changed between
>> > executions. I printed an Executor assigned to my scheduler, and it
>> reported 240
>> > cores.
>>
>> > import distributed.joblib
>> > from joblib import Parallel, parallel_backend
>> > from sklearn.datasets import load_digits
>> > from sklearn.grid_search import RandomizedSearchCV
>> > from sklearn.svm import SVC
>> > import numpy as np
>>
>> > digits = load_digits()
>>
>> > param_space = {
>> >     'C': np.logspace(-6, 6, 100),
>> >     'gamma': np.logspace(-8, 8, 100),
>> >     'tol': np.logspace(-4, -1, 100),
>> >     'class_weight': [None, 'balanced'],
>> > }
>>
>> > model = SVC(kernel='rbf')
>> > search = RandomizedSearchCV(model, param_space, cv=3, n_iter=1000,
>> verbose=1,
>> > n_jobs=200)
>>
>> > with parallel_backend('distributed',
>> scheduler_host='my_scheduler:8786'):
>> >     search.fit(digits.data, digits.target)
>>
>> > Fitting 3 folds for each of 1000 candidates, totalling 3000 fits
>> > [Parallel(n_jobs=200)]: Done   4 tasks      | elapsed:    0.5s
>> > [Parallel(n_jobs=200)]: Done 292 tasks      | elapsed:    6.9s
>> > [Parallel(n_jobs=200)]: Done 800 tasks      | elapsed:   16.1s
>> > [Parallel(n_jobs=200)]: Done 1250 tasks      | elapsed:   24.8s
>> > [Parallel(n_jobs=200)]: Done 1800 tasks      | elapsed:   36.0s
>> > [Parallel(n_jobs=200)]: Done 2450 tasks      | elapsed:   49.0s
>> > [Parallel(n_jobs=200)]: Done 3000 out of 3000 | elapsed:  1.0min
>> finished
>>
>> > -------------------------------------
>>
>> > Fitting 3 folds for each of 1000 candidates, totalling 3000 fits
>> > [Parallel(n_jobs=20)]: Done  10 tasks      | elapsed:    0.5s
>> > [Parallel(n_jobs=20)]: Done 160 tasks      | elapsed:    3.7s
>> > [Parallel(n_jobs=20)]: Done 410 tasks      | elapsed:    8.6s
>> > [Parallel(n_jobs=20)]: Done 760 tasks      | elapsed:   16.2s
>> > [Parallel(n_jobs=20)]: Done 1210 tasks      | elapsed:   25.0s
>> > [Parallel(n_jobs=20)]: Done 1760 tasks      | elapsed:   36.2s
>> > [Parallel(n_jobs=20)]: Done 2410 tasks      | elapsed:   48.8s
>> > [Parallel(n_jobs=20)]: Done 3000 out of 3000 | elapsed:  1.0min finished
>>
>>
>> >
>>
>> > On Sun, Aug 7, 2016 at 8:31 PM Gael Varoquaux <
>> gael.varoquaux at normalesup.org>
>> > wrote:
>>
>> >     Parallel computing in scikit-learn is built upon on joblib. In the
>> >     development version of scikit-learn, the included joblib can be
>> extended
>> >     with a distributed backend:
>> >     http://distributed.readthedocs.io/en/latest/joblib.html
>> >     that can distribute code on a cluster.
>>
>> >     This is still bleeding edge, but this is probably a direction that
>> will
>> >     see more development.
>>
>> >     _______________________________________________
>> >     scikit-learn mailing list
>> >     scikit-learn at python.org
>> >     https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>> --
>>     Gael Varoquaux
>>     Researcher, INRIA Parietal
>>     NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
>>     Phone:  ++ 33-1-69-08-79-68
>>     http://gael-varoquaux.info
>> http://twitter.com/GaelVaroquaux
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160808/e80969ec/attachment-0001.html>