From szx9404 at gmail.com  Tue Dec  4 20:14:08 2018
From: szx9404 at gmail.com (parker x)
Date: Tue, 4 Dec 2018 17:14:08 -0800
Subject: [scikit-learn] Question about contributing to scikit-learn
Message-ID: <CAAm=4rKN9mAGcMZYAQ=xDueofTuzu+tiRthh0i2Kw0OxHZ0RZQ@mail.gmail.com>

Dear scikit-learn developers,

My name is Parker, and I'm a data scientist.

Scikit-learn is a great ML library that I work frequently for work and
personal projects. I have always wanted to contribute something to the
scikit-learn community, and I am wondering if you could give some opinions
on following two ideas for contribution.

My first idea is to integrate another python library 'imbalanced-learn'
into scikit-learn so that people could also use scikit-learn to deal with
imbalance issues.

Another idea is to combine those scikit-learn built-in feature selection
functions into one automated feature selection function that might benefit
those users who are not familiar with feature selection process.

Looking forward to your suggestions! And thank you very much for your time!

Best,
Parker
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181204/a85c4bee/attachment.html>

From joel.nothman at gmail.com  Wed Dec  5 17:32:06 2018
From: joel.nothman at gmail.com (Joel Nothman)
Date: Thu, 6 Dec 2018 09:32:06 +1100
Subject: [scikit-learn] New core dev: Adrin Jalali
Message-ID: <CAAkaFLXemxvQrxfSgfXm7WOc_HrKYDSM5eJPj8hsyr1Ty_qL0A@mail.gmail.com>

The Scikit-learn core development team has welcomed a new member, Adrin
Jalali, who has been doing some really amazing work in contributing code
and reviews since July (aside from occasional contributions since 2014).
Congratulations and welcome, Adrin!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181206/9c1ff270/attachment.html>

From matthieu.brucher at gmail.com  Wed Dec  5 17:45:19 2018
From: matthieu.brucher at gmail.com (Matthieu Brucher)
Date: Wed, 5 Dec 2018 22:45:19 +0000
Subject: [scikit-learn] Recurrent questions about speed for
 TfidfVectorizer
In-Reply-To: <IlxpvZLSdp8YcbbmiMX5Rtr9pfvJObND4KgNCUlhkbUBLPSxbvIp6_8jgKn9op-fXRr5E8cEbCsjW9p8JlAjcZaiWe_jNyAdVjZR4haH-SE=@pm.me>
References: <CAHCaCkJtdKibR=ed34_9NZ97o6Z8O8kT0bk7fbm_Og=_WBqh-A@mail.gmail.com>
 <L81a-RIivjN6v7qoXG-IJw9NIRFxcVipN0wa6NcFnnUDArZbBkXXvcTdYwHEkE0A5E4w1SR690c_6ylngoONe0or-K99WrAWPpyVyAFZ_OE=@pm.me>
 <46dd3561-a70c-ea18-282f-26d34b87cf06@gmail.com>
 <IlxpvZLSdp8YcbbmiMX5Rtr9pfvJObND4KgNCUlhkbUBLPSxbvIp6_8jgKn9op-fXRr5E8cEbCsjW9p8JlAjcZaiWe_jNyAdVjZR4haH-SE=@pm.me>
Message-ID: <CAHCaCk+zoWd9O1vTKuJzwx4_YEv=Vcs2wZHf+c7Q+LPYAaJG3A@mail.gmail.com>

Hi qll,

Sorry for the late reply, lots of things to work on currently.

I'll have a look at the roadmap and the pointers to see what could be done
to enhance the situation.

Cheers,

Matthieu

Le lun. 26 nov. 2018 ? 20:09, Roman Yurchak via scikit-learn <
scikit-learn at python.org> a ?crit :

> Tries are interesting, but it appears that while they use less memory
> that dicts/maps they are generally slower than dicts for a large number
> of elements. See e.g.
> https://github.com/pytries/marisa-trie/blob/master/docs/benchmarks.rst.
> This is also consistent with the results in the below linked
> CountVectorizer PR that aimed to use tries, I think.
>
> Though maybe e.g. MARISA-Trie (and generally trie libraries available in
> python) did improve significantly in 5 years since
> https://github.com/scikit-learn/scikit-learn/issues/2639 was done.
>
> The thing is also that even HashingVecorizer that doesn't need to handle
> the vocabulary is only a moderately faster, so using a better data
> structure for the vocabulary might give us its performance at best..
>
> --
> Roman
>
> On 26/11/2018 16:f28, Andreas Mueller wrote:
> > I think tries might be an interesting datastructure, but it really
> > depends on where the bottleneck is.
> > I'm really surprised they are not used more, but maybe that's just
> > because implementations are missing?
> >
> > On 11/26/18 8:39 AM, Roman Yurchak via scikit-learn wrote:
> >> Hi Matthieu,
> >>
> >> if you are interested in general questions regarding improving
> >> scikit-learn performance, you might be want to have a look at the draft
> >> roadmap
> >> https://github.com/scikit-learn/scikit-learn/wiki/Draft-Roadmap-2018 --
> >> there is a lot topics where suggestions / PRs on improving performance
> >> would be very welcome.
> >>
> >> For the particular case of TfidfVectorizer, it is a bit different from
> >> the rest of the scikit-learn code base in the sense that it's not
> >> limited by the performance of numerical calculation but rather that of
> >> string processing and counting. TfidfVectorizer is equivalent to
> >> CountVectorizer + TfidfTransformer and the later  has only a marginal
> >> computational cost. As to CountVectorizer, last time I checked, its
> >> profiling was something along the lines of,
> >>     - part regexp for tokenization (see token_pattern.findall)
> >>     - part token counting (see CountVectorizer._count_vocab)
> >>     - and a comparable part for all the rest
> >>
> >> Because of that, porting it to Cython is not that immediate, as one is
> >> still going to use CPython regexp and token counting in a dict. For
> >> instance, HashingVectorizer implements token counting in Cython -- it's
> >> faster but not that much faster. Using C++ maps or some less common
> >> structures have been discussed in
> >> https://github.com/scikit-learn/scikit-learn/issues/2639
> >>
> >> Currently, I think, there are ~3 main ways performance could be
> improved,
> >>     1. Optimize the current implementation while remaining in Python.
> >> Possible but IMO would require some effort, because there are not much
> >> low hanging fruits left there. Though a new look would definitely be
> good.
> >>
> >>     2. Parallelize computations. There was some earlier discussion about
> >> this in scikit-learn issues, but at present, the better way would
> >> probably be to add it in dask-ml (see
> >> https://github.com/dask/dask-ml/issues/5). HashingVectorizer is already
> >> supported. Someone would need to implement CountVectorizer.
> >>
> >>     3. Rewrite part of the implementation in a lower level language
> (e.g.
> >> Cython). The question is how maintainable that would be, and whether the
> >> performance gains would be worth it.  Now that Python 2 will be dropped,
> >> at least not having to deal with Py2/3 compatibility for strings in
> >> Cython might make things a bit easier. Though, if the processing is in
> >> Cython it might also make using custom tokenizers/analyzers more
> difficult.
> >>
> >>       On a related topic, I have been experimenting with implementing
> part
> >> of this processing in Rust lately:
> >> https://github.com/rth/text-vectorize. So far it looks promising.
> >> Though, of course, it will remain a separate project because of language
> >> constraints in scikit-learn.
> >>
> >> In general if you have thoughts on things that can be improved, don't
> >> hesitate to open issues,
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>


-- 
Quantitative analyst, Ph.D.
Blog: http://blog.audio-tk.com/
LinkedIn: http://www.linkedin.com/in/matthieubrucher
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181205/01cac6e1/attachment.html>

From pahome.chen at mirlab.org  Wed Dec  5 22:13:03 2018
From: pahome.chen at mirlab.org (lampahome)
Date: Thu, 6 Dec 2018 11:13:03 +0800
Subject: [scikit-learn] Is there regression algo with 3-d input?
Message-ID: <CAB3eZftUz65J-+q+tyiw7ZN_uOB6yPzQQkLuJ25ZAjjfC_QT-w@mail.gmail.com>

I want to regress time series prediction per week, so the unit of train
data X is the day ex: Mon, Tue, Wed...etc.

Ex: train data X is like below
X:
[ [1,2,3,4,3,2,1]
 ,[2,2,3,4,3,2,2]  ]
Each data of each row is about the day of one week. So each row has 7 data.

Now if I have another feature W in each day like weather, or traffic or
else.

I thought expand the X to 3d is reasonable because the W should be
contained in each day in X.

So what I thought X is:
[ [ [1, W-Mon], [2, W-Tue]  , [3, W-Wed]  , [4, W-Thu]  , [3, W-Fri]  , [2,
W-Sat]  , [1, W-Sun]  ]
, [ [2, W-Mon], [2, W-Tue]  , [3, W-Wed]  , [4, W-Thu]  , [3, W-Fri]  , [2,
W-Sat]  , [2, W-Sun]  ]   ]
It become a 3d input and contain every feature of each day.

Does scikit have regression algo can accept the 3d input X ?
I almost found algo can only accept 2d input X ex: *X* : array-like or
sparse matrix, shape = [n_samples, n_features]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181206/b9e81e09/attachment-0001.html>

From stuart at stuartreynolds.net  Wed Dec  5 23:50:32 2018
From: stuart at stuartreynolds.net (Stuart Reynolds)
Date: Wed, 5 Dec 2018 20:50:32 -0800
Subject: [scikit-learn] Is there regression algo with 3-d input?
In-Reply-To: <CAB3eZftUz65J-+q+tyiw7ZN_uOB6yPzQQkLuJ25ZAjjfC_QT-w@mail.gmail.com>
References: <CAB3eZftUz65J-+q+tyiw7ZN_uOB6yPzQQkLuJ25ZAjjfC_QT-w@mail.gmail.com>
Message-ID: <CAAy-kdkzfsxs0R90yD4PKvaeC7k34NMTnZ45+a6=uBFsGhyDYw@mail.gmail.com>

Would the output be different if you simply wrapped the whole process with
reshaping 3D input to 2d?

On Wed, Dec 5, 2018 at 7:14 PM lampahome <pahome.chen at mirlab.org> wrote:

> I want to regress time series prediction per week, so the unit of train
> data X is the day ex: Mon, Tue, Wed...etc.
>
> Ex: train data X is like below
> X:
> [ [1,2,3,4,3,2,1]
>  ,[2,2,3,4,3,2,2]  ]
> Each data of each row is about the day of one week. So each row has 7 data.
>
> Now if I have another feature W in each day like weather, or traffic or
> else.
>
> I thought expand the X to 3d is reasonable because the W should be
> contained in each day in X.
>
> So what I thought X is:
> [ [ [1, W-Mon], [2, W-Tue]  , [3, W-Wed]  , [4, W-Thu]  , [3, W-Fri]  ,
> [2, W-Sat]  , [1, W-Sun]  ]
> , [ [2, W-Mon], [2, W-Tue]  , [3, W-Wed]  , [4, W-Thu]  , [3, W-Fri]  ,
> [2, W-Sat]  , [2, W-Sun]  ]   ]
> It become a 3d input and contain every feature of each day.
>
> Does scikit have regression algo can accept the 3d input X ?
> I almost found algo can only accept 2d input X ex: *X* : array-like or
> sparse matrix, shape = [n_samples, n_features]
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181205/3b87ebf7/attachment.html>

From pahome.chen at mirlab.org  Wed Dec  5 23:54:34 2018
From: pahome.chen at mirlab.org (lampahome)
Date: Thu, 6 Dec 2018 12:54:34 +0800
Subject: [scikit-learn] Is there regression algo with 3-d input?
In-Reply-To: <CAAy-kdkzfsxs0R90yD4PKvaeC7k34NMTnZ45+a6=uBFsGhyDYw@mail.gmail.com>
References: <CAB3eZftUz65J-+q+tyiw7ZN_uOB6yPzQQkLuJ25ZAjjfC_QT-w@mail.gmail.com>
 <CAAy-kdkzfsxs0R90yD4PKvaeC7k34NMTnZ45+a6=uBFsGhyDYw@mail.gmail.com>
Message-ID: <CAB3eZftwSP2a1-NWdEpg_sLQr36r8=6N5MpiCGDkAeCFyOiwyA@mail.gmail.com>

Stuart Reynolds <stuart at stuartreynolds.net> ? 2018?12?6? ?? ??12:52???

> Would the output be different if you simply wrapped the whole process with
> reshaping 3D input to 2d?
>
>>
>>
I don't know, I'm not experiencing on it.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181206/3d1ff2de/attachment.html>

From olivier.grisel at ensta.org  Thu Dec  6 05:53:39 2018
From: olivier.grisel at ensta.org (Olivier Grisel)
Date: Thu, 6 Dec 2018 11:53:39 +0100
Subject: [scikit-learn] New core dev: Adrin Jalali
In-Reply-To: <CAAkaFLXemxvQrxfSgfXm7WOc_HrKYDSM5eJPj8hsyr1Ty_qL0A@mail.gmail.com>
References: <CAAkaFLXemxvQrxfSgfXm7WOc_HrKYDSM5eJPj8hsyr1Ty_qL0A@mail.gmail.com>
Message-ID: <CAFvE7K6392ZgEcFTpqdxVhz0EnTKAytVH_WnoDWZ-Rpsc8B1CQ@mail.gmail.com>

Congrats and welcome Adrin!

-- 
Olivier
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181206/e02e9043/attachment.html>

From sepand.haghighi at yahoo.com  Thu Dec  6 12:59:16 2018
From: sepand.haghighi at yahoo.com (Sepand Haghighi)
Date: Thu, 6 Dec 2018 17:59:16 +0000 (UTC)
Subject: [scikit-learn] PyCM 1.6 released: New machine learning library for
 confusion matrix statistical analysis
References: <124384675.2914980.1544119156218.ref@mail.yahoo.com>
Message-ID: <124384675.2914980.1544119156218@mail.yahoo.com>

Hi folks

Recently we have released new version of PyCM,?library for confusion matrix statistical analysis. I thought you might find it interesting.
PyCM is a multi-class confusion matrix library written in Python that supports both input data vectors and direct matrix, and a proper tool for post-classification model evaluation that supports most classes and overall statistics parameters. PyCM is the swiss-army knife of confusion matrices, targeted mainly at data scientists that need a broad array of metrics (more than 90) for predictive models and an accurate evaluation of large variety of classifiers.

Version 1.6 changelog :   
   - AUC Value Interpretation (AUCI) added
   - Example-6 added (Unbalanced data)
   - Anaconda cloud package added
   - overall_param and class_param arguments added to stat, save_stat and save_html methods
   - class_param argument added to save_csv method
   - _ removed from overall statistics names
   - README modified
   - Document modified

Repository : https://github.com/sepandhaghighi/pycmWebsite : http://pycm.shaghighi.ir/Document : http://pycm.shaghighi.ir/doc/Paper Link :?PyCM: Multiclass confusion matrix library in Python

 
Best RegardsSepand Haghighi


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181206/31986ec2/attachment.html>

From pahome.chen at mirlab.org  Fri Dec  7 04:00:35 2018
From: pahome.chen at mirlab.org (lampahome)
Date: Fri, 7 Dec 2018 17:00:35 +0800
Subject: [scikit-learn] Is there regression algo with 3-d input?
In-Reply-To: <CAAy-kdkzfsxs0R90yD4PKvaeC7k34NMTnZ45+a6=uBFsGhyDYw@mail.gmail.com>
References: <CAB3eZftUz65J-+q+tyiw7ZN_uOB6yPzQQkLuJ25ZAjjfC_QT-w@mail.gmail.com>
 <CAAy-kdkzfsxs0R90yD4PKvaeC7k34NMTnZ45+a6=uBFsGhyDYw@mail.gmail.com>
Message-ID: <CAB3eZfs1CHD2AbkCE6m9tKbBUNfy21riXH2GvV4X=K=ZKoZiAQ@mail.gmail.com>

Stuart Reynolds <stuart at stuartreynolds.net> ? 2018?12?6? ?? ??12:52???

> Would the output be different if you simply wrapped the whole process with
> reshaping 3D input to 2d?
>
>
Sometimes will changed a lot, sometimes will be similar.

Maybe using neural network is what I want?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181207/fd6ff4c2/attachment.html>

From joel.nothman at gmail.com  Sat Dec  8 05:15:23 2018
From: joel.nothman at gmail.com (Joel Nothman)
Date: Sat, 8 Dec 2018 21:15:23 +1100
Subject: [scikit-learn] Question about contributing to scikit-learn
In-Reply-To: <CAAm=4rKN9mAGcMZYAQ=xDueofTuzu+tiRthh0i2Kw0OxHZ0RZQ@mail.gmail.com>
References: <CAAm=4rKN9mAGcMZYAQ=xDueofTuzu+tiRthh0i2Kw0OxHZ0RZQ@mail.gmail.com>
Message-ID: <CAAkaFLXr6F+QJ0k5-NY0gGDNYhWnEdTxbrTdqwaa=_yX0o5pnw@mail.gmail.com>

Hi Parker,

We strongly urge new contributors to start with small issues
(documentation, small fixes, etc.) to gain confidence in the contribution
procedure, etc. Once you've worked on small issues and understand better
what comes through the issue tracker, you can consider bigger contributions.

We have indeed proposed support for imblearn-like Pipeline extensions (
https://github.com/scikit-learn/scikit-learn/issues/3855#issuecomment-357949997).
And yes, we're in need of a contributor there, but I would rather review
and merge smaller pieces of your work, before finding a large one that
needs a lot of changes before merge.

Joel

On Wed, 5 Dec 2018 at 12:15, parker x <szx9404 at gmail.com> wrote:

> Dear scikit-learn developers,
>
> My name is Parker, and I'm a data scientist.
>
> Scikit-learn is a great ML library that I work frequently for work and
> personal projects. I have always wanted to contribute something to the
> scikit-learn community, and I am wondering if you could give some opinions
> on following two ideas for contribution.
>
> My first idea is to integrate another python library 'imbalanced-learn'
> into scikit-learn so that people could also use scikit-learn to deal with
> imbalance issues.
>
> Another idea is to combine those scikit-learn built-in feature selection
> functions into one automated feature selection function that might benefit
> those users who are not familiar with feature selection process.
>
> Looking forward to your suggestions! And thank you very much for your time!
>
> Best,
> Parker
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181208/bbb68bd7/attachment.html>

From t3kcit at gmail.com  Sat Dec  8 09:26:15 2018
From: t3kcit at gmail.com (Andreas Mueller)
Date: Sat, 8 Dec 2018 09:26:15 -0500
Subject: [scikit-learn] New core dev: Adrin Jalali
In-Reply-To: <CAAkaFLXemxvQrxfSgfXm7WOc_HrKYDSM5eJPj8hsyr1Ty_qL0A@mail.gmail.com>
References: <CAAkaFLXemxvQrxfSgfXm7WOc_HrKYDSM5eJPj8hsyr1Ty_qL0A@mail.gmail.com>
Message-ID: <f62db2cc-f118-148c-a7af-e5cd63f6d88c@gmail.com>

Congratulations and welcome Adrin!

On 12/5/18 5:32 PM, Joel Nothman wrote:
> The Scikit-learn core development team has welcomed a new member, 
> Adrin Jalali, who has been doing some really amazing work in 
> contributing code and reviews since July (aside from occasional 
> contributions since 2014). Congratulations and welcome, Adrin!
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181208/a87f5d95/attachment.html>

From gael.varoquaux at normalesup.org  Sat Dec  8 12:16:05 2018
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Sat, 8 Dec 2018 18:16:05 +0100
Subject: [scikit-learn] New core dev: Adrin Jalali
In-Reply-To: <f62db2cc-f118-148c-a7af-e5cd63f6d88c@gmail.com>
References: <CAAkaFLXemxvQrxfSgfXm7WOc_HrKYDSM5eJPj8hsyr1Ty_qL0A@mail.gmail.com>
 <f62db2cc-f118-148c-a7af-e5cd63f6d88c@gmail.com>
Message-ID: <20181208171605.lxyoztlfk56zalrp@phare.normalesup.org>

Indeed, welcome Adrin, and thanks a lot for your investment on the
package!

Ga?l

On Sat, Dec 08, 2018 at 09:26:15AM -0500, Andreas Mueller wrote:
> Congratulations and welcome Adrin!

> On 12/5/18 5:32 PM, Joel Nothman wrote:

>     The Scikit-learn core development team has welcomed a new member, Adrin
>     Jalali, who has been doing some really amazing work in contributing code
>     and reviews since July (aside from occasional contributions since 2014).
>     Congratulations and welcome, Adrin!


>     _______________________________________________
>     scikit-learn mailing list
>     scikit-learn at python.org
>     https://mail.python.org/mailman/listinfo/scikit-learn


> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
    Gael Varoquaux
    Senior Researcher, INRIA Parietal
    NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
    Phone:  ++ 33-1-69-08-79-68
    http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux

From jcrudy at gmail.com  Sat Dec  8 13:16:46 2018
From: jcrudy at gmail.com (Jason Rudy)
Date: Sat, 8 Dec 2018 10:16:46 -0800
Subject: [scikit-learn] check_estimator and score_samples method
Message-ID: <CANrsH6__8QZFqayXcB+45ip_GDnG1+hhGoarYOF=zvop5dyGwQ@mail.gmail.com>

Hi all,

I'm working on updating py-earth for some recent changes in scikit-learn
and cython.  It seems like check_estimator has been significantly improved,
and I'm working through making py-earth compliant with it.  I've hit the
following issue, though.  It seems check_estimator tests score_samples
using only X as an argument, and py-earth's score_samples requires y as
well.  So, my question is: must score_samples work with just X (and
therefore maybe I should just remove it from py-earth) or is it okay to
have a score_samples that requires y, and I should try to find a workaround
for check_estimator?

Best,

Jason
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181208/1821927e/attachment.html>

From qinhanmin2005 at sina.com  Sat Dec  8 22:47:50 2018
From: qinhanmin2005 at sina.com (Hanmin Qin)
Date: Sun, 09 Dec 2018 11:47:50 +0800
Subject: [scikit-learn] New core dev: Adrin Jalali
Message-ID: <20181209034750.17491464009F@webmail.sinamail.sina.com.cn>

Welcome and thanks for contributing!
Hanmin Qin
----- Original Message -----
From: Gael Varoquaux <gael.varoquaux at normalesup.org>
To: Scikit-learn mailing list <scikit-learn at python.org>
Subject: Re: [scikit-learn] New core dev: Adrin Jalali
Date: 2018-12-09 01:18


Indeed, welcome Adrin, and thanks a lot for your investment on the
package!
Ga?l
On Sat, Dec 08, 2018 at 09:26:15AM -0500, Andreas Mueller wrote:
> Congratulations and welcome Adrin!
> On 12/5/18 5:32 PM, Joel Nothman wrote:
>     The Scikit-learn core development team has welcomed a new member, Adrin
>     Jalali, who has been doing some really amazing work in contributing code
>     and reviews since July (aside from occasional contributions since 2014).
>     Congratulations and welcome, Adrin!
>     _______________________________________________
>     scikit-learn mailing list
>     scikit-learn at python.org
>     https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-- 
    Gael Varoquaux
    Senior Researcher, INRIA Parietal
    NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
    Phone:  ++ 33-1-69-08-79-68
    http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux
_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181209/8750a506/attachment.html>

From g.lemaitre58 at gmail.com  Sun Dec  9 04:09:29 2018
From: g.lemaitre58 at gmail.com (=?ISO-8859-1?Q?Guillaume_Lema=EEtre?=)
Date: Sun, 09 Dec 2018 10:09:29 +0100
Subject: [scikit-learn] New core dev: Adrin Jalali
In-Reply-To: <20181209034750.17491464009F@webmail.sinamail.sina.com.cn>
Message-ID: <7bmvf8ethadfrfvqfkiadmcp.1544346569794@gmail.com>

An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181209/e443aa9e/attachment-0001.html>

From emmanuelarias30 at gmail.com  Sun Dec  9 09:15:13 2018
From: emmanuelarias30 at gmail.com (eamanu15)
Date: Sun, 9 Dec 2018 11:15:13 -0300
Subject: [scikit-learn] Question about contributing to scikit-learn
In-Reply-To: <mailman.60.1544288404.15087.scikit-learn@python.org>
References: <mailman.60.1544288404.15087.scikit-learn@python.org>
Message-ID: <CACttXDObg8uDZ7T894UkVgb0r2ddm2FdkSvaGngqA3LTbOAWLg@mail.gmail.com>

Hello Parker,

I can tell you my experience.

I start to contribute to sklearn two month ago, and I start with code
review, this way I can learn how sklearn is written and how is the
workflow, read issue and try to solve them. Then, I make some PR.

I can tell that the core devs are very friendly and help you always.
Specially, I had more contact with Joel Nothman and Andreas Mueller (thanks
guys).

So, I hope this help you in some way =)

Regards!
Emmanuel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181209/4829505e/attachment.html>

From adrin.jalali at gmail.com  Sun Dec  9 10:12:45 2018
From: adrin.jalali at gmail.com (Adrin)
Date: Sun, 9 Dec 2018 16:12:45 +0100
Subject: [scikit-learn] New core dev: Adrin Jalali
In-Reply-To: <7bmvf8ethadfrfvqfkiadmcp.1544346569794@gmail.com>
References: <20181209034750.17491464009F@webmail.sinamail.sina.com.cn>
 <7bmvf8ethadfrfvqfkiadmcp.1544346569794@gmail.com>
Message-ID: <CAEOrW4_0vBAJKSv-JUoZuJd6jZpHoRWVCSfP3QwTgW=NmdEZ0Q@mail.gmail.com>

Thank you all for all the support, patience, and mentorship you've had and
now having me on board. It's an absolute pleasure working with you :)

On Sun, 9 Dec 2018 at 10:10 Guillaume Lema?tre <g.lemaitre58 at gmail.com>
wrote:

> Congrats Adrin
>
> Sent from my phone - sorry to be brief and potential misspell.
> *From:* qinhanmin2005 at sina.com
> *Sent:* 9 December 2018 04:50
> *To:* scikit-learn at python.org
> *Reply to:* qinhanmin2005 at sina.com; scikit-learn at python.org
> *Subject:* Re: [scikit-learn] New core dev: Adrin Jalali
> Welcome and thanks for contributing!
>
> Hanmin Qin
>
> ----- Original Message -----
> From: Gael Varoquaux <gael.varoquaux at normalesup.org>
> To: Scikit-learn mailing list <scikit-learn at python.org>
> Subject: Re: [scikit-learn] New core dev: Adrin Jalali
> Date: 2018-12-09 01:18
>
>
> Indeed, welcome Adrin, and thanks a lot for your investment on the
> package!
> Ga?l
> On Sat, Dec 08, 2018 at 09:26:15AM -0500, Andreas Mueller wrote:
> > Congratulations and welcome Adrin!
> > On 12/5/18 5:32 PM, Joel Nothman wrote:
> > The Scikit-learn core development team has welcomed a new member, Adrin
> > Jalali, who has been doing some really amazing work in contributing code
> > and reviews since July (aside from occasional contributions since 2014).
> > Congratulations and welcome, Adrin!
> > _______________________________________________
> > scikit-learn mailing list
> >     scikit-learn at python.org
> >     https://mail.python.org/mailman/listinfo/scikit-learn
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> --
> Gael Varoquaux
> Senior Researcher, INRIA Parietal
> NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
>     Phone:  ++ 33-1-69-08-79-68 <+33169087968>
>     http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181209/9a7356f1/attachment.html>

From jk231092 at gmail.com  Mon Dec 10 00:02:46 2018
From: jk231092 at gmail.com (Jitesh Khandelwal)
Date: Mon, 10 Dec 2018 10:32:46 +0530
Subject: [scikit-learn] Agglomerative clustering
Message-ID: <CAB+rR37p3u981pJ1EddqiwKDyLqSyqGpxtbszArdAh_dhV2UpQ@mail.gmail.com>

Hi everyone,

I am using agglomerative clustering with an L1 distance matrix as input and
the "complete" linkage option.

I want to impose an additional constraint. When 2 clusters are combined and
the cost of combination is equal for multiple cluster pairs, I want to
choose the pair for which the combined cluster has the least size.

What is the cleanest and easiest way of achieving this?

Thanks,
Jitesh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181210/09da18df/attachment.html>

From gael.varoquaux at normalesup.org  Mon Dec 10 00:58:52 2018
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Mon, 10 Dec 2018 06:58:52 +0100
Subject: [scikit-learn] Agglomerative clustering
In-Reply-To: <CAB+rR37p3u981pJ1EddqiwKDyLqSyqGpxtbszArdAh_dhV2UpQ@mail.gmail.com>
References: <CAB+rR37p3u981pJ1EddqiwKDyLqSyqGpxtbszArdAh_dhV2UpQ@mail.gmail.com>
Message-ID: <20181210055852.tiyx7fa3eq4n277i@phare.normalesup.org>

> I want to impose an additional constraint. When 2 clusters are combined and the
> cost of combination is equal for multiple cluster pairs, I want to choose the
> pair for which the combined cluster has the least size.

> What is the cleanest and easiest way of achieving this?

I don't think that the public API enables you to do that. So I think that
you are going to have to modify the code, and modify the cost heapq to
make it a tuple of "(distance, size)".

Unfortunately, when doing this, you'll be on your own, as we cannot
provide support for modified code.

Cheers,

Ga?l

From szx9404 at gmail.com  Mon Dec 10 13:00:49 2018
From: szx9404 at gmail.com (parker x)
Date: Mon, 10 Dec 2018 10:00:49 -0800
Subject: [scikit-learn] Question about contributing to scikit-learn
In-Reply-To: <CACttXDObg8uDZ7T894UkVgb0r2ddm2FdkSvaGngqA3LTbOAWLg@mail.gmail.com>
References: <mailman.60.1544288404.15087.scikit-learn@python.org>
 <CACttXDObg8uDZ7T894UkVgb0r2ddm2FdkSvaGngqA3LTbOAWLg@mail.gmail.com>
Message-ID: <CAAm=4r+sk7Coqxupu+vfLrqudcX=uRc30R1cLJhcFF7w4mMmKg@mail.gmail.com>

Hi Emmanuel and Joel,

Thanks very much for your advice. I will take a look at small issues first
and see what to contribute from there.

Best,
Parker

eamanu15 <emmanuelarias30 at gmail.com> ?2018?12?9??? ??6:17???

> Hello Parker,
>
> I can tell you my experience.
>
> I start to contribute to sklearn two month ago, and I start with code
> review, this way I can learn how sklearn is written and how is the
> workflow, read issue and try to solve them. Then, I make some PR.
>
> I can tell that the core devs are very friendly and help you always.
> Specially, I had more contact with Joel Nothman and Andreas Mueller (thanks
> guys).
>
> So, I hope this help you in some way =)
>
> Regards!
> Emmanuel
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181210/a5516e97/attachment.html>

From joel.nothman at gmail.com  Mon Dec 10 17:57:19 2018
From: joel.nothman at gmail.com (Joel Nothman)
Date: Tue, 11 Dec 2018 09:57:19 +1100
Subject: [scikit-learn] check_estimator and score_samples method
In-Reply-To: <CANrsH6__8QZFqayXcB+45ip_GDnG1+hhGoarYOF=zvop5dyGwQ@mail.gmail.com>
References: <CANrsH6__8QZFqayXcB+45ip_GDnG1+hhGoarYOF=zvop5dyGwQ@mail.gmail.com>
Message-ID: <CAAkaFLVQreOrnmPhaMHzV03CBZSu10OPEu9VSK6A1FMqeZm=sA@mail.gmail.com>

We're trying to make check_estimator more flexible (
https://github.com/scikit-learn/scikit-learn/pull/8022) but this is
certainly not something we had considered yet. Perhaps suggest it there?

Or for now we could just make the check pass if score_samples yields a
TypeError with only X...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181211/e3bf2f97/attachment.html>

From pahome.chen at mirlab.org  Tue Dec 11 04:09:40 2018
From: pahome.chen at mirlab.org (lampahome)
Date: Tue, 11 Dec 2018 17:09:40 +0800
Subject: [scikit-learn] Why some regression algo can predict multiple out?
Message-ID: <CAB3eZfuTtv1j7BpRw88pM=iXrw5NhuostLMQ8nLA+d+C=wpaWg@mail.gmail.com>

As title, apart from sklearn.multioutput.MultiOutputRegressor, almost
regression algo in sklearn only can predict 1-d output.

Ex: predict 1-d output
sklearn.linear_model.SGDRegressor
fit(X, y, coef_init=None, intercept_init=None, sample_weight=None)
y : numpy array, shape (n_samples,)

Ex: predict multiple output
sklearn.linear_model.ElasticNet
fit(X, y, check_input=True)
y : ndarray, shape (n_samples,) or (n_samples, n_targets)

There're two kind of output for regression methods.

What's the difference?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181211/c6e4c890/attachment.html>

From joel.nothman at gmail.com  Tue Dec 11 04:54:47 2018
From: joel.nothman at gmail.com (Joel Nothman)
Date: Tue, 11 Dec 2018 20:54:47 +1100
Subject: [scikit-learn] Why some regression algo can predict multiple
 out?
In-Reply-To: <CAB3eZfuTtv1j7BpRw88pM=iXrw5NhuostLMQ8nLA+d+C=wpaWg@mail.gmail.com>
References: <CAB3eZfuTtv1j7BpRw88pM=iXrw5NhuostLMQ8nLA+d+C=wpaWg@mail.gmail.com>
Message-ID: <CAAkaFLVqwv2CGs+Pb08egasMCzd9_MEaQEFB4D_R+uBi+nZjCg@mail.gmail.com>

Yes, some can use a shared model to predict multiple outputs (ElasticNet,
DecisionTreeRegressor, MLPRegressor), others can't. Those that can't can be
trivially extended to the multiple output case with MultiOutputRegressor,
by learning each output independently.

On Tue, 11 Dec 2018 at 20:11, lampahome <pahome.chen at mirlab.org> wrote:

> As title, apart from sklearn.multioutput.MultiOutputRegressor, almost
> regression algo in sklearn only can predict 1-d output.
>
> Ex: predict 1-d output
> sklearn.linear_model.SGDRegressor
> fit(X, y, coef_init=None, intercept_init=None, sample_weight=None)
> y : numpy array, shape (n_samples,)
>
> Ex: predict multiple output
> sklearn.linear_model.ElasticNet
> fit(X, y, check_input=True)
> y : ndarray, shape (n_samples,) or (n_samples, n_targets)
>
> There're two kind of output for regression methods.
>
> What's the difference?
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181211/f1c15a96/attachment.html>

From pahome.chen at mirlab.org  Tue Dec 11 06:03:44 2018
From: pahome.chen at mirlab.org (lampahome)
Date: Tue, 11 Dec 2018 19:03:44 +0800
Subject: [scikit-learn] Why some regression algo can predict multiple
 out?
In-Reply-To: <CAAkaFLVqwv2CGs+Pb08egasMCzd9_MEaQEFB4D_R+uBi+nZjCg@mail.gmail.com>
References: <CAB3eZfuTtv1j7BpRw88pM=iXrw5NhuostLMQ8nLA+d+C=wpaWg@mail.gmail.com>
 <CAAkaFLVqwv2CGs+Pb08egasMCzd9_MEaQEFB4D_R+uBi+nZjCg@mail.gmail.com>
Message-ID: <CAB3eZfvY3u3-r+B==F-WVHBmjnHEEB=c1r2r4FnTZ9J8OXhEoQ@mail.gmail.com>

Joel Nothman <joel.nothman at gmail.com> ? 2018?12?11? ?? ??5:56???

> Yes, some can use a shared model to predict multiple outputs (ElasticNet,
> DecisionTreeRegressor, MLPRegressor), others can't. Those that can't can be
> trivially extended to the multiple output case with MultiOutputRegressor,
> by learning each output independently.
>
> I mean, why those(ElasticNet, DecisionTreeRegressor, MLPRegressor) can
predict multiple outputs?
What's the theory?

thx lot.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181211/7a76fbec/attachment-0001.html>

From pahome.chen at mirlab.org  Wed Dec 12 21:40:31 2018
From: pahome.chen at mirlab.org (lampahome)
Date: Thu, 13 Dec 2018 10:40:31 +0800
Subject: [scikit-learn] Difference between linear model and tree-based
 regressor?
Message-ID: <CAB3eZfuWhd8ONjNhe2HXLSxR+UXzY8ZbD-XZh_wVBOxYFD=F-g@mail.gmail.com>

Linear model like linear reg, Lasso reg, Elastic net reg...etc.

Tree-based like ExtTree reg, Random forest reg...etc

What's the difference between them?

I observe one point is below:
1. linear model can be extrapolated? tree-based can't does it
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181213/dd4ab168/attachment.html>

From jorisvandenbossche at gmail.com  Thu Dec 13 04:16:28 2018
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Thu, 13 Dec 2018 10:16:28 +0100
Subject: [scikit-learn] benchmarking TargetEncoder Was: ANN Dirty_cat:
 learning on dirty categories
In-Reply-To: <CAFvE7K6uQAr3qYA7A=JLpsEwEw6sVovxVyXhQPjr4Efy8eYkcg@mail.gmail.com>
References: <20181120205818.vgm5fses2nprgvnl@phare.normalesup.org>
 <e79e61a8-4a8a-32c4-56c0-2e2dc4afdfa0@gmail.com>
 <20181120211606.upltvviobudlurxe@phare.normalesup.org>
 <652a4474-c10c-0df9-e314-e16a415b59b8@gmail.com>
 <20181120214337.7unwskh7wtei4kj5@phare.normalesup.org>
 <4c5189a8-4beb-933f-1582-29c964c1cec4@gmail.com>
 <20181121053818.zwjmj6zgwharwpgp@phare.normalesup.org>
 <bd8e5b5b-a256-06f5-3f92-dda43f5973d8@gmail.com>
 <20181121153424.i3b7orguqhm243el@phare.normalesup.org>
 <52d96d5f-be24-20b0-707d-4e13b1494f38@gmail.com>
 <20181123084711.l22vhrbwikr5hamh@phare.normalesup.org>
 <CAFvE7K6uQAr3qYA7A=JLpsEwEw6sVovxVyXhQPjr4Efy8eYkcg@mail.gmail.com>
Message-ID: <CALQtMBYGyLKYZLGF=m4hXUWb4-B+6gpuA1CWpwjH8WV9xjFYCw@mail.gmail.com>

Hi all,

I finally had some time to start looking at it the last days. Some
preliminary work can be found here:
https://github.com/jorisvandenbossche/target-encoder-benchmarks.

Up to now, I only did some preliminary work to set up the benchmarks (based
on Patricio Cerda's code, https://arxiv.org/pdf/1806.00979.pdf), and with
some initial datasets (medical charges and employee salaries) compared the
different implementations with its default settings.
So there is still a lot to do (add datasets, investigate the actual
differences between the different implementations and results, in a more
structured way compare the options, etc, there are some todo's listed in
the README). However, now I am mostly on holidays for the rest of December.
If somebody wants to further look at it, that is certainly welcome,
otherwise, it will be a priority for me beginning of January.

For datasets: additional ideas are welcome. For now, the idea is to add a
subset of the Criteo Terabyte Click dataset, and to generate some data.

>>> Does that mean you'd be opposed to adding the leave-one-out
TargetEncoder
>>> I would really like to add it before February
>> A few month to get it right is not that bad, is it?
> The PR is over a year old already, and you hadn't voiced any opposition
> there.

As far as I understand, the open PR is not a leave-one-out TargetEncoder?
I also did not yet add the CountFeaturizer from that scikit-learn PR,
because it is actually quite different (e.g it doesn't work for regression
tasks, as it counts conditional on y). But for classification it could be
easily added to the benchmarks.

Joris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181213/f4c68de0/attachment.html>

From olivier.grisel at ensta.org  Thu Dec 13 09:53:15 2018
From: olivier.grisel at ensta.org (Olivier Grisel)
Date: Thu, 13 Dec 2018 15:53:15 +0100
Subject: [scikit-learn] Difference between linear model and tree-based
 regressor?
In-Reply-To: <CAB3eZfuWhd8ONjNhe2HXLSxR+UXzY8ZbD-XZh_wVBOxYFD=F-g@mail.gmail.com>
References: <CAB3eZfuWhd8ONjNhe2HXLSxR+UXzY8ZbD-XZh_wVBOxYFD=F-g@mail.gmail.com>
Message-ID: <CAFvE7K7-kW-zce83PGUe_5JjM9qppcN_+5_UCF2ntnw7Cw74Zw@mail.gmail.com>

They are very different statistical models from a mathematical point of
view. See the online scikit-learn documentation or reference text books
such as "Elements of Statistical Learning" for more details.

In practice, linear model tends to be faster to fit on large data,
especially when the number of features is large (although it depends on the
solver, loss, penalty, data scaling...).

Linear model cannot fit prediction tasks when the data is not linearly
separable (by definition) while tree based model do not have this
restriction.

Tree based model can still under fit in some cases but for different
reasons (e.g. when we limit the depth of the trees).

Linear model can be made mode expressive via feature engineering (e.g.
k-bins discretizer, polynomial features expansion, Nystroem kernel
approximation...) and thereafter sometimes be competitive with tree based
models even on task that where originally non linearly separable tasks.
However this is not guaranteed either. Cross-validation and parameter
tuning are still required to tell which class of model works best for a
specific task.

As you said, tree based model "cannot extrapolate" in the sense that their
decision function is piecewise constant while the decision function of
linear model is an hyperplane. Depending on the tasks the lack of
extrapolation can either be considered a limitation or a benefit (for
instance to avoid unrealistic extrapolations like people with a negative
age or size, predicting negative mechanical energy loss via heat
dissipation, fractions that are larger than 100%, 6 stars out of 5
recommendations...).

-- 
Olivier
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181213/87220fc3/attachment.html>

From jbbrown at kuhp.kyoto-u.ac.jp  Thu Dec 13 09:58:28 2018
From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.)
Date: Thu, 13 Dec 2018 23:58:28 +0900
Subject: [scikit-learn] Difference between linear model and tree-based
 regressor?
In-Reply-To: <CAFvE7K7-kW-zce83PGUe_5JjM9qppcN_+5_UCF2ntnw7Cw74Zw@mail.gmail.com>
References: <CAB3eZfuWhd8ONjNhe2HXLSxR+UXzY8ZbD-XZh_wVBOxYFD=F-g@mail.gmail.com>
 <CAFvE7K7-kW-zce83PGUe_5JjM9qppcN_+5_UCF2ntnw7Cw74Zw@mail.gmail.com>
Message-ID: <CAJe_vxAw1K0GNAfDhFteaiSi0cC4PcbjRMmSO2S0Uder7eQZ1w@mail.gmail.com>

"Elements of Statistical Learning" is on my bookshelf, but even so, that
was a great summary!
J.B.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181213/26e5e321/attachment.html>

From jcrudy at gmail.com  Thu Dec 13 16:06:41 2018
From: jcrudy at gmail.com (Jason Rudy)
Date: Thu, 13 Dec 2018 13:06:41 -0800
Subject: [scikit-learn] check_estimator and score_samples method
In-Reply-To: <CAAkaFLVQreOrnmPhaMHzV03CBZSu10OPEu9VSK6A1FMqeZm=sA@mail.gmail.com>
References: <CANrsH6__8QZFqayXcB+45ip_GDnG1+hhGoarYOF=zvop5dyGwQ@mail.gmail.com>
 <CAAkaFLVQreOrnmPhaMHzV03CBZSu10OPEu9VSK6A1FMqeZm=sA@mail.gmail.com>
Message-ID: <CANrsH68cq9s8TTTrR3_PtnKtsVJNT=n=jRLc9oQLYA+qgN9-ww@mail.gmail.com>

Thanks, Joel.  From your response I assume that the use of a y argument to
score_samples is not a violation of the sklearn API, so I'll keep the
method and find a workaround for the check_estimator test as it's currently
written.  I'll comment on the issue as well.


On Mon, Dec 10, 2018 at 2:58 PM Joel Nothman <joel.nothman at gmail.com> wrote:

> We're trying to make check_estimator more flexible (
> https://github.com/scikit-learn/scikit-learn/pull/8022) but this is
> certainly not something we had considered yet. Perhaps suggest it there?
>
> Or for now we could just make the check pass if score_samples yields a
> TypeError with only X...
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181213/3736b6ae/attachment.html>

From t3kcit at gmail.com  Fri Dec 14 10:46:10 2018
From: t3kcit at gmail.com (Andreas Mueller)
Date: Fri, 14 Dec 2018 10:46:10 -0500
Subject: [scikit-learn] benchmarking TargetEncoder Was: ANN Dirty_cat:
 learning on dirty categories
In-Reply-To: <CALQtMBYGyLKYZLGF=m4hXUWb4-B+6gpuA1CWpwjH8WV9xjFYCw@mail.gmail.com>
References: <20181120205818.vgm5fses2nprgvnl@phare.normalesup.org>
 <e79e61a8-4a8a-32c4-56c0-2e2dc4afdfa0@gmail.com>
 <20181120211606.upltvviobudlurxe@phare.normalesup.org>
 <652a4474-c10c-0df9-e314-e16a415b59b8@gmail.com>
 <20181120214337.7unwskh7wtei4kj5@phare.normalesup.org>
 <4c5189a8-4beb-933f-1582-29c964c1cec4@gmail.com>
 <20181121053818.zwjmj6zgwharwpgp@phare.normalesup.org>
 <bd8e5b5b-a256-06f5-3f92-dda43f5973d8@gmail.com>
 <20181121153424.i3b7orguqhm243el@phare.normalesup.org>
 <52d96d5f-be24-20b0-707d-4e13b1494f38@gmail.com>
 <20181123084711.l22vhrbwikr5hamh@phare.normalesup.org>
 <CAFvE7K6uQAr3qYA7A=JLpsEwEw6sVovxVyXhQPjr4Efy8eYkcg@mail.gmail.com>
 <CALQtMBYGyLKYZLGF=m4hXUWb4-B+6gpuA1CWpwjH8WV9xjFYCw@mail.gmail.com>
Message-ID: <26d9146b-f673-ba0e-11d6-4266bec48407@gmail.com>


On 12/13/18 4:16 AM, Joris Van den Bossche wrote:
> Hi all,
>
> I finally had some time to start looking at it the last days. Some 
> preliminary work can be found here: 
> https://github.com/jorisvandenbossche/target-encoder-benchmarks.
You continue to be my hero. Probably can not look at it in detail before 
the holidays though :-/
>
> Up to now, I only did some preliminary work to set up the benchmarks 
> (based on Patricio Cerda's code, 
> https://arxiv.org/pdf/1806.00979.pdf), and with some initial datasets 
> (medical charges and employee salaries) compared the different 
> implementations with its default settings.
> So there is still a lot to do (add datasets, investigate the actual 
> differences between the different implementations and results, in a 
> more structured way compare the options, etc, there are some todo's 
> listed in the README). However, now I am mostly on holidays for the 
> rest of December. If somebody wants to further look at it, that is 
> certainly welcome, otherwise, it will be a priority for me beginning 
> of January.
>
> For datasets: additional ideas are welcome. For now, the idea is to 
> add a subset of the Criteo Terabyte Click dataset, and to generate 
> some data.
>
> >>> Does that mean you'd be opposed to adding the leave-one-out TargetEncoder
> >>> I would really like to add it before February
> >> A few month to get it right is not that bad, is it?
> > The PR is over a year old already, and you hadn't voiced any opposition
> > there.
>
> As far as I understand, the open PR is not a leave-one-out TargetEncoder?
I would want it to be :-/
> I also did not yet add the CountFeaturizer from that scikit-learn PR, 
> because it is actually quite different (e.g it doesn't work for 
> regression tasks, as it counts conditional on y). But for 
> classification it could be easily added to the benchmarks.
I'm confused now. That's what TargetEncoder and leave-one-out 
TargetEncoder do as well, right?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181214/eac74499/attachment.html>

From jorisvandenbossche at gmail.com  Sat Dec 15 07:35:54 2018
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Sat, 15 Dec 2018 13:35:54 +0100
Subject: [scikit-learn] benchmarking TargetEncoder Was: ANN Dirty_cat:
 learning on dirty categories
In-Reply-To: <26d9146b-f673-ba0e-11d6-4266bec48407@gmail.com>
References: <20181120205818.vgm5fses2nprgvnl@phare.normalesup.org>
 <e79e61a8-4a8a-32c4-56c0-2e2dc4afdfa0@gmail.com>
 <20181120211606.upltvviobudlurxe@phare.normalesup.org>
 <652a4474-c10c-0df9-e314-e16a415b59b8@gmail.com>
 <20181120214337.7unwskh7wtei4kj5@phare.normalesup.org>
 <4c5189a8-4beb-933f-1582-29c964c1cec4@gmail.com>
 <20181121053818.zwjmj6zgwharwpgp@phare.normalesup.org>
 <bd8e5b5b-a256-06f5-3f92-dda43f5973d8@gmail.com>
 <20181121153424.i3b7orguqhm243el@phare.normalesup.org>
 <52d96d5f-be24-20b0-707d-4e13b1494f38@gmail.com>
 <20181123084711.l22vhrbwikr5hamh@phare.normalesup.org>
 <CAFvE7K6uQAr3qYA7A=JLpsEwEw6sVovxVyXhQPjr4Efy8eYkcg@mail.gmail.com>
 <CALQtMBYGyLKYZLGF=m4hXUWb4-B+6gpuA1CWpwjH8WV9xjFYCw@mail.gmail.com>
 <26d9146b-f673-ba0e-11d6-4266bec48407@gmail.com>
Message-ID: <CALQtMBZ5iM3w76vTC2g1GEQ2Wy0SRxTdZ7UTP=UYz7+F6-8teA@mail.gmail.com>

Op vr 14 dec. 2018 om 16:46 schreef Andreas Mueller <t3kcit at gmail.com>:

> As far as I understand, the open PR is not a leave-one-out TargetEncoder?
>
> I would want it to be :-/
>
> I also did not yet add the CountFeaturizer from that scikit-learn PR,
> because it is actually quite different (e.g it doesn't work for regression
> tasks, as it counts conditional on y). But for classification it could be
> easily added to the benchmarks.
>
> I'm confused now. That's what TargetEncoder and leave-one-out
> TargetEncoder do as well, right?.
>

As far as I understand, that is not exactly what those do. The
TargetEncoder (as implemented in dirty_cat, category_encoders and
hccEncoders) will, for each category, calculate the expected value of the
target depending on the category. For binary classification this indeed
comes to counting the 0's and 1's, and there the information contained in
the result might be similar as the sklearn PR, but the format is different:
those packages calculate the probability (value between 0 and 1 as number
of 1's divided by number of samples in that category) and return that as a
single column, instead of returning two columns with the counts for the 0's
and 1's.
And for regression this is not related to counting anymore, but just the
average of the target per category (in practice, the TargetEncoder is
computing the same for regression or binary classification: the average of
the target per category. But for regression, the CountFeaturizer doesn't
work since there are no discrete values in the target to count).

Furthermore, all of those implementations in the 3 mentioned packages have
some kind of regularization (empirical bayes shrinkage, or KFold or
leave-one-out cross-validation), while this is also not present in the
CountFeaturizer PR (but this aspect is of course something we want to
actually test in the benchmarks).

Another thing I noticed in the CountFeaturizer implementation, is that the
behaviour differs when y is passed or not. First, I find it a bit strange
to do this as it is a quite different behaviour (counting the categories
(to just encode the categorical variable with a notion about its frequency
in the training set), or counting the target depending on the category is
quite different?). But also, when using a transformer in a Pipeline, you
don't control the passing of y, I think? So in that way, you always have
the behaviour of counting the target.
I would find it more logical to have those two things in two separate
transformers (if we think the "frequency encoder" is useful enough).
(I need to give this feedback on the PR, but that will be for after the
holidays)

Joris


> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181215/20157f59/attachment.html>

From kouichi.matsuda at gmail.com  Sat Dec 15 09:02:06 2018
From: kouichi.matsuda at gmail.com (Kouichi Matsuda)
Date: Sat, 15 Dec 2018 09:02:06 -0500
Subject: [scikit-learn] MLPClassifier on WIndows 10 is 4 times slower than
 that on macOS?
In-Reply-To: null
Message-ID: <CAB8DfE_8B=2A_MdJE=+92ytWmhdLWj2-1K_fwYe1FUYX_KUf7w@mail.gmail.com>

Hi Hi everyone,

I am writing a scikit-learn program to use MLPClassifier to learn
Fashion-MNIST.
The following is the program. It's very simple.
When I ran it on Windows 10 (Core-i7-8565U, 1.8GHz, 16GB) note book, it
took about 4 minutes.
However, when I ran it on MacBook(macOS), it took about 1 minutes.
Does anyone help me to understand the reason why Windows 10 is so slow?
Am I missing something?

Thanks,

import os import gzip import numpy as np #from
https://github.com/zalandoresearch/fashion-mnist/blob/master/utils/mnist_reader.py
def load_mnist(path, kind='train'): labels_path =
os.path.join(path,'%s-labels-idx1-ubyte.gz' % kind) images_path =
os.path.join(path,'%s-images-idx3-ubyte.gz' % kind) with
gzip.open(labels_path, 'rb') as lbpath: labels =
np.frombuffer(lbpath.read(), dtype=np.uint8, offset=8) with
gzip.open(images_path, 'rb') as imgpath: images =
np.frombuffer(imgpath.read(), dtype=np.uint8, offset=16) images =
images.reshape(len(labels), 784) return images, labels x_train, y_train =
load_mnist('data', kind='train') x_test, y_test = load_mnist('data',
kind='t10k') from sklearn.neural_network import MLPClassifier import time
import datetime print(datetime.datetime.today()) start = time.time() mlp =
MLPClassifier() mlp.fit(x_train, y_train) print((time.time() - start)/ 60)


---
MATSUDA, Kouichi, Ph.D.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181215/c029db3b/attachment.html>

From gael.varoquaux at normalesup.org  Sat Dec 15 10:47:33 2018
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Sat, 15 Dec 2018 16:47:33 +0100
Subject: [scikit-learn] MLPClassifier on WIndows 10 is 4 times slower
 than that on macOS?
In-Reply-To: <CAB8DfE_8B=2A_MdJE=+92ytWmhdLWj2-1K_fwYe1FUYX_KUf7w@mail.gmail.com>
References: <CAB8DfE_8B=2A_MdJE=+92ytWmhdLWj2-1K_fwYe1FUYX_KUf7w@mail.gmail.com>
Message-ID: <20181215154733.ljxvqdx7jfuhn3nx@phare.normalesup.org>

I suspect that it is probably due to the linear-algebra libraries: your
scientific Python install on macOS is probably using optimized
linear-algebra (ie optimized numpy and scipy), but not your install on
Windows.

I would recommend you to look at how you installed you Python
distribution on macOS and on Windows, as you likely have installed an
optimized one on one of the platforms and not on the other.

Cheers,

Ga?l

On Sat, Dec 15, 2018 at 09:02:06AM -0500, Kouichi Matsuda wrote:
> Hi?Hi everyone,

> I am writing a scikit-learn program to use MLPClassifier to learn
> Fashion-MNIST.
> The following is the program. It's very simple.
> When I ran it on Windows 10 (Core-i7-8565U, 1.8GHz, 16GB) note book, it took
> about 4 minutes.
> However, when I ran it on MacBook(macOS), it took about 1 minutes.
> Does anyone help me to understand the reason why Windows 10 is so slow?
> Am I missing something?

> Thanks,??

> import os import gzip import numpy as np #from https://github.com/
> zalandoresearch/fashion-mnist/blob/master/utils/mnist_reader.py def load_mnist
> (path, kind='train'): labels_path = os.path.join(path,'%s-labels-idx1-ubyte.gz'
> % kind) images_path = os.path.join(path,'%s-images-idx3-ubyte.gz' % kind) with
> gzip.open(labels_path, 'rb') as lbpath: labels = np.frombuffer(lbpath.read(),
> dtype=np.uint8, offset=8) with gzip.open(images_path, 'rb') as imgpath: images
> = np.frombuffer(imgpath.read(), dtype=np.uint8, offset=16) images =
> images.reshape(len(labels), 784) return images, labels x_train, y_train =
> load_mnist('data', kind='train') x_test, y_test = load_mnist('data', kind=
> 't10k') from sklearn.neural_network import MLPClassifier import time import
> datetime print(datetime.datetime.today()) start = time.time() mlp =
> MLPClassifier() mlp.fit(x_train, y_train) print((time.time() - start)/ 60)


> ---
> MATSUDA, Kouichi, Ph.D.

> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
    Gael Varoquaux
    Senior Researcher, INRIA Parietal
    NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
    Phone:  ++ 33-1-69-08-79-68
    http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux

From minminmail at hotmail.com  Sun Dec 16 17:09:22 2018
From: minminmail at hotmail.com (rui min)
Date: Sun, 16 Dec 2018 22:09:22 +0000
Subject: [scikit-learn] plan to add the association rule classification
 algorithm in scikit learn
Message-ID: <BL0PR08MB457941962B1B2D4A0B1ED6CEC6A30@BL0PR08MB4579.namprd08.prod.outlook.com>

Dear scikit-learn developers,


   I am Rui from Spain, Granada University. Currently I am planning to write an association rule algorithm in scikit-learn.

   I don?t know if anyone is working on that. So avoid duplication of the work, I would like to ask here.


Hope to hear from you soon.


Best Regards


Rui


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181216/696ca6c2/attachment.html>

From joel.nothman at gmail.com  Mon Dec 17 01:26:26 2018
From: joel.nothman at gmail.com (Joel Nothman)
Date: Mon, 17 Dec 2018 17:26:26 +1100
Subject: [scikit-learn] plan to add the association rule classification
 algorithm in scikit learn
In-Reply-To: <BL0PR08MB457941962B1B2D4A0B1ED6CEC6A30@BL0PR08MB4579.namprd08.prod.outlook.com>
References: <BL0PR08MB457941962B1B2D4A0B1ED6CEC6A30@BL0PR08MB4579.namprd08.prod.outlook.com>
Message-ID: <CAAkaFLU3PdYNegOS_zQhUmjGU3RTimnUys6AvEnFOHny4afhiA@mail.gmail.com>

Hi Rui,

This has been discussed several times on the mailing list and issue
tracker. We are not interested in association rule mining in Scikit-learn
for its own purposes. We would be interested in association rule mining
only as part of a classification algorithm. Are there such algorithms which
are mature and popular enough to meet our inclusion criteria (see our FAQ)?

Cheers,

Joel

On Mon, 17 Dec 2018 at 09:24, rui min <minminmail at hotmail.com> wrote:

> Dear scikit-learn developers,
>
>
>    I am Rui from Spain, Granada University. Currently I am planning to
> write an association rule algorithm in scikit-learn.
>
>    I don?t know if anyone is working on that. So avoid duplication of the
> work, I would like to ask here.
>
>
> Hope to hear from you soon.
>
>
>
> Best Regards
>
>
>
> Rui
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181217/f16bd403/attachment.html>

From mail at sebastianraschka.com  Mon Dec 17 01:46:56 2018
From: mail at sebastianraschka.com (Sebastian Raschka)
Date: Mon, 17 Dec 2018 00:46:56 -0600
Subject: [scikit-learn] plan to add the association rule classification
 algorithm in scikit learn
In-Reply-To: <CAAkaFLU3PdYNegOS_zQhUmjGU3RTimnUys6AvEnFOHny4afhiA@mail.gmail.com>
References: <BL0PR08MB457941962B1B2D4A0B1ED6CEC6A30@BL0PR08MB4579.namprd08.prod.outlook.com>
 <CAAkaFLU3PdYNegOS_zQhUmjGU3RTimnUys6AvEnFOHny4afhiA@mail.gmail.com>
Message-ID: <8F02D137-802B-460C-9F02-B39967FDDB6D@sebastianraschka.com>

Hi Rui,

I agree with Joel that association rule mining could be a bit tricky to fit nicely within the scikit-learn API. Maybe this could be some transformer class? I thought about that a few years ago but remember that I couldn't come up with a good solution at that point.

In any case, I have an association rule implementation in mlxtend (http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/), which is based on the apriori algorithm. Some users were asking about Eclat and FP-Growth algorithms, instead of apriori. If you are interested in such a contribution, i.e., implementing Eclat or FP-Growth such that instead of 

frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)
association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)

one could use

frequent_itemsets = eclat(df, min_support=0.6, use_colnames=True)

or

frequent_itemsets = fpgrowth(df, min_support=0.6, use_colnames=True)
association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)

I would be very happy about such a contribution (see issue tracker at https://github.com/rasbt/mlxtend/issues/248)

If you had an alternative algorithm for frequent itemset generation in mind (I am not sure if others exist, to be honest). I would also be happy about that one, too.

Best,
Sebastian

> On Dec 17, 2018, at 12:26 AM, Joel Nothman <joel.nothman at gmail.com> wrote:
> 
> Hi Rui,
> 
> This has been discussed several times on the mailing list and issue tracker. We are not interested in association rule mining in Scikit-learn for its own purposes. We would be interested in association rule mining only as part of a classification algorithm. Are there such algorithms which are mature and popular enough to meet our inclusion criteria (see our FAQ)?
> 
> Cheers,
> 
> Joel
> 
> On Mon, 17 Dec 2018 at 09:24, rui min <minminmail at hotmail.com> wrote:
> Dear scikit-learn developers,
> 
>    I am Rui from Spain, Granada University. Currently I am planning to write an association rule algorithm in scikit-learn.
>    I don?t know if anyone is working on that. So avoid duplication of the work, I would like to ask here.
> 
> Hope to hear from you soon.
> 
> 
> Best Regards
> 
> 
> Rui
> 
> 
> 
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From kouichi.matsuda at gmail.com  Mon Dec 17 09:54:07 2018
From: kouichi.matsuda at gmail.com (Kouichi Matsuda)
Date: Mon, 17 Dec 2018 06:54:07 -0800
Subject: [scikit-learn] MLPClassifier on WIndows 10 is 4 times slower
 than that on macOS?
In-Reply-To: <20181215154733.ljxvqdx7jfuhn3nx@phare.normalesup.org>
References: <CAB8DfE_8B=2A_MdJE=+92ytWmhdLWj2-1K_fwYe1FUYX_KUf7w@mail.gmail.com>
 <20181215154733.ljxvqdx7jfuhn3nx@phare.normalesup.org>
Message-ID: <CAB8DfE9A9s3zKiJ1HPQ_70WgmpYbF0i5XzN2HVY5R4wZzKkYog@mail.gmail.com>

 Thank you for your quick reply. It's very helpful.
It's because of Anaconda: Its python stops the iteration soon as follows
(w/ verbose=True).
I am not sure why 'n_iter_no_change=10' is changed in Anaconda.
Anaconda might modify the MLPClassifier implementation.

> python learn.py (in pure Python+Scikit-Learn)
...

Iteration 125, loss = 0.26152263

Iteration 126, loss = 0.25705940

Iteration 127, loss = 0.25957841

Training loss did not improve more than tol=0.000100 for 10 consecutive
epochs. Stopping.
0.8496

> python learn.py (in Anaconda)
...
Iteration 23, loss = 0.34410594
Iteration 24, loss = 0.34663903
Iteration 25, loss = 0.34376815
Training loss did not improve more than tol=0.000100 for two consecutive
epochs. Stopping.
0.852

Thanks,


---
???? MATSUDA, Kouichi, Ph.D.


2018?12?16?(?) 0:50 Gael Varoquaux <gael.varoquaux at normalesup.org>:

> I suspect that it is probably due to the linear-algebra libraries: your
> scientific Python install on macOS is probably using optimized
> linear-algebra (ie optimized numpy and scipy), but not your install on
> Windows.
>
> I would recommend you to look at how you installed you Python
> distribution on macOS and on Windows, as you likely have installed an
> optimized one on one of the platforms and not on the other.
>
> Cheers,
>
> Ga?l
>
> On Sat, Dec 15, 2018 at 09:02:06AM -0500, Kouichi Matsuda wrote:
> > Hi Hi everyone,
>
> > I am writing a scikit-learn program to use MLPClassifier to learn
> > Fashion-MNIST.
> > The following is the program. It's very simple.
> > When I ran it on Windows 10 (Core-i7-8565U, 1.8GHz, 16GB) note book, it
> took
> > about 4 minutes.
> > However, when I ran it on MacBook(macOS), it took about 1 minutes.
> > Does anyone help me to understand the reason why Windows 10 is so slow?
> > Am I missing something?
>
> > Thanks,
>
> > import os import gzip import numpy as np #from https://github.com/
> > zalandoresearch/fashion-mnist/blob/master/utils/mnist_reader.py def
> load_mnist
> > (path, kind='train'): labels_path =
> os.path.join(path,'%s-labels-idx1-ubyte.gz'
> > % kind) images_path = os.path.join(path,'%s-images-idx3-ubyte.gz' %
> kind) with
> > gzip.open(labels_path, 'rb') as lbpath: labels =
> np.frombuffer(lbpath.read(),
> > dtype=np.uint8, offset=8) with gzip.open(images_path, 'rb') as imgpath:
> images
> > = np.frombuffer(imgpath.read(), dtype=np.uint8, offset=16) images =
> > images.reshape(len(labels), 784) return images, labels x_train, y_train =
> > load_mnist('data', kind='train') x_test, y_test = load_mnist('data',
> kind=
> > 't10k') from sklearn.neural_network import MLPClassifier import time
> import
> > datetime print(datetime.datetime.today()) start = time.time() mlp =
> > MLPClassifier() mlp.fit(x_train, y_train) print((time.time() - start)/
> 60)
>
>
> > ---
> > MATSUDA, Kouichi, Ph.D.
>
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> --
>     Gael Varoquaux
>     Senior Researcher, INRIA Parietal
>     NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
>     Phone:  ++ 33-1-69-08-79-68
>     http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181217/b026a498/attachment.html>

From g.lemaitre58 at gmail.com  Mon Dec 17 10:01:54 2018
From: g.lemaitre58 at gmail.com (=?ISO-8859-1?Q?Guillaume_Lema=EEtre?=)
Date: Mon, 17 Dec 2018 16:01:54 +0100
Subject: [scikit-learn] MLPClassifier on WIndows 10 is 4 times slower
 than that on macOS?
In-Reply-To: <CAB8DfE9A9s3zKiJ1HPQ_70WgmpYbF0i5XzN2HVY5R4wZzKkYog@mail.gmail.com>
Message-ID: <nq7f9rmeah4o70c8vbb2jmij.1545058914822@gmail.com>

An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181217/e798ab4e/attachment-0001.html>

From g.lemaitre58 at gmail.com  Mon Dec 17 10:14:52 2018
From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=)
Date: Mon, 17 Dec 2018 16:14:52 +0100
Subject: [scikit-learn] MLPClassifier on WIndows 10 is 4 times slower
 than that on macOS?
In-Reply-To: <nq7f9rmeah4o70c8vbb2jmij.1545058914822@gmail.com>
References: <CAB8DfE9A9s3zKiJ1HPQ_70WgmpYbF0i5XzN2HVY5R4wZzKkYog@mail.gmail.com>
 <nq7f9rmeah4o70c8vbb2jmij.1545058914822@gmail.com>
Message-ID: <CACDxx9gZDxdiQZ=tVhx4DmDb0dB1f_TTpbtaXQqvDKb7gO9-OA@mail.gmail.com>

I checked on 0.20.1 using scikit-learn shipped by Anaconda and both seem to
have the same default.

On Mon, 17 Dec 2018 at 16:01, Guillaume Lema?tre <g.lemaitre58 at gmail.com>
wrote:

> could you provide the scikit-learn version in both case?
>
> Sent from my phone - sorry to be brief and potential misspell.
> *From:* kouichi.matsuda at gmail.com
> *Sent:* 17 December 2018 15:56
> *To:* scikit-learn at python.org
> *Reply to:* scikit-learn at python.org
> *Subject:* Re: [scikit-learn] MLPClassifier on WIndows 10 is 4 times
> slower than that on macOS?
>
> Thank you for your quick reply. It's very helpful.
> It's because of Anaconda: Its python stops the iteration soon as follows
> (w/ verbose=True).
> I am not sure why 'n_iter_no_change=10' is changed in Anaconda.
> Anaconda might modify the MLPClassifier implementation.
>
> > python learn.py (in pure Python+Scikit-Learn)
> ...
>
> Iteration 125, loss = 0.26152263
>
> Iteration 126, loss = 0.25705940
>
> Iteration 127, loss = 0.25957841
>
> Training loss did not improve more than tol=0.000100 for 10 consecutive
> epochs. Stopping.
> 0.8496
>
> > python learn.py (in Anaconda)
> ...
> Iteration 23, loss = 0.34410594
> Iteration 24, loss = 0.34663903
> Iteration 25, loss = 0.34376815
> Training loss did not improve more than tol=0.000100 for two consecutive
> epochs. Stopping.
> 0.852
>
> Thanks,
>
>
> ---
> ???? MATSUDA, Kouichi, Ph.D.
>
>
> 2018?12?16?(?) 0:50 Gael Varoquaux <gael.varoquaux at normalesup.org>:
>
>> I suspect that it is probably due to the linear-algebra libraries: your
>> scientific Python install on macOS is probably using optimized
>> linear-algebra (ie optimized numpy and scipy), but not your install on
>> Windows.
>>
>> I would recommend you to look at how you installed you Python
>> distribution on macOS and on Windows, as you likely have installed an
>> optimized one on one of the platforms and not on the other.
>>
>> Cheers,
>>
>> Ga?l
>>
>> On Sat, Dec 15, 2018 at 09:02:06AM -0500, Kouichi Matsuda wrote:
>> > Hi Hi everyone,
>>
>> > I am writing a scikit-learn program to use MLPClassifier to learn
>> > Fashion-MNIST.
>> > The following is the program. It's very simple.
>> > When I ran it on Windows 10 (Core-i7-8565U, 1.8GHz, 16GB) note book, it
>> took
>> > about 4 minutes.
>> > However, when I ran it on MacBook(macOS), it took about 1 minutes.
>> > Does anyone help me to understand the reason why Windows 10 is so slow?
>> > Am I missing something?
>>
>> > Thanks,
>>
>> > import os import gzip import numpy as np #from https://github.com/
>> > zalandoresearch/fashion-mnist/blob/master/utils/mnist_reader.py def
>> load_mnist
>> > (path, kind='train'): labels_path = os.path.join(path,'%
>> s-labels-idx1-ubyte.gz'
>> > % kind) images_path = os.path.join(path,'%s-images-idx3-ubyte.gz' %
>> kind) with
>> > gzip.open(labels_path, 'rb') as lbpath: labels = np.frombuffer(
>> lbpath.read(),
>> > dtype=np.uint8, offset=8) with gzip.open(images_path, 'rb') as
>> imgpath: images
>> > = np.frombuffer(imgpath.read(), dtype=np.uint8, offset=16) images =
>> > images.reshape(len(labels), 784) return images, labels x_train,
>> y_train =
>> > load_mnist('data', kind='train') x_test, y_test = load_mnist('data',
>> kind=
>> > 't10k') from sklearn.neural_network import MLPClassifier import time
>> import
>> > datetime print(datetime.datetime.today()) start = time.time() mlp =
>> > MLPClassifier() mlp.fit(x_train, y_train) print((time.time() - start)/
>> 60)
>>
>>
>> > ---
>> > MATSUDA, Kouichi, Ph.D.
>>
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>> --
>>     Gael Varoquaux
>>     Senior Researcher, INRIA Parietal
>>     NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
>>     Phone:  ++ 33-1-69-08-79-68 <+33169087968>
>>     http://gael-varoquaux.info
>> http://twitter.com/GaelVaroquaux
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>

-- 
Guillaume Lemaitre
INRIA Saclay - Parietal team
Center for Data Science Paris-Saclay
https://glemaitre.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181217/fe426f46/attachment.html>

From t3kcit at gmail.com  Mon Dec 17 22:07:31 2018
From: t3kcit at gmail.com (Andreas Mueller)
Date: Mon, 17 Dec 2018 22:07:31 -0500
Subject: [scikit-learn] plan to add the association rule classification
 algorithm in scikit learn
In-Reply-To: <8F02D137-802B-460C-9F02-B39967FDDB6D@sebastianraschka.com>
References: <BL0PR08MB457941962B1B2D4A0B1ED6CEC6A30@BL0PR08MB4579.namprd08.prod.outlook.com>
 <CAAkaFLU3PdYNegOS_zQhUmjGU3RTimnUys6AvEnFOHny4afhiA@mail.gmail.com>
 <8F02D137-802B-460C-9F02-B39967FDDB6D@sebastianraschka.com>
Message-ID: <bf7c29d1-3396-02ca-82e1-81034c140279@gmail.com>

Can we add this to the FAQ as out of scope?

Sebastian: feel free to put more into mlxtend :P

On 12/17/18 1:46 AM, Sebastian Raschka wrote:
> Hi Rui,
>
> I agree with Joel that association rule mining could be a bit tricky to fit nicely within the scikit-learn API. Maybe this could be some transformer class? I thought about that a few years ago but remember that I couldn't come up with a good solution at that point.
>
> In any case, I have an association rule implementation in mlxtend (http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/), which is based on the apriori algorithm. Some users were asking about Eclat and FP-Growth algorithms, instead of apriori. If you are interested in such a contribution, i.e., implementing Eclat or FP-Growth such that instead of
>
> frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)
> association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)
>
> one could use
>
> frequent_itemsets = eclat(df, min_support=0.6, use_colnames=True)
>
> or
>
> frequent_itemsets = fpgrowth(df, min_support=0.6, use_colnames=True)
> association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)
>
> I would be very happy about such a contribution (see issue tracker at https://github.com/rasbt/mlxtend/issues/248)
>
> If you had an alternative algorithm for frequent itemset generation in mind (I am not sure if others exist, to be honest). I would also be happy about that one, too.
>
> Best,
> Sebastian
>
>> On Dec 17, 2018, at 12:26 AM, Joel Nothman <joel.nothman at gmail.com> wrote:
>>
>> Hi Rui,
>>
>> This has been discussed several times on the mailing list and issue tracker. We are not interested in association rule mining in Scikit-learn for its own purposes. We would be interested in association rule mining only as part of a classification algorithm. Are there such algorithms which are mature and popular enough to meet our inclusion criteria (see our FAQ)?
>>
>> Cheers,
>>
>> Joel
>>
>> On Mon, 17 Dec 2018 at 09:24, rui min <minminmail at hotmail.com> wrote:
>> Dear scikit-learn developers,
>>
>>     I am Rui from Spain, Granada University. Currently I am planning to write an association rule algorithm in scikit-learn.
>>     I don?t know if anyone is working on that. So avoid duplication of the work, I would like to ask here.
>>
>> Hope to hear from you soon.
>>
>>
>> Best Regards
>>
>>
>> Rui
>>
>>
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

From dmitrii.ignatov at gmail.com  Tue Dec 18 03:17:00 2018
From: dmitrii.ignatov at gmail.com (Dmitry Ignatov)
Date: Tue, 18 Dec 2018 11:17:00 +0300
Subject: [scikit-learn] plan to add the association rule classification
 algorithm in scikit learn
In-Reply-To: <8F02D137-802B-460C-9F02-B39967FDDB6D@sebastianraschka.com>
References: <BL0PR08MB457941962B1B2D4A0B1ED6CEC6A30@BL0PR08MB4579.namprd08.prod.outlook.com>
 <CAAkaFLU3PdYNegOS_zQhUmjGU3RTimnUys6AvEnFOHny4afhiA@mail.gmail.com>
 <8F02D137-802B-460C-9F02-B39967FDDB6D@sebastianraschka.com>
Message-ID: <CAKnnxJ0en-Ca4YyoJ0_N_aAM6_BP94cfWh5T94y8Wxg6k51kag@mail.gmail.com>

Hi All,

Just a short comment to "If you had an alternative algorithm for frequent
itemset generation in mind (I am not sure if others exist, to be honest). I
would also be happy about that one, too." There are many other techniques
and their modifications for related problems like sequence mining, see e.g.
here: http://www.philippe-fournier-viger.com/spmf/. In my opinion, a
notable difference for practice exists between frequent itemsets and closed
(frequent) itemsets; the latter may reduce an output drastically. However,
combinatorial explosion w.r.t. the number of produced patterns is an issue
here.

Best,
Dmitry

??, 17 ???. 2018 ?. ? 10:12, Sebastian Raschka <mail at sebastianraschka.com>:

> Hi Rui,
>
> I agree with Joel that association rule mining could be a bit tricky to
> fit nicely within the scikit-learn API. Maybe this could be some
> transformer class? I thought about that a few years ago but remember that I
> couldn't come up with a good solution at that point.
>
> In any case, I have an association rule implementation in mlxtend (
> http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/),
> which is based on the apriori algorithm. Some users were asking about Eclat
> and FP-Growth algorithms, instead of apriori. If you are interested in such
> a contribution, i.e., implementing Eclat or FP-Growth such that instead of
>
> frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)
> association_rules(frequent_itemsets, metric="confidence",
> min_threshold=0.7)
>
> one could use
>
> frequent_itemsets = eclat(df, min_support=0.6, use_colnames=True)
>
> or
>
> frequent_itemsets = fpgrowth(df, min_support=0.6, use_colnames=True)
> association_rules(frequent_itemsets, metric="confidence",
> min_threshold=0.7)
>
> I would be very happy about such a contribution (see issue tracker at
> https://github.com/rasbt/mlxtend/issues/248)
>
> If you had an alternative algorithm for frequent itemset generation in
> mind (I am not sure if others exist, to be honest). I would also be happy
> about that one, too.
>
> Best,
> Sebastian
>
> > On Dec 17, 2018, at 12:26 AM, Joel Nothman <joel.nothman at gmail.com>
> wrote:
> >
> > Hi Rui,
> >
> > This has been discussed several times on the mailing list and issue
> tracker. We are not interested in association rule mining in Scikit-learn
> for its own purposes. We would be interested in association rule mining
> only as part of a classification algorithm. Are there such algorithms which
> are mature and popular enough to meet our inclusion criteria (see our FAQ)?
> >
> > Cheers,
> >
> > Joel
> >
> > On Mon, 17 Dec 2018 at 09:24, rui min <minminmail at hotmail.com> wrote:
> > Dear scikit-learn developers,
> >
> >    I am Rui from Spain, Granada University. Currently I am planning to
> write an association rule algorithm in scikit-learn.
> >    I don?t know if anyone is working on that. So avoid duplication of
> the work, I would like to ask here.
> >
> > Hope to hear from you soon.
> >
> >
> > Best Regards
> >
> >
> > Rui
> >
> >
> >
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181218/43108f25/attachment.html>

From kouichi.matsuda at gmail.com  Tue Dec 18 07:16:00 2018
From: kouichi.matsuda at gmail.com (Kouichi Matsuda)
Date: Tue, 18 Dec 2018 21:16:00 +0900
Subject: [scikit-learn] MLPClassifier on WIndows 10 is 4 times slower
 than that on macOS?
In-Reply-To: <CACDxx9gZDxdiQZ=tVhx4DmDb0dB1f_TTpbtaXQqvDKb7gO9-OA@mail.gmail.com>
References: <CAB8DfE9A9s3zKiJ1HPQ_70WgmpYbF0i5XzN2HVY5R4wZzKkYog@mail.gmail.com>
 <nq7f9rmeah4o70c8vbb2jmij.1545058914822@gmail.com>
 <CACDxx9gZDxdiQZ=tVhx4DmDb0dB1f_TTpbtaXQqvDKb7gO9-OA@mail.gmail.com>
Message-ID: <CAB8DfE96NsVDX-TeaWwmBWXqWx3mB0ECERRZhO_UfkLV=UNFgQ@mail.gmail.com>

Great! Thanks
Woops, the latest Anaconda does not support the latest scikit-learn...

>>> print(sklearn.__version__)
0.19.2

I should have checked the change log ... orz
>> n_iter_no_change parameter now at 10 from previously hardcoded 2. #9456
by Nicholas Nadeau.
It might be confusing to change it to be severer.

Thanks and sorry for bothering you.
---
???? MATSUDA, Kouichi, Ph.D.


2018?12?18?(?) 0:17 Guillaume Lema?tre <g.lemaitre58 at gmail.com>:

> I checked on 0.20.1 using scikit-learn shipped by Anaconda and both seem
> to have the same default.
>
> On Mon, 17 Dec 2018 at 16:01, Guillaume Lema?tre <g.lemaitre58 at gmail.com>
> wrote:
>
>> could you provide the scikit-learn version in both case?
>>
>> Sent from my phone - sorry to be brief and potential misspell.
>> *From:* kouichi.matsuda at gmail.com
>> *Sent:* 17 December 2018 15:56
>> *To:* scikit-learn at python.org
>> *Reply to:* scikit-learn at python.org
>> *Subject:* Re: [scikit-learn] MLPClassifier on WIndows 10 is 4 times
>> slower than that on macOS?
>>
>> Thank you for your quick reply. It's very helpful.
>> It's because of Anaconda: Its python stops the iteration soon as follows
>> (w/ verbose=True).
>> I am not sure why 'n_iter_no_change=10' is changed in Anaconda.
>> Anaconda might modify the MLPClassifier implementation.
>>
>> > python learn.py (in pure Python+Scikit-Learn)
>> ...
>>
>> Iteration 125, loss = 0.26152263
>>
>> Iteration 126, loss = 0.25705940
>>
>> Iteration 127, loss = 0.25957841
>>
>> Training loss did not improve more than tol=0.000100 for 10 consecutive
>> epochs. Stopping.
>> 0.8496
>>
>> > python learn.py (in Anaconda)
>> ...
>> Iteration 23, loss = 0.34410594
>> Iteration 24, loss = 0.34663903
>> Iteration 25, loss = 0.34376815
>> Training loss did not improve more than tol=0.000100 for two consecutive
>> epochs. Stopping.
>> 0.852
>>
>> Thanks,
>>
>>
>> ---
>> ???? MATSUDA, Kouichi, Ph.D.
>>
>>
>> 2018?12?16?(?) 0:50 Gael Varoquaux <gael.varoquaux at normalesup.org>:
>>
>>> I suspect that it is probably due to the linear-algebra libraries: your
>>> scientific Python install on macOS is probably using optimized
>>> linear-algebra (ie optimized numpy and scipy), but not your install on
>>> Windows.
>>>
>>> I would recommend you to look at how you installed you Python
>>> distribution on macOS and on Windows, as you likely have installed an
>>> optimized one on one of the platforms and not on the other.
>>>
>>> Cheers,
>>>
>>> Ga?l
>>>
>>> On Sat, Dec 15, 2018 at 09:02:06AM -0500, Kouichi Matsuda wrote:
>>> > Hi Hi everyone,
>>>
>>> > I am writing a scikit-learn program to use MLPClassifier to learn
>>> > Fashion-MNIST.
>>> > The following is the program. It's very simple.
>>> > When I ran it on Windows 10 (Core-i7-8565U, 1.8GHz, 16GB) note book,
>>> it took
>>> > about 4 minutes.
>>> > However, when I ran it on MacBook(macOS), it took about 1 minutes.
>>> > Does anyone help me to understand the reason why Windows 10 is so slow?
>>> > Am I missing something?
>>>
>>> > Thanks,
>>>
>>> > import os import gzip import numpy as np #from https://github.com/
>>> > zalandoresearch/fashion-mnist/blob/master/utils/mnist_reader.py def
>>> load_mnist
>>> > (path, kind='train'): labels_path = os.path.join(path,'%
>>> s-labels-idx1-ubyte.gz'
>>> > % kind) images_path = os.path.join(path,'%s-images-idx3-ubyte.gz' %
>>> kind) with
>>> > gzip.open(labels_path, 'rb') as lbpath: labels = np.frombuffer(
>>> lbpath.read(),
>>> > dtype=np.uint8, offset=8) with gzip.open(images_path, 'rb') as
>>> imgpath: images
>>> > = np.frombuffer(imgpath.read(), dtype=np.uint8, offset=16) images =
>>> > images.reshape(len(labels), 784) return images, labels x_train,
>>> y_train =
>>> > load_mnist('data', kind='train') x_test, y_test = load_mnist('data',
>>> kind=
>>> > 't10k') from sklearn.neural_network import MLPClassifier import time
>>> import
>>> > datetime print(datetime.datetime.today()) start = time.time() mlp =
>>> > MLPClassifier() mlp.fit(x_train, y_train) print((time.time() -
>>> start)/ 60)
>>>
>>>
>>> > ---
>>> > MATSUDA, Kouichi, Ph.D.
>>>
>>> > _______________________________________________
>>> > scikit-learn mailing list
>>> > scikit-learn at python.org
>>> > https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>> --
>>>     Gael Varoquaux
>>>     Senior Researcher, INRIA Parietal
>>>     NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
>>>     Phone:  ++ 33-1-69-08-79-68 <+33169087968>
>>>     http://gael-varoquaux.info
>>> http://twitter.com/GaelVaroquaux
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>
>
> --
> Guillaume Lemaitre
> INRIA Saclay - Parietal team
> Center for Data Science Paris-Saclay
> https://glemaitre.github.io/
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181218/ccaf559e/attachment-0001.html>

From olivier.grisel at ensta.org  Tue Dec 18 12:05:21 2018
From: olivier.grisel at ensta.org (Olivier Grisel)
Date: Tue, 18 Dec 2018 18:05:21 +0100
Subject: [scikit-learn] MLPClassifier on WIndows 10 is 4 times slower
 than that on macOS?
In-Reply-To: <CAB8DfE96NsVDX-TeaWwmBWXqWx3mB0ECERRZhO_UfkLV=UNFgQ@mail.gmail.com>
References: <CAB8DfE9A9s3zKiJ1HPQ_70WgmpYbF0i5XzN2HVY5R4wZzKkYog@mail.gmail.com>
 <nq7f9rmeah4o70c8vbb2jmij.1545058914822@gmail.com>
 <CACDxx9gZDxdiQZ=tVhx4DmDb0dB1f_TTpbtaXQqvDKb7gO9-OA@mail.gmail.com>
 <CAB8DfE96NsVDX-TeaWwmBWXqWx3mB0ECERRZhO_UfkLV=UNFgQ@mail.gmail.com>
Message-ID: <CAFvE7K5g7SHyauUv=0-LM7ARZNSOpq5G+r1-iCCzXMu0dC3M7A@mail.gmail.com>

You should probably just "conda update scikit-learn":

scikit-learn 0.20.1 is available on the official anaconda channel for all
supported operating systems:
https://anaconda.org/anaconda/scikit-learn
-- 
Olivier
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181218/daf8b544/attachment.html>

From kouichi.matsuda at gmail.com  Tue Dec 18 18:58:32 2018
From: kouichi.matsuda at gmail.com (Kouichi Matsuda)
Date: Wed, 19 Dec 2018 08:58:32 +0900
Subject: [scikit-learn] MLPClassifier on WIndows 10 is 4 times slower
 than that on macOS?
In-Reply-To: <CAFvE7K5g7SHyauUv=0-LM7ARZNSOpq5G+r1-iCCzXMu0dC3M7A@mail.gmail.com>
References: <CAB8DfE9A9s3zKiJ1HPQ_70WgmpYbF0i5XzN2HVY5R4wZzKkYog@mail.gmail.com>
 <nq7f9rmeah4o70c8vbb2jmij.1545058914822@gmail.com>
 <CACDxx9gZDxdiQZ=tVhx4DmDb0dB1f_TTpbtaXQqvDKb7gO9-OA@mail.gmail.com>
 <CAB8DfE96NsVDX-TeaWwmBWXqWx3mB0ECERRZhO_UfkLV=UNFgQ@mail.gmail.com>
 <CAFvE7K5g7SHyauUv=0-LM7ARZNSOpq5G+r1-iCCzXMu0dC3M7A@mail.gmail.com>
Message-ID: <CAB8DfE-zAtJ3wmheMcOD32bAsVBx14-TH8HqFb9zTp5JpmV=pg@mail.gmail.com>

Great! Thanks!

2018?12?19?(?) ??2:07?Olivier Grisel ???olivier.grisel at ensta.org???????:

> You should probably just "conda update scikit-learn":
>
> scikit-learn 0.20.1 is available on the official anaconda channel for all
> supported operating systems:
> https://anaconda.org/anaconda/scikit-learn
> --
> Olivier
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181219/207e06f6/attachment.html>

From t3kcit at gmail.com  Wed Dec 19 17:27:21 2018
From: t3kcit at gmail.com (Andreas Mueller)
Date: Wed, 19 Dec 2018 17:27:21 -0500
Subject: [scikit-learn] Next Sprint
In-Reply-To: <20181120192519.gbagzrvzzqljglme@phare.normalesup.org>
References: <dd9b7853-c72a-64a3-4ef5-106860243d13@gmail.com>
 <20181115144120.nnpufumsosmpamov@phare.normalesup.org>
 <10d2f8f2-efbf-c72d-2d96-b1b585003a47@gmail.com>
 <CAAkaFLUN-v-Gf6Pz=o_q5q2WgSbJGVOM4NtUy4VhAczPCWrX7w@mail.gmail.com>
 <621b1350-2112-e8b0-c7f1-cbc739f0262e@gmail.com>
 <CAAkaFLVOD80c15hjoMYb=izHrzqb+OT3sFU-Sq95NBEac61jGg@mail.gmail.com>
 <CAAkaFLWTST-9fX07hQW7QBAe2FLWKdgbPPEmQ1By1EJKRcCc-A@mail.gmail.com>
 <7f141338-1d9f-e516-3d7e-cb8232a0720f@gmail.com>
 <CAFvE7K6arsvCxc+oRikqoZnLFQqzoCggQ+vp8HC2ZchtgGXGkQ@mail.gmail.com>
 <20181120192519.gbagzrvzzqljglme@phare.normalesup.org>
Message-ID: <1b8d4167-f588-2264-5f72-9d59258c9422@gmail.com>

Can we please nail down dates for a sprint?

On 11/20/18 2:25 PM, Gael Varoquaux wrote:
> On Tue, Nov 20, 2018 at 08:15:07PM +0100, Olivier Grisel wrote:
>> We can also do Paris in April / May or June if that's ok with Joel and better
>> for Andreas.
> Absolutely.
>
> My thoughts here are that I want to minimize transportation, partly
> because flying has a large carbon footprint. Also, for personal reasons,
> I am not sure that I will be able to make it to Austin in July, but I
> realize that this is a pretty bad argument.
>
> We're happy to try to host in Paris whenever it's most convenient and to
> try to help with travel for those not in Paris.
>
> Ga?l
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From t3kcit at gmail.com  Wed Dec 19 17:31:23 2018
From: t3kcit at gmail.com (Andreas Mueller)
Date: Wed, 19 Dec 2018 17:31:23 -0500
Subject: [scikit-learn] benchmarking TargetEncoder Was: ANN Dirty_cat:
 learning on dirty categories
In-Reply-To: <CALQtMBZ5iM3w76vTC2g1GEQ2Wy0SRxTdZ7UTP=UYz7+F6-8teA@mail.gmail.com>
References: <20181120205818.vgm5fses2nprgvnl@phare.normalesup.org>
 <e79e61a8-4a8a-32c4-56c0-2e2dc4afdfa0@gmail.com>
 <20181120211606.upltvviobudlurxe@phare.normalesup.org>
 <652a4474-c10c-0df9-e314-e16a415b59b8@gmail.com>
 <20181120214337.7unwskh7wtei4kj5@phare.normalesup.org>
 <4c5189a8-4beb-933f-1582-29c964c1cec4@gmail.com>
 <20181121053818.zwjmj6zgwharwpgp@phare.normalesup.org>
 <bd8e5b5b-a256-06f5-3f92-dda43f5973d8@gmail.com>
 <20181121153424.i3b7orguqhm243el@phare.normalesup.org>
 <52d96d5f-be24-20b0-707d-4e13b1494f38@gmail.com>
 <20181123084711.l22vhrbwikr5hamh@phare.normalesup.org>
 <CAFvE7K6uQAr3qYA7A=JLpsEwEw6sVovxVyXhQPjr4Efy8eYkcg@mail.gmail.com>
 <CALQtMBYGyLKYZLGF=m4hXUWb4-B+6gpuA1CWpwjH8WV9xjFYCw@mail.gmail.com>
 <26d9146b-f673-ba0e-11d6-4266bec48407@gmail.com>
 <CALQtMBZ5iM3w76vTC2g1GEQ2Wy0SRxTdZ7UTP=UYz7+F6-8teA@mail.gmail.com>
Message-ID: <8fa34e9c-7205-4803-16c1-7c52b09178ee@gmail.com>


On 12/15/18 7:35 AM, Joris Van den Bossche wrote:
> Op vr 14 dec. 2018 om 16:46 schreef Andreas Mueller <t3kcit at gmail.com 
> <mailto:t3kcit at gmail.com>>:
>
>>     As far as I understand, the open PR is not a leave-one-out
>>     TargetEncoder?
>     I would want it to be :-/
>>     I also did not yet add the CountFeaturizer from that scikit-learn
>>     PR, because it is actually quite different (e.g it doesn't work
>>     for regression tasks, as it counts conditional on y). But for
>>     classification it could be easily added to the benchmarks.
>     I'm confused now. That's what TargetEncoder and leave-one-out
>     TargetEncoder do as well, right?.
>
>
> As far as I understand, that is not exactly what those do. The 
> TargetEncoder (as implemented in dirty_cat, category_encoders and 
> hccEncoders) will, for each category, calculate the expected value of 
> the target depending on the category. For binary classification this 
> indeed comes to counting the 0's and 1's, and there the information 
> contained in the result might be similar as the sklearn PR, but the 
> format is different: those packages calculate the probability (value 
> between 0 and 1 as number of 1's divided by number of samples in that 
> category) and return that as a single column, instead of returning two 
> columns with the counts for the 0's and 1's.
This is a standard case of the "binary special case", right? For 
multi-class you need multiple columns, right?
Doing a single column for binary makes sense, I think.

> And for regression this is not related to counting anymore, but just 
> the average of the target per category (in practice, the TargetEncoder 
> is computing the same for regression or binary classification: the 
> average of the target per category. But for regression, the 
> CountFeaturizer doesn't work since there are no discrete values in the 
> target to count).
I guess CountFeaturizer was not implemented with regression in mind.
Actually being able to do regression and classification in the same 
estimator shows that "CountFeaturizer"
is probably the wrong name.

>
> Furthermore, all of those implementations in the 3 mentioned packages 
> have some kind of regularization (empirical bayes shrinkage, or KFold 
> or leave-one-out cross-validation), while this is also not present in 
> the CountFeaturizer PR (but this aspect is of course something we want 
> to actually test in the benchmarks).
>
> Another thing I noticed in the CountFeaturizer implementation, is that 
> the behaviour differs when y is passed or not. First, I find it a bit 
> strange to do this as it is a quite different behaviour (counting the 
> categories (to just encode the categorical variable with a notion 
> about its frequency in the training set), or counting the target 
> depending on the category is quite different?). But also, when using a 
> transformer in a Pipeline, you don't control the passing of y, I 
> think? So in that way, you always have the behaviour of counting the 
> target.
> I would find it more logical to have those two things in two separate 
> transformers (if we think the "frequency encoder" is useful enough).
> (I need to give this feedback on the PR, but that will be for after 
> the holidays)
>
I'm pretty sure I mentioned that before, I think optional y is bad. I 
just thought it was weird but the pipeline argument is a good one.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181219/d65c6bfa/attachment.html>

From gael.varoquaux at normalesup.org  Wed Dec 19 17:33:02 2018
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Wed, 19 Dec 2018 23:33:02 +0100
Subject: [scikit-learn] Next Sprint
In-Reply-To: <1b8d4167-f588-2264-5f72-9d59258c9422@gmail.com>
References: <20181115144120.nnpufumsosmpamov@phare.normalesup.org>
 <10d2f8f2-efbf-c72d-2d96-b1b585003a47@gmail.com>
 <CAAkaFLUN-v-Gf6Pz=o_q5q2WgSbJGVOM4NtUy4VhAczPCWrX7w@mail.gmail.com>
 <621b1350-2112-e8b0-c7f1-cbc739f0262e@gmail.com>
 <CAAkaFLVOD80c15hjoMYb=izHrzqb+OT3sFU-Sq95NBEac61jGg@mail.gmail.com>
 <CAAkaFLWTST-9fX07hQW7QBAe2FLWKdgbPPEmQ1By1EJKRcCc-A@mail.gmail.com>
 <7f141338-1d9f-e516-3d7e-cb8232a0720f@gmail.com>
 <CAFvE7K6arsvCxc+oRikqoZnLFQqzoCggQ+vp8HC2ZchtgGXGkQ@mail.gmail.com>
 <20181120192519.gbagzrvzzqljglme@phare.normalesup.org>
 <1b8d4167-f588-2264-5f72-9d59258c9422@gmail.com>
Message-ID: <20181219223302.zsz2no2wkngyi2cu@phare.normalesup.org>

I would propose  the week of Feb 25th, as I heard people say that they
might be available at this time. It is good for many people, or should we
organize a doodle?

G

On Wed, Dec 19, 2018 at 05:27:21PM -0500, Andreas Mueller wrote:
> Can we please nail down dates for a sprint?

> On 11/20/18 2:25 PM, Gael Varoquaux wrote:
> > On Tue, Nov 20, 2018 at 08:15:07PM +0100, Olivier Grisel wrote:
> > > We can also do Paris in April / May or June if that's ok with Joel and better
> > > for Andreas.
> > Absolutely.

> > My thoughts here are that I want to minimize transportation, partly
> > because flying has a large carbon footprint. Also, for personal reasons,
> > I am not sure that I will be able to make it to Austin in July, but I
> > realize that this is a pretty bad argument.

> > We're happy to try to host in Paris whenever it's most convenient and to
> > try to help with travel for those not in Paris.

> > Ga?l
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn

> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-- 
    Gael Varoquaux
    Senior Researcher, INRIA Parietal
    NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
    Phone:  ++ 33-1-69-08-79-68
    http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux

From pahome.chen at mirlab.org  Thu Dec 20 02:09:34 2018
From: pahome.chen at mirlab.org (lampahome)
Date: Thu, 20 Dec 2018 15:09:34 +0800
Subject: [scikit-learn] time complexity of tree-based model?
Message-ID: <CAB3eZfvwhh7u1WERKKaCRYki5YXDT2+3cbK6yL1qaHHKB9agPw@mail.gmail.com>

I do some benchmark in my experiments and I almost use ensemble-based
regressor.

What is the time complexity if I use random forest regressor? Assume I only
set variable * estimators=100* and others doesn't enter.

thx
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181220/95950a11/attachment-0001.html>

From mail at sebastianraschka.com  Thu Dec 20 02:19:48 2018
From: mail at sebastianraschka.com (Sebastian Raschka)
Date: Thu, 20 Dec 2018 01:19:48 -0600
Subject: [scikit-learn] time complexity of tree-based model?
In-Reply-To: <CAB3eZfvwhh7u1WERKKaCRYki5YXDT2+3cbK6yL1qaHHKB9agPw@mail.gmail.com>
References: <CAB3eZfvwhh7u1WERKKaCRYki5YXDT2+3cbK6yL1qaHHKB9agPw@mail.gmail.com>
Message-ID: <9CEAACA6-67A4-4382-AA5B-BDD6788D9905@sebastianraschka.com>

Say n is the number of examples and m is the number of features, then a naive implementation of a balanced binary decision tree is O(m * n^2 log n). I think scikit-learn's decision tree cache the sorted features, so this reduces to O(m * n log n). Than, to your O(m * n log n) you can multiply the number of decision trees in the forest 

Best,
Sebastian

> On Dec 20, 2018, at 1:09 AM, lampahome <pahome.chen at mirlab.org> wrote:
> 
> I do some benchmark in my experiments and I almost use ensemble-based regressor.
> 
> What is the time complexity if I use random forest regressor? Assume I only set variable  estimators=100 and others doesn't enter.
> 
> thx
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From aneto at chatdesk.com  Thu Dec 20 10:20:15 2018
From: aneto at chatdesk.com (Aneto)
Date: Thu, 20 Dec 2018 09:20:15 -0600
Subject: [scikit-learn] How to keep a model running in memory?
Message-ID: <CAF9=BEeeToAJN-XK11Mu-yOSr-rsAse-3aVbwzgSWxzZfW3iNQ@mail.gmail.com>

Hi scikit learn community,

We currently use scikit-learn for a model that generates predictions on a
server endpoint. We would like to keep the model running in memory instead
of having to re-load the model for every new request that comes in to the
server.

Can you please point us in the right direction for this? Any tutorials or
examples.

In case it's helpful, we use Flask for our web server.

Thank you!

Aneto
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181220/0fcf754e/attachment.html>

From rhochmuth at alteryx.com  Thu Dec 20 11:12:18 2018
From: rhochmuth at alteryx.com (Roland Hochmuth)
Date: Thu, 20 Dec 2018 16:12:18 +0000
Subject: [scikit-learn] How to keep a model running in memory?
In-Reply-To: <CAF9=BEeeToAJN-XK11Mu-yOSr-rsAse-3aVbwzgSWxzZfW3iNQ@mail.gmail.com>
References: <CAF9=BEeeToAJN-XK11Mu-yOSr-rsAse-3aVbwzgSWxzZfW3iNQ@mail.gmail.com>
Message-ID: <E657EA5E-0005-4F3C-97BD-AB9BBD6E1D59@alteryx.com>

Hi Liam, Not sure I have the complete context for what you are trying to do, but have you considered using Python multiprocessing to start a separate process? The lifecycle of that process could start when the Flask server starts-up or on the first request. The separate process would load and run the model. Depending on what you would like to do, some form of IPC mechanism, such as gRPC could be used to control or get updates from the model process.

Regards --Roland


From: scikit-learn <scikit-learn-bounces+rhochmuth=alteryx.com at python.org> on behalf of Aneto <aneto at chatdesk.com>
Reply-To: Scikit-learn mailing list <scikit-learn at python.org>
Date: Thursday, December 20, 2018 at 8:21 AM
To: "scikit-learn at python.org" <scikit-learn at python.org>
Cc: Liam Geron <liam at chatdesk.com>
Subject: [scikit-learn] How to keep a model running in memory?

Hi scikit learn community,

We currently use scikit-learn for a model that generates predictions on a server endpoint. We would like to keep the model running in memory instead of having to re-load the model for every new request that comes in to the server.

Can you please point us in the right direction for this? Any tutorials or examples.

In case it's helpful, we use Flask for our web server.

Thank you!

Aneto
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181220/e0c7e676/attachment.html>

From leefrance79 at gmail.com  Thu Dec 20 12:14:35 2018
From: leefrance79 at gmail.com (=?utf-8?B?7J207J246rec?=)
Date: Thu, 20 Dec 2018 12:14:35 -0500
Subject: [scikit-learn] Submission
Message-ID: <6C947338-7776-44C5-908D-862C3F76135A@gmail.com>


Leon LEE
Leefrance79 at gmail.com
Skype: leefrance7979

From t3kcit at gmail.com  Thu Dec 20 12:44:24 2018
From: t3kcit at gmail.com (Andreas Mueller)
Date: Thu, 20 Dec 2018 12:44:24 -0500
Subject: [scikit-learn] Next Sprint
In-Reply-To: <20181219223302.zsz2no2wkngyi2cu@phare.normalesup.org>
References: <20181115144120.nnpufumsosmpamov@phare.normalesup.org>
 <10d2f8f2-efbf-c72d-2d96-b1b585003a47@gmail.com>
 <CAAkaFLUN-v-Gf6Pz=o_q5q2WgSbJGVOM4NtUy4VhAczPCWrX7w@mail.gmail.com>
 <621b1350-2112-e8b0-c7f1-cbc739f0262e@gmail.com>
 <CAAkaFLVOD80c15hjoMYb=izHrzqb+OT3sFU-Sq95NBEac61jGg@mail.gmail.com>
 <CAAkaFLWTST-9fX07hQW7QBAe2FLWKdgbPPEmQ1By1EJKRcCc-A@mail.gmail.com>
 <7f141338-1d9f-e516-3d7e-cb8232a0720f@gmail.com>
 <CAFvE7K6arsvCxc+oRikqoZnLFQqzoCggQ+vp8HC2ZchtgGXGkQ@mail.gmail.com>
 <20181120192519.gbagzrvzzqljglme@phare.normalesup.org>
 <1b8d4167-f588-2264-5f72-9d59258c9422@gmail.com>
 <20181219223302.zsz2no2wkngyi2cu@phare.normalesup.org>
Message-ID: <e089f2ac-4b59-ed73-1fc9-1a66e4967ffe@gmail.com>

Works for me!

On 12/19/18 5:33 PM, Gael Varoquaux wrote:
> I would propose  the week of Feb 25th, as I heard people say that they
> might be available at this time. It is good for many people, or should we
> organize a doodle?
>
> G
>
> On Wed, Dec 19, 2018 at 05:27:21PM -0500, Andreas Mueller wrote:
>> Can we please nail down dates for a sprint?
>> On 11/20/18 2:25 PM, Gael Varoquaux wrote:
>>> On Tue, Nov 20, 2018 at 08:15:07PM +0100, Olivier Grisel wrote:
>>>> We can also do Paris in April / May or June if that's ok with Joel and better
>>>> for Andreas.
>>> Absolutely.
>>> My thoughts here are that I want to minimize transportation, partly
>>> because flying has a large carbon footprint. Also, for personal reasons,
>>> I am not sure that I will be able to make it to Austin in July, but I
>>> realize that this is a pretty bad argument.
>>> We're happy to try to host in Paris whenever it's most convenient and to
>>> try to help with travel for those not in Paris.
>>> Ga?l
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn


From rhochmuth at alteryx.com  Thu Dec 20 13:07:31 2018
From: rhochmuth at alteryx.com (Roland Hochmuth)
Date: Thu, 20 Dec 2018 18:07:31 +0000
Subject: [scikit-learn] How to keep a model running in memory?
In-Reply-To: <CAJn_aE64Fhosy=ABx1AxL_NuE=XKSWkjPzin-u+005T1rxWtyw@mail.gmail.com>
References: <CAF9=BEeeToAJN-XK11Mu-yOSr-rsAse-3aVbwzgSWxzZfW3iNQ@mail.gmail.com>
 <E657EA5E-0005-4F3C-97BD-AB9BBD6E1D59@alteryx.com>
 <CAJn_aE64Fhosy=ABx1AxL_NuE=XKSWkjPzin-u+005T1rxWtyw@mail.gmail.com>
Message-ID: <5464DD75-2F91-4A7F-9F8D-959CE33C8480@alteryx.com>

Hi Liam, I would suggest start out by taking a look at the gRPC quickstart for Python at, https://grpc.io/docs/quickstart/python.html and then modifying that example to do what you would like.

The Flask server would launch the separate process using multiprocessing. The model process would create a gRPC service endpoint. The Flask server would wait for the model process to start and then establish a gRPC connection as a client to the gRPC service endpoint of the model process. The gRPC service of the model process would have methods, such as trainModel or getModelStatus, ? When an http request occurs on the Flask http server, the server would then invoke the gRPC methods in the model process.

I hope that helps.

Regards --Roland


From: Liam Geron <liam at chatdesk.com>
Date: Thursday, December 20, 2018 at 9:53 AM
To: Roland Hochmuth <rhochmuth at alteryx.com>
Cc: Scikit-learn mailing list <scikit-learn at python.org>
Subject: Re: [scikit-learn] How to keep a model running in memory?

Hi Roland,

Thanks for the suggestion! I'll certainly look into gRPC or similar frameworks. Currently we have multiprocessing, but it's not used to that same extent. How would the second process have a sort of "listener" to respond to incoming requests if it is running persistently?

Thanks so much for the help.

Best,
Liam

On Thu, Dec 20, 2018 at 11:12 AM Roland Hochmuth <rhochmuth at alteryx.com<mailto:rhochmuth at alteryx.com>> wrote:
Hi Liam, Not sure I have the complete context for what you are trying to do, but have you considered using Python multiprocessing to start a separate process? The lifecycle of that process could start when the Flask server starts-up or on the first request. The separate process would load and run the model. Depending on what you would like to do, some form of IPC mechanism, such as gRPC could be used to control or get updates from the model process.

Regards --Roland


From: scikit-learn <scikit-learn-bounces+rhochmuth=alteryx.com at python.org<mailto:alteryx.com at python.org>> on behalf of Aneto <aneto at chatdesk.com<mailto:aneto at chatdesk.com>>
Reply-To: Scikit-learn mailing list <scikit-learn at python.org<mailto:scikit-learn at python.org>>
Date: Thursday, December 20, 2018 at 8:21 AM
To: "scikit-learn at python.org<mailto:scikit-learn at python.org>" <scikit-learn at python.org<mailto:scikit-learn at python.org>>
Cc: Liam Geron <liam at chatdesk.com<mailto:liam at chatdesk.com>>
Subject: [scikit-learn] How to keep a model running in memory?

Hi scikit learn community,

We currently use scikit-learn for a model that generates predictions on a server endpoint. We would like to keep the model running in memory instead of having to re-load the model for every new request that comes in to the server.

Can you please point us in the right direction for this? Any tutorials or examples.

In case it's helpful, we use Flask for our web server.

Thank you!

Aneto
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181220/0686594a/attachment-0001.html>

From adrin.jalali at gmail.com  Thu Dec 20 14:32:46 2018
From: adrin.jalali at gmail.com (Adrin)
Date: Thu, 20 Dec 2018 20:32:46 +0100
Subject: [scikit-learn] Next Sprint
In-Reply-To: <e089f2ac-4b59-ed73-1fc9-1a66e4967ffe@gmail.com>
References: <20181115144120.nnpufumsosmpamov@phare.normalesup.org>
 <10d2f8f2-efbf-c72d-2d96-b1b585003a47@gmail.com>
 <CAAkaFLUN-v-Gf6Pz=o_q5q2WgSbJGVOM4NtUy4VhAczPCWrX7w@mail.gmail.com>
 <621b1350-2112-e8b0-c7f1-cbc739f0262e@gmail.com>
 <CAAkaFLVOD80c15hjoMYb=izHrzqb+OT3sFU-Sq95NBEac61jGg@mail.gmail.com>
 <CAAkaFLWTST-9fX07hQW7QBAe2FLWKdgbPPEmQ1By1EJKRcCc-A@mail.gmail.com>
 <7f141338-1d9f-e516-3d7e-cb8232a0720f@gmail.com>
 <CAFvE7K6arsvCxc+oRikqoZnLFQqzoCggQ+vp8HC2ZchtgGXGkQ@mail.gmail.com>
 <20181120192519.gbagzrvzzqljglme@phare.normalesup.org>
 <1b8d4167-f588-2264-5f72-9d59258c9422@gmail.com>
 <20181219223302.zsz2no2wkngyi2cu@phare.normalesup.org>
 <e089f2ac-4b59-ed73-1fc9-1a66e4967ffe@gmail.com>
Message-ID: <CAEOrW4_0pK+=M=Cphk89ADVC9ctT9cy-tJffB0=VrBBg0OFkMg@mail.gmail.com>

It'll be the least favourable week of February for me, but I can make do.

On Thu, 20 Dec 2018 at 18:45 Andreas Mueller <t3kcit at gmail.com> wrote:

> Works for me!
>
> On 12/19/18 5:33 PM, Gael Varoquaux wrote:
> > I would propose  the week of Feb 25th, as I heard people say that they
> > might be available at this time. It is good for many people, or should we
> > organize a doodle?
> >
> > G
> >
> > On Wed, Dec 19, 2018 at 05:27:21PM -0500, Andreas Mueller wrote:
> >> Can we please nail down dates for a sprint?
> >> On 11/20/18 2:25 PM, Gael Varoquaux wrote:
> >>> On Tue, Nov 20, 2018 at 08:15:07PM +0100, Olivier Grisel wrote:
> >>>> We can also do Paris in April / May or June if that's ok with Joel
> and better
> >>>> for Andreas.
> >>> Absolutely.
> >>> My thoughts here are that I want to minimize transportation, partly
> >>> because flying has a large carbon footprint. Also, for personal
> reasons,
> >>> I am not sure that I will be able to make it to Austin in July, but I
> >>> realize that this is a pretty bad argument.
> >>> We're happy to try to host in Paris whenever it's most convenient and
> to
> >>> try to help with travel for those not in Paris.
> >>> Ga?l
> >>> _______________________________________________
> >>> scikit-learn mailing list
> >>> scikit-learn at python.org
> >>> https://mail.python.org/mailman/listinfo/scikit-learn
> >> _______________________________________________
> >> scikit-learn mailing list
> >> scikit-learn at python.org
> >> https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181220/53762eb5/attachment.html>

From alexandre.gramfort at inria.fr  Thu Dec 20 15:19:04 2018
From: alexandre.gramfort at inria.fr (Alexandre Gramfort)
Date: Thu, 20 Dec 2018 21:19:04 +0100
Subject: [scikit-learn] Next Sprint
In-Reply-To: <CAEOrW4_0pK+=M=Cphk89ADVC9ctT9cy-tJffB0=VrBBg0OFkMg@mail.gmail.com>
References: <20181115144120.nnpufumsosmpamov@phare.normalesup.org>
 <10d2f8f2-efbf-c72d-2d96-b1b585003a47@gmail.com>
 <CAAkaFLUN-v-Gf6Pz=o_q5q2WgSbJGVOM4NtUy4VhAczPCWrX7w@mail.gmail.com>
 <621b1350-2112-e8b0-c7f1-cbc739f0262e@gmail.com>
 <CAAkaFLVOD80c15hjoMYb=izHrzqb+OT3sFU-Sq95NBEac61jGg@mail.gmail.com>
 <CAAkaFLWTST-9fX07hQW7QBAe2FLWKdgbPPEmQ1By1EJKRcCc-A@mail.gmail.com>
 <7f141338-1d9f-e516-3d7e-cb8232a0720f@gmail.com>
 <CAFvE7K6arsvCxc+oRikqoZnLFQqzoCggQ+vp8HC2ZchtgGXGkQ@mail.gmail.com>
 <20181120192519.gbagzrvzzqljglme@phare.normalesup.org>
 <1b8d4167-f588-2264-5f72-9d59258c9422@gmail.com>
 <20181219223302.zsz2no2wkngyi2cu@phare.normalesup.org>
 <e089f2ac-4b59-ed73-1fc9-1a66e4967ffe@gmail.com>
 <CAEOrW4_0pK+=M=Cphk89ADVC9ctT9cy-tJffB0=VrBBg0OFkMg@mail.gmail.com>
Message-ID: <CADeotZpF4CRSwwHp66-5NY1a1DVOCXiO9w2Ee4ngxuz1L=VaCg@mail.gmail.com>

ok for me

Alex

On Thu, Dec 20, 2018 at 8:35 PM Adrin <adrin.jalali at gmail.com> wrote:
>
> It'll be the least favourable week of February for me, but I can make do.
>
> On Thu, 20 Dec 2018 at 18:45 Andreas Mueller <t3kcit at gmail.com> wrote:
>>
>> Works for me!
>>
>> On 12/19/18 5:33 PM, Gael Varoquaux wrote:
>> > I would propose  the week of Feb 25th, as I heard people say that they
>> > might be available at this time. It is good for many people, or should we
>> > organize a doodle?
>> >
>> > G
>> >
>> > On Wed, Dec 19, 2018 at 05:27:21PM -0500, Andreas Mueller wrote:
>> >> Can we please nail down dates for a sprint?
>> >> On 11/20/18 2:25 PM, Gael Varoquaux wrote:
>> >>> On Tue, Nov 20, 2018 at 08:15:07PM +0100, Olivier Grisel wrote:
>> >>>> We can also do Paris in April / May or June if that's ok with Joel and better
>> >>>> for Andreas.
>> >>> Absolutely.
>> >>> My thoughts here are that I want to minimize transportation, partly
>> >>> because flying has a large carbon footprint. Also, for personal reasons,
>> >>> I am not sure that I will be able to make it to Austin in July, but I
>> >>> realize that this is a pretty bad argument.
>> >>> We're happy to try to host in Paris whenever it's most convenient and to
>> >>> try to help with travel for those not in Paris.
>> >>> Ga?l
>> >>> _______________________________________________
>> >>> scikit-learn mailing list
>> >>> scikit-learn at python.org
>> >>> https://mail.python.org/mailman/listinfo/scikit-learn
>> >> _______________________________________________
>> >> scikit-learn mailing list
>> >> scikit-learn at python.org
>> >> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

From pahome.chen at mirlab.org  Thu Dec 20 20:57:16 2018
From: pahome.chen at mirlab.org (lampahome)
Date: Fri, 21 Dec 2018 09:57:16 +0800
Subject: [scikit-learn] Does random forest work if there are very few
 features?
Message-ID: <CAB3eZfuD_sz+kpxv9UWTzVBNu3cTqLitb7-+ZsLNVbREpa2MBw@mail.gmail.com>

I read doc and know tree-based model is determined by entropy or gini
impurity.

When model try to create leaf nodes, it will determine based on the
feature, right?

Ex:
I have 2 features A,B, and I divide it with A.
So I have left and right nodes based on A.
It should have the best shape if I create nodes based on A, right?

Now if I have 100 estimators but I only have two features, do I have
different trees which are all based on feature A?
or the shape of trees based on A are all the same cuz they were created by
feature A?

thx
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181221/1be0a868/attachment.html>

From olivier.grisel at ensta.org  Fri Dec 21 10:00:00 2018
From: olivier.grisel at ensta.org (Olivier Grisel)
Date: Fri, 21 Dec 2018 16:00:00 +0100
Subject: [scikit-learn] Next Sprint
In-Reply-To: <CADeotZpF4CRSwwHp66-5NY1a1DVOCXiO9w2Ee4ngxuz1L=VaCg@mail.gmail.com>
References: <20181115144120.nnpufumsosmpamov@phare.normalesup.org>
 <10d2f8f2-efbf-c72d-2d96-b1b585003a47@gmail.com>
 <CAAkaFLUN-v-Gf6Pz=o_q5q2WgSbJGVOM4NtUy4VhAczPCWrX7w@mail.gmail.com>
 <621b1350-2112-e8b0-c7f1-cbc739f0262e@gmail.com>
 <CAAkaFLVOD80c15hjoMYb=izHrzqb+OT3sFU-Sq95NBEac61jGg@mail.gmail.com>
 <CAAkaFLWTST-9fX07hQW7QBAe2FLWKdgbPPEmQ1By1EJKRcCc-A@mail.gmail.com>
 <7f141338-1d9f-e516-3d7e-cb8232a0720f@gmail.com>
 <CAFvE7K6arsvCxc+oRikqoZnLFQqzoCggQ+vp8HC2ZchtgGXGkQ@mail.gmail.com>
 <20181120192519.gbagzrvzzqljglme@phare.normalesup.org>
 <1b8d4167-f588-2264-5f72-9d59258c9422@gmail.com>
 <20181219223302.zsz2no2wkngyi2cu@phare.normalesup.org>
 <e089f2ac-4b59-ed73-1fc9-1a66e4967ffe@gmail.com>
 <CAEOrW4_0pK+=M=Cphk89ADVC9ctT9cy-tJffB0=VrBBg0OFkMg@mail.gmail.com>
 <CADeotZpF4CRSwwHp66-5NY1a1DVOCXiO9w2Ee4ngxuz1L=VaCg@mail.gmail.com>
Message-ID: <CAFvE7K73C8GDqbg2ht9DdZa9nv5hNztUHsA93w+gtqCLDc4x7g@mail.gmail.com>

Ok for me. The last 3 weeks of February are fine for me.

Le jeu. 20 d?c. 2018 ? 21:21, Alexandre Gramfort <
alexandre.gramfort at inria.fr> a ?crit :

> ok for me
>
> Alex
>
> On Thu, Dec 20, 2018 at 8:35 PM Adrin <adrin.jalali at gmail.com> wrote:
> >
> > It'll be the least favourable week of February for me, but I can make do.
> >
> > On Thu, 20 Dec 2018 at 18:45 Andreas Mueller <t3kcit at gmail.com> wrote:
> >>
> >> Works for me!
> >>
> >> On 12/19/18 5:33 PM, Gael Varoquaux wrote:
> >> > I would propose  the week of Feb 25th, as I heard people say that they
> >> > might be available at this time. It is good for many people, or
> should we
> >> > organize a doodle?
> >> >
> >> > G
> >> >
> >> > On Wed, Dec 19, 2018 at 05:27:21PM -0500, Andreas Mueller wrote:
> >> >> Can we please nail down dates for a sprint?
> >> >> On 11/20/18 2:25 PM, Gael Varoquaux wrote:
> >> >>> On Tue, Nov 20, 2018 at 08:15:07PM +0100, Olivier Grisel wrote:
> >> >>>> We can also do Paris in April / May or June if that's ok with Joel
> and better
> >> >>>> for Andreas.
> >> >>> Absolutely.
> >> >>> My thoughts here are that I want to minimize transportation, partly
> >> >>> because flying has a large carbon footprint. Also, for personal
> reasons,
> >> >>> I am not sure that I will be able to make it to Austin in July, but
> I
> >> >>> realize that this is a pretty bad argument.
> >> >>> We're happy to try to host in Paris whenever it's most convenient
> and to
> >> >>> try to help with travel for those not in Paris.
> >> >>> Ga?l
> >> >>> _______________________________________________
> >> >>> scikit-learn mailing list
> >> >>> scikit-learn at python.org
> >> >>> https://mail.python.org/mailman/listinfo/scikit-learn
> >> >> _______________________________________________
> >> >> scikit-learn mailing list
> >> >> scikit-learn at python.org
> >> >> https://mail.python.org/mailman/listinfo/scikit-learn
> >>
> >> _______________________________________________
> >> scikit-learn mailing list
> >> scikit-learn at python.org
> >> https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181221/d2257c21/attachment-0001.html>

From rth.yurchak at pm.me  Sat Dec 22 10:58:21 2018
From: rth.yurchak at pm.me (Roman Yurchak)
Date: Sat, 22 Dec 2018 15:58:21 +0000
Subject: [scikit-learn] Next Sprint
In-Reply-To: <CAFvE7K73C8GDqbg2ht9DdZa9nv5hNztUHsA93w+gtqCLDc4x7g@mail.gmail.com>
References: <20181115144120.nnpufumsosmpamov@phare.normalesup.org>
 <CAFvE7K6arsvCxc+oRikqoZnLFQqzoCggQ+vp8HC2ZchtgGXGkQ@mail.gmail.com>
 <20181120192519.gbagzrvzzqljglme@phare.normalesup.org>
 <1b8d4167-f588-2264-5f72-9d59258c9422@gmail.com>
 <20181219223302.zsz2no2wkngyi2cu@phare.normalesup.org>
 <e089f2ac-4b59-ed73-1fc9-1a66e4967ffe@gmail.com>
 <CAEOrW4_0pK+=M=Cphk89ADVC9ctT9cy-tJffB0=VrBBg0OFkMg@mail.gmail.com>
 <CADeotZpF4CRSwwHp66-5NY1a1DVOCXiO9w2Ee4ngxuz1L=VaCg@mail.gmail.com>
 <CAFvE7K73C8GDqbg2ht9DdZa9nv5hNztUHsA93w+gtqCLDc4x7g@mail.gmail.com>
Message-ID: <RX6iIIyFIwuS7OUZgAO0I2OvQSl3IJ2D4ZBdJC2zO0i8gXkHwDMq_3oP1KtYtaamqgIcLhdX4CQCAJIj2_4oTdqT_wYL5eEDak9wSJH_y1Q=@pm.me>

That works for me as well.

On 21/12/2018 16:00, Olivier Grisel wrote:
> Ok for me. The last 3 weeks of February are fine for me.
> 
> Le jeu. 20 d?c. 2018 ? 21:21, Alexandre Gramfort 
> <alexandre.gramfort at inria.fr <mailto:alexandre.gramfort at inria.fr>> a ?crit?:
> 
>     ok for me
> 
>     Alex
> 
>     On Thu, Dec 20, 2018 at 8:35 PM Adrin <adrin.jalali at gmail.com
>     <mailto:adrin.jalali at gmail.com>> wrote:
>      >
>      > It'll be the least favourable week of February for me, but I can
>     make do.
>      >
>      > On Thu, 20 Dec 2018 at 18:45 Andreas Mueller <t3kcit at gmail.com
>     <mailto:t3kcit at gmail.com>> wrote:
>      >>
>      >> Works for me!
>      >>
>      >> On 12/19/18 5:33 PM, Gael Varoquaux wrote:
>      >> > I would propose? the week of Feb 25th, as I heard people say
>     that they
>      >> > might be available at this time. It is good for many people,
>     or should we
>      >> > organize a doodle?
>      >> >
>      >> > G
>      >> >
>      >> > On Wed, Dec 19, 2018 at 05:27:21PM -0500, Andreas Mueller wrote:
>      >> >> Can we please nail down dates for a sprint?
>      >> >> On 11/20/18 2:25 PM, Gael Varoquaux wrote:
>      >> >>> On Tue, Nov 20, 2018 at 08:15:07PM +0100, Olivier Grisel wrote:
>      >> >>>> We can also do Paris in April / May or June if that's ok
>     with Joel and better
>      >> >>>> for Andreas.
>      >> >>> Absolutely.
>      >> >>> My thoughts here are that I want to minimize transportation,
>     partly
>      >> >>> because flying has a large carbon footprint. Also, for
>     personal reasons,
>      >> >>> I am not sure that I will be able to make it to Austin in
>     July, but I
>      >> >>> realize that this is a pretty bad argument.
>      >> >>> We're happy to try to host in Paris whenever it's most
>     convenient and to
>      >> >>> try to help with travel for those not in Paris.
>      >> >>> Ga?l
>      >> >>> _______________________________________________
>      >> >>> scikit-learn mailing list
>      >> >>> scikit-learn at python.org <mailto:scikit-learn at python.org>
>      >> >>> https://mail.python.org/mailman/listinfo/scikit-learn
>      >> >> _______________________________________________
>      >> >> scikit-learn mailing list
>      >> >> scikit-learn at python.org <mailto:scikit-learn at python.org>
>      >> >> https://mail.python.org/mailman/listinfo/scikit-learn
>      >>
>      >> _______________________________________________
>      >> scikit-learn mailing list
>      >> scikit-learn at python.org <mailto:scikit-learn at python.org>
>      >> https://mail.python.org/mailman/listinfo/scikit-learn
>      >
>      > _______________________________________________
>      > scikit-learn mailing list
>      > scikit-learn at python.org <mailto:scikit-learn at python.org>
>      > https://mail.python.org/mailman/listinfo/scikit-learn
>     _______________________________________________
>     scikit-learn mailing list
>     scikit-learn at python.org <mailto:scikit-learn at python.org>
>     https://mail.python.org/mailman/listinfo/scikit-learn
> 


From g.lemaitre58 at gmail.com  Sat Dec 22 11:27:39 2018
From: g.lemaitre58 at gmail.com (=?ISO-8859-1?Q?Guillaume_Lema=EEtre?=)
Date: Sat, 22 Dec 2018 17:27:39 +0100
Subject: [scikit-learn] Next Sprint
In-Reply-To: <RX6iIIyFIwuS7OUZgAO0I2OvQSl3IJ2D4ZBdJC2zO0i8gXkHwDMq_3oP1KtYtaamqgIcLhdX4CQCAJIj2_4oTdqT_wYL5eEDak9wSJH_y1Q=@pm.me>
Message-ID: <32r9p0j48h2ubjbl55ir612a.1545496059736@gmail.com>

Works for me as well. 

Sent from my phone - sorry to be brief and potential misspell.


? Original Message ?
From: scikit-learn at python.org
Sent: 22 December 2018 17:17
To: scikit-learn at python.org
Reply to: rth.yurchak at pm.me; scikit-learn at python.org
Cc: rth.yurchak at pm.me
Subject: Re: [scikit-learn] Next Sprint

That works for me as well.

On 21/12/2018 16:00, Olivier Grisel wrote:
> Ok for me. The last 3 weeks of February are fine for me.
> 
> Le jeu. 20 d?c. 2018 ? 21:21, Alexandre Gramfort 
> <alexandre.gramfort at inria.fr <mailto:alexandre.gramfort at inria.fr>> a ?crit?:
> 
>???? ok for me
> 
>???? Alex
> 
>???? On Thu, Dec 20, 2018 at 8:35 PM Adrin <adrin.jalali at gmail.com
>???? <mailto:adrin.jalali at gmail.com>> wrote:
>????? >
>????? > It'll be the least favourable week of February for me, but I can
>???? make do.
>????? >
>????? > On Thu, 20 Dec 2018 at 18:45 Andreas Mueller <t3kcit at gmail.com
>???? <mailto:t3kcit at gmail.com>> wrote:
>????? >>
>????? >> Works for me!
>????? >>
>????? >> On 12/19/18 5:33 PM, Gael Varoquaux wrote:
>????? >> > I would propose? the week of Feb 25th, as I heard people say
>???? that they
>????? >> > might be available at this time. It is good for many people,
>???? or should we
>????? >> > organize a doodle?
>????? >> >
>????? >> > G
>????? >> >
>????? >> > On Wed, Dec 19, 2018 at 05:27:21PM -0500, Andreas Mueller wrote:
>????? >> >> Can we please nail down dates for a sprint?
>????? >> >> On 11/20/18 2:25 PM, Gael Varoquaux wrote:
>????? >> >>> On Tue, Nov 20, 2018 at 08:15:07PM +0100, Olivier Grisel wrote:
>????? >> >>>> We can also do Paris in April / May or June if that's ok
>???? with Joel and better
>????? >> >>>> for Andreas.
>????? >> >>> Absolutely.
>????? >> >>> My thoughts here are that I want to minimize transportation,
>???? partly
>????? >> >>> because flying has a large carbon footprint. Also, for
>???? personal reasons,
>????? >> >>> I am not sure that I will be able to make it to Austin in
>???? July, but I
>????? >> >>> realize that this is a pretty bad argument.
>????? >> >>> We're happy to try to host in Paris whenever it's most
>???? convenient and to
>????? >> >>> try to help with travel for those not in Paris.
>????? >> >>> Ga?l
>????? >> >>> _______________________________________________
>????? >> >>> scikit-learn mailing list
>????? >> >>> scikit-learn at python.org <mailto:scikit-learn at python.org>
>????? >> >>> https://mail.python.org/mailman/listinfo/scikit-learn
>????? >> >> _______________________________________________
>????? >> >> scikit-learn mailing list
>????? >> >> scikit-learn at python.org <mailto:scikit-learn at python.org>
>????? >> >> https://mail.python.org/mailman/listinfo/scikit-learn
>????? >>
>????? >> _______________________________________________
>????? >> scikit-learn mailing list
>????? >> scikit-learn at python.org <mailto:scikit-learn at python.org>
>????? >> https://mail.python.org/mailman/listinfo/scikit-learn
>????? >
>????? > _______________________________________________
>????? > scikit-learn mailing list
>????? > scikit-learn at python.org <mailto:scikit-learn at python.org>
>????? > https://mail.python.org/mailman/listinfo/scikit-learn
>???? _______________________________________________
>???? scikit-learn mailing list
>???? scikit-learn at python.org <mailto:scikit-learn at python.org>
>???? https://mail.python.org/mailman/listinfo/scikit-learn
> 

_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn

From pahome.chen at mirlab.org  Mon Dec 24 22:15:42 2018
From: pahome.chen at mirlab.org (lampahome)
Date: Tue, 25 Dec 2018 11:15:42 +0800
Subject: [scikit-learn] Any way to tune the parameters better than
 GridSearchCV?
Message-ID: <CAB3eZfsKAfA4QwBUrQfKdGKm-5Tp7iZ71884pWehD5rJT0s52w@mail.gmail.com>

Take random forest as example, if I give estimator from 10 to 10000(10,
100, 1000, 10000) into grid search.

Based on the result, I found estimator=100 is the best, but I don't know
lower or greater than 100 is better.

How should I decide? brute force or any tools better than GridSearchCV?

thx
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181225/2158b760/attachment.html>

From jbbrown at kuhp.kyoto-u.ac.jp  Mon Dec 24 22:27:01 2018
From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.)
Date: Tue, 25 Dec 2018 12:27:01 +0900
Subject: [scikit-learn] Any way to tune the parameters better than
 GridSearchCV?
In-Reply-To: <CAB3eZfsKAfA4QwBUrQfKdGKm-5Tp7iZ71884pWehD5rJT0s52w@mail.gmail.com>
References: <CAB3eZfsKAfA4QwBUrQfKdGKm-5Tp7iZ71884pWehD5rJT0s52w@mail.gmail.com>
Message-ID: <CAJe_vxCYkObQu9oxSF=iUJ39tszr=mijEjreQCU8Pg77ywoqLA@mail.gmail.com>

> Take random forest as example, if I give estimator from 10 to 10000(10,
> 100, 1000, 10000) into grid search.
> Based on the result, I found estimator=100 is the best, but I don't know
> lower or greater than 100 is better.
> How should I decide? brute force or any tools better than GridSearchCV?
>

A simple but nonetheless practical solution is to
  (1) start with an upper bound on the number of trees you are willing to
accept in the model,
  (2) obtain its performance (ACC, MCC, F1, etc) as the starting reference
point,
  (3) systematically lower the number of trees (log2 scale down, fixed size
decrement, etc)
  (4) obtain the reduced forest size performance,
  (5) Repeat (3)-(4) until [performance(reference) - performance(current
forest size)] > tolerance

You can encapsulate that in a function which then returns the final model
you obtain.
>From the model object, the number of trees can be obtained.

J.B.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181225/6f822a8b/attachment.html>

From mail at sebastianraschka.com  Mon Dec 24 23:15:01 2018
From: mail at sebastianraschka.com (Sebastian Raschka)
Date: Mon, 24 Dec 2018 22:15:01 -0600
Subject: [scikit-learn] Any way to tune the parameters better than
 GridSearchCV?
In-Reply-To: <CAJe_vxCYkObQu9oxSF=iUJ39tszr=mijEjreQCU8Pg77ywoqLA@mail.gmail.com>
References: <CAB3eZfsKAfA4QwBUrQfKdGKm-5Tp7iZ71884pWehD5rJT0s52w@mail.gmail.com>
 <CAJe_vxCYkObQu9oxSF=iUJ39tszr=mijEjreQCU8Pg77ywoqLA@mail.gmail.com>
Message-ID: <2B6DACD8-F35D-4C3B-BB4E-601F5E75BACE@sebastianraschka.com>

I would like to make a related suggestion but instead of focusing on the upper bound for the number of trees rather set choosing the lower bound. From a theoretical perspective, it doesn't make sense to me how fewer trees can result in a better performing random forest model in terms of generalization performance. If you observe a better performance on the same independent test set with fewer trees, I would say that this is likely not a good indicator of better generalization performance. It could be due to overfitting and train/test set resampling and/or picking up artifacts in the dataset. 

As a general suggestion, I would suggest choosing a reasonable number of trees that seems computationally feasible given the size of the dataset and the number hyperparameters to compare via model selection. Then, after tuning, I would use the best hyperparameter setting with 10x more trees and see if you notice any significant different in the cross-validation performance. Next, I would use the model and fit it to the whole training set with those best hyperparameters and evaluate the performance on the independent test set.

Best,
Sebastian


> On Dec 24, 2018, at 9:27 PM, Brown J.B. via scikit-learn <scikit-learn at python.org> wrote:
> 
> Take random forest as example, if I give estimator from 10 to 10000(10, 100, 1000, 10000) into grid search.
> Based on the result, I found estimator=100 is the best, but I don't know lower or greater than 100 is better.
> How should I decide? brute force or any tools better than GridSearchCV?
> 
> A simple but nonetheless practical solution is to 
>   (1) start with an upper bound on the number of trees you are willing to accept in the model, 
>   (2) obtain its performance (ACC, MCC, F1, etc) as the starting reference point,
>   (3) systematically lower the number of trees (log2 scale down, fixed size decrement, etc)
>   (4) obtain the reduced forest size performance,
>   (5) Repeat (3)-(4) until [performance(reference) - performance(current forest size)] > tolerance
> 
> You can encapsulate that in a function which then returns the final model you obtain.
> From the model object, the number of trees can be obtained.
> 
> J.B.
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From pahome.chen at mirlab.org  Wed Dec 26 04:26:40 2018
From: pahome.chen at mirlab.org (lampahome)
Date: Wed, 26 Dec 2018 17:26:40 +0800
Subject: [scikit-learn] How to grab subsets from train sets when
 bootstrap=False in RF regressor?
Message-ID: <CAB3eZftK1VV4_AudF8XFyCXhgLJjRSApAf4er-Z_V9ptVVDW0w@mail.gmail.com>

As title

RF regressor decide a tree by grabing part of train data aka bootstrap.

If set bootstrap=False, how would the model grab data?

The reason I'm interesting is when I set it to False, it makes the mse and
mae down, that's means False is better.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181226/7eb8f0cd/attachment.html>

From t3kcit at gmail.com  Thu Dec 27 16:33:10 2018
From: t3kcit at gmail.com (Andreas Mueller)
Date: Thu, 27 Dec 2018 16:33:10 -0500
Subject: [scikit-learn] How to grab subsets from train sets when
 bootstrap=False in RF regressor?
In-Reply-To: <CAB3eZftK1VV4_AudF8XFyCXhgLJjRSApAf4er-Z_V9ptVVDW0w@mail.gmail.com>
References: <CAB3eZftK1VV4_AudF8XFyCXhgLJjRSApAf4er-Z_V9ptVVDW0w@mail.gmail.com>
Message-ID: <a1139988-5242-42d6-7ba6-4a0a6924cfc8@gmail.com>

It uses all the data.

On 12/26/18 4:26 AM, lampahome wrote:
> As title
>
> RF regressor decide a tree by grabing part of train data aka bootstrap.
>
> If set bootstrap=False, how would the model grab data?
>
> The reason I'm interesting is when I set it to False, it makes the mse 
> and mae down, that's means False is better.
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181227/c2724a4c/attachment.html>

From t3kcit at gmail.com  Thu Dec 27 17:59:59 2018
From: t3kcit at gmail.com (Andreas Mueller)
Date: Thu, 27 Dec 2018 17:59:59 -0500
Subject: [scikit-learn] Draft of a Scikit-learn governance document
Message-ID: <1a1c9e5f-389c-f5f9-1552-a71a5513ff96@gmail.com>

Hi all.

I just posted a proposal for a scikit-learn governance document as a PR:
https://github.com/scikit-learn/scikit-learn/pull/12878

The core devs discussed this already to some degrees but I think it 
would be great to
involve the greater community in finalizing this.

Any feedback is welcome.

Cheers,
Andy

From joel.nothman at gmail.com  Mon Dec 31 20:26:41 2018
From: joel.nothman at gmail.com (Joel Nothman)
Date: Tue, 1 Jan 2019 12:26:41 +1100
Subject: [scikit-learn] ANN: Scikit-learn 0.20.2 released
Message-ID: <CAAkaFLXpH6gc5gHDsuJ2tQSqGXF-hz7VszP75XQkq3VQxqVvXQ@mail.gmail.com>

A bug fix release of scikit-learn, version 0.20.2, was released a couple of
weeks ago. It is not yet on Conda default channel, but should be available
on pypi and conda-forge. Thank you to all who contributed.

As well as the changes listed at
https://scikit-learn.org/0.20/whats_new.html#version-0-20-2 and
documentation improvements, we also corrected an error in packaging the
source distribution for the previous release: we have made sure to use the
latest cython this time.

We still anticipate that there will be a further release in the 0.20 series
to fix regressions from 0.19 to 0.20.

Happy new year, and happy learning!

The scikit-learn developer team
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190101/612884e3/attachment.html>

From qinhanmin2005 at sina.com  Mon Dec 31 22:16:50 2018
From: qinhanmin2005 at sina.com (Hanmin Qin)
Date: Tue, 01 Jan 2019 11:16:50 +0800
Subject: [scikit-learn] ANN: Scikit-learn 0.20.2 released
Message-ID: <20190101031651.1066E4140092@webmail.sinamail.sina.com.cn>

0.20.2 is now available on conda default channel.
Happy new year to everyone!
The scikit-learn developer team
----- Original Message -----
From: Joel Nothman <joel.nothman at gmail.com>
To: Scikit-learn user and developer mailing list <scikit-learn at python.org>
Subject: [scikit-learn] ANN: Scikit-learn 0.20.2 released
Date: 2019-01-01 09:28

A bug fix release of scikit-learn, version 0.20.2, was released a couple of weeks ago. It is not yet on Conda default channel, but should be available on pypi and conda-forge. Thank you to all who contributed.
As well as the changes listed at https://scikit-learn.org/0.20/whats_new.html#version-0-20-2 and documentation improvements, we also corrected an error in packaging the source distribution for the previous release: we have made sure to use the latest cython this time.
We still anticipate that there will be a further release in the 0.20 series to fix regressions from 0.19 to 0.20.
Happy new year, and happy learning!
The scikit-learn developer team

_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190101/c099da4a/attachment.html>