From olivier.grisel at ensta.org  Thu Sep  1 04:43:59 2016
From: olivier.grisel at ensta.org (Olivier Grisel)
Date: Thu, 1 Sep 2016 10:43:59 +0200
Subject: [scikit-learn] Declaring numpy and scipy dependencies?
In-Reply-To: <36f5d0ef-397d-f5bc-c312-19793482fb06@gmail.com>
References: <195faf56-d8c6-49e0-7fd7-5bb4f1b22931@gmail.com>
 <98971054-939E-416C-BA47-AE5AD515E170@sebastianraschka.com>
 <CAH6Pt5r0z09cJDALPXRSeTcUJLOB3mFnm0y9D4zGQR-JtaBLsA@mail.gmail.com>
 <705a27d4-3643-bc9b-11a8-80ba0f6752bf@gmail.com>
 <CAH6Pt5pqt4eB+dxpDXxsPto4KmJz1D5P81saqGSBRTi62U-qHg@mail.gmail.com>
 <36f5d0ef-397d-f5bc-c312-19793482fb06@gmail.com>
Message-ID: <CAFvE7K6y4QpGsL-1Nh+8sjK3e0f_cpSKkncbKM=84j8VMar_gQ@mail.gmail.com>

I would be +1 to add the dependencies to numpy and scipy on the binary
wheels only.

We don't have the tools yet but this could be implemented in the
auditwheel tool that is already used to generate the manylinux1
compatible wheels for Linux.

-- 
Olivier

From popeye2408 at googlemail.com  Thu Sep  1 14:28:26 2016
From: popeye2408 at googlemail.com (Daniel Seeliger)
Date: Thu, 1 Sep 2016 20:28:26 +0200
Subject: [scikit-learn] Confidence Estimation for Regressor Predictions
Message-ID: <3A554CF0-3DD8-4DC0-ACE2-1E0491D815DE@googlemail.com>

Dear all,

For classifiers I make use of the predict_proba method to compute a Gini coefficient or entropy to get an estimate of how "sure" the model is about an individual prediction.

Is there anything similar I could use for regression models? I guess for a RandomForest I could simply use the indiviual predictions of each tree in clf.estimators_ and compute a standard deviation but I guess this is not a generic approach I can use for other regressors like the GradientBoostingRegressor or a SVR.

Thanks a lot for your help,
Daniel

From Dale.T.Smith at macys.com  Thu Sep  1 14:32:09 2016
From: Dale.T.Smith at macys.com (Dale T Smith)
Date: Thu, 1 Sep 2016 18:32:09 +0000
Subject: [scikit-learn] Confidence Estimation for Regressor Predictions
In-Reply-To: <3A554CF0-3DD8-4DC0-ACE2-1E0491D815DE@googlemail.com>
References: <3A554CF0-3DD8-4DC0-ACE2-1E0491D815DE@googlemail.com>
Message-ID: <BL2PR06MB2276AF2095ED50AFFF8E9EE9C3E20@BL2PR06MB2276.namprd06.prod.outlook.com>

There is a scikit-learn-contrib project with confidence intervals for random forests.

https://github.com/scikit-learn-contrib/forest-confidence-interval


__________________________________________________________________________________________
Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science and Capacity Planning
 |?5985 State Bridge Road, Johns Creek, GA 30097?|?dale.t.smith at macys.com

-----Original Message-----
From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Daniel Seeliger via scikit-learn
Sent: Thursday, September 1, 2016 2:28 PM
To: scikit-learn at python.org
Cc: Daniel Seeliger
Subject: [scikit-learn] Confidence Estimation for Regressor Predictions

? EXT MSG:

Dear all,

For classifiers I make use of the predict_proba method to compute a Gini coefficient or entropy to get an estimate of how "sure" the model is about an individual prediction.

Is there anything similar I could use for regression models? I guess for a RandomForest I could simply use the indiviual predictions of each tree in clf.estimators_ and compute a standard deviation but I guess this is not a generic approach I can use for other regressors like the GradientBoostingRegressor or a SVR.

Thanks a lot for your help,
Daniel
_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn

* This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.

From rth.yurchak at gmail.com  Thu Sep  1 15:45:01 2016
From: rth.yurchak at gmail.com (Roman Yurchak)
Date: Thu, 1 Sep 2016 21:45:01 +0200
Subject: [scikit-learn] Confidence Estimation for Regressor Predictions
In-Reply-To: <BL2PR06MB2276AF2095ED50AFFF8E9EE9C3E20@BL2PR06MB2276.namprd06.prod.outlook.com>
References: <3A554CF0-3DD8-4DC0-ACE2-1E0491D815DE@googlemail.com>
 <BL2PR06MB2276AF2095ED50AFFF8E9EE9C3E20@BL2PR06MB2276.namprd06.prod.outlook.com>
Message-ID: <57C8853D.7030109@gmail.com>

I'm also interested to know if there are any projects similar to
scikit-learn-contrib/forest-confidence-interval for linear_model or SVM
regressors.

In the general case, I think you could get a quick first order
approximation of the confidence interval for your regressor, if you take
the standard deviation  of predictions obtained by fitting different
subsets of your data using,
     cross_validation.cross_val_score( ).std()
with a fixed set of estimator parameters? Or some multiple of it (e.g.
2*std). Though this will probably not match exactly the mathematical
definition of a confidence interval.
-- 
Roman


On 01/09/16 20:32, Dale T Smith wrote:
> There is a scikit-learn-contrib project with confidence intervals for random forests.
> 
> https://github.com/scikit-learn-contrib/forest-confidence-interval
> 
> 
> __________________________________________________________________________________________
> Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science and Capacity Planning
>  | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com
> 
> -----Original Message-----
> From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Daniel Seeliger via scikit-learn
> Sent: Thursday, September 1, 2016 2:28 PM
> To: scikit-learn at python.org
> Cc: Daniel Seeliger
> Subject: [scikit-learn] Confidence Estimation for Regressor Predictions
> 
> ? EXT MSG:
> 
> Dear all,
> 
> For classifiers I make use of the predict_proba method to compute a Gini coefficient or entropy to get an estimate of how "sure" the model is about an individual prediction.
> 
> Is there anything similar I could use for regression models? I guess for a RandomForest I could simply use the indiviual predictions of each tree in clf.estimators_ and compute a standard deviation but I guess this is not a generic approach I can use for other regressors like the GradientBoostingRegressor or a SVR.
> 
> Thanks a lot for your help,
> Daniel
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> * This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 


From Dale.T.Smith at macys.com  Thu Sep  1 15:55:02 2016
From: Dale.T.Smith at macys.com (Dale T Smith)
Date: Thu, 1 Sep 2016 19:55:02 +0000
Subject: [scikit-learn] Confidence Estimation for Regressor Predictions
In-Reply-To: <57C8853D.7030109@gmail.com>
References: <3A554CF0-3DD8-4DC0-ACE2-1E0491D815DE@googlemail.com>
 <BL2PR06MB2276AF2095ED50AFFF8E9EE9C3E20@BL2PR06MB2276.namprd06.prod.outlook.com>
 <57C8853D.7030109@gmail.com>
Message-ID: <BL2PR06MB2276BC9B3DF39D484D9D4410C3E20@BL2PR06MB2276.namprd06.prod.outlook.com>

Confidence intervals for linear models are well known - see any statistics book or look it up on Wikipedia. You should be able to calculate everything you need for a linear model just from the information the estimator provides. Note the Rsquared provided by linear_model appears to be what statisticians call the adjusted-Rsquared.


__________________________________________________________________________________________
Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science and Capacity Planning
 |?5985 State Bridge Road, Johns Creek, GA 30097?|?dale.t.smith at macys.com


-----Original Message-----
From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Roman Yurchak
Sent: Thursday, September 1, 2016 3:45 PM
To: Scikit-learn user and developer mailing list
Subject: Re: [scikit-learn] Confidence Estimation for Regressor Predictions

? EXT MSG:

I'm also interested to know if there are any projects similar to scikit-learn-contrib/forest-confidence-interval for linear_model or SVM regressors.

In the general case, I think you could get a quick first order approximation of the confidence interval for your regressor, if you take the standard deviation  of predictions obtained by fitting different subsets of your data using,
     cross_validation.cross_val_score( ).std() with a fixed set of estimator parameters? Or some multiple of it (e.g.
2*std). Though this will probably not match exactly the mathematical definition of a confidence interval.
--
Roman


On 01/09/16 20:32, Dale T Smith wrote:
> There is a scikit-learn-contrib project with confidence intervals for random forests.
> 
> https://github.com/scikit-learn-contrib/forest-confidence-interval
> 
> 
> __________________________________________________________________________________________
> Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science and Capacity Planning
>  | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com
> 
> -----Original Message-----
> From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Daniel Seeliger via scikit-learn
> Sent: Thursday, September 1, 2016 2:28 PM
> To: scikit-learn at python.org
> Cc: Daniel Seeliger
> Subject: [scikit-learn] Confidence Estimation for Regressor Predictions
> 
> ? EXT MSG:
> 
> Dear all,
> 
> For classifiers I make use of the predict_proba method to compute a Gini coefficient or entropy to get an estimate of how "sure" the model is about an individual prediction.
> 
> Is there anything similar I could use for regression models? I guess for a RandomForest I could simply use the indiviual predictions of each tree in clf.estimators_ and compute a standard deviation but I guess this is not a generic approach I can use for other regressors like the GradientBoostingRegressor or a SVR.
> 
> Thanks a lot for your help,
> Daniel
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> * This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 

_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn

* This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.

From jurafejfar at gmail.com  Thu Sep  1 16:00:50 2016
From: jurafejfar at gmail.com (=?UTF-8?B?SmnFmcOtIEZlamZhcg==?=)
Date: Thu, 1 Sep 2016 22:00:50 +0200
Subject: [scikit-learn] Confidence Estimation for Regressor Predictions
In-Reply-To: <57C8853D.7030109@gmail.com>
References: <3A554CF0-3DD8-4DC0-ACE2-1E0491D815DE@googlemail.com>
 <BL2PR06MB2276AF2095ED50AFFF8E9EE9C3E20@BL2PR06MB2276.namprd06.prod.outlook.com>
 <57C8853D.7030109@gmail.com>
Message-ID: <CA+8wVNUV5DqAHsN+gU7bHW8FqFjQ001xUUZ3iwk766W7wkBGFA@mail.gmail.com>

Maybe you can also use bootstrap method published by Efron? You can see
https://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf

It is implemented in resampling module with replacement option, if I can
understand.

J.

Dne 1.9.2016 21:46 napsal u?ivatel "Roman Yurchak" <rth.yurchak at gmail.com>:

> I'm also interested to know if there are any projects similar to
> scikit-learn-contrib/forest-confidence-interval for linear_model or SVM
> regressors.
>
> In the general case, I think you could get a quick first order
> approximation of the confidence interval for your regressor, if you take
> the standard deviation  of predictions obtained by fitting different
> subsets of your data using,
>      cross_validation.cross_val_score( ).std()
> with a fixed set of estimator parameters? Or some multiple of it (e.g.
> 2*std). Though this will probably not match exactly the mathematical
> definition of a confidence interval.
> --
> Roman
>
>
> On 01/09/16 20:32, Dale T Smith wrote:
> > There is a scikit-learn-contrib project with confidence intervals for
> random forests.
> >
> > https://github.com/scikit-learn-contrib/forest-confidence-interval
> >
> >
> > ____________________________________________________________
> ______________________________
> > Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data
> Science and Capacity Planning
> >  | 5985 State Bridge Road, Johns Creek, GA 30097 |
> dale.t.smith at macys.com
> >
> > -----Original Message-----
> > From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=
> macys.com at python.org] On Behalf Of Daniel Seeliger via scikit-learn
> > Sent: Thursday, September 1, 2016 2:28 PM
> > To: scikit-learn at python.org
> > Cc: Daniel Seeliger
> > Subject: [scikit-learn] Confidence Estimation for Regressor Predictions
> >
> > ? EXT MSG:
> >
> > Dear all,
> >
> > For classifiers I make use of the predict_proba method to compute a Gini
> coefficient or entropy to get an estimate of how "sure" the model is about
> an individual prediction.
> >
> > Is there anything similar I could use for regression models? I guess for
> a RandomForest I could simply use the indiviual predictions of each tree in
> clf.estimators_ and compute a standard deviation but I guess this is not a
> generic approach I can use for other regressors like the
> GradientBoostingRegressor or a SVR.
> >
> > Thanks a lot for your help,
> > Daniel
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > * This is an EXTERNAL EMAIL. Stop and think before clicking a link or
> opening attachments.
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160901/167e72fc/attachment.html>

From rth.yurchak at gmail.com  Thu Sep  1 17:13:45 2016
From: rth.yurchak at gmail.com (Roman Yurchak)
Date: Thu, 1 Sep 2016 23:13:45 +0200
Subject: [scikit-learn] Confidence Estimation for Regressor Predictions
In-Reply-To: <BL2PR06MB2276BC9B3DF39D484D9D4410C3E20@BL2PR06MB2276.namprd06.prod.outlook.com>
References: <3A554CF0-3DD8-4DC0-ACE2-1E0491D815DE@googlemail.com>
 <BL2PR06MB2276AF2095ED50AFFF8E9EE9C3E20@BL2PR06MB2276.namprd06.prod.outlook.com>
 <57C8853D.7030109@gmail.com>
 <BL2PR06MB2276BC9B3DF39D484D9D4410C3E20@BL2PR06MB2276.namprd06.prod.outlook.com>
Message-ID: <57C89A09.4090100@gmail.com>

Dale, I meant for all the methods in scikit.linear_model. Linear
regression is well known, but say for rigde regression that does not
look that simple http://stats.stackexchange.com/a/15417 .
Thanks for mentioning the bootstrap method!

-- 
Roman

On 01/09/16 21:55, Dale T Smith wrote:
> Confidence intervals for linear models are well known - see any statistics book or look it up on Wikipedia. You should be able to calculate everything you need for a linear model just from the information the estimator provides. Note the Rsquared provided by linear_model appears to be what statisticians call the adjusted-Rsquared.
> 
> 
> __________________________________________________________________________________________
> Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science and Capacity Planning
>  | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com
> 
> 
> -----Original Message-----
> From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Roman Yurchak
> Sent: Thursday, September 1, 2016 3:45 PM
> To: Scikit-learn user and developer mailing list
> Subject: Re: [scikit-learn] Confidence Estimation for Regressor Predictions
> 
> ? EXT MSG:
> 
> I'm also interested to know if there are any projects similar to scikit-learn-contrib/forest-confidence-interval for linear_model or SVM regressors.
> 
> In the general case, I think you could get a quick first order approximation of the confidence interval for your regressor, if you take the standard deviation  of predictions obtained by fitting different subsets of your data using,
>      cross_validation.cross_val_score( ).std() with a fixed set of estimator parameters? Or some multiple of it (e.g.
> 2*std). Though this will probably not match exactly the mathematical definition of a confidence interval.
> --
> Roman
> 
> 
> On 01/09/16 20:32, Dale T Smith wrote:
>> There is a scikit-learn-contrib project with confidence intervals for random forests.
>>
>> https://github.com/scikit-learn-contrib/forest-confidence-interval
>>
>>
>> __________________________________________________________________________________________
>> Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science and Capacity Planning
>>  | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com
>>
>> -----Original Message-----
>> From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Daniel Seeliger via scikit-learn
>> Sent: Thursday, September 1, 2016 2:28 PM
>> To: scikit-learn at python.org
>> Cc: Daniel Seeliger
>> Subject: [scikit-learn] Confidence Estimation for Regressor Predictions
>>
>> ? EXT MSG:
>>
>> Dear all,
>>
>> For classifiers I make use of the predict_proba method to compute a Gini coefficient or entropy to get an estimate of how "sure" the model is about an individual prediction.
>>
>> Is there anything similar I could use for regression models? I guess for a RandomForest I could simply use the indiviual predictions of each tree in clf.estimators_ and compute a standard deviation but I guess this is not a generic approach I can use for other regressors like the GradientBoostingRegressor or a SVR.
>>
>> Thanks a lot for your help,
>> Daniel
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>> * This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> * This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 


From jeff1evesque at yahoo.com  Fri Sep  2 00:19:09 2016
From: jeff1evesque at yahoo.com (Jeffrey Levesque)
Date: Fri, 2 Sep 2016 00:19:09 -0400
Subject: [scikit-learn] Confidence Estimation for Regressor Predictions
In-Reply-To: <57C89A09.4090100@gmail.com>
References: <3A554CF0-3DD8-4DC0-ACE2-1E0491D815DE@googlemail.com>
 <BL2PR06MB2276AF2095ED50AFFF8E9EE9C3E20@BL2PR06MB2276.namprd06.prod.outlook.com>
 <57C8853D.7030109@gmail.com>
 <BL2PR06MB2276BC9B3DF39D484D9D4410C3E20@BL2PR06MB2276.namprd06.prod.outlook.com>
 <57C89A09.4090100@gmail.com>
Message-ID: <FE845A3A-C4ED-4364-A7E8-C2384919F5FB@yahoo.com>

Hi All,

I am also interested in determining a confidence level associated with an SVM, or SVR prediction.  Is there a nice way to generalize this confidence regardless of the kernel chosen, for the given SVM or SVR implementation?

Last year I generally tried the 'predict_proba' method, which was not very good, when implemented generically:

- https://github.com/jeff1evesque/machine-learning/issues/1924#issuecomment-159491052

The 'decision_function' performed a little better.  But, are my examples poor, because the sample data is too small for accurate confidence measurements?  Would both the 'decision_function', and 'predict_proba' improve if my dataset was much larger, or should I customize the latter methods?

Feel free to make any comments on the above github issue.  I've spent more time on the web tools of that repository, than understanding the fundamentals of predictions.  Forgive me ahead of time.


Thank you,

Jeff Levesque
https://github.com/jeff1evesque

> On Sep 1, 2016, at 5:13 PM, Roman Yurchak <rth.yurchak at gmail.com> wrote:
> 
> Dale, I meant for all the methods in scikit.linear_model. Linear
> regression is well known, but say for rigde regression that does not
> look that simple http://stats.stackexchange.com/a/15417 .
> Thanks for mentioning the bootstrap method!
> 
> -- 
> Roman
> 
>> On 01/09/16 21:55, Dale T Smith wrote:
>> Confidence intervals for linear models are well known - see any statistics book or look it up on Wikipedia. You should be able to calculate everything you need for a linear model just from the information the estimator provides. Note the Rsquared provided by linear_model appears to be what statisticians call the adjusted-Rsquared.
>> 
>> 
>> __________________________________________________________________________________________
>> Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science and Capacity Planning
>> | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com
>> 
>> 
>> -----Original Message-----
>> From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Roman Yurchak
>> Sent: Thursday, September 1, 2016 3:45 PM
>> To: Scikit-learn user and developer mailing list
>> Subject: Re: [scikit-learn] Confidence Estimation for Regressor Predictions
>> 
>> ? EXT MSG:
>> 
>> I'm also interested to know if there are any projects similar to scikit-learn-contrib/forest-confidence-interval for linear_model or SVM regressors.
>> 
>> In the general case, I think you could get a quick first order approximation of the confidence interval for your regressor, if you take the standard deviation  of predictions obtained by fitting different subsets of your data using,
>>     cross_validation.cross_val_score( ).std() with a fixed set of estimator parameters? Or some multiple of it (e.g.
>> 2*std). Though this will probably not match exactly the mathematical definition of a confidence interval.
>> --
>> Roman
>> 
>> 
>>> On 01/09/16 20:32, Dale T Smith wrote:
>>> There is a scikit-learn-contrib project with confidence intervals for random forests.
>>> 
>>> https://github.com/scikit-learn-contrib/forest-confidence-interval
>>> 
>>> 
>>> __________________________________________________________________________________________
>>> Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science and Capacity Planning
>>> | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com
>>> 
>>> -----Original Message-----
>>> From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Daniel Seeliger via scikit-learn
>>> Sent: Thursday, September 1, 2016 2:28 PM
>>> To: scikit-learn at python.org
>>> Cc: Daniel Seeliger
>>> Subject: [scikit-learn] Confidence Estimation for Regressor Predictions
>>> 
>>> ? EXT MSG:
>>> 
>>> Dear all,
>>> 
>>> For classifiers I make use of the predict_proba method to compute a Gini coefficient or entropy to get an estimate of how "sure" the model is about an individual prediction.
>>> 
>>> Is there anything similar I could use for regression models? I guess for a RandomForest I could simply use the indiviual predictions of each tree in clf.estimators_ and compute a standard deviation but I guess this is not a generic approach I can use for other regressors like the GradientBoostingRegressor or a SVR.
>>> 
>>> Thanks a lot for your help,
>>> Daniel
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>> 
>>> * This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> 
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>> 
>> * This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From Dale.T.Smith at macys.com  Fri Sep  2 08:21:27 2016
From: Dale.T.Smith at macys.com (Dale T Smith)
Date: Fri, 2 Sep 2016 12:21:27 +0000
Subject: [scikit-learn] Confidence Estimation for Regressor Predictions
In-Reply-To: <57C89A09.4090100@gmail.com>
References: <3A554CF0-3DD8-4DC0-ACE2-1E0491D815DE@googlemail.com>
 <BL2PR06MB2276AF2095ED50AFFF8E9EE9C3E20@BL2PR06MB2276.namprd06.prod.outlook.com>
 <57C8853D.7030109@gmail.com>
 <BL2PR06MB2276BC9B3DF39D484D9D4410C3E20@BL2PR06MB2276.namprd06.prod.outlook.com>
 <57C89A09.4090100@gmail.com>
Message-ID: <BL2PR06MB22765B904BE919BE8E3EA096C3E50@BL2PR06MB2276.namprd06.prod.outlook.com>

Roman,

Research in the 1970's that's not well known indicates that the bias for t-statistics, for instance, cancels out in the numerator and denominator. I should have written up something showing how to do the relevant statistical diagnostics for ridge regression, but got laid off an earlier job.

Lasso regression is a very different story.


__________________________________________________________________________________________
Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science and Capacity Planning
 |?5985 State Bridge Road, Johns Creek, GA 30097?|?dale.t.smith at macys.com

-----Original Message-----
From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Roman Yurchak
Sent: Thursday, September 1, 2016 5:14 PM
To: Scikit-learn user and developer mailing list
Subject: Re: [scikit-learn] Confidence Estimation for Regressor Predictions

? EXT MSG:

Dale, I meant for all the methods in scikit.linear_model. Linear regression is well known, but say for rigde regression that does not look that simple http://stats.stackexchange.com/a/15417 .
Thanks for mentioning the bootstrap method!

--
Roman

On 01/09/16 21:55, Dale T Smith wrote:
> Confidence intervals for linear models are well known - see any statistics book or look it up on Wikipedia. You should be able to calculate everything you need for a linear model just from the information the estimator provides. Note the Rsquared provided by linear_model appears to be what statisticians call the adjusted-Rsquared.
> 
> 
> ______________________________________________________________________
> ____________________ Dale Smith | Macy's Systems and Technology | IFS 
> eCommerce | Data Science and Capacity Planning  | 5985 State Bridge 
> Road, Johns Creek, GA 30097 | dale.t.smith at macys.com
> 
> 
> -----Original Message-----
> From: scikit-learn 
> [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On 
> Behalf Of Roman Yurchak
> Sent: Thursday, September 1, 2016 3:45 PM
> To: Scikit-learn user and developer mailing list
> Subject: Re: [scikit-learn] Confidence Estimation for Regressor 
> Predictions
> 
> ? EXT MSG:
> 
> I'm also interested to know if there are any projects similar to scikit-learn-contrib/forest-confidence-interval for linear_model or SVM regressors.
> 
> In the general case, I think you could get a quick first order approximation of the confidence interval for your regressor, if you take the standard deviation  of predictions obtained by fitting different subsets of your data using,
>      cross_validation.cross_val_score( ).std() with a fixed set of estimator parameters? Or some multiple of it (e.g.
> 2*std). Though this will probably not match exactly the mathematical definition of a confidence interval.
> --
> Roman
> 
> 
> On 01/09/16 20:32, Dale T Smith wrote:
>> There is a scikit-learn-contrib project with confidence intervals for random forests.
>>
>> https://github.com/scikit-learn-contrib/forest-confidence-interval
>>
>>
>> _____________________________________________________________________
>> _____________________ Dale Smith | Macy's Systems and Technology | 
>> IFS eCommerce | Data Science and Capacity Planning  | 5985 State 
>> Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com
>>
>> -----Original Message-----
>> From: scikit-learn 
>> [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On 
>> Behalf Of Daniel Seeliger via scikit-learn
>> Sent: Thursday, September 1, 2016 2:28 PM
>> To: scikit-learn at python.org
>> Cc: Daniel Seeliger
>> Subject: [scikit-learn] Confidence Estimation for Regressor 
>> Predictions
>>
>> ? EXT MSG:
>>
>> Dear all,
>>
>> For classifiers I make use of the predict_proba method to compute a Gini coefficient or entropy to get an estimate of how "sure" the model is about an individual prediction.
>>
>> Is there anything similar I could use for regression models? I guess for a RandomForest I could simply use the indiviual predictions of each tree in clf.estimators_ and compute a standard deviation but I guess this is not a generic approach I can use for other regressors like the GradientBoostingRegressor or a SVR.
>>
>> Thanks a lot for your help,
>> Daniel
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>> * This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> * This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 

_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn

* This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.

From Dale.T.Smith at macys.com  Fri Sep  2 08:34:03 2016
From: Dale.T.Smith at macys.com (Dale T Smith)
Date: Fri, 2 Sep 2016 12:34:03 +0000
Subject: [scikit-learn] Confidence Estimation for Regressor Predictions
In-Reply-To: <FE845A3A-C4ED-4364-A7E8-C2384919F5FB@yahoo.com>
References: <3A554CF0-3DD8-4DC0-ACE2-1E0491D815DE@googlemail.com>
 <BL2PR06MB2276AF2095ED50AFFF8E9EE9C3E20@BL2PR06MB2276.namprd06.prod.outlook.com>
 <57C8853D.7030109@gmail.com>
 <BL2PR06MB2276BC9B3DF39D484D9D4410C3E20@BL2PR06MB2276.namprd06.prod.outlook.com>
 <57C89A09.4090100@gmail.com> <FE845A3A-C4ED-4364-A7E8-C2384919F5FB@yahoo.com>
Message-ID: <BL2PR06MB2276E0755D27E1EFA4C7DC97C3E50@BL2PR06MB2276.namprd06.prod.outlook.com>

I do not know of any research related to any estimators except linear_model and forests of trees. Knowledge of the underlying distributions is required for confidence intervals. The Jackknife and bootstrap are the most common methods to obtain this information from the data.

If anyone knows of these techniques applied more widely in machine learning to measure confidence intervals, please post the references. I think providing these measures in scikit-learn-contrib provides the entire project with features other packages don't have.

Here's an example of the work done on the StatML side, "Distribution-Free Predictive Inference for Regression"

http://www.stat.cmu.edu/~ryantibs/papers/conformal.pdf

Note the use of leave-one-covariate-out to estimate variable importance.

__________________________________________________________________________________________
Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science and Capacity Planning
 |?5985 State Bridge Road, Johns Creek, GA 30097?|?dale.t.smith at macys.com


-----Original Message-----
From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Jeffrey Levesque via scikit-learn
Sent: Friday, September 2, 2016 12:19 AM
To: Scikit-learn user and developer mailing list
Cc: Jeffrey Levesque
Subject: Re: [scikit-learn] Confidence Estimation for Regressor Predictions

? EXT MSG:

Hi All,

I am also interested in determining a confidence level associated with an SVM, or SVR prediction.  Is there a nice way to generalize this confidence regardless of the kernel chosen, for the given SVM or SVR implementation?

Last year I generally tried the 'predict_proba' method, which was not very good, when implemented generically:

- https://github.com/jeff1evesque/machine-learning/issues/1924#issuecomment-159491052

The 'decision_function' performed a little better.  But, are my examples poor, because the sample data is too small for accurate confidence measurements?  Would both the 'decision_function', and 'predict_proba' improve if my dataset was much larger, or should I customize the latter methods?

Feel free to make any comments on the above github issue.  I've spent more time on the web tools of that repository, than understanding the fundamentals of predictions.  Forgive me ahead of time.


Thank you,

Jeff Levesque
https://github.com/jeff1evesque

> On Sep 1, 2016, at 5:13 PM, Roman Yurchak <rth.yurchak at gmail.com> wrote:
> 
> Dale, I meant for all the methods in scikit.linear_model. Linear 
> regression is well known, but say for rigde regression that does not 
> look that simple http://stats.stackexchange.com/a/15417 .
> Thanks for mentioning the bootstrap method!
> 
> --
> Roman
> 
>> On 01/09/16 21:55, Dale T Smith wrote:
>> Confidence intervals for linear models are well known - see any statistics book or look it up on Wikipedia. You should be able to calculate everything you need for a linear model just from the information the estimator provides. Note the Rsquared provided by linear_model appears to be what statisticians call the adjusted-Rsquared.
>> 
>> 
>> _____________________________________________________________________
>> _____________________ Dale Smith | Macy's Systems and Technology | 
>> IFS eCommerce | Data Science and Capacity Planning
>> | 5985 State Bridge Road, Johns Creek, GA 30097 | 
>> | dale.t.smith at macys.com
>> 
>> 
>> -----Original Message-----
>> From: scikit-learn 
>> [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On 
>> Behalf Of Roman Yurchak
>> Sent: Thursday, September 1, 2016 3:45 PM
>> To: Scikit-learn user and developer mailing list
>> Subject: Re: [scikit-learn] Confidence Estimation for Regressor 
>> Predictions
>> 
>> ? EXT MSG:
>> 
>> I'm also interested to know if there are any projects similar to scikit-learn-contrib/forest-confidence-interval for linear_model or SVM regressors.
>> 
>> In the general case, I think you could get a quick first order approximation of the confidence interval for your regressor, if you take the standard deviation  of predictions obtained by fitting different subsets of your data using,
>>     cross_validation.cross_val_score( ).std() with a fixed set of estimator parameters? Or some multiple of it (e.g.
>> 2*std). Though this will probably not match exactly the mathematical definition of a confidence interval.
>> --
>> Roman
>> 
>> 
>>> On 01/09/16 20:32, Dale T Smith wrote:
>>> There is a scikit-learn-contrib project with confidence intervals for random forests.
>>> 
>>> https://github.com/scikit-learn-contrib/forest-confidence-interval
>>> 
>>> 
>>> ____________________________________________________________________
>>> ______________________ Dale Smith | Macy's Systems and Technology | 
>>> IFS eCommerce | Data Science and Capacity Planning
>>> | 5985 State Bridge Road, Johns Creek, GA 30097 | 
>>> | dale.t.smith at macys.com
>>> 
>>> -----Original Message-----
>>> From: scikit-learn 
>>> [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On 
>>> Behalf Of Daniel Seeliger via scikit-learn
>>> Sent: Thursday, September 1, 2016 2:28 PM
>>> To: scikit-learn at python.org
>>> Cc: Daniel Seeliger
>>> Subject: [scikit-learn] Confidence Estimation for Regressor 
>>> Predictions
>>> 
>>> ? EXT MSG:
>>> 
>>> Dear all,
>>> 
>>> For classifiers I make use of the predict_proba method to compute a Gini coefficient or entropy to get an estimate of how "sure" the model is about an individual prediction.
>>> 
>>> Is there anything similar I could use for regression models? I guess for a RandomForest I could simply use the indiviual predictions of each tree in clf.estimators_ and compute a standard deviation but I guess this is not a generic approach I can use for other regressors like the GradientBoostingRegressor or a SVR.
>>> 
>>> Thanks a lot for your help,
>>> Daniel
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>> 
>>> * This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> 
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>> 
>> * This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn

* This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.

From drraph at gmail.com  Wed Sep  7 08:17:46 2016
From: drraph at gmail.com (Raphael C)
Date: Wed, 7 Sep 2016 13:17:46 +0100
Subject: [scikit-learn] How to get the factorization from NMF in scikit learn
Message-ID: <CAFHc1QYXqx4B7U90QwVf3WxRDFNrkkUVsK-K5XS7iRduhcqmHQ@mail.gmail.com>

I am trying to use NMF from scikit learn. Given a matrix A this should
give me a factorization into matrices W and H so that WH is
approximately equal to A. As a sanity check I tried the following:

from sklearn.decomposition import NMF
import numpy as np
A = np.array([[0,1,0],[1,0,1],[1,1,0]])
nmf = NMF(n_components=3, init='random', random_state=0)
print nmf.components_

This gives me a single 3 by 3 matrix as output. What is this
representing? I want the two matrices W and H from the factorization.
How can I get these two matrices?

I am sure I am just missing something simple.

Raphael

From zephyr14 at gmail.com  Wed Sep  7 08:32:16 2016
From: zephyr14 at gmail.com (Vlad Niculae)
Date: Wed, 7 Sep 2016 08:32:16 -0400
Subject: [scikit-learn] How to get the factorization from NMF in scikit
 learn
In-Reply-To: <CAFHc1QYXqx4B7U90QwVf3WxRDFNrkkUVsK-K5XS7iRduhcqmHQ@mail.gmail.com>
References: <CAFHc1QYXqx4B7U90QwVf3WxRDFNrkkUVsK-K5XS7iRduhcqmHQ@mail.gmail.com>
Message-ID: <CAFJw_eGpd79ZVtq0-HO5bTT2F2BOkefprJ3YY5ArDsmDC8GAZQ@mail.gmail.com>

Hi Raphael,

The other matrix in the factorization is the output of nmf.transform(A).
In your example you forgot to fit the estimator; if you're just
interested in the decomposition the recommended way is to get it in
one line with W = nmf.fit_transform(A).

While the mathematical description doesn't make it immediately
obvious, the scikit-learn API makes a distinction between the two
factors W, H based on whether they're in the samples or the features
direction. W is a representation of the samples in the learned latent
space, shape (n_samples, n_components). Meanwhile, H is a
representation of the features, so it's useful to store it *in the
transformer* in case more samples arise from the same sample
representation (e.g, at test time) and you want to transform them.

HTH,
Vlad

On Wed, Sep 7, 2016 at 8:17 AM, Raphael C <drraph at gmail.com> wrote:
> I am trying to use NMF from scikit learn. Given a matrix A this should
> give me a factorization into matrices W and H so that WH is
> approximately equal to A. As a sanity check I tried the following:
>
> from sklearn.decomposition import NMF
> import numpy as np
> A = np.array([[0,1,0],[1,0,1],[1,1,0]])
> nmf = NMF(n_components=3, init='random', random_state=0)
> print nmf.components_
>
> This gives me a single 3 by 3 matrix as output. What is this
> representing? I want the two matrices W and H from the factorization.
> How can I get these two matrices?
>
> I am sure I am just missing something simple.
>
> Raphael
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

From piotr.bialecki at hotmail.de  Wed Sep  7 14:03:46 2016
From: piotr.bialecki at hotmail.de (Piotr Bialecki)
Date: Wed, 7 Sep 2016 18:03:46 +0000
Subject: [scikit-learn] Tuning custom parameters using grid_search
Message-ID: <DB5PR01MB0854CE98D9FD9BE54E79BF03F3F80@DB5PR01MB0854.eurprd01.prod.exchangelabs.com>

Hi all,

I am currently tuning some parameters of my xgboost model using scikit's grid_search, e.g.:

param_test1 = {'max_depth':range(3,10,2),
                           'min_child_weight':range(1,6,2)
}
gsearch1 = GridSearchCV(estimator = XGBClassifier(learning_rate =0.1, n_estimators=762,
                                                                                     max_depth=5, min_child_weight=1, gamma=0,
                                                                                     subsample=0.8, colsample_bytree=0.8,
                                                                                     objective= 'binary:logistic', nthread=4,
                                                                                     scale_pos_weight=1, seed=2809),
                                            param_grid = param_test1,
                                            scoring='roc_auc',
                                            n_jobs=6,
                                            iid=False, cv=5)

Before that I preprocessed my dataset X with some different methods.
These preprocessing steps have some parameters too, which I would like to tune.
I know that it is possible to tune the parameters of the preprocessing steps,
if they are part pf my pipeline.
E.g. if I am using PCA, I could tune the parameter n_components, right?

But what if I have some "custom" preprocessing code with some parameters?
Is it possible to create a scikit-compatible "object" of my custom code in order to tune the
parameters in the pipeline with grid search?
Imagine I would like to write a custom method FeatureMultiplier() with a parameter multiplier_value.
Is it possible to create a scikit-compatible class out of this method and tune it with grid search?

I thought I saw a talk about exactly this topic at some PyData in 2016 or 2015,
but unfortunately I cannot find the video of it.
Maybe I misunderstood the presentation at that time.


Best regards,
Piotr
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160907/e51dc899/attachment.html>

From jmschreiber91 at gmail.com  Wed Sep  7 14:11:36 2016
From: jmschreiber91 at gmail.com (Jacob Schreiber)
Date: Wed, 7 Sep 2016 14:11:36 -0400
Subject: [scikit-learn] Tuning custom parameters using grid_search
In-Reply-To: <DB5PR01MB0854CE98D9FD9BE54E79BF03F3F80@DB5PR01MB0854.eurprd01.prod.exchangelabs.com>
References: <DB5PR01MB0854CE98D9FD9BE54E79BF03F3F80@DB5PR01MB0854.eurprd01.prod.exchangelabs.com>
Message-ID: <CA+ad8EveD9qj_f9O=XR-04Cv5ymr_ME=UQFpKq4qSoeFuRs7kg@mail.gmail.com>

You can use a pipeline object to contain both feature
selection/transformation steps and an estimator. All elements of a pipeline
can then be tuned using gridsearch. You can see a simple example here:
http://scikit-learn.org/stable/modules/pipeline.html

You may also be interested seeing if the FeatureUnion object can serve the
same purpose as your FeatureMultiplier.

On Wed, Sep 7, 2016 at 2:03 PM, Piotr Bialecki <piotr.bialecki at hotmail.de>
wrote:

> Hi all,
>
> I am currently tuning some parameters of my xgboost model using scikit's
> grid_search, e.g.:
>
> param_test1 = {'max_depth':range(3,10,2),
>                            'min_child_weight':range(1,6,2)
> }
> gsearch1 = GridSearchCV(estimator = XGBClassifier(learning_rate =0.1,
> n_estimators=762,
>
>             max_depth=5, min_child_weight=1, gamma=0,
>
>              subsample=0.8, colsample_bytree=0.8,
>
>              objective= 'binary:logistic', nthread=4,
>
>              scale_pos_weight=1, seed=2809),
>                                             param_grid = param_test1,
>                                             scoring='roc_auc',
>                                             n_jobs=6,
>                                             iid=False, cv=5)
>
> Before that I preprocessed my dataset X with some different methods.
> These preprocessing steps have some parameters too, which I would like to
> tune.
> I know that it is possible to tune the parameters of the preprocessing
> steps,
> if they are part pf my pipeline.
> E.g. if I am using PCA, I could tune the parameter n_components, right?
>
> But what if I have some "custom" preprocessing code with some parameters?
> Is it possible to create a scikit-compatible "object" of my custom code
> in order to tune the
> parameters in the pipeline with grid search?
> Imagine I would like to write a custom method FeatureMultiplier() with a
> parameter multiplier_value.
> Is it possible to create a scikit-compatible class out of this method and
> tune it with grid search?
>
> I thought I saw a talk about exactly this topic at some PyData in 2016 or
> 2015,
> but unfortunately I cannot find the video of it.
> Maybe I misunderstood the presentation at that time.
>
>
> Best regards,
> Piotr
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160907/35ee4e64/attachment.html>

From mail at sebastianraschka.com  Wed Sep  7 14:26:55 2016
From: mail at sebastianraschka.com (Sebastian Raschka)
Date: Wed, 7 Sep 2016 14:26:55 -0400
Subject: [scikit-learn] Tuning custom parameters using grid_search
In-Reply-To: <DB5PR01MB0854CE98D9FD9BE54E79BF03F3F80@DB5PR01MB0854.eurprd01.prod.exchangelabs.com>
References: <DB5PR01MB0854CE98D9FD9BE54E79BF03F3F80@DB5PR01MB0854.eurprd01.prod.exchangelabs.com>
Message-ID: <E17EB20F-6CFB-4777-AB58-FEF3EFB60ECB@sebastianraschka.com>

Hi, Piotr,


> These preprocessing steps have some parameters too, which I would like to tune.
> I know that it is possible to tune the parameters of the preprocessing steps, 
> if they are part pf my pipeline. 
> E.g. if I am using PCA, I could tune the parameter n_components, right?
> 
> But what if I have some "custom" preprocessing code with some parameters?
> Is it possible to create a scikit-compatible "object" of my custom code in order to tune the
> parameters in the pipeline with grid search?

Yeah, you could use the Pipeline class or the `make_pipeline` function, then you can create a custom estimator using the BaseEstimator class like so:


class CustomEstimator(BaseEstimator):

    def __init__(self, my_param=None):
        pass

    def fit_transform(self, X, y=None):
        return self.fit(X).transform(X)

    def transform(self, X, y=None):
        return X

    def fit(self, X, y=None):
        return self


pipe = make_pipeline(CustomEstimator(), 
                     LogisticRegression())
grid = {'customestimator__my_param': [3],
        'logisticregression__C': [0.1, 1.0, 10.0]}

gsearch1 = GridSearchCV(estimator=pipe, param_grid=grid)

gsearch1.fit(X, y)


Then, you can put in your desired preprocessing stuff into fit and transform.

Best,
Sebastian

> On Sep 7, 2016, at 2:03 PM, Piotr Bialecki <piotr.bialecki at hotmail.de> wrote:
> 
> Hi all,
> 
> I am currently tuning some parameters of my xgboost model using scikit's grid_search, e.g.:
> 
> param_test1 = {'max_depth':range(3,10,2),
>                            'min_child_weight':range(1,6,2)
> }
> gsearch1 = GridSearchCV(estimator = XGBClassifier(learning_rate =0.1, n_estimators=762,
>                                                                                      max_depth=5, min_child_weight=1, gamma=0, 
>                                                                                      subsample=0.8, colsample_bytree=0.8,
>                                                                                      objective= 'binary:logistic', nthread=4, 
>                                                                                      scale_pos_weight=1, seed=2809), 
>                                             param_grid = param_test1, 
>                                             scoring='roc_auc',
>                                             n_jobs=6,
>                                             iid=False, cv=5)
> 
> Before that I preprocessed my dataset X with some different methods.
> These preprocessing steps have some parameters too, which I would like to tune.
> I know that it is possible to tune the parameters of the preprocessing steps, 
> if they are part pf my pipeline. 
> E.g. if I am using PCA, I could tune the parameter n_components, right?
> 
> But what if I have some "custom" preprocessing code with some parameters?
> Is it possible to create a scikit-compatible "object" of my custom code in order to tune the
> parameters in the pipeline with grid search?
> Imagine I would like to write a custom method FeatureMultiplier() with a parameter multiplier_value.
> Is it possible to create a scikit-compatible class out of this method and tune it with grid search?
> 
> I thought I saw a talk about exactly this topic at some PyData in 2016 or 2015,
> but unfortunately I cannot find the video of it.
> Maybe I misunderstood the presentation at that time.
> 
> 
> Best regards,
> Piotr
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From mail at sebastianraschka.com  Wed Sep  7 14:38:29 2016
From: mail at sebastianraschka.com (Sebastian Raschka)
Date: Wed, 7 Sep 2016 14:38:29 -0400
Subject: [scikit-learn] Mailing list "slow"?
Message-ID: <73E2228E-8A05-4942-B8A0-CD8A406BD505@sebastianraschka.com>

Hi, all,
I noticed that it takes forever now until something is posted on the mailing list after I sent it out. Since the switch to Python.org, it takes about ~15 - 45min after hitting ?sent?. I?ve noticed this for months now and was wondering if this is normal or of there?s something going on with my particular mailing list account? (besides the mailing list, my usually arrives within 1-2 seconds, so it?s not a problem with my email client or server in general).

Best,
Sebastian

From piotr.bialecki at hotmail.de  Wed Sep  7 15:16:43 2016
From: piotr.bialecki at hotmail.de (Piotr Bialecki)
Date: Wed, 7 Sep 2016 19:16:43 +0000
Subject: [scikit-learn] Tuning custom parameters using grid_search
In-Reply-To: <E17EB20F-6CFB-4777-AB58-FEF3EFB60ECB@sebastianraschka.com>
References: <DB5PR01MB0854CE98D9FD9BE54E79BF03F3F80@DB5PR01MB0854.eurprd01.prod.exchangelabs.com>
 <E17EB20F-6CFB-4777-AB58-FEF3EFB60ECB@sebastianraschka.com>
Message-ID: <DB5PR01MB08544EB17707F8D1A5A17827F3F80@DB5PR01MB0854.eurprd01.prod.exchangelabs.com>

Hi Sebastian,

thanks a lot. That was exactly what I was looking for! :)
I will have a look into the base classes of other preprocessing steps as 
well.

@Jacob
Thank you too! :)


Greets,
Piotr

On 07.09.2016 20:26, Sebastian Raschka wrote:
> Hi, Piotr,
>
>
>> These preprocessing steps have some parameters too, which I would like to tune.
>> I know that it is possible to tune the parameters of the preprocessing steps,
>> if they are part pf my pipeline.
>> E.g. if I am using PCA, I could tune the parameter n_components, right?
>>
>> But what if I have some "custom" preprocessing code with some parameters?
>> Is it possible to create a scikit-compatible "object" of my custom code in order to tune the
>> parameters in the pipeline with grid search?
> Yeah, you could use the Pipeline class or the `make_pipeline` function, then you can create a custom estimator using the BaseEstimator class like so:
>
>
> class CustomEstimator(BaseEstimator):
>
>      def __init__(self, my_param=None):
>          pass
>
>      def fit_transform(self, X, y=None):
>          return self.fit(X).transform(X)
>
>      def transform(self, X, y=None):
>          return X
>
>      def fit(self, X, y=None):
>          return self
>
>
> pipe = make_pipeline(CustomEstimator(),
>                       LogisticRegression())
> grid = {'customestimator__my_param': [3],
>          'logisticregression__C': [0.1, 1.0, 10.0]}
>
> gsearch1 = GridSearchCV(estimator=pipe, param_grid=grid)
>
> gsearch1.fit(X, y)
>
>
> Then, you can put in your desired preprocessing stuff into fit and transform.
>
> Best,
> Sebastian
>
>> On Sep 7, 2016, at 2:03 PM, Piotr Bialecki <piotr.bialecki at hotmail.de> wrote:
>>
>> Hi all,
>>
>> I am currently tuning some parameters of my xgboost model using scikit's grid_search, e.g.:
>>
>> param_test1 = {'max_depth':range(3,10,2),
>>                             'min_child_weight':range(1,6,2)
>> }
>> gsearch1 = GridSearchCV(estimator = XGBClassifier(learning_rate =0.1, n_estimators=762,
>>                                                                                       max_depth=5, min_child_weight=1, gamma=0,
>>                                                                                       subsample=0.8, colsample_bytree=0.8,
>>                                                                                       objective= 'binary:logistic', nthread=4,
>>                                                                                       scale_pos_weight=1, seed=2809),
>>                                              param_grid = param_test1,
>>                                              scoring='roc_auc',
>>                                              n_jobs=6,
>>                                              iid=False, cv=5)
>>
>> Before that I preprocessed my dataset X with some different methods.
>> These preprocessing steps have some parameters too, which I would like to tune.
>> I know that it is possible to tune the parameters of the preprocessing steps,
>> if they are part pf my pipeline.
>> E.g. if I am using PCA, I could tune the parameter n_components, right?
>>
>> But what if I have some "custom" preprocessing code with some parameters?
>> Is it possible to create a scikit-compatible "object" of my custom code in order to tune the
>> parameters in the pipeline with grid search?
>> Imagine I would like to write a custom method FeatureMultiplier() with a parameter multiplier_value.
>> Is it possible to create a scikit-compatible class out of this method and tune it with grid search?
>>
>> I thought I saw a talk about exactly this topic at some PyData in 2016 or 2015,
>> but unfortunately I cannot find the video of it.
>> Maybe I misunderstood the presentation at that time.
>>
>>
>> Best regards,
>> Piotr
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From olivier.grisel at ensta.org  Thu Sep  8 09:01:40 2016
From: olivier.grisel at ensta.org (Olivier Grisel)
Date: Thu, 8 Sep 2016 15:01:40 +0200
Subject: [scikit-learn] Mailing list "slow"?
In-Reply-To: <73E2228E-8A05-4942-B8A0-CD8A406BD505@sebastianraschka.com>
References: <73E2228E-8A05-4942-B8A0-CD8A406BD505@sebastianraschka.com>
Message-ID: <CAFvE7K4vxka+3KqoHCtwG7W_v0pRNOXF6gS05sO5Kk9oS7KMHQ@mail.gmail.com>

I have not noticed it myself. Let me try to time this email to check:
sent at 3:01pm CEST.

From olivier.grisel at ensta.org  Thu Sep  8 09:02:39 2016
From: olivier.grisel at ensta.org (Olivier Grisel)
Date: Thu, 8 Sep 2016 15:02:39 +0200
Subject: [scikit-learn] Mailing list "slow"?
In-Reply-To: <CAFvE7K4vxka+3KqoHCtwG7W_v0pRNOXF6gS05sO5Kk9oS7KMHQ@mail.gmail.com>
References: <73E2228E-8A05-4942-B8A0-CD8A406BD505@sebastianraschka.com>
 <CAFvE7K4vxka+3KqoHCtwG7W_v0pRNOXF6gS05sO5Kk9oS7KMHQ@mail.gmail.com>
Message-ID: <CAFvE7K4fRD46ZWpsHNejqN3EH4PVM46x1VCBSR0pay7n9vR3+g@mail.gmail.com>

It's already in the archive:

https://mail.python.org/pipermail/scikit-learn/2016-September/000495.html

-- 
Olivier

From gael.varoquaux at normalesup.org  Thu Sep  8 09:06:09 2016
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Thu, 8 Sep 2016 15:06:09 +0200
Subject: [scikit-learn] Mailing list "slow"?
In-Reply-To: <CAFvE7K4fRD46ZWpsHNejqN3EH4PVM46x1VCBSR0pay7n9vR3+g@mail.gmail.com>
References: <73E2228E-8A05-4942-B8A0-CD8A406BD505@sebastianraschka.com>
 <CAFvE7K4vxka+3KqoHCtwG7W_v0pRNOXF6gS05sO5Kk9oS7KMHQ@mail.gmail.com>
 <CAFvE7K4fRD46ZWpsHNejqN3EH4PVM46x1VCBSR0pay7n9vR3+g@mail.gmail.com>
Message-ID: <20160908130609.GJ35579@phare.normalesup.org>

I recieved it.

G

On Thu, Sep 08, 2016 at 03:02:39PM +0200, Olivier Grisel wrote:
> It's already in the archive:

> https://mail.python.org/pipermail/scikit-learn/2016-September/000495.html
-- 
    Gael Varoquaux
    Researcher, INRIA Parietal
    NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
    Phone:  ++ 33-1-69-08-79-68
    http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux

From Dale.T.Smith at macys.com  Thu Sep  8 09:09:29 2016
From: Dale.T.Smith at macys.com (Dale T Smith)
Date: Thu, 8 Sep 2016 13:09:29 +0000
Subject: [scikit-learn] Mailing list "slow"?
In-Reply-To: <20160908130609.GJ35579@phare.normalesup.org>
References: <73E2228E-8A05-4942-B8A0-CD8A406BD505@sebastianraschka.com>
 <CAFvE7K4vxka+3KqoHCtwG7W_v0pRNOXF6gS05sO5Kk9oS7KMHQ@mail.gmail.com>
 <CAFvE7K4fRD46ZWpsHNejqN3EH4PVM46x1VCBSR0pay7n9vR3+g@mail.gmail.com>
 <20160908130609.GJ35579@phare.normalesup.org>
Message-ID: <BL2PR06MB2276187088F2C46430F37DE6C3FB0@BL2PR06MB2276.namprd06.prod.outlook.com>

Likewise here in the U.S. - Atlanta, GA.


__________________________________________________________________________________________
Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science and Capacity Planning
 |?5985 State Bridge Road, Johns Creek, GA 30097?|?dale.t.smith at macys.com

-----Original Message-----
From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Gael Varoquaux
Sent: Thursday, September 8, 2016 9:06 AM
To: Scikit-learn user and developer mailing list
Subject: Re: [scikit-learn] Mailing list "slow"?

? EXT MSG:

I recieved it.

G

On Thu, Sep 08, 2016 at 03:02:39PM +0200, Olivier Grisel wrote:
> It's already in the archive:

> https://mail.python.org/pipermail/scikit-learn/2016-September/000495.html
-- 
    Gael Varoquaux
    Researcher, INRIA Parietal
    NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
    Phone:  ++ 33-1-69-08-79-68
    http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux
_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn

* This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.

From se.raschka at gmail.com  Thu Sep  8 09:30:33 2016
From: se.raschka at gmail.com (Sebastian Raschka)
Date: Thu, 8 Sep 2016 09:30:33 -0400
Subject: [scikit-learn] Mailing list "slow"?
In-Reply-To: <BL2PR06MB2276187088F2C46430F37DE6C3FB0@BL2PR06MB2276.namprd06.prod.outlook.com>
References: <73E2228E-8A05-4942-B8A0-CD8A406BD505@sebastianraschka.com>
 <CAFvE7K4vxka+3KqoHCtwG7W_v0pRNOXF6gS05sO5Kk9oS7KMHQ@mail.gmail.com>
 <CAFvE7K4fRD46ZWpsHNejqN3EH4PVM46x1VCBSR0pay7n9vR3+g@mail.gmail.com>
 <20160908130609.GJ35579@phare.normalesup.org>
 <BL2PR06MB2276187088F2C46430F37DE6C3FB0@BL2PR06MB2276.namprd06.prod.outlook.com>
Message-ID: <1C96E1A9-E9FC-4112-B889-A7B2AD9D3D25@gmail.com>

Thanks! So it must be something on my side (or sth. weird with this email account in combination with the Python mailing list). Sorry for spamming, but let me try using my gmail account and send 2 mails simultaneously (I will later delete one of the two).

9:30:30 AM EDT (from gmail)

> On Sep 8, 2016, at 9:09 AM, Dale T Smith <Dale.T.Smith at macys.com> wrote:
> 
> Likewise here in the U.S. - Atlanta, GA.
> 
> 
> __________________________________________________________________________________________
> Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science and Capacity Planning
> | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com
> 
> -----Original Message-----
> From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Gael Varoquaux
> Sent: Thursday, September 8, 2016 9:06 AM
> To: Scikit-learn user and developer mailing list
> Subject: Re: [scikit-learn] Mailing list "slow"?
> 
> ? EXT MSG:
> 
> I recieved it.
> 
> G
> 
> On Thu, Sep 08, 2016 at 03:02:39PM +0200, Olivier Grisel wrote:
>> It's already in the archive:
> 
>> https://mail.python.org/pipermail/scikit-learn/2016-September/000495.html
> -- 
>    Gael Varoquaux
>    Researcher, INRIA Parietal
>    NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
>    Phone:  ++ 33-1-69-08-79-68
>    http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> * This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From mail at sebastianraschka.com  Thu Sep  8 09:29:52 2016
From: mail at sebastianraschka.com (Sebastian Raschka)
Date: Thu, 8 Sep 2016 09:29:52 -0400
Subject: [scikit-learn] Mailing list "slow"?
In-Reply-To: <BL2PR06MB2276187088F2C46430F37DE6C3FB0@BL2PR06MB2276.namprd06.prod.outlook.com>
References: <73E2228E-8A05-4942-B8A0-CD8A406BD505@sebastianraschka.com>
 <CAFvE7K4vxka+3KqoHCtwG7W_v0pRNOXF6gS05sO5Kk9oS7KMHQ@mail.gmail.com>
 <CAFvE7K4fRD46ZWpsHNejqN3EH4PVM46x1VCBSR0pay7n9vR3+g@mail.gmail.com>
 <20160908130609.GJ35579@phare.normalesup.org>
 <BL2PR06MB2276187088F2C46430F37DE6C3FB0@BL2PR06MB2276.namprd06.prod.outlook.com>
Message-ID: <F4C06C46-7EE7-48E5-9E5A-40BD8EFF9083@sebastianraschka.com>

Thanks! So it must be something on my side (or sth. weird with this email account in combination with the Python mailing list). Sorry for spamming, but let me try using my gmail account and send 2 mails simultaneously (I will later delete one of the two).

9:29:50 AM EDT

> On Sep 8, 2016, at 9:09 AM, Dale T Smith <Dale.T.Smith at macys.com> wrote:
> 
> Likewise here in the U.S. - Atlanta, GA.
> 
> 
> __________________________________________________________________________________________
> Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science and Capacity Planning
> | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com
> 
> -----Original Message-----
> From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Gael Varoquaux
> Sent: Thursday, September 8, 2016 9:06 AM
> To: Scikit-learn user and developer mailing list
> Subject: Re: [scikit-learn] Mailing list "slow"?
> 
> ? EXT MSG:
> 
> I recieved it.
> 
> G
> 
> On Thu, Sep 08, 2016 at 03:02:39PM +0200, Olivier Grisel wrote:
>> It's already in the archive:
> 
>> https://mail.python.org/pipermail/scikit-learn/2016-September/000495.html
> -- 
>    Gael Varoquaux
>    Researcher, INRIA Parietal
>    NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
>    Phone:  ++ 33-1-69-08-79-68
>    http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> * This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From se.raschka at gmail.com  Thu Sep  8 09:48:22 2016
From: se.raschka at gmail.com (Sebastian Raschka)
Date: Thu, 8 Sep 2016 09:48:22 -0400
Subject: [scikit-learn] Mailing list "slow"?
In-Reply-To: <F4C06C46-7EE7-48E5-9E5A-40BD8EFF9083@sebastianraschka.com>
References: <73E2228E-8A05-4942-B8A0-CD8A406BD505@sebastianraschka.com>
 <CAFvE7K4vxka+3KqoHCtwG7W_v0pRNOXF6gS05sO5Kk9oS7KMHQ@mail.gmail.com>
 <CAFvE7K4fRD46ZWpsHNejqN3EH4PVM46x1VCBSR0pay7n9vR3+g@mail.gmail.com>
 <20160908130609.GJ35579@phare.normalesup.org>
 <BL2PR06MB2276187088F2C46430F37DE6C3FB0@BL2PR06MB2276.namprd06.prod.outlook.com>
 <F4C06C46-7EE7-48E5-9E5A-40BD8EFF9083@sebastianraschka.com>
Message-ID: <017A1D1D-3489-4A9F-8E39-186527253F61@gmail.com>

Okay, it?s my @sebastianraschka.com domain then: it took ~15 minutes this time (gmail ~ 1 min). Maybe the former is going through a more rigorous filtering on the mailserver since it is an unknown domain name or so. In any case, I will use my gmail address on the mailing list then, sorry for the bother :P

> On Sep 8, 2016, at 9:29 AM, Sebastian Raschka <mail at sebastianraschka.com> wrote:
> 
> Thanks! So it must be something on my side (or sth. weird with this email account in combination with the Python mailing list). Sorry for spamming, but let me try using my gmail account and send 2 mails simultaneously (I will later delete one of the two).
> 
> 9:29:50 AM EDT
> 
>> On Sep 8, 2016, at 9:09 AM, Dale T Smith <Dale.T.Smith at macys.com> wrote:
>> 
>> Likewise here in the U.S. - Atlanta, GA.
>> 
>> 
>> __________________________________________________________________________________________
>> Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science and Capacity Planning
>> | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com
>> 
>> -----Original Message-----
>> From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Gael Varoquaux
>> Sent: Thursday, September 8, 2016 9:06 AM
>> To: Scikit-learn user and developer mailing list
>> Subject: Re: [scikit-learn] Mailing list "slow"?
>> 
>> ? EXT MSG:
>> 
>> I recieved it.
>> 
>> G
>> 
>> On Thu, Sep 08, 2016 at 03:02:39PM +0200, Olivier Grisel wrote:
>>> It's already in the archive:
>> 
>>> https://mail.python.org/pipermail/scikit-learn/2016-September/000495.html
>> -- 
>>   Gael Varoquaux
>>   Researcher, INRIA Parietal
>>   NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
>>   Phone:  ++ 33-1-69-08-79-68
>>   http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>> 
>> * This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From klonuo at gmail.com  Thu Sep  8 14:40:26 2016
From: klonuo at gmail.com (klo uo)
Date: Thu, 8 Sep 2016 20:40:26 +0200
Subject: [scikit-learn] Fwd: Loading file in libsvm format
In-Reply-To: <CAA-8Ld8Q53CVdGbb4OKLjhoBQ0Eeq+XMMDqGwJJR-XY7yhe=ZQ@mail.gmail.com>
References: <CAA-8Ld8Q53CVdGbb4OKLjhoBQ0Eeq+XMMDqGwJJR-XY7yhe=ZQ@mail.gmail.com>
Message-ID: <CAA-8Ld-PmcwD+dCbiSeycV1pMYvwnHzOV+bRgq_QnvSERvm8XA@mail.gmail.com>

---------- Forwarded message ----------
From: klo uo <klonuo at gmail.com>
Date: Thu, Sep 8, 2016 at 8:25 PM
Subject: Loading file in libsvm format
To: scikit-learn-general at lists.sourceforge.net


Hi,

I produced a file in libsvm format:

    <label> <index1>:<value1> <index2>:<value2> ...

with this content:

    6284 576:1 884:1 2482:1 4279:1 5765:1 184552:1 661512:1 699842:1
    2259 1669:1 5711528:6
    2822 5765159:1
    ...

The label is document_id, and index:value are term_id and term count.

This file has 83K labels with 40K unique terms (and overall 1.2M
index:value pairs).

When I load this file in sklearn:

    from sklearn.datasets import load_svmlight_file
    X, y = load_svmlight_file('libsim.txt')

I get X with shape (82448, 6092168).

I don't know of any reason why am I getting 6M features?
Can someone explain?


Thanks
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160908/14fbb1c9/attachment.html>

From t3kcit at gmail.com  Thu Sep  8 14:43:53 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Thu, 8 Sep 2016 14:43:53 -0400
Subject: [scikit-learn] Fwd: Loading file in libsvm format
In-Reply-To: <CAA-8Ld-PmcwD+dCbiSeycV1pMYvwnHzOV+bRgq_QnvSERvm8XA@mail.gmail.com>
References: <CAA-8Ld8Q53CVdGbb4OKLjhoBQ0Eeq+XMMDqGwJJR-XY7yhe=ZQ@mail.gmail.com>
 <CAA-8Ld-PmcwD+dCbiSeycV1pMYvwnHzOV+bRgq_QnvSERvm8XA@mail.gmail.com>
Message-ID: <6571ba86-7316-bfb3-f6d4-9eb9ffab873e@gmail.com>

The feature are sparse.
The third row says 5765159:1 meaning there is a feature with number 5765159.


On 09/08/2016 02:40 PM, klo uo wrote:
>
> ---------- Forwarded message ----------
> From: *klo uo* <klonuo at gmail.com <mailto:klonuo at gmail.com>>
> Date: Thu, Sep 8, 2016 at 8:25 PM
> Subject: Loading file in libsvm format
> To: scikit-learn-general at lists.sourceforge.net 
> <mailto:scikit-learn-general at lists.sourceforge.net>
>
>
> Hi,
>
> I produced a file in libsvm format:
>
> <label> <index1>:<value1> <index2>:<value2> ...
>
> with this content:
>
> 6284 576:1 884:1 2482:1 4279:1 5765:1 184552:1 661512:1 699842:1
>     2259 1669:1 5711528:6
>     2822 5765159:1
>     ...
>
> The label is document_id, and index:value are term_id and term count.
>
> This file has 83K labels with 40K unique terms (and overall 1.2M 
> index:value pairs).
>
> When I load this file in sklearn:
>
> from sklearn.datasets import load_svmlight_file
>     X, y = load_svmlight_file('libsim.txt')
>
> I get X with shape (82448, 6092168).
>
> I don't know of any reason why am I getting 6M features?
> Can someone explain?
>
>
> Thanks
>
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160908/18fd4f17/attachment.html>

From klonuo at gmail.com  Thu Sep  8 14:45:22 2016
From: klonuo at gmail.com (klo uo)
Date: Thu, 8 Sep 2016 20:45:22 +0200
Subject: [scikit-learn] Loading file in libsvm format
In-Reply-To: <CAA-8Ld-PmcwD+dCbiSeycV1pMYvwnHzOV+bRgq_QnvSERvm8XA@mail.gmail.com>
References: <CAA-8Ld8Q53CVdGbb4OKLjhoBQ0Eeq+XMMDqGwJJR-XY7yhe=ZQ@mail.gmail.com>
 <CAA-8Ld-PmcwD+dCbiSeycV1pMYvwnHzOV+bRgq_QnvSERvm8XA@mail.gmail.com>
Message-ID: <CAA-8Ld9_yb5X_CYEBN2GzgyD=S0eu2kuYdiChSX-aFCk-Bmw6A@mail.gmail.com>

Oh, I just figured, it's the max value for term_id.
Sorry to disturb you ;)


Cheers


On Thu, Sep 8, 2016 at 8:40 PM, klo uo <klonuo at gmail.com> wrote:

>
> ---------- Forwarded message ----------
> From: klo uo <klonuo at gmail.com>
> Date: Thu, Sep 8, 2016 at 8:25 PM
> Subject: Loading file in libsvm format
> To: scikit-learn-general at lists.sourceforge.net
>
>
> Hi,
>
> I produced a file in libsvm format:
>
>     <label> <index1>:<value1> <index2>:<value2> ...
>
> with this content:
>
>     6284 576:1 884:1 2482:1 4279:1 5765:1 184552:1 661512:1 699842:1
>     2259 1669:1 5711528:6
>     2822 5765159:1
>     ...
>
> The label is document_id, and index:value are term_id and term count.
>
> This file has 83K labels with 40K unique terms (and overall 1.2M
> index:value pairs).
>
> When I load this file in sklearn:
>
>     from sklearn.datasets import load_svmlight_file
>     X, y = load_svmlight_file('libsim.txt')
>
> I get X with shape (82448, 6092168).
>
> I don't know of any reason why am I getting 6M features?
> Can someone explain?
>
>
> Thanks
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160908/0ce0b238/attachment.html>

From douglas.chan at ieee.org  Thu Sep  8 21:51:41 2016
From: douglas.chan at ieee.org (Douglas Chan)
Date: Thu, 8 Sep 2016 18:51:41 -0700
Subject: [scikit-learn] Gradient Boosting: Feature Importances do not
 sum to 1
In-Reply-To: <A8FF93441B0044FE9AED45D831CD3273@Serendipitous>
References: <C2484D9F44FE441897BAA416DCC281BD@Serendipitous>
 <CAFHc1QZm4_R8eFka_V5gWF3irLwayozLsU-2JW7+YE0-U604HA@mail.gmail.com>
 <A8FF93441B0044FE9AED45D831CD3273@Serendipitous>
Message-ID: <4C639AAD236E493AABC55F2B4709CAEF@Serendipitous>

Hello everyone,

I?d like to bring this up again to see if people have any thoughts on it.

If you also think this is a bug, then we can track it and get it fixed.  Please share your opinions.

Thank you,
-Doug


From: Douglas Chan 
Sent: Wednesday, August 31, 2016 4:52 PM
To: Scikit-learn user and developer mailing list ; Raphael C 
Subject: Re: [scikit-learn] Gradient Boosting: Feature Importances do not sum to 1

Thanks for your reply, Raphael.

Here?s some code using the Boston dataset to reproduce this.  

=== START CODE ===
import numpy as np
from sklearn import datasets
from sklearn.ensemble import GradientBoostingRegressor

boston = datasets.load_boston()
X, Y = (boston.data, boston.target)

n_estimators = 712   
# Note: From 712 onwards, the feature importance sum is less than 1

params = {'n_estimators': n_estimators, 'max_depth': 6, 'learning_rate': 0.1}
clf = GradientBoostingRegressor(**params)
clf.fit(X, Y)

feature_importance_sum = np.sum(clf.feature_importances_)
print "At n_estimators = %i, feature importance sum = %f" % (n_estimators , feature_importance_sum)

=== END CODE ===

If we deem this to be an error, I can open a bug to track it.  Please share your thoughts on it.

Thank you,
-Doug


From: Raphael C 
Sent: Tuesday, August 30, 2016 11:28 PM
To: Scikit-learn user and developer mailing list 
Subject: Re: [scikit-learn] Gradient Boosting: Feature Importances do not sum to 1

Can you provide a reproducible example? 
Raphael

On Wednesday, August 31, 2016, Douglas Chan <douglas.chan at ieee.org> wrote:

  Hello everyone,

  I notice conditions when Feature Importance values do not add up to 1 in ensemble tree methods, like Gradient Boosting Trees or AdaBoost Trees.  I wonder if there?s a bug in the code.

  This error occurs when the ensemble has a large number of estimators.  The exact conditions depend variously.  For example, the error shows up sooner with a smaller amount of training samples.  Or, if the depth of the tree is large.  

  When this error appears, the predicted value seems to have converged.  But it?s unclear if the error is causing the predicted value not to change with more estimators.  In fact, the feature importance sum goes lower and lower with more estimators thereafter.

  I wonder if we?re hitting some floating point calculation error. 

  Looking forward to hear your thoughts on this.

  Thank you!
  -Doug


--------------------------------------------------------------------------------
_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160908/58a7048c/attachment-0001.html>

From cs14btech11041 at iith.ac.in  Fri Sep  9 04:19:57 2016
From: cs14btech11041 at iith.ac.in (Ibrahim Dalal)
Date: Fri, 9 Sep 2016 13:49:57 +0530
Subject: [scikit-learn] Issue with sklearn.neural_network
Message-ID: <CAAyvngWYSSgxHAZkcEBqU_M1ZKL9QjCExWnznEJ7tjSZhCkqfw@mail.gmail.com>

Dear Developers,

I am using sklearn version 0.17.1 on Ubuntu 14.04.

I was checking out neural network examples and one such example used
sklearn.neural_network.MLPClassifier. When I tried this, I get the
following error:

>>> from sklearn import neural_network
>>> clf = neural_network.MLPClassifier()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'MLPClassifier'

Thanks
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160909/53e6d7f9/attachment.html>

From rth.yurchak at gmail.com  Fri Sep  9 04:35:49 2016
From: rth.yurchak at gmail.com (Roman Yurchak)
Date: Fri, 9 Sep 2016 10:35:49 +0200
Subject: [scikit-learn] Issue with sklearn.neural_network
In-Reply-To: <CAAyvngWYSSgxHAZkcEBqU_M1ZKL9QjCExWnznEJ7tjSZhCkqfw@mail.gmail.com>
References: <CAAyvngWYSSgxHAZkcEBqU_M1ZKL9QjCExWnznEJ7tjSZhCkqfw@mail.gmail.com>
Message-ID: <57D27465.5070505@gmail.com>

Ibrahim, I believe the sklearn.neural_network.MLPClassifier was added in
the not yet released v0.18 (current dev version),
http://scikit-learn.org/dev/modules/neural_networks_supervised.html
-- 
Roman
On 09/09/16 10:19, Ibrahim Dalal via scikit-learn wrote:
> Dear Developers,
> 
> I am using sklearn version 0.17.1 on Ubuntu 14.04.
> 
> I was checking out neural network examples and one such example used
> sklearn.neural_network.MLPClassifier. When I tried this, I get the
> following error:
> 
>>>> from sklearn import neural_network
>>>> clf = neural_network.MLPClassifier()
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> AttributeError: 'module' object has no attribute 'MLPClassifier'
> 
> Thanks
> 
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 


From piotr.bialecki at hotmail.de  Fri Sep  9 08:11:34 2016
From: piotr.bialecki at hotmail.de (Piotr Bialecki)
Date: Fri, 9 Sep 2016 12:11:34 +0000
Subject: [scikit-learn] Gradient Boosting: Feature Importances do not
 sum to 1
In-Reply-To: <4C639AAD236E493AABC55F2B4709CAEF@Serendipitous>
References: <C2484D9F44FE441897BAA416DCC281BD@Serendipitous>
 <CAFHc1QZm4_R8eFka_V5gWF3irLwayozLsU-2JW7+YE0-U604HA@mail.gmail.com>
 <A8FF93441B0044FE9AED45D831CD3273@Serendipitous>
 <4C639AAD236E493AABC55F2B4709CAEF@Serendipitous>
Message-ID: <DB5PR01MB08549BBCE2759F68CD3E8E45F3FA0@DB5PR01MB0854.eurprd01.prod.exchangelabs.com>

Hi Doug,

I modified your code a little bit to calculate the feature_importances of every tree of the forest.
In my opinion these feature importances should also sum to 1.0.

Since I could not access each DecisionTreeRegressor of your GradientBoositngRegressor, I created a new
ExtraTreeRegressor.

This is a bit off topic, but does anyone have an idea, why
type(ExtraTreesRegressor().estimators_)
results in a list and
type(GradientBoostingRegressor().estimators_)
results in an np.array?

Anyway, here is the code:

import numpy as np
from sklearn import datasets
from sklearn.ensemble import GradientBoostingRegressor, ExtraTreesRegressor

boston = datasets.load_boston()
X, Y = (boston.data, boston.target)

n_estimators = 712
# Note: From 712 onwards, the feature importance sum is less than 1
params = {'n_estimators': n_estimators, 'max_depth': 6, 'learning_rate': 0.1}
clf = GradientBoostingRegressor(**params)
clf.fit(X, Y)

feature_importance_sum = np.sum(clf.feature_importances_)
print "At n_estimators = %i, feature importance sum = %.20f" % (n_estimators , feature_importance_sum)


n_estimators_forest = 100
clf_forest = ExtraTreesRegressor(n_estimators=n_estimators_forest)
clf_forest.fit(X, Y)

feature_importance_sum_forest = np.sum(clf_forest.feature_importances_)
forest_feat_imp = [np.sum(tree.feature_importances_) for tree in clf_forest.estimators_]
print "At n_estimators = %i, feature importance sum = %.20f" % (n_estimators_forest, feature_importance_sum_forest)
for idx, imp in enumerate(forest_feat_imp):
    print "imp for tree %i: %.20f" % (idx, imp)


I suppose in each tree there is a small rounding error, summing up to the overall error.
So is this a bug or an inevitable rounding issue?


Greets,
Piotr

On 09.09.2016 03:51, Douglas Chan wrote:
Hello everyone,

I?d like to bring this up again to see if people have any thoughts on it.

If you also think this is a bug, then we can track it and get it fixed.  Please share your opinions.

Thank you,
-Doug


From: Douglas Chan<mailto:douglas.chan at ieee.org>
Sent: Wednesday, August 31, 2016 4:52 PM
To: Scikit-learn user and developer mailing list<mailto:scikit-learn at python.org> ; Raphael C<mailto:drraph at gmail.com>
Subject: Re: [scikit-learn] Gradient Boosting: Feature Importances do not sum to 1

Thanks for your reply, Raphael.

Here?s some code using the Boston dataset to reproduce this.

=== START CODE ===
import numpy as np
from sklearn import datasets
from sklearn.ensemble import GradientBoostingRegressor

boston = datasets.load_boston()
X, Y = (boston.data, boston.target)

n_estimators = 712
# Note: From 712 onwards, the feature importance sum is less than 1

params = {'n_estimators': n_estimators, 'max_depth': 6, 'learning_rate': 0.1}
clf = GradientBoostingRegressor(**params)
clf.fit(X, Y)

feature_importance_sum = np.sum(clf.feature_importances_)
print "At n_estimators = %i, feature importance sum = %f" % (n_estimators , feature_importance_sum)

=== END CODE ===

If we deem this to be an error, I can open a bug to track it.  Please share your thoughts on it.

Thank you,
-Doug


From: Raphael C<mailto:drraph at gmail.com>
Sent: Tuesday, August 30, 2016 11:28 PM
To: Scikit-learn user and developer mailing list<mailto:scikit-learn at python.org>
Subject: Re: [scikit-learn] Gradient Boosting: Feature Importances do not sum to 1

Can you provide a reproducible example?
Raphael

On Wednesday, August 31, 2016, Douglas Chan <<mailto:douglas.chan at ieee.org>douglas.chan at ieee.org<mailto:douglas.chan at ieee.org>> wrote:
Hello everyone,

I notice conditions when Feature Importance values do not add up to 1 in ensemble tree methods, like Gradient Boosting Trees or AdaBoost Trees.  I wonder if there?s a bug in the code.

This error occurs when the ensemble has a large number of estimators.  The exact conditions depend variously.  For example, the error shows up sooner with a smaller amount of training samples.  Or, if the depth of the tree is large.

When this error appears, the predicted value seems to have converged.  But it?s unclear if the error is causing the predicted value not to change with more estimators.  In fact, the feature importance sum goes lower and lower with more estimators thereafter.

I wonder if we?re hitting some floating point calculation error.

Looking forward to hear your thoughts on this.

Thank you!
-Doug


________________________________
_______________________________________________
scikit-learn mailing list
scikit-learn at python.org<mailto:scikit-learn at python.org>
https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
scikit-learn at python.org<mailto:scikit-learn at python.org>
https://mail.python.org/mailman/listinfo/scikit-learn


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160909/be9b6c01/attachment-0001.html>

From cs14btech11041 at iith.ac.in  Fri Sep  9 11:03:10 2016
From: cs14btech11041 at iith.ac.in (Ibrahim Dalal)
Date: Fri, 9 Sep 2016 20:33:10 +0530
Subject: [scikit-learn] Issue with sklearn.neural_network
In-Reply-To: <57D27465.5070505@gmail.com>
References: <CAAyvngWYSSgxHAZkcEBqU_M1ZKL9QjCExWnznEJ7tjSZhCkqfw@mail.gmail.com>
 <57D27465.5070505@gmail.com>
Message-ID: <CAAyvngWKn-RGY-Nbe8gpSpdDPmMSNgRazAVKQXQoORpwwBn3tw@mail.gmail.com>

Hi,

Is there any support for classification using Neural Nets in version 0.17.1?

Thanks

On Fri, Sep 9, 2016 at 2:05 PM, Roman Yurchak <rth.yurchak at gmail.com> wrote:

> Ibrahim, I believe the sklearn.neural_network.MLPClassifier was added in
> the not yet released v0.18 (current dev version),
> http://scikit-learn.org/dev/modules/neural_networks_supervised.html
> --
> Roman
> On 09/09/16 10:19, Ibrahim Dalal via scikit-learn wrote:
> > Dear Developers,
> >
> > I am using sklearn version 0.17.1 on Ubuntu 14.04.
> >
> > I was checking out neural network examples and one such example used
> > sklearn.neural_network.MLPClassifier. When I tried this, I get the
> > following error:
> >
> >>>> from sklearn import neural_network
> >>>> clf = neural_network.MLPClassifier()
> > Traceback (most recent call last):
> >   File "<stdin>", line 1, in <module>
> > AttributeError: 'module' object has no attribute 'MLPClassifier'
> >
> > Thanks
> >
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160909/25846b2c/attachment.html>

From andrew.matte at gmail.com  Fri Sep  9 11:07:46 2016
From: andrew.matte at gmail.com (Andrew Matte)
Date: Fri, 9 Sep 2016 11:07:46 -0400
Subject: [scikit-learn] Issue with sklearn.neural_network
In-Reply-To: <CAAyvngWKn-RGY-Nbe8gpSpdDPmMSNgRazAVKQXQoORpwwBn3tw@mail.gmail.com>
References: <CAAyvngWYSSgxHAZkcEBqU_M1ZKL9QjCExWnznEJ7tjSZhCkqfw@mail.gmail.com>
 <57D27465.5070505@gmail.com>
 <CAAyvngWKn-RGY-Nbe8gpSpdDPmMSNgRazAVKQXQoORpwwBn3tw@mail.gmail.com>
Message-ID: <CANiXb6UtogTXegdHELX8ya_FAwhD8QrKULmdKmNhM-WX8vQqMw@mail.gmail.com>

I was looking for that a while ago. I installed 0.18 with pip+git directly
on both a debian and a centOS VM no problem.

On Fri, Sep 9, 2016 at 11:03 AM, Ibrahim Dalal via scikit-learn <
scikit-learn at python.org> wrote:

> Hi,
>
> Is there any support for classification using Neural Nets in version
> 0.17.1?
>
> Thanks
>
> On Fri, Sep 9, 2016 at 2:05 PM, Roman Yurchak <rth.yurchak at gmail.com>
> wrote:
>
>> Ibrahim, I believe the sklearn.neural_network.MLPClassifier was added in
>> the not yet released v0.18 (current dev version),
>> http://scikit-learn.org/dev/modules/neural_networks_supervised.html
>> --
>> Roman
>> On 09/09/16 10:19, Ibrahim Dalal via scikit-learn wrote:
>> > Dear Developers,
>> >
>> > I am using sklearn version 0.17.1 on Ubuntu 14.04.
>> >
>> > I was checking out neural network examples and one such example used
>> > sklearn.neural_network.MLPClassifier. When I tried this, I get the
>> > following error:
>> >
>> >>>> from sklearn import neural_network
>> >>>> clf = neural_network.MLPClassifier()
>> > Traceback (most recent call last):
>> >   File "<stdin>", line 1, in <module>
>> > AttributeError: 'module' object has no attribute 'MLPClassifier'
>> >
>> > Thanks
>> >
>> >
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>> >
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160909/f6b83efa/attachment.html>

From t3kcit at gmail.com  Fri Sep  9 11:14:41 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Fri, 9 Sep 2016 11:14:41 -0400
Subject: [scikit-learn] Issue with sklearn.neural_network
In-Reply-To: <CAAyvngWKn-RGY-Nbe8gpSpdDPmMSNgRazAVKQXQoORpwwBn3tw@mail.gmail.com>
References: <CAAyvngWYSSgxHAZkcEBqU_M1ZKL9QjCExWnznEJ7tjSZhCkqfw@mail.gmail.com>
 <57D27465.5070505@gmail.com>
 <CAAyvngWKn-RGY-Nbe8gpSpdDPmMSNgRazAVKQXQoORpwwBn3tw@mail.gmail.com>
Message-ID: <d3e6692a-5656-749e-9887-bf62b80aba2d@gmail.com>


On 09/09/2016 11:03 AM, Ibrahim Dalal via scikit-learn wrote:
> Hi,
>
> Is there any support for classification using Neural Nets in version 
> 0.17.1?
>
> Thanks
>
no.
Wait a week for 0.18(-rc?)

From jmschreiber91 at gmail.com  Sat Sep 10 14:20:33 2016
From: jmschreiber91 at gmail.com (Jacob Schreiber)
Date: Sat, 10 Sep 2016 14:20:33 -0400
Subject: [scikit-learn] pomegranate v0.6.0 release
Message-ID: <CA+ad8Es=GDEXLsBnCjs6rC-qd7guqxgDN0qFeiiAWz0CwQDp=g@mail.gmail.com>

Hello everyone!

I just released pomegranate v0.6.0, which focuses on probabilistic
modelling for python. It currently implements basic distributions, naive
bayes, markov chains, general mixture models, hidden Markov models, and
Bayesian networks in a fast and extremely flexible manner. I have a more in
depth reddit post on it.
<https://www.reddit.com/r/Python/comments/52424a/pomegranate_v060_released_probabilistic_modelling/>

The gist is that since the last version, I've added model stacking
(mixtures of HMMs, Naive Bayes of Bayesian Networks or mixture models....),
native parallelization through joblib, extended the out-of-core API to
include all models and model stacks, and significantly increased the speed
using a BLAS backend.

I recently gave a talk at PyData Chicago about pomegranate and released an in
depth notebook tutorial
<https://github.com/jmschrei/pomegranate/blob/master/tutorials/PyData_2016_Chicago_Tutorial.ipynb>
covering all the cool new features which you should check out.

In addition, I just wrote a new tutorial on how to utilize parallelization
<https://github.com/jmschrei/pomegranate/blob/master/tutorials/Tutorial_7_Parallelization.ipynb>
to train a Gaussian mixture model, a GMM-HMM, and a mixture of GMM-HMMs (a
GMM-HMM-GMM if you will) without having to think too much about the
underlying algorithms.

I'd love for you all to check it out and let me know if you have any
feedback or want to chat about it.

Thanks!
Jacob
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160910/1b004ec5/attachment.html>

From douglas.chan at ieee.org  Mon Sep 12 21:47:10 2016
From: douglas.chan at ieee.org (Douglas Chan)
Date: Mon, 12 Sep 2016 18:47:10 -0700
Subject: [scikit-learn] Gradient Boosting: Feature Importances do not
 sum to 1
In-Reply-To: <DB5PR01MB08549BBCE2759F68CD3E8E45F3FA0@DB5PR01MB0854.eurprd01.prod.exchangelabs.com>
References: <C2484D9F44FE441897BAA416DCC281BD@Serendipitous>
 <CAFHc1QZm4_R8eFka_V5gWF3irLwayozLsU-2JW7+YE0-U604HA@mail.gmail.com>
 <A8FF93441B0044FE9AED45D831CD3273@Serendipitous>
 <4C639AAD236E493AABC55F2B4709CAEF@Serendipitous>
 <DB5PR01MB08549BBCE2759F68CD3E8E45F3FA0@DB5PR01MB0854.eurprd01.prod.exchangelabs.com>
Message-ID: <6A59284AFC744A6190C335DD87E75CF0@Serendipitous>

Thanks for sharing your comments about this, Piotr.

I agree with you that each ExtraTreesRegressor tree in the ensemble should sum to 1.  Though, at least for ExtraTreesRegressor, the sum is still near 1.  For GB, that sum keeps decreasing on and on.

I feel there?s a bug here so I just submitted one to track this issue:
https://github.com/scikit-learn/scikit-learn/issues/7406

-Doug


From: Piotr Bialecki 
Sent: Friday, September 09, 2016 5:11 AM
To: Scikit-learn user and developer mailing list 
Subject: Re: [scikit-learn] Gradient Boosting: Feature Importances do not sum to 1

Hi Doug,

I modified your code a little bit to calculate the feature_importances of every tree of the forest.
In my opinion these feature importances should also sum to 1.0.

Since I could not access each DecisionTreeRegressor of your GradientBoositngRegressor, I created a new 
ExtraTreeRegressor.

This is a bit off topic, but does anyone have an idea, why 
type(ExtraTreesRegressor().estimators_) 
results in a list and 
type(GradientBoostingRegressor().estimators_)
results in an np.array?

Anyway, here is the code:

import numpy as np
from sklearn import datasets
from sklearn.ensemble import GradientBoostingRegressor, ExtraTreesRegressor
 
boston = datasets.load_boston()
X, Y = (boston.data, boston.target)
 
n_estimators = 712  
# Note: From 712 onwards, the feature importance sum is less than 1
params = {'n_estimators': n_estimators, 'max_depth': 6, 'learning_rate': 0.1}
clf = GradientBoostingRegressor(**params)
clf.fit(X, Y)
 
feature_importance_sum = np.sum(clf.feature_importances_)
print "At n_estimators = %i, feature importance sum = %.20f" % (n_estimators , feature_importance_sum)


n_estimators_forest = 100
clf_forest = ExtraTreesRegressor(n_estimators=n_estimators_forest)
clf_forest.fit(X, Y)

feature_importance_sum_forest = np.sum(clf_forest.feature_importances_)
forest_feat_imp = [np.sum(tree.feature_importances_) for tree in clf_forest.estimators_]
print "At n_estimators = %i, feature importance sum = %.20f" % (n_estimators_forest, feature_importance_sum_forest)
for idx, imp in enumerate(forest_feat_imp):
    print "imp for tree %i: %.20f" % (idx, imp)


I suppose in each tree there is a small rounding error, summing up to the overall error.
So is this a bug or an inevitable rounding issue?


Greets,
Piotr


On 09.09.2016 03:51, Douglas Chan wrote:

  Hello everyone,

  I?d like to bring this up again to see if people have any thoughts on it.

  If you also think this is a bug, then we can track it and get it fixed.  Please share your opinions.

  Thank you,
  -Doug


  From: Douglas Chan 
  Sent: Wednesday, August 31, 2016 4:52 PM
  To: Scikit-learn user and developer mailing list ; Raphael C 
  Subject: Re: [scikit-learn] Gradient Boosting: Feature Importances do not sum to 1

  Thanks for your reply, Raphael.

  Here?s some code using the Boston dataset to reproduce this.  

  === START CODE ===
  import numpy as np
  from sklearn import datasets
  from sklearn.ensemble import GradientBoostingRegressor

  boston = datasets.load_boston()
  X, Y = (boston.data, boston.target)

  n_estimators = 712   
  # Note: From 712 onwards, the feature importance sum is less than 1

  params = {'n_estimators': n_estimators, 'max_depth': 6, 'learning_rate': 0.1}
  clf = GradientBoostingRegressor(**params)
  clf.fit(X, Y)

  feature_importance_sum = np.sum(clf.feature_importances_)
  print "At n_estimators = %i, feature importance sum = %f" % (n_estimators , feature_importance_sum)

  === END CODE ===

  If we deem this to be an error, I can open a bug to track it.  Please share your thoughts on it.

  Thank you,
  -Doug


  From: Raphael C 
  Sent: Tuesday, August 30, 2016 11:28 PM
  To: Scikit-learn user and developer mailing list 
  Subject: Re: [scikit-learn] Gradient Boosting: Feature Importances do not sum to 1

  Can you provide a reproducible example? 
  Raphael

  On Wednesday, August 31, 2016, Douglas Chan <douglas.chan at ieee.org> wrote:

    Hello everyone,

    I notice conditions when Feature Importance values do not add up to 1 in ensemble tree methods, like Gradient Boosting Trees or AdaBoost Trees.  I wonder if there?s a bug in the code.

    This error occurs when the ensemble has a large number of estimators.  The exact conditions depend variously.  For example, the error shows up sooner with a smaller amount of training samples.  Or, if the depth of the tree is large.  

    When this error appears, the predicted value seems to have converged.  But it?s unclear if the error is causing the predicted value not to change with more estimators.  In fact, the feature importance sum goes lower and lower with more estimators thereafter.

    I wonder if we?re hitting some floating point calculation error. 

    Looking forward to hear your thoughts on this.

    Thank you!
    -Doug


------------------------------------------------------------------------------
  _______________________________________________
  scikit-learn mailing list
  scikit-learn at python.org
  https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn


--------------------------------------------------------------------------------
_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160912/6617f31c/attachment.html>

From hongbinopen at gmail.com  Tue Sep 13 04:15:31 2016
From: hongbinopen at gmail.com (=?UTF-8?B?5paM5rSq?=)
Date: Tue, 13 Sep 2016 16:15:31 +0800
Subject: [scikit-learn] is RandomForest random samples or random features?
Message-ID: <CAA_68THg4jxaP7-L2eYgce-6DF6oQdGVOVD2DwRYFpZdG8cCrQ@mail.gmail.com>

I have read the Guide of sklearn's RandomForest :

"""
In random forests (see RandomForestClassifier
<http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier>
and RandomForestRegressor
<http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor>
classes), each tree in the ensemble is built from a sample drawn with
replacement (i.e., a bootstrap sample) from the training set.
"""

But I prefer RandomForest as :
"""
features ("attributes", "predictors", "independent variables") are randomly
sampled
"""

is RandomForest random samples or random features? where can I find a
features random version of RandomForest?

thx.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160913/3692ad49/attachment.html>

From drougardn at gmail.com  Tue Sep 13 04:27:43 2016
From: drougardn at gmail.com (Nicolas Drougard)
Date: Tue, 13 Sep 2016 10:27:43 +0200
Subject: [scikit-learn] is RandomForest random samples or random
 features?
In-Reply-To: <CAA_68THg4jxaP7-L2eYgce-6DF6oQdGVOVD2DwRYFpZdG8cCrQ@mail.gmail.com>
References: <CAA_68THg4jxaP7-L2eYgce-6DF6oQdGVOVD2DwRYFpZdG8cCrQ@mail.gmail.com>
Message-ID: <CAM7uaWnNYmNh9e-p_cpr3G2o488kLaaJMCzyFqyMDNeCDdNvmg@mail.gmail.com>

You may want to use the parameter called "max_features".

Indeed:

"1.11.2.3. Parameters -- The main parameters to adjust when using these
methods is n_estimators and max_features. The former is the number of trees
in the forest. The larger the better, but also the longer it will take to
compute. In addition, note that results will stop getting significantly
better beyond a critical number of trees. *The latter is the size of the
random subsets of features to consider when splitting a node.*"


Best regards,
Nicolas


2016-09-13 10:15 GMT+02:00 ?? <hongbinopen at gmail.com>:

> I have read the Guide of sklearn's RandomForest :
>
> """
> In random forests (see RandomForestClassifier
> <http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier>
> and RandomForestRegressor
> <http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor>
> classes), each tree in the ensemble is built from a sample drawn with
> replacement (i.e., a bootstrap sample) from the training set.
> """
>
> But I prefer RandomForest as :
> """
> features ("attributes", "predictors", "independent variables") are
> randomly sampled
> """
>
> is RandomForest random samples or random features? where can I find a
> features random version of RandomForest?
>
> thx.
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160913/0426a6bb/attachment-0001.html>

From yoann.brenet at se1.bp.com  Tue Sep 13 08:16:02 2016
From: yoann.brenet at se1.bp.com (Brenet, Yoann)
Date: Tue, 13 Sep 2016 12:16:02 +0000
Subject: [scikit-learn] Use of Scaler with LassoCV, RidgeCV
Message-ID: <95896AD744E1FC4BB23705DAA74907C54CC2B4E7@DE35S00FHST28.dsc.bp.com>

Hi all,

I was trying to use scikit-learn LassoCV/RidgeCV while applying a 'StandardScaler' on each fold set. I do not want to apply the scaler before the cross-validation to avoid leakage but I cannot figure out how I am supposed to do that with LassoCV/RidgeCV.

Is there a way to do this ? Or should I create a pipeline with Lasso/Ridge and 'manually' search for the hyper-parameters (using GridSearchCV for instance) ?

Many thanks.

Yoann
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160913/7038061c/attachment.html>

From Dale.T.Smith at macys.com  Tue Sep 13 08:23:34 2016
From: Dale.T.Smith at macys.com (Dale T Smith)
Date: Tue, 13 Sep 2016 12:23:34 +0000
Subject: [scikit-learn] is RandomForest random samples or random
 features?
In-Reply-To: <CAA_68THg4jxaP7-L2eYgce-6DF6oQdGVOVD2DwRYFpZdG8cCrQ@mail.gmail.com>
References: <CAA_68THg4jxaP7-L2eYgce-6DF6oQdGVOVD2DwRYFpZdG8cCrQ@mail.gmail.com>
Message-ID: <BL2PR06MB2276126D5E0153A290D98155C3FE0@BL2PR06MB2276.namprd06.prod.outlook.com>

Each tree is built using a random sample with replacement from the provided training data. The data not in the sample is used to calculate the out-of-bag score. The ?bag? is the sampled data.

The ?random? refers to several features of the algorithm, including random sampling of features

So for each tree
                Get a random sample of the training data
                For I to n_estimators:
                                Build a tree ? this involves a random sample of features and thresholds for each feature in the sample at each node.
                                Use the rest of the training data, not in the sample, to calculate the out-of-bag score

Random Forest already incorporates ?random features?.

https://github.com/glouppe/phd-thesis

__________________________________________________________________________________________
Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science
770-658-5176 | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com

From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of ??
Sent: Tuesday, September 13, 2016 4:16 AM
To: scikit-learn at python.org
Subject: [scikit-learn] is RandomForest random samples or random features?

? EXT MSG:
I have read the Guide of sklearn's RandomForest :

"""
In random forests (see RandomForestClassifier<http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier> and RandomForestRegressor<http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor> classes), each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set.
"""
But I prefer RandomForest as :
"""
features ("attributes", "predictors", "independent variables") are randomly sampled
"""
is RandomForest random samples or random features? where can I find a features random version of RandomForest?
thx.
* This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160913/f421e601/attachment.html>

From Dale.T.Smith at macys.com  Tue Sep 13 08:26:40 2016
From: Dale.T.Smith at macys.com (Dale T Smith)
Date: Tue, 13 Sep 2016 12:26:40 +0000
Subject: [scikit-learn] Use of Scaler with LassoCV, RidgeCV
In-Reply-To: <95896AD744E1FC4BB23705DAA74907C54CC2B4E7@DE35S00FHST28.dsc.bp.com>
References: <95896AD744E1FC4BB23705DAA74907C54CC2B4E7@DE35S00FHST28.dsc.bp.com>
Message-ID: <BL2PR06MB2276AEEA0D80F96666A65B41C3FE0@BL2PR06MB2276.namprd06.prod.outlook.com>

Hmm. I would scale the training data, and then use the same scaling on the test and validation data. This isn?t quite what you asked, but it?s close and does involve transformations and pipelines. Perhaps you can modify according to your use case, introducing the scaling before PolynomialFeatures is called.

https://www.datarobot.com/blog/regularized-linear-regression-with-scikit-learn/

__________________________________________________________________________________________
Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science
770-658-5176 | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com

From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Brenet, Yoann
Sent: Tuesday, September 13, 2016 8:16 AM
To: scikit-learn at python.org
Subject: [scikit-learn] Use of Scaler with LassoCV, RidgeCV

? EXT MSG:
Hi all,

I was trying to use scikit-learn LassoCV/RidgeCV while applying a 'StandardScaler' on each fold set. I do not want to apply the scaler before the cross-validation to avoid leakage but I cannot figure out how I am supposed to do that with LassoCV/RidgeCV.

Is there a way to do this ? Or should I create a pipeline with Lasso/Ridge and 'manually' search for the hyper-parameters (using GridSearchCV for instance) ?

Many thanks.

Yoann
* This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160913/8a97635f/attachment-0001.html>

From Dale.T.Smith at macys.com  Tue Sep 13 08:30:02 2016
From: Dale.T.Smith at macys.com (Dale T Smith)
Date: Tue, 13 Sep 2016 12:30:02 +0000
Subject: [scikit-learn] is RandomForest random samples or random
 features?
In-Reply-To: <BL2PR06MB2276126D5E0153A290D98155C3FE0@BL2PR06MB2276.namprd06.prod.outlook.com>
References: <CAA_68THg4jxaP7-L2eYgce-6DF6oQdGVOVD2DwRYFpZdG8cCrQ@mail.gmail.com>
 <BL2PR06MB2276126D5E0153A290D98155C3FE0@BL2PR06MB2276.namprd06.prod.outlook.com>
Message-ID: <BL2PR06MB22764D52BE1F3F1BAF773B8CC3FE0@BL2PR06MB2276.namprd06.prod.outlook.com>

Wrong! Apologies, I had a double loop in there.

Get a random sample of the training data
For I to n_estimators:
                Build a tree ? this involves a random sample of features and thresholds for each feature in the training data sample at each node.
                Use the rest of the training data, not in the sample, to calculate the out-of-bag score.

I also edited a bit for clarity. Refer to Gilles Loope?s dissertation for details.

__________________________________________________________________________________________
Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science
770-658-5176 | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com

From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Dale T Smith
Sent: Tuesday, September 13, 2016 8:24 AM
To: Scikit-learn user and developer mailing list
Subject: Re: [scikit-learn] is RandomForest random samples or random features?

? EXT MSG:
Each tree is built using a random sample with replacement from the provided training data. The data not in the sample is used to calculate the out-of-bag score. The ?bag? is the sampled data.

The ?random? refers to several features of the algorithm, including random sampling of features

So for each tree
                Get a random sample of the training data
                For I to n_estimators:
                                Build a tree ? this involves a random sample of features and thresholds for each feature in the sample at each node.
                                Use the rest of the training data, not in the sample, to calculate the out-of-bag score

Random Forest already incorporates ?random features?.

https://github.com/glouppe/phd-thesis

__________________________________________________________________________________________
Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science
770-658-5176 | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com<mailto:dale.t.smith at macys.com>

From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of ??
Sent: Tuesday, September 13, 2016 4:16 AM
To: scikit-learn at python.org<mailto:scikit-learn at python.org>
Subject: [scikit-learn] is RandomForest random samples or random features?

? EXT MSG:
I have read the Guide of sklearn's RandomForest :

"""
In random forests (see RandomForestClassifier<http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier> and RandomForestRegressor<http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor> classes), each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set.
"""
But I prefer RandomForest as :
"""
features ("attributes", "predictors", "independent variables") are randomly sampled
"""
is RandomForest random samples or random features? where can I find a features random version of RandomForest?
thx.
* This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.
* This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160913/b0b0e475/attachment.html>

From se.raschka at gmail.com  Tue Sep 13 08:33:52 2016
From: se.raschka at gmail.com (Sebastian Raschka)
Date: Tue, 13 Sep 2016 08:33:52 -0400
Subject: [scikit-learn] Use of Scaler with LassoCV, RidgeCV
In-Reply-To: <95896AD744E1FC4BB23705DAA74907C54CC2B4E7@DE35S00FHST28.dsc.bp.com>
References: <95896AD744E1FC4BB23705DAA74907C54CC2B4E7@DE35S00FHST28.dsc.bp.com>
Message-ID: <65093BCA-70D7-4DCF-9EDC-C4EA7B513C3C@gmail.com>

Hi, Yoann,

when I understand correctly, you want to apply the scaling to each iteration in cross-validation (i.e., the recommended way to do it)? Here, you could use the make_pipeline function, which will call fit on each training fold and call transform on each test fold:


from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.cross_validation import cross_val_score
from sklearn.linear_model import Ridge

pipe = make_pipeline(StandardScaler(), Ridge())
cross_val_score(pipe, X, y)

You can think of ?pipe? as an Ridge estimator with a StandardScaler attached to it.

Best,
Sebastian


> On Sep 13, 2016, at 8:16 AM, Brenet, Yoann <yoann.brenet at se1.bp.com> wrote:
> 
> Hi all,
>  
> I was trying to use scikit-learn LassoCV/RidgeCV while applying a 'StandardScaler' on each fold set. I do not want to apply the scaler before the cross-validation to avoid leakage but I cannot figure out how I am supposed to do that with LassoCV/RidgeCV.
>  
> Is there a way to do this ? Or should I create a pipeline with Lasso/Ridge and 'manually' search for the hyper-parameters (using GridSearchCV for instance) ?
>  
> Many thanks.
>  
> Yoann
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From yoann.brenet at se1.bp.com  Tue Sep 13 08:45:13 2016
From: yoann.brenet at se1.bp.com (Brenet, Yoann)
Date: Tue, 13 Sep 2016 12:45:13 +0000
Subject: [scikit-learn] Use of Scaler with LassoCV, RidgeCV
Message-ID: <95896AD744E1FC4BB23705DAA74907C54CC2B527@DE35S00FHST28.dsc.bp.com>

Hi Sebastian,

Many thanks, that's what I was thinking I should be doing, so thanks a lot for confirming that was the way to go.

Really appreciate the help,
Yoann 

Date: Tue, 13 Sep 2016 08:33:52 -0400
From: Sebastian Raschka <se.raschka at gmail.com>
To: Scikit-learn user and developer mailing list
	<scikit-learn at python.org>
Subject: Re: [scikit-learn] Use of Scaler with LassoCV, RidgeCV
Message-ID: <65093BCA-70D7-4DCF-9EDC-C4EA7B513C3C at gmail.com>
Content-Type: text/plain; charset=utf-8

Hi, Yoann,

when I understand correctly, you want to apply the scaling to each iteration in cross-validation (i.e., the recommended way to do it)? Here, you could use the make_pipeline function, which will call fit on each training fold and call transform on each test fold:


from sklearn.preprocessing import StandardScaler from sklearn.pipeline import make_pipeline from sklearn.cross_validation import cross_val_score from sklearn.linear_model import Ridge

pipe = make_pipeline(StandardScaler(), Ridge()) cross_val_score(pipe, X, y)

You can think of ?pipe? as an Ridge estimator with a StandardScaler attached to it.

Best,
Sebastian


> On Sep 13, 2016, at 8:16 AM, Brenet, Yoann <yoann.brenet at se1.bp.com> wrote:
> 
> Hi all,
>  
> I was trying to use scikit-learn LassoCV/RidgeCV while applying a 'StandardScaler' on each fold set. I do not want to apply the scaler before the cross-validation to avoid leakage but I cannot figure out how I am supposed to do that with LassoCV/RidgeCV.
>  
> Is there a way to do this ? Or should I create a pipeline with Lasso/Ridge and 'manually' search for the hyper-parameters (using GridSearchCV for instance) ?
>  
> Many thanks.
>  
> Yoann
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


------------------------------

Subject: Digest Footer

_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn


------------------------------

End of scikit-learn Digest, Vol 6, Issue 15
*******************************************

From hongbinopen at gmail.com  Tue Sep 13 09:34:08 2016
From: hongbinopen at gmail.com (=?UTF-8?B?5paM5rSq?=)
Date: Tue, 13 Sep 2016 21:34:08 +0800
Subject: [scikit-learn] is RandomForest random samples or random
 features?
In-Reply-To: <BL2PR06MB22764D52BE1F3F1BAF773B8CC3FE0@BL2PR06MB2276.namprd06.prod.outlook.com>
References: <CAA_68THg4jxaP7-L2eYgce-6DF6oQdGVOVD2DwRYFpZdG8cCrQ@mail.gmail.com>
 <BL2PR06MB2276126D5E0153A290D98155C3FE0@BL2PR06MB2276.namprd06.prod.outlook.com>
 <BL2PR06MB22764D52BE1F3F1BAF773B8CC3FE0@BL2PR06MB2276.namprd06.prod.outlook.com>
Message-ID: <CAA_68TEUUoj_8Yd98ryAhVUs_MGD60W1zg6OTpEzH=R0-xVWvA@mail.gmail.com>

thanks to all of you. I think I have got the point.  ^_^

2016-09-13 20:30 GMT+08:00 Dale T Smith <Dale.T.Smith at macys.com>:

> Wrong! Apologies, I had a double loop in there.
>
>
>
> Get a random sample of the training data
>
> For I to n_estimators:
>
>                 Build a tree ? this involves a *random sample of features*
> and thresholds for each feature in the training data sample at each node.
>
>                 Use the rest of the training data, not in the sample, to
> calculate the out-of-bag score.
>
>
>
> I also edited a bit for clarity. Refer to Gilles Loope?s dissertation for
> details.
>
>
>
> ____________________________________________________________
> ______________________________
> *Dale Smith* | Macy's Systems and Technology | IFS eCommerce | Data
> Science
> 770-658-5176 | 5985 State Bridge Road, Johns Creek, GA 30097 |
> dale.t.smith at macys.com
>
>
>
> *From:* scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=
> macys.com at python.org] *On Behalf Of *Dale T Smith
> *Sent:* Tuesday, September 13, 2016 8:24 AM
> *To:* Scikit-learn user and developer mailing list
> *Subject:* Re: [scikit-learn] is RandomForest random samples or random
> features?
>
>
>
> ? EXT MSG:
>
> Each tree is built using a random sample with replacement from the
> provided training data. The data not in the sample is used to calculate the
> out-of-bag score. The ?bag? is the sampled data.
>
>
>
> The ?random? refers to several features of the algorithm, including random
> sampling of features
>
>
>
> So for each tree
>
>                 Get a random sample of the training data
>
>                 For I to n_estimators:
>
>                                 Build a tree ? this involves a *random
> sample of features* and thresholds for each feature in the sample at each
> node.
>
>                                 Use the rest of the training data, not in
> the sample, to calculate the out-of-bag score
>
>
>
> Random Forest already incorporates ?random features?.
>
>
>
> https://github.com/glouppe/phd-thesis
>
>
>
> ____________________________________________________________
> ______________________________
> *Dale Smith* | Macy's Systems and Technology | IFS eCommerce | Data
> Science
> 770-658-5176 | 5985 State Bridge Road, Johns Creek, GA 30097 |
> dale.t.smith at macys.com
>
>
>
> *From:* scikit-learn [mailto:scikit-learn-bounces+
> dale.t.smith=macys.com at python.org
> <scikit-learn-bounces+dale.t.smith=macys.com at python.org>] *On Behalf Of *
> ??
> *Sent:* Tuesday, September 13, 2016 4:16 AM
> *To:* scikit-learn at python.org
> *Subject:* [scikit-learn] is RandomForest random samples or random
> features?
>
>
>
> ? EXT MSG:
>
> I have read the Guide of sklearn's RandomForest :
>
> """
> In random forests (see RandomForestClassifier
> <http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier>
> and RandomForestRegressor
> <http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor>
> classes), each tree in the ensemble is built from a sample drawn with
> replacement (i.e., a bootstrap sample) from the training set.
> """
>
> But I prefer RandomForest as :
> """
> features ("attributes", "predictors", "independent variables") are
> randomly sampled
> """
>
> is RandomForest random samples or random features? where can I find a
> features random version of RandomForest?
>
> thx.
>
> * This is an EXTERNAL EMAIL. Stop and think before clicking a link or
> opening attachments.
>
> * This is an EXTERNAL EMAIL. Stop and think before clicking a link or
> opening attachments.
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160913/2b0b9329/attachment.html>

From se.raschka at gmail.com  Tue Sep 13 10:47:49 2016
From: se.raschka at gmail.com (Sebastian Raschka)
Date: Tue, 13 Sep 2016 10:47:49 -0400
Subject: [scikit-learn] Problems with plotting decision regions
Message-ID: <B04247A6-1B3A-404E-819F-B7EA3966093D@gmail.com>

Hi, all,
I am having some problems with showing decision regions in 2D if more than 4 classes are present. Really can?t figure out the source of the problem and would really appreciate some help if you have done this before or have any pointer since I am afraid that I am overlooking something really simple here.

I created a gist of 2 simple examples with images attached:

https://gist.github.com/rasbt/6fb65bba38b70e28e60a9842b988cc67

I think it is very likely that it is not a bug in scikit-learn but rather a matplotlib contourf bug? In case it is a bug at all ?

Best,
Sebastian

From jakevdp at cs.washington.edu  Tue Sep 13 10:58:34 2016
From: jakevdp at cs.washington.edu (Jacob Vanderplas)
Date: Tue, 13 Sep 2016 07:58:34 -0700
Subject: [scikit-learn] Problems with plotting decision regions
In-Reply-To: <B04247A6-1B3A-404E-819F-B7EA3966093D@gmail.com>
References: <B04247A6-1B3A-404E-819F-B7EA3966093D@gmail.com>
Message-ID: <CACpqBg3=MS8sR7BSxLKWBWgTxHGKCAHdeUFvp7G1uNM-4JbZ0A@mail.gmail.com>

It seems to work correctly if you replace the colormap with a continuous
one like 'viridis'. I suspect this is a bug in matplotlib's ListedColormap,
   Jake

 Jake VanderPlas
 Senior Data Science Fellow
 Director of Research in Physical Sciences
 University of Washington eScience Institute

On Tue, Sep 13, 2016 at 7:47 AM, Sebastian Raschka <se.raschka at gmail.com>
wrote:

> Hi, all,
> I am having some problems with showing decision regions in 2D if more than
> 4 classes are present. Really can?t figure out the source of the problem
> and would really appreciate some help if you have done this before or have
> any pointer since I am afraid that I am overlooking something really simple
> here.
>
> I created a gist of 2 simple examples with images attached:
>
> https://gist.github.com/rasbt/6fb65bba38b70e28e60a9842b988cc67
>
> I think it is very likely that it is not a bug in scikit-learn but rather
> a matplotlib contourf bug? In case it is a bug at all ?
>
> Best,
> Sebastian
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160913/401ade9d/attachment.html>

From se.raschka at gmail.com  Tue Sep 13 11:03:16 2016
From: se.raschka at gmail.com (Sebastian Raschka)
Date: Tue, 13 Sep 2016 11:03:16 -0400
Subject: [scikit-learn] Problems with plotting decision regions
In-Reply-To: <CACpqBg3=MS8sR7BSxLKWBWgTxHGKCAHdeUFvp7G1uNM-4JbZ0A@mail.gmail.com>
References: <B04247A6-1B3A-404E-819F-B7EA3966093D@gmail.com>
 <CACpqBg3=MS8sR7BSxLKWBWgTxHGKCAHdeUFvp7G1uNM-4JbZ0A@mail.gmail.com>
Message-ID: <8C114CC8-DFE3-43B6-BA75-E9A657E9F88F@gmail.com>

Thanks a lot, Jake,
?viridis? seems to work, indeed. I guess I should move this to the matplotlib bug tracker then.

Best,
Sebastian

> On Sep 13, 2016, at 10:58 AM, Jacob Vanderplas <jakevdp at cs.washington.edu> wrote:
> 
> It seems to work correctly if you replace the colormap with a continuous one like 'viridis'. I suspect this is a bug in matplotlib's ListedColormap,
>    Jake
> 
>  Jake VanderPlas
>  Senior Data Science Fellow
>  Director of Research in Physical Sciences
>  University of Washington eScience Institute
> 
> On Tue, Sep 13, 2016 at 7:47 AM, Sebastian Raschka <se.raschka at gmail.com> wrote:
> Hi, all,
> I am having some problems with showing decision regions in 2D if more than 4 classes are present. Really can?t figure out the source of the problem and would really appreciate some help if you have done this before or have any pointer since I am afraid that I am overlooking something really simple here.
> 
> I created a gist of 2 simple examples with images attached:
> 
> https://gist.github.com/rasbt/6fb65bba38b70e28e60a9842b988cc67
> 
> I think it is very likely that it is not a bug in scikit-learn but rather a matplotlib contourf bug? In case it is a bug at all ?
> 
> Best,
> Sebastian
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From t3kcit at gmail.com  Tue Sep 13 11:56:45 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Tue, 13 Sep 2016 11:56:45 -0400
Subject: [scikit-learn] Use of Scaler with LassoCV, RidgeCV
In-Reply-To: <95896AD744E1FC4BB23705DAA74907C54CC2B527@DE35S00FHST28.dsc.bp.com>
References: <95896AD744E1FC4BB23705DAA74907C54CC2B527@DE35S00FHST28.dsc.bp.com>
Message-ID: <17283c24-44b3-4f52-3b7b-7824f1a4c60b@gmail.com>

There is no way to use the "efficient" EstimatorCV objects with pipelines.
This is an API bug and there's an open issue and maybe even a PR for that.

On 09/13/2016 08:45 AM, Brenet, Yoann wrote:
> Hi Sebastian,
>
> Many thanks, that's what I was thinking I should be doing, so thanks a lot for confirming that was the way to go.
>
> Really appreciate the help,
> Yoann
>
> Date: Tue, 13 Sep 2016 08:33:52 -0400
> From: Sebastian Raschka <se.raschka at gmail.com>
> To: Scikit-learn user and developer mailing list
> 	<scikit-learn at python.org>
> Subject: Re: [scikit-learn] Use of Scaler with LassoCV, RidgeCV
> Message-ID: <65093BCA-70D7-4DCF-9EDC-C4EA7B513C3C at gmail.com>
> Content-Type: text/plain; charset=utf-8
>
> Hi, Yoann,
>
> when I understand correctly, you want to apply the scaling to each iteration in cross-validation (i.e., the recommended way to do it)? Here, you could use the make_pipeline function, which will call fit on each training fold and call transform on each test fold:
>
>
> from sklearn.preprocessing import StandardScaler from sklearn.pipeline import make_pipeline from sklearn.cross_validation import cross_val_score from sklearn.linear_model import Ridge
>
> pipe = make_pipeline(StandardScaler(), Ridge()) cross_val_score(pipe, X, y)
>
> You can think of ?pipe? as an Ridge estimator with a StandardScaler attached to it.
>
> Best,
> Sebastian
>
>
>> On Sep 13, 2016, at 8:16 AM, Brenet, Yoann <yoann.brenet at se1.bp.com> wrote:
>>
>> Hi all,
>>   
>> I was trying to use scikit-learn LassoCV/RidgeCV while applying a 'StandardScaler' on each fold set. I do not want to apply the scaler before the cross-validation to avoid leakage but I cannot figure out how I am supposed to do that with LassoCV/RidgeCV.
>>   
>> Is there a way to do this ? Or should I create a pipeline with Lasso/Ridge and 'manually' search for the hyper-parameters (using GridSearchCV for instance) ?
>>   
>> Many thanks.
>>   
>> Yoann
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ------------------------------
>
> End of scikit-learn Digest, Vol 6, Issue 15
> *******************************************
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From t3kcit at gmail.com  Tue Sep 13 12:07:45 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Tue, 13 Sep 2016 12:07:45 -0400
Subject: [scikit-learn] Use of Scaler with LassoCV, RidgeCV
In-Reply-To: <95896AD744E1FC4BB23705DAA74907C54CC2B527@DE35S00FHST28.dsc.bp.com>
References: <95896AD744E1FC4BB23705DAA74907C54CC2B527@DE35S00FHST28.dsc.bp.com>
Message-ID: <de35403c-700e-5e6b-b399-b0669c085cfd@gmail.com>

It's here (and it's old and probably out of date):
https://github.com/scikit-learn/scikit-learn/issues/1626

On 09/13/2016 08:45 AM, Brenet, Yoann wrote:
> Hi Sebastian,
>
> Many thanks, that's what I was thinking I should be doing, so thanks a lot for confirming that was the way to go.
>
> Really appreciate the help,
> Yoann
>
> Date: Tue, 13 Sep 2016 08:33:52 -0400
> From: Sebastian Raschka <se.raschka at gmail.com>
> To: Scikit-learn user and developer mailing list
> 	<scikit-learn at python.org>
> Subject: Re: [scikit-learn] Use of Scaler with LassoCV, RidgeCV
> Message-ID: <65093BCA-70D7-4DCF-9EDC-C4EA7B513C3C at gmail.com>
> Content-Type: text/plain; charset=utf-8
>
> Hi, Yoann,
>
> when I understand correctly, you want to apply the scaling to each iteration in cross-validation (i.e., the recommended way to do it)? Here, you could use the make_pipeline function, which will call fit on each training fold and call transform on each test fold:
>
>
> from sklearn.preprocessing import StandardScaler from sklearn.pipeline import make_pipeline from sklearn.cross_validation import cross_val_score from sklearn.linear_model import Ridge
>
> pipe = make_pipeline(StandardScaler(), Ridge()) cross_val_score(pipe, X, y)
>
> You can think of ?pipe? as an Ridge estimator with a StandardScaler attached to it.
>
> Best,
> Sebastian
>
>
>> On Sep 13, 2016, at 8:16 AM, Brenet, Yoann <yoann.brenet at se1.bp.com> wrote:
>>
>> Hi all,
>>   
>> I was trying to use scikit-learn LassoCV/RidgeCV while applying a 'StandardScaler' on each fold set. I do not want to apply the scaler before the cross-validation to avoid leakage but I cannot figure out how I am supposed to do that with LassoCV/RidgeCV.
>>   
>> Is there a way to do this ? Or should I create a pipeline with Lasso/Ridge and 'manually' search for the hyper-parameters (using GridSearchCV for instance) ?
>>   
>> Many thanks.
>>   
>> Yoann
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ------------------------------
>
> End of scikit-learn Digest, Vol 6, Issue 15
> *******************************************
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From yoann.brenet at se1.bp.com  Tue Sep 13 21:00:16 2016
From: yoann.brenet at se1.bp.com (Brenet, Yoann)
Date: Wed, 14 Sep 2016 01:00:16 +0000
Subject: [scikit-learn] Use of Scaler with LassoCV, RidgeCV
Message-ID: <95896AD744E1FC4BB23705DAA74907C54CC2B719@DE35S00FHST28.dsc.bp.com>

Hi Andreas,

Thanks a lot for the information.

Yoann


Date: Tue, 13 Sep 2016 11:56:45 -0400
From: Andreas Mueller <t3kcit at gmail.com>
To: Scikit-learn user and developer mailing list
	<scikit-learn at python.org>
Subject: Re: [scikit-learn] Use of Scaler with LassoCV, RidgeCV
Message-ID: <17283c24-44b3-4f52-3b7b-7824f1a4c60b at gmail.com>
Content-Type: text/plain; charset=windows-1252; format=flowed

There is no way to use the "efficient" EstimatorCV objects with pipelines.
This is an API bug and there's an open issue and maybe even a PR for that.

On 09/13/2016 08:45 AM, Brenet, Yoann wrote:
> Hi Sebastian,
>
> Many thanks, that's what I was thinking I should be doing, so thanks a lot for confirming that was the way to go.
>
> Really appreciate the help,
> Yoann
>
> Date: Tue, 13 Sep 2016 08:33:52 -0400
> From: Sebastian Raschka <se.raschka at gmail.com>
> To: Scikit-learn user and developer mailing list
> 	<scikit-learn at python.org>
> Subject: Re: [scikit-learn] Use of Scaler with LassoCV, RidgeCV
> Message-ID: <65093BCA-70D7-4DCF-9EDC-C4EA7B513C3C at gmail.com>
> Content-Type: text/plain; charset=utf-8
>
> Hi, Yoann,
>
> when I understand correctly, you want to apply the scaling to each iteration in cross-validation (i.e., the recommended way to do it)? Here, you could use the make_pipeline function, which will call fit on each training fold and call transform on each test fold:
>
>
> from sklearn.preprocessing import StandardScaler from sklearn.pipeline 
> import make_pipeline from sklearn.cross_validation import 
> cross_val_score from sklearn.linear_model import Ridge
>
> pipe = make_pipeline(StandardScaler(), Ridge()) cross_val_score(pipe, 
> X, y)
>
> You can think of ?pipe? as an Ridge estimator with a StandardScaler attached to it.
>
> Best,
> Sebastian
>
>
>> On Sep 13, 2016, at 8:16 AM, Brenet, Yoann <yoann.brenet at se1.bp.com> wrote:
>>
>> Hi all,
>>   
>> I was trying to use scikit-learn LassoCV/RidgeCV while applying a 'StandardScaler' on each fold set. I do not want to apply the scaler before the cross-validation to avoid leakage but I cannot figure out how I am supposed to do that with LassoCV/RidgeCV.
>>   
>> Is there a way to do this ? Or should I create a pipeline with Lasso/Ridge and 'manually' search for the hyper-parameters (using GridSearchCV for instance) ?
>>   
>> Many thanks.
>>   
>> Yoann
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ------------------------------
>
> End of scikit-learn Digest, Vol 6, Issue 15
> *******************************************
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


------------------------------

Subject: Digest Footer

_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn


------------------------------

End of scikit-learn Digest, Vol 6, Issue 17
*******************************************

From t3kcit at gmail.com  Wed Sep 14 19:26:45 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Wed, 14 Sep 2016 19:26:45 -0400
Subject: [scikit-learn] Scikit-learn 0.18-rc2 release candidate!
Message-ID: <19dbaa9c-1545-2c1c-c16f-b0d2bda542b3@gmail.com>

Hi all.
We just published the 0.18-rc2 release candidate on pipy and anaconda.org.
Please go ahead and test it, so we can iron out the issues for the 0.18 
release.
We plan to release in the next ~2 weeks.

You can find all the amazing new features here:
http://scikit-learn.org/dev/whats_new.html

The stable website will be updated after the final release.

You can upgrade to the release candidate as follows.
With pip:

pip install scikit-learn==0.18.rc2

If you're using conda-forge, you can also update using conda

|*conda* install -c https://conda.anaconda.org/conda-forge/label/rc 
scikit-learn=0.18rc2|

Don't use pip to install a package that you installed with conda (or via 
anaconda!).
If you're using conda but not conda-forge, please completely remove your 
scikit-learn
installation and then install with pip (or wait for the full release).

A big thank you to everybody who contributed.

All the best,
Andy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160914/9211f637/attachment.html>

From nelle.varoquaux at gmail.com  Wed Sep 14 19:29:43 2016
From: nelle.varoquaux at gmail.com (Nelle Varoquaux)
Date: Wed, 14 Sep 2016 16:29:43 -0700
Subject: [scikit-learn] Scikit-learn 0.18-rc2 release candidate!
In-Reply-To: <19dbaa9c-1545-2c1c-c16f-b0d2bda542b3@gmail.com>
References: <19dbaa9c-1545-2c1c-c16f-b0d2bda542b3@gmail.com>
Message-ID: <CAE-UAvQqop5+-n0GLFy5+5H1ePXUYZxjjukTqh1F1LZ=h8u4Yw@mail.gmail.com>

Thanks again for taking care of the release!
N

On 14 September 2016 at 16:26, Andreas Mueller <t3kcit at gmail.com> wrote:
> Hi all.
> We just published the 0.18-rc2 release candidate on pipy and anaconda.org.
> Please go ahead and test it, so we can iron out the issues for the 0.18
> release.
> We plan to release in the next ~2 weeks.
>
> You can find all the amazing new features here:
> http://scikit-learn.org/dev/whats_new.html
>
> The stable website will be updated after the final release.
>
> You can upgrade to the release candidate as follows.
> With pip:
>
> pip install scikit-learn==0.18.rc2
>
> If you're using conda-forge, you can also update using conda
>
> conda install -c https://conda.anaconda.org/conda-forge/label/rc
> scikit-learn=0.18rc2
>
> Don't use pip to install a package that you installed with conda (or via
> anaconda!).
> If you're using conda but not conda-forge, please completely remove your
> scikit-learn
> installation and then install with pip (or wait for the full release).
>
> A big thank you to everybody who contributed.
>
> All the best,
> Andy
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>

From t3kcit at gmail.com  Wed Sep 14 19:37:31 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Wed, 14 Sep 2016 19:37:31 -0400
Subject: [scikit-learn] Scikit-learn 0.18-rc2 release candidate!
In-Reply-To: <CAE-UAvQqop5+-n0GLFy5+5H1ePXUYZxjjukTqh1F1LZ=h8u4Yw@mail.gmail.com>
References: <19dbaa9c-1545-2c1c-c16f-b0d2bda542b3@gmail.com>
 <CAE-UAvQqop5+-n0GLFy5+5H1ePXUYZxjjukTqh1F1LZ=h8u4Yw@mail.gmail.com>
Message-ID: <1f944226-d785-bbf3-21b4-22116ba0e59d@gmail.com>


On 09/14/2016 07:29 PM, Nelle Varoquaux wrote:
> Thanks again for taking care of the release!
> N
>
Always a pleasure, though it was mostly @ogrisel and @jnothman ;)

From se.raschka at gmail.com  Wed Sep 14 19:43:16 2016
From: se.raschka at gmail.com (Sebastian Raschka)
Date: Wed, 14 Sep 2016 19:43:16 -0400
Subject: [scikit-learn] Scikit-learn 0.18-rc2 release candidate!
In-Reply-To: <19dbaa9c-1545-2c1c-c16f-b0d2bda542b3@gmail.com>
References: <19dbaa9c-1545-2c1c-c16f-b0d2bda542b3@gmail.com>
Message-ID: <4E68DA82-E047-45F3-B07F-0C96F2C2E585@gmail.com>

Thanks for all the effort putting it together! Looks like a nice set of features and improvements that again will make life (and ML stuff) more convenient ;).
Will give you feedback if I encounter any problem with the RC!

PS: Now that the 0.18 is (almost) out there, no excuses anymore regarding the book ;) I hope the release date in October is fixed! :). 

Cheers,
Sebastian

> On Sep 14, 2016, at 7:26 PM, Andreas Mueller <t3kcit at gmail.com> wrote:
> 
> Hi all.
> We just published the 0.18-rc2 release candidate on pipy and anaconda.org.
> Please go ahead and test it, so we can iron out the issues for the 0.18 release.
> We plan to release in the next ~2 weeks.
> 
> You can find all the amazing new features here:
> http://scikit-learn.org/dev/whats_new.html
> 
> The stable website will be updated after the final release.
> 
> You can upgrade to the release candidate as follows.
> With pip:
> 
> pip install scikit-learn==0.18.rc2
> 
> If you're using conda-forge, you can also update using conda
> conda install -c https://conda.anaconda.org/conda-forge/label/rc scikit-learn=0.18rc2
> Don't use pip to install a package that you installed with conda (or via anaconda!).
> If you're using conda but not conda-forge, please completely remove your scikit-learn
> installation and then install with pip (or wait for the full release).
> 
> A big thank you to everybody who contributed.
> 
> All the best,
> Andy
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From t3kcit at gmail.com  Wed Sep 14 21:05:48 2016
From: t3kcit at gmail.com (Andy)
Date: Wed, 14 Sep 2016 21:05:48 -0400
Subject: [scikit-learn] Scikit-learn 0.18-rc2 release candidate!
In-Reply-To: <4E68DA82-E047-45F3-B07F-0C96F2C2E585@gmail.com>
References: <19dbaa9c-1545-2c1c-c16f-b0d2bda542b3@gmail.com>
 <4E68DA82-E047-45F3-B07F-0C96F2C2E585@gmail.com>
Message-ID: <53301328-7655-2d0f-9b7b-3fe8c843762e@gmail.com>


On 09/14/2016 07:43 PM, Sebastian Raschka wrote:
> Thanks for all the effort putting it together! Looks like a nice set of features and improvements that again will make life (and ML stuff) more convenient ;).
> Will give you feedback if I encounter any problem with the RC!
>
> PS: Now that the 0.18 is (almost) out there, no excuses anymore regarding the book ;) I hope the release date in October is fixed! :).
It sure is ;) Gonna go to print next week!


From joel.nothman at gmail.com  Wed Sep 14 21:53:51 2016
From: joel.nothman at gmail.com (Joel Nothman)
Date: Thu, 15 Sep 2016 11:53:51 +1000
Subject: [scikit-learn] Scikit-learn 0.18-rc2 release candidate!
In-Reply-To: <1f944226-d785-bbf3-21b4-22116ba0e59d@gmail.com>
References: <19dbaa9c-1545-2c1c-c16f-b0d2bda542b3@gmail.com>
 <CAE-UAvQqop5+-n0GLFy5+5H1ePXUYZxjjukTqh1F1LZ=h8u4Yw@mail.gmail.com>
 <1f944226-d785-bbf3-21b4-22116ba0e59d@gmail.com>
Message-ID: <CAAkaFLXWTFZFkXwV=+jBAfS6AM2TN8NbkYCr7BfYBdW3G1-sUA@mail.gmail.com>

You definitely did a lot of trawling through issues, and made some valuable
LGTMs!

On 15 September 2016 at 09:37, Andreas Mueller <t3kcit at gmail.com> wrote:

>
>
> On 09/14/2016 07:29 PM, Nelle Varoquaux wrote:
>
>> Thanks again for taking care of the release!
>> N
>>
>> Always a pleasure, though it was mostly @ogrisel and @jnothman ;)
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160915/831b69f5/attachment-0001.html>

From joel.nothman at gmail.com  Wed Sep 14 21:54:20 2016
From: joel.nothman at gmail.com (Joel Nothman)
Date: Thu, 15 Sep 2016 11:54:20 +1000
Subject: [scikit-learn] Scikit-learn 0.18-rc2 release candidate!
In-Reply-To: <4E68DA82-E047-45F3-B07F-0C96F2C2E585@gmail.com>
References: <19dbaa9c-1545-2c1c-c16f-b0d2bda542b3@gmail.com>
 <4E68DA82-E047-45F3-B07F-0C96F2C2E585@gmail.com>
Message-ID: <CAAkaFLXYxGkqhk5jYRTqovzKVRNkvsEPqAMGvAFroMk_eD90aQ@mail.gmail.com>

>
> PS: Now that the 0.18 is (almost) out there, no excuses anymore regarding
> the book ;) I hope the release date in October is fixed! :).


Except that now it requires substantial revisions....

On 15 September 2016 at 09:43, Sebastian Raschka <se.raschka at gmail.com>
wrote:

> Thanks for all the effort putting it together! Looks like a nice set of
> features and improvements that again will make life (and ML stuff) more
> convenient ;).
> Will give you feedback if I encounter any problem with the RC!
>
> PS: Now that the 0.18 is (almost) out there, no excuses anymore regarding
> the book ;) I hope the release date in October is fixed! :).
>
> Cheers,
> Sebastian
>
> > On Sep 14, 2016, at 7:26 PM, Andreas Mueller <t3kcit at gmail.com> wrote:
> >
> > Hi all.
> > We just published the 0.18-rc2 release candidate on pipy and
> anaconda.org.
> > Please go ahead and test it, so we can iron out the issues for the 0.18
> release.
> > We plan to release in the next ~2 weeks.
> >
> > You can find all the amazing new features here:
> > http://scikit-learn.org/dev/whats_new.html
> >
> > The stable website will be updated after the final release.
> >
> > You can upgrade to the release candidate as follows.
> > With pip:
> >
> > pip install scikit-learn==0.18.rc2
> >
> > If you're using conda-forge, you can also update using conda
> > conda install -c https://conda.anaconda.org/conda-forge/label/rc
> scikit-learn=0.18rc2
> > Don't use pip to install a package that you installed with conda (or via
> anaconda!).
> > If you're using conda but not conda-forge, please completely remove your
> scikit-learn
> > installation and then install with pip (or wait for the full release).
> >
> > A big thank you to everybody who contributed.
> >
> > All the best,
> > Andy
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160915/1bc8e2ed/attachment.html>

From t3kcit at gmail.com  Wed Sep 14 22:18:01 2016
From: t3kcit at gmail.com (Andy)
Date: Wed, 14 Sep 2016 22:18:01 -0400
Subject: [scikit-learn] Scikit-learn 0.18-rc2 release candidate!
In-Reply-To: <CAAkaFLXYxGkqhk5jYRTqovzKVRNkvsEPqAMGvAFroMk_eD90aQ@mail.gmail.com>
References: <19dbaa9c-1545-2c1c-c16f-b0d2bda542b3@gmail.com>
 <4E68DA82-E047-45F3-B07F-0C96F2C2E585@gmail.com>
 <CAAkaFLXYxGkqhk5jYRTqovzKVRNkvsEPqAMGvAFroMk_eD90aQ@mail.gmail.com>
Message-ID: <8852316b-fe22-e7ae-8084-282ae279f31f@gmail.com>


On 09/14/2016 09:54 PM, Joel Nothman wrote:
>
>
>     PS: Now that the 0.18 is (almost) out there, no excuses anymore
>     regarding the book ;) I hope the release date in October is fixed! :).
>
>
> Except that now it requires substantial revisions....
>
Nope, they are all in! Even the GroupKFold. Unless you did something 
else afterwards? ;)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160914/d331706f/attachment.html>

From ragvrv at gmail.com  Wed Sep 14 22:43:56 2016
From: ragvrv at gmail.com (Raghav R V)
Date: Thu, 15 Sep 2016 04:43:56 +0200
Subject: [scikit-learn] Scikit-learn 0.18-rc2 release candidate!
In-Reply-To: <8852316b-fe22-e7ae-8084-282ae279f31f@gmail.com>
References: <19dbaa9c-1545-2c1c-c16f-b0d2bda542b3@gmail.com>
 <4E68DA82-E047-45F3-B07F-0C96F2C2E585@gmail.com>
 <CAAkaFLXYxGkqhk5jYRTqovzKVRNkvsEPqAMGvAFroMk_eD90aQ@mail.gmail.com>
 <8852316b-fe22-e7ae-8084-282ae279f31f@gmail.com>
Message-ID: <CACmxyDFG3iuJVkR_dB6TJvacp8r0PcdseK0MvsUqoQQkaFx=7g@mail.gmail.com>

Awesome! Thanks for the amazing job! :)

On 15 Sep 2016 4:19 a.m., "Andy" <t3kcit at gmail.com> wrote:

>
>
> On 09/14/2016 09:54 PM, Joel Nothman wrote:
>
>
>
>> PS: Now that the 0.18 is (almost) out there, no excuses anymore regarding
>> the book ;) I hope the release date in October is fixed! :).
>
>
> Except that now it requires substantial revisions....
>
> Nope, they are all in! Even the GroupKFold. Unless you did something else
> afterwards? ;)
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160915/d92cb14a/attachment.html>

From se.raschka at gmail.com  Wed Sep 14 22:51:30 2016
From: se.raschka at gmail.com (Sebastian Raschka)
Date: Wed, 14 Sep 2016 22:51:30 -0400
Subject: [scikit-learn] Scikit-learn 0.18-rc2 release candidate!
In-Reply-To: <CAAkaFLXYxGkqhk5jYRTqovzKVRNkvsEPqAMGvAFroMk_eD90aQ@mail.gmail.com>
References: <19dbaa9c-1545-2c1c-c16f-b0d2bda542b3@gmail.com>
 <4E68DA82-E047-45F3-B07F-0C96F2C2E585@gmail.com>
 <CAAkaFLXYxGkqhk5jYRTqovzKVRNkvsEPqAMGvAFroMk_eD90aQ@mail.gmail.com>
Message-ID: <47C0F165-1D94-4A4F-A009-366E7CDA43FD@gmail.com>

Based on the Early Release drafts it looks like the changes are already included ? I think that 0.18 was the ?catch? regarding release schedule :P

> On Sep 14, 2016, at 9:54 PM, Joel Nothman <joel.nothman at gmail.com> wrote:
> 
> 
> 
> PS: Now that the 0.18 is (almost) out there, no excuses anymore regarding the book ;) I hope the release date in October is fixed! :).
> 
> Except that now it requires substantial revisions.... 
> 
> On 15 September 2016 at 09:43, Sebastian Raschka <se.raschka at gmail.com> wrote:
> Thanks for all the effort putting it together! Looks like a nice set of features and improvements that again will make life (and ML stuff) more convenient ;).
> Will give you feedback if I encounter any problem with the RC!
> 
> PS: Now that the 0.18 is (almost) out there, no excuses anymore regarding the book ;) I hope the release date in October is fixed! :).
> 
> Cheers,
> Sebastian
> 
> > On Sep 14, 2016, at 7:26 PM, Andreas Mueller <t3kcit at gmail.com> wrote:
> >
> > Hi all.
> > We just published the 0.18-rc2 release candidate on pipy and anaconda.org.
> > Please go ahead and test it, so we can iron out the issues for the 0.18 release.
> > We plan to release in the next ~2 weeks.
> >
> > You can find all the amazing new features here:
> > http://scikit-learn.org/dev/whats_new.html
> >
> > The stable website will be updated after the final release.
> >
> > You can upgrade to the release candidate as follows.
> > With pip:
> >
> > pip install scikit-learn==0.18.rc2
> >
> > If you're using conda-forge, you can also update using conda
> > conda install -c https://conda.anaconda.org/conda-forge/label/rc scikit-learn=0.18rc2
> > Don't use pip to install a package that you installed with conda (or via anaconda!).
> > If you're using conda but not conda-forge, please completely remove your scikit-learn
> > installation and then install with pip (or wait for the full release).
> >
> > A big thank you to everybody who contributed.
> >
> > All the best,
> > Andy
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From joel.nothman at gmail.com  Wed Sep 14 22:56:52 2016
From: joel.nothman at gmail.com (Joel Nothman)
Date: Thu, 15 Sep 2016 12:56:52 +1000
Subject: [scikit-learn] Scikit-learn 0.18-rc2 release candidate!
In-Reply-To: <47C0F165-1D94-4A4F-A009-366E7CDA43FD@gmail.com>
References: <19dbaa9c-1545-2c1c-c16f-b0d2bda542b3@gmail.com>
 <4E68DA82-E047-45F3-B07F-0C96F2C2E585@gmail.com>
 <CAAkaFLXYxGkqhk5jYRTqovzKVRNkvsEPqAMGvAFroMk_eD90aQ@mail.gmail.com>
 <47C0F165-1D94-4A4F-A009-366E7CDA43FD@gmail.com>
Message-ID: <CAAkaFLVpmx9pgy2y36mkMXKxwh0hopuQWamHMYh8s71wn=-75w@mail.gmail.com>

Ah, I seem to have missed the part of knowing that this beast has been
available online
<https://library.oreilly.com/book/0636920030515/introduction-to-machine-learning-with-python/toc>,
predicting my reviews.

On 15 September 2016 at 12:51, Sebastian Raschka <se.raschka at gmail.com>
wrote:

> Based on the Early Release drafts it looks like the changes are already
> included ? I think that 0.18 was the ?catch? regarding release schedule :P
>
> > On Sep 14, 2016, at 9:54 PM, Joel Nothman <joel.nothman at gmail.com>
> wrote:
> >
> >
> >
> > PS: Now that the 0.18 is (almost) out there, no excuses anymore
> regarding the book ;) I hope the release date in October is fixed! :).
> >
> > Except that now it requires substantial revisions....
> >
> > On 15 September 2016 at 09:43, Sebastian Raschka <se.raschka at gmail.com>
> wrote:
> > Thanks for all the effort putting it together! Looks like a nice set of
> features and improvements that again will make life (and ML stuff) more
> convenient ;).
> > Will give you feedback if I encounter any problem with the RC!
> >
> > PS: Now that the 0.18 is (almost) out there, no excuses anymore
> regarding the book ;) I hope the release date in October is fixed! :).
> >
> > Cheers,
> > Sebastian
> >
> > > On Sep 14, 2016, at 7:26 PM, Andreas Mueller <t3kcit at gmail.com> wrote:
> > >
> > > Hi all.
> > > We just published the 0.18-rc2 release candidate on pipy and
> anaconda.org.
> > > Please go ahead and test it, so we can iron out the issues for the
> 0.18 release.
> > > We plan to release in the next ~2 weeks.
> > >
> > > You can find all the amazing new features here:
> > > http://scikit-learn.org/dev/whats_new.html
> > >
> > > The stable website will be updated after the final release.
> > >
> > > You can upgrade to the release candidate as follows.
> > > With pip:
> > >
> > > pip install scikit-learn==0.18.rc2
> > >
> > > If you're using conda-forge, you can also update using conda
> > > conda install -c https://conda.anaconda.org/conda-forge/label/rc
> scikit-learn=0.18rc2
> > > Don't use pip to install a package that you installed with conda (or
> via anaconda!).
> > > If you're using conda but not conda-forge, please completely remove
> your scikit-learn
> > > installation and then install with pip (or wait for the full release).
> > >
> > > A big thank you to everybody who contributed.
> > >
> > > All the best,
> > > Andy
> > > _______________________________________________
> > > scikit-learn mailing list
> > > scikit-learn at python.org
> > > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160915/e989c4bc/attachment.html>

From manojkumarsivaraj334 at gmail.com  Wed Sep 14 22:59:59 2016
From: manojkumarsivaraj334 at gmail.com (Manoj Kumar)
Date: Wed, 14 Sep 2016 22:59:59 -0400
Subject: [scikit-learn] Scikit-learn 0.18-rc2 release candidate!
In-Reply-To: <CAAkaFLVpmx9pgy2y36mkMXKxwh0hopuQWamHMYh8s71wn=-75w@mail.gmail.com>
References: <19dbaa9c-1545-2c1c-c16f-b0d2bda542b3@gmail.com>
 <4E68DA82-E047-45F3-B07F-0C96F2C2E585@gmail.com>
 <CAAkaFLXYxGkqhk5jYRTqovzKVRNkvsEPqAMGvAFroMk_eD90aQ@mail.gmail.com>
 <47C0F165-1D94-4A4F-A009-366E7CDA43FD@gmail.com>
 <CAAkaFLVpmx9pgy2y36mkMXKxwh0hopuQWamHMYh8s71wn=-75w@mail.gmail.com>
Message-ID: <CAFQAd-np5fQQcoCZ79pW5WqwyMvdkHLkzmLeKXTh5EKcCop28Q@mail.gmail.com>

Great news! Thanks and congrats all for your hard work!

On Wed, Sep 14, 2016 at 10:56 PM, Joel Nothman <joel.nothman at gmail.com>
wrote:

> Ah, I seem to have missed the part of knowing that this beast has been
> available online
> <https://library.oreilly.com/book/0636920030515/introduction-to-machine-learning-with-python/toc>,
> predicting my reviews.
>
> On 15 September 2016 at 12:51, Sebastian Raschka <se.raschka at gmail.com>
> wrote:
>
>> Based on the Early Release drafts it looks like the changes are already
>> included ? I think that 0.18 was the ?catch? regarding release schedule :P
>>
>> > On Sep 14, 2016, at 9:54 PM, Joel Nothman <joel.nothman at gmail.com>
>> wrote:
>> >
>> >
>> >
>> > PS: Now that the 0.18 is (almost) out there, no excuses anymore
>> regarding the book ;) I hope the release date in October is fixed! :).
>> >
>> > Except that now it requires substantial revisions....
>> >
>> > On 15 September 2016 at 09:43, Sebastian Raschka <se.raschka at gmail.com>
>> wrote:
>> > Thanks for all the effort putting it together! Looks like a nice set of
>> features and improvements that again will make life (and ML stuff) more
>> convenient ;).
>> > Will give you feedback if I encounter any problem with the RC!
>> >
>> > PS: Now that the 0.18 is (almost) out there, no excuses anymore
>> regarding the book ;) I hope the release date in October is fixed! :).
>> >
>> > Cheers,
>> > Sebastian
>> >
>> > > On Sep 14, 2016, at 7:26 PM, Andreas Mueller <t3kcit at gmail.com>
>> wrote:
>> > >
>> > > Hi all.
>> > > We just published the 0.18-rc2 release candidate on pipy and
>> anaconda.org.
>> > > Please go ahead and test it, so we can iron out the issues for the
>> 0.18 release.
>> > > We plan to release in the next ~2 weeks.
>> > >
>> > > You can find all the amazing new features here:
>> > > http://scikit-learn.org/dev/whats_new.html
>> > >
>> > > The stable website will be updated after the final release.
>> > >
>> > > You can upgrade to the release candidate as follows.
>> > > With pip:
>> > >
>> > > pip install scikit-learn==0.18.rc2
>> > >
>> > > If you're using conda-forge, you can also update using conda
>> > > conda install -c https://conda.anaconda.org/conda-forge/label/rc
>> scikit-learn=0.18rc2
>> > > Don't use pip to install a package that you installed with conda (or
>> via anaconda!).
>> > > If you're using conda but not conda-forge, please completely remove
>> your scikit-learn
>> > > installation and then install with pip (or wait for the full release).
>> > >
>> > > A big thank you to everybody who contributed.
>> > >
>> > > All the best,
>> > > Andy
>> > > _______________________________________________
>> > > scikit-learn mailing list
>> > > scikit-learn at python.org
>> > > https://mail.python.org/mailman/listinfo/scikit-learn
>> >
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>> >
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 
Manoj,
http://github.com/MechCoder
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160914/0741d2e5/attachment.html>

From gael.varoquaux at normalesup.org  Thu Sep 15 01:34:53 2016
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Thu, 15 Sep 2016 07:34:53 +0200
Subject: [scikit-learn] Scikit-learn 0.18-rc2 release candidate!
In-Reply-To: <19dbaa9c-1545-2c1c-c16f-b0d2bda542b3@gmail.com>
References: <19dbaa9c-1545-2c1c-c16f-b0d2bda542b3@gmail.com>
Message-ID: <20160915053453.GB3783253@phare.normalesup.org>

Good job to all the team. This is a big one!

Who writes a blog post? :)

Ga?l

On Wed, Sep 14, 2016 at 07:26:45PM -0400, Andreas Mueller wrote:
> Hi all.
> We just published the 0.18-rc2 release candidate on pipy and anaconda.org.
> Please go ahead and test it, so we can iron out the issues for the 0.18
> release.
> We plan to release in the next ~2 weeks.

> You can find all the amazing new features here:
> http://scikit-learn.org/dev/whats_new.html

> The stable website will be updated after the final release.

> You can upgrade to the release candidate as follows.
> With pip:

> pip install scikit-learn==0.18.rc2

> If you're using conda-forge, you can also update using conda

> conda install -c https://conda.anaconda.org/conda-forge/label/rc scikit-learn=0.18rc2

> Don't use pip to install a package that you installed with conda (or via
> anaconda!).
> If you're using conda but not conda-forge, please completely remove your
> scikit-learn
> installation and then install with pip (or wait for the full release).

> A big thank you to everybody who contributed.

> All the best,
> Andy

> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
    Gael Varoquaux
    Researcher, INRIA Parietal
    NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
    Phone:  ++ 33-1-69-08-79-68
    http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux

From t3kcit at gmail.com  Thu Sep 15 09:34:55 2016
From: t3kcit at gmail.com (Andy)
Date: Thu, 15 Sep 2016 09:34:55 -0400
Subject: [scikit-learn] Scikit-learn 0.18-rc2 release candidate!
In-Reply-To: <20160915053453.GB3783253@phare.normalesup.org>
References: <19dbaa9c-1545-2c1c-c16f-b0d2bda542b3@gmail.com>
 <20160915053453.GB3783253@phare.normalesup.org>
Message-ID: <cb9655c5-03e0-6644-88a9-1b78f1782867@gmail.com>


On 09/15/2016 01:34 AM, Gael Varoquaux wrote:
> Good job to all the team. This is a big one!
>
> Who writes a blog post? :)
>
I could do it in like two weeks, which would probably be just after the 
final release?

From gael.varoquaux at normalesup.org  Thu Sep 15 09:44:12 2016
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Thu, 15 Sep 2016 15:44:12 +0200
Subject: [scikit-learn] Scikit-learn 0.18-rc2 release candidate!
In-Reply-To: <cb9655c5-03e0-6644-88a9-1b78f1782867@gmail.com>
References: <19dbaa9c-1545-2c1c-c16f-b0d2bda542b3@gmail.com>
 <20160915053453.GB3783253@phare.normalesup.org>
 <cb9655c5-03e0-6644-88a9-1b78f1782867@gmail.com>
Message-ID: <20160915134412.GA4097310@phare.normalesup.org>

On Thu, Sep 15, 2016 at 09:34:55AM -0400, Andy wrote:
> >Who writes a blog post? :)

> I could do it in like two weeks, which would probably be just after the
> final release?

Sounds great!

G

From aakash at klugtek.co.in  Thu Sep 15 09:52:34 2016
From: aakash at klugtek.co.in (Aakash Agarwal)
Date: Thu, 15 Sep 2016 19:22:34 +0530
Subject: [scikit-learn] Scikit-learn 0.18-rc2 release candidate!
In-Reply-To: <20160915134412.GA4097310@phare.normalesup.org>
References: <19dbaa9c-1545-2c1c-c16f-b0d2bda542b3@gmail.com>
 <20160915053453.GB3783253@phare.normalesup.org>
 <cb9655c5-03e0-6644-88a9-1b78f1782867@gmail.com>
 <20160915134412.GA4097310@phare.normalesup.org>
Message-ID: <CABVTFDssJU_D6rskGzrRLGVTb9Gi1O9BMCyHe8puAuOYbCWr=Q@mail.gmail.com>

Awesome work guys! Keep it up :)

On Thu, Sep 15, 2016 at 7:14 PM, Gael Varoquaux <
gael.varoquaux at normalesup.org> wrote:

> On Thu, Sep 15, 2016 at 09:34:55AM -0400, Andy wrote:
> > >Who writes a blog post? :)
>
> > I could do it in like two weeks, which would probably be just after the
> > final release?
>
> Sounds great!
>
> G
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>


-- 
Thanks,
Aakash
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160915/45e746a7/attachment.html>

From jmschreiber91 at gmail.com  Thu Sep 15 10:13:06 2016
From: jmschreiber91 at gmail.com (Jacob Schreiber)
Date: Thu, 15 Sep 2016 10:13:06 -0400
Subject: [scikit-learn] Scikit-learn 0.18-rc2 release candidate!
In-Reply-To: <CABVTFDssJU_D6rskGzrRLGVTb9Gi1O9BMCyHe8puAuOYbCWr=Q@mail.gmail.com>
References: <19dbaa9c-1545-2c1c-c16f-b0d2bda542b3@gmail.com>
 <20160915053453.GB3783253@phare.normalesup.org>
 <cb9655c5-03e0-6644-88a9-1b78f1782867@gmail.com>
 <20160915134412.GA4097310@phare.normalesup.org>
 <CABVTFDssJU_D6rskGzrRLGVTb9Gi1O9BMCyHe8puAuOYbCWr=Q@mail.gmail.com>
Message-ID: <CA+ad8Etcqd=MjkhSw52dH6mej1NOwfhbGJAtAAy0GOZwrRs4+g@mail.gmail.com>

Hooray everyone!

On Thu, Sep 15, 2016 at 9:52 AM, Aakash Agarwal <aakash at klugtek.co.in>
wrote:

> Awesome work guys! Keep it up :)
>
> On Thu, Sep 15, 2016 at 7:14 PM, Gael Varoquaux <
> gael.varoquaux at normalesup.org> wrote:
>
>> On Thu, Sep 15, 2016 at 09:34:55AM -0400, Andy wrote:
>> > >Who writes a blog post? :)
>>
>> > I could do it in like two weeks, which would probably be just after the
>> > final release?
>>
>> Sounds great!
>>
>> G
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
>
> --
> Thanks,
> Aakash
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160915/72aed042/attachment.html>

From katch at seas.upenn.edu  Thu Sep 15 15:35:04 2016
From: katch at seas.upenn.edu (Kathleen Chen)
Date: Thu, 15 Sep 2016 15:35:04 -0400
Subject: [scikit-learn] New contributor to scikit-learn
Message-ID: <CAApte4yhOYhZFF3eLMv1ZLft5__dDJ+TNv9hdX70S-G-eu6jxg@mail.gmail.com>

Hi! I'm Kathy, a student at Penn taking an open source software development
<https://www.cis.upenn.edu/~cdmurphy/foss/fall2016/> course this semester.
I wanted to send a quick "hello" to the community; a few other students and
I will be making contributions to scikit-learn as part of our work in the
class. :) Really looking forward to learning from & getting to know some of
you in the coming months!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160915/16bea88f/attachment.html>

From jmschreiber91 at gmail.com  Thu Sep 15 15:42:56 2016
From: jmschreiber91 at gmail.com (Jacob Schreiber)
Date: Thu, 15 Sep 2016 15:42:56 -0400
Subject: [scikit-learn] New contributor to scikit-learn
In-Reply-To: <CAApte4yhOYhZFF3eLMv1ZLft5__dDJ+TNv9hdX70S-G-eu6jxg@mail.gmail.com>
References: <CAApte4yhOYhZFF3eLMv1ZLft5__dDJ+TNv9hdX70S-G-eu6jxg@mail.gmail.com>
Message-ID: <CA+ad8EsQHvaczMckHnGCpZL7oftJ3Y+ymDLx=y1=k1D3aJHkxQ@mail.gmail.com>

Welcome! If you're looking to get started, you might try sorting issues by
those with "Needs contributor" and "easy" to begin with. I look forward to
seeing your contributions.

On Thu, Sep 15, 2016 at 3:35 PM, Kathleen Chen <katch at seas.upenn.edu> wrote:

> Hi! I'm Kathy, a student at Penn taking an open source software
> development <https://www.cis.upenn.edu/~cdmurphy/foss/fall2016/> course
> this semester. I wanted to send a quick "hello" to the community; a few
> other students and I will be making contributions to scikit-learn as part
> of our work in the class. :) Really looking forward to learning from &
> getting to know some of you in the coming months!
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160915/89cbdce1/attachment.html>

From gael.varoquaux at normalesup.org  Thu Sep 15 15:55:55 2016
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Thu, 15 Sep 2016 21:55:55 +0200
Subject: [scikit-learn] New contributor to scikit-learn
In-Reply-To: <CAApte4yhOYhZFF3eLMv1ZLft5__dDJ+TNv9hdX70S-G-eu6jxg@mail.gmail.com>
References: <CAApte4yhOYhZFF3eLMv1ZLft5__dDJ+TNv9hdX70S-G-eu6jxg@mail.gmail.com>
Message-ID: <20160915195555.GK3783253@phare.normalesup.org>

That's cool. Thanks for saying hi! Welcome on board.

Just to warn you so that you're not disappointed: the more senior members
of the community are drowning under commitments. Hence you'll might be
getting feedback slower than we would like. But if you're careful to
communicate well, and take on tasks that are both relevant and of the
right level of difficulties, you'll see that you progressively get to
understand this amazing community, and the code behind it! I think that
quite a few people who joined the project recently can testify for this.

I'm really looking forward to seeing you on github,

Ga?l

On Thu, Sep 15, 2016 at 03:35:04PM -0400, Kathleen Chen wrote:
> Hi! I'm Kathy, a student at Penn taking an open source software development
> ?course this semester. I wanted to send a quick "hello" to the community; a few
> other students and I will be making contributions to scikit-learn as part of
> our work in the class. :) Really looking forward to learning from & getting to
> know some of you in the coming months!

> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
    Gael Varoquaux
    Researcher, INRIA Parietal
    NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
    Phone:  ++ 33-1-69-08-79-68
    http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux

From jkarno at seas.upenn.edu  Thu Sep 15 15:36:22 2016
From: jkarno at seas.upenn.edu (Josh Karnofsky SEAS)
Date: Thu, 15 Sep 2016 15:36:22 -0400
Subject: [scikit-learn] Joining the Community
Message-ID: <12BC3208-D623-485A-8247-AB444AFDF9B2@seas.upenn.edu>

Hi everyone,

My name?s Josh and I am a student at the University of Pennsylvania. I?m joining the scikit-learn community this semester as part of a computer science course on open source software.

I?m excited to work with everyone.
- Josh K

From jmschreiber91 at gmail.com  Thu Sep 15 16:03:10 2016
From: jmschreiber91 at gmail.com (Jacob Schreiber)
Date: Thu, 15 Sep 2016 16:03:10 -0400
Subject: [scikit-learn] Joining the Community
In-Reply-To: <12BC3208-D623-485A-8247-AB444AFDF9B2@seas.upenn.edu>
References: <12BC3208-D623-485A-8247-AB444AFDF9B2@seas.upenn.edu>
Message-ID: <CA+ad8Et4kqY5sgLh5YN-yCp6orjCKgXXMoaJJM2NzmYnHDEaEQ@mail.gmail.com>

Hello as well! As I mentioned in the other two threads, you may want to
take a stab at "Needs contributors" and "easy" marked issues first if
you're new to the project. I look forward to seeing your contributions!

Jacob

On Thu, Sep 15, 2016 at 3:36 PM, Josh Karnofsky SEAS <jkarno at seas.upenn.edu>
wrote:

> Hi everyone,
>
> My name?s Josh and I am a student at the University of Pennsylvania. I?m
> joining the scikit-learn community this semester as part of a computer
> science course on open source software.
>
> I?m excited to work with everyone.
> - Josh K
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160915/77727fec/attachment.html>

From t3kcit at gmail.com  Thu Sep 15 17:41:10 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Thu, 15 Sep 2016 17:41:10 -0400
Subject: [scikit-learn] New contributor to scikit-learn
In-Reply-To: <20160915195555.GK3783253@phare.normalesup.org>
References: <CAApte4yhOYhZFF3eLMv1ZLft5__dDJ+TNv9hdX70S-G-eu6jxg@mail.gmail.com>
 <20160915195555.GK3783253@phare.normalesup.org>
Message-ID: <b7ee211e-6902-1ab2-b2c9-876e4725041f@gmail.com>

Hi.
Welcome Kathy and He.

Gael:
I talked with Kathy and committed to mentoring her and some of the other 
students a bit.
Any help in reviews is obviously welcome tough ;)

Kathy already worked on an easy issue and I think we'll see some 
interesting stuff soon!

Andy

On 09/15/2016 03:55 PM, Gael Varoquaux wrote:
> That's cool. Thanks for saying hi! Welcome on board.
>
> Just to warn you so that you're not disappointed: the more senior members
> of the community are drowning under commitments. Hence you'll might be
> getting feedback slower than we would like. But if you're careful to
> communicate well, and take on tasks that are both relevant and of the
> right level of difficulties, you'll see that you progressively get to
> understand this amazing community, and the code behind it! I think that
> quite a few people who joined the project recently can testify for this.
>
> I'm really looking forward to seeing you on github,
>
> Ga?l
>
> On Thu, Sep 15, 2016 at 03:35:04PM -0400, Kathleen Chen wrote:
>> Hi! I'm Kathy, a student at Penn taking an open source software development
>>   course this semester. I wanted to send a quick "hello" to the community; a few
>> other students and I will be making contributions to scikit-learn as part of
>> our work in the class. :) Really looking forward to learning from & getting to
>> know some of you in the coming months!
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>


From joel.nothman at gmail.com  Thu Sep 15 19:14:12 2016
From: joel.nothman at gmail.com (Joel Nothman)
Date: Fri, 16 Sep 2016 09:14:12 +1000
Subject: [scikit-learn] Github project management tools
Message-ID: <CAAkaFLXw7+Qfu1L+SGijrVGJ7UU=R-tx4_ttdh0fCJo-u53fMw@mail.gmail.com>

One of the biggest issues with scikit-learn as a project is managing its
backlog of issues; another is release scheduling. Some of this cannot be
fixed as long as our model of voluntary contribution (with a couple of
important exceptions) does not change. However, it may be worth considering
the new project management features in Github.

At the moment we have the following management:
* labels corresponding to type (bug, enhancement, new feat, question),
scope (API, Build/CI, ?Large Scale, Documentation), difficulty (easy,
moderate), status/scheduling (needs contributor, needs review, sprint).
* PR status management with title prefixes [WIP], [MRG], [MRG+1], [MRG+2]

Firstly, we might benefit from prefixing labels by category, i.e.
difficulty:easy so that complementary labels appear together.

In truth, PRs have roughly these statuses:
* WIP (not ready for review)
* waiting for review
* waiting for changes (with or without one of the following)
* in dispute (i.e. fundamental doubts about the PR)
* the above together with 1 or 2 "official" approvals
* ready for merge (pending minor changes such as what's new documentation)

New github features:

* reviews with "approved" or "request changes". A list of approvers can be
found in the merge/CI panel. We could replace the MRG+1 annotation with
this and use it to track disputation too. I'm not sure how it works with
changes that are added after approval. I think it would have avoided one
improper merge by me... One downside is that there does not yet seem to be
a way to search for PRs with a specified level of approval (while searching
for "MRG+1" sort-of works).
* Milestone prioritising: issues in a milestone, such as
https://github.com/scikit-learn/scikit-learn/milestone/21, can be ranked
with drag-and-drop. I think this could help with release scheduling as it
would allow us to identify the top priorities for a release and see when
enough of them are completed.
* The Kanban-style workflow management of the new Projects tool
https://github.com/scikit-learn/scikit-learn/projects is another way of
managing status and, I think, priority, for a small set of related issues.
This might be an alternative way of managing milestone scope, or of working
towards big changes like the one just completed for model selection; like
proposed expansions to get_feature_names expansion; like estimator tags;
making utilities public/private...

So with the goal of making it easier to track where attention is most
needed, and when to move to release: What's worth trying?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160916/a9ee0d26/attachment.html>

From t3kcit at gmail.com  Thu Sep 15 19:35:09 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Thu, 15 Sep 2016 19:35:09 -0400
Subject: [scikit-learn] Github project management tools
In-Reply-To: <CAAkaFLXw7+Qfu1L+SGijrVGJ7UU=R-tx4_ttdh0fCJo-u53fMw@mail.gmail.com>
References: <CAAkaFLXw7+Qfu1L+SGijrVGJ7UU=R-tx4_ttdh0fCJo-u53fMw@mail.gmail.com>
Message-ID: <39b383b7-714d-755d-bfdb-5325bdf6314f@gmail.com>

Hey Joel.
Thanks for bringing this up. I have a really hard time keeping up with 
what's happening
on the issue tracker and I have no idea how you manage.

The current tags are certainly not always helpful. Also, they are rarely 
updated.

I have been very frustrated by github. I used email to track all issues, 
but their new "upgrade"
made that impossible as issues are no longer email threads - each review 
is it's own thread.

It might make sense to switch to something like reviewable or gerrit.
These sit on top of github, and people can interact with them without 
using them.
I haven't really worked with either, but heard only good things about them.

Any way to prioritize issues and putting them into the buckets that you 
listed would be a great step forward.
That would require someone manually going through 470 PRs and 762 
issues, though.
I would be happy to do that if we actually stick to the system. A single 
person is not enough to keep the tags (or whatever we end up using)
up to date, though.

Your statuses only apply to PRs, too, and we need to have something 
similar for issues, which have maybe these statuses

* random idea / feature request
* feature request with consensus to implement
* possible bug
* confirmed bug
* feature request or bug with active PR
* feature request or bug with stale PR

One problem with these is that man feature requests never get any 
comments, similar for PRs.
Is a PR without comment waiting for review? Or in dispute?
A PR could be reviewed but dispute could happen later, as we don't 
always agree on what to do.

I agree that we should try to organize ourselves better. I'm doubtful 
the new github features will help.
They certainly already have tremendously hindered me in keeping up in 
the couple of hours they've been online.

There is still no way to mark a comment as addressed, and comments are 
still more or less randomly hidden
(and links to them become dead). Both of these issues are fixed in the 
other review platforms.

I don't think we are the intended users of github, though I'm not sure 
who is.


On 09/15/2016 07:14 PM, Joel Nothman wrote:
> One of the biggest issues with scikit-learn as a project is managing 
> its backlog of issues; another is release scheduling. Some of this 
> cannot be fixed as long as our model of voluntary contribution (with a 
> couple of important exceptions) does not change. However, it may be 
> worth considering the new project management features in Github.
>
> At the moment we have the following management:
> * labels corresponding to type (bug, enhancement, new feat, question), 
> scope (API, Build/CI, ?Large Scale, Documentation), difficulty (easy, 
> moderate), status/scheduling (needs contributor, needs review, sprint).
> * PR status management with title prefixes [WIP], [MRG], [MRG+1], [MRG+2]
>
> Firstly, we might benefit from prefixing labels by category, i.e. 
> difficulty:easy so that complementary labels appear together.
>
> In truth, PRs have roughly these statuses:
> * WIP (not ready for review)
> * waiting for review
> * waiting for changes (with or without one of the following)
> * in dispute (i.e. fundamental doubts about the PR)
> * the above together with 1 or 2 "official" approvals
> * ready for merge (pending minor changes such as what's new documentation)
>
> New github features:
>
> * reviews with "approved" or "request changes". A list of approvers 
> can be found in the merge/CI panel. We could replace the MRG+1 
> annotation with this and use it to track disputation too. I'm not sure 
> how it works with changes that are added after approval. I think it 
> would have avoided one improper merge by me... One downside is that 
> there does not yet seem to be a way to search for PRs with a specified 
> level of approval (while searching for "MRG+1" sort-of works).
> * Milestone prioritising: issues in a milestone, such as 
> https://github.com/scikit-learn/scikit-learn/milestone/21, can be 
> ranked with drag-and-drop. I think this could help with release 
> scheduling as it would allow us to identify the top priorities for a 
> release and see when enough of them are completed.
> * The Kanban-style workflow management of the new Projects tool 
> https://github.com/scikit-learn/scikit-learn/projects is another way 
> of managing status and, I think, priority, for a small set of related 
> issues. This might be an alternative way of managing milestone scope, 
> or of working towards big changes like the one just completed for 
> model selection; like proposed expansions to get_feature_names 
> expansion; like estimator tags; making utilities public/private...
>
> So with the goal of making it easier to track where attention is most 
> needed, and when to move to release: What's worth trying?
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160915/d541e205/attachment.html>

From nelle.varoquaux at gmail.com  Thu Sep 15 20:32:27 2016
From: nelle.varoquaux at gmail.com (Nelle Varoquaux)
Date: Thu, 15 Sep 2016 17:32:27 -0700
Subject: [scikit-learn] New contributor to scikit-learn
In-Reply-To: <b7ee211e-6902-1ab2-b2c9-876e4725041f@gmail.com>
References: <CAApte4yhOYhZFF3eLMv1ZLft5__dDJ+TNv9hdX70S-G-eu6jxg@mail.gmail.com>
 <20160915195555.GK3783253@phare.normalesup.org>
 <b7ee211e-6902-1ab2-b2c9-876e4725041f@gmail.com>
Message-ID: <CAE-UAvRk0tLAHBd1ajZt4R3jdnr64A6dYFZBseYmFMphRue1UA@mail.gmail.com>

On 15 September 2016 at 14:41, Andreas Mueller <t3kcit at gmail.com> wrote:
> Hi.
> Welcome Kathy and He.
>
> Gael:
> I talked with Kathy and committed to mentoring her and some of the other
> students a bit.
> Any help in reviews is obviously welcome tough ;)
>
> Kathy already worked on an easy issue and I think we'll see some interesting
> stuff soon!

This really cool!
 Just my two cents, as I am wrapping up having students contributing
to a large opensource software: it might be a good idea to create a
explicit tag for those PR/students and have at first one person and
only one person reviewing it. Our students were quite confused with
the amount of messages and the number of interlocutors on their PRs.

I am participating this semester on an initiative to have students
contribute to opensource software with St?fan van der Walt - I'd be
curious to share tips on how it well, what to do and what to avoid.

Cheers,
N

>
> Andy
>
>
> On 09/15/2016 03:55 PM, Gael Varoquaux wrote:
>>
>> That's cool. Thanks for saying hi! Welcome on board.
>>
>> Just to warn you so that you're not disappointed: the more senior members
>> of the community are drowning under commitments. Hence you'll might be
>> getting feedback slower than we would like. But if you're careful to
>> communicate well, and take on tasks that are both relevant and of the
>> right level of difficulties, you'll see that you progressively get to
>> understand this amazing community, and the code behind it! I think that
>> quite a few people who joined the project recently can testify for this.
>>
>> I'm really looking forward to seeing you on github,
>>
>> Ga?l
>>
>> On Thu, Sep 15, 2016 at 03:35:04PM -0400, Kathleen Chen wrote:
>>>
>>> Hi! I'm Kathy, a student at Penn taking an open source software
>>> development
>>>   course this semester. I wanted to send a quick "hello" to the
>>> community; a few
>>> other students and I will be making contributions to scikit-learn as part
>>> of
>>> our work in the class. :) Really looking forward to learning from &
>>> getting to
>>> know some of you in the coming months!
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

From ronnie.ghose at gmail.com  Thu Sep 15 20:39:01 2016
From: ronnie.ghose at gmail.com (Ronnie Ghose)
Date: Thu, 15 Sep 2016 20:39:01 -0400
Subject: [scikit-learn] New contributor to scikit-learn
In-Reply-To: <CAE-UAvRk0tLAHBd1ajZt4R3jdnr64A6dYFZBseYmFMphRue1UA@mail.gmail.com>
References: <CAApte4yhOYhZFF3eLMv1ZLft5__dDJ+TNv9hdX70S-G-eu6jxg@mail.gmail.com>
 <20160915195555.GK3783253@phare.normalesup.org>
 <b7ee211e-6902-1ab2-b2c9-876e4725041f@gmail.com>
 <CAE-UAvRk0tLAHBd1ajZt4R3jdnr64A6dYFZBseYmFMphRue1UA@mail.gmail.com>
Message-ID: <CAHazPTnW7ypsxnT8psoQ=wy9YG29Bmie7H07Kw9FD2nuk+Uy+A@mail.gmail.com>

@kathy yay :) -- is Murphy not doing Open Academy anymore?

On Thu, Sep 15, 2016 at 8:32 PM, Nelle Varoquaux <nelle.varoquaux at gmail.com>
wrote:

> On 15 September 2016 at 14:41, Andreas Mueller <t3kcit at gmail.com> wrote:
> > Hi.
> > Welcome Kathy and He.
> >
> > Gael:
> > I talked with Kathy and committed to mentoring her and some of the other
> > students a bit.
> > Any help in reviews is obviously welcome tough ;)
> >
> > Kathy already worked on an easy issue and I think we'll see some
> interesting
> > stuff soon!
>
> This really cool!
>  Just my two cents, as I am wrapping up having students contributing
> to a large opensource software: it might be a good idea to create a
> explicit tag for those PR/students and have at first one person and
> only one person reviewing it. Our students were quite confused with
> the amount of messages and the number of interlocutors on their PRs.
>
> I am participating this semester on an initiative to have students
> contribute to opensource software with St?fan van der Walt - I'd be
> curious to share tips on how it well, what to do and what to avoid.
>
> Cheers,
> N
>
> >
> > Andy
> >
> >
> > On 09/15/2016 03:55 PM, Gael Varoquaux wrote:
> >>
> >> That's cool. Thanks for saying hi! Welcome on board.
> >>
> >> Just to warn you so that you're not disappointed: the more senior
> members
> >> of the community are drowning under commitments. Hence you'll might be
> >> getting feedback slower than we would like. But if you're careful to
> >> communicate well, and take on tasks that are both relevant and of the
> >> right level of difficulties, you'll see that you progressively get to
> >> understand this amazing community, and the code behind it! I think that
> >> quite a few people who joined the project recently can testify for this.
> >>
> >> I'm really looking forward to seeing you on github,
> >>
> >> Ga?l
> >>
> >> On Thu, Sep 15, 2016 at 03:35:04PM -0400, Kathleen Chen wrote:
> >>>
> >>> Hi! I'm Kathy, a student at Penn taking an open source software
> >>> development
> >>>   course this semester. I wanted to send a quick "hello" to the
> >>> community; a few
> >>> other students and I will be making contributions to scikit-learn as
> part
> >>> of
> >>> our work in the class. :) Really looking forward to learning from &
> >>> getting to
> >>> know some of you in the coming months!
> >>> _______________________________________________
> >>> scikit-learn mailing list
> >>> scikit-learn at python.org
> >>> https://mail.python.org/mailman/listinfo/scikit-learn
> >>
> >>
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160915/baeb858c/attachment.html>

From joel.nothman at gmail.com  Fri Sep 16 01:14:44 2016
From: joel.nothman at gmail.com (Joel Nothman)
Date: Fri, 16 Sep 2016 15:14:44 +1000
Subject: [scikit-learn] Github project management tools
In-Reply-To: <39b383b7-714d-755d-bfdb-5325bdf6314f@gmail.com>
References: <CAAkaFLXw7+Qfu1L+SGijrVGJ7UU=R-tx4_ttdh0fCJo-u53fMw@mail.gmail.com>
 <39b383b7-714d-755d-bfdb-5325bdf6314f@gmail.com>
Message-ID: <CAAkaFLVNaVquWuRaqH3brRpxq2svtRN3y55v6DTXas1hpZ621g@mail.gmail.com>

I think we're quite close to the intended users of Github, they just
started simple and with all these more feature-complete competitors appear,
are adding those features but haven't quite got it right yet. I'm not
convinced that it's the perfect tool (although I haven't seen this
threading problem; gmail seems to still be keeping one thread per PR?), but
its simplicity and familiarity/popularity is a great advantage for handling
new contributors. In terms of contributor familiarity, most of the projects
that we integrate with use same: numpy, scipy, cython (recently), pandas,
matplotlib, ipython. While I appreciate that we are somewhat arbitrarily
supporting a near-monopoly, the case for moving away from, or even
wrapping, github seems poor to me.

Apart from distinguishing between possible bug, actual bug and other (which
are fairly static categories), classifying issues by status is too hard to
manage. What I'd like to suggest is that we choose a way to highlight
high-priority issues for the next release, either through the milestone
feature, the project feature. Other issues will still get attention by way
of random traffic, but we care less about the timing of their resolution.

(I'm sure there must be a way using the API to find issues linked to by PRs
or not, but I don't think that's available in the UI.)

On 16 September 2016 at 09:35, Andreas Mueller <t3kcit at gmail.com> wrote:

> Hey Joel.
> Thanks for bringing this up. I have a really hard time keeping up with
> what's happening
> on the issue tracker and I have no idea how you manage.
>
> The current tags are certainly not always helpful. Also, they are rarely
> updated.
>
> I have been very frustrated by github. I used email to track all issues,
> but their new "upgrade"
> made that impossible as issues are no longer email threads - each review
> is it's own thread.
>
> It might make sense to switch to something like reviewable or gerrit.
> These sit on top of github, and people can interact with them without
> using them.
> I haven't really worked with either, but heard only good things about them.
>
> Any way to prioritize issues and putting them into the buckets that you
> listed would be a great step forward.
> That would require someone manually going through 470 PRs and 762 issues,
> though.
> I would be happy to do that if we actually stick to the system. A single
> person is not enough to keep the tags (or whatever we end up using)
> up to date, though.
>
> Your statuses only apply to PRs, too, and we need to have something
> similar for issues, which have maybe these statuses
>
> * random idea / feature request
> * feature request with consensus to implement
> * possible bug
> * confirmed bug
> * feature request or bug with active PR
> * feature request or bug with stale PR
>
> One problem with these is that man feature requests never get any
> comments, similar for PRs.
> Is a PR without comment waiting for review? Or in dispute?
> A PR could be reviewed but dispute could happen later, as we don't always
> agree on what to do.
>
> I agree that we should try to organize ourselves better. I'm doubtful the
> new github features will help.
> They certainly already have tremendously hindered me in keeping up in the
> couple of hours they've been online.
>
> There is still no way to mark a comment as addressed, and comments are
> still more or less randomly hidden
> (and links to them become dead). Both of these issues are fixed in the
> other review platforms.
>
> I don't think we are the intended users of github, though I'm not sure who
> is.
>
>
>
> On 09/15/2016 07:14 PM, Joel Nothman wrote:
>
> One of the biggest issues with scikit-learn as a project is managing its
> backlog of issues; another is release scheduling. Some of this cannot be
> fixed as long as our model of voluntary contribution (with a couple of
> important exceptions) does not change. However, it may be worth considering
> the new project management features in Github.
>
> At the moment we have the following management:
> * labels corresponding to type (bug, enhancement, new feat, question),
> scope (API, Build/CI, ?Large Scale, Documentation), difficulty (easy,
> moderate), status/scheduling (needs contributor, needs review, sprint).
> * PR status management with title prefixes [WIP], [MRG], [MRG+1], [MRG+2]
>
> Firstly, we might benefit from prefixing labels by category, i.e.
> difficulty:easy so that complementary labels appear together.
>
> In truth, PRs have roughly these statuses:
> * WIP (not ready for review)
> * waiting for review
> * waiting for changes (with or without one of the following)
> * in dispute (i.e. fundamental doubts about the PR)
> * the above together with 1 or 2 "official" approvals
> * ready for merge (pending minor changes such as what's new documentation)
>
> New github features:
>
> * reviews with "approved" or "request changes". A list of approvers can be
> found in the merge/CI panel. We could replace the MRG+1 annotation with
> this and use it to track disputation too. I'm not sure how it works with
> changes that are added after approval. I think it would have avoided one
> improper merge by me... One downside is that there does not yet seem to be
> a way to search for PRs with a specified level of approval (while searching
> for "MRG+1" sort-of works).
> * Milestone prioritising: issues in a milestone, such as
> https://github.com/scikit-learn/scikit-learn/milestone/21, can be ranked
> with drag-and-drop. I think this could help with release scheduling as it
> would allow us to identify the top priorities for a release and see when
> enough of them are completed.
> * The Kanban-style workflow management of the new Projects tool
> https://github.com/scikit-learn/scikit-learn/projects is another way of
> managing status and, I think, priority, for a small set of related issues.
> This might be an alternative way of managing milestone scope, or of working
> towards big changes like the one just completed for model selection; like
> proposed expansions to get_feature_names expansion; like estimator tags;
> making utilities public/private...
>
> So with the goal of making it easier to track where attention is most
> needed, and when to move to release: What's worth trying?
>
>
> _______________________________________________
> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160916/ed8d272c/attachment-0001.html>

From Dale.T.Smith at macys.com  Fri Sep 16 08:25:01 2016
From: Dale.T.Smith at macys.com (Dale T Smith)
Date: Fri, 16 Sep 2016 12:25:01 +0000
Subject: [scikit-learn] Github project management tools
In-Reply-To: <CAAkaFLVNaVquWuRaqH3brRpxq2svtRN3y55v6DTXas1hpZ621g@mail.gmail.com>
References: <CAAkaFLXw7+Qfu1L+SGijrVGJ7UU=R-tx4_ttdh0fCJo-u53fMw@mail.gmail.com>
 <39b383b7-714d-755d-bfdb-5325bdf6314f@gmail.com>
 <CAAkaFLVNaVquWuRaqH3brRpxq2svtRN3y55v6DTXas1hpZ621g@mail.gmail.com>
Message-ID: <BL2PR06MB2276382E87D0EF44551CB58AC3F30@BL2PR06MB2276.namprd06.prod.outlook.com>

A form ? with required, pre-defined fields ? can help when people submit bugs, issues, or requests for new features. Perhaps creating an issue template for scikit-learn is a good first step.

https://help.github.com/articles/creating-an-issue-template-for-your-repository/

Pull requests also have a template

https://help.github.com/articles/creating-a-pull-request-template-for-your-repository/

I am not sure how these fit into the team?s review and release workflow.

If this doesn?t quite fit your needs, perhaps engaging Github Support will yield something interesting.

__________________________________________________________________________________________
Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science
770-658-5176 | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com

From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Joel Nothman
Sent: Friday, September 16, 2016 1:15 AM
To: Scikit-learn user and developer mailing list
Subject: Re: [scikit-learn] Github project management tools

? EXT MSG:
I think we're quite close to the intended users of Github, they just started simple and with all these more feature-complete competitors appear, are adding those features but haven't quite got it right yet. I'm not convinced that it's the perfect tool (although I haven't seen this threading problem; gmail seems to still be keeping one thread per PR?), but its simplicity and familiarity/popularity is a great advantage for handling new contributors. In terms of contributor familiarity, most of the projects that we integrate with use same: numpy, scipy, cython (recently), pandas, matplotlib, ipython. While I appreciate that we are somewhat arbitrarily supporting a near-monopoly, the case for moving away from, or even wrapping, github seems poor to me.

Apart from distinguishing between possible bug, actual bug and other (which are fairly static categories), classifying issues by status is too hard to manage. What I'd like to suggest is that we choose a way to highlight high-priority issues for the next release, either through the milestone feature, the project feature. Other issues will still get attention by way of random traffic, but we care less about the timing of their resolution.

(I'm sure there must be a way using the API to find issues linked to by PRs or not, but I don't think that's available in the UI.)

On 16 September 2016 at 09:35, Andreas Mueller <t3kcit at gmail.com<mailto:t3kcit at gmail.com>> wrote:
Hey Joel.
Thanks for bringing this up. I have a really hard time keeping up with what's happening
on the issue tracker and I have no idea how you manage.

The current tags are certainly not always helpful. Also, they are rarely updated.

I have been very frustrated by github. I used email to track all issues, but their new "upgrade"
made that impossible as issues are no longer email threads - each review is it's own thread.

It might make sense to switch to something like reviewable or gerrit.
These sit on top of github, and people can interact with them without using them.
I haven't really worked with either, but heard only good things about them.

Any way to prioritize issues and putting them into the buckets that you listed would be a great step forward.
That would require someone manually going through 470 PRs and 762 issues, though.
I would be happy to do that if we actually stick to the system. A single person is not enough to keep the tags (or whatever we end up using)
up to date, though.

Your statuses only apply to PRs, too, and we need to have something similar for issues, which have maybe these statuses

* random idea / feature request
* feature request with consensus to implement
* possible bug
* confirmed bug
* feature request or bug with active PR
* feature request or bug with stale PR

One problem with these is that man feature requests never get any comments, similar for PRs.
Is a PR without comment waiting for review? Or in dispute?
A PR could be reviewed but dispute could happen later, as we don't always agree on what to do.

I agree that we should try to organize ourselves better. I'm doubtful the new github features will help.
They certainly already have tremendously hindered me in keeping up in the couple of hours they've been online.

There is still no way to mark a comment as addressed, and comments are still more or less randomly hidden
(and links to them become dead). Both of these issues are fixed in the other review platforms.

I don't think we are the intended users of github, though I'm not sure who is.


On 09/15/2016 07:14 PM, Joel Nothman wrote:
One of the biggest issues with scikit-learn as a project is managing its backlog of issues; another is release scheduling. Some of this cannot be fixed as long as our model of voluntary contribution (with a couple of important exceptions) does not change. However, it may be worth considering the new project management features in Github.

At the moment we have the following management:
* labels corresponding to type (bug, enhancement, new feat, question), scope (API, Build/CI, ?Large Scale, Documentation), difficulty (easy, moderate), status/scheduling (needs contributor, needs review, sprint).
* PR status management with title prefixes [WIP], [MRG], [MRG+1], [MRG+2]

Firstly, we might benefit from prefixing labels by category, i.e. difficulty:easy so that complementary labels appear together.

In truth, PRs have roughly these statuses:
* WIP (not ready for review)
* waiting for review
* waiting for changes (with or without one of the following)
* in dispute (i.e. fundamental doubts about the PR)
* the above together with 1 or 2 "official" approvals
* ready for merge (pending minor changes such as what's new documentation)

New github features:

* reviews with "approved" or "request changes". A list of approvers can be found in the merge/CI panel. We could replace the MRG+1 annotation with this and use it to track disputation too. I'm not sure how it works with changes that are added after approval. I think it would have avoided one improper merge by me... One downside is that there does not yet seem to be a way to search for PRs with a specified level of approval (while searching for "MRG+1" sort-of works).
* Milestone prioritising: issues in a milestone, such as https://github.com/scikit-learn/scikit-learn/milestone/21, can be ranked with drag-and-drop. I think this could help with release scheduling as it would allow us to identify the top priorities for a release and see when enough of them are completed.
* The Kanban-style workflow management of the new Projects tool https://github.com/scikit-learn/scikit-learn/projects is another way of managing status and, I think, priority, for a small set of related issues. This might be an alternative way of managing milestone scope, or of working towards big changes like the one just completed for model selection; like proposed expansions to get_feature_names expansion; like estimator tags; making utilities public/private...

So with the goal of making it easier to track where attention is most needed, and when to move to release: What's worth trying?


_______________________________________________

scikit-learn mailing list

scikit-learn at python.org<mailto:scikit-learn at python.org>

https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
scikit-learn at python.org<mailto:scikit-learn at python.org>
https://mail.python.org/mailman/listinfo/scikit-learn

* This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160916/ed037799/attachment-0001.html>

From se.raschka at gmail.com  Fri Sep 16 09:11:30 2016
From: se.raschka at gmail.com (Sebastian Raschka)
Date: Fri, 16 Sep 2016 09:11:30 -0400
Subject: [scikit-learn] Github project management tools
In-Reply-To: <BL2PR06MB2276382E87D0EF44551CB58AC3F30@BL2PR06MB2276.namprd06.prod.outlook.com>
References: <CAAkaFLXw7+Qfu1L+SGijrVGJ7UU=R-tx4_ttdh0fCJo-u53fMw@mail.gmail.com>
 <39b383b7-714d-755d-bfdb-5325bdf6314f@gmail.com>
 <CAAkaFLVNaVquWuRaqH3brRpxq2svtRN3y55v6DTXas1hpZ621g@mail.gmail.com>
 <BL2PR06MB2276382E87D0EF44551CB58AC3F30@BL2PR06MB2276.namprd06.prod.outlook.com>
Message-ID: <B3BEF715-2320-4EF5-A137-8FB454CA279C@gmail.com>

Scikit-learn?s GitHub repo already makes use of these templates. I think the issue is more a technical one arising from their latest ?style? changes. 

> On Sep 16, 2016, at 8:25 AM, Dale T Smith <Dale.T.Smith at macys.com> wrote:
> 
> A form ? with required, pre-defined fields ? can help when people submit bugs, issues, or requests for new features. Perhaps creating an issue template for scikit-learn is a good first step.
>  
> https://help.github.com/articles/creating-an-issue-template-for-your-repository/
>  
> Pull requests also have a template
>  
> https://help.github.com/articles/creating-a-pull-request-template-for-your-repository/
>  
> I am not sure how these fit into the team?s review and release workflow.
>  
> If this doesn?t quite fit your needs, perhaps engaging Github Support will yield something interesting.
>  
> __________________________________________________________________________________________
> Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science
> 770-658-5176 | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com
>  
> From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Joel Nothman
> Sent: Friday, September 16, 2016 1:15 AM
> To: Scikit-learn user and developer mailing list
> Subject: Re: [scikit-learn] Github project management tools
>  
> ? EXT MSG:
> I think we're quite close to the intended users of Github, they just started simple and with all these more feature-complete competitors appear, are adding those features but haven't quite got it right yet. I'm not convinced that it's the perfect tool (although I haven't seen this threading problem; gmail seems to still be keeping one thread per PR?), but its simplicity and familiarity/popularity is a great advantage for handling new contributors. In terms of contributor familiarity, most of the projects that we integrate with use same: numpy, scipy, cython (recently), pandas, matplotlib, ipython. While I appreciate that we are somewhat arbitrarily supporting a near-monopoly, the case for moving away from, or even wrapping, github seems poor to me.
>  
> Apart from distinguishing between possible bug, actual bug and other (which are fairly static categories), classifying issues by status is too hard to manage. What I'd like to suggest is that we choose a way to highlight high-priority issues for the next release, either through the milestone feature, the project feature. Other issues will still get attention by way of random traffic, but we care less about the timing of their resolution.
>  
> (I'm sure there must be a way using the API to find issues linked to by PRs or not, but I don't think that's available in the UI.)
>  
> On 16 September 2016 at 09:35, Andreas Mueller <t3kcit at gmail.com> wrote:
> Hey Joel.
> Thanks for bringing this up. I have a really hard time keeping up with what's happening
> on the issue tracker and I have no idea how you manage.
> 
> The current tags are certainly not always helpful. Also, they are rarely updated.
> 
> I have been very frustrated by github. I used email to track all issues, but their new "upgrade"
> made that impossible as issues are no longer email threads - each review is it's own thread.
> 
> It might make sense to switch to something like reviewable or gerrit.
> These sit on top of github, and people can interact with them without using them.
> I haven't really worked with either, but heard only good things about them.
> 
> Any way to prioritize issues and putting them into the buckets that you listed would be a great step forward.
> That would require someone manually going through 470 PRs and 762 issues, though.
> I would be happy to do that if we actually stick to the system. A single person is not enough to keep the tags (or whatever we end up using)
> up to date, though.
> 
> Your statuses only apply to PRs, too, and we need to have something similar for issues, which have maybe these statuses
> 
> * random idea / feature request
> * feature request with consensus to implement
> * possible bug
> * confirmed bug
> * feature request or bug with active PR
> * feature request or bug with stale PR
> 
> One problem with these is that man feature requests never get any comments, similar for PRs.
> Is a PR without comment waiting for review? Or in dispute?
> A PR could be reviewed but dispute could happen later, as we don't always agree on what to do.
> 
> I agree that we should try to organize ourselves better. I'm doubtful the new github features will help.
> They certainly already have tremendously hindered me in keeping up in the couple of hours they've been online.
> 
> There is still no way to mark a comment as addressed, and comments are still more or less randomly hidden
> (and links to them become dead). Both of these issues are fixed in the other review platforms.
> 
> I don't think we are the intended users of github, though I'm not sure who is.
> 
> 
> 
> On 09/15/2016 07:14 PM, Joel Nothman wrote:
> One of the biggest issues with scikit-learn as a project is managing its backlog of issues; another is release scheduling. Some of this cannot be fixed as long as our model of voluntary contribution (with a couple of important exceptions) does not change. However, it may be worth considering the new project management features in Github.
>  
> At the moment we have the following management:
> * labels corresponding to type (bug, enhancement, new feat, question), scope (API, Build/CI, ?Large Scale, Documentation), difficulty (easy, moderate), status/scheduling (needs contributor, needs review, sprint).
> * PR status management with title prefixes [WIP], [MRG], [MRG+1], [MRG+2]
>  
> Firstly, we might benefit from prefixing labels by category, i.e. difficulty:easy so that complementary labels appear together.
>  
> In truth, PRs have roughly these statuses:
> * WIP (not ready for review)
> * waiting for review
> * waiting for changes (with or without one of the following)
> * in dispute (i.e. fundamental doubts about the PR)
> * the above together with 1 or 2 "official" approvals
> * ready for merge (pending minor changes such as what's new documentation)
>  
> New github features:
>  
> * reviews with "approved" or "request changes". A list of approvers can be found in the merge/CI panel. We could replace the MRG+1 annotation with this and use it to track disputation too. I'm not sure how it works with changes that are added after approval. I think it would have avoided one improper merge by me... One downside is that there does not yet seem to be a way to search for PRs with a specified level of approval (while searching for "MRG+1" sort-of works).
> * Milestone prioritising: issues in a milestone, such as https://github.com/scikit-learn/scikit-learn/milestone/21, can be ranked with drag-and-drop. I think this could help with release scheduling as it would allow us to identify the top priorities for a release and see when enough of them are completed.
> * The Kanban-style workflow management of the new Projects toolhttps://github.com/scikit-learn/scikit-learn/projects is another way of managing status and, I think, priority, for a small set of related issues. This might be an alternative way of managing milestone scope, or of working towards big changes like the one just completed for model selection; like proposed expansions to get_feature_names expansion; like estimator tags; making utilities public/private...
>  
> So with the goal of making it easier to track where attention is most needed, and when to move to release: What's worth trying?
>  
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>  
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
>  
> * This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From se.raschka at gmail.com  Fri Sep 16 10:31:43 2016
From: se.raschka at gmail.com (Sebastian Raschka)
Date: Fri, 16 Sep 2016 10:31:43 -0400
Subject: [scikit-learn] Github project management tools
In-Reply-To: <B3BEF715-2320-4EF5-A137-8FB454CA279C@gmail.com>
References: <CAAkaFLXw7+Qfu1L+SGijrVGJ7UU=R-tx4_ttdh0fCJo-u53fMw@mail.gmail.com>
 <39b383b7-714d-755d-bfdb-5325bdf6314f@gmail.com>
 <CAAkaFLVNaVquWuRaqH3brRpxq2svtRN3y55v6DTXas1hpZ621g@mail.gmail.com>
 <BL2PR06MB2276382E87D0EF44551CB58AC3F30@BL2PR06MB2276.namprd06.prod.outlook.com>
 <B3BEF715-2320-4EF5-A137-8FB454CA279C@gmail.com>
Message-ID: <326BED8D-9856-45E8-9AFF-D53F7382B77E@gmail.com>

> While I appreciate that we are somewhat arbitrarily supporting a near-monopoly, the case for moving away from, or even wrapping, github seems poor to me.

Yeah, I would that moving away from GitHub involves probably too much hassle given the size of the project. Also, I don?t think there are any good alternatives besides BitBucket, which also would be not a good choice for such a big project due to its pricing structure ? they have a simple yet useful ?priority? attribute for issues though. Not sure, but it feels like GitHub is currently in a somewhat experimental stage regarding their web UI ? feels like they are changing a bit too much, too often frequently. However, maybe (or hopefully) they'll address a few of the recent annoyances in future due to user feedback. Using a wrapper seems like a good idea right now, but who knows whether or not these wrapper will introduce changes as well in near future. 

>> either through the milestone feature, the project feature

I think the milestone feature is pretty useful. Have seen this in several other projects (e.g., matplotlib). As a user/sometimes contributor, it would help with focussing on more important issues; I am sometimes a bit hesitant to submit/tackle pull requests or issues since I feel like they are somewhat distracting the core contributors from the more important stuff.

Best,
Sebastian

> On Sep 16, 2016, at 9:11 AM, Sebastian Raschka <se.raschka at gmail.com> wrote:
> 
> Scikit-learn?s GitHub repo already makes use of these templates. I think the issue is more a technical one arising from their latest ?style? changes. 
> 
>> On Sep 16, 2016, at 8:25 AM, Dale T Smith <Dale.T.Smith at macys.com> wrote:
>> 
>> A form ? with required, pre-defined fields ? can help when people submit bugs, issues, or requests for new features. Perhaps creating an issue template for scikit-learn is a good first step.
>> 
>> https://help.github.com/articles/creating-an-issue-template-for-your-repository/
>> 
>> Pull requests also have a template
>> 
>> https://help.github.com/articles/creating-a-pull-request-template-for-your-repository/
>> 
>> I am not sure how these fit into the team?s review and release workflow.
>> 
>> If this doesn?t quite fit your needs, perhaps engaging Github Support will yield something interesting.
>> 
>> __________________________________________________________________________________________
>> Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science
>> 770-658-5176 | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com
>> 
>> From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Joel Nothman
>> Sent: Friday, September 16, 2016 1:15 AM
>> To: Scikit-learn user and developer mailing list
>> Subject: Re: [scikit-learn] Github project management tools
>> 
>> ? EXT MSG:
>> I think we're quite close to the intended users of Github, they just started simple and with all these more feature-complete competitors appear, are adding those features but haven't quite got it right yet. I'm not convinced that it's the perfect tool (although I haven't seen this threading problem; gmail seems to still be keeping one thread per PR?), but its simplicity and familiarity/popularity is a great advantage for handling new contributors. In terms of contributor familiarity, most of the projects that we integrate with use same: numpy, scipy, cython (recently), pandas, matplotlib, ipython. While I appreciate that we are somewhat arbitrarily supporting a near-monopoly, the case for moving away from, or even wrapping, github seems poor to me.
>> 
>> Apart from distinguishing between possible bug, actual bug and other (which are fairly static categories), classifying issues by status is too hard to manage. What I'd like to suggest is that we choose a way to highlight high-priority issues for the next release, either through the milestone feature, the project feature. Other issues will still get attention by way of random traffic, but we care less about the timing of their resolution.
>> 
>> (I'm sure there must be a way using the API to find issues linked to by PRs or not, but I don't think that's available in the UI.)
>> 
>> On 16 September 2016 at 09:35, Andreas Mueller <t3kcit at gmail.com> wrote:
>> Hey Joel.
>> Thanks for bringing this up. I have a really hard time keeping up with what's happening
>> on the issue tracker and I have no idea how you manage.
>> 
>> The current tags are certainly not always helpful. Also, they are rarely updated.
>> 
>> I have been very frustrated by github. I used email to track all issues, but their new "upgrade"
>> made that impossible as issues are no longer email threads - each review is it's own thread.
>> 
>> It might make sense to switch to something like reviewable or gerrit.
>> These sit on top of github, and people can interact with them without using them.
>> I haven't really worked with either, but heard only good things about them.
>> 
>> Any way to prioritize issues and putting them into the buckets that you listed would be a great step forward.
>> That would require someone manually going through 470 PRs and 762 issues, though.
>> I would be happy to do that if we actually stick to the system. A single person is not enough to keep the tags (or whatever we end up using)
>> up to date, though.
>> 
>> Your statuses only apply to PRs, too, and we need to have something similar for issues, which have maybe these statuses
>> 
>> * random idea / feature request
>> * feature request with consensus to implement
>> * possible bug
>> * confirmed bug
>> * feature request or bug with active PR
>> * feature request or bug with stale PR
>> 
>> One problem with these is that man feature requests never get any comments, similar for PRs.
>> Is a PR without comment waiting for review? Or in dispute?
>> A PR could be reviewed but dispute could happen later, as we don't always agree on what to do.
>> 
>> I agree that we should try to organize ourselves better. I'm doubtful the new github features will help.
>> They certainly already have tremendously hindered me in keeping up in the couple of hours they've been online.
>> 
>> There is still no way to mark a comment as addressed, and comments are still more or less randomly hidden
>> (and links to them become dead). Both of these issues are fixed in the other review platforms.
>> 
>> I don't think we are the intended users of github, though I'm not sure who is.
>> 
>> 
>> 
>> On 09/15/2016 07:14 PM, Joel Nothman wrote:
>> One of the biggest issues with scikit-learn as a project is managing its backlog of issues; another is release scheduling. Some of this cannot be fixed as long as our model of voluntary contribution (with a couple of important exceptions) does not change. However, it may be worth considering the new project management features in Github.
>> 
>> At the moment we have the following management:
>> * labels corresponding to type (bug, enhancement, new feat, question), scope (API, Build/CI, ?Large Scale, Documentation), difficulty (easy, moderate), status/scheduling (needs contributor, needs review, sprint).
>> * PR status management with title prefixes [WIP], [MRG], [MRG+1], [MRG+2]
>> 
>> Firstly, we might benefit from prefixing labels by category, i.e. difficulty:easy so that complementary labels appear together.
>> 
>> In truth, PRs have roughly these statuses:
>> * WIP (not ready for review)
>> * waiting for review
>> * waiting for changes (with or without one of the following)
>> * in dispute (i.e. fundamental doubts about the PR)
>> * the above together with 1 or 2 "official" approvals
>> * ready for merge (pending minor changes such as what's new documentation)
>> 
>> New github features:
>> 
>> * reviews with "approved" or "request changes". A list of approvers can be found in the merge/CI panel. We could replace the MRG+1 annotation with this and use it to track disputation too. I'm not sure how it works with changes that are added after approval. I think it would have avoided one improper merge by me... One downside is that there does not yet seem to be a way to search for PRs with a specified level of approval (while searching for "MRG+1" sort-of works).
>> * Milestone prioritising: issues in a milestone, such as https://github.com/scikit-learn/scikit-learn/milestone/21, can be ranked with drag-and-drop. I think this could help with release scheduling as it would allow us to identify the top priorities for a release and see when enough of them are completed.
>> * The Kanban-style workflow management of the new Projects toolhttps://github.com/scikit-learn/scikit-learn/projects is another way of managing status and, I think, priority, for a small set of related issues. This might be an alternative way of managing milestone scope, or of working towards big changes like the one just completed for model selection; like proposed expansions to get_feature_names expansion; like estimator tags; making utilities public/private...
>> 
>> So with the goal of making it easier to track where attention is most needed, and when to move to release: What's worth trying?
>> 
>> 
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>> 
>> 
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>> 
>> 
>> * This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From t3kcit at gmail.com  Fri Sep 16 11:09:06 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Fri, 16 Sep 2016 11:09:06 -0400
Subject: [scikit-learn] Github project management tools
In-Reply-To: <CAAkaFLVNaVquWuRaqH3brRpxq2svtRN3y55v6DTXas1hpZ621g@mail.gmail.com>
References: <CAAkaFLXw7+Qfu1L+SGijrVGJ7UU=R-tx4_ttdh0fCJo-u53fMw@mail.gmail.com>
 <39b383b7-714d-755d-bfdb-5325bdf6314f@gmail.com>
 <CAAkaFLVNaVquWuRaqH3brRpxq2svtRN3y55v6DTXas1hpZ621g@mail.gmail.com>
Message-ID: <f7d1cfc2-b296-56af-c30a-0c8767ebe783@gmail.com>


On 09/16/2016 01:14 AM, Joel Nothman wrote:
> I think we're quite close to the intended users of Github, they just 
> started simple and with all these more feature-complete competitors 
> appear, are adding those features but haven't quite got it right yet. 
> I'm not convinced that it's the perfect tool (although I haven't seen 
> this threading problem; gmail seems to still be keeping one thread per 
> PR?), but its simplicity and familiarity/popularity is a great 
> advantage for handling new contributors. In terms of contributor 
> familiarity, most of the projects that we integrate with use same: 
> numpy, scipy, cython (recently), pandas, matplotlib, ipython. While I 
> appreciate that we are somewhat arbitrarily supporting a 
> near-monopoly, the case for moving away from, or even wrapping, github 
> seems poor to me.
>
Actually, both of these services don't require everyone to be using 
them. The contributors could still be using github and get all the 
normal functionality.
It would just give us a better way to track things.

> Apart from distinguishing between possible bug, actual bug and other 
> (which are fairly static categories), classifying issues by status is 
> too hard to manage. What I'd like to suggest is that we choose a way 
> to highlight high-priority issues for the next release, either through 
> the milestone feature, the project feature. Other issues will still 
> get attention by way of random traffic, but we care less about the 
> timing of their resolution.
>
Yeah, actually updating statuses is a lot of work, as I found out with 
the "need contributor" tag. However, I think this might be the most 
helpful tag for people wanting to contribute.

I'm happy with using the release tags more.

From t3kcit at gmail.com  Fri Sep 16 11:05:57 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Fri, 16 Sep 2016 11:05:57 -0400
Subject: [scikit-learn] New contributor to scikit-learn
In-Reply-To: <CAE-UAvRk0tLAHBd1ajZt4R3jdnr64A6dYFZBseYmFMphRue1UA@mail.gmail.com>
References: <CAApte4yhOYhZFF3eLMv1ZLft5__dDJ+TNv9hdX70S-G-eu6jxg@mail.gmail.com>
 <20160915195555.GK3783253@phare.normalesup.org>
 <b7ee211e-6902-1ab2-b2c9-876e4725041f@gmail.com>
 <CAE-UAvRk0tLAHBd1ajZt4R3jdnr64A6dYFZBseYmFMphRue1UA@mail.gmail.com>
Message-ID: <8b8e3608-5d5d-a7f5-3b0f-76a2d17099c2@gmail.com>


On 09/15/2016 08:32 PM, Nelle Varoquaux wrote:
> On 15 September 2016 at 14:41, Andreas Mueller <t3kcit at gmail.com> wrote:
>> Hi.
>> Welcome Kathy and He.
>>
>> Gael:
>> I talked with Kathy and committed to mentoring her and some of the other
>> students a bit.
>> Any help in reviews is obviously welcome tough ;)
>>
>> Kathy already worked on an easy issue and I think we'll see some interesting
>> stuff soon!
> This really cool!
>   Just my two cents, as I am wrapping up having students contributing
> to a large opensource software: it might be a good idea to create a
> explicit tag for those PR/students and have at first one person and
> only one person reviewing it. Our students were quite confused with
> the amount of messages and the number of interlocutors on their PRs.
Kathy, how do you feel about this?
I know your first "easy" PR had quite a bit of discussion.
> I am participating this semester on an initiative to have students
> contribute to opensource software with St?fan van der Walt - I'd be
> curious to share tips on how it well, what to do and what to avoid.
>
I'm pretty sure Chris Murphy from UPenn would love to talk to you about 
your project!

Andy

From gael.varoquaux at normalesup.org  Fri Sep 16 11:21:18 2016
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Fri, 16 Sep 2016 17:21:18 +0200
Subject: [scikit-learn] Github project management tools
In-Reply-To: <CAAkaFLXw7+Qfu1L+SGijrVGJ7UU=R-tx4_ttdh0fCJo-u53fMw@mail.gmail.com>
References: <CAAkaFLXw7+Qfu1L+SGijrVGJ7UU=R-tx4_ttdh0fCJo-u53fMw@mail.gmail.com>
Message-ID: <20160916152118.GC187668@phare.normalesup.org>

On Fri, Sep 16, 2016 at 09:14:12AM +1000, Joel Nothman wrote:
> One downside is that there does not yet seem to be a way to search for
> PRs with a specified level of approval (while searching for "MRG+1" sort-of
> works).

Yes, I do that a lot. So this is not a great improvement for me.

G

From katch at seas.upenn.edu  Fri Sep 16 14:33:40 2016
From: katch at seas.upenn.edu (Kathleen Chen)
Date: Fri, 16 Sep 2016 14:33:40 -0400
Subject: [scikit-learn] New contributor to scikit-learn
In-Reply-To: <CAHazPTnW7ypsxnT8psoQ=wy9YG29Bmie7H07Kw9FD2nuk+Uy+A@mail.gmail.com>
References: <CAApte4yhOYhZFF3eLMv1ZLft5__dDJ+TNv9hdX70S-G-eu6jxg@mail.gmail.com>
 <20160915195555.GK3783253@phare.normalesup.org>
 <b7ee211e-6902-1ab2-b2c9-876e4725041f@gmail.com>
 <CAE-UAvRk0tLAHBd1ajZt4R3jdnr64A6dYFZBseYmFMphRue1UA@mail.gmail.com>
 <CAHazPTnW7ypsxnT8psoQ=wy9YG29Bmie7H07Kw9FD2nuk+Uy+A@mail.gmail.com>
Message-ID: <CAApte4w_1gwKp1=5U7T-QpMe=_nqEXgLk0+7TnoFYWM2X8+fOw@mail.gmail.com>

Thanks all for the warm welcome! Addressing a couple of specific comments:

Ronnie: I'm not sure! I will ask him. Did you go to Penn and/or how do you
know Chris?

Nelle: Seconding Andy's comment--Chris would definitely love to discuss
that!

Andy & Nelle: Regarding the separate tag and single person reviewing it. My
experience with the (in-progress) PR has not been too confusing: I actually
quite like the back-and-forth discussion and wonder how that would have
worked if only one person were reviewing the commit. i.e. Would that person
be the point-of-correspondence for others in the community and/or be
responsible for making sure that other people were on-board with the
change? Would we compile a dedicated list of people willing to review
student PRs and rotate through them? If we could agree on a procedure for
doing so, I could see it working out well for the future.
For me though, I do like the current workflow and don't find it too
difficult to follow.

On Thu, Sep 15, 2016 at 8:39 PM, Ronnie Ghose <ronnie.ghose at gmail.com>
wrote:

> @kathy yay :) -- is Murphy not doing Open Academy anymore?
>
> On Thu, Sep 15, 2016 at 8:32 PM, Nelle Varoquaux <
> nelle.varoquaux at gmail.com> wrote:
>
>> On 15 September 2016 at 14:41, Andreas Mueller <t3kcit at gmail.com> wrote:
>> > Hi.
>> > Welcome Kathy and He.
>> >
>> > Gael:
>> > I talked with Kathy and committed to mentoring her and some of the other
>> > students a bit.
>> > Any help in reviews is obviously welcome tough ;)
>> >
>> > Kathy already worked on an easy issue and I think we'll see some
>> interesting
>> > stuff soon!
>>
>> This really cool!
>>  Just my two cents, as I am wrapping up having students contributing
>> to a large opensource software: it might be a good idea to create a
>> explicit tag for those PR/students and have at first one person and
>> only one person reviewing it. Our students were quite confused with
>> the amount of messages and the number of interlocutors on their PRs.
>>
>> I am participating this semester on an initiative to have students
>> contribute to opensource software with St?fan van der Walt - I'd be
>> curious to share tips on how it well, what to do and what to avoid.
>>
>> Cheers,
>> N
>>
>> >
>> > Andy
>> >
>> >
>> > On 09/15/2016 03:55 PM, Gael Varoquaux wrote:
>> >>
>> >> That's cool. Thanks for saying hi! Welcome on board.
>> >>
>> >> Just to warn you so that you're not disappointed: the more senior
>> members
>> >> of the community are drowning under commitments. Hence you'll might be
>> >> getting feedback slower than we would like. But if you're careful to
>> >> communicate well, and take on tasks that are both relevant and of the
>> >> right level of difficulties, you'll see that you progressively get to
>> >> understand this amazing community, and the code behind it! I think that
>> >> quite a few people who joined the project recently can testify for
>> this.
>> >>
>> >> I'm really looking forward to seeing you on github,
>> >>
>> >> Ga?l
>>
>> >>
>> >> On Thu, Sep 15, 2016 at 03:35:04PM -0400, Kathleen Chen wrote:
>> >>>
>> >>> Hi! I'm Kathy, a student at Penn taking an open source software
>> >>> development
>> >>>   course this semester. I wanted to send a quick "hello" to the
>> >>> community; a few
>> >>> other students and I will be making contributions to scikit-learn as
>> part
>> >>> of
>> >>> our work in the class. :) Really looking forward to learning from &
>> >>> getting to
>> >>> know some of you in the coming months!
>> >>> _______________________________________________
>> >>> scikit-learn mailing list
>> >>> scikit-learn at python.org
>> >>> https://mail.python.org/mailman/listinfo/scikit-learn
>> >>
>> >>
>> >
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160916/e3849815/attachment.html>

From t3kcit at gmail.com  Fri Sep 16 17:59:11 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Fri, 16 Sep 2016 17:59:11 -0400
Subject: [scikit-learn] Fwd: Featuring scikit-learn in Hacktoberfest 2016
In-Reply-To: <CAMmMsqo8rVuWVNJ8f-mbGEWO2=EXzqCWFh_4fRqpxjb7etDShg@mail.gmail.com>
References: <CAMmMsqo8rVuWVNJ8f-mbGEWO2=EXzqCWFh_4fRqpxjb7etDShg@mail.gmail.com>
Message-ID: <854e54e1-07ef-34f9-43f2-d92acc321ac9@gmail.com>

Hey all.

Wdyt (see below)? I'm always for new contributors ;) - Though it does 
add overhead.
I'm not sure I want to add a label, but being on the page seems a good 
thing, I think.

Adny


-------- Forwarded Message --------
Subject: 	Featuring scikit-learn in Hacktoberfest 2016
Date: 	Fri, 16 Sep 2016 17:31:30 -0400
From: 	Andrew Starr-Bochicchio <astarr at digitalocean.com>
To: 	t3kcit at gmail.com
CC: 	Open Source <opensource at digitalocean.com>


Hi,

My name is Andrew Starr-Bochicchio and I'm on the Community team at 
DigitalOcean.

In an effort to create some excitement around contributing to open 
source in the month of October, DigitalOcean and GitHub are partnering 
to run the third annual Hacktoberfest. We challenge people to get 
involved by creating at least four pull requests to open source projects 
over the course of the month. Everyone who completes the task gets a 
limited edition t-shirt. Last year we saw 14,419 people sign up, and 
5,708 people create at least four pull requests 
<https://www.digitalocean.com/company/blog/looking-back-at-hacktoberfest/>. 
We're hoping for even more this time around.

Pull requests can be to any project, but we also like to use 
Hacktoberfest to feature a number of projects that we love, most 
prominently on the landing page (see last year's for an example 
<https://hacktoberfest.digitalocean.com/>). We are hoping to feature 
scikit-learn, but I just wanted to check in with you and make sure you 
were interested.

Additionally, this year we'll be encouraging project maintainers to 
label issues on GitHub "hacktoberfest" so that we can get even more 
projects involved and send contributors to where the help is most 
needed. So spread the word if there are other projects you think might 
be interested.

Please let me know if you'd like to be featured on the landing page, if 
you have any questions, or if there's anything DO can do to help.

Thanks!

- Andrew Starr-Bochicchio
   Community Manager
   DigitalOcean

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160916/41b96eac/attachment.html>

From jmschreiber91 at gmail.com  Fri Sep 16 18:02:44 2016
From: jmschreiber91 at gmail.com (Jacob Schreiber)
Date: Fri, 16 Sep 2016 18:02:44 -0400
Subject: [scikit-learn] Fwd: Featuring scikit-learn in Hacktoberfest 2016
In-Reply-To: <854e54e1-07ef-34f9-43f2-d92acc321ac9@gmail.com>
References: <CAMmMsqo8rVuWVNJ8f-mbGEWO2=EXzqCWFh_4fRqpxjb7etDShg@mail.gmail.com>
 <854e54e1-07ef-34f9-43f2-d92acc321ac9@gmail.com>
Message-ID: <CA+ad8EuGyrbnHuLqri9Jafofu5gZa4ieVhOaGhEA8URcYiF44g@mail.gmail.com>

I think it's always good to get more contributors, even if they only add
small amounts of code and disappear. However, I'd be worried about people
submitting low quality PRs in order to claim "success" and/or not following
up.

On Fri, Sep 16, 2016 at 5:59 PM, Andreas Mueller <t3kcit at gmail.com> wrote:

> Hey all.
>
> Wdyt (see below)? I'm always for new contributors ;) - Though it does add
> overhead.
> I'm not sure I want to add a label, but being on the page seems a good
> thing, I think.
>
> Adny
>
>
> -------- Forwarded Message --------
> Subject: Featuring scikit-learn in Hacktoberfest 2016
> Date: Fri, 16 Sep 2016 17:31:30 -0400
> From: Andrew Starr-Bochicchio <astarr at digitalocean.com>
> <astarr at digitalocean.com>
> To: t3kcit at gmail.com
> CC: Open Source <opensource at digitalocean.com>
> <opensource at digitalocean.com>
>
> Hi,
>
> My name is Andrew Starr-Bochicchio and I'm on the Community team at
> DigitalOcean.
>
> In an effort to create some excitement around contributing to open source
> in the month of October, DigitalOcean and GitHub are partnering to run the
> third annual Hacktoberfest. We challenge people to get involved by creating
> at least four pull requests to open source projects over the course of the
> month. Everyone who completes the task gets a limited edition t-shirt. Last
> year we saw 14,419 people sign up, and 5,708 people create at least four
> pull requests
> <https://www.digitalocean.com/company/blog/looking-back-at-hacktoberfest/>.
> We're hoping for even more this time around.
>
> Pull requests can be to any project, but we also like to use Hacktoberfest
> to feature a number of projects that we love, most prominently on the
> landing page (see last year's for an example
> <https://hacktoberfest.digitalocean.com/>). We are hoping to feature
> scikit-learn, but I just wanted to check in with you and make sure you were
> interested.
>
> Additionally, this year we'll be encouraging project maintainers to label
> issues on GitHub "hacktoberfest" so that we can get even more projects
> involved and send contributors to where the help is most needed. So spread
> the word if there are other projects you think might be interested.
>
> Please let me know if you'd like to be featured on the landing page, if
> you have any questions, or if there's anything DO can do to help.
>
> Thanks!
>
> - Andrew Starr-Bochicchio
>   Community Manager
>   DigitalOcean
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160916/f782ca63/attachment-0001.html>

From nfliu at uw.edu  Fri Sep 16 18:10:38 2016
From: nfliu at uw.edu (Nelson Liu)
Date: Fri, 16 Sep 2016 15:10:38 -0700
Subject: [scikit-learn] Fwd: Featuring scikit-learn in Hacktoberfest 2016
In-Reply-To: <CA+ad8EuGyrbnHuLqri9Jafofu5gZa4ieVhOaGhEA8URcYiF44g@mail.gmail.com>
References: <CAMmMsqo8rVuWVNJ8f-mbGEWO2=EXzqCWFh_4fRqpxjb7etDShg@mail.gmail.com>
 <854e54e1-07ef-34f9-43f2-d92acc321ac9@gmail.com>
 <CA+ad8EuGyrbnHuLqri9Jafofu5gZa4ieVhOaGhEA8URcYiF44g@mail.gmail.com>
Message-ID: <CALoLHMJmC=QvYG=AEfxfE4ahE+=JaUVKvWV_zF5WjEFOy-erCA@mail.gmail.com>

> I'm not sure I want to add a label.

+1 on this. I'm also a bit wary of low quality pull requests, but I don't
think it will be that common of an occurrence.

Perhaps my experience is skewed, but I've found that lots of newcomers have
difficulty finding something to work on; we'd probably want to improve
ourselves in that regard if we were to expect a significant amount of new
contributions from this program.

Nelson

On Fri, Sep 16, 2016 at 3:02 PM, Jacob Schreiber <jmschreiber91 at gmail.com>
wrote:

> I think it's always good to get more contributors, even if they only add
> small amounts of code and disappear. However, I'd be worried about people
> submitting low quality PRs in order to claim "success" and/or not following
> up.
>
> On Fri, Sep 16, 2016 at 5:59 PM, Andreas Mueller <t3kcit at gmail.com> wrote:
>
>> Hey all.
>>
>> Wdyt (see below)? I'm always for new contributors ;) - Though it does add
>> overhead.
>> I'm not sure I want to add a label, but being on the page seems a good
>> thing, I think.
>>
>> Adny
>>
>>
>> -------- Forwarded Message --------
>> Subject: Featuring scikit-learn in Hacktoberfest 2016
>> Date: Fri, 16 Sep 2016 17:31:30 -0400
>> From: Andrew Starr-Bochicchio <astarr at digitalocean.com>
>> <astarr at digitalocean.com>
>> To: t3kcit at gmail.com
>> CC: Open Source <opensource at digitalocean.com>
>> <opensource at digitalocean.com>
>>
>> Hi,
>>
>> My name is Andrew Starr-Bochicchio and I'm on the Community team at
>> DigitalOcean.
>>
>> In an effort to create some excitement around contributing to open source
>> in the month of October, DigitalOcean and GitHub are partnering to run the
>> third annual Hacktoberfest. We challenge people to get involved by creating
>> at least four pull requests to open source projects over the course of the
>> month. Everyone who completes the task gets a limited edition t-shirt. Last
>> year we saw 14,419 people sign up, and 5,708 people create at least four
>> pull requests
>> <https://www.digitalocean.com/company/blog/looking-back-at-hacktoberfest/>.
>> We're hoping for even more this time around.
>>
>> Pull requests can be to any project, but we also like to use
>> Hacktoberfest to feature a number of projects that we love, most
>> prominently on the landing page (see last year's for an example
>> <https://hacktoberfest.digitalocean.com/>). We are hoping to feature
>> scikit-learn, but I just wanted to check in with you and make sure you were
>> interested.
>>
>> Additionally, this year we'll be encouraging project maintainers to label
>> issues on GitHub "hacktoberfest" so that we can get even more projects
>> involved and send contributors to where the help is most needed. So spread
>> the word if there are other projects you think might be interested.
>>
>> Please let me know if you'd like to be featured on the landing page, if
>> you have any questions, or if there's anything DO can do to help.
>>
>> Thanks!
>>
>> - Andrew Starr-Bochicchio
>>   Community Manager
>>   DigitalOcean
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160916/45c8d1ae/attachment.html>

From t3kcit at gmail.com  Fri Sep 16 18:14:59 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Fri, 16 Sep 2016 18:14:59 -0400
Subject: [scikit-learn] Fwd: Featuring scikit-learn in Hacktoberfest 2016
In-Reply-To: <CA+ad8EuGyrbnHuLqri9Jafofu5gZa4ieVhOaGhEA8URcYiF44g@mail.gmail.com>
References: <CAMmMsqo8rVuWVNJ8f-mbGEWO2=EXzqCWFh_4fRqpxjb7etDShg@mail.gmail.com>
 <854e54e1-07ef-34f9-43f2-d92acc321ac9@gmail.com>
 <CA+ad8EuGyrbnHuLqri9Jafofu5gZa4ieVhOaGhEA8URcYiF44g@mail.gmail.com>
Message-ID: <55a49f41-ebef-b14b-759b-7585bd2a730d@gmail.com>


On 09/16/2016 06:02 PM, Jacob Schreiber wrote:
> I think it's always good to get more contributors, even if they only 
> add small amounts of code and disappear. However, I'd be worried about 
> people submitting low quality PRs in order to claim "success" and/or 
> not following up.
One month is a reasonable time to get things done, though I do share 
your concern.

From t3kcit at gmail.com  Fri Sep 16 18:17:44 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Fri, 16 Sep 2016 18:17:44 -0400
Subject: [scikit-learn] Fwd: Featuring scikit-learn in Hacktoberfest 2016
In-Reply-To: <CALoLHMJmC=QvYG=AEfxfE4ahE+=JaUVKvWV_zF5WjEFOy-erCA@mail.gmail.com>
References: <CAMmMsqo8rVuWVNJ8f-mbGEWO2=EXzqCWFh_4fRqpxjb7etDShg@mail.gmail.com>
 <854e54e1-07ef-34f9-43f2-d92acc321ac9@gmail.com>
 <CA+ad8EuGyrbnHuLqri9Jafofu5gZa4ieVhOaGhEA8URcYiF44g@mail.gmail.com>
 <CALoLHMJmC=QvYG=AEfxfE4ahE+=JaUVKvWV_zF5WjEFOy-erCA@mail.gmail.com>
Message-ID: <6c387fed-8530-2b63-84b1-eb0a2cdbdb4c@gmail.com>


On 09/16/2016 06:10 PM, Nelson Liu wrote:
>
> Perhaps my experience is skewed, but I've found that lots of newcomers 
> have difficulty finding something to work on; we'd probably want to 
> improve ourselves in that regard if we were to expect a significant 
> amount of new contributions from this program.
>
I totally agree.
I think the "needs contributor" and "easy" tags are very helpful.
Though they need a lot of work from the maintainers, in particular 
keeping the "needs contributor" up to date is tricky.
And it might need to be added back if a PR gets abandoned.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160916/fc1fd715/attachment.html>

From joel.nothman at gmail.com  Mon Sep 19 10:05:03 2016
From: joel.nothman at gmail.com (Joel Nothman)
Date: Tue, 20 Sep 2016 00:05:03 +1000
Subject: [scikit-learn] Github project management tools
In-Reply-To: <20160916152118.GC187668@phare.normalesup.org>
References: <CAAkaFLXw7+Qfu1L+SGijrVGJ7UU=R-tx4_ttdh0fCJo-u53fMw@mail.gmail.com>
 <20160916152118.GC187668@phare.normalesup.org>
Message-ID: <CAAkaFLW6sVg_vjXnMHu9F59i0HHxpDLM87B3opKTcyoe-dNXjg@mail.gmail.com>

On 17 September 2016 at 01:21, Gael Varoquaux <gael.varoquaux at normalesup.org
> wrote:

> On Fri, Sep 16, 2016 at 09:14:12AM +1000, Joel Nothman wrote:
> > One downside is that there does not yet seem to be a way to search for
> > PRs with a specified level of approval (while searching for "MRG+1"
> sort-of
> > works).
>
> Yes, I do that a lot. So this is not a great improvement for me.
>

A lot of the new features, including this, do not seem to have Github APIs
(or at least documentation) yet. When we adopted title hacking, PRs could
not receive labels. *Would labels be an improvement over title hacking for
recording approval status?*

I think it would be worth trying to have a rough *priority ranking for
things we'd like to see in 0.19*. However the Github Milestones feature is
a bit crippled in UI: you can rank issues, but cannot filter by anything
but open/closed, so for instance cannot see bugs and non-bugs separately.
Perhaps Projects come to supersede that, although I think they work best
for small-scale sprints rather than release-level milestones. And you
cannot search sorted by milestone priority.


Apart from an interface for manual prioritising, I think we would benefit
from *automatic labelling*:
* of issues to say when a PR mentioning the issue exists
* of PRs to say whether there's been 1 or 2 LGTMs by core devs

There are a number of issue labelling bots around -- https://github.com/
botdylan/botdylan seems to be one of the more configurable -- but hosted
solutions don't seem readily available.

Does anyone know of strong preferences for tracking + labelling bot
solutions? waffle.io seems to go in this direction but is relatively
inflexible.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160920/e5e614ec/attachment.html>

From betatim at gmail.com  Mon Sep 19 10:11:56 2016
From: betatim at gmail.com (Tim Head)
Date: Mon, 19 Sep 2016 14:11:56 +0000
Subject: [scikit-learn] Github project management tools
In-Reply-To: <CAAkaFLW6sVg_vjXnMHu9F59i0HHxpDLM87B3opKTcyoe-dNXjg@mail.gmail.com>
References: <CAAkaFLXw7+Qfu1L+SGijrVGJ7UU=R-tx4_ttdh0fCJo-u53fMw@mail.gmail.com>
 <20160916152118.GC187668@phare.normalesup.org>
 <CAAkaFLW6sVg_vjXnMHu9F59i0HHxpDLM87B3opKTcyoe-dNXjg@mail.gmail.com>
Message-ID: <CAN3x1RZZuHotu9jYbJmzzVOGFyuCdz=mOLOz-YAzWNCcNEMwRw@mail.gmail.com>

On Mon, Sep 19, 2016 at 4:06 PM Joel Nothman <joel.nothman at gmail.com> wrote:

> I think it would be worth trying to have a rough *priority ranking for
> things we'd like to see in 0.19*. However the Github Milestones feature
> is a bit crippled in UI: you can rank issues, but cannot filter by anything
> but open/closed, so for instance cannot see bugs and non-bugs separately.
> Perhaps Projects come to supersede that, although I think they work best
> for small-scale sprints rather than release-level milestones. And you
> cannot search sorted by milestone priority.
>

You can use the search text field to build complex queries like this:

https://github.com/scikit-learn/scikit-learn/issues?q=is%3Aissue+is%3Aopen+milestone%3A0.18+label%3ABlocker

I once had a nice set of example queries but seem to have lost it so the
best docs I can point to is:
https://help.github.com/articles/searching-issues/

HTH,
T
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160919/4ec2c0b1/attachment.html>

From joel.nothman at gmail.com  Mon Sep 19 10:49:57 2016
From: joel.nothman at gmail.com (Joel Nothman)
Date: Tue, 20 Sep 2016 00:49:57 +1000
Subject: [scikit-learn] Github project management tools
In-Reply-To: <CAAkaFLW6sVg_vjXnMHu9F59i0HHxpDLM87B3opKTcyoe-dNXjg@mail.gmail.com>
References: <CAAkaFLXw7+Qfu1L+SGijrVGJ7UU=R-tx4_ttdh0fCJo-u53fMw@mail.gmail.com>
 <20160916152118.GC187668@phare.normalesup.org>
 <CAAkaFLW6sVg_vjXnMHu9F59i0HHxpDLM87B3opKTcyoe-dNXjg@mail.gmail.com>
Message-ID: <CAAkaFLV6160=p1ztEBtDh8tmAXwDBf5Cy-m1=aAabHxkbCdXPg@mail.gmail.com>

Another bot-able tool might be pinging inactive PRs to ask if they're being
worked on, and labelling "Needs contributor" if there's no reply within n
days...!

On 20 September 2016 at 00:05, Joel Nothman <joel.nothman at gmail.com> wrote:

> On 17 September 2016 at 01:21, Gael Varoquaux <
> gael.varoquaux at normalesup.org> wrote:
>
>> On Fri, Sep 16, 2016 at 09:14:12AM +1000, Joel Nothman wrote:
>> > One downside is that there does not yet seem to be a way to search for
>> > PRs with a specified level of approval (while searching for "MRG+1"
>> sort-of
>> > works).
>>
>> Yes, I do that a lot. So this is not a great improvement for me.
>>
>
> A lot of the new features, including this, do not seem to have Github APIs
> (or at least documentation) yet. When we adopted title hacking, PRs could
> not receive labels. *Would labels be an improvement over title hacking
> for recording approval status?*
>
> I think it would be worth trying to have a rough *priority ranking for
> things we'd like to see in 0.19*. However the Github Milestones feature
> is a bit crippled in UI: you can rank issues, but cannot filter by anything
> but open/closed, so for instance cannot see bugs and non-bugs separately.
> Perhaps Projects come to supersede that, although I think they work best
> for small-scale sprints rather than release-level milestones. And you
> cannot search sorted by milestone priority.
>
>
> Apart from an interface for manual prioritising, I think we would benefit
> from *automatic labelling*:
> * of issues to say when a PR mentioning the issue exists
> * of PRs to say whether there's been 1 or 2 LGTMs by core devs
>
> There are a number of issue labelling bots around --
> https://github.com/botdylan/botdylan seems to be one of the more
> configurable -- but hosted solutions don't seem readily available.
>
> Does anyone know of strong preferences for tracking + labelling bot
> solutions? waffle.io seems to go in this direction but is relatively
> inflexible.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160920/4186c328/attachment.html>

From nelle.varoquaux at gmail.com  Mon Sep 19 12:26:31 2016
From: nelle.varoquaux at gmail.com (Nelle Varoquaux)
Date: Mon, 19 Sep 2016 09:26:31 -0700
Subject: [scikit-learn] Github project management tools
In-Reply-To: <CAAkaFLV6160=p1ztEBtDh8tmAXwDBf5Cy-m1=aAabHxkbCdXPg@mail.gmail.com>
References: <CAAkaFLXw7+Qfu1L+SGijrVGJ7UU=R-tx4_ttdh0fCJo-u53fMw@mail.gmail.com>
 <20160916152118.GC187668@phare.normalesup.org>
 <CAAkaFLW6sVg_vjXnMHu9F59i0HHxpDLM87B3opKTcyoe-dNXjg@mail.gmail.com>
 <CAAkaFLV6160=p1ztEBtDh8tmAXwDBf5Cy-m1=aAabHxkbCdXPg@mail.gmail.com>
Message-ID: <CAE-UAvTGMrtX5uOP5END4n+SihtQNSHH+T7cwgBByu4vuhz4QA@mail.gmail.com>

> Another bot-able tool might be pinging inactive PRs to ask if they're being
> worked on, and labelling "Needs contributor" if there's no reply within n
> days...!

If PRs are inactive, it might also be interesting to tag them as
easy_fix when there is little to do.

>
> On 20 September 2016 at 00:05, Joel Nothman <joel.nothman at gmail.com> wrote:
>>
>> On 17 September 2016 at 01:21, Gael Varoquaux
>> <gael.varoquaux at normalesup.org> wrote:
>>>
>>> On Fri, Sep 16, 2016 at 09:14:12AM +1000, Joel Nothman wrote:
>>> > One downside is that there does not yet seem to be a way to search for
>>> > PRs with a specified level of approval (while searching for "MRG+1"
>>> > sort-of
>>> > works).
>>>
>>> Yes, I do that a lot. So this is not a great improvement for me.
>>
>>
>> A lot of the new features, including this, do not seem to have Github APIs
>> (or at least documentation) yet. When we adopted title hacking, PRs could
>> not receive labels. Would labels be an improvement over title hacking for
>> recording approval status?
>>
>> I think it would be worth trying to have a rough priority ranking for
>> things we'd like to see in 0.19. However the Github Milestones feature is a
>> bit crippled in UI: you can rank issues, but cannot filter by anything but
>> open/closed, so for instance cannot see bugs and non-bugs separately.
>> Perhaps Projects come to supersede that, although I think they work best for
>> small-scale sprints rather than release-level milestones. And you cannot
>> search sorted by milestone priority.
>>
>>
>> Apart from an interface for manual prioritising, I think we would benefit
>> from automatic labelling:
>> * of issues to say when a PR mentioning the issue exists
>> * of PRs to say whether there's been 1 or 2 LGTMs by core devs
>>
>> There are a number of issue labelling bots around --
>> https://github.com/botdylan/botdylan seems to be one of the more
>> configurable -- but hosted solutions don't seem readily available.
>>
>> Does anyone know of strong preferences for tracking + labelling bot
>> solutions? waffle.io seems to go in this direction but is relatively
>> inflexible.
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>

From zamparo at gmail.com  Mon Sep 19 17:45:39 2016
From: zamparo at gmail.com (Lee Zamparo)
Date: Mon, 19 Sep 2016 14:45:39 -0700
Subject: [scikit-learn] behaviour of OneHotEncoder somewhat confusing
Message-ID: <CAAiPRhASOXg2dwq5CLh0mNHDqJk+PFHL+CwSH9JsENT5KNthJw@mail.gmail.com>

Hi sklearners,

A lab-mate came to me with a problem about encoding DNA sequences using
preprocessing.OneHotEncoder, and I find it to produce confusing results.

Suppose I have a DNA string:  myguide = ?ACGT?

He?d like use OneHotEncoder to transform DNA strings, character by
character, into a one hot encoded representation like this: [[1,0,0,0],
[0,1,0,0], [0,0,1,0], [0,0,0,1]].  The use-case seems to be solved in
pandas using the dubiously named get_dummies method (
http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.get_dummies.html).
I thought that it would be trivial to do with OneHotEncoder, but it seems
strangely difficult:

In [23]: myarray = le.fit_transform([c for c in myguide])

In [24]: myarray
Out[24]: array([0, 1, 2, 3])

In [27]: myarray = le.transform([[c for c in myguide],[c for c in
myguide],[c for c in myguide]])

In [28]: myarray
Out[28]:
array([[0, 1, 2, 3],
       [0, 1, 2, 3],
       [0, 1, 2, 3]])

In [29]: ohe.fit_transform(myarray)
Out[29]:
array([[ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.]])    <? ????

So this is not at all what I expected.  I read the documentation for
OneHotEncoder (
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder),
but did not find if clear how it worked (also I found the example using
integers confusing).  Neither FeatureHasher nor DictVectorizer seem to be
more appropriate for transforming strings into positional OneHot encoded
arrays.  Am I missing something, or is this operation not supported in
sklearn?

Thanks,

-- 
Lee Zamparo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160919/20bf9f29/attachment.html>

From se.raschka at gmail.com  Mon Sep 19 18:07:54 2016
From: se.raschka at gmail.com (Sebastian Raschka)
Date: Mon, 19 Sep 2016 18:07:54 -0400
Subject: [scikit-learn] behaviour of OneHotEncoder somewhat confusing
In-Reply-To: <CAAiPRhASOXg2dwq5CLh0mNHDqJk+PFHL+CwSH9JsENT5KNthJw@mail.gmail.com>
References: <CAAiPRhASOXg2dwq5CLh0mNHDqJk+PFHL+CwSH9JsENT5KNthJw@mail.gmail.com>
Message-ID: <B791E852-D597-43A1-9E80-9942462241D7@gmail.com>

Hi, Lee,

maybe set `n_value=4`, this seems to do the job. I think the problem you encountered is due to the fact that the one-hot encoder infers the number of values for each feature (column) from the dataset. In your case, each column had only 1 unique feature in your example

> array([[0, 1, 2, 3],
>        [0, 1, 2, 3],
>        [0, 1, 2, 3]])

If you had an array like

> array([[0],
>           [1],
>           [2],
>          [3]])

it should work though. Alternatively, set n_values to 4:


> >>> from sklearn.preprocessing import OneHotEncoder
> >>> import numpy as np
> 
> >>> enc = OneHotEncoder(n_values=4)
> >>> X = np.array([[0, 1, 2, 3]])
> >>> enc.fit_transform(X).toarray()


array([[ 1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,
         0.,  0.,  1.]])

and 

> X2 = np.array([[0, 1, 2, 3],
>                [0, 1, 2, 3],
>                [0, 1, 2, 3]])
> 
> enc.transform(X2).toarray()


array([[ 1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,
         0.,  0.,  1.],
       [ 1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,
         0.,  0.,  1.],
       [ 1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,
         0.,  0.,  1.]])


Best,
Sebastian


> On Sep 19, 2016, at 5:45 PM, Lee Zamparo <zamparo at gmail.com> wrote:
> 
> Hi sklearners,
> 
> A lab-mate came to me with a problem about encoding DNA sequences using preprocessing.OneHotEncoder, and I find it to produce confusing results.
> 
> Suppose I have a DNA string:  myguide = ?ACGT?
> 
> He?d like use OneHotEncoder to transform DNA strings, character by character, into a one hot encoded representation like this: [[1,0,0,0], [0,1,0,0], [0,0,1,0], [0,0,0,1]].  The use-case seems to be solved in pandas using the dubiously named get_dummies method (http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.get_dummies.html).  I thought that it would be trivial to do with OneHotEncoder, but it seems strangely difficult:
> 
> In [23]: myarray = le.fit_transform([c for c in myguide])
> 
> In [24]: myarray
> Out[24]: array([0, 1, 2, 3])
> 
> In [27]: myarray = le.transform([[c for c in myguide],[c for c in myguide],[c for c in myguide]])
> 
> In [28]: myarray
> Out[28]:
> array([[0, 1, 2, 3],
>        [0, 1, 2, 3],
>        [0, 1, 2, 3]])
> 
> In [29]: ohe.fit_transform(myarray)
> Out[29]:
> array([[ 1.,  1.,  1.,  1.],
>        [ 1.,  1.,  1.,  1.],
>        [ 1.,  1.,  1.,  1.]])    <? ????
> 
> So this is not at all what I expected.  I read the documentation for OneHotEncoder (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder), but did not find if clear how it worked (also I found the example using integers confusing).  Neither FeatureHasher nor DictVectorizer seem to be more appropriate for transforming strings into positional OneHot encoded arrays.  Am I missing something, or is this operation not supported in sklearn?
> 
> Thanks,
> 
> -- 
> Lee Zamparo
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From joel.nothman at gmail.com  Mon Sep 19 19:47:55 2016
From: joel.nothman at gmail.com (Joel Nothman)
Date: Tue, 20 Sep 2016 09:47:55 +1000
Subject: [scikit-learn] behaviour of OneHotEncoder somewhat confusing
In-Reply-To: <B791E852-D597-43A1-9E80-9942462241D7@gmail.com>
References: <CAAiPRhASOXg2dwq5CLh0mNHDqJk+PFHL+CwSH9JsENT5KNthJw@mail.gmail.com>
 <B791E852-D597-43A1-9E80-9942462241D7@gmail.com>
Message-ID: <CAAkaFLX1NRvDRK3EVwNvpCnQTF2ehVtKXBC+P=pnnqRx_nGwsA@mail.gmail.com>

OneHotCoder has issues, but I think all you want here is

ohe.fit_transform(np.transpose(le.fit_transform([c for c in myguide])))

Still, this seems like it is far from the intended use of OneHotEncoder
(which should not really be stacked with LabelEncoder), so it's not
surprising it's tricky.

On 20 September 2016 at 08:07, Sebastian Raschka <se.raschka at gmail.com>
wrote:

> Hi, Lee,
>
> maybe set `n_value=4`, this seems to do the job. I think the problem you
> encountered is due to the fact that the one-hot encoder infers the number
> of values for each feature (column) from the dataset. In your case, each
> column had only 1 unique feature in your example
>
> > array([[0, 1, 2, 3],
> >        [0, 1, 2, 3],
> >        [0, 1, 2, 3]])
>
> If you had an array like
>
> > array([[0],
> >           [1],
> >           [2],
> >          [3]])
>
> it should work though. Alternatively, set n_values to 4:
>
>
> > >>> from sklearn.preprocessing import OneHotEncoder
> > >>> import numpy as np
> >
> > >>> enc = OneHotEncoder(n_values=4)
> > >>> X = np.array([[0, 1, 2, 3]])
> > >>> enc.fit_transform(X).toarray()
>
>
> array([[ 1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,
>          0.,  0.,  1.]])
>
> and
>
> > X2 = np.array([[0, 1, 2, 3],
> >                [0, 1, 2, 3],
> >                [0, 1, 2, 3]])
> >
> > enc.transform(X2).toarray()
>
>
>
> array([[ 1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,
>          0.,  0.,  1.],
>        [ 1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,
>          0.,  0.,  1.],
>        [ 1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,
>          0.,  0.,  1.]])
>
>
> Best,
> Sebastian
>
>
> > On Sep 19, 2016, at 5:45 PM, Lee Zamparo <zamparo at gmail.com> wrote:
> >
> > Hi sklearners,
> >
> > A lab-mate came to me with a problem about encoding DNA sequences using
> preprocessing.OneHotEncoder, and I find it to produce confusing results.
> >
> > Suppose I have a DNA string:  myguide = ?ACGT?
> >
> > He?d like use OneHotEncoder to transform DNA strings, character by
> character, into a one hot encoded representation like this: [[1,0,0,0],
> [0,1,0,0], [0,0,1,0], [0,0,0,1]].  The use-case seems to be solved in
> pandas using the dubiously named get_dummies method (
> http://pandas.pydata.org/pandas-docs/version/0.13.1/
> generated/pandas.get_dummies.html).  I thought that it would be trivial
> to do with OneHotEncoder, but it seems strangely difficult:
> >
> > In [23]: myarray = le.fit_transform([c for c in myguide])
> >
> > In [24]: myarray
> > Out[24]: array([0, 1, 2, 3])
> >
> > In [27]: myarray = le.transform([[c for c in myguide],[c for c in
> myguide],[c for c in myguide]])
> >
> > In [28]: myarray
> > Out[28]:
> > array([[0, 1, 2, 3],
> >        [0, 1, 2, 3],
> >        [0, 1, 2, 3]])
> >
> > In [29]: ohe.fit_transform(myarray)
> > Out[29]:
> > array([[ 1.,  1.,  1.,  1.],
> >        [ 1.,  1.,  1.,  1.],
> >        [ 1.,  1.,  1.,  1.]])    <? ????
> >
> > So this is not at all what I expected.  I read the documentation for
> OneHotEncoder (http://scikit-learn.org/stable/modules/generated/
> sklearn.preprocessing.OneHotEncoder.html#sklearn.
> preprocessing.OneHotEncoder), but did not find if clear how it worked
> (also I found the example using integers confusing).  Neither FeatureHasher
> nor DictVectorizer seem to be more appropriate for transforming strings
> into positional OneHot encoded arrays.  Am I missing something, or is this
> operation not supported in sklearn?
> >
> > Thanks,
> >
> > --
> > Lee Zamparo
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160920/bd6a0567/attachment-0001.html>

From zamparo at gmail.com  Mon Sep 19 20:15:33 2016
From: zamparo at gmail.com (Lee Zamparo)
Date: Mon, 19 Sep 2016 20:15:33 -0400
Subject: [scikit-learn] behaviour of OneHotEncoder somewhat confusing
In-Reply-To: <B791E852-D597-43A1-9E80-9942462241D7@gmail.com>
References: <CAAiPRhASOXg2dwq5CLh0mNHDqJk+PFHL+CwSH9JsENT5KNthJw@mail.gmail.com>
 <B791E852-D597-43A1-9E80-9942462241D7@gmail.com>
Message-ID: <CAAiPRhBTuXvRTUXsozsJte5OX0v-WJpPdL7n+vhEjCMTW2SjCg@mail.gmail.com>

Hi Sebastian,

Great, thanks!

The docstring doesn?t make it very clear that using the default
?n_values=?auto? infers the number of different values column-wise; maybe I
could do a quick PR to update it?  Or, maybe I could make your example into
a, well, example for the documentation online?

Alternatively, if you think this case is too off-usage for OneHotEncoder,
maybe doing nothing is the best course?

Thanks,

-- 
Lee Zamparo

On September 19, 2016 at 6:08:15 PM, Sebastian Raschka (se.raschka at gmail.com)
wrote:

Hi, Lee,

maybe set `n_value=4`, this seems to do the job. I think the problem you
encountered is due to the fact that the one-hot encoder infers the number
of values for each feature (column) from the dataset. In your case, each
column had only 1 unique feature in your example

> array([[0, 1, 2, 3],
> [0, 1, 2, 3],
> [0, 1, 2, 3]])

If you had an array like

> array([[0],
> [1],
> [2],
> [3]])

it should work though. Alternatively, set n_values to 4:


> >>> from sklearn.preprocessing import OneHotEncoder
> >>> import numpy as np
>
> >>> enc = OneHotEncoder(n_values=4)
> >>> X = np.array([[0, 1, 2, 3]])
> >>> enc.fit_transform(X).toarray()


array([[ 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0.,
0., 0., 1.]])

and

> X2 = np.array([[0, 1, 2, 3],
> [0, 1, 2, 3],
> [0, 1, 2, 3]])
>
> enc.transform(X2).toarray()


array([[ 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0.,
0., 0., 1.],
[ 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0.,
0., 0., 1.],
[ 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0.,
0., 0., 1.]])


Best,
Sebastian


> On Sep 19, 2016, at 5:45 PM, Lee Zamparo <zamparo at gmail.com> wrote:
>
> Hi sklearners,
>
> A lab-mate came to me with a problem about encoding DNA sequences using
preprocessing.OneHotEncoder, and I find it to produce confusing results.
>
> Suppose I have a DNA string: myguide = ?ACGT?
>
> He?d like use OneHotEncoder to transform DNA strings, character by
character, into a one hot encoded representation like this: [[1,0,0,0],
[0,1,0,0], [0,0,1,0], [0,0,0,1]]. The use-case seems to be solved in pandas
using the dubiously named get_dummies method (
http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.get_dummies.html).
I thought that it would be trivial to do with OneHotEncoder, but it seems
strangely difficult:
>
> In [23]: myarray = le.fit_transform([c for c in myguide])
>
> In [24]: myarray
> Out[24]: array([0, 1, 2, 3])
>
> In [27]: myarray = le.transform([[c for c in myguide],[c for c in
myguide],[c for c in myguide]])
>
> In [28]: myarray
> Out[28]:
> array([[0, 1, 2, 3],
> [0, 1, 2, 3],
> [0, 1, 2, 3]])
>
> In [29]: ohe.fit_transform(myarray)
> Out[29]:
> array([[ 1., 1., 1., 1.],
> [ 1., 1., 1., 1.],
> [ 1., 1., 1., 1.]]) <? ????
>
> So this is not at all what I expected. I read the documentation for
OneHotEncoder (
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder),
but did not find if clear how it worked (also I found the example using
integers confusing). Neither FeatureHasher nor DictVectorizer seem to be
more appropriate for transforming strings into positional OneHot encoded
arrays. Am I missing something, or is this operation not supported in
sklearn?
>
> Thanks,
>
> --
> Lee Zamparo
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160919/5124187b/attachment.html>

From zamparo at gmail.com  Mon Sep 19 20:20:31 2016
From: zamparo at gmail.com (Lee Zamparo)
Date: Mon, 19 Sep 2016 20:20:31 -0400
Subject: [scikit-learn] behaviour of OneHotEncoder somewhat confusing
In-Reply-To: <CAAkaFLX1NRvDRK3EVwNvpCnQTF2ehVtKXBC+P=pnnqRx_nGwsA@mail.gmail.com>
References: <CAAiPRhASOXg2dwq5CLh0mNHDqJk+PFHL+CwSH9JsENT5KNthJw@mail.gmail.com>
 <B791E852-D597-43A1-9E80-9942462241D7@gmail.com>
 <CAAkaFLX1NRvDRK3EVwNvpCnQTF2ehVtKXBC+P=pnnqRx_nGwsA@mail.gmail.com>
Message-ID: <CAAiPRhAY_QbPBb5yjNZjo7irzQetvmqGQxXxbDaf550D8U7+Eg@mail.gmail.com>

Hi Joel,

Yea, seems that the one-hot encoding of the transpose solves the issue.  As
you say, and as I mentioned to Sebastian, it seems a bit off-usage for
OneHotEncoder.

Thanks for the solution all the same though.

-- 
Lee Zamparo

On September 19, 2016 at 7:48:15 PM, Joel Nothman (joel.nothman at gmail.com)
wrote:

OneHotCoder has issues, but I think all you want here is

ohe.fit_transform(np.transpose(le.fit_transform([c for c in myguide])))

Still, this seems like it is far from the intended use of OneHotEncoder
(which should not really be stacked with LabelEncoder), so it's not
surprising it's tricky.

On 20 September 2016 at 08:07, Sebastian Raschka <se.raschka at gmail.com>
wrote:

> Hi, Lee,
>
> maybe set `n_value=4`, this seems to do the job. I think the problem you
> encountered is due to the fact that the one-hot encoder infers the number
> of values for each feature (column) from the dataset. In your case, each
> column had only 1 unique feature in your example
>
> > array([[0, 1, 2, 3],
> >        [0, 1, 2, 3],
> >        [0, 1, 2, 3]])
>
> If you had an array like
>
> > array([[0],
> >           [1],
> >           [2],
> >          [3]])
>
> it should work though. Alternatively, set n_values to 4:
>
>
> > >>> from sklearn.preprocessing import OneHotEncoder
> > >>> import numpy as np
> >
> > >>> enc = OneHotEncoder(n_values=4)
> > >>> X = np.array([[0, 1, 2, 3]])
> > >>> enc.fit_transform(X).toarray()
>
>
> array([[ 1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,
>          0.,  0.,  1.]])
>
> and
>
> > X2 = np.array([[0, 1, 2, 3],
> >                [0, 1, 2, 3],
> >                [0, 1, 2, 3]])
> >
> > enc.transform(X2).toarray()
>
>
>
> array([[ 1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,
>          0.,  0.,  1.],
>        [ 1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,
>          0.,  0.,  1.],
>        [ 1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,
>          0.,  0.,  1.]])
>
>
> Best,
> Sebastian
>
>
> > On Sep 19, 2016, at 5:45 PM, Lee Zamparo <zamparo at gmail.com> wrote:
> >
> > Hi sklearners,
> >
> > A lab-mate came to me with a problem about encoding DNA sequences using
> preprocessing.OneHotEncoder, and I find it to produce confusing results.
> >
> > Suppose I have a DNA string:  myguide = ?ACGT?
> >
> > He?d like use OneHotEncoder to transform DNA strings, character by
> character, into a one hot encoded representation like this: [[1,0,0,0],
> [0,1,0,0], [0,0,1,0], [0,0,0,1]].  The use-case seems to be solved in
> pandas using the dubiously named get_dummies method (
> http://pandas.pydata.org/pandas-docs/version/0.13.1/
> generated/pandas.get_dummies.html).  I thought that it would be trivial
> to do with OneHotEncoder, but it seems strangely difficult:
> >
> > In [23]: myarray = le.fit_transform([c for c in myguide])
> >
> > In [24]: myarray
> > Out[24]: array([0, 1, 2, 3])
> >
> > In [27]: myarray = le.transform([[c for c in myguide],[c for c in
> myguide],[c for c in myguide]])
> >
> > In [28]: myarray
> > Out[28]:
> > array([[0, 1, 2, 3],
> >        [0, 1, 2, 3],
> >        [0, 1, 2, 3]])
> >
> > In [29]: ohe.fit_transform(myarray)
> > Out[29]:
> > array([[ 1.,  1.,  1.,  1.],
> >        [ 1.,  1.,  1.,  1.],
> >        [ 1.,  1.,  1.,  1.]])    <? ????
> >
> > So this is not at all what I expected.  I read the documentation for
> OneHotEncoder (http://scikit-learn.org/stable/modules/generated/
> sklearn.preprocessing.OneHotEncoder.html#sklearn.
> preprocessing.OneHotEncoder), but did not find if clear how it worked
> (also I found the example using integers confusing).  Neither FeatureHasher
> nor DictVectorizer seem to be more appropriate for transforming strings
> into positional OneHot encoded arrays.  Am I missing something, or is this
> operation not supported in sklearn?
> >
> > Thanks,
> >
> > --
> > Lee Zamparo
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>

_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160919/29e37bdc/attachment-0001.html>

From ivanvallesperez at gmail.com  Tue Sep 20 12:04:53 2016
From: ivanvallesperez at gmail.com (=?UTF-8?B?SXbDoW4gVmFsbMOpcyBQw6lyZXo=?=)
Date: Tue, 20 Sep 2016 18:04:53 +0200
Subject: [scikit-learn] Contribution project proposal
Message-ID: <CAKwC8ikA=q89GYO6pcOYuM3-iTQa-CXD-KQRcjRbSNYzBMh7QA@mail.gmail.com>

Hello,

My name is Iv?n Vall?s and I am Data Scientist. I am really interested in
contributing to the Scikit-Learn project by writing a new feature: Stacked
Generalization Ensemble. I would like to know if it is being developed and
if you think it is worth. Of course it is a long-term project due to the
relative complexity of the proposal.

Thank you.

Best,
Iv?n
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160920/c95855a3/attachment.html>

From joel.nothman at gmail.com  Tue Sep 20 20:02:50 2016
From: joel.nothman at gmail.com (Joel Nothman)
Date: Wed, 21 Sep 2016 10:02:50 +1000
Subject: [scikit-learn] Contribution project proposal
In-Reply-To: <CAKwC8ikA=q89GYO6pcOYuM3-iTQa-CXD-KQRcjRbSNYzBMh7QA@mail.gmail.com>
References: <CAKwC8ikA=q89GYO6pcOYuM3-iTQa-CXD-KQRcjRbSNYzBMh7QA@mail.gmail.com>
Message-ID: <CAAkaFLW5pXyxdOgk1pBFdY-0a3ERC751NHvF86MKJbOgffTJJw@mail.gmail.com>

Have you searched the issue tracker for Stacking and the relationship
between your proposal and others in the works?

https://github.com/scikit-learn/scikit-learn/search?q=stacking&type=Issues&utf8=%E2%9C%93

On 21 September 2016 at 02:04, Iv?n Vall?s P?rez <ivanvallesperez at gmail.com>
wrote:

> Hello,
>
> My name is Iv?n Vall?s and I am Data Scientist. I am really interested in
> contributing to the Scikit-Learn project by writing a new feature: Stacked
> Generalization Ensemble. I would like to know if it is being developed and
> if you think it is worth. Of course it is a long-term project due to the
> relative complexity of the proposal.
>
> Thank you.
>
> Best,
> Iv?n
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160921/574e43ea/attachment.html>

From mail at sebastianraschka.com  Tue Sep 20 20:53:00 2016
From: mail at sebastianraschka.com (Sebastian Raschka)
Date: Tue, 20 Sep 2016 20:53:00 -0400
Subject: [scikit-learn] Contribution project proposal
In-Reply-To: <CAAkaFLW5pXyxdOgk1pBFdY-0a3ERC751NHvF86MKJbOgffTJJw@mail.gmail.com>
References: <CAKwC8ikA=q89GYO6pcOYuM3-iTQa-CXD-KQRcjRbSNYzBMh7QA@mail.gmail.com>
 <CAAkaFLW5pXyxdOgk1pBFdY-0a3ERC751NHvF86MKJbOgffTJJw@mail.gmail.com>
Message-ID: <1218CE55-4241-4864-9A91-AE648AF200E2@sebastianraschka.com>

I remember that there was a discussion regarding stacking in general after we implemented the majority voting classifier, and I just found a PR with some stacking implementation that seems to be in progress https://github.com/scikit-learn/scikit-learn/pull/6674


> On Sep 20, 2016, at 8:02 PM, Joel Nothman <joel.nothman at gmail.com> wrote:
> 
> Have you searched the issue tracker for Stacking and the relationship between your proposal and others in the works?
> 
> https://github.com/scikit-learn/scikit-learn/search?q=stacking&type=Issues&utf8=%E2%9C%93
> 
> On 21 September 2016 at 02:04, Iv?n Vall?s P?rez <ivanvallesperez at gmail.com> wrote:
> Hello,
> 
> My name is Iv?n Vall?s and I am Data Scientist. I am really interested in contributing to the Scikit-Learn project by writing a new feature: Stacked Generalization Ensemble. I would like to know if it is being developed and if you think it is worth. Of course it is a long-term project due to the relative complexity of the proposal. 
> 
> Thank you.
> 
> Best,
> Iv?n
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From olivier.grisel at ensta.org  Wed Sep 21 03:44:09 2016
From: olivier.grisel at ensta.org (Olivier Grisel)
Date: Wed, 21 Sep 2016 09:44:09 +0200
Subject: [scikit-learn] Contribution project proposal
In-Reply-To: <1218CE55-4241-4864-9A91-AE648AF200E2@sebastianraschka.com>
References: <CAKwC8ikA=q89GYO6pcOYuM3-iTQa-CXD-KQRcjRbSNYzBMh7QA@mail.gmail.com>
 <CAAkaFLW5pXyxdOgk1pBFdY-0a3ERC751NHvF86MKJbOgffTJJw@mail.gmail.com>
 <1218CE55-4241-4864-9A91-AE648AF200E2@sebastianraschka.com>
Message-ID: <CAFvE7K6GszkoM+npqc1VZyrd1hG2OQ=JG9eeB0O2FEX1c_qPjw@mail.gmail.com>

If this is your first contribution to the project, I would strongly
suggest to start by contributing a small bug fix or improvement to get
accustomed to the kind of things the core devs expect when reviewing a
PR.

Also please read the contributors guide :

http://scikit-learn.org/dev/developers/contributing.html

Best,

-- 
Olivier

From ivanvallesperez at gmail.com  Wed Sep 21 13:57:31 2016
From: ivanvallesperez at gmail.com (=?utf-8?Q?Iv=C3=A1n_Vall=C3=A9s_P=C3=A9rez?=)
Date: Wed, 21 Sep 2016 19:57:31 +0200
Subject: [scikit-learn] Contribution project proposal
In-Reply-To: <CAFvE7K6GszkoM+npqc1VZyrd1hG2OQ=JG9eeB0O2FEX1c_qPjw@mail.gmail.com>
References: <CAKwC8ikA=q89GYO6pcOYuM3-iTQa-CXD-KQRcjRbSNYzBMh7QA@mail.gmail.com>
 <CAAkaFLW5pXyxdOgk1pBFdY-0a3ERC751NHvF86MKJbOgffTJJw@mail.gmail.com>
 <1218CE55-4241-4864-9A91-AE648AF200E2@sebastianraschka.com>
 <CAFvE7K6GszkoM+npqc1VZyrd1hG2OQ=JG9eeB0O2FEX1c_qPjw@mail.gmail.com>
Message-ID: <3D97A85B-BFEE-4951-948D-DB4663EFD77E@gmail.com>

Hi,


Thank you all for the info. It is not my first contribution to a project (I made little contributions to xgboost and tensorflow), even though I think it is really interesting what Oliver said, specially because of the very curated structure and guidelines of the project. BTW, I?ve found super interesting the contributor?s guide.

I decided that it was a good idea to add the stacker to the project because I have a preliminar version which took me to the 22nd place in the BNP Kaggle competition and it is already more or less compatible with sklearn :D. That?s the reason why I would like to make this project free.

Regarding the existing project, I am going to write a message in order to see if I can made my contribution, if it is active yet. 

Thank you everybody for the info!!

Best,
Iv?n


> El 21 sept 2016, a las 9:44, Olivier Grisel <olivier.grisel at ensta.org> escribi?:
> 
> If this is your first contribution to the project, I would strongly
> suggest to start by contributing a small bug fix or improvement to get
> accustomed to the kind of things the core devs expect when reviewing a
> PR.
> 
> Also please read the contributors guide :
> 
> http://scikit-learn.org/dev/developers/contributing.html
> 
> Best,
> 
> -- 
> Olivier
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From t3kcit at gmail.com  Thu Sep 22 01:08:02 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Thu, 22 Sep 2016 10:38:02 +0530
Subject: [scikit-learn] behaviour of OneHotEncoder somewhat confusing
In-Reply-To: <CAAiPRhAY_QbPBb5yjNZjo7irzQetvmqGQxXxbDaf550D8U7+Eg@mail.gmail.com>
References: <CAAiPRhASOXg2dwq5CLh0mNHDqJk+PFHL+CwSH9JsENT5KNthJw@mail.gmail.com>
 <B791E852-D597-43A1-9E80-9942462241D7@gmail.com>
 <CAAkaFLX1NRvDRK3EVwNvpCnQTF2ehVtKXBC+P=pnnqRx_nGwsA@mail.gmail.com>
 <CAAiPRhAY_QbPBb5yjNZjo7irzQetvmqGQxXxbDaf550D8U7+Eg@mail.gmail.com>
Message-ID: <1c020bae-0b8e-ab1d-cc21-e169de67ea13@gmail.com>

Yeah the input format is a bit odd, usually it should be n_samples x 
n_features, so something like
[['A'], ['C'], ['T'], ['G']]

Though this is currently also hard to do :(

On 09/20/2016 05:50 AM, Lee Zamparo wrote:
> Hi Joel,
>
> Yea, seems that the one-hot encoding of the transpose solves the 
> issue.  As you say, and as I mentioned to Sebastian, it seems a bit 
> off-usage for OneHotEncoder.
>
> Thanks for the solution all the same though.
>
> -- 
> Lee Zamparo
>
> On September 19, 2016 at 7:48:15 PM, Joel Nothman 
> (joel.nothman at gmail.com <mailto:joel.nothman at gmail.com>) wrote:
>
>> OneHotCoder has issues, but I think all you want here is
>>
>> ohe.fit_transform(np.transpose(le.fit_transform([c for c in myguide])))
>>
>> Still, this seems like it is far from the intended use of 
>> OneHotEncoder (which should not really be stacked with LabelEncoder), 
>> so it's not surprising it's tricky.
>>
>> On 20 September 2016 at 08:07, Sebastian Raschka 
>> <se.raschka at gmail.com <mailto:se.raschka at gmail.com>> wrote:
>>
>>     Hi, Lee,
>>
>>     maybe set `n_value=4`, this seems to do the job. I think the
>>     problem you encountered is due to the fact that the one-hot
>>     encoder infers the number of values for each feature (column)
>>     from the dataset. In your case, each column had only 1 unique
>>     feature in your example
>>
>>     > array([[0, 1, 2, 3],
>>     >        [0, 1, 2, 3],
>>     >        [0, 1, 2, 3]])
>>
>>     If you had an array like
>>
>>     > array([[0],
>>     >           [1],
>>     >           [2],
>>     >          [3]])
>>
>>     it should work though. Alternatively, set n_values to 4:
>>
>>
>>     > >>> from sklearn.preprocessing import OneHotEncoder
>>     > >>> import numpy as np
>>     >
>>     > >>> enc = OneHotEncoder(n_values=4)
>>     > >>> X = np.array([[0, 1, 2, 3]])
>>     > >>> enc.fit_transform(X).toarray()
>>
>>
>>     array([[ 1.,  0.,  0.,  0.,  0., 1.,  0.,  0.,  0.,  0.,  1.,
>>     0.,  0.,
>>              0.,  0.,  1.]])
>>
>>     and
>>
>>     > X2 = np.array([[0, 1, 2, 3],
>>     >  [0, 1, 2, 3],
>>     >                [0, 1, 2, 3]])
>>     >
>>     > enc.transform(X2).toarray()
>>
>>
>>
>>     array([[ 1.,  0.,  0.,  0.,  0., 1.,  0.,  0.,  0.,  0.,  1.,
>>     0.,  0.,
>>              0.,  0.,  1.],
>>            [ 1.,  0.,  0., 0.,  0.,  1.,  0.,  0.,  0., 0.,  1., 
>>     0.,  0.,
>>              0.,  0.,  1.],
>>            [ 1.,  0.,  0., 0.,  0.,  1.,  0.,  0.,  0., 0.,  1., 
>>     0.,  0.,
>>              0.,  0.,  1.]])
>>
>>
>>     Best,
>>     Sebastian
>>
>>
>>     > On Sep 19, 2016, at 5:45 PM, Lee Zamparo <zamparo at gmail.com
>>     <mailto:zamparo at gmail.com>> wrote:
>>     >
>>     > Hi sklearners,
>>     >
>>     > A lab-mate came to me with a problem about encoding DNA
>>     sequences using preprocessing.OneHotEncoder, and I find it to
>>     produce confusing results.
>>     >
>>     > Suppose I have a DNA string:  myguide = ?ACGT?
>>     >
>>     > He?d like use OneHotEncoder to transform DNA strings, character
>>     by character, into a one hot encoded representation like this:
>>     [[1,0,0,0], [0,1,0,0], [0,0,1,0], [0,0,0,1]].  The use-case seems
>>     to be solved in pandas using the dubiously named get_dummies
>>     method
>>     (http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.get_dummies.html
>>     <http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.get_dummies.html>).
>>     I thought that it would be trivial to do with OneHotEncoder, but
>>     it seems strangely difficult:
>>     >
>>     > In [23]: myarray = le.fit_transform([c for c in myguide])
>>     >
>>     > In [24]: myarray
>>     > Out[24]: array([0, 1, 2, 3])
>>     >
>>     > In [27]: myarray = le.transform([[c for c in myguide],[c for c
>>     in myguide],[c for c in myguide]])
>>     >
>>     > In [28]: myarray
>>     > Out[28]:
>>     > array([[0, 1, 2, 3],
>>     >        [0, 1, 2, 3],
>>     >        [0, 1, 2, 3]])
>>     >
>>     > In [29]: ohe.fit_transform(myarray)
>>     > Out[29]:
>>     > array([[ 1.,  1.,  1.,  1.],
>>     >        [ 1.,  1.,  1., 1.],
>>     >        [ 1.,  1.,  1., 1.]])    <? ????
>>     >
>>     > So this is not at all what I expected.  I read the
>>     documentation for OneHotEncoder
>>     (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder
>>     <http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder>),
>>     but did not find if clear how it worked (also I found the example
>>     using integers confusing).  Neither FeatureHasher nor
>>     DictVectorizer seem to be more appropriate for transforming
>>     strings into positional OneHot encoded arrays.  Am I missing
>>     something, or is this operation not supported in sklearn?
>>     >
>>     > Thanks,
>>     >
>>     > --
>>     > Lee Zamparo
>>     > _______________________________________________
>>     > scikit-learn mailing list
>>     > scikit-learn at python.org <mailto:scikit-learn at python.org>
>>     > https://mail.python.org/mailman/listinfo/scikit-learn
>>     <https://mail.python.org/mailman/listinfo/scikit-learn>
>>
>>     _______________________________________________
>>     scikit-learn mailing list
>>     scikit-learn at python.org <mailto:scikit-learn at python.org>
>>     https://mail.python.org/mailman/listinfo/scikit-learn
>>     <https://mail.python.org/mailman/listinfo/scikit-learn>
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org <mailto:scikit-learn at python.org>
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160922/a773995d/attachment-0001.html>

From t3kcit at gmail.com  Thu Sep 22 01:13:57 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Thu, 22 Sep 2016 10:43:57 +0530
Subject: [scikit-learn] Github project management tools
In-Reply-To: <CAE-UAvTGMrtX5uOP5END4n+SihtQNSHH+T7cwgBByu4vuhz4QA@mail.gmail.com>
References: <CAAkaFLXw7+Qfu1L+SGijrVGJ7UU=R-tx4_ttdh0fCJo-u53fMw@mail.gmail.com>
 <20160916152118.GC187668@phare.normalesup.org>
 <CAAkaFLW6sVg_vjXnMHu9F59i0HHxpDLM87B3opKTcyoe-dNXjg@mail.gmail.com>
 <CAAkaFLV6160=p1ztEBtDh8tmAXwDBf5Cy-m1=aAabHxkbCdXPg@mail.gmail.com>
 <CAE-UAvTGMrtX5uOP5END4n+SihtQNSHH+T7cwgBByu4vuhz4QA@mail.gmail.com>
Message-ID: <fe49d881-1321-71e0-021b-990541a34239@gmail.com>


On 09/19/2016 09:56 PM, Nelle Varoquaux wrote:
>> Another bot-able tool might be pinging inactive PRs to ask if they're being
>> worked on, and labelling "Needs contributor" if there's no reply within n
>> days...!
That kind of only works when the status is "waiting for changes",
and not "waiting for reviews". I guess we could tag all old issues
or use the new interface (though you said that's not scriptable yet?)
So we would need to actually use the "needs reviews" tag and add an
"waiting for changes" tag. And I guess the "waiting for changes" should be
removed automatically when the author changed something and changed to 
"needs review"?

Is there an API to access the "fixes #ISSUE" thing for auto-closing? 
Just mentioning an issue
doesn't mean it's a PR to solve the issue.

> If PRs are inactive, it might also be interesting to tag them as
> easy_fix when there is little to do.
>
>
That's much harder to automate though.
I know that I often misjudge the amount that is left to do in a PR,
not sure if bots are better at that than humans yet.


Are there bots with LSTM support yet? ;)

From nelle.varoquaux at gmail.com  Thu Sep 22 01:23:49 2016
From: nelle.varoquaux at gmail.com (Nelle Varoquaux)
Date: Wed, 21 Sep 2016 22:23:49 -0700
Subject: [scikit-learn] Github project management tools
In-Reply-To: <fe49d881-1321-71e0-021b-990541a34239@gmail.com>
References: <CAAkaFLXw7+Qfu1L+SGijrVGJ7UU=R-tx4_ttdh0fCJo-u53fMw@mail.gmail.com>
 <20160916152118.GC187668@phare.normalesup.org>
 <CAAkaFLW6sVg_vjXnMHu9F59i0HHxpDLM87B3opKTcyoe-dNXjg@mail.gmail.com>
 <CAAkaFLV6160=p1ztEBtDh8tmAXwDBf5Cy-m1=aAabHxkbCdXPg@mail.gmail.com>
 <CAE-UAvTGMrtX5uOP5END4n+SihtQNSHH+T7cwgBByu4vuhz4QA@mail.gmail.com>
 <fe49d881-1321-71e0-021b-990541a34239@gmail.com>
Message-ID: <CAE-UAvQC98Xj81hAfBTHxcQWSQxD1VwEa_SAPZmK3tWHbWgE0A@mail.gmail.com>

On 21 September 2016 at 22:13, Andreas Mueller <t3kcit at gmail.com> wrote:
>
>
> On 09/19/2016 09:56 PM, Nelle Varoquaux wrote:
>>>
>>> Another bot-able tool might be pinging inactive PRs to ask if they're
>>> being
>>> worked on, and labelling "Needs contributor" if there's no reply within n
>>> days...!
>
> That kind of only works when the status is "waiting for changes",
> and not "waiting for reviews". I guess we could tag all old issues
> or use the new interface (though you said that's not scriptable yet?)
> So we would need to actually use the "needs reviews" tag and add an
> "waiting for changes" tag. And I guess the "waiting for changes" should be
> removed automatically when the author changed something and changed to
> "needs review"?
>
> Is there an API to access the "fixes #ISSUE" thing for auto-closing? Just
> mentioning an issue
> doesn't mean it's a PR to solve the issue.
>
>> If PRs are inactive, it might also be interesting to tag them as
>> easy_fix when there is little to do.
>>
>>
> That's much harder to automate though.
> I know that I often misjudge the amount that is left to do in a PR,
> not sure if bots are better at that than humans yet.

Bots wouldn't be able to do that, but I find that an hour now and then
scrolling throught old PR works pretty well :)

>
> Are there bots with LSTM support yet? ;)
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

From t3kcit at gmail.com  Thu Sep 22 08:49:43 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Thu, 22 Sep 2016 18:19:43 +0530
Subject: [scikit-learn] Call for reviewers for 0.18 release
Message-ID: <948b536f-8025-c089-a8cb-6ff35ed92235@gmail.com>

Hey folks.

We are preparing to release 0.18 and we're a bit short on reviewers (I'm 
semi-online myself).
If any of you want to pitch in, please have a look at the PRs with the 
0.18 milestone:

https://github.com/scikit-learn/scikit-learn/pulls?q=is%3Aopen+milestone%3A0.18+is%3Apr

Thanks,
Andy

From siddharthgupta234 at gmail.com  Sat Sep 24 04:19:58 2016
From: siddharthgupta234 at gmail.com (Siddharth Gupta)
Date: Sat, 24 Sep 2016 13:49:58 +0530
Subject: [scikit-learn] Understanding the codebase
Message-ID: <CAM_sO3TRXkiHbTghAu7eeyh968j16PN7ef-EBKihe_X1EYfDKw@mail.gmail.com>

Hey fellows, I am new to Open Source, I have set up the dev-environment for
scikit-learn but I dont know from where should I start understanding the
codebase. I would be working on a continuous Integration issue, any support
would be helpful. I have read the contribution guidelines.

Regards Siddharth Gupta,
Ph: 9871012292
Linkedin <https://www.linkedin.com/in/sidgupta234/> | Github
<https://github.com/sidgupta234> | Codechef
<https://www.codechef.com/users/sidgupta234> | Twitter
<https://twitter.com/SidGupta234> | Facebook
<https://www.facebook.com/profile.php?id=1483695876>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160924/a4b8d8fb/attachment.html>

From nannyakannya at gmail.com  Sun Sep 25 05:44:44 2016
From: nannyakannya at gmail.com (=?UTF-8?B?5rex6LC35Lqu56WQ?=)
Date: Sun, 25 Sep 2016 18:44:44 +0900
Subject: [scikit-learn] MLP Maxout activation proposal
Message-ID: <CAAzufNKsfhu9u6pOimCF5E-tY=j8V8hGU2BGSaxmRzLHSJxATA@mail.gmail.com>

Hi everyone,

My name is Ryosuke Fukatani.
I'm joining a scikit-learn community and really excited to work with all.

Today I propose new feature "maxout activation" for MLP.
Maxout activation achived high performance
classification.(http://www.jmlr.org/proceedings/papers/v28/goodfellow13.pdf).

And it seemed that only a few layers are needed to use maxout.
I think it is suitable feature for light weight MLP by scikit learn.

If it's OK, I'd work on it.
Best regards

From siddharthgupta234 at gmail.com  Sun Sep 25 05:51:18 2016
From: siddharthgupta234 at gmail.com (Siddharth Gupta)
Date: Sun, 25 Sep 2016 15:21:18 +0530
Subject: [scikit-learn] MLP Maxout activation proposal
In-Reply-To: <CAAzufNKsfhu9u6pOimCF5E-tY=j8V8hGU2BGSaxmRzLHSJxATA@mail.gmail.com>
References: <CAAzufNKsfhu9u6pOimCF5E-tY=j8V8hGU2BGSaxmRzLHSJxATA@mail.gmail.com>
Message-ID: <CAM_sO3T1nrNhqkXF540oGEWTmnC2WSzSv+5ota+H4cuUuh8J6w@mail.gmail.com>

It seems like a great idea to me. I would love to co-work on this feature.
Also could you check the file link, it does not seem to be opening the
document.

Regards Siddharth Gupta,
Ph: 9871012292
Linkedin <https://www.linkedin.com/in/sidgupta234/> | Github
<https://github.com/sidgupta234> | Codechef
<https://www.codechef.com/users/sidgupta234> | Twitter
<https://twitter.com/SidGupta234> | Facebook
<https://www.facebook.com/profile.php?id=1483695876>

On Sun, Sep 25, 2016 at 3:14 PM, ???? <nannyakannya at gmail.com> wrote:

> Hi everyone,
>
> My name is Ryosuke Fukatani.
> I'm joining a scikit-learn community and really excited to work with all.
>
> Today I propose new feature "maxout activation" for MLP.
> Maxout activation achived high performance
> classification.(http://www.jmlr.org/proceedings/papers/
> v28/goodfellow13.pdf).
>
> And it seemed that only a few layers are needed to use maxout.
> I think it is suitable feature for light weight MLP by scikit learn.
>
> If it's OK, I'd work on it.
> Best regards
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160925/26a7a907/attachment.html>

From nannyakannya at gmail.com  Sun Sep 25 07:46:29 2016
From: nannyakannya at gmail.com (=?UTF-8?B?5rex6LC35Lqu56WQ?=)
Date: Sun, 25 Sep 2016 20:46:29 +0900
Subject: [scikit-learn] MLP Maxout activation proposal
In-Reply-To: <CAM_sO3T1nrNhqkXF540oGEWTmnC2WSzSv+5ota+H4cuUuh8J6w@mail.gmail.com>
References: <CAAzufNKsfhu9u6pOimCF5E-tY=j8V8hGU2BGSaxmRzLHSJxATA@mail.gmail.com>
 <CAM_sO3T1nrNhqkXF540oGEWTmnC2WSzSv+5ota+H4cuUuh8J6w@mail.gmail.com>
Message-ID: <CAAzufNK-=j=v1uexP0ymSkDaZNbAEf=b18z-HYjpQHxN4wMmtw@mail.gmail.com>

Thanks for your reply!
I look forward to working with you.

In my environment, the file link works, but I found another.
Please try this URL-> ( https://arxiv.org/abs/1302.4389 )

Regards Ryosuke,

2016-09-25 18:51 GMT+09:00, Siddharth Gupta <siddharthgupta234 at gmail.com>:
> It seems like a great idea to me. I would love to co-work on this feature.
> Also could you check the file link, it does not seem to be opening the
> document.
>
> Regards Siddharth Gupta,
> Ph: 9871012292
> Linkedin <https://www.linkedin.com/in/sidgupta234/> | Github
> <https://github.com/sidgupta234> | Codechef
> <https://www.codechef.com/users/sidgupta234> | Twitter
> <https://twitter.com/SidGupta234> | Facebook
> <https://www.facebook.com/profile.php?id=1483695876>
>
> On Sun, Sep 25, 2016 at 3:14 PM, ???? <nannyakannya at gmail.com> wrote:
>
>> Hi everyone,
>>
>> My name is Ryosuke Fukatani.
>> I'm joining a scikit-learn community and really excited to work with all.
>>
>> Today I propose new feature "maxout activation" for MLP.
>> Maxout activation achived high performance
>> classification.(http://www.jmlr.org/proceedings/papers/
>> v28/goodfellow13.pdf).
>>
>> And it seemed that only a few layers are needed to use maxout.
>> I think it is suitable feature for light weight MLP by scikit learn.
>>
>> If it's OK, I'd work on it.
>> Best regards
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>


-- 
********************************

********************************

From siddharthgupta234 at gmail.com  Sun Sep 25 08:40:08 2016
From: siddharthgupta234 at gmail.com (Siddharth Gupta)
Date: Sun, 25 Sep 2016 18:10:08 +0530
Subject: [scikit-learn] MLP Maxout activation proposal
In-Reply-To: <CAAzufNK-=j=v1uexP0ymSkDaZNbAEf=b18z-HYjpQHxN4wMmtw@mail.gmail.com>
References: <CAAzufNKsfhu9u6pOimCF5E-tY=j8V8hGU2BGSaxmRzLHSJxATA@mail.gmail.com>
 <CAM_sO3T1nrNhqkXF540oGEWTmnC2WSzSv+5ota+H4cuUuh8J6w@mail.gmail.com>
 <CAAzufNK-=j=v1uexP0ymSkDaZNbAEf=b18z-HYjpQHxN4wMmtw@mail.gmail.com>
Message-ID: <CAM_sO3RULy=qQ6RQT1t_JPeqqWYSKgYVqGyTiT_RB36bYT8Hdg@mail.gmail.com>

Yes, Ryosuke the link works. Lets start by creating an issue where we can
discuss the dos and don'ts. Something on these lines
https://github.com/scikit-learn/scikit-learn/issues/6175 ! You create the
issue, we'll see what others think and then we'll proceed further.

Regards Siddharth Gupta,
Ph: 9871012292
Linkedin <https://www.linkedin.com/in/sidgupta234/> | Github
<https://github.com/sidgupta234> | Codechef
<https://www.codechef.com/users/sidgupta234> | Twitter
<https://twitter.com/SidGupta234> | Facebook
<https://www.facebook.com/profile.php?id=1483695876>

On Sun, Sep 25, 2016 at 5:16 PM, ???? <nannyakannya at gmail.com> wrote:

> Thanks for your reply!
> I look forward to working with you.
>
> In my environment, the file link works, but I found another.
> Please try this URL-> ( https://arxiv.org/abs/1302.4389 )
>
> Regards Ryosuke,
>
> 2016-09-25 18:51 GMT+09:00, Siddharth Gupta <siddharthgupta234 at gmail.com>:
> > It seems like a great idea to me. I would love to co-work on this
> feature.
> > Also could you check the file link, it does not seem to be opening the
> > document.
> >
> > Regards Siddharth Gupta,
> > Ph: 9871012292
> > Linkedin <https://www.linkedin.com/in/sidgupta234/> | Github
> > <https://github.com/sidgupta234> | Codechef
> > <https://www.codechef.com/users/sidgupta234> | Twitter
> > <https://twitter.com/SidGupta234> | Facebook
> > <https://www.facebook.com/profile.php?id=1483695876>
> >
> > On Sun, Sep 25, 2016 at 3:14 PM, ???? <nannyakannya at gmail.com> wrote:
> >
> >> Hi everyone,
> >>
> >> My name is Ryosuke Fukatani.
> >> I'm joining a scikit-learn community and really excited to work with
> all.
> >>
> >> Today I propose new feature "maxout activation" for MLP.
> >> Maxout activation achived high performance
> >> classification.(http://www.jmlr.org/proceedings/papers/
> >> v28/goodfellow13.pdf).
> >>
> >> And it seemed that only a few layers are needed to use maxout.
> >> I think it is suitable feature for light weight MLP by scikit learn.
> >>
> >> If it's OK, I'd work on it.
> >> Best regards
> >> _______________________________________________
> >> scikit-learn mailing list
> >> scikit-learn at python.org
> >> https://mail.python.org/mailman/listinfo/scikit-learn
> >>
> >
>
>
> --
> ********************************
>
> ********************************
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160925/805ef7ea/attachment-0001.html>

From nannyakannya at gmail.com  Sun Sep 25 09:01:34 2016
From: nannyakannya at gmail.com (=?UTF-8?B?5rex6LC35Lqu56WQ?=)
Date: Sun, 25 Sep 2016 22:01:34 +0900
Subject: [scikit-learn] MLP Maxout activation proposal
In-Reply-To: <CAM_sO3RULy=qQ6RQT1t_JPeqqWYSKgYVqGyTiT_RB36bYT8Hdg@mail.gmail.com>
References: <CAAzufNKsfhu9u6pOimCF5E-tY=j8V8hGU2BGSaxmRzLHSJxATA@mail.gmail.com>
 <CAM_sO3T1nrNhqkXF540oGEWTmnC2WSzSv+5ota+H4cuUuh8J6w@mail.gmail.com>
 <CAAzufNK-=j=v1uexP0ymSkDaZNbAEf=b18z-HYjpQHxN4wMmtw@mail.gmail.com>
 <CAM_sO3RULy=qQ6RQT1t_JPeqqWYSKgYVqGyTiT_RB36bYT8Hdg@mail.gmail.com>
Message-ID: <CAAzufNJjBPzky0hYMN_iR6QsrsCf39yWKkDpnqM1qCiK0Y_Reg@mail.gmail.com>

Sure! I created https://github.com/scikit-learn/scikit-learn/issues/7488 .

Regards Ryosuke,

2016-09-25 21:40 GMT+09:00, Siddharth Gupta <siddharthgupta234 at gmail.com>:
> Yes, Ryosuke the link works. Lets start by creating an issue where we can
> discuss the dos and don'ts. Something on these lines
> https://github.com/scikit-learn/scikit-learn/issues/6175 ! You create the
> issue, we'll see what others think and then we'll proceed further.
>
> Regards Siddharth Gupta,
> Ph: 9871012292
> Linkedin <https://www.linkedin.com/in/sidgupta234/> | Github
> <https://github.com/sidgupta234> | Codechef
> <https://www.codechef.com/users/sidgupta234> | Twitter
> <https://twitter.com/SidGupta234> | Facebook
> <https://www.facebook.com/profile.php?id=1483695876>
>
> On Sun, Sep 25, 2016 at 5:16 PM, ???? <nannyakannya at gmail.com> wrote:
>
>> Thanks for your reply!
>> I look forward to working with you.
>>
>> In my environment, the file link works, but I found another.
>> Please try this URL-> ( https://arxiv.org/abs/1302.4389 )
>>
>> Regards Ryosuke,
>>
>> 2016-09-25 18:51 GMT+09:00, Siddharth Gupta
>> <siddharthgupta234 at gmail.com>:
>> > It seems like a great idea to me. I would love to co-work on this
>> feature.
>> > Also could you check the file link, it does not seem to be opening the
>> > document.
>> >
>> > Regards Siddharth Gupta,
>> > Ph: 9871012292
>> > Linkedin <https://www.linkedin.com/in/sidgupta234/> | Github
>> > <https://github.com/sidgupta234> | Codechef
>> > <https://www.codechef.com/users/sidgupta234> | Twitter
>> > <https://twitter.com/SidGupta234> | Facebook
>> > <https://www.facebook.com/profile.php?id=1483695876>
>> >
>> > On Sun, Sep 25, 2016 at 3:14 PM, ???? <nannyakannya at gmail.com> wrote:
>> >
>> >> Hi everyone,
>> >>
>> >> My name is Ryosuke Fukatani.
>> >> I'm joining a scikit-learn community and really excited to work with
>> all.
>> >>
>> >> Today I propose new feature "maxout activation" for MLP.
>> >> Maxout activation achived high performance
>> >> classification.(http://www.jmlr.org/proceedings/papers/
>> >> v28/goodfellow13.pdf).
>> >>
>> >> And it seemed that only a few layers are needed to use maxout.
>> >> I think it is suitable feature for light weight MLP by scikit learn.
>> >>
>> >> If it's OK, I'd work on it.
>> >> Best regards
>> >> _______________________________________________
>> >> scikit-learn mailing list
>> >> scikit-learn at python.org
>> >> https://mail.python.org/mailman/listinfo/scikit-learn
>> >>
>> >
>>
>>
>> --
>> ********************************
>>
>> ********************************
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>


-- 
********************************

********************************

From Afarin.Famili at UTSouthwestern.edu  Mon Sep 26 14:03:27 2016
From: Afarin.Famili at UTSouthwestern.edu (Afarin Famili)
Date: Mon, 26 Sep 2016 18:03:27 +0000
Subject: [scikit-learn] header intact
Message-ID: <1474913007611.80841@UTSouthwestern.edu>

?


________________________________

UT Southwestern


Medical Center


The future of medicine, today.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160926/92efd185/attachment.html>

From Afarin.Famili at UTSouthwestern.edu  Mon Sep 26 14:06:49 2016
From: Afarin.Famili at UTSouthwestern.edu (Afarin Famili)
Date: Mon, 26 Sep 2016 18:06:49 +0000
Subject: [scikit-learn] Is there a built-in function for pairs of data?
In-Reply-To: <1474669404029.16414@UTSouthwestern.edu>
References: <1474669404029.16414@UTSouthwestern.edu>
Message-ID: <1474913209751.36283@UTSouthwestern.edu>


Dear Scikit-learn team,


We need to deal with pairs of data in our classification task. I was wondering if there is already a built-in function in Scikit-learn that can partition the pairs of data into train and test sets?


Regards,

Afarin


________________________________

UT Southwestern


Medical Center


The future of medicine, today.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160926/983b9036/attachment.html>

From pedropazzini at gmail.com  Mon Sep 26 14:47:26 2016
From: pedropazzini at gmail.com (Pedro Pazzini)
Date: Mon, 26 Sep 2016 15:47:26 -0300
Subject: [scikit-learn] Is there a built-in function for pairs of data?
In-Reply-To: <1474913209751.36283@UTSouthwestern.edu>
References: <1474669404029.16414@UTSouthwestern.edu>
 <1474913209751.36283@UTSouthwestern.edu>
Message-ID: <CAAY8FkB2LjnegwFbn=gSOawLBcBQ3dnYa6BxDxN6-cvLT1RsfA@mail.gmail.com>

Like this?:
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html

2016-09-26 15:06 GMT-03:00 Afarin Famili <Afarin.Famili at utsouthwestern.edu>:

>
> Dear Scikit-learn team,
>
>
> We need to deal with pairs of data in our classification task. I was
> wondering if there is already a built-in function in Scikit-learn that can
> partition the pairs of data into train and test sets?
>
>
> Regards,
>
> Afarin
>
>
>
> ------------------------------
>
> UT Southwestern
>
> Medical Center
>
> The future of medicine, today.
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160926/2ba60e6a/attachment.html>

From nicholdav at gmail.com  Mon Sep 26 14:53:05 2016
From: nicholdav at gmail.com (David Nicholson)
Date: Mon, 26 Sep 2016 14:53:05 -0400
Subject: [scikit-learn] Is there a built-in function for pairs of data?
In-Reply-To: <1474913209751.36283@UTSouthwestern.edu>
References: <1474669404029.16414@UTSouthwestern.edu>
 <1474913209751.36283@UTSouthwestern.edu>
Message-ID: <CAMabFbXamB5KzQY9_WU+8BFxpSECbs2fSiQqad18zi9zmOjvVQ@mail.gmail.com>

Do you mean like train_test_split?
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html

On Sep 26, 2016 14:43, "Afarin Famili" <Afarin.Famili at utsouthwestern.edu>
wrote:

>
> Dear Scikit-learn team,
>
>
> We need to deal with pairs of data in our classification task. I was
> wondering if there is already a built-in function in Scikit-learn that can
> partition the pairs of data into train and test sets?
>
>
> Regards,
>
> Afarin
>
>
>
> ------------------------------
>
> UT Southwestern
>
> Medical Center
>
> The future of medicine, today.
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160926/191ef81d/attachment.html>

From md.khairullah at gmail.com  Mon Sep 26 15:43:05 2016
From: md.khairullah at gmail.com (Md. Khairullah)
Date: Mon, 26 Sep 2016 21:43:05 +0200
Subject: [scikit-learn] Large computation time for homogeneous data with
 agglomerative clustering
Message-ID: <CA+xrTcKMkwSN2Y7jFg12nEx-Ch_V5bw7eLhG5UO39wN+ebBozg@mail.gmail.com>

Dear Scikit-learners,
This is my first post here and I hope you experts can help me a lot.

We are using the agglomerative clustering with ward's linkage and
connectivity constraint. The data size is around 205,000 (each is a single
scalar feature). The data set is dynamic (in time) and we need to apply
clustering at different time thorough the process. Initially all data is 0
and they increase gradually. Alternatively, in the early stage the data is
more homogeneous and the heterogeneity among the data increases gradually.
If the clustering is applied at the final stage (most heterogeneous data,
but off course having patterns/clusters) requesting 20 clusters it takes
only 61s of CPU time. But, if clustering is run in an early stage (more
homogeneous data but all are not 0 and off course there are
patterns/clusters in the data) with the same settings the time rises up to
1h 5m. The CPU time is in-between of these two if the data come from an
in-between time stamp. I also tried the the other linkage options too, but
the situation does not improve. My understanding is that the homogeneity is
playing the role.

Have you experienced this too? What solution do you suggest?

Thanks in advance for your attention and help.

-- 
Best regards

Md. Khairullah
PhD Student, KU Leuven
Numerical Analysis and Applied Mathematics Section
Celestijnenlaan 200a - box 2402
3001 Leuven
room: 03.18
tel. +32 16 37 39 66
fax +32 16 3 27996
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160926/da13ef50/attachment-0001.html>

From Afarin.Famili at UTSouthwestern.edu  Mon Sep 26 17:06:28 2016
From: Afarin.Famili at UTSouthwestern.edu (Afarin Famili)
Date: Mon, 26 Sep 2016 21:06:28 +0000
Subject: [scikit-learn] scikit-learn Digest, Vol 6, Issue 40
In-Reply-To: <mailman.259.1474918987.2301.scikit-learn@python.org>
References: <mailman.259.1474918987.2301.scikit-learn@python.org>
Message-ID: <1474923988308.27715@UTSouthwestern.edu>

Hi David,

When applying Train_test_split to the sample space, we have a single row per subject. I am looking for some other function like Train_test_split that can deal with pairs of rows (for each subject), which does not lead to a biased accuracy. We are studying memory and have a row of features for successful memory encoding, and a second row for unsuccessful memory encoding in each of the subjects. Our target space being 1 for successful and 0 for unsuccessful encoding respectively. 
How do you recommend me to split this set of data in order to get a reasonable/unbiased accuracy?

Thanks,
Afarin


________________________________________
From: scikit-learn <scikit-learn-bounces+afarin.famili=utsouthwestern.edu at python.org> on behalf of scikit-learn-request at python.org <scikit-learn-request at python.org>
Sent: Monday, September 26, 2016 2:43 PM
To: scikit-learn at python.org
Subject: scikit-learn Digest, Vol 6, Issue 40

Send scikit-learn mailing list submissions to
        scikit-learn at python.org

To subscribe or unsubscribe via the World Wide Web, visit
        https://mail.python.org/mailman/listinfo/scikit-learn
or, via email, send a message with subject or body 'help' to
        scikit-learn-request at python.org

You can reach the person managing the list at
        scikit-learn-owner at python.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of scikit-learn digest..."


Today's Topics:

   1. header intact (Afarin Famili)
   2. Is there a built-in function for pairs of data? (Afarin Famili)
   3. Re: Is there a built-in function for pairs of data?
      (Pedro Pazzini)
   4. Re: Is there a built-in function for pairs of data?
      (David Nicholson)
   5. Large computation time for homogeneous data with
      agglomerative clustering (Md. Khairullah)


----------------------------------------------------------------------

Message: 1
Date: Mon, 26 Sep 2016 18:03:27 +0000
From: Afarin Famili <Afarin.Famili at UTSouthwestern.edu>
To: "scikit-learn at python.org" <scikit-learn at python.org>
Subject: [scikit-learn] header intact
Message-ID: <1474913007611.80841 at UTSouthwestern.edu>
Content-Type: text/plain; charset="iso-8859-1"

?


________________________________

UT Southwestern


Medical Center


The future of medicine, today.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160926/92efd185/attachment-0001.html>

------------------------------

Message: 2
Date: Mon, 26 Sep 2016 18:06:49 +0000
From: Afarin Famili <Afarin.Famili at UTSouthwestern.edu>
To: "scikit-learn at python.org" <scikit-learn at python.org>
Subject: [scikit-learn] Is there a built-in function for pairs of
        data?
Message-ID: <1474913209751.36283 at UTSouthwestern.edu>
Content-Type: text/plain; charset="iso-8859-1"


Dear Scikit-learn team,


We need to deal with pairs of data in our classification task. I was wondering if there is already a built-in function in Scikit-learn that can partition the pairs of data into train and test sets?


Regards,

Afarin


________________________________

UT Southwestern


Medical Center


The future of medicine, today.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160926/983b9036/attachment-0001.html>

------------------------------

Message: 3
Date: Mon, 26 Sep 2016 15:47:26 -0300
From: Pedro Pazzini <pedropazzini at gmail.com>
To: Scikit-learn user and developer mailing list
        <scikit-learn at python.org>
Subject: Re: [scikit-learn] Is there a built-in function for pairs of
        data?
Message-ID:
        <CAAY8FkB2LjnegwFbn=gSOawLBcBQ3dnYa6BxDxN6-cvLT1RsfA at mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Like this?:
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html

2016-09-26 15:06 GMT-03:00 Afarin Famili <Afarin.Famili at utsouthwestern.edu>:

>
> Dear Scikit-learn team,
>
>
> We need to deal with pairs of data in our classification task. I was
> wondering if there is already a built-in function in Scikit-learn that can
> partition the pairs of data into train and test sets?
>
>
> Regards,
>
> Afarin
>
>
>
> ------------------------------
>
> UT Southwestern
>
> Medical Center
>
> The future of medicine, today.
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160926/2ba60e6a/attachment-0001.html>

------------------------------

Message: 4
Date: Mon, 26 Sep 2016 14:53:05 -0400
From: David Nicholson <nicholdav at gmail.com>
To: Scikit-learn user and developer mailing list
        <scikit-learn at python.org>
Subject: Re: [scikit-learn] Is there a built-in function for pairs of
        data?
Message-ID:
        <CAMabFbXamB5KzQY9_WU+8BFxpSECbs2fSiQqad18zi9zmOjvVQ at mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Do you mean like train_test_split?
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html

On Sep 26, 2016 14:43, "Afarin Famili" <Afarin.Famili at utsouthwestern.edu>
wrote:

>
> Dear Scikit-learn team,
>
>
> We need to deal with pairs of data in our classification task. I was
> wondering if there is already a built-in function in Scikit-learn that can
> partition the pairs of data into train and test sets?
>
>
> Regards,
>
> Afarin
>
>
>
> ------------------------------
>
> UT Southwestern
>
> Medical Center
>
> The future of medicine, today.
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160926/191ef81d/attachment-0001.html>

------------------------------

Message: 5
Date: Mon, 26 Sep 2016 21:43:05 +0200
From: "Md. Khairullah" <md.khairullah at gmail.com>
To: scikit-learn at python.org
Subject: [scikit-learn] Large computation time for homogeneous data
        with agglomerative clustering
Message-ID:
        <CA+xrTcKMkwSN2Y7jFg12nEx-Ch_V5bw7eLhG5UO39wN+ebBozg at mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Dear Scikit-learners,
This is my first post here and I hope you experts can help me a lot.

We are using the agglomerative clustering with ward's linkage and
connectivity constraint. The data size is around 205,000 (each is a single
scalar feature). The data set is dynamic (in time) and we need to apply
clustering at different time thorough the process. Initially all data is 0
and they increase gradually. Alternatively, in the early stage the data is
more homogeneous and the heterogeneity among the data increases gradually.
If the clustering is applied at the final stage (most heterogeneous data,
but off course having patterns/clusters) requesting 20 clusters it takes
only 61s of CPU time. But, if clustering is run in an early stage (more
homogeneous data but all are not 0 and off course there are
patterns/clusters in the data) with the same settings the time rises up to
1h 5m. The CPU time is in-between of these two if the data come from an
in-between time stamp. I also tried the the other linkage options too, but
the situation does not improve. My understanding is that the homogeneity is
playing the role.

Have you experienced this too? What solution do you suggest?

Thanks in advance for your attention and help.

--
Best regards

Md. Khairullah
PhD Student, KU Leuven
Numerical Analysis and Applied Mathematics Section
Celestijnenlaan 200a - box 2402
3001 Leuven
room: 03.18
tel. +32 16 37 39 66
fax +32 16 3 27996
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160926/da13ef50/attachment.html>

------------------------------

Subject: Digest Footer

_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn


------------------------------

End of scikit-learn Digest, Vol 6, Issue 40
*******************************************


From naopon at gmail.com  Mon Sep 26 17:25:04 2016
From: naopon at gmail.com (Naoya Kanai)
Date: Mon, 26 Sep 2016 14:25:04 -0700
Subject: [scikit-learn] scikit-learn Digest, Vol 6, Issue 40
In-Reply-To: <1474923988308.27715@UTSouthwestern.edu>
References: <1474923988308.27715@UTSouthwestern.edu>
Message-ID: <57e990dd3e8f140000e50bc7@polymail.io>

What if you split the data pairwise(i.e. X_success, X_fail, etc) with subjects matched by row index, then run train_test_split on each one with the same random_state?

Naoya Kanai

Sent from
https://polymail.io/

On Mon, Sep 26, 2016 at 2:06 PM Afarin Famili

<
mailto:Afarin Famili <Afarin.Famili at utsouthwestern.edu>
> wrote:

a, pre, code, a:link, body { word-wrap: break-word !important; }

Hi David,

When applying Train_test_split to the sample space, we have a single row per subject. I am looking for some other function like Train_test_split that can deal with pairs of rows (for each subject), which does not lead to a biased accuracy. We are studying memory and have a row of features for successful memory encoding, and a second row for unsuccessful memory encoding in each of the subjects. Our target space being 1 for successful and 0 for unsuccessful encoding respectively.

How do you recommend me to split this set of data in order to get a reasonable/unbiased accuracy?

Thanks,

Afarin

________________________________________

From: scikit-learn
mailto:utsouthwestern.edu at python.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160926/5b5cb8bd/attachment-0001.html>

From joel.nothman at gmail.com  Mon Sep 26 18:06:10 2016
From: joel.nothman at gmail.com (Joel Nothman)
Date: Tue, 27 Sep 2016 08:06:10 +1000
Subject: [scikit-learn] scikit-learn Digest, Vol 6, Issue 40
In-Reply-To: <1474923988308.27715@UTSouthwestern.edu>
References: <mailman.259.1474918987.2301.scikit-learn@python.org>
 <1474923988308.27715@UTSouthwestern.edu>
Message-ID: <CAAkaFLV9+m7VBZfYq2=tP=crGhLwwx2BRjJOGqUkpSSt1G9bOA@mail.gmail.com>

Hi Arafin,

You appear to be talking about a situation in which your dataset is divided
into subsets in which the data are highly correlated (but perhaps
conditionally independent given the subject / group identifier). In
Scikit-learn 0.18 these might be called "grouped cross validation"
strategies. See
http://scikit-learn.org/dev/modules/cross_validation.html#cross-validation-iterators-for-grouped-data
.

(In earlier versions of Scikit-learn, you will find the corresponding CV
objects as LabelKFold, LeaveOneLabelOut, etc., but we decided to rename
them for clarity when redesigning CV objects and moving them to the new
sklearn.model_selection subpackage.)

I hope that helps.

Joel

On 27 September 2016 at 07:06, Afarin Famili <
Afarin.Famili at utsouthwestern.edu> wrote:

> Hi David,
>
> When applying Train_test_split to the sample space, we have a single row
> per subject. I am looking for some other function like Train_test_split
> that can deal with pairs of rows (for each subject), which does not lead to
> a biased accuracy. We are studying memory and have a row of features for
> successful memory encoding, and a second row for unsuccessful memory
> encoding in each of the subjects. Our target space being 1 for successful
> and 0 for unsuccessful encoding respectively.
> How do you recommend me to split this set of data in order to get a
> reasonable/unbiased accuracy?
>
> Thanks,
> Afarin
>
>
>
> ________________________________________
> From: scikit-learn <scikit-learn-bounces+afarin.famili=utsouthwestern.edu@
> python.org> on behalf of scikit-learn-request at python.org <
> scikit-learn-request at python.org>
> Sent: Monday, September 26, 2016 2:43 PM
> To: scikit-learn at python.org
> Subject: scikit-learn Digest, Vol 6, Issue 40
>
> Send scikit-learn mailing list submissions to
>         scikit-learn at python.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         https://mail.python.org/mailman/listinfo/scikit-learn
> or, via email, send a message with subject or body 'help' to
>         scikit-learn-request at python.org
>
> You can reach the person managing the list at
>         scikit-learn-owner at python.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of scikit-learn digest..."
>
>
> Today's Topics:
>
>    1. header intact (Afarin Famili)
>    2. Is there a built-in function for pairs of data? (Afarin Famili)
>    3. Re: Is there a built-in function for pairs of data?
>       (Pedro Pazzini)
>    4. Re: Is there a built-in function for pairs of data?
>       (David Nicholson)
>    5. Large computation time for homogeneous data with
>       agglomerative clustering (Md. Khairullah)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Mon, 26 Sep 2016 18:03:27 +0000
> From: Afarin Famili <Afarin.Famili at UTSouthwestern.edu>
> To: "scikit-learn at python.org" <scikit-learn at python.org>
> Subject: [scikit-learn] header intact
> Message-ID: <1474913007611.80841 at UTSouthwestern.edu>
> Content-Type: text/plain; charset="iso-8859-1"
>
> ?
>
>
>
> ________________________________
>
> UT Southwestern
>
>
> Medical Center
>
>
>
> The future of medicine, today.
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://mail.python.org/pipermail/scikit-learn/
> attachments/20160926/92efd185/attachment-0001.html>
>
> ------------------------------
>
> Message: 2
> Date: Mon, 26 Sep 2016 18:06:49 +0000
> From: Afarin Famili <Afarin.Famili at UTSouthwestern.edu>
> To: "scikit-learn at python.org" <scikit-learn at python.org>
> Subject: [scikit-learn] Is there a built-in function for pairs of
>         data?
> Message-ID: <1474913209751.36283 at UTSouthwestern.edu>
> Content-Type: text/plain; charset="iso-8859-1"
>
>
> Dear Scikit-learn team,
>
>
> We need to deal with pairs of data in our classification task. I was
> wondering if there is already a built-in function in Scikit-learn that can
> partition the pairs of data into train and test sets?
>
>
> Regards,
>
> Afarin
>
>
>
> ________________________________
>
> UT Southwestern
>
>
> Medical Center
>
>
>
> The future of medicine, today.
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://mail.python.org/pipermail/scikit-learn/
> attachments/20160926/983b9036/attachment-0001.html>
>
> ------------------------------
>
> Message: 3
> Date: Mon, 26 Sep 2016 15:47:26 -0300
> From: Pedro Pazzini <pedropazzini at gmail.com>
> To: Scikit-learn user and developer mailing list
>         <scikit-learn at python.org>
> Subject: Re: [scikit-learn] Is there a built-in function for pairs of
>         data?
> Message-ID:
>         <CAAY8FkB2LjnegwFbn=gSOawLBcBQ3dnYa6BxDxN6-cvLT1RsfA at mail.
> gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Like this?:
> http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.
> train_test_split.html
>
> 2016-09-26 15:06 GMT-03:00 Afarin Famili <Afarin.Famili at utsouthwestern.edu
> >:
>
> >
> > Dear Scikit-learn team,
> >
> >
> > We need to deal with pairs of data in our classification task. I was
> > wondering if there is already a built-in function in Scikit-learn that
> can
> > partition the pairs of data into train and test sets?
> >
> >
> > Regards,
> >
> > Afarin
> >
> >
> >
> > ------------------------------
> >
> > UT Southwestern
> >
> > Medical Center
> >
> > The future of medicine, today.
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> >
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://mail.python.org/pipermail/scikit-learn/
> attachments/20160926/2ba60e6a/attachment-0001.html>
>
> ------------------------------
>
> Message: 4
> Date: Mon, 26 Sep 2016 14:53:05 -0400
> From: David Nicholson <nicholdav at gmail.com>
> To: Scikit-learn user and developer mailing list
>         <scikit-learn at python.org>
> Subject: Re: [scikit-learn] Is there a built-in function for pairs of
>         data?
> Message-ID:
>         <CAMabFbXamB5KzQY9_WU+8BFxpSECbs2fSiQqad18zi9zmOjvVQ
> @mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Do you mean like train_test_split?
> http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.
> train_test_split.html
>
> On Sep 26, 2016 14:43, "Afarin Famili" <Afarin.Famili at utsouthwestern.edu>
> wrote:
>
> >
> > Dear Scikit-learn team,
> >
> >
> > We need to deal with pairs of data in our classification task. I was
> > wondering if there is already a built-in function in Scikit-learn that
> can
> > partition the pairs of data into train and test sets?
> >
> >
> > Regards,
> >
> > Afarin
> >
> >
> >
> > ------------------------------
> >
> > UT Southwestern
> >
> > Medical Center
> >
> > The future of medicine, today.
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> >
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://mail.python.org/pipermail/scikit-learn/
> attachments/20160926/191ef81d/attachment-0001.html>
>
> ------------------------------
>
> Message: 5
> Date: Mon, 26 Sep 2016 21:43:05 +0200
> From: "Md. Khairullah" <md.khairullah at gmail.com>
> To: scikit-learn at python.org
> Subject: [scikit-learn] Large computation time for homogeneous data
>         with agglomerative clustering
> Message-ID:
>         <CA+xrTcKMkwSN2Y7jFg12nEx-Ch_V5bw7eLhG5UO39wN+ebBozg at mail.
> gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Dear Scikit-learners,
> This is my first post here and I hope you experts can help me a lot.
>
> We are using the agglomerative clustering with ward's linkage and
> connectivity constraint. The data size is around 205,000 (each is a single
> scalar feature). The data set is dynamic (in time) and we need to apply
> clustering at different time thorough the process. Initially all data is 0
> and they increase gradually. Alternatively, in the early stage the data is
> more homogeneous and the heterogeneity among the data increases gradually.
> If the clustering is applied at the final stage (most heterogeneous data,
> but off course having patterns/clusters) requesting 20 clusters it takes
> only 61s of CPU time. But, if clustering is run in an early stage (more
> homogeneous data but all are not 0 and off course there are
> patterns/clusters in the data) with the same settings the time rises up to
> 1h 5m. The CPU time is in-between of these two if the data come from an
> in-between time stamp. I also tried the the other linkage options too, but
> the situation does not improve. My understanding is that the homogeneity is
> playing the role.
>
> Have you experienced this too? What solution do you suggest?
>
> Thanks in advance for your attention and help.
>
> --
> Best regards
>
> Md. Khairullah
> PhD Student, KU Leuven
> Numerical Analysis and Applied Mathematics Section
> Celestijnenlaan 200a - box 2402
> 3001 Leuven
> room: 03.18
> tel. +32 16 37 39 66
> fax +32 16 3 27996
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://mail.python.org/pipermail/scikit-learn/
> attachments/20160926/da13ef50/attachment.html>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ------------------------------
>
> End of scikit-learn Digest, Vol 6, Issue 40
> *******************************************
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160927/7b3fa5dd/attachment-0001.html>

From bharat.didwania.eee14 at itbhu.ac.in  Tue Sep 27 02:35:20 2016
From: bharat.didwania.eee14 at itbhu.ac.in (Bharat Didwania 4-Yr B.Tech. Electrical Engg.)
Date: Mon, 26 Sep 2016 23:35:20 -0700
Subject: [scikit-learn] Fwd: [Scikit-learn-general] MultinomialNB
 Scikit-learn question
In-Reply-To: <CAA3g_m8iExYp-tkDQ2i8zBymYSt3CtECm=M9tvnxq9WPEchL4w@mail.gmail.com>
References: <66C2E6FB-A23E-43D2-9D3B-75C35002B992@gmail.com>
 <CAA3g_m8iExYp-tkDQ2i8zBymYSt3CtECm=M9tvnxq9WPEchL4w@mail.gmail.com>
Message-ID: <CAA3g_m-4UjnH9KXrG0MWVQGPRnsu90yXkZWLvEWA9qQUfUswWQ@mail.gmail.com>

---------- Forwarded message ----------
From: Bharat Didwania 4-Yr B.Tech. Electrical Engg. <
bharat.didwania.eee14 at itbhu.ac.in>
Date: Mon, Sep 26, 2016 at 11:04 PM
Subject: Re: [Scikit-learn-general] MultinomialNB Scikit-learn question
To: scikit-learn-general at lists.sourceforge.net


It seems you are using the sample_weights with OneVsAll classifier. I think
it should be an issue as the base classifier, i.e MultinomialNB supports
sample_weights but OneVsAll does not.


 Sent with Mailtrack
<https://mailtrack.io/install?source=signature&lang=en&referral=bharat.didwania.eee14 at itbhu.ac.in&idSignature=22>

On Mon, Sep 26, 2016 at 4:10 PM, Diego Vergara <reiby.viper at gmail.com>
wrote:

> Hi Scikit-learn developer team. I have a query in which I need help.
>
> How does it work sample weight in MultinomialNB ?, Is there any
> documentation or equation?.
>
> In the original equation (http://scikit-learn.org/stabl
> e/modules/naive_bayes.html#multinomial-naive-bayes), weights applied to
> individual samples, throws me bad results.
> Best regards, thank you for your answers.
>
> ------------------------------------------------------------
> ------------------
>
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>


 Sent with Mailtrack
<https://mailtrack.io/install?source=signature&lang=en&referral=bharat.didwania.eee14 at itbhu.ac.in&idSignature=22>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160926/097b02bb/attachment.html>

From t3kcit at gmail.com  Wed Sep 28 10:02:03 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Wed, 28 Sep 2016 10:02:03 -0400
Subject: [scikit-learn] always Squash and Merge?
Message-ID: <6b40af8a-f1d3-d288-9075-85e9bcd40eb8@gmail.com>

Hey.

This is a continuation of the discussion we had on squashing in June:
https://mail.python.org/pipermail/scikit-learn/2016-June/000121.html

I thought we discussed this again after the "squash and merge" feature 
was introduced, but I couldn't find the thread.

I think Joel, me and some others where recently using the github "squash 
and merge" feature,
which I think is great. It removes burden from the contributors and 
makes for a "clean" (or fake) history.
I like it because it makes cherry-picking easy and allows a pretty 
simple analysis of what's happening.

When doing some backports, I realized that some people (including Gael) 
didn't use it.

Is there a reason not to use squash and merge? Should we make it policy?

The one case where I think we might not want it is in case there are 
multiple authors in a PR.
Other than that, I don't really see a downside.

Wdyt?

Andy

From joel.nothman at gmail.com  Wed Sep 28 10:10:30 2016
From: joel.nothman at gmail.com (Joel Nothman)
Date: Thu, 29 Sep 2016 00:10:30 +1000
Subject: [scikit-learn] always Squash and Merge?
In-Reply-To: <6b40af8a-f1d3-d288-9075-85e9bcd40eb8@gmail.com>
References: <6b40af8a-f1d3-d288-9075-85e9bcd40eb8@gmail.com>
Message-ID: <CAAkaFLUpshxpoBoV1F08MyqEMP5+kNuNNNF9+UChuTjc=zma8A@mail.gmail.com>

That's generally my approach too. Squash and merge unless you need a record
of separate authorship.

Squashing helps managing cherrypicking for releases, and ensuring what's
new has decent coverage.

On 29 September 2016 at 00:02, Andreas Mueller <t3kcit at gmail.com> wrote:

> Hey.
>
> This is a continuation of the discussion we had on squashing in June:
> https://mail.python.org/pipermail/scikit-learn/2016-June/000121.html
>
> I thought we discussed this again after the "squash and merge" feature was
> introduced, but I couldn't find the thread.
>
> I think Joel, me and some others where recently using the github "squash
> and merge" feature,
> which I think is great. It removes burden from the contributors and makes
> for a "clean" (or fake) history.
> I like it because it makes cherry-picking easy and allows a pretty simple
> analysis of what's happening.
>
> When doing some backports, I realized that some people (including Gael)
> didn't use it.
>
> Is there a reason not to use squash and merge? Should we make it policy?
>
> The one case where I think we might not want it is in case there are
> multiple authors in a PR.
> Other than that, I don't really see a downside.
>
> Wdyt?
>
> Andy
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160929/39001295/attachment.html>

From gael.varoquaux at normalesup.org  Wed Sep 28 10:05:45 2016
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Wed, 28 Sep 2016 16:05:45 +0200
Subject: [scikit-learn] always Squash and Merge?
In-Reply-To: <6b40af8a-f1d3-d288-9075-85e9bcd40eb8@gmail.com>
References: <6b40af8a-f1d3-d288-9075-85e9bcd40eb8@gmail.com>
Message-ID: <20160928140545.GI473208@phare.normalesup.org>

> When doing some backports, I realized that some people (including Gael) didn't use it.

I am not against it. When I think about why I didn't use it, it was a
combination of laziness and lack of trust in git (ie I was worried of
hard-to-resolve conflicts).

From t3kcit at gmail.com  Wed Sep 28 11:18:28 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Wed, 28 Sep 2016 11:18:28 -0400
Subject: [scikit-learn] always Squash and Merge?
In-Reply-To: <20160928140545.GI473208@phare.normalesup.org>
References: <6b40af8a-f1d3-d288-9075-85e9bcd40eb8@gmail.com>
 <20160928140545.GI473208@phare.normalesup.org>
Message-ID: <ea0f79d4-7fed-26e5-d873-db7b4b2778a8@gmail.com>


On 09/28/2016 10:05 AM, Gael Varoquaux wrote:
> I am not against it. When I think about why I didn't use it, it was a
> combination of laziness and lack of trust in git (ie I was worried of
> hard-to-resolve conflicts).
Cool.
I think we didn't run into any problems with it so far, and we have used
it relatively extensively.
I think it's a good default until we run into trouble.
And once you use it, I think it becomes your default on github.

From t3kcit at gmail.com  Wed Sep 28 11:26:43 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Wed, 28 Sep 2016 11:26:43 -0400
Subject: [scikit-learn] scikit-learn Digest, Vol 6, Issue 40
In-Reply-To: <1474923988308.27715@UTSouthwestern.edu>
References: <mailman.259.1474918987.2301.scikit-learn@python.org>
 <1474923988308.27715@UTSouthwestern.edu>
Message-ID: <a30c9f69-3305-6d3d-d6be-ece40af876ec@gmail.com>

It's not really clear to me what you want to achieve.
What do you mean by "does not lead to a biased accuracy"?


On 09/26/2016 05:06 PM, Afarin Famili wrote:
> Hi David,
>
> When applying Train_test_split to the sample space, we have a single row per subject. I am looking for some other function like Train_test_split that can deal with pairs of rows (for each subject), which does not lead to a biased accuracy. We are studying memory and have a row of features for successful memory encoding, and a second row for unsuccessful memory encoding in each of the subjects. Our target space being 1 for successful and 0 for unsuccessful encoding respectively.
> How do you recommend me to split this set of data in order to get a reasonable/unbiased accuracy?
>
> Thanks,
> Afarin
>
>


From nelle.varoquaux at gmail.com  Wed Sep 28 11:47:56 2016
From: nelle.varoquaux at gmail.com (Nelle Varoquaux)
Date: Wed, 28 Sep 2016 08:47:56 -0700
Subject: [scikit-learn] always Squash and Merge?
In-Reply-To: <ea0f79d4-7fed-26e5-d873-db7b4b2778a8@gmail.com>
References: <6b40af8a-f1d3-d288-9075-85e9bcd40eb8@gmail.com>
 <20160928140545.GI473208@phare.normalesup.org>
 <ea0f79d4-7fed-26e5-d873-db7b4b2778a8@gmail.com>
Message-ID: <CAE-UAvRt-pbAQ=hcUDRC4ATjwjBU=-yt=Y_3t2pPcyuyXLfbWQ@mail.gmail.com>

On 28 September 2016 at 08:18, Andreas Mueller <t3kcit at gmail.com> wrote:
>
>
> On 09/28/2016 10:05 AM, Gael Varoquaux wrote:
>>
>> I am not against it. When I think about why I didn't use it, it was a
>> combination of laziness and lack of trust in git (ie I was worried of
>> hard-to-resolve conflicts).
>
> Cool.
> I think we didn't run into any problems with it so far, and we have used
> it relatively extensively.
> I think it's a good default until we run into trouble.
> And once you use it, I think it becomes your default on github.

With squash and merge, there shouldn't be any problems ever. Rebasing
onto mastre would be the tricky part.

The only disadvantage is that you loose a lot of history on large PRs.

>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

From sean.violante at gmail.com  Wed Sep 28 11:53:53 2016
From: sean.violante at gmail.com (Sean Violante)
Date: Wed, 28 Sep 2016 17:53:53 +0200
Subject: [scikit-learn] scikit-learn Digest, Vol 6, Issue 40
In-Reply-To: <a30c9f69-3305-6d3d-d6be-ece40af876ec@gmail.com>
References: <mailman.259.1474918987.2301.scikit-learn@python.org>
 <1474923988308.27715@UTSouthwestern.edu>
 <a30c9f69-3305-6d3d-d6be-ece40af876ec@gmail.com>
Message-ID: <CAL9=spNK=5k5T8Ja9rrpJDE3PeyhUPOya6xhUGevjaGe-vNECg@mail.gmail.com>

Afarin,
can you please describe your full data set, as maybe you are making a
mistake in how you are setting up the data.

My understanding of what Afarin is saying is that for each person he has a
row for successes and a row for failures (but cannot understand why only
two rows - would expect multiple rows according to different feature
configurations)

So what Afarin wants to do is split by person rather than by row?

Sean


On Wed, Sep 28, 2016 at 5:26 PM, Andreas Mueller <t3kcit at gmail.com> wrote:

> It's not really clear to me what you want to achieve.
> What do you mean by "does not lead to a biased accuracy"?
>
>
> On 09/26/2016 05:06 PM, Afarin Famili wrote:
>
>> Hi David,
>>
>> When applying Train_test_split to the sample space, we have a single row
>> per subject. I am looking for some other function like Train_test_split
>> that can deal with pairs of rows (for each subject), which does not lead to
>> a biased accuracy. We are studying memory and have a row of features for
>> successful memory encoding, and a second row for unsuccessful memory
>> encoding in each of the subjects. Our target space being 1 for successful
>> and 0 for unsuccessful encoding respectively.
>> How do you recommend me to split this set of data in order to get a
>> reasonable/unbiased accuracy?
>>
>> Thanks,
>> Afarin
>>
>>
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160928/6b25876c/attachment.html>

From t3kcit at gmail.com  Wed Sep 28 13:01:21 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Wed, 28 Sep 2016 13:01:21 -0400
Subject: [scikit-learn] Github project management tools
In-Reply-To: <CAE-UAvQC98Xj81hAfBTHxcQWSQxD1VwEa_SAPZmK3tWHbWgE0A@mail.gmail.com>
References: <CAAkaFLXw7+Qfu1L+SGijrVGJ7UU=R-tx4_ttdh0fCJo-u53fMw@mail.gmail.com>
 <20160916152118.GC187668@phare.normalesup.org>
 <CAAkaFLW6sVg_vjXnMHu9F59i0HHxpDLM87B3opKTcyoe-dNXjg@mail.gmail.com>
 <CAAkaFLV6160=p1ztEBtDh8tmAXwDBf5Cy-m1=aAabHxkbCdXPg@mail.gmail.com>
 <CAE-UAvTGMrtX5uOP5END4n+SihtQNSHH+T7cwgBByu4vuhz4QA@mail.gmail.com>
 <fe49d881-1321-71e0-021b-990541a34239@gmail.com>
 <CAE-UAvQC98Xj81hAfBTHxcQWSQxD1VwEa_SAPZmK3tWHbWgE0A@mail.gmail.com>
Message-ID: <e769ff3c-24f4-1c92-9ae4-d562a785b0ac@gmail.com>

So following up on this conversation, do we want to use status labels 
more consistently?
And what should they be?
Joel Proposed for PRs:

* WIP (not ready for review)
* waiting for review [we have a tag for this]
* waiting for changes (with or without one of the following)
* in dispute (i.e. fundamental doubts about the PR)
* the above together with 1 or 2 "official" approvals
* ready for merge (pending minor changes such as what's new documentation)

We could at least add tags for "waiting for changes" and "in dispute", 
which are fairly clear categories.

For PRs we should probably add [bug - not confirmed] and [bug - confirmed]


On 09/22/2016 01:23 AM, Nelle Varoquaux wrote:
> On 21 September 2016 at 22:13, Andreas Mueller <t3kcit at gmail.com> wrote:
>>
>> On 09/19/2016 09:56 PM, Nelle Varoquaux wrote:
>>>> Another bot-able tool might be pinging inactive PRs to ask if they're
>>>> being
>>>> worked on, and labelling "Needs contributor" if there's no reply within n
>>>> days...!
>> That kind of only works when the status is "waiting for changes",
>> and not "waiting for reviews". I guess we could tag all old issues
>> or use the new interface (though you said that's not scriptable yet?)
>> So we would need to actually use the "needs reviews" tag and add an
>> "waiting for changes" tag. And I guess the "waiting for changes" should be
>> removed automatically when the author changed something and changed to
>> "needs review"?
>>
>> Is there an API to access the "fixes #ISSUE" thing for auto-closing? Just
>> mentioning an issue
>> doesn't mean it's a PR to solve the issue.
>>
>>> If PRs are inactive, it might also be interesting to tag them as
>>> easy_fix when there is little to do.
>>>
>>>
>> That's much harder to automate though.
>> I know that I often misjudge the amount that is left to do in a PR,
>> not sure if bots are better at that than humans yet.
> Bots wouldn't be able to do that, but I find that an hour now and then
> scrolling throught old PR works pretty well :)
>
>> Are there bots with LSTM support yet? ;)
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From nfliu at uw.edu  Wed Sep 28 13:09:28 2016
From: nfliu at uw.edu (Nelson Liu)
Date: Wed, 28 Sep 2016 10:09:28 -0700
Subject: [scikit-learn] Github project management tools
In-Reply-To: <e769ff3c-24f4-1c92-9ae4-d562a785b0ac@gmail.com>
References: <CAAkaFLXw7+Qfu1L+SGijrVGJ7UU=R-tx4_ttdh0fCJo-u53fMw@mail.gmail.com>
 <20160916152118.GC187668@phare.normalesup.org>
 <CAAkaFLW6sVg_vjXnMHu9F59i0HHxpDLM87B3opKTcyoe-dNXjg@mail.gmail.com>
 <CAAkaFLV6160=p1ztEBtDh8tmAXwDBf5Cy-m1=aAabHxkbCdXPg@mail.gmail.com>
 <CAE-UAvTGMrtX5uOP5END4n+SihtQNSHH+T7cwgBByu4vuhz4QA@mail.gmail.com>
 <fe49d881-1321-71e0-021b-990541a34239@gmail.com>
 <CAE-UAvQC98Xj81hAfBTHxcQWSQxD1VwEa_SAPZmK3tWHbWgE0A@mail.gmail.com>
 <e769ff3c-24f4-1c92-9ae4-d562a785b0ac@gmail.com>
Message-ID: <CALoLHMKeKnj_H9ZsYZ+kUAZFn5Cdi9ePyr+hoZjdgnoq7HFBTg@mail.gmail.com>

Maybe something for "stalled" pull requests? e.g. if someone hasn't worked
on their PR in say 30 days and it's tagged "waiting for changes", you could
ping them and then put on the "stalled" label. If they don't respond in
another 15 days / say they aren't working on it anymore, maybe it'd be good
to change to "abandoned" or "need contributor" (and add "need contributor"
to the linked issue, if applicable) to indicate that someone else can pick
it up.

On Wed, Sep 28, 2016 at 10:01 AM, Andreas Mueller <t3kcit at gmail.com> wrote:

> So following up on this conversation, do we want to use status labels more
> consistently?
> And what should they be?
> Joel Proposed for PRs:
>
> * WIP (not ready for review)
> * waiting for review [we have a tag for this]
> * waiting for changes (with or without one of the following)
> * in dispute (i.e. fundamental doubts about the PR)
> * the above together with 1 or 2 "official" approvals
> * ready for merge (pending minor changes such as what's new documentation)
>
> We could at least add tags for "waiting for changes" and "in dispute",
> which are fairly clear categories.
>
> For PRs we should probably add [bug - not confirmed] and [bug - confirmed]
>
>
>
> On 09/22/2016 01:23 AM, Nelle Varoquaux wrote:
>
>> On 21 September 2016 at 22:13, Andreas Mueller <t3kcit at gmail.com> wrote:
>>
>>>
>>> On 09/19/2016 09:56 PM, Nelle Varoquaux wrote:
>>>
>>>> Another bot-able tool might be pinging inactive PRs to ask if they're
>>>>> being
>>>>> worked on, and labelling "Needs contributor" if there's no reply
>>>>> within n
>>>>> days...!
>>>>>
>>>> That kind of only works when the status is "waiting for changes",
>>> and not "waiting for reviews". I guess we could tag all old issues
>>> or use the new interface (though you said that's not scriptable yet?)
>>> So we would need to actually use the "needs reviews" tag and add an
>>> "waiting for changes" tag. And I guess the "waiting for changes" should
>>> be
>>> removed automatically when the author changed something and changed to
>>> "needs review"?
>>>
>>> Is there an API to access the "fixes #ISSUE" thing for auto-closing? Just
>>> mentioning an issue
>>> doesn't mean it's a PR to solve the issue.
>>>
>>> If PRs are inactive, it might also be interesting to tag them as
>>>> easy_fix when there is little to do.
>>>>
>>>>
>>>> That's much harder to automate though.
>>> I know that I often misjudge the amount that is left to do in a PR,
>>> not sure if bots are better at that than humans yet.
>>>
>> Bots wouldn't be able to do that, but I find that an hour now and then
>> scrolling throught old PR works pretty well :)
>>
>> Are there bots with LSTM support yet? ;)
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160928/668eacca/attachment.html>

From nelle.varoquaux at gmail.com  Wed Sep 28 14:21:21 2016
From: nelle.varoquaux at gmail.com (Nelle Varoquaux)
Date: Wed, 28 Sep 2016 11:21:21 -0700
Subject: [scikit-learn] Github project management tools
In-Reply-To: <CALoLHMKeKnj_H9ZsYZ+kUAZFn5Cdi9ePyr+hoZjdgnoq7HFBTg@mail.gmail.com>
References: <CAAkaFLXw7+Qfu1L+SGijrVGJ7UU=R-tx4_ttdh0fCJo-u53fMw@mail.gmail.com>
 <20160916152118.GC187668@phare.normalesup.org>
 <CAAkaFLW6sVg_vjXnMHu9F59i0HHxpDLM87B3opKTcyoe-dNXjg@mail.gmail.com>
 <CAAkaFLV6160=p1ztEBtDh8tmAXwDBf5Cy-m1=aAabHxkbCdXPg@mail.gmail.com>
 <CAE-UAvTGMrtX5uOP5END4n+SihtQNSHH+T7cwgBByu4vuhz4QA@mail.gmail.com>
 <fe49d881-1321-71e0-021b-990541a34239@gmail.com>
 <CAE-UAvQC98Xj81hAfBTHxcQWSQxD1VwEa_SAPZmK3tWHbWgE0A@mail.gmail.com>
 <e769ff3c-24f4-1c92-9ae4-d562a785b0ac@gmail.com>
 <CALoLHMKeKnj_H9ZsYZ+kUAZFn5Cdi9ePyr+hoZjdgnoq7HFBTg@mail.gmail.com>
Message-ID: <CAE-UAvRjoJnUMOedovqO0J86zs120JjnqQUL+QnHp8_mC7cpqg@mail.gmail.com>

On 28 September 2016 at 10:09, Nelson Liu <nfliu at uw.edu> wrote:
> Maybe something for "stalled" pull requests? e.g. if someone hasn't worked
> on their PR in say 30 days and it's tagged "waiting for changes", you could
> ping them and then put on the "stalled" label. If they don't respond in
> another 15 days / say they aren't working on it anymore, maybe it'd be good
> to change to "abandoned" or "need contributor" (and add "need contributor"
> to the linked issue, if applicable) to indicate that someone else can pick
> it up.
>
> On Wed, Sep 28, 2016 at 10:01 AM, Andreas Mueller <t3kcit at gmail.com> wrote:
>>
>> So following up on this conversation, do we want to use status labels more
>> consistently?
>> And what should they be?
>> Joel Proposed for PRs:
>>
>> * WIP (not ready for review)
>> * waiting for review [we have a tag for this]
>> * waiting for changes (with or without one of the following)
>> * in dispute (i.e. fundamental doubts about the PR)
>> * the above together with 1 or 2 "official" approvals
>> * ready for merge (pending minor changes such as what's new documentation)
>>
>> We could at least add tags for "waiting for changes" and "in dispute",
>> which are fairly clear categories.
>>
>> For PRs we should probably add [bug - not confirmed] and [bug - confirmed]


I think the only ones worth having are the ones that can be dealt with
automatically and the ones that will not be used frequently:

- stalled after 30 days of inactivity [can be done automatically]
- in dispute [I don't expect it to be used often].


WIP, MRG, MRG+N in the title seem IMO a better way to do this. It is
easier than to do with tagging *and* the author of the PR can edit its
own title (which is useful for the two first).

On matplotlib, we have tags for "need review", "need change" and they
are IMO useless. The "need review" is added automatically by a bot as
soon as a PR is opened. It tags PR that are WIP as needed review and
pretty much all of the PRs have it, as no one removes this tag. The
"need change" is used very sparsely because no one ever bothers to put
it (understand as "tagging is annoying" through the UI).

>>
>>
>> On 09/22/2016 01:23 AM, Nelle Varoquaux wrote:
>>>
>>> On 21 September 2016 at 22:13, Andreas Mueller <t3kcit at gmail.com> wrote:
>>>>
>>>>
>>>> On 09/19/2016 09:56 PM, Nelle Varoquaux wrote:
>>>>>>
>>>>>> Another bot-able tool might be pinging inactive PRs to ask if they're
>>>>>> being
>>>>>> worked on, and labelling "Needs contributor" if there's no reply
>>>>>> within n
>>>>>> days...!
>>>>
>>>> That kind of only works when the status is "waiting for changes",
>>>> and not "waiting for reviews". I guess we could tag all old issues
>>>> or use the new interface (though you said that's not scriptable yet?)
>>>> So we would need to actually use the "needs reviews" tag and add an
>>>> "waiting for changes" tag. And I guess the "waiting for changes" should
>>>> be
>>>> removed automatically when the author changed something and changed to
>>>> "needs review"?
>>>>
>>>> Is there an API to access the "fixes #ISSUE" thing for auto-closing?
>>>> Just
>>>> mentioning an issue
>>>> doesn't mean it's a PR to solve the issue.
>>>>
>>>>> If PRs are inactive, it might also be interesting to tag them as
>>>>> easy_fix when there is little to do.
>>>>>
>>>>>
>>>> That's much harder to automate though.
>>>> I know that I often misjudge the amount that is left to do in a PR,
>>>> not sure if bots are better at that than humans yet.
>>>
>>> Bots wouldn't be able to do that, but I find that an hour now and then
>>> scrolling throught old PR works pretty well :)
>>>
>>>> Are there bots with LSTM support yet? ;)
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>

From t3kcit at gmail.com  Wed Sep 28 15:24:24 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Wed, 28 Sep 2016 15:24:24 -0400
Subject: [scikit-learn] Github project management tools
In-Reply-To: <CAE-UAvRjoJnUMOedovqO0J86zs120JjnqQUL+QnHp8_mC7cpqg@mail.gmail.com>
References: <CAAkaFLXw7+Qfu1L+SGijrVGJ7UU=R-tx4_ttdh0fCJo-u53fMw@mail.gmail.com>
 <20160916152118.GC187668@phare.normalesup.org>
 <CAAkaFLW6sVg_vjXnMHu9F59i0HHxpDLM87B3opKTcyoe-dNXjg@mail.gmail.com>
 <CAAkaFLV6160=p1ztEBtDh8tmAXwDBf5Cy-m1=aAabHxkbCdXPg@mail.gmail.com>
 <CAE-UAvTGMrtX5uOP5END4n+SihtQNSHH+T7cwgBByu4vuhz4QA@mail.gmail.com>
 <fe49d881-1321-71e0-021b-990541a34239@gmail.com>
 <CAE-UAvQC98Xj81hAfBTHxcQWSQxD1VwEa_SAPZmK3tWHbWgE0A@mail.gmail.com>
 <e769ff3c-24f4-1c92-9ae4-d562a785b0ac@gmail.com>
 <CALoLHMKeKnj_H9ZsYZ+kUAZFn5Cdi9ePyr+hoZjdgnoq7HFBTg@mail.gmail.com>
 <CAE-UAvRjoJnUMOedovqO0J86zs120JjnqQUL+QnHp8_mC7cpqg@mail.gmail.com>
Message-ID: <41f0eb1c-c877-0c4d-0f56-9485f57c0eae@gmail.com>


On 09/28/2016 02:21 PM, Nelle Varoquaux wrote:
>
> I think the only ones worth having are the ones that can be dealt with
> automatically and the ones that will not be used frequently:
>
> - stalled after 30 days of inactivity [can be done automatically]
> - in dispute [I don't expect it to be used often].
I think "in dispute" is actually one of the most common statuses among PRs.
Or maybe I have a skewed picture of things.
Many PRs stalled because it is not clear whether the proposed solution 
is a good one.


It would be great to have some way to get through the backlog of 400 PRs 
and I think tagging them might be useful.
We rarely reject PRs, we could also revisit that policy.

For the backlog, it's pretty unclear to me how many are waiting for 
reviews, how many are waiting for changes,
and how many are disputed.
Tagging these might help people who want to review to find things to 
review, and people who want to code to pick
up stalled PRs.

From nelle.varoquaux at gmail.com  Wed Sep 28 15:29:58 2016
From: nelle.varoquaux at gmail.com (Nelle Varoquaux)
Date: Wed, 28 Sep 2016 12:29:58 -0700
Subject: [scikit-learn] Github project management tools
In-Reply-To: <41f0eb1c-c877-0c4d-0f56-9485f57c0eae@gmail.com>
References: <CAAkaFLXw7+Qfu1L+SGijrVGJ7UU=R-tx4_ttdh0fCJo-u53fMw@mail.gmail.com>
 <20160916152118.GC187668@phare.normalesup.org>
 <CAAkaFLW6sVg_vjXnMHu9F59i0HHxpDLM87B3opKTcyoe-dNXjg@mail.gmail.com>
 <CAAkaFLV6160=p1ztEBtDh8tmAXwDBf5Cy-m1=aAabHxkbCdXPg@mail.gmail.com>
 <CAE-UAvTGMrtX5uOP5END4n+SihtQNSHH+T7cwgBByu4vuhz4QA@mail.gmail.com>
 <fe49d881-1321-71e0-021b-990541a34239@gmail.com>
 <CAE-UAvQC98Xj81hAfBTHxcQWSQxD1VwEa_SAPZmK3tWHbWgE0A@mail.gmail.com>
 <e769ff3c-24f4-1c92-9ae4-d562a785b0ac@gmail.com>
 <CALoLHMKeKnj_H9ZsYZ+kUAZFn5Cdi9ePyr+hoZjdgnoq7HFBTg@mail.gmail.com>
 <CAE-UAvRjoJnUMOedovqO0J86zs120JjnqQUL+QnHp8_mC7cpqg@mail.gmail.com>
 <41f0eb1c-c877-0c4d-0f56-9485f57c0eae@gmail.com>
Message-ID: <CAE-UAvQvX5fEqZvMa3XgkjEWasihpfKNf+mzRpYPB=7LRCxB3A@mail.gmail.com>

On 28 September 2016 at 12:24, Andreas Mueller <t3kcit at gmail.com> wrote:
>
>
> On 09/28/2016 02:21 PM, Nelle Varoquaux wrote:
>>
>>
>> I think the only ones worth having are the ones that can be dealt with
>> automatically and the ones that will not be used frequently:
>>
>> - stalled after 30 days of inactivity [can be done automatically]
>> - in dispute [I don't expect it to be used often].
>
> I think "in dispute" is actually one of the most common statuses among PRs.
> Or maybe I have a skewed picture of things.
> Many PRs stalled because it is not clear whether the proposed solution is a
> good one.

On the stalled one, sure, but there are a lot of PRs being merged
fairly quickly. So over all, I think it is quite rare. No?

> It would be great to have some way to get through the backlog of 400 PRs and
> I think tagging them might be useful.
> We rarely reject PRs, we could also revisit that policy.
>
> For the backlog, it's pretty unclear to me how many are waiting for reviews,
> how many are waiting for changes,
> and how many are disputed.
> Tagging these might help people who want to review to find things to review,
> and people who want to code to pick
> up stalled PRs.

That sounds like a great use of labels, thought all of these need to
be tagged manually.

> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

From t3kcit at gmail.com  Wed Sep 28 17:01:45 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Wed, 28 Sep 2016 17:01:45 -0400
Subject: [scikit-learn] ANN Scikit-learn 0.18 released
Message-ID: <40a62931-bf5a-7c59-7253-69418d53f196@gmail.com>

Hi everybody.

I'm happy to announce scikit-learn 0.18 has been released today.
You can install from pipy or anaconda.org:

pip install --upgrade scikit-learn --no-deps


or if you prefer conda:

conda update scikit-learn


A big thank you to everybody who contributed.
This one took us a while, but I think it's worth the wait.

Highlights include:
- A new GaussianProcessClassifier and GaussianProcessRegressor to learn 
complex kernels!
- A much improved GaussianMixture and BayesianGaussianMixture mixture 
models.
- We moved the content of the grid_search, cross_validation and 
validation_curve modules to the new model_selection module.
- A Multi-layer perceptron.

and soo much more that it's impossible to summarize.
Check out the full changelog here:

http://scikit-learn.org/stable/whats_new.html#version-0-18

Please don't update a (ana)conda installation using pip, as that might 
lead to problems.
Let us know any issues you have on the issue tracker:
https://github.com/scikit-learn/scikit-learn/issues

Enjoy!

Andy

From joel.nothman at gmail.com  Wed Sep 28 17:14:53 2016
From: joel.nothman at gmail.com (Joel Nothman)
Date: Thu, 29 Sep 2016 07:14:53 +1000
Subject: [scikit-learn] always Squash and Merge?
In-Reply-To: <CAE-UAvRt-pbAQ=hcUDRC4ATjwjBU=-yt=Y_3t2pPcyuyXLfbWQ@mail.gmail.com>
References: <6b40af8a-f1d3-d288-9075-85e9bcd40eb8@gmail.com>
 <20160928140545.GI473208@phare.normalesup.org>
 <ea0f79d4-7fed-26e5-d873-db7b4b2778a8@gmail.com>
 <CAE-UAvRt-pbAQ=hcUDRC4ATjwjBU=-yt=Y_3t2pPcyuyXLfbWQ@mail.gmail.com>
Message-ID: <CAAkaFLX_yv=UeoN3t4CAyQ6yDpnXgK7qOeccopmTN-CF38bT1w@mail.gmail.com>

On 29 September 2016 at 01:47, Nelle Varoquaux <nelle.varoquaux at gmail.com>
wrote:

> On 28 September 2016 at 08:18, Andreas Mueller <t3kcit at gmail.com> wrote:
> >
> >
> > On 09/28/2016 10:05 AM, Gael Varoquaux wrote:
> >>
> >> I am not against it. When I think about why I didn't use it, it was a
> >> combination of laziness and lack of trust in git (ie I was worried of
> >> hard-to-resolve conflicts).
> >
> > Cool.
> > I think we didn't run into any problems with it so far, and we have used
> > it relatively extensively.
> > I think it's a good default until we run into trouble.
> > And once you use it, I think it becomes your default on github.
>
> With squash and merge, there shouldn't be any problems ever. Rebasing
> onto mastre would be the tricky part.
>

+1. After Github confirms that something can be merged with the click of a
button, there are generally no merge issues.


> The only disadvantage is that you loose a lot of history on large PRs.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160929/bb3a0715/attachment.html>

From gael.varoquaux at normalesup.org  Thu Sep 29 01:28:56 2016
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Thu, 29 Sep 2016 07:28:56 +0200
Subject: [scikit-learn] ANN Scikit-learn 0.18 released
In-Reply-To: <40a62931-bf5a-7c59-7253-69418d53f196@gmail.com>
References: <40a62931-bf5a-7c59-7253-69418d53f196@gmail.com>
Message-ID: <20160929052856.GA1123098@phare.normalesup.org>

Hurray!

Congratulations to everybody, and in particular the release time!

Ga?l

On Wed, Sep 28, 2016 at 05:01:45PM -0400, Andreas Mueller wrote:
> Hi everybody.

> I'm happy to announce scikit-learn 0.18 has been released today.
> You can install from pipy or anaconda.org:

> pip install --upgrade scikit-learn --no-deps


> or if you prefer conda:

> conda update scikit-learn


> A big thank you to everybody who contributed.
> This one took us a while, but I think it's worth the wait.

> Highlights include:
> - A new GaussianProcessClassifier and GaussianProcessRegressor to learn
> complex kernels!
> - A much improved GaussianMixture and BayesianGaussianMixture mixture
> models.
> - We moved the content of the grid_search, cross_validation and
> validation_curve modules to the new model_selection module.
> - A Multi-layer perceptron.

> and soo much more that it's impossible to summarize.
> Check out the full changelog here:

> http://scikit-learn.org/stable/whats_new.html#version-0-18

> Please don't update a (ana)conda installation using pip, as that might lead
> to problems.
> Let us know any issues you have on the issue tracker:
> https://github.com/scikit-learn/scikit-learn/issues

> Enjoy!

> Andy
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-- 
    Gael Varoquaux
    Researcher, INRIA Parietal
    NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
    Phone:  ++ 33-1-69-08-79-68
    http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux

From se.raschka at gmail.com  Thu Sep 29 01:34:40 2016
From: se.raschka at gmail.com (Sebastian Raschka)
Date: Thu, 29 Sep 2016 01:34:40 -0400
Subject: [scikit-learn] ANN Scikit-learn 0.18 released
In-Reply-To: <20160929052856.GA1123098@phare.normalesup.org>
References: <40a62931-bf5a-7c59-7253-69418d53f196@gmail.com>
 <20160929052856.GA1123098@phare.normalesup.org>
Message-ID: <F2B0C871-A02F-43DC-BE82-C62357C4D25B@gmail.com>

Have been playing around with the new functionality tonight. There are so many great additions, especially the new CV functionality in the model_selection module is super great. Nested CV is much more convenient now! Congratulations to everyone, and thanks for this great new version! :)


> On Sep 29, 2016, at 1:28 AM, Gael Varoquaux <gael.varoquaux at normalesup.org> wrote:
> 
> Hurray!
> 
> Congratulations to everybody, and in particular the release time!
> 
> Ga?l
> 
> On Wed, Sep 28, 2016 at 05:01:45PM -0400, Andreas Mueller wrote:
>> Hi everybody.
> 
>> I'm happy to announce scikit-learn 0.18 has been released today.
>> You can install from pipy or anaconda.org:
> 
>> pip install --upgrade scikit-learn --no-deps
> 
> 
>> or if you prefer conda:
> 
>> conda update scikit-learn
> 
> 
>> A big thank you to everybody who contributed.
>> This one took us a while, but I think it's worth the wait.
> 
>> Highlights include:
>> - A new GaussianProcessClassifier and GaussianProcessRegressor to learn
>> complex kernels!
>> - A much improved GaussianMixture and BayesianGaussianMixture mixture
>> models.
>> - We moved the content of the grid_search, cross_validation and
>> validation_curve modules to the new model_selection module.
>> - A Multi-layer perceptron.
> 
>> and soo much more that it's impossible to summarize.
>> Check out the full changelog here:
> 
>> http://scikit-learn.org/stable/whats_new.html#version-0-18
> 
>> Please don't update a (ana)conda installation using pip, as that might lead
>> to problems.
>> Let us know any issues you have on the issue tracker:
>> https://github.com/scikit-learn/scikit-learn/issues
> 
>> Enjoy!
> 
>> Andy
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> -- 
>    Gael Varoquaux
>    Researcher, INRIA Parietal
>    NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
>    Phone:  ++ 33-1-69-08-79-68
>    http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From bertrand.thirion at inria.fr  Thu Sep 29 02:39:24 2016
From: bertrand.thirion at inria.fr (bthirion)
Date: Thu, 29 Sep 2016 08:39:24 +0200
Subject: [scikit-learn] ANN Scikit-learn 0.18 released
In-Reply-To: <20160929052856.GA1123098@phare.normalesup.org>
References: <40a62931-bf5a-7c59-7253-69418d53f196@gmail.com>
 <20160929052856.GA1123098@phare.normalesup.org>
Message-ID: <296b7e7d-e128-0ded-4a35-267d6b53345c@inria.fr>

Congrats !

Bertrand

On 29/09/2016 07:28, Gael Varoquaux wrote:
> Hurray!
>
> Congratulations to everybody, and in particular the release time!
>
> Ga?l
>
> On Wed, Sep 28, 2016 at 05:01:45PM -0400, Andreas Mueller wrote:
>> Hi everybody.
>> I'm happy to announce scikit-learn 0.18 has been released today.
>> You can install from pipy or anaconda.org:
>> pip install --upgrade scikit-learn --no-deps
>
>> or if you prefer conda:
>> conda update scikit-learn
>
>> A big thank you to everybody who contributed.
>> This one took us a while, but I think it's worth the wait.
>> Highlights include:
>> - A new GaussianProcessClassifier and GaussianProcessRegressor to learn
>> complex kernels!
>> - A much improved GaussianMixture and BayesianGaussianMixture mixture
>> models.
>> - We moved the content of the grid_search, cross_validation and
>> validation_curve modules to the new model_selection module.
>> - A Multi-layer perceptron.
>> and soo much more that it's impossible to summarize.
>> Check out the full changelog here:
>> http://scikit-learn.org/stable/whats_new.html#version-0-18
>> Please don't update a (ana)conda installation using pip, as that might lead
>> to problems.
>> Let us know any issues you have on the issue tracker:
>> https://github.com/scikit-learn/scikit-learn/issues
>> Enjoy!
>> Andy
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn


From jaquesgrobler at gmail.com  Thu Sep 29 03:32:59 2016
From: jaquesgrobler at gmail.com (Jaques Grobler)
Date: Thu, 29 Sep 2016 09:32:59 +0200
Subject: [scikit-learn] ANN Scikit-learn 0.18 released
In-Reply-To: <296b7e7d-e128-0ded-4a35-267d6b53345c@inria.fr>
References: <40a62931-bf5a-7c59-7253-69418d53f196@gmail.com>
 <20160929052856.GA1123098@phare.normalesup.org>
 <296b7e7d-e128-0ded-4a35-267d6b53345c@inria.fr>
Message-ID: <CAHcSORmZ-ekPSYjXRM16FRFGqpisZQKCcHubrnEmV=vPNtgfxw@mail.gmail.com>

Congrats everyone!

2016-09-29 8:39 GMT+02:00 bthirion <bertrand.thirion at inria.fr>:

> Congrats !
>
> Bertrand
>
>
> On 29/09/2016 07:28, Gael Varoquaux wrote:
>
>> Hurray!
>>
>> Congratulations to everybody, and in particular the release time!
>>
>> Ga?l
>>
>> On Wed, Sep 28, 2016 at 05:01:45PM -0400, Andreas Mueller wrote:
>>
>>> Hi everybody.
>>> I'm happy to announce scikit-learn 0.18 has been released today.
>>> You can install from pipy or anaconda.org:
>>> pip install --upgrade scikit-learn --no-deps
>>>
>>
>> or if you prefer conda:
>>> conda update scikit-learn
>>>
>>
>> A big thank you to everybody who contributed.
>>> This one took us a while, but I think it's worth the wait.
>>> Highlights include:
>>> - A new GaussianProcessClassifier and GaussianProcessRegressor to learn
>>> complex kernels!
>>> - A much improved GaussianMixture and BayesianGaussianMixture mixture
>>> models.
>>> - We moved the content of the grid_search, cross_validation and
>>> validation_curve modules to the new model_selection module.
>>> - A Multi-layer perceptron.
>>> and soo much more that it's impossible to summarize.
>>> Check out the full changelog here:
>>> http://scikit-learn.org/stable/whats_new.html#version-0-18
>>> Please don't update a (ana)conda installation using pip, as that might
>>> lead
>>> to problems.
>>> Let us know any issues you have on the issue tracker:
>>> https://github.com/scikit-learn/scikit-learn/issues
>>> Enjoy!
>>> Andy
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160929/96ac66d2/attachment.html>

From ragvrv at gmail.com  Thu Sep 29 06:53:53 2016
From: ragvrv at gmail.com (Raghav R V)
Date: Thu, 29 Sep 2016 12:53:53 +0200
Subject: [scikit-learn] ANN Scikit-learn 0.18 released
In-Reply-To: <CAHcSORmZ-ekPSYjXRM16FRFGqpisZQKCcHubrnEmV=vPNtgfxw@mail.gmail.com>
References: <40a62931-bf5a-7c59-7253-69418d53f196@gmail.com>
 <20160929052856.GA1123098@phare.normalesup.org>
 <296b7e7d-e128-0ded-4a35-267d6b53345c@inria.fr>
 <CAHcSORmZ-ekPSYjXRM16FRFGqpisZQKCcHubrnEmV=vPNtgfxw@mail.gmail.com>
Message-ID: <CACmxyDHuShDQXx8gOPWxCpDguYeg-Dpe=FsBC7rrrSecq9ioyQ@mail.gmail.com>

Congrats everyone :)

On 29 Sep 2016 9:41 a.m., "Jaques Grobler" <jaquesgrobler at gmail.com> wrote:

> Congrats everyone!
>
> 2016-09-29 8:39 GMT+02:00 bthirion <bertrand.thirion at inria.fr>:
>
>> Congrats !
>>
>> Bertrand
>>
>>
>> On 29/09/2016 07:28, Gael Varoquaux wrote:
>>
>>> Hurray!
>>>
>>> Congratulations to everybody, and in particular the release time!
>>>
>>> Ga?l
>>>
>>> On Wed, Sep 28, 2016 at 05:01:45PM -0400, Andreas Mueller wrote:
>>>
>>>> Hi everybody.
>>>> I'm happy to announce scikit-learn 0.18 has been released today.
>>>> You can install from pipy or anaconda.org:
>>>> pip install --upgrade scikit-learn --no-deps
>>>>
>>>
>>> or if you prefer conda:
>>>> conda update scikit-learn
>>>>
>>>
>>> A big thank you to everybody who contributed.
>>>> This one took us a while, but I think it's worth the wait.
>>>> Highlights include:
>>>> - A new GaussianProcessClassifier and GaussianProcessRegressor to learn
>>>> complex kernels!
>>>> - A much improved GaussianMixture and BayesianGaussianMixture mixture
>>>> models.
>>>> - We moved the content of the grid_search, cross_validation and
>>>> validation_curve modules to the new model_selection module.
>>>> - A Multi-layer perceptron.
>>>> and soo much more that it's impossible to summarize.
>>>> Check out the full changelog here:
>>>> http://scikit-learn.org/stable/whats_new.html#version-0-18
>>>> Please don't update a (ana)conda installation using pip, as that might
>>>> lead
>>>> to problems.
>>>> Let us know any issues you have on the issue tracker:
>>>> https://github.com/scikit-learn/scikit-learn/issues
>>>> Enjoy!
>>>> Andy
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160929/08514ec7/attachment-0001.html>

From t3kcit at gmail.com  Thu Sep 29 09:45:07 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Thu, 29 Sep 2016 09:45:07 -0400
Subject: [scikit-learn] Github project management tools
In-Reply-To: <CAE-UAvQvX5fEqZvMa3XgkjEWasihpfKNf+mzRpYPB=7LRCxB3A@mail.gmail.com>
References: <CAAkaFLXw7+Qfu1L+SGijrVGJ7UU=R-tx4_ttdh0fCJo-u53fMw@mail.gmail.com>
 <20160916152118.GC187668@phare.normalesup.org>
 <CAAkaFLW6sVg_vjXnMHu9F59i0HHxpDLM87B3opKTcyoe-dNXjg@mail.gmail.com>
 <CAAkaFLV6160=p1ztEBtDh8tmAXwDBf5Cy-m1=aAabHxkbCdXPg@mail.gmail.com>
 <CAE-UAvTGMrtX5uOP5END4n+SihtQNSHH+T7cwgBByu4vuhz4QA@mail.gmail.com>
 <fe49d881-1321-71e0-021b-990541a34239@gmail.com>
 <CAE-UAvQC98Xj81hAfBTHxcQWSQxD1VwEa_SAPZmK3tWHbWgE0A@mail.gmail.com>
 <e769ff3c-24f4-1c92-9ae4-d562a785b0ac@gmail.com>
 <CALoLHMKeKnj_H9ZsYZ+kUAZFn5Cdi9ePyr+hoZjdgnoq7HFBTg@mail.gmail.com>
 <CAE-UAvRjoJnUMOedovqO0J86zs120JjnqQUL+QnHp8_mC7cpqg@mail.gmail.com>
 <41f0eb1c-c877-0c4d-0f56-9485f57c0eae@gmail.com>
 <CAE-UAvQvX5fEqZvMa3XgkjEWasihpfKNf+mzRpYPB=7LRCxB3A@mail.gmail.com>
Message-ID: <dd7558e5-83a5-9e80-8ae6-6f152d38dbc0@gmail.com>

So I made a project for 0.19:

https://github.com/scikit-learn/scikit-learn/projects/5

The idea would be to drag and drop issues and PRs so that the important 
ones are at the top.
We could also add an "important" column, currently the scrolling is 
pretty annoying.
Thoughts?


On 09/28/2016 03:29 PM, Nelle Varoquaux wrote:
> On 28 September 2016 at 12:24, Andreas Mueller <t3kcit at gmail.com> wrote:
>>
>> On 09/28/2016 02:21 PM, Nelle Varoquaux wrote:
>>>
>>> I think the only ones worth having are the ones that can be dealt with
>>> automatically and the ones that will not be used frequently:
>>>
>>> - stalled after 30 days of inactivity [can be done automatically]
>>> - in dispute [I don't expect it to be used often].
>> I think "in dispute" is actually one of the most common statuses among PRs.
>> Or maybe I have a skewed picture of things.
>> Many PRs stalled because it is not clear whether the proposed solution is a
>> good one.
> On the stalled one, sure, but there are a lot of PRs being merged
> fairly quickly. So over all, I think it is quite rare. No?
>
>> It would be great to have some way to get through the backlog of 400 PRs and
>> I think tagging them might be useful.
>> We rarely reject PRs, we could also revisit that policy.
>>
>> For the backlog, it's pretty unclear to me how many are waiting for reviews,
>> how many are waiting for changes,
>> and how many are disputed.
>> Tagging these might help people who want to review to find things to review,
>> and people who want to code to pick
>> up stalled PRs.
> That sounds like a great use of labels, thought all of these need to
> be tagged manually.
>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From joel.nothman at gmail.com  Thu Sep 29 10:05:01 2016
From: joel.nothman at gmail.com (Joel Nothman)
Date: Fri, 30 Sep 2016 00:05:01 +1000
Subject: [scikit-learn] Github project management tools
In-Reply-To: <dd7558e5-83a5-9e80-8ae6-6f152d38dbc0@gmail.com>
References: <CAAkaFLXw7+Qfu1L+SGijrVGJ7UU=R-tx4_ttdh0fCJo-u53fMw@mail.gmail.com>
 <20160916152118.GC187668@phare.normalesup.org>
 <CAAkaFLW6sVg_vjXnMHu9F59i0HHxpDLM87B3opKTcyoe-dNXjg@mail.gmail.com>
 <CAAkaFLV6160=p1ztEBtDh8tmAXwDBf5Cy-m1=aAabHxkbCdXPg@mail.gmail.com>
 <CAE-UAvTGMrtX5uOP5END4n+SihtQNSHH+T7cwgBByu4vuhz4QA@mail.gmail.com>
 <fe49d881-1321-71e0-021b-990541a34239@gmail.com>
 <CAE-UAvQC98Xj81hAfBTHxcQWSQxD1VwEa_SAPZmK3tWHbWgE0A@mail.gmail.com>
 <e769ff3c-24f4-1c92-9ae4-d562a785b0ac@gmail.com>
 <CALoLHMKeKnj_H9ZsYZ+kUAZFn5Cdi9ePyr+hoZjdgnoq7HFBTg@mail.gmail.com>
 <CAE-UAvRjoJnUMOedovqO0J86zs120JjnqQUL+QnHp8_mC7cpqg@mail.gmail.com>
 <41f0eb1c-c877-0c4d-0f56-9485f57c0eae@gmail.com>
 <CAE-UAvQvX5fEqZvMa3XgkjEWasihpfKNf+mzRpYPB=7LRCxB3A@mail.gmail.com>
 <dd7558e5-83a5-9e80-8ae6-6f152d38dbc0@gmail.com>
Message-ID: <CAAkaFLU8Dh746zTj1jaKSATQFC44tW_O073ADWjwP91LxCNOxQ@mail.gmail.com>

I agree that being able to identify which PRs are stalled on the
contributor's part, which on reviewers' part, and since when, would be
great. I'm not sure we've come up with a way that'll work though.

In terms of backlog, I've wondered if just getting things into a
spreadsheet would help:

https://docs.google.com/spreadsheets/d/1LdzNxQbn7A0Ao8zlUBgnvT42929JpAe9958YxKCubjE/edit

What other features of an Issue / PR would be useful to
sort/filter/pivottable on in a spreadsheet form like this?

(It would be extra nice if we could modify titles and labels within the
spreadsheet and have them update via the GitHub API, but I'm not sure I'll
get around to making that feature :P)


On 29 September 2016 at 23:45, Andreas Mueller <t3kcit at gmail.com> wrote:

> So I made a project for 0.19:
>
> https://github.com/scikit-learn/scikit-learn/projects/5
>
> The idea would be to drag and drop issues and PRs so that the important
> ones are at the top.
> We could also add an "important" column, currently the scrolling is pretty
> annoying.
> Thoughts?
>
>
>
>
> On 09/28/2016 03:29 PM, Nelle Varoquaux wrote:
>
>> On 28 September 2016 at 12:24, Andreas Mueller <t3kcit at gmail.com> wrote:
>>
>>>
>>> On 09/28/2016 02:21 PM, Nelle Varoquaux wrote:
>>>
>>>>
>>>> I think the only ones worth having are the ones that can be dealt with
>>>> automatically and the ones that will not be used frequently:
>>>>
>>>> - stalled after 30 days of inactivity [can be done automatically]
>>>> - in dispute [I don't expect it to be used often].
>>>>
>>> I think "in dispute" is actually one of the most common statuses among
>>> PRs.
>>> Or maybe I have a skewed picture of things.
>>> Many PRs stalled because it is not clear whether the proposed solution
>>> is a
>>> good one.
>>>
>> On the stalled one, sure, but there are a lot of PRs being merged
>> fairly quickly. So over all, I think it is quite rare. No?
>>
>> It would be great to have some way to get through the backlog of 400 PRs
>>> and
>>> I think tagging them might be useful.
>>> We rarely reject PRs, we could also revisit that policy.
>>>
>>> For the backlog, it's pretty unclear to me how many are waiting for
>>> reviews,
>>> how many are waiting for changes,
>>> and how many are disputed.
>>> Tagging these might help people who want to review to find things to
>>> review,
>>> and people who want to code to pick
>>> up stalled PRs.
>>>
>> That sounds like a great use of labels, thought all of these need to
>> be tagged manually.
>>
>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160930/58182c83/attachment.html>

From g.lemaitre58 at gmail.com  Thu Sep 29 10:16:40 2016
From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=)
Date: Thu, 29 Sep 2016 16:16:40 +0200
Subject: [scikit-learn] Github project management tools
In-Reply-To: <dd7558e5-83a5-9e80-8ae6-6f152d38dbc0@gmail.com>
References: <CAAkaFLXw7+Qfu1L+SGijrVGJ7UU=R-tx4_ttdh0fCJo-u53fMw@mail.gmail.com>
 <20160916152118.GC187668@phare.normalesup.org>
 <CAAkaFLW6sVg_vjXnMHu9F59i0HHxpDLM87B3opKTcyoe-dNXjg@mail.gmail.com>
 <CAAkaFLV6160=p1ztEBtDh8tmAXwDBf5Cy-m1=aAabHxkbCdXPg@mail.gmail.com>
 <CAE-UAvTGMrtX5uOP5END4n+SihtQNSHH+T7cwgBByu4vuhz4QA@mail.gmail.com>
 <fe49d881-1321-71e0-021b-990541a34239@gmail.com>
 <CAE-UAvQC98Xj81hAfBTHxcQWSQxD1VwEa_SAPZmK3tWHbWgE0A@mail.gmail.com>
 <e769ff3c-24f4-1c92-9ae4-d562a785b0ac@gmail.com>
 <CALoLHMKeKnj_H9ZsYZ+kUAZFn5Cdi9ePyr+hoZjdgnoq7HFBTg@mail.gmail.com>
 <CAE-UAvRjoJnUMOedovqO0J86zs120JjnqQUL+QnHp8_mC7cpqg@mail.gmail.com>
 <41f0eb1c-c877-0c4d-0f56-9485f57c0eae@gmail.com>
 <CAE-UAvQvX5fEqZvMa3XgkjEWasihpfKNf+mzRpYPB=7LRCxB3A@mail.gmail.com>
 <dd7558e5-83a5-9e80-8ae6-6f152d38dbc0@gmail.com>
Message-ID: <CACDxx9gJDdjQM9SUNk1e=EnpMHMeL5XPMzVG3U2CZPUbAxzvsg@mail.gmail.com>

What do you think about splitting MRG and MRG+1 in two different column.
The scrolling can get a little bit less annoying and you can have an easier
view on the MRG+1 to kick them out.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160929/ba5225df/attachment.html>

From joel.nothman at gmail.com  Thu Sep 29 10:30:02 2016
From: joel.nothman at gmail.com (Joel Nothman)
Date: Fri, 30 Sep 2016 00:30:02 +1000
Subject: [scikit-learn] Github project management tools
In-Reply-To: <CACDxx9gJDdjQM9SUNk1e=EnpMHMeL5XPMzVG3U2CZPUbAxzvsg@mail.gmail.com>
References: <CAAkaFLXw7+Qfu1L+SGijrVGJ7UU=R-tx4_ttdh0fCJo-u53fMw@mail.gmail.com>
 <20160916152118.GC187668@phare.normalesup.org>
 <CAAkaFLW6sVg_vjXnMHu9F59i0HHxpDLM87B3opKTcyoe-dNXjg@mail.gmail.com>
 <CAAkaFLV6160=p1ztEBtDh8tmAXwDBf5Cy-m1=aAabHxkbCdXPg@mail.gmail.com>
 <CAE-UAvTGMrtX5uOP5END4n+SihtQNSHH+T7cwgBByu4vuhz4QA@mail.gmail.com>
 <fe49d881-1321-71e0-021b-990541a34239@gmail.com>
 <CAE-UAvQC98Xj81hAfBTHxcQWSQxD1VwEa_SAPZmK3tWHbWgE0A@mail.gmail.com>
 <e769ff3c-24f4-1c92-9ae4-d562a785b0ac@gmail.com>
 <CALoLHMKeKnj_H9ZsYZ+kUAZFn5Cdi9ePyr+hoZjdgnoq7HFBTg@mail.gmail.com>
 <CAE-UAvRjoJnUMOedovqO0J86zs120JjnqQUL+QnHp8_mC7cpqg@mail.gmail.com>
 <41f0eb1c-c877-0c4d-0f56-9485f57c0eae@gmail.com>
 <CAE-UAvQvX5fEqZvMa3XgkjEWasihpfKNf+mzRpYPB=7LRCxB3A@mail.gmail.com>
 <dd7558e5-83a5-9e80-8ae6-6f152d38dbc0@gmail.com>
 <CACDxx9gJDdjQM9SUNk1e=EnpMHMeL5XPMzVG3U2CZPUbAxzvsg@mail.gmail.com>
Message-ID: <CAAkaFLVM4pGgcWe5zRVdNwSSEgz-LHPPiXu=eKyH88aRg0vuvA@mail.gmail.com>

I've put a column for that status in.

Note: this has largely been generated with
https://gist.github.com/jnothman/8eba0834acfd633c6d83b437f6f18c49

On 30 September 2016 at 00:16, Guillaume Lema?tre <g.lemaitre58 at gmail.com>
wrote:

> What do you think about splitting MRG and MRG+1 in two different column.
> The scrolling can get a little bit less annoying and you can have an
> easier view on the MRG+1 to kick them out.
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160930/32eb50c3/attachment-0001.html>

From jurafejfar at gmail.com  Thu Sep 29 11:10:42 2016
From: jurafejfar at gmail.com (=?UTF-8?B?SmnFmcOtIEZlamZhcg==?=)
Date: Thu, 29 Sep 2016 20:10:42 +0500
Subject: [scikit-learn] ANN Scikit-learn 0.18 released
In-Reply-To: <CA+8wVNWtcdyY9UKzWn1tWCihmuJfsQzDNZT3ssAAtDjZAcgRbg@mail.gmail.com>
References: <40a62931-bf5a-7c59-7253-69418d53f196@gmail.com>
 <20160929052856.GA1123098@phare.normalesup.org>
 <296b7e7d-e128-0ded-4a35-267d6b53345c@inria.fr>
 <CAHcSORmZ-ekPSYjXRM16FRFGqpisZQKCcHubrnEmV=vPNtgfxw@mail.gmail.com>
 <CACmxyDHuShDQXx8gOPWxCpDguYeg-Dpe=FsBC7rrrSecq9ioyQ@mail.gmail.com>
 <CA+8wVNWtcdyY9UKzWn1tWCihmuJfsQzDNZT3ssAAtDjZAcgRbg@mail.gmail.com>
Message-ID: <CA+8wVNWiMQf=VAF8rLZcVVhCdFdfqwtCAT7ReEscUcqc7QrPUQ@mail.gmail.com>

Thank you very much for the new version of this great SW! J.

Dne 29.9.2016 15:55 napsal u?ivatel "Raghav R V" <ragvrv at gmail.com>:

Congrats everyone :)

On 29 Sep 2016 9:41 a.m., "Jaques Grobler" <jaquesgrobler at gmail.com> wrote:

> Congrats everyone!
>
> 2016-09-29 8:39 GMT+02:00 bthirion <bertrand.thirion at inria.fr>:
>
>> Congrats !
>>
>> Bertrand
>>
>>
>> On 29/09/2016 07:28, Gael Varoquaux wrote:
>>
>>> Hurray!
>>>
>>> Congratulations to everybody, and in particular the release time!
>>>
>>> Ga?l
>>>
>>> On Wed, Sep 28, 2016 at 05:01:45PM -0400, Andreas Mueller wrote:
>>>
>>>> Hi everybody.
>>>> I'm happy to announce scikit-learn 0.18 has been released today.
>>>> You can install from pipy or anaconda.org:
>>>> pip install --upgrade scikit-learn --no-deps
>>>>
>>>
>>> or if you prefer conda:
>>>> conda update scikit-learn
>>>>
>>>
>>> A big thank you to everybody who contributed.
>>>> This one took us a while, but I think it's worth the wait.
>>>> Highlights include:
>>>> - A new GaussianProcessClassifier and GaussianProcessRegressor to learn
>>>> complex kernels!
>>>> - A much improved GaussianMixture and BayesianGaussianMixture mixture
>>>> models.
>>>> - We moved the content of the grid_search, cross_validation and
>>>> validation_curve modules to the new model_selection module.
>>>> - A Multi-layer perceptron.
>>>> and soo much more that it's impossible to summarize.
>>>> Check out the full changelog here:
>>>> http://scikit-learn.org/stable/whats_new.html#version-0-18
>>>> Please don't update a (ana)conda installation using pip, as that might
>>>> lead
>>>> to problems.
>>>> Let us know any issues you have on the issue tracker:
>>>> https://github.com/scikit-learn/scikit-learn/issues
>>>> Enjoy!
>>>> Andy
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160929/7792bbb6/attachment.html>

From bharat.didwania.eee14 at itbhu.ac.in  Thu Sep 29 11:30:33 2016
From: bharat.didwania.eee14 at itbhu.ac.in (Bharat Didwania .)
Date: Thu, 29 Sep 2016 08:30:33 -0700
Subject: [scikit-learn] ANN Scikit-learn 0.18 released
In-Reply-To: <CA+8wVNWiMQf=VAF8rLZcVVhCdFdfqwtCAT7ReEscUcqc7QrPUQ@mail.gmail.com>
References: <40a62931-bf5a-7c59-7253-69418d53f196@gmail.com>
 <20160929052856.GA1123098@phare.normalesup.org>
 <296b7e7d-e128-0ded-4a35-267d6b53345c@inria.fr>
 <CAHcSORmZ-ekPSYjXRM16FRFGqpisZQKCcHubrnEmV=vPNtgfxw@mail.gmail.com>
 <CACmxyDHuShDQXx8gOPWxCpDguYeg-Dpe=FsBC7rrrSecq9ioyQ@mail.gmail.com>
 <CA+8wVNWtcdyY9UKzWn1tWCihmuJfsQzDNZT3ssAAtDjZAcgRbg@mail.gmail.com>
 <CA+8wVNWiMQf=VAF8rLZcVVhCdFdfqwtCAT7ReEscUcqc7QrPUQ@mail.gmail.com>
Message-ID: <CAA3g_m8WctZxZMusmv5YT2aqXy5g35o0HZvYYoYEkyctaLD1bw@mail.gmail.com>

Hi everybody,
I want to participate in gsoc 2017 for scikit-learn projects. Can anyone
suggest any mentors to whom i can contact.?


 Sent with Mailtrack
<https://mailtrack.io/install?source=signature&lang=en&referral=bharat.didwania.eee14 at itbhu.ac.in&idSignature=22>

On Thu, Sep 29, 2016 at 8:10 AM, Ji?? Fejfar <jurafejfar at gmail.com> wrote:

> Thank you very much for the new version of this great SW! J.
>
> Dne 29.9.2016 15:55 napsal u?ivatel "Raghav R V" <ragvrv at gmail.com>:
>
> Congrats everyone :)
>
> On 29 Sep 2016 9:41 a.m., "Jaques Grobler" <jaquesgrobler at gmail.com>
> wrote:
>
>> Congrats everyone!
>>
>> 2016-09-29 8:39 GMT+02:00 bthirion <bertrand.thirion at inria.fr>:
>>
>>> Congrats !
>>>
>>> Bertrand
>>>
>>>
>>> On 29/09/2016 07:28, Gael Varoquaux wrote:
>>>
>>>> Hurray!
>>>>
>>>> Congratulations to everybody, and in particular the release time!
>>>>
>>>> Ga?l
>>>>
>>>> On Wed, Sep 28, 2016 at 05:01:45PM -0400, Andreas Mueller wrote:
>>>>
>>>>> Hi everybody.
>>>>> I'm happy to announce scikit-learn 0.18 has been released today.
>>>>> You can install from pipy or anaconda.org:
>>>>> pip install --upgrade scikit-learn --no-deps
>>>>>
>>>>
>>>> or if you prefer conda:
>>>>> conda update scikit-learn
>>>>>
>>>>
>>>> A big thank you to everybody who contributed.
>>>>> This one took us a while, but I think it's worth the wait.
>>>>> Highlights include:
>>>>> - A new GaussianProcessClassifier and GaussianProcessRegressor to learn
>>>>> complex kernels!
>>>>> - A much improved GaussianMixture and BayesianGaussianMixture mixture
>>>>> models.
>>>>> - We moved the content of the grid_search, cross_validation and
>>>>> validation_curve modules to the new model_selection module.
>>>>> - A Multi-layer perceptron.
>>>>> and soo much more that it's impossible to summarize.
>>>>> Check out the full changelog here:
>>>>> http://scikit-learn.org/stable/whats_new.html#version-0-18
>>>>> Please don't update a (ana)conda installation using pip, as that might
>>>>> lead
>>>>> to problems.
>>>>> Let us know any issues you have on the issue tracker:
>>>>> https://github.com/scikit-learn/scikit-learn/issues
>>>>> Enjoy!
>>>>> Andy
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> scikit-learn at python.org
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>
>>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160929/341de5d7/attachment.html>

From gael.varoquaux at normalesup.org  Thu Sep 29 11:36:55 2016
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Thu, 29 Sep 2016 17:36:55 +0200
Subject: [scikit-learn] ANN Scikit-learn 0.18 released
In-Reply-To: <CAA3g_m8WctZxZMusmv5YT2aqXy5g35o0HZvYYoYEkyctaLD1bw@mail.gmail.com>
References: <40a62931-bf5a-7c59-7253-69418d53f196@gmail.com>
 <20160929052856.GA1123098@phare.normalesup.org>
 <296b7e7d-e128-0ded-4a35-267d6b53345c@inria.fr>
 <CAHcSORmZ-ekPSYjXRM16FRFGqpisZQKCcHubrnEmV=vPNtgfxw@mail.gmail.com>
 <CACmxyDHuShDQXx8gOPWxCpDguYeg-Dpe=FsBC7rrrSecq9ioyQ@mail.gmail.com>
 <CA+8wVNWtcdyY9UKzWn1tWCihmuJfsQzDNZT3ssAAtDjZAcgRbg@mail.gmail.com>
 <CA+8wVNWiMQf=VAF8rLZcVVhCdFdfqwtCAT7ReEscUcqc7QrPUQ@mail.gmail.com>
 <CAA3g_m8WctZxZMusmv5YT2aqXy5g35o0HZvYYoYEkyctaLD1bw@mail.gmail.com>
Message-ID: <20160929153655.GN1123098@phare.normalesup.org>

> I want to participate in gsoc 2017 for scikit-learn projects. Can anyone
> suggest any mentors to whom i can contact.?

I would advise you to do a few PR that fix easy issues. That will get you
to know the community.

From t3kcit at gmail.com  Thu Sep 29 12:46:27 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Thu, 29 Sep 2016 22:16:27 +0530
Subject: [scikit-learn] PyData cook book
In-Reply-To: <CADxzQorHjDaRLiyzKmxL+fZj8832CmHvyqLrfrHEKPHeaMHxfQ@mail.gmail.com>
References: <CADxzQorHjDaRLiyzKmxL+fZj8832CmHvyqLrfrHEKPHeaMHxfQ@mail.gmail.com>
Message-ID: <CADxzQoosy20tp71=9MfS2AOMOmT-Zm0G1DDdzca-E06jp+rd5Q@mail.gmail.com>

Hey all.
Numfocus and some folks are working on a PyData cookbook:
https://github.com/pydata/pydata-cookbook

Unfortunately so far it has no scikit-learn yet.
Is anyone interested in contributing?
I think I did enough writing for a while ;)

Andy

Sent from phone. Please excuse spelling and brevity.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160929/2ee048c8/attachment.html>

From nfliu at uw.edu  Thu Sep 29 12:56:51 2016
From: nfliu at uw.edu (Nelson Liu)
Date: Thu, 29 Sep 2016 09:56:51 -0700
Subject: [scikit-learn] PyData cook book
In-Reply-To: <CADxzQoosy20tp71=9MfS2AOMOmT-Zm0G1DDdzca-E06jp+rd5Q@mail.gmail.com>
References: <CADxzQorHjDaRLiyzKmxL+fZj8832CmHvyqLrfrHEKPHeaMHxfQ@mail.gmail.com>
 <CADxzQoosy20tp71=9MfS2AOMOmT-Zm0G1DDdzca-E06jp+rd5Q@mail.gmail.com>
Message-ID: <CALoLHMKoL4osBFzD9dTfAAEvaDOw2DFN281xxsQB-dhFhDxoQw@mail.gmail.com>

This looks quite neat, open review seems to be a great way to do this sort
of thing. I'd be interested in helping out if necessary.

Nelson

On Thu, Sep 29, 2016 at 9:46 AM, Andreas Mueller <t3kcit at gmail.com> wrote:

> Hey all.
> Numfocus and some folks are working on a PyData cookbook:
> https://github.com/pydata/pydata-cookbook
>
> Unfortunately so far it has no scikit-learn yet.
> Is anyone interested in contributing?
> I think I did enough writing for a while ;)
>
> Andy
>
> Sent from phone. Please excuse spelling and brevity.
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160929/e4d0eb22/attachment.html>

From siddharthgupta234 at gmail.com  Thu Sep 29 13:31:11 2016
From: siddharthgupta234 at gmail.com (Siddharth Gupta)
Date: Thu, 29 Sep 2016 23:01:11 +0530
Subject: [scikit-learn] PyData cook book
In-Reply-To: <CADxzQoosy20tp71=9MfS2AOMOmT-Zm0G1DDdzca-E06jp+rd5Q@mail.gmail.com>
References: <CADxzQorHjDaRLiyzKmxL+fZj8832CmHvyqLrfrHEKPHeaMHxfQ@mail.gmail.com>
 <CADxzQoosy20tp71=9MfS2AOMOmT-Zm0G1DDdzca-E06jp+rd5Q@mail.gmail.com>
Message-ID: <CAM_sO3QNPGOJK4t1ehqnFJpGM1rCDUDf98OBEsdmpnCRht51Gw@mail.gmail.com>

Yes, yes, yes I will take this Andreas. :)

Regards Siddharth Gupta,
Ph: 9871012292
Linkedin <https://www.linkedin.com/in/sidgupta234/> | Github
<https://github.com/sidgupta234> | Codechef
<https://www.codechef.com/users/sidgupta234> | Twitter
<https://twitter.com/SidGupta234> | Facebook
<https://www.facebook.com/profile.php?id=1483695876>

On Thu, Sep 29, 2016 at 10:16 PM, Andreas Mueller <t3kcit at gmail.com> wrote:

> Hey all.
> Numfocus and some folks are working on a PyData cookbook:
> https://github.com/pydata/pydata-cookbook
>
> Unfortunately so far it has no scikit-learn yet.
> Is anyone interested in contributing?
> I think I did enough writing for a while ;)
>
> Andy
>
> Sent from phone. Please excuse spelling and brevity.
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160929/0d8f84ad/attachment.html>

From siddharthgupta234 at gmail.com  Thu Sep 29 13:35:48 2016
From: siddharthgupta234 at gmail.com (Siddharth Gupta)
Date: Thu, 29 Sep 2016 23:05:48 +0530
Subject: [scikit-learn] Github project management tools
In-Reply-To: <CAAkaFLVM4pGgcWe5zRVdNwSSEgz-LHPPiXu=eKyH88aRg0vuvA@mail.gmail.com>
References: <CAAkaFLXw7+Qfu1L+SGijrVGJ7UU=R-tx4_ttdh0fCJo-u53fMw@mail.gmail.com>
 <20160916152118.GC187668@phare.normalesup.org>
 <CAAkaFLW6sVg_vjXnMHu9F59i0HHxpDLM87B3opKTcyoe-dNXjg@mail.gmail.com>
 <CAAkaFLV6160=p1ztEBtDh8tmAXwDBf5Cy-m1=aAabHxkbCdXPg@mail.gmail.com>
 <CAE-UAvTGMrtX5uOP5END4n+SihtQNSHH+T7cwgBByu4vuhz4QA@mail.gmail.com>
 <fe49d881-1321-71e0-021b-990541a34239@gmail.com>
 <CAE-UAvQC98Xj81hAfBTHxcQWSQxD1VwEa_SAPZmK3tWHbWgE0A@mail.gmail.com>
 <e769ff3c-24f4-1c92-9ae4-d562a785b0ac@gmail.com>
 <CALoLHMKeKnj_H9ZsYZ+kUAZFn5Cdi9ePyr+hoZjdgnoq7HFBTg@mail.gmail.com>
 <CAE-UAvRjoJnUMOedovqO0J86zs120JjnqQUL+QnHp8_mC7cpqg@mail.gmail.com>
 <41f0eb1c-c877-0c4d-0f56-9485f57c0eae@gmail.com>
 <CAE-UAvQvX5fEqZvMa3XgkjEWasihpfKNf+mzRpYPB=7LRCxB3A@mail.gmail.com>
 <dd7558e5-83a5-9e80-8ae6-6f152d38dbc0@gmail.com>
 <CACDxx9gJDdjQM9SUNk1e=EnpMHMeL5XPMzVG3U2CZPUbAxzvsg@mail.gmail.com>
 <CAAkaFLVM4pGgcWe5zRVdNwSSEgz-LHPPiXu=eKyH88aRg0vuvA@mail.gmail.com>
Message-ID: <CAM_sO3Rydmt23sQOSNkpvVaxOkJ9t_rUtaOkKDk6a1DHV3Pe6Q@mail.gmail.com>

I have a question which may/may not be relevant here. The question is why
don't we assign issues to the one who have asked to take this issue. This
feature may give us a better picture of the current stat of the Issue. We
can ping that person directly and get info regarding his progress in case
of a long haul of inactivity.

Regards Siddharth Gupta,
Ph: 9871012292
Linkedin <https://www.linkedin.com/in/sidgupta234/> | Github
<https://github.com/sidgupta234> | Codechef
<https://www.codechef.com/users/sidgupta234> | Twitter
<https://twitter.com/SidGupta234> | Facebook
<https://www.facebook.com/profile.php?id=1483695876>

On Thu, Sep 29, 2016 at 8:00 PM, Joel Nothman <joel.nothman at gmail.com>
wrote:

> I've put a column for that status in.
>
> Note: this has largely been generated with https://gist.github.com/
> jnothman/8eba0834acfd633c6d83b437f6f18c49
>
> On 30 September 2016 at 00:16, Guillaume Lema?tre <g.lemaitre58 at gmail.com>
> wrote:
>
>> What do you think about splitting MRG and MRG+1 in two different column.
>> The scrolling can get a little bit less annoying and you can have an
>> easier view on the MRG+1 to kick them out.
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160929/42ec3bf3/attachment-0001.html>

From nfliu at uw.edu  Thu Sep 29 13:41:53 2016
From: nfliu at uw.edu (Nelson Liu)
Date: Thu, 29 Sep 2016 10:41:53 -0700
Subject: [scikit-learn] Github project management tools
In-Reply-To: <CAM_sO3Rydmt23sQOSNkpvVaxOkJ9t_rUtaOkKDk6a1DHV3Pe6Q@mail.gmail.com>
References: <CAAkaFLXw7+Qfu1L+SGijrVGJ7UU=R-tx4_ttdh0fCJo-u53fMw@mail.gmail.com>
 <20160916152118.GC187668@phare.normalesup.org>
 <CAAkaFLW6sVg_vjXnMHu9F59i0HHxpDLM87B3opKTcyoe-dNXjg@mail.gmail.com>
 <CAAkaFLV6160=p1ztEBtDh8tmAXwDBf5Cy-m1=aAabHxkbCdXPg@mail.gmail.com>
 <CAE-UAvTGMrtX5uOP5END4n+SihtQNSHH+T7cwgBByu4vuhz4QA@mail.gmail.com>
 <fe49d881-1321-71e0-021b-990541a34239@gmail.com>
 <CAE-UAvQC98Xj81hAfBTHxcQWSQxD1VwEa_SAPZmK3tWHbWgE0A@mail.gmail.com>
 <e769ff3c-24f4-1c92-9ae4-d562a785b0ac@gmail.com>
 <CALoLHMKeKnj_H9ZsYZ+kUAZFn5Cdi9ePyr+hoZjdgnoq7HFBTg@mail.gmail.com>
 <CAE-UAvRjoJnUMOedovqO0J86zs120JjnqQUL+QnHp8_mC7cpqg@mail.gmail.com>
 <41f0eb1c-c877-0c4d-0f56-9485f57c0eae@gmail.com>
 <CAE-UAvQvX5fEqZvMa3XgkjEWasihpfKNf+mzRpYPB=7LRCxB3A@mail.gmail.com>
 <dd7558e5-83a5-9e80-8ae6-6f152d38dbc0@gmail.com>
 <CACDxx9gJDdjQM9SUNk1e=EnpMHMeL5XPMzVG3U2CZPUbAxzvsg@mail.gmail.com>
 <CAAkaFLVM4pGgcWe5zRVdNwSSEgz-LHPPiXu=eKyH88aRg0vuvA@mail.gmail.com>
 <CAM_sO3Rydmt23sQOSNkpvVaxOkJ9t_rUtaOkKDk6a1DHV3Pe6Q@mail.gmail.com>
Message-ID: <CALoLHMJ-p0tYvbqy9jDOM8GQ2ycSOXeF-ehtZesnL46rMMiu=w@mail.gmail.com>

I think it's a matter of two things -- one, you can't be assigned if you
aren't a member of the organization on github. Two -- linking pull requests
to issues is generally visible enough (hence why it's in the PR template).
We don't have issues with figuring out who is working on an issue, but
rather keeping track of all of them; I don't think that would solve that
problem.

Nelson

On Thursday, September 29, 2016, Siddharth Gupta <
siddharthgupta234 at gmail.com> wrote:

> I have a question which may/may not be relevant here. The question is why
> don't we assign issues to the one who have asked to take this issue. This
> feature may give us a better picture of the current stat of the Issue. We
> can ping that person directly and get info regarding his progress in case
> of a long haul of inactivity.
>
> Regards Siddharth Gupta,
> Ph: 9871012292
> Linkedin <https://www.linkedin.com/in/sidgupta234/> | Github
> <https://github.com/sidgupta234> | Codechef
> <https://www.codechef.com/users/sidgupta234> | Twitter
> <https://twitter.com/SidGupta234> | Facebook
> <https://www.facebook.com/profile.php?id=1483695876>
>
> On Thu, Sep 29, 2016 at 8:00 PM, Joel Nothman <joel.nothman at gmail.com
> <javascript:_e(%7B%7D,'cvml','joel.nothman at gmail.com');>> wrote:
>
>> I've put a column for that status in.
>>
>> Note: this has largely been generated with https://gist.github.com/j
>> nothman/8eba0834acfd633c6d83b437f6f18c49
>>
>> On 30 September 2016 at 00:16, Guillaume Lema?tre <g.lemaitre58 at gmail.com
>> <javascript:_e(%7B%7D,'cvml','g.lemaitre58 at gmail.com');>> wrote:
>>
>>> What do you think about splitting MRG and MRG+1 in two different column.
>>> The scrolling can get a little bit less annoying and you can have an
>>> easier view on the MRG+1 to kick them out.
>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> <javascript:_e(%7B%7D,'cvml','scikit-learn at python.org');>
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> <javascript:_e(%7B%7D,'cvml','scikit-learn at python.org');>
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160929/e9e8d60a/attachment.html>

From nelle.varoquaux at gmail.com  Thu Sep 29 13:46:29 2016
From: nelle.varoquaux at gmail.com (Nelle Varoquaux)
Date: Thu, 29 Sep 2016 10:46:29 -0700
Subject: [scikit-learn] Github project management tools
In-Reply-To: <CALoLHMJ-p0tYvbqy9jDOM8GQ2ycSOXeF-ehtZesnL46rMMiu=w@mail.gmail.com>
References: <CAAkaFLXw7+Qfu1L+SGijrVGJ7UU=R-tx4_ttdh0fCJo-u53fMw@mail.gmail.com>
 <20160916152118.GC187668@phare.normalesup.org>
 <CAAkaFLW6sVg_vjXnMHu9F59i0HHxpDLM87B3opKTcyoe-dNXjg@mail.gmail.com>
 <CAAkaFLV6160=p1ztEBtDh8tmAXwDBf5Cy-m1=aAabHxkbCdXPg@mail.gmail.com>
 <CAE-UAvTGMrtX5uOP5END4n+SihtQNSHH+T7cwgBByu4vuhz4QA@mail.gmail.com>
 <fe49d881-1321-71e0-021b-990541a34239@gmail.com>
 <CAE-UAvQC98Xj81hAfBTHxcQWSQxD1VwEa_SAPZmK3tWHbWgE0A@mail.gmail.com>
 <e769ff3c-24f4-1c92-9ae4-d562a785b0ac@gmail.com>
 <CALoLHMKeKnj_H9ZsYZ+kUAZFn5Cdi9ePyr+hoZjdgnoq7HFBTg@mail.gmail.com>
 <CAE-UAvRjoJnUMOedovqO0J86zs120JjnqQUL+QnHp8_mC7cpqg@mail.gmail.com>
 <41f0eb1c-c877-0c4d-0f56-9485f57c0eae@gmail.com>
 <CAE-UAvQvX5fEqZvMa3XgkjEWasihpfKNf+mzRpYPB=7LRCxB3A@mail.gmail.com>
 <dd7558e5-83a5-9e80-8ae6-6f152d38dbc0@gmail.com>
 <CACDxx9gJDdjQM9SUNk1e=EnpMHMeL5XPMzVG3U2CZPUbAxzvsg@mail.gmail.com>
 <CAAkaFLVM4pGgcWe5zRVdNwSSEgz-LHPPiXu=eKyH88aRg0vuvA@mail.gmail.com>
 <CAM_sO3Rydmt23sQOSNkpvVaxOkJ9t_rUtaOkKDk6a1DHV3Pe6Q@mail.gmail.com>
 <CALoLHMJ-p0tYvbqy9jDOM8GQ2ycSOXeF-ehtZesnL46rMMiu=w@mail.gmail.com>
Message-ID: <CAE-UAvTS1QkNY1736Xq-Sgc4UC5YdgrcukiNs+qjVF=qHwB43w@mail.gmail.com>

On 29 September 2016 at 10:41, Nelson Liu <nfliu at uw.edu> wrote:
> I think it's a matter of two things -- one, you can't be assigned if you
> aren't a member of the organization on github. Two -- linking pull requests
> to issues is generally visible enough (hence why it's in the PR template).
> We don't have issues with figuring out who is working on an issue, but
> rather keeping track of all of them; I don't think that would solve that
> problem.

On the other hand, we could do this for PRs, in particular stalled
PRs. That would be a core member "responsible" for making sure this PR
moves forward, both in terms of reviewing and updating the code.

>
> Nelson
>
>
> On Thursday, September 29, 2016, Siddharth Gupta
> <siddharthgupta234 at gmail.com> wrote:
>>
>> I have a question which may/may not be relevant here. The question is why
>> don't we assign issues to the one who have asked to take this issue. This
>> feature may give us a better picture of the current stat of the Issue. We
>> can ping that person directly and get info regarding his progress in case of
>> a long haul of inactivity.
>>
>> Regards Siddharth Gupta,
>> Ph: 9871012292
>> Linkedin | Github | Codechef | Twitter | Facebook
>>
>> On Thu, Sep 29, 2016 at 8:00 PM, Joel Nothman <joel.nothman at gmail.com>
>> wrote:
>>>
>>> I've put a column for that status in.
>>>
>>> Note: this has largely been generated with
>>> https://gist.github.com/jnothman/8eba0834acfd633c6d83b437f6f18c49
>>>
>>> On 30 September 2016 at 00:16, Guillaume Lema?tre
>>> <g.lemaitre58 at gmail.com> wrote:
>>>>
>>>> What do you think about splitting MRG and MRG+1 in two different column.
>>>> The scrolling can get a little bit less annoying and you can have an
>>>> easier view on the MRG+1 to kick them out.
>>>>
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>

From drraph at gmail.com  Thu Sep 29 15:12:21 2016
From: drraph at gmail.com (Raphael C)
Date: Thu, 29 Sep 2016 20:12:21 +0100
Subject: [scikit-learn] Github project management tools
In-Reply-To: <CAAkaFLU8Dh746zTj1jaKSATQFC44tW_O073ADWjwP91LxCNOxQ@mail.gmail.com>
References: <CAAkaFLXw7+Qfu1L+SGijrVGJ7UU=R-tx4_ttdh0fCJo-u53fMw@mail.gmail.com>
 <20160916152118.GC187668@phare.normalesup.org>
 <CAAkaFLW6sVg_vjXnMHu9F59i0HHxpDLM87B3opKTcyoe-dNXjg@mail.gmail.com>
 <CAAkaFLV6160=p1ztEBtDh8tmAXwDBf5Cy-m1=aAabHxkbCdXPg@mail.gmail.com>
 <CAE-UAvTGMrtX5uOP5END4n+SihtQNSHH+T7cwgBByu4vuhz4QA@mail.gmail.com>
 <fe49d881-1321-71e0-021b-990541a34239@gmail.com>
 <CAE-UAvQC98Xj81hAfBTHxcQWSQxD1VwEa_SAPZmK3tWHbWgE0A@mail.gmail.com>
 <e769ff3c-24f4-1c92-9ae4-d562a785b0ac@gmail.com>
 <CALoLHMKeKnj_H9ZsYZ+kUAZFn5Cdi9ePyr+hoZjdgnoq7HFBTg@mail.gmail.com>
 <CAE-UAvRjoJnUMOedovqO0J86zs120JjnqQUL+QnHp8_mC7cpqg@mail.gmail.com>
 <41f0eb1c-c877-0c4d-0f56-9485f57c0eae@gmail.com>
 <CAE-UAvQvX5fEqZvMa3XgkjEWasihpfKNf+mzRpYPB=7LRCxB3A@mail.gmail.com>
 <dd7558e5-83a5-9e80-8ae6-6f152d38dbc0@gmail.com>
 <CAAkaFLU8Dh746zTj1jaKSATQFC44tW_O073ADWjwP91LxCNOxQ@mail.gmail.com>
Message-ID: <CAFHc1QY53uuSHnLhitkNYgkTvY0aQF5kdBGROTyNuO8=ioedNg@mail.gmail.com>

I hope this isn't out of place but I notice that
https://github.com/scikit-learn/scikit-learn/pull/4899 is not in the
list. It seems like a very worthwhile addition and the PR appears
stalled at present.

Raphael

On 29 September 2016 at 15:05, Joel Nothman <joel.nothman at gmail.com> wrote:
> I agree that being able to identify which PRs are stalled on the
> contributor's part, which on reviewers' part, and since when, would be
> great. I'm not sure we've come up with a way that'll work though.
>
> In terms of backlog, I've wondered if just getting things into a spreadsheet
> would help:
>
> https://docs.google.com/spreadsheets/d/1LdzNxQbn7A0Ao8zlUBgnvT42929JpAe9958YxKCubjE/edit
>
> What other features of an Issue / PR would be useful to
> sort/filter/pivottable on in a spreadsheet form like this?
>
> (It would be extra nice if we could modify titles and labels within the
> spreadsheet and have them update via the GitHub API, but I'm not sure I'll
> get around to making that feature :P)
>
>
> On 29 September 2016 at 23:45, Andreas Mueller <t3kcit at gmail.com> wrote:
>>
>> So I made a project for 0.19:
>>
>> https://github.com/scikit-learn/scikit-learn/projects/5
>>
>> The idea would be to drag and drop issues and PRs so that the important
>> ones are at the top.
>> We could also add an "important" column, currently the scrolling is pretty
>> annoying.
>> Thoughts?
>>
>>
>>
>>
>> On 09/28/2016 03:29 PM, Nelle Varoquaux wrote:
>>>
>>> On 28 September 2016 at 12:24, Andreas Mueller <t3kcit at gmail.com> wrote:
>>>>
>>>>
>>>> On 09/28/2016 02:21 PM, Nelle Varoquaux wrote:
>>>>>
>>>>>
>>>>> I think the only ones worth having are the ones that can be dealt with
>>>>> automatically and the ones that will not be used frequently:
>>>>>
>>>>> - stalled after 30 days of inactivity [can be done automatically]
>>>>> - in dispute [I don't expect it to be used often].
>>>>
>>>> I think "in dispute" is actually one of the most common statuses among
>>>> PRs.
>>>> Or maybe I have a skewed picture of things.
>>>> Many PRs stalled because it is not clear whether the proposed solution
>>>> is a
>>>> good one.
>>>
>>> On the stalled one, sure, but there are a lot of PRs being merged
>>> fairly quickly. So over all, I think it is quite rare. No?
>>>
>>>> It would be great to have some way to get through the backlog of 400 PRs
>>>> and
>>>> I think tagging them might be useful.
>>>> We rarely reject PRs, we could also revisit that policy.
>>>>
>>>> For the backlog, it's pretty unclear to me how many are waiting for
>>>> reviews,
>>>> how many are waiting for changes,
>>>> and how many are disputed.
>>>> Tagging these might help people who want to review to find things to
>>>> review,
>>>> and people who want to code to pick
>>>> up stalled PRs.
>>>
>>> That sounds like a great use of labels, thought all of these need to
>>> be tagged manually.
>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>

From drraph at gmail.com  Thu Sep 29 15:15:34 2016
From: drraph at gmail.com (Raphael C)
Date: Thu, 29 Sep 2016 20:15:34 +0100
Subject: [scikit-learn] Github project management tools
In-Reply-To: <CAFHc1QY53uuSHnLhitkNYgkTvY0aQF5kdBGROTyNuO8=ioedNg@mail.gmail.com>
References: <CAAkaFLXw7+Qfu1L+SGijrVGJ7UU=R-tx4_ttdh0fCJo-u53fMw@mail.gmail.com>
 <20160916152118.GC187668@phare.normalesup.org>
 <CAAkaFLW6sVg_vjXnMHu9F59i0HHxpDLM87B3opKTcyoe-dNXjg@mail.gmail.com>
 <CAAkaFLV6160=p1ztEBtDh8tmAXwDBf5Cy-m1=aAabHxkbCdXPg@mail.gmail.com>
 <CAE-UAvTGMrtX5uOP5END4n+SihtQNSHH+T7cwgBByu4vuhz4QA@mail.gmail.com>
 <fe49d881-1321-71e0-021b-990541a34239@gmail.com>
 <CAE-UAvQC98Xj81hAfBTHxcQWSQxD1VwEa_SAPZmK3tWHbWgE0A@mail.gmail.com>
 <e769ff3c-24f4-1c92-9ae4-d562a785b0ac@gmail.com>
 <CALoLHMKeKnj_H9ZsYZ+kUAZFn5Cdi9ePyr+hoZjdgnoq7HFBTg@mail.gmail.com>
 <CAE-UAvRjoJnUMOedovqO0J86zs120JjnqQUL+QnHp8_mC7cpqg@mail.gmail.com>
 <41f0eb1c-c877-0c4d-0f56-9485f57c0eae@gmail.com>
 <CAE-UAvQvX5fEqZvMa3XgkjEWasihpfKNf+mzRpYPB=7LRCxB3A@mail.gmail.com>
 <dd7558e5-83a5-9e80-8ae6-6f152d38dbc0@gmail.com>
 <CAAkaFLU8Dh746zTj1jaKSATQFC44tW_O073ADWjwP91LxCNOxQ@mail.gmail.com>
 <CAFHc1QY53uuSHnLhitkNYgkTvY0aQF5kdBGROTyNuO8=ioedNg@mail.gmail.com>
Message-ID: <CAFHc1QbRzLFVZi9Dk-CrPnjcQwSLkqYVYSFrxMVe6t3ABXLnfA@mail.gmail.com>

My apologies I see it is in the spreadsheet. It would be great to see
this work finished for 0.19 if at all possible IMHO.

Raphael

On 29 September 2016 at 20:12, Raphael C <drraph at gmail.com> wrote:
> I hope this isn't out of place but I notice that
> https://github.com/scikit-learn/scikit-learn/pull/4899 is not in the
> list. It seems like a very worthwhile addition and the PR appears
> stalled at present.
>
> Raphael
>
> On 29 September 2016 at 15:05, Joel Nothman <joel.nothman at gmail.com> wrote:
>> I agree that being able to identify which PRs are stalled on the
>> contributor's part, which on reviewers' part, and since when, would be
>> great. I'm not sure we've come up with a way that'll work though.
>>
>> In terms of backlog, I've wondered if just getting things into a spreadsheet
>> would help:
>>
>> https://docs.google.com/spreadsheets/d/1LdzNxQbn7A0Ao8zlUBgnvT42929JpAe9958YxKCubjE/edit
>>
>> What other features of an Issue / PR would be useful to
>> sort/filter/pivottable on in a spreadsheet form like this?
>>
>> (It would be extra nice if we could modify titles and labels within the
>> spreadsheet and have them update via the GitHub API, but I'm not sure I'll
>> get around to making that feature :P)
>>
>>
>> On 29 September 2016 at 23:45, Andreas Mueller <t3kcit at gmail.com> wrote:
>>>
>>> So I made a project for 0.19:
>>>
>>> https://github.com/scikit-learn/scikit-learn/projects/5
>>>
>>> The idea would be to drag and drop issues and PRs so that the important
>>> ones are at the top.
>>> We could also add an "important" column, currently the scrolling is pretty
>>> annoying.
>>> Thoughts?
>>>
>>>
>>>
>>>
>>> On 09/28/2016 03:29 PM, Nelle Varoquaux wrote:
>>>>
>>>> On 28 September 2016 at 12:24, Andreas Mueller <t3kcit at gmail.com> wrote:
>>>>>
>>>>>
>>>>> On 09/28/2016 02:21 PM, Nelle Varoquaux wrote:
>>>>>>
>>>>>>
>>>>>> I think the only ones worth having are the ones that can be dealt with
>>>>>> automatically and the ones that will not be used frequently:
>>>>>>
>>>>>> - stalled after 30 days of inactivity [can be done automatically]
>>>>>> - in dispute [I don't expect it to be used often].
>>>>>
>>>>> I think "in dispute" is actually one of the most common statuses among
>>>>> PRs.
>>>>> Or maybe I have a skewed picture of things.
>>>>> Many PRs stalled because it is not clear whether the proposed solution
>>>>> is a
>>>>> good one.
>>>>
>>>> On the stalled one, sure, but there are a lot of PRs being merged
>>>> fairly quickly. So over all, I think it is quite rare. No?
>>>>
>>>>> It would be great to have some way to get through the backlog of 400 PRs
>>>>> and
>>>>> I think tagging them might be useful.
>>>>> We rarely reject PRs, we could also revisit that policy.
>>>>>
>>>>> For the backlog, it's pretty unclear to me how many are waiting for
>>>>> reviews,
>>>>> how many are waiting for changes,
>>>>> and how many are disputed.
>>>>> Tagging these might help people who want to review to find things to
>>>>> review,
>>>>> and people who want to code to pick
>>>>> up stalled PRs.
>>>>
>>>> That sounds like a great use of labels, thought all of these need to
>>>> be tagged manually.
>>>>
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> scikit-learn at python.org
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>

From t3kcit at gmail.com  Thu Sep 29 15:42:21 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Thu, 29 Sep 2016 15:42:21 -0400
Subject: [scikit-learn] ANN Scikit-learn 0.18 released
In-Reply-To: <40a62931-bf5a-7c59-7253-69418d53f196@gmail.com>
References: <40a62931-bf5a-7c59-7253-69418d53f196@gmail.com>
Message-ID: <31d9ef9b-b68a-c0ed-4948-2fbb7c78c7f2@gmail.com>

An important addition to the release notes that I forgot:
The 0.18 release is the last scikit-learn release to support Python 2.6
 From 0.19 on, we will only support 2.7 and 3.4 and up.


On 09/28/2016 05:01 PM, Andreas Mueller wrote:
> Hi everybody.
>
> I'm happy to announce scikit-learn 0.18 has been released today.
> You can install from pipy or anaconda.org:
>
> pip install --upgrade scikit-learn --no-deps
>
>
> or if you prefer conda:
>
> conda update scikit-learn
>
>
> A big thank you to everybody who contributed.
> This one took us a while, but I think it's worth the wait.
>
> Highlights include:
> - A new GaussianProcessClassifier and GaussianProcessRegressor to 
> learn complex kernels!
> - A much improved GaussianMixture and BayesianGaussianMixture mixture 
> models.
> - We moved the content of the grid_search, cross_validation and 
> validation_curve modules to the new model_selection module.
> - A Multi-layer perceptron.
>
> and soo much more that it's impossible to summarize.
> Check out the full changelog here:
>
> http://scikit-learn.org/stable/whats_new.html#version-0-18
>
> Please don't update a (ana)conda installation using pip, as that might 
> lead to problems.
> Let us know any issues you have on the issue tracker:
> https://github.com/scikit-learn/scikit-learn/issues
>
> Enjoy!
>
> Andy


From kaltenb at stanford.edu  Thu Sep 29 18:09:54 2016
From: kaltenb at stanford.edu (Kristen M. Altenburger)
Date: Thu, 29 Sep 2016 22:09:54 +0000
Subject: [scikit-learn] Question about Python's L2-Regularized Logistic
 Regression
Message-ID: <BB0719EB-2496-49D7-8C94-9C12E043D731@stanford.edu>

Hi All,

I am trying to understand Python?s code [function ?_fit_liblinear' in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/svm/base.py] for fitting an L2-logistic regression for a ?liblinear? solver. More specifically, my [approximately balanced class] dataset is such that the # of predictors [p=2000] >> # of observations [n=100]. Therefore, I am currently confused that when I increase C [and thus decrease the regularization strength] in fitting the logistic regression model to my training data why I then still obtain high AUC results when the model is then applied to my testing data. Is python internally doing a feature selection when fitting this model for high C values? Or why is it that the almost unregularized model [high C values] versus regularized [cross-validated approach to selecting C] model both result in similar AUC and accuracy results when the model is applied to the testing data? Should I be coding my predictors as +1/-1? 

Any pointers/explanations would be much appreciated!

Thanks,
Kristen

From se.raschka at gmail.com  Thu Sep 29 18:20:39 2016
From: se.raschka at gmail.com (Sebastian Raschka)
Date: Thu, 29 Sep 2016 18:20:39 -0400
Subject: [scikit-learn] Question about Python's L2-Regularized Logistic
 Regression
In-Reply-To: <BB0719EB-2496-49D7-8C94-9C12E043D731@stanford.edu>
References: <BB0719EB-2496-49D7-8C94-9C12E043D731@stanford.edu>
Message-ID: <6F872B0A-568A-4EEC-BACD-3C31C812AF89@gmail.com>

Hi, Kristen,
there shouldn?t be any internal feature selection going on behind the scenes. You may want to compare the weight coefficients of your regularized vs unregularized model, if they are exactly the same, then this would be an indicator that something funny is going on. Otherwise, it could be that both strongly- and non-regularized models are equally good or bad models on that dataset (btw. what value do you get for the ROC auc?).

You can access the weight coefficients via the ?coef_? attribute after fitting. I.e.,

lr = LogisticRegression(...)
lr.fit(X_train, y_train)
lr.coef_

> Should I be coding my predictors as +1/-1? 

0 and 1 should be just fine and is the expected default. 

Best,
Sebastian

> On Sep 29, 2016, at 6:09 PM, Kristen M. Altenburger <kaltenb at stanford.edu> wrote:
> 
> Hi All,
> 
> I am trying to understand Python?s code [function ?_fit_liblinear' in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/svm/base.py] for fitting an L2-logistic regression for a ?liblinear? solver. More specifically, my [approximately balanced class] dataset is such that the # of predictors [p=2000] >> # of observations [n=100]. Therefore, I am currently confused that when I increase C [and thus decrease the regularization strength] in fitting the logistic regression model to my training data why I then still obtain high AUC results when the model is then applied to my testing data. Is python internally doing a feature selection when fitting this model for high C values? Or why is it that the almost unregularized model [high C values] versus regularized [cross-validated approach to selecting C] model both result in similar AUC and accuracy results when the model is applied to the testing data? Should I be coding my predictors as +1/-1? 
> 
> Any pointers/explanations would be much appreciated!
> 
> Thanks,
> Kristen
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From michael.eickenberg at gmail.com  Thu Sep 29 18:21:46 2016
From: michael.eickenberg at gmail.com (Michael Eickenberg)
Date: Fri, 30 Sep 2016 00:21:46 +0200
Subject: [scikit-learn] Question about Python's L2-Regularized Logistic
 Regression
In-Reply-To: <BB0719EB-2496-49D7-8C94-9C12E043D731@stanford.edu>
References: <BB0719EB-2496-49D7-8C94-9C12E043D731@stanford.edu>
Message-ID: <CADxJN65rHaqmV9NYR7jOKUoPTrDJmZRYLoagZUu60Kj7k_kASQ@mail.gmail.com>

That should totally depend on your dataset. Maybe it is an "easy" dataset
and not much regularization is needed.

Maybe use PCA(n_components=2) or an LDA transform to take a look at your
data in 2D. Maybe they are easily linearly separable?

Sklearn does not do any feature selection if you don't ask it to.

What C-values are you using? Try an np.logspace but go much farther out
both sides than you think reasonable. Then plot AUC as a function of that
to get a global idea of what is going on.

hth,
Michael

On Friday, September 30, 2016, Kristen M. Altenburger <kaltenb at stanford.edu>
wrote:

> Hi All,
>
> I am trying to understand Python?s code [function ?_fit_liblinear' in
> https://github.com/scikit-learn/scikit-learn/blob/
> master/sklearn/svm/base.py] for fitting an L2-logistic regression for a
> ?liblinear? solver. More specifically, my [approximately balanced class]
> dataset is such that the # of predictors [p=2000] >> # of observations
> [n=100]. Therefore, I am currently confused that when I increase C [and
> thus decrease the regularization strength] in fitting the logistic
> regression model to my training data why I then still obtain high AUC
> results when the model is then applied to my testing data. Is python
> internally doing a feature selection when fitting this model for high C
> values? Or why is it that the almost unregularized model [high C values]
> versus regularized [cross-validated approach to selecting C] model both
> result in similar AUC and accuracy results when the model is applied to the
> testing data? Should I be coding my predictors as +1/-1?
>
> Any pointers/explanations would be much appreciated!
>
> Thanks,
> Kristen
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org <javascript:;>
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160930/1af97614/attachment.html>

From joel.nothman at gmail.com  Thu Sep 29 23:05:13 2016
From: joel.nothman at gmail.com (Joel Nothman)
Date: Fri, 30 Sep 2016 13:05:13 +1000
Subject: [scikit-learn] Github project management tools
In-Reply-To: <CAFHc1QbRzLFVZi9Dk-CrPnjcQwSLkqYVYSFrxMVe6t3ABXLnfA@mail.gmail.com>
References: <CAAkaFLXw7+Qfu1L+SGijrVGJ7UU=R-tx4_ttdh0fCJo-u53fMw@mail.gmail.com>
 <20160916152118.GC187668@phare.normalesup.org>
 <CAAkaFLW6sVg_vjXnMHu9F59i0HHxpDLM87B3opKTcyoe-dNXjg@mail.gmail.com>
 <CAAkaFLV6160=p1ztEBtDh8tmAXwDBf5Cy-m1=aAabHxkbCdXPg@mail.gmail.com>
 <CAE-UAvTGMrtX5uOP5END4n+SihtQNSHH+T7cwgBByu4vuhz4QA@mail.gmail.com>
 <fe49d881-1321-71e0-021b-990541a34239@gmail.com>
 <CAE-UAvQC98Xj81hAfBTHxcQWSQxD1VwEa_SAPZmK3tWHbWgE0A@mail.gmail.com>
 <e769ff3c-24f4-1c92-9ae4-d562a785b0ac@gmail.com>
 <CALoLHMKeKnj_H9ZsYZ+kUAZFn5Cdi9ePyr+hoZjdgnoq7HFBTg@mail.gmail.com>
 <CAE-UAvRjoJnUMOedovqO0J86zs120JjnqQUL+QnHp8_mC7cpqg@mail.gmail.com>
 <41f0eb1c-c877-0c4d-0f56-9485f57c0eae@gmail.com>
 <CAE-UAvQvX5fEqZvMa3XgkjEWasihpfKNf+mzRpYPB=7LRCxB3A@mail.gmail.com>
 <dd7558e5-83a5-9e80-8ae6-6f152d38dbc0@gmail.com>
 <CAAkaFLU8Dh746zTj1jaKSATQFC44tW_O073ADWjwP91LxCNOxQ@mail.gmail.com>
 <CAFHc1QY53uuSHnLhitkNYgkTvY0aQF5kdBGROTyNuO8=ioedNg@mail.gmail.com>
 <CAFHc1QbRzLFVZi9Dk-CrPnjcQwSLkqYVYSFrxMVe6t3ABXLnfA@mail.gmail.com>
Message-ID: <CAAkaFLWsK+zSeGJPKsE8kHTTgL_+7+w_h3C4egdMxobO9s9GDA@mail.gmail.com>

The spreadsheet seems to have some duplications and presumably some missing
rows, with apologies. I assume some is due to the github pagination, and
some may be my error. Not a big enough error to fix up.

On 30 September 2016 at 05:15, Raphael C <drraph at gmail.com> wrote:

> My apologies I see it is in the spreadsheet. It would be great to see
> this work finished for 0.19 if at all possible IMHO.
>
> Raphael
>
> On 29 September 2016 at 20:12, Raphael C <drraph at gmail.com> wrote:
> > I hope this isn't out of place but I notice that
> > https://github.com/scikit-learn/scikit-learn/pull/4899 is not in the
> > list. It seems like a very worthwhile addition and the PR appears
> > stalled at present.
> >
> > Raphael
> >
> > On 29 September 2016 at 15:05, Joel Nothman <joel.nothman at gmail.com>
> wrote:
> >> I agree that being able to identify which PRs are stalled on the
> >> contributor's part, which on reviewers' part, and since when, would be
> >> great. I'm not sure we've come up with a way that'll work though.
> >>
> >> In terms of backlog, I've wondered if just getting things into a
> spreadsheet
> >> would help:
> >>
> >> https://docs.google.com/spreadsheets/d/1LdzNxQbn7A0Ao8zlUBgnvT42929Jp
> Ae9958YxKCubjE/edit
> >>
> >> What other features of an Issue / PR would be useful to
> >> sort/filter/pivottable on in a spreadsheet form like this?
> >>
> >> (It would be extra nice if we could modify titles and labels within the
> >> spreadsheet and have them update via the GitHub API, but I'm not sure
> I'll
> >> get around to making that feature :P)
> >>
> >>
> >> On 29 September 2016 at 23:45, Andreas Mueller <t3kcit at gmail.com>
> wrote:
> >>>
> >>> So I made a project for 0.19:
> >>>
> >>> https://github.com/scikit-learn/scikit-learn/projects/5
> >>>
> >>> The idea would be to drag and drop issues and PRs so that the important
> >>> ones are at the top.
> >>> We could also add an "important" column, currently the scrolling is
> pretty
> >>> annoying.
> >>> Thoughts?
> >>>
> >>>
> >>>
> >>>
> >>> On 09/28/2016 03:29 PM, Nelle Varoquaux wrote:
> >>>>
> >>>> On 28 September 2016 at 12:24, Andreas Mueller <t3kcit at gmail.com>
> wrote:
> >>>>>
> >>>>>
> >>>>> On 09/28/2016 02:21 PM, Nelle Varoquaux wrote:
> >>>>>>
> >>>>>>
> >>>>>> I think the only ones worth having are the ones that can be dealt
> with
> >>>>>> automatically and the ones that will not be used frequently:
> >>>>>>
> >>>>>> - stalled after 30 days of inactivity [can be done automatically]
> >>>>>> - in dispute [I don't expect it to be used often].
> >>>>>
> >>>>> I think "in dispute" is actually one of the most common statuses
> among
> >>>>> PRs.
> >>>>> Or maybe I have a skewed picture of things.
> >>>>> Many PRs stalled because it is not clear whether the proposed
> solution
> >>>>> is a
> >>>>> good one.
> >>>>
> >>>> On the stalled one, sure, but there are a lot of PRs being merged
> >>>> fairly quickly. So over all, I think it is quite rare. No?
> >>>>
> >>>>> It would be great to have some way to get through the backlog of 400
> PRs
> >>>>> and
> >>>>> I think tagging them might be useful.
> >>>>> We rarely reject PRs, we could also revisit that policy.
> >>>>>
> >>>>> For the backlog, it's pretty unclear to me how many are waiting for
> >>>>> reviews,
> >>>>> how many are waiting for changes,
> >>>>> and how many are disputed.
> >>>>> Tagging these might help people who want to review to find things to
> >>>>> review,
> >>>>> and people who want to code to pick
> >>>>> up stalled PRs.
> >>>>
> >>>> That sounds like a great use of labels, thought all of these need to
> >>>> be tagged manually.
> >>>>
> >>>>> _______________________________________________
> >>>>> scikit-learn mailing list
> >>>>> scikit-learn at python.org
> >>>>> https://mail.python.org/mailman/listinfo/scikit-learn
> >>>>
> >>>> _______________________________________________
> >>>> scikit-learn mailing list
> >>>> scikit-learn at python.org
> >>>> https://mail.python.org/mailman/listinfo/scikit-learn
> >>>
> >>>
> >>> _______________________________________________
> >>> scikit-learn mailing list
> >>> scikit-learn at python.org
> >>> https://mail.python.org/mailman/listinfo/scikit-learn
> >>
> >>
> >>
> >> _______________________________________________
> >> scikit-learn mailing list
> >> scikit-learn at python.org
> >> https://mail.python.org/mailman/listinfo/scikit-learn
> >>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160930/da57d364/attachment.html>

From joel.nothman at gmail.com  Thu Sep 29 23:07:21 2016
From: joel.nothman at gmail.com (Joel Nothman)
Date: Fri, 30 Sep 2016 13:07:21 +1000
Subject: [scikit-learn] ANN Scikit-learn 0.18 released
In-Reply-To: <F2B0C871-A02F-43DC-BE82-C62357C4D25B@gmail.com>
References: <40a62931-bf5a-7c59-7253-69418d53f196@gmail.com>
 <20160929052856.GA1123098@phare.normalesup.org>
 <F2B0C871-A02F-43DC-BE82-C62357C4D25B@gmail.com>
Message-ID: <CAAkaFLUAvAj49JGxweA=Jy6VZytvyO0rkUfKVRvssY=gYoUM2Q@mail.gmail.com>

(this has been in drafts a few days and I'm sure there's plenty I've missed
from the lists below)

Well done, everyone! The size of this release - and the group of people
that contributed to it - is even a bit overwhelming. Thanks for managing
the release, Andy... and writing it up as a book!

We've got a lot in the works already for 0.19.

There are a number of things that have been a long time coming and which
I'd really like to see happen, such as:

* multiple metrics for cross validation (#7388 et al.)
* documenting and officially making (most) utils public (#6616)
* indicator features for Imputer, done right (#6556)
* KNN imputation (#2989, #4844)
* ColumnTransformer or similar for heterogeneous data (#2034, #886)
* dataset resampling (#1454)
* string handling in OneHotEncoder (#7327)
* interpolation in average_precision_score (#7356)
* tree categorical splits (#4899)
* k-best feature selection from a model's feature_importances_ (#6717)
* ? feature name transformation (#6425)
* ? sample_weight support in CV scoring (#1179, #2879, #3524, #1574;
perhaps this isn't as easy as it looks)

There are things that are important but will probably require more work:

* making common tests and their exceptions more general (perhaps by way of
"estimator tags")
* improving our LSH offerings and integration

It's all a bit overwhelming and all help ensuring that the issue backlog is
tracked, and that the solutions are designed, built and reviewed would be
most welcome!

J


On 29 September 2016 at 15:34, Sebastian Raschka <se.raschka at gmail.com>
wrote:

> Have been playing around with the new functionality tonight. There are so
> many great additions, especially the new CV functionality in the
> model_selection module is super great. Nested CV is much more convenient
> now! Congratulations to everyone, and thanks for this great new version! :)
>
>
> > On Sep 29, 2016, at 1:28 AM, Gael Varoquaux <
> gael.varoquaux at normalesup.org> wrote:
> >
> > Hurray!
> >
> > Congratulations to everybody, and in particular the release time!
> >
> > Ga?l
> >
> > On Wed, Sep 28, 2016 at 05:01:45PM -0400, Andreas Mueller wrote:
> >> Hi everybody.
> >
> >> I'm happy to announce scikit-learn 0.18 has been released today.
> >> You can install from pipy or anaconda.org:
> >
> >> pip install --upgrade scikit-learn --no-deps
> >
> >
> >> or if you prefer conda:
> >
> >> conda update scikit-learn
> >
> >
> >> A big thank you to everybody who contributed.
> >> This one took us a while, but I think it's worth the wait.
> >
> >> Highlights include:
> >> - A new GaussianProcessClassifier and GaussianProcessRegressor to learn
> >> complex kernels!
> >> - A much improved GaussianMixture and BayesianGaussianMixture mixture
> >> models.
> >> - We moved the content of the grid_search, cross_validation and
> >> validation_curve modules to the new model_selection module.
> >> - A Multi-layer perceptron.
> >
> >> and soo much more that it's impossible to summarize.
> >> Check out the full changelog here:
> >
> >> http://scikit-learn.org/stable/whats_new.html#version-0-18
> >
> >> Please don't update a (ana)conda installation using pip, as that might
> lead
> >> to problems.
> >> Let us know any issues you have on the issue tracker:
> >> https://github.com/scikit-learn/scikit-learn/issues
> >
> >> Enjoy!
> >
> >> Andy
> >> _______________________________________________
> >> scikit-learn mailing list
> >> scikit-learn at python.org
> >> https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > --
> >    Gael Varoquaux
> >    Researcher, INRIA Parietal
> >    NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
> >    Phone:  ++ 33-1-69-08-79-68
> >    http://gael-varoquaux.info            http://twitter.com/
> GaelVaroquaux
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160930/a01129f5/attachment-0001.html>

From roberto.pagliari at asos.com  Fri Sep 30 06:52:40 2016
From: roberto.pagliari at asos.com (Roberto Pagliari)
Date: Fri, 30 Sep 2016 10:52:40 +0000
Subject: [scikit-learn] nmf with multiple cores
Message-ID: <B1284C19-DAB8-4757-9012-052D199BBAAF@asos.com>

I?m running nmf on a machine with 16 cores.

Is there an option to run nmf with multithreading? I know numpy does, but all I see is one single process with 100% CPU usage.


Thanks,

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160930/01cfb8a7/attachment.html>

From tom.duprelatour at orange.fr  Fri Sep 30 07:46:16 2016
From: tom.duprelatour at orange.fr (Tom DLT)
Date: Fri, 30 Sep 2016 13:46:16 +0200
Subject: [scikit-learn] nmf with multiple cores
In-Reply-To: <B1284C19-DAB8-4757-9012-052D199BBAAF@asos.com>
References: <B1284C19-DAB8-4757-9012-052D199BBAAF@asos.com>
Message-ID: <CAGKmC=tjCZ8ieRfZL-NvtZyrc2ZbOoQ-0QaizRgZ2HbiVLRm9g@mail.gmail.com>

Hi Roberto,

As answered previously, there is no multi-threading available for NMF in
scikit-learn.

However, if you want to compute *multiple* NMF in parallel, you can use
joblib with a threading backend,
as the 'cd' solver releases the GIL (through cython code) during a large
part of the time.
The other main computational cost goes with numpy dot product, which
depends on your BLAS configuration.

To help choose the best settings for your needs, here is also a quick
example for benchmarking multithreading
<https://gist.github.com/TomDLT/c1d560a510a41dd80ab6>.

Best,

Tom

2016-09-30 12:52 GMT+02:00 Roberto Pagliari <roberto.pagliari at asos.com>:

> I?m running nmf on a machine with 16 cores.
>
>
>
> Is there an option to run nmf with multithreading? I know numpy does, but
> all I see is one single process with 100% CPU usage.
>
>
>
>
>
>
>
> Thanks,
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160930/1c71b7da/attachment.html>

From ysa at mit.edu  Fri Sep 30 10:51:06 2016
From: ysa at mit.edu (Yoel Sanchez Araujo)
Date: Fri, 30 Sep 2016 14:51:06 +0000
Subject: [scikit-learn] understanding lasso performance ?
Message-ID: <9D9F2E0FB227ED45A4CD3EEA2C34C77937DF5846@OC11expo28.exchange.mit.edu>


Hi all,

Today I updated to the latest release of scikit-learn, and I went to test out the LassoCV module in linear_model. I've tried both approaches below, and my accuracy seems very poor, while using the same exact data with glmnet in R for example will give me ~ 75% accuracy:


from sklearn import linear_model
from sklearn.model_selection import StratifiedKFold,  train_test_split

lassocv1 = linear_model.LassoCV(cv=10, max_iter=10000, n_alphas=10000)

xtrain, xtest, ytrain, ytest = train_test_split(
    endo_Xv, endo_y, test_size = .25, random_state = 1
)

lassocv1.fit(xtrain, ytrain)
lassocv1.score(xtest, ytest)

from this, lassocv1.coef_ returns all zero coefficients

I've also tried this:

k_fold_S = StratifiedKFold(n_splits=10, shuffle=False)
lasso_cv = linear_model.LassoCV()
alphas=[]
scores=[]
coefs=[]
ks=[]

for k, (train, test) in enumerate(k_fold_S.split(endo_Xv, endo_y)):
    lasso_cv.fit(endo_Xv[train], endo_y[train])
    scores.append(lasso_cv.score(endo_Xv[test], endo_y[test]))
    alphas.append(lasso_cv.alpha_)
    coefs.append(lasso_cv.coef_)
    ks.append(k)

for all k, the coef_ arrays are all zero and the scores array for example:


[-1.3295256159340241e-05,
  -1.3295256159562285e-05,
  -1.3295256159784328e-05,
  -1.3295256159562285e-05,
  -1.3295256159562285e-05,
  -1.3295256159340241e-05,
  -6.4162287406910323e-05,
  -6.4162287406910323e-05,
  -6.4162287406910323e-05,
  -3.8436343168246623e-06])


Any insights would be greatly appreciated, not sure if this has anything to do with the update, but yesterday(unupdated) I was getting better performance.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160930/e3b0f675/attachment-0001.html>

From ysa at mit.edu  Fri Sep 30 11:00:07 2016
From: ysa at mit.edu (Yoel Sanchez Araujo)
Date: Fri, 30 Sep 2016 15:00:07 +0000
Subject: [scikit-learn] understanding lasso performance ?
In-Reply-To: <9D9F2E0FB227ED45A4CD3EEA2C34C77937DF5846@OC11expo28.exchange.mit.edu>
References: <9D9F2E0FB227ED45A4CD3EEA2C34C77937DF5846@OC11expo28.exchange.mit.edu>
Message-ID: <9D9F2E0FB227ED45A4CD3EEA2C34C77937DF585D@OC11expo28.exchange.mit.edu>

Actually, just to follow up on this, I believe I see what's wrong. I've overlooked the fact that I'm attempting to compare classification from lasso in the R glmnet package to lassocv here, and the linear model that supports classification would be logistic regression in scikitlearn. Sorry!
________________________________
From: scikit-learn [scikit-learn-bounces+ysa=mit.edu at python.org] on behalf of Yoel Sanchez Araujo [ysa at mit.edu]
Sent: Friday, September 30, 2016 10:51 AM
To: Scikit-learn user and developer mailing list
Subject: [scikit-learn] understanding lasso performance ?


Hi all,

Today I updated to the latest release of scikit-learn, and I went to test out the LassoCV module in linear_model. I've tried both approaches below, and my accuracy seems very poor, while using the same exact data with glmnet in R for example will give me ~ 75% accuracy:


from sklearn import linear_model
from sklearn.model_selection import StratifiedKFold,  train_test_split

lassocv1 = linear_model.LassoCV(cv=10, max_iter=10000, n_alphas=10000)

xtrain, xtest, ytrain, ytest = train_test_split(
    endo_Xv, endo_y, test_size = .25, random_state = 1
)

lassocv1.fit(xtrain, ytrain)
lassocv1.score(xtest, ytest)

from this, lassocv1.coef_ returns all zero coefficients

I've also tried this:

k_fold_S = StratifiedKFold(n_splits=10, shuffle=False)
lasso_cv = linear_model.LassoCV()
alphas=[]
scores=[]
coefs=[]
ks=[]

for k, (train, test) in enumerate(k_fold_S.split(endo_Xv, endo_y)):
    lasso_cv.fit(endo_Xv[train], endo_y[train])
    scores.append(lasso_cv.score(endo_Xv[test], endo_y[test]))
    alphas.append(lasso_cv.alpha_)
    coefs.append(lasso_cv.coef_)
    ks.append(k)

for all k, the coef_ arrays are all zero and the scores array for example:


[-1.3295256159340241e-05,
  -1.3295256159562285e-05,
  -1.3295256159784328e-05,
  -1.3295256159562285e-05,
  -1.3295256159562285e-05,
  -1.3295256159340241e-05,
  -6.4162287406910323e-05,
  -6.4162287406910323e-05,
  -6.4162287406910323e-05,
  -3.8436343168246623e-06])


Any insights would be greatly appreciated, not sure if this has anything to do with the update, but yesterday(unupdated) I was getting better performance.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160930/37e00201/attachment.html>

From t3kcit at gmail.com  Fri Sep 30 12:58:58 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Fri, 30 Sep 2016 12:58:58 -0400
Subject: [scikit-learn] ANN Scikit-learn 0.18 released
In-Reply-To: <CAAkaFLUAvAj49JGxweA=Jy6VZytvyO0rkUfKVRvssY=gYoUM2Q@mail.gmail.com>
References: <40a62931-bf5a-7c59-7253-69418d53f196@gmail.com>
 <20160929052856.GA1123098@phare.normalesup.org>
 <F2B0C871-A02F-43DC-BE82-C62357C4D25B@gmail.com>
 <CAAkaFLUAvAj49JGxweA=Jy6VZytvyO0rkUfKVRvssY=gYoUM2Q@mail.gmail.com>
Message-ID: <77756fd4-5ad9-e51d-51e1-aa60274d2117@gmail.com>


On 09/29/2016 11:07 PM, Joel Nothman wrote:
> (this has been in drafts a few days and I'm sure there's plenty I've 
> missed from the lists below)
>
> Well done, everyone! The size of this release - and the group of 
> people that contributed to it - is even a bit overwhelming. Thanks for 
> managing the release, Andy... and writing it up as a book!
Thank you for your incredible dedication!

>
> We've got a lot in the works already for 0.19.
>
> There are a number of things that have been a long time coming and 
> which I'd really like to see happen, such as:
>
> * multiple metrics for cross validation (#7388 et al.)
> * documenting and officially making (most) utils public (#6616)
> * indicator features for Imputer, done right (#6556)
> * KNN imputation (#2989, #4844)
> * ColumnTransformer or similar for heterogeneous data (#2034, #886)
> * dataset resampling (#1454)
> * string handling in OneHotEncoder (#7327)
> * interpolation in average_precision_score (#7356)
> * tree categorical splits (#4899)
> * k-best feature selection from a model's feature_importances_ (#6717)
> * ? feature name transformation (#6425)
> * ? sample_weight support in CV scoring (#1179, #2879, #3524, #1574; 
> perhaps this isn't as easy as it looks)
>
> There are things that are important but will probably require more work:
>
> * making common tests and their exceptions more general (perhaps by 
> way of "estimator tags")
> * improving our LSH offerings and integration
>
It's good to see that you're excited about the same things as me.

I also want to do the numpy-doc update, as it gives SOOO much better 
error messages now.

I'll try to put some time into the public utils soon, and I think the 
interpolation in average precision is basically done!
I think many of the other things you mentioned are already well on their 
way, and maybe we can get 0.19 out within the next 4 month,
to get back on a more regular schedule.