From moyi.dang at gmail.com  Sat Oct  1 09:34:10 2016
From: moyi.dang at gmail.com (Moyi Dang)
Date: Sat, 1 Oct 2016 09:34:10 -0400
Subject: [scikit-learn] Why does sci-kit learn's hashingvectorizer give
 negative values?
Message-ID: <CAJ5DmAWV4H39PnjVx1yw0WhOnGnnsUZauD3-5E0O4x=Cvxvw6Q@mail.gmail.com>

Hi,

I'm trying to make the hashingvectorizer work for online learning. To
do this, I need it to give actual token counts.

The HashingVectorizer in Sci-Kit learn doesn't give token counts, but
by default gives a normalized count either l1 or l2.

I need the tokenized counts, so I set norm = None. However, after I do
this, I'm no longer getting decimals, but I'm still getting negative
numbers.

It seems like the negatives can be removed by setting non_negative =
True, which takes the absolute value of the values.

However, I don't understand why the negatives are there in the first
place, or what they mean. I'm not sure if the absolute values are
corresponding to the token counts.

Can someone please help explain what the HashingVectorizer is doing?
How do I get the HashingVectorizer to return token counts?

You can replicate my results with the following code - I'm using the
20newsgroups dataset which comes with sci-kit learn:

from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)
from sklearn.feature_extraction.text import HashingVectorizer

# produces normalized results with mean 0 and unit variance
cv = HashingVectorizer(stop_words = 'english')
X_train = cv.fit_transform(twenty_train.data)
print(X_train)

# produces integer results both positive and negative
cv = HashingVectorizer(stop_words = 'english', norm=None)
X_train = cv.fit_transform(twenty_train.data)
print(X_train)

# produces only positive results but not sure if they correspond to counts
cv = HashingVectorizer(stop_words = 'english', norm=None, non_negative = True)
X_train = cv.fit_transform(twenty_train.data)
print(X_train)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161001/40633f72/attachment.html>

From rth.yurchak at gmail.com  Sat Oct  1 10:17:40 2016
From: rth.yurchak at gmail.com (Roman Yurchak)
Date: Sat, 1 Oct 2016 16:17:40 +0200
Subject: [scikit-learn] Why does sci-kit learn's hashingvectorizer give
 negative values?
In-Reply-To: <CAJ5DmAWV4H39PnjVx1yw0WhOnGnnsUZauD3-5E0O4x=Cvxvw6Q@mail.gmail.com>
References: <CAJ5DmAWV4H39PnjVx1yw0WhOnGnnsUZauD3-5E0O4x=Cvxvw6Q@mail.gmail.com>
Message-ID: <57EFC584.7020608@gmail.com>

On 01/10/16 15:34, Moyi Dang wrote:
> However, I don't understand why the negatives are there in the first
> place, or what they mean. I'm not sure if the absolute values are
> corresponding to the token counts.
> 
> Can someone please help explain what the HashingVectorizer is doing? How
> do I get the HashingVectorizer to return token counts?

Hi Moyi,

it's a mechanism to compensate for hash collisions, see
https://github.com/scikit-learn/scikit-learn/issues/7513 The absolute
values are token counts for most practical applications (if you don't
have too many collisions).  There will be a PR shortly to make this more
consistent.


From tevang3 at gmail.com  Sat Oct  1 10:59:11 2016
From: tevang3 at gmail.com (Thomas Evangelidis)
Date: Sat, 1 Oct 2016 16:59:11 +0200
Subject: [scikit-learn] suggested machine learning algorithm
Message-ID: <CAACvdx2wpzvUWCMDnKC+dF1gNMbBxmNhgYZpMW76oe_OkVm8ng@mail.gmail.com>

Dear scikit-learn users and developers,

I have a dataset consisting of 42 observation (molnames) and 4 variables (
VDWAALS, EEL, EGB, ESURF) with which I want to make a predictive model that
estimates the experimental value (Expr). I tried multivariate linear
regression using 10,000 bootstrap repeats each time using 21 observations
for training and the rest 21 for testing, but the average correlation was
only R= 0.1727 +- 0.19779.


molname                    VDWAALS     EEL               EGB
>  ESURF        Expr
> CHEMBL108457        -20.4848        -96.5826         23.4584       -5.4045
>        -7.27193
> CHEMBL388269        -50.3860         28.9403        -51.5147       -6.4061
>        -6.8022
> CHEMBL244078        -49.1466        -21.9869         17.7999       -6.4588
>        -6.61742
> CHEMBL244077        -53.4365        -32.8943         34.8723       -7.0384
>        -6.61742
> CHEMBL396772        -51.4111        -34.4904         36.0326       -6.5443
>        -5.82207
> ........


I would like your advice about what other machine learning algorithm I
could try with these data. E.g. can I make a decision tree or the
observations  and variable are too few to avoid overfitting? I could
include more variables but the observations will always remain 42.

I would greatly appreciate any advice!

Thomas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161001/13d2c6c8/attachment-0001.html>

From ericmajinglong at gmail.com  Sat Oct  1 14:37:35 2016
From: ericmajinglong at gmail.com (Eric Ma)
Date: Sat, 1 Oct 2016 14:37:35 -0400
Subject: [scikit-learn] suggested machine learning algorithm
In-Reply-To: <CAACvdx2wpzvUWCMDnKC+dF1gNMbBxmNhgYZpMW76oe_OkVm8ng@mail.gmail.com>
References: <CAACvdx2wpzvUWCMDnKC+dF1gNMbBxmNhgYZpMW76oe_OkVm8ng@mail.gmail.com>
Message-ID: <CAK-i=xi_=u89fmw7ctqResnX4BRRzA5MkaeMoa61iLkpYVvV-w@mail.gmail.com>

Hi Thomas,

A number of people I've learned from have given me the following "recipe",
which I hold to loosely.

   1. Start with Random Forest - it should be able to give you good
   baseline predictive capacity.
   2. Let's say you don't care about interpretability, but only care about
   predictive value.  Keep tweaking RF parameters (use grid search + cross
   validation), or switch to gradient boosting.
   3. Let's say you do care about interpretability. Use RF's
   feature_importances_ to get out the features that are important for
   prediction. Try linear regression on just those, may also want to try
   multiplying those features together to get the "interaction" product of
   those features. (this is using RF as a feature selection method).

Beyond this, I am sure more "expert" types will be able to chime in, and
also correct me if I've said anything wrong here.

Cheers
Eric

On Sat, Oct 1, 2016 at 10:59 AM, Thomas Evangelidis <tevang3 at gmail.com>
wrote:

> Dear scikit-learn users and developers,
>
> I have a dataset consisting of 42 observation (molnames) and 4 variables (
> VDWAALS, EEL, EGB, ESURF) with which I want to make a predictive model
> that estimates the experimental value (Expr). I tried multivariate linear
> regression using 10,000 bootstrap repeats each time using 21 observations
> for training and the rest 21 for testing, but the average correlation was
> only R= 0.1727 +- 0.19779.
>
>
> molname                    VDWAALS     EEL               EGB
>>  ESURF        Expr
>> CHEMBL108457        -20.4848        -96.5826         23.4584
>> -5.4045        -7.27193
>> CHEMBL388269        -50.3860         28.9403        -51.5147
>> -6.4061        -6.8022
>> CHEMBL244078        -49.1466        -21.9869         17.7999
>> -6.4588        -6.61742
>> CHEMBL244077        -53.4365        -32.8943         34.8723
>> -7.0384        -6.61742
>> CHEMBL396772        -51.4111        -34.4904         36.0326
>> -6.5443        -5.82207
>> ........
>
>
> I would like your advice about what other machine learning algorithm I
> could try with these data. E.g. can I make a decision tree or the
> observations  and variable are too few to avoid overfitting? I could
> include more variables but the observations will always remain 42.
>
> I would greatly appreciate any advice!
>
> Thomas
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161001/a3279666/attachment.html>

From aadral at gmail.com  Sat Oct  1 14:48:44 2016
From: aadral at gmail.com (=?UTF-8?B?0JDQu9C10LrRgdC10Lkg0JTRgNCw0LvRjA==?=)
Date: Sat, 1 Oct 2016 19:48:44 +0100
Subject: [scikit-learn] suggested machine learning algorithm
In-Reply-To: <CAACvdx2wpzvUWCMDnKC+dF1gNMbBxmNhgYZpMW76oe_OkVm8ng@mail.gmail.com>
References: <CAACvdx2wpzvUWCMDnKC+dF1gNMbBxmNhgYZpMW76oe_OkVm8ng@mail.gmail.com>
Message-ID: <CADdg2yzuXb4J+n58on-qNHS=V=25HyZOg_FGeYK6gwQ1Ubci8g@mail.gmail.com>

Hi Thomas,

What quality do you have on training?

There is no silver bullet, but there is quite common technique you can use
to find out if you use appropriate algorithm. You can take a look at the
difference between "train" and "validation" quality of learning curves (
example
<http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html#example-model-selection-plot-learning-curve-py>).
If you see big gap, then you can reduce complexity of your model to
overcome overfitting (reduce interaction parameter / number of variables /
iterations / ...). If you see a small gap, then you can try to increase
model complexity to fit your data better.

Moreover, I see you have a tiny dataset and use 50/50 split. I presume,
that you will train "production" model on the whole available dataset. In
that case, I suggest you to use more data for training and use almost LOO
<http://scikit-learn.org/stable/modules/cross_validation.html#leave-one-out-loo>
approach
to better estimate your predictive quality. But, be really cautious about
cross-validation as you can easily overfit your data.


2016-10-01 15:59 GMT+01:00 Thomas Evangelidis <tevang3 at gmail.com>:

> Dear scikit-learn users and developers,
>
> I have a dataset consisting of 42 observation (molnames) and 4 variables (
> VDWAALS, EEL, EGB, ESURF) with which I want to make a predictive model
> that estimates the experimental value (Expr). I tried multivariate linear
> regression using 10,000 bootstrap repeats each time using 21 observations
> for training and the rest 21 for testing, but the average correlation was
> only R= 0.1727 +- 0.19779.
>
>
> molname                    VDWAALS     EEL               EGB
>>  ESURF        Expr
>> CHEMBL108457        -20.4848        -96.5826         23.4584
>> -5.4045        -7.27193
>> CHEMBL388269        -50.3860         28.9403        -51.5147
>> -6.4061        -6.8022
>> CHEMBL244078        -49.1466        -21.9869         17.7999
>> -6.4588        -6.61742
>> CHEMBL244077        -53.4365        -32.8943         34.8723
>> -7.0384        -6.61742
>> CHEMBL396772        -51.4111        -34.4904         36.0326
>> -6.5443        -5.82207
>> ........
>
>
> I would like your advice about what other machine learning algorithm I
> could try with these data. E.g. can I make a decision tree or the
> observations  and variable are too few to avoid overfitting? I could
> include more variables but the observations will always remain 42.
>
> I would greatly appreciate any advice!
>
> Thomas
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 
Yours sincerely,
https://www.linkedin.com/in/alexey-dral
Alexey A. Dral
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161001/46025b3c/attachment.html>

From se.raschka at gmail.com  Sat Oct  1 15:58:39 2016
From: se.raschka at gmail.com (Sebastian Raschka)
Date: Sat, 1 Oct 2016 15:58:39 -0400
Subject: [scikit-learn] suggested machine learning algorithm
In-Reply-To: <CAACvdx2wpzvUWCMDnKC+dF1gNMbBxmNhgYZpMW76oe_OkVm8ng@mail.gmail.com>
References: <CAACvdx2wpzvUWCMDnKC+dF1gNMbBxmNhgYZpMW76oe_OkVm8ng@mail.gmail.com>
Message-ID: <CE9FAC75-8FAF-488E-8537-E4D187FF4BEB@gmail.com>

Maybe it?s worth switching to LOOCV since you may have a bit of a pessimistic bias here due to the small training set size (in bootstrap you only have asymptotically 0.632 unique samples for training). I would try both linear and nonlinear models; instead of adding more features maybe also try to eliminate some features via L1, feature selection, or feature extraction in addition to trying different algorithms like random forests, gaussian processes, RBF kernel SVM regression, and so forth.


> On Oct 1, 2016, at 10:59 AM, Thomas Evangelidis <tevang3 at gmail.com> wrote:
> 
> Dear scikit-learn users and developers,
> 
> I have a dataset consisting of 42 observation (molnames) and 4 variables (VDWAALS, EEL, EGB, ESURF) with which I want to make a predictive model that estimates the experimental value (Expr). I tried multivariate linear regression using 10,000 bootstrap repeats each time using 21 observations for training and the rest 21 for testing, but the average correlation was only R= 0.1727 +- 0.19779.
> 
> 
> molname                    VDWAALS     EEL               EGB              ESURF        Expr
> CHEMBL108457        -20.4848        -96.5826         23.4584       -5.4045        -7.27193
> CHEMBL388269        -50.3860         28.9403        -51.5147       -6.4061        -6.8022
> CHEMBL244078        -49.1466        -21.9869         17.7999       -6.4588        -6.61742
> CHEMBL244077        -53.4365        -32.8943         34.8723       -7.0384        -6.61742
> CHEMBL396772        -51.4111        -34.4904         36.0326       -6.5443        -5.82207
> ........
> 
> I would like your advice about what other machine learning algorithm I could try with these data. E.g. can I make a decision tree or the observations  and variable are too few to avoid overfitting? I could include more variables but the observations will always remain 42.
> 
> I would greatly appreciate any advice!
> 
> Thomas
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From joel.nothman at gmail.com  Sat Oct  1 18:11:42 2016
From: joel.nothman at gmail.com (Joel Nothman)
Date: Sun, 2 Oct 2016 09:11:42 +1100
Subject: [scikit-learn] Why does sci-kit learn's hashingvectorizer give
 negative values?
In-Reply-To: <57EFC584.7020608@gmail.com>
References: <CAJ5DmAWV4H39PnjVx1yw0WhOnGnnsUZauD3-5E0O4x=Cvxvw6Q@mail.gmail.com>
 <57EFC584.7020608@gmail.com>
Message-ID: <CAAkaFLVdFCN1G-LbM6oJLprgEia6-p_5tOm+YVcFcLxEB=Xgbg@mail.gmail.com>

Negative values are not really there to compensate for hash collisions.
It's there because that makes the hashed vector space an approximation to
the full vector space under inner product.

On 2 October 2016 at 00:17, Roman Yurchak <rth.yurchak at gmail.com> wrote:

> On 01/10/16 15:34, Moyi Dang wrote:
> > However, I don't understand why the negatives are there in the first
> > place, or what they mean. I'm not sure if the absolute values are
> > corresponding to the token counts.
> >
> > Can someone please help explain what the HashingVectorizer is doing? How
> > do I get the HashingVectorizer to return token counts?
>
> Hi Moyi,
>
> it's a mechanism to compensate for hash collisions, see
> https://github.com/scikit-learn/scikit-learn/issues/7513 The absolute
> values are token counts for most practical applications (if you don't
> have too many collisions).  There will be a PR shortly to make this more
> consistent.
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161002/8d918fdc/attachment.html>

From jblackburne at gmail.com  Sun Oct  2 02:19:53 2016
From: jblackburne at gmail.com (Jeff Blackburne)
Date: Sat, 1 Oct 2016 23:19:53 -0700
Subject: [scikit-learn] Strange behavior when I add a member to a cython
 struct
Message-ID: <CALujU24o83NExwiMhkJDxJLB3j6hoCfvukAnFsqGjW3rcB2j8w@mail.gmail.com>

Hi,

As part of my work on PR #4899 (categorical splits for tree-based
learners), I want to add a pointer member to the Node struct in
sklearn/tree/_tree.pxd. But when I do this, it causes some of the unit
tests to fail in the 32-bit Appveyor (Windows) CI. (Actually, it usually
causes them to hang indefinitely.) I'm testing this with the latest commit
on master.

The patch I'm applying is listed in full below; it's tiny. If you like, I
can make a new PR to demonstrate the behavior.

Does anyone know why this would happen, and only on 32-bit windows?

Thanks,
Jeff


```
diff --git a/sklearn/tree/_tree.pxd b/sklearn/tree/_tree.pxd
index dbf0545..b80e7bb 100644
--- a/sklearn/tree/_tree.pxd
+++ b/sklearn/tree/_tree.pxd
@@ -32,6 +32,7 @@ cdef struct Node:
     DOUBLE_t impurity                    # Impurity of the node (i.e., the
value of the criterion)
     SIZE_t n_node_samples                # Number of samples at the node
     DOUBLE_t weighted_n_node_samples     # Weighted number of samples at
the node
+    UINT32_t *foo


 cdef class Tree:
diff --git a/sklearn/tree/_tree.pyx b/sklearn/tree/_tree.pyx
index 4e8160f..a2f8117 100644
--- a/sklearn/tree/_tree.pyx
+++ b/sklearn/tree/_tree.pyx
@@ -68,9 +68,9 @@ cdef SIZE_t INITIAL_STACK_SIZE = 10
 # Repeat struct definition for numpy
 NODE_DTYPE = np.dtype({
     'names': ['left_child', 'right_child', 'feature', 'threshold',
'impurity',
-              'n_node_samples', 'weighted_n_node_samples'],
+              'n_node_samples', 'weighted_n_node_samples', 'foo'],
     'formats': [np.intp, np.intp, np.intp, np.float64, np.float64, np.intp,
-                np.float64],
+                np.float64, np.intp],
     'offsets': [
         <Py_ssize_t> &(<Node*> NULL).left_child,
         <Py_ssize_t> &(<Node*> NULL).right_child,
@@ -78,7 +78,8 @@ NODE_DTYPE = np.dtype({
         <Py_ssize_t> &(<Node*> NULL).threshold,
         <Py_ssize_t> &(<Node*> NULL).impurity,
         <Py_ssize_t> &(<Node*> NULL).n_node_samples,
-        <Py_ssize_t> &(<Node*> NULL).weighted_n_node_samples
+        <Py_ssize_t> &(<Node*> NULL).weighted_n_node_samples,
+        <Py_ssize_t> &(<Node*> NULL).foo
     ]
 })
```
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161001/49e1be53/attachment.html>

From tevang3 at gmail.com  Sun Oct  2 08:23:50 2016
From: tevang3 at gmail.com (Thomas Evangelidis)
Date: Sun, 2 Oct 2016 14:23:50 +0200
Subject: [scikit-learn] suggested machine learning algorithm
In-Reply-To: <CADdg2yzuXb4J+n58on-qNHS=V=25HyZOg_FGeYK6gwQ1Ubci8g@mail.gmail.com>
References: <CAACvdx2wpzvUWCMDnKC+dF1gNMbBxmNhgYZpMW76oe_OkVm8ng@mail.gmail.com>
 <CADdg2yzuXb4J+n58on-qNHS=V=25HyZOg_FGeYK6gwQ1Ubci8g@mail.gmail.com>
Message-ID: <CAACvdx0f+KnxKP1JDxMuniarYCsvhAuM18wrra_stuHu90_Dsg@mail.gmail.com>

On 1 October 2016 at 20:48, ??????? ????? <aadral at gmail.com> wrote:

> Hi Thomas,
>
> What quality do you have on training?
>
> There is no silver bullet, but there is quite common technique you can use
> to find out if you use appropriate algorithm. You can take a look at the
> difference between "train" and "validation" quality of learning curves (
> example
> <http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html#example-model-selection-plot-learning-curve-py>).
> If you see big gap, then you can reduce complexity of your model to
> overcome overfitting (reduce interaction parameter / number of variables
> / iterations / ...). If you see a small gap, then you can try to increase
> model complexity to fit your data better.
> ??
>
> ?Hi ????????,

the "Training examples" in the learning curves are  the number of
observations used for training? Don't you think my dataset is kind of small
(42 observations) to use that technique?


> Moreover, I see you have a tiny dataset and use 50/50 split. I presume,
> that you will train "production" model on the whole available dataset. In
> that case, I suggest you to use more data for training and use almost LOO
> <http://scikit-learn.org/stable/modules/cross_validation.html#leave-one-out-loo> approach
> to better estimate your predictive quality. But, be really cautious about
> cross-validation as you can easily overfit your data.
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161002/17e61230/attachment.html>

From aadral at gmail.com  Sun Oct  2 18:52:39 2016
From: aadral at gmail.com (=?UTF-8?B?0JDQu9C10LrRgdC10Lkg0JTRgNCw0LvRjA==?=)
Date: Sun, 2 Oct 2016 23:52:39 +0100
Subject: [scikit-learn] suggested machine learning algorithm
In-Reply-To: <CAACvdx0f+KnxKP1JDxMuniarYCsvhAuM18wrra_stuHu90_Dsg@mail.gmail.com>
References: <CAACvdx2wpzvUWCMDnKC+dF1gNMbBxmNhgYZpMW76oe_OkVm8ng@mail.gmail.com>
 <CADdg2yzuXb4J+n58on-qNHS=V=25HyZOg_FGeYK6gwQ1Ubci8g@mail.gmail.com>
 <CAACvdx0f+KnxKP1JDxMuniarYCsvhAuM18wrra_stuHu90_Dsg@mail.gmail.com>
Message-ID: <CADdg2ywBgXbHGpSJg8OOakuf+j+3pVsDRi9pw51Pm1iJuWwkVA@mail.gmail.com>

2016-10-02 13:23 GMT+01:00 Thomas Evangelidis <tevang3 at gmail.com>:

>
>
> On 1 October 2016 at 20:48, ??????? ????? <aadral at gmail.com> wrote:
>
>> Hi Thomas,
>>
>> What quality do you have on training?
>>
>> There is no silver bullet, but there is quite common technique you can
>> use to find out if you use appropriate algorithm. You can take a look at
>> the difference between "train" and "validation" quality of learning curves (
>> example
>> <http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html#example-model-selection-plot-learning-curve-py>).
>> If you see big gap, then you can reduce complexity of your model to
>> overcome overfitting (reduce interaction parameter / number of variables
>> / iterations / ...). If you see a small gap, then you can try to increase
>> model complexity to fit your data better.
>> ??
>>
>> ?Hi ????????,
>
> the "Training examples" in the learning curves are  the number of
> observations used for training? Don't you think my dataset is kind of small
> (42 observations) to use that technique?
>

Yes, it is really a tiny dataset =). You don't necessarily need to use it
over number of training observations. For instance, you can have this plot
over number of iterations.


>
>
>
>> Moreover, I see you have a tiny dataset and use 50/50 split. I presume,
>> that you will train "production" model on the whole available dataset.
>> In that case, I suggest you to use more data for training and use almost
>> LOO
>> <http://scikit-learn.org/stable/modules/cross_validation.html#leave-one-out-loo> approach
>> to better estimate your predictive quality. But, be really cautious about
>> cross-validation as you can easily overfit your data.
>>
>>
>>
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 
Yours sincerely,
https://www.linkedin.com/in/alexey-dral
Alexey A. Dral
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161002/c7789c40/attachment.html>

From jbbrown at kuhp.kyoto-u.ac.jp  Mon Oct  3 00:05:13 2016
From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.)
Date: Mon, 3 Oct 2016 13:05:13 +0900
Subject: [scikit-learn] ANN Scikit-learn 0.18 released
In-Reply-To: <77756fd4-5ad9-e51d-51e1-aa60274d2117@gmail.com>
References: <40a62931-bf5a-7c59-7253-69418d53f196@gmail.com>
 <20160929052856.GA1123098@phare.normalesup.org>
 <F2B0C871-A02F-43DC-BE82-C62357C4D25B@gmail.com>
 <CAAkaFLUAvAj49JGxweA=Jy6VZytvyO0rkUfKVRvssY=gYoUM2Q@mail.gmail.com>
 <77756fd4-5ad9-e51d-51e1-aa60274d2117@gmail.com>
Message-ID: <CAJe_vxAE7_dsz=4uJ5E19rOAzb+jheGxoAxVQ7UuNnCZGQ4O7g@mail.gmail.com>

Hello community,

Congratulations on the release of 0.19 !
While I'm merely a casual user and wish I could contribute more often, I
thank everyone for their time and efforts!

2016-10-01 1:58 GMT+09:00 Andreas Mueller <t3kcit at gmail.com>:

We've got a lot in the works already for 0.19.
>>
>> * multiple metrics for cross validation (#7388 et al.)
>>
>
I've done something like this in my internal model building and selection
libraries.
My solution has been to have
  -each metric object be able to explain a "distance from optimal"
  -a metric collection object, which can be built by either explicit
instantiation or calculation using data
  -a pareto curve calculation object
  -a ranker for the points on the pareto curve, with the ability to select
the N-best points.

While there are certainly smarter interfaces and implementations, here is
an example of one of my doctests that may help get this PR started.
My apologies that my old docstring argument notation doesn't match the
commonly used standards.

Hope this helps,
J.B. Brown
Kyoto University

 26 class
TrialRanker(object):
 27     """An object for handling the generic mechanism of selecting
optimal
 28     trials from a colletion of trials."""

 43     def SelectBest(self, metricSets,
paretoAlg,
 44
preProcessor=None):
 45         """Select the best [metricSets] by using
the
 46         [paretoAlg] pareto selection object.  Note that it is
actually
 47         the [paretoAlg] that specifies how many optimal [metricSets]
to
 48
select.
 49

 50         Data may be pre-processed into a form necessary for the
[paretoAlg]
 51         by using the [preProcessor] that is a
MetricSetConverter.
 52

 53         Return: an EvaluatedMetricSet if [paretoAlg] selects only
one
 54         metric set, otherwise a list of EvaluatedMetricSet
objects.
 55

 56         >>> from pareto.paretoDecorators import
MinNormSelector
 57         >>> from pareto import
OriginBasePareto
 58         >>> pAlg =
MinNormSelector(OriginBasePareto())
 59

 60         >>> from metrics.TwoClassMetrics import Accuracy,
Sensitivity
 61         >>> from metrics.metricSet import
EvaluatedMetricSet
 62         >>> met1 =
EvaluatedMetricSet.BuildByExplicitValue(
 63         ...           [(Accuracy, 0.7), (Sensitivity,
0.9)])
 64         >>>
met1.SetTitle("Example1")
 65         >>> met1.associatedData = range(5)  # property
set/get
 66         >>> met2 =
EvaluatedMetricSet.BuildByExplicitValue(
 67         ...           [(Accuracy, 0.8), (Sensitivity,
0.6)])
 68         >>>
met2.SetTitle("Example2")
 69         >>> met2.SetAssociatedData("abcdef")  # explicit method
call
 70         >>> met3 =
EvaluatedMetricSet.BuildByExplicitValue(
 71         ...           [(Accuracy, 0.5), (Sensitivity,
0.5)])
 72         >>>
met3.SetTitle("Example3")
 73         >>> met3.associatedData =
float
 74

 75         >>> from metrics.metricSet.converters import
OptDistConverter
 76

 77         >>> ranker = TrialRanker()  # pAlg selects
met1
 78         >>> best =
ranker.SelectBest((met1,met2,met3),
 79         ...                          pAlg,
OptDistConverter())
 80         >>>
best.VerboseDescription(True)
 81         >>>
str(best)
 82         'Example1: 2 metrics; Accuracy=0.700;
Sensitivity=0.900'
 83         >>>
best.associatedData
 84         [0, 1, 2, 3,
4]
 85

 86         >>> pAlg = MinNormSelector(OriginBasePareto(),
nSelect=2)
 87         >>> best =
ranker.SelectBest((met1,met2,met3),
 88         ...                          pAlg,
OptDistConverter())
 89         >>> for metSet in
best:
 90         ...
metSet.VerboseDescription(True)
 91         ...
str(metSet)
 92         ...
str(metSet.associatedData)
 93         'Example1: 2 metrics; Accuracy=0.700;
Sensitivity=0.900'
 94         '[0, 1, 2, 3,
4]'
 95         'Example2: 2 metrics; Accuracy=0.800;
Sensitivity=0.600'
 96
'abcdef'
 97

 98         >>> from metrics.TwoClassMetrics import
PositivePredictiveValue
 99         >>> met4 =
EvaluatedMetricSet.BuildByExplicitValue(
100         ...         [(Accuracy, 0.7), (PositivePredictiveValue,
0.5)])
101         >>> best =
ranker.SelectBest((met1,met2,met3,met4),
102         ...                          pAlg,
OptDistConverter())
103         Traceback (most recent call
last):
104
...
105         ValueError: Metric sets contain differing
Metrics.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161003/66911393/attachment-0001.html>

From Victor.Poughon at cnes.fr  Mon Oct  3 05:21:24 2016
From: Victor.Poughon at cnes.fr (Poughon Victor)
Date: Mon, 3 Oct 2016 09:21:24 +0000
Subject: [scikit-learn] sample_weight for cohen_kappa_score
Message-ID: <3E55146A6A81B44A9CB69CAB65908CEA3558C867@TW-MBX-P01.cnesnet.ad.cnes.fr>

Hello,

I'd like to use samples weights together with sklearn.metrics.cohen_kappa_score,
in a similar way to other metrics which have this argument. Is it as simple as
forwarding the weights to the confusion_matrix call? [0]

If yes I'm happy to work on the pull request.

In that case the other argument "weights" might be confusing but it's too late
to rename it, right?

Cheers,

Victor Poughon

[0] https://github.com/scikit-learn/scikit-learn/blob/dee786a/sklearn/metrics/classification.py#L331


From klonuo at gmail.com  Mon Oct  3 07:30:44 2016
From: klonuo at gmail.com (klo uo)
Date: Mon, 3 Oct 2016 13:30:44 +0200
Subject: [scikit-learn] Generate data from trained naive bayes
Message-ID: <CAA-8Ld9oZit0UCof8sSyw-PaCBQxp9ZpsENzuopTsvRK0GrHUQ@mail.gmail.com>

Hi,

because naive bayes is a generative model, does that mean that I can
somehow generate data based on trained model?

For example:

clf = BernoulliNB()
clf.fit(train, labels)

Can I generate data for specific label?


Thanks,
Klo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161003/8cf1b152/attachment.html>

From t3kcit at gmail.com  Mon Oct  3 09:07:56 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Mon, 3 Oct 2016 09:07:56 -0400
Subject: [scikit-learn] Generate data from trained naive bayes
In-Reply-To: <CAA-8Ld9oZit0UCof8sSyw-PaCBQxp9ZpsENzuopTsvRK0GrHUQ@mail.gmail.com>
References: <CAA-8Ld9oZit0UCof8sSyw-PaCBQxp9ZpsENzuopTsvRK0GrHUQ@mail.gmail.com>
Message-ID: <476b1ca4-c7fa-89ec-81da-c42f9e7abb69@gmail.com>

Hi Klo.
Yes, you could, but as the model is very simple, that's usually not very 
interesting.
It stores for each label an independent Bernoulli distribution for each 
feature.
these are stored in feature_log_prob_.
I would suggest you look at this attribute, rather than sample from the 
distribution.
To sample from it you would have to exponentiate it and then sample from 
these Bernoulli distributions.

Andy

On 10/03/2016 07:30 AM, klo uo wrote:
> Hi,
>
> because naive bayes is a generative model, does that mean that I can 
> somehow generate data based on trained model?
>
> For example:
>
> clf = BernoulliNB()
> clf.fit(train, labels)
>
> Can I generate data for specific label?
>
>
> Thanks,
> Klo
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161003/7adac295/attachment.html>

From t3kcit at gmail.com  Mon Oct  3 09:09:54 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Mon, 3 Oct 2016 09:09:54 -0400
Subject: [scikit-learn] sample_weight for cohen_kappa_score
In-Reply-To: <3E55146A6A81B44A9CB69CAB65908CEA3558C867@TW-MBX-P01.cnesnet.ad.cnes.fr>
References: <3E55146A6A81B44A9CB69CAB65908CEA3558C867@TW-MBX-P01.cnesnet.ad.cnes.fr>
Message-ID: <71ddf2ec-cf8a-c6bd-3134-cd1bc7f5e360@gmail.com>

Hm it sounds like "weights" should have been called "weighting" maybe?
Not sure if it's worth changing now, as we released it already.

And I think passing the weighting to the confusion matrix is correct.
There should be tests for weighted metrics to confirm that.

PR welcome.

On 10/03/2016 05:21 AM, Poughon Victor wrote:
> Hello,
>
> I'd like to use samples weights together with sklearn.metrics.cohen_kappa_score,
> in a similar way to other metrics which have this argument. Is it as simple as
> forwarding the weights to the confusion_matrix call? [0]
>
> If yes I'm happy to work on the pull request.
>
> In that case the other argument "weights" might be confusing but it's too late
> to rename it, right?
>
> Cheers,
>
> Victor Poughon
>
> [0] https://github.com/scikit-learn/scikit-learn/blob/dee786a/sklearn/metrics/classification.py#L331
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From klonuo at gmail.com  Mon Oct  3 11:08:39 2016
From: klonuo at gmail.com (klo uo)
Date: Mon, 3 Oct 2016 17:08:39 +0200
Subject: [scikit-learn] Generate data from trained naive bayes
In-Reply-To: <476b1ca4-c7fa-89ec-81da-c42f9e7abb69@gmail.com>
References: <CAA-8Ld9oZit0UCof8sSyw-PaCBQxp9ZpsENzuopTsvRK0GrHUQ@mail.gmail.com>
 <476b1ca4-c7fa-89ec-81da-c42f9e7abb69@gmail.com>
Message-ID: <CAA-8Ld8z72HqJyjMcPJLSwDdsBnYy+=mN5h6HZ7tci7ZzJJ=DA@mail.gmail.com>

Thanks Andy,

I can comprehend to the point "...and then sample from these Bernoulli
distributions"

>From the data in `feature_log_prob_`, I would guess it contains single
feature (features mean from the trained data) for each class.
I can see how can I sample from `feature_log_prob_`...


On Mon, Oct 3, 2016 at 3:07 PM, Andreas Mueller <t3kcit at gmail.com> wrote:

> Hi Klo.
> Yes, you could, but as the model is very simple, that's usually not very
> interesting.
> It stores for each label an independent Bernoulli distribution for each
> feature.
> these are stored in feature_log_prob_.
> I would suggest you look at this attribute, rather than sample from the
> distribution.
> To sample from it you would have to exponentiate it and then sample from
> these Bernoulli distributions.
>
> Andy
>
>
> On 10/03/2016 07:30 AM, klo uo wrote:
>
> Hi,
>
> because naive bayes is a generative model, does that mean that I can
> somehow generate data based on trained model?
>
> For example:
>
> clf = BernoulliNB()
> clf.fit(train, labels)
>
> Can I generate data for specific label?
>
>
> Thanks,
> Klo
>
>
> _______________________________________________
> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161003/36145b07/attachment-0001.html>

From klonuo at gmail.com  Mon Oct  3 11:09:32 2016
From: klonuo at gmail.com (klo uo)
Date: Mon, 3 Oct 2016 17:09:32 +0200
Subject: [scikit-learn] Generate data from trained naive bayes
In-Reply-To: <CAA-8Ld8z72HqJyjMcPJLSwDdsBnYy+=mN5h6HZ7tci7ZzJJ=DA@mail.gmail.com>
References: <CAA-8Ld9oZit0UCof8sSyw-PaCBQxp9ZpsENzuopTsvRK0GrHUQ@mail.gmail.com>
 <476b1ca4-c7fa-89ec-81da-c42f9e7abb69@gmail.com>
 <CAA-8Ld8z72HqJyjMcPJLSwDdsBnYy+=mN5h6HZ7tci7ZzJJ=DA@mail.gmail.com>
Message-ID: <CAA-8Ld8SbCiKb6rZ+fvi8iixjuK-3mRTWdf0FhGyC9RF=ct7BA@mail.gmail.com>

On Mon, Oct 3, 2016 at 5:08 PM, klo uo <klonuo at gmail.com> wrote:

> I can see how can I sample from `feature_log_prob_`...
>

I meant I cannot see
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161003/09e7ea49/attachment.html>

From gael.varoquaux at normalesup.org  Mon Oct  3 11:14:15 2016
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Mon, 3 Oct 2016 17:14:15 +0200
Subject: [scikit-learn] Welcome Raghav to the core-dev team
Message-ID: <20161003151415.GF20745@phare.normalesup.org>

Hi,

We have the pleasure to welcome Raghav RV to the core-dev team. Raghav
(@raghavrv) has been working on scikit-learn for more than a year. In
particular, he implemented the rewrite of the cross-validation utilities,
which is quite dear to my heart.

Welcome Raghav!

Ga?l


From manojkumarsivaraj334 at gmail.com  Mon Oct  3 11:23:55 2016
From: manojkumarsivaraj334 at gmail.com (Manoj Kumar)
Date: Mon, 3 Oct 2016 11:23:55 -0400
Subject: [scikit-learn] Generate data from trained naive bayes
In-Reply-To: <CAA-8Ld8SbCiKb6rZ+fvi8iixjuK-3mRTWdf0FhGyC9RF=ct7BA@mail.gmail.com>
References: <CAA-8Ld9oZit0UCof8sSyw-PaCBQxp9ZpsENzuopTsvRK0GrHUQ@mail.gmail.com>
 <476b1ca4-c7fa-89ec-81da-c42f9e7abb69@gmail.com>
 <CAA-8Ld8z72HqJyjMcPJLSwDdsBnYy+=mN5h6HZ7tci7ZzJJ=DA@mail.gmail.com>
 <CAA-8Ld8SbCiKb6rZ+fvi8iixjuK-3mRTWdf0FhGyC9RF=ct7BA@mail.gmail.com>
Message-ID: <CAFQAd-nXOs8WD_V1Mp=4HKiw2qwPxdGvd_oQBOxHqVobUr9ygA@mail.gmail.com>

Hi,

feature_log_prob_ is an array of size (n_classes, n_features).

exp(feature_log_prob_[class_ind, feature_ind]) gives P(X_{feature_ind} = 1
| class_ind)"

Using the conditional independence assumptions of NaiveBayes, you can use
this to sample each feature independently given the class.

Hope that helps.


On Mon, Oct 3, 2016 at 11:09 AM, klo uo <klonuo at gmail.com> wrote:

> On Mon, Oct 3, 2016 at 5:08 PM, klo uo <klonuo at gmail.com> wrote:
>
>> I can see how can I sample from `feature_log_prob_`...
>>
>
> I meant I cannot see
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 
Manoj,
http://github.com/MechCoder
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161003/b3a49c25/attachment.html>

From nfliu at uw.edu  Mon Oct  3 11:22:33 2016
From: nfliu at uw.edu (Nelson Liu)
Date: Mon, 3 Oct 2016 08:22:33 -0700
Subject: [scikit-learn] Welcome Raghav to the core-dev team
In-Reply-To: <20161003151415.GF20745@phare.normalesup.org>
References: <20161003151415.GF20745@phare.normalesup.org>
Message-ID: <CALoLHMLZnsRiTUi4c5wnp5Q0N_ttdyTy1WH32tS2ZEo=_7mg_A@mail.gmail.com>

Yay! Congrats, Raghav!

On Mon, Oct 3, 2016 at 8:14 AM, Gael Varoquaux <
gael.varoquaux at normalesup.org> wrote:

> Hi,
>
> We have the pleasure to welcome Raghav RV to the core-dev team. Raghav
> (@raghavrv) has been working on scikit-learn for more than a year. In
> particular, he implemented the rewrite of the cross-validation utilities,
> which is quite dear to my heart.
>
> Welcome Raghav!
>
> Ga?l
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161003/9372891c/attachment.html>

From ashu.9412 at gmail.com  Mon Oct  3 11:27:40 2016
From: ashu.9412 at gmail.com (Devashish Deshpande)
Date: Mon, 3 Oct 2016 20:57:40 +0530
Subject: [scikit-learn] Welcome Raghav to the core-dev team
In-Reply-To: <CALoLHMLZnsRiTUi4c5wnp5Q0N_ttdyTy1WH32tS2ZEo=_7mg_A@mail.gmail.com>
References: <20161003151415.GF20745@phare.normalesup.org>
 <CALoLHMLZnsRiTUi4c5wnp5Q0N_ttdyTy1WH32tS2ZEo=_7mg_A@mail.gmail.com>
Message-ID: <CANX99MUb1w38a2dq_P7ZYacr1uxZ34GPzKBF9dNdXaGjpjhSqw@mail.gmail.com>

Congratulations Raghav!!

On Mon, Oct 3, 2016 at 8:52 PM, Nelson Liu <nfliu at uw.edu> wrote:

> Yay! Congrats, Raghav!
>
> On Mon, Oct 3, 2016 at 8:14 AM, Gael Varoquaux <
> gael.varoquaux at normalesup.org> wrote:
>
>> Hi,
>>
>> We have the pleasure to welcome Raghav RV to the core-dev team. Raghav
>> (@raghavrv) has been working on scikit-learn for more than a year. In
>> particular, he implemented the rewrite of the cross-validation utilities,
>> which is quite dear to my heart.
>>
>> Welcome Raghav!
>>
>> Ga?l
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161003/29b3b82d/attachment.html>

From yenchenlin1994 at gmail.com  Mon Oct  3 11:28:58 2016
From: yenchenlin1994 at gmail.com (lin yenchen)
Date: Mon, 03 Oct 2016 15:28:58 +0000
Subject: [scikit-learn] Welcome Raghav to the core-dev team
In-Reply-To: <CALoLHMLZnsRiTUi4c5wnp5Q0N_ttdyTy1WH32tS2ZEo=_7mg_A@mail.gmail.com>
References: <20161003151415.GF20745@phare.normalesup.org>
 <CALoLHMLZnsRiTUi4c5wnp5Q0N_ttdyTy1WH32tS2ZEo=_7mg_A@mail.gmail.com>
Message-ID: <CA+QZHeCK7i_oAcE+yWL=ZofRzvpEUZQmB+0kPzwjtQENpcPVPQ@mail.gmail.com>

Congrats, Raghav!

Nelson Liu <nfliu at uw.edu> ? 2016?10?3? ?? ??11:27???

> Yay! Congrats, Raghav!
>
> On Mon, Oct 3, 2016 at 8:14 AM, Gael Varoquaux <
> gael.varoquaux at normalesup.org> wrote:
>
> Hi,
>
> We have the pleasure to welcome Raghav RV to the core-dev team. Raghav
> (@raghavrv) has been working on scikit-learn for more than a year. In
> particular, he implemented the rewrite of the cross-validation utilities,
> which is quite dear to my heart.
>
> Welcome Raghav!
>
> Ga?l
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161003/39ee3cbc/attachment-0001.html>

From krishnakalyan3 at gmail.com  Mon Oct  3 11:39:22 2016
From: krishnakalyan3 at gmail.com (Krishna Kalyan)
Date: Mon, 3 Oct 2016 17:39:22 +0200
Subject: [scikit-learn] Welcome Raghav to the core-dev team
In-Reply-To: <CA+QZHeCK7i_oAcE+yWL=ZofRzvpEUZQmB+0kPzwjtQENpcPVPQ@mail.gmail.com>
References: <20161003151415.GF20745@phare.normalesup.org>
 <CALoLHMLZnsRiTUi4c5wnp5Q0N_ttdyTy1WH32tS2ZEo=_7mg_A@mail.gmail.com>
 <CA+QZHeCK7i_oAcE+yWL=ZofRzvpEUZQmB+0kPzwjtQENpcPVPQ@mail.gmail.com>
Message-ID: <CACs8=aSKwD++JFEg+ky=FZDQQzjD24-Auk1G4MUpZFQ57e8RkQ@mail.gmail.com>

Congrats Raghav. :)

On Mon, Oct 3, 2016 at 5:28 PM, lin yenchen <yenchenlin1994 at gmail.com>
wrote:

> Congrats, Raghav!
>
> Nelson Liu <nfliu at uw.edu> ? 2016?10?3? ?? ??11:27???
>
>> Yay! Congrats, Raghav!
>>
>> On Mon, Oct 3, 2016 at 8:14 AM, Gael Varoquaux <
>> gael.varoquaux at normalesup.org> wrote:
>>
>> Hi,
>>
>> We have the pleasure to welcome Raghav RV to the core-dev team. Raghav
>> (@raghavrv) has been working on scikit-learn for more than a year. In
>> particular, he implemented the rewrite of the cross-validation utilities,
>> which is quite dear to my heart.
>>
>> Welcome Raghav!
>>
>> Ga?l
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161003/1f681c0b/attachment.html>

From ronnie.ghose at gmail.com  Mon Oct  3 11:40:15 2016
From: ronnie.ghose at gmail.com (Ronnie Ghose)
Date: Mon, 3 Oct 2016 11:40:15 -0400
Subject: [scikit-learn] Welcome Raghav to the core-dev team
In-Reply-To: <CA+QZHeCK7i_oAcE+yWL=ZofRzvpEUZQmB+0kPzwjtQENpcPVPQ@mail.gmail.com>
References: <20161003151415.GF20745@phare.normalesup.org>
 <CALoLHMLZnsRiTUi4c5wnp5Q0N_ttdyTy1WH32tS2ZEo=_7mg_A@mail.gmail.com>
 <CA+QZHeCK7i_oAcE+yWL=ZofRzvpEUZQmB+0kPzwjtQENpcPVPQ@mail.gmail.com>
Message-ID: <CAHazPTmwX=nZUh71fpYnRBtsszzaL1FCE+ftKfjrKCsesUbjwg@mail.gmail.com>

congrats! :)

On Mon, Oct 3, 2016 at 11:28 AM, lin yenchen <yenchenlin1994 at gmail.com>
wrote:

> Congrats, Raghav!
>
> Nelson Liu <nfliu at uw.edu> ? 2016?10?3? ?? ??11:27???
>
>> Yay! Congrats, Raghav!
>>
>> On Mon, Oct 3, 2016 at 8:14 AM, Gael Varoquaux <
>> gael.varoquaux at normalesup.org> wrote:
>>
>> Hi,
>>
>> We have the pleasure to welcome Raghav RV to the core-dev team. Raghav
>> (@raghavrv) has been working on scikit-learn for more than a year. In
>> particular, he implemented the rewrite of the cross-validation utilities,
>> which is quite dear to my heart.
>>
>> Welcome Raghav!
>>
>> Ga?l
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161003/f98f98bd/attachment.html>

From ragvrv at gmail.com  Mon Oct  3 12:09:13 2016
From: ragvrv at gmail.com (Raghav R V)
Date: Mon, 3 Oct 2016 18:09:13 +0200
Subject: [scikit-learn] Welcome Raghav to the core-dev team
In-Reply-To: <CAHazPTmwX=nZUh71fpYnRBtsszzaL1FCE+ftKfjrKCsesUbjwg@mail.gmail.com>
References: <20161003151415.GF20745@phare.normalesup.org>
 <CALoLHMLZnsRiTUi4c5wnp5Q0N_ttdyTy1WH32tS2ZEo=_7mg_A@mail.gmail.com>
 <CA+QZHeCK7i_oAcE+yWL=ZofRzvpEUZQmB+0kPzwjtQENpcPVPQ@mail.gmail.com>
 <CAHazPTmwX=nZUh71fpYnRBtsszzaL1FCE+ftKfjrKCsesUbjwg@mail.gmail.com>
Message-ID: <CACmxyDG3gJsSBVksXVciX10jH6NThva5k623k_Q-OC6HkasRgQ@mail.gmail.com>

Thanks everyone! Looking forward to contributing more :D

On Mon, Oct 3, 2016 at 5:40 PM, Ronnie Ghose <ronnie.ghose at gmail.com> wrote:

> congrats! :)
>
> On Mon, Oct 3, 2016 at 11:28 AM, lin yenchen <yenchenlin1994 at gmail.com>
> wrote:
>
>> Congrats, Raghav!
>>
>> Nelson Liu <nfliu at uw.edu> ? 2016?10?3? ?? ??11:27???
>>
>>> Yay! Congrats, Raghav!
>>>
>>> On Mon, Oct 3, 2016 at 8:14 AM, Gael Varoquaux <
>>> gael.varoquaux at normalesup.org> wrote:
>>>
>>> Hi,
>>>
>>> We have the pleasure to welcome Raghav RV to the core-dev team. Raghav
>>> (@raghavrv) has been working on scikit-learn for more than a year. In
>>> particular, he implemented the rewrite of the cross-validation utilities,
>>> which is quite dear to my heart.
>>>
>>> Welcome Raghav!
>>>
>>> Ga?l
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161003/9f544622/attachment.html>

From nelle.varoquaux at gmail.com  Mon Oct  3 12:21:51 2016
From: nelle.varoquaux at gmail.com (Nelle Varoquaux)
Date: Mon, 3 Oct 2016 09:21:51 -0700
Subject: [scikit-learn] Welcome Raghav to the core-dev team
In-Reply-To: <CAHazPTmwX=nZUh71fpYnRBtsszzaL1FCE+ftKfjrKCsesUbjwg@mail.gmail.com>
References: <20161003151415.GF20745@phare.normalesup.org>
 <CALoLHMLZnsRiTUi4c5wnp5Q0N_ttdyTy1WH32tS2ZEo=_7mg_A@mail.gmail.com>
 <CA+QZHeCK7i_oAcE+yWL=ZofRzvpEUZQmB+0kPzwjtQENpcPVPQ@mail.gmail.com>
 <CAHazPTmwX=nZUh71fpYnRBtsszzaL1FCE+ftKfjrKCsesUbjwg@mail.gmail.com>
Message-ID: <CAE-UAvTnQT8iHka7qQDeDh7A8oQsTOEhaxnZT4=5fxcdH4AJBw@mail.gmail.com>

Congratulation Raghav!

On 3 October 2016 at 08:40, Ronnie Ghose <ronnie.ghose at gmail.com> wrote:
> congrats! :)
>
> On Mon, Oct 3, 2016 at 11:28 AM, lin yenchen <yenchenlin1994 at gmail.com>
> wrote:
>>
>> Congrats, Raghav!
>>
>> Nelson Liu <nfliu at uw.edu> ? 2016?10?3? ?? ??11:27???
>>>
>>> Yay! Congrats, Raghav!
>>>
>>> On Mon, Oct 3, 2016 at 8:14 AM, Gael Varoquaux
>>> <gael.varoquaux at normalesup.org> wrote:
>>>>
>>>> Hi,
>>>>
>>>> We have the pleasure to welcome Raghav RV to the core-dev team. Raghav
>>>> (@raghavrv) has been working on scikit-learn for more than a year. In
>>>> particular, he implemented the rewrite of the cross-validation
>>>> utilities,
>>>> which is quite dear to my heart.
>>>>
>>>> Welcome Raghav!
>>>>
>>>> Ga?l
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>

From ragvrv at gmail.com  Mon Oct  3 12:23:35 2016
From: ragvrv at gmail.com (Raghav R V)
Date: Mon, 3 Oct 2016 18:23:35 +0200
Subject: [scikit-learn] ANN Scikit-learn 0.18 released
In-Reply-To: <CAJe_vxAE7_dsz=4uJ5E19rOAzb+jheGxoAxVQ7UuNnCZGQ4O7g@mail.gmail.com>
References: <40a62931-bf5a-7c59-7253-69418d53f196@gmail.com>
 <20160929052856.GA1123098@phare.normalesup.org>
 <F2B0C871-A02F-43DC-BE82-C62357C4D25B@gmail.com>
 <CAAkaFLUAvAj49JGxweA=Jy6VZytvyO0rkUfKVRvssY=gYoUM2Q@mail.gmail.com>
 <77756fd4-5ad9-e51d-51e1-aa60274d2117@gmail.com>
 <CAJe_vxAE7_dsz=4uJ5E19rOAzb+jheGxoAxVQ7UuNnCZGQ4O7g@mail.gmail.com>
Message-ID: <CACmxyDFEWrfSG7n03hTy3mgM3hKV7cBV3rnkCafWUwKFs_akRQ@mail.gmail.com>

Hi Brown,

Thanks for the email. There is a working PR here at
https://github.com/scikit-learn/scikit-learn/pull/7388

Would you be kind to take a look at it and comment how helpful the proposed
API is for your use case?

Thanks


On Mon, Oct 3, 2016 at 6:05 AM, Brown J.B. <jbbrown at kuhp.kyoto-u.ac.jp>
wrote:

> Hello community,
>
> Congratulations on the release of 0.19 !
> While I'm merely a casual user and wish I could contribute more often, I
> thank everyone for their time and efforts!
>
> 2016-10-01 1:58 GMT+09:00 Andreas Mueller <t3kcit at gmail.com>:
>
> We've got a lot in the works already for 0.19.
>>>
>>> * multiple metrics for cross validation (#7388 et al.)
>>>
>>
> I've done something like this in my internal model building and selection
> libraries.
> My solution has been to have
>   -each metric object be able to explain a "distance from optimal"
>   -a metric collection object, which can be built by either explicit
> instantiation or calculation using data
>   -a pareto curve calculation object
>   -a ranker for the points on the pareto curve, with the ability to select
> the N-best points.
>
> While there are certainly smarter interfaces and implementations, here is
> an example of one of my doctests that may help get this PR started.
> My apologies that my old docstring argument notation doesn't match the
> commonly used standards.
>
> Hope this helps,
> J.B. Brown
> Kyoto University
>
>  26 class TrialRanker(object):
>
>  27     """An object for handling the generic mechanism of selecting
> optimal
>  28     trials from a colletion of trials."""
>
>  43     def SelectBest(self, metricSets, paretoAlg,
>
>  44                    preProcessor=None):
>
>  45         """Select the best [metricSets] by using
> the
>  46         [paretoAlg] pareto selection object.  Note that it is
> actually
>  47         the [paretoAlg] that specifies how many optimal [metricSets]
> to
>  48         select.
>
>  49
>
>  50         Data may be pre-processed into a form necessary for the
> [paretoAlg]
>  51         by using the [preProcessor] that is a
> MetricSetConverter.
>  52
>
>  53         Return: an EvaluatedMetricSet if [paretoAlg] selects only
> one
>  54         metric set, otherwise a list of EvaluatedMetricSet
> objects.
>  55
>
>  56         >>> from pareto.paretoDecorators import
> MinNormSelector
>  57         >>> from pareto import OriginBasePareto
>
>  58         >>> pAlg = MinNormSelector(OriginBasePare
> to())
>  59
>
>  60         >>> from metrics.TwoClassMetrics import Accuracy,
> Sensitivity
>  61         >>> from metrics.metricSet import
> EvaluatedMetricSet
>  62         >>> met1 = EvaluatedMetricSet.BuildByExpl
> icitValue(
>  63         ...           [(Accuracy, 0.7), (Sensitivity,
> 0.9)])
>  64         >>> met1.SetTitle("Example1")
>
>  65         >>> met1.associatedData = range(5)  # property
> set/get
>  66         >>> met2 = EvaluatedMetricSet.BuildByExpl
> icitValue(
>  67         ...           [(Accuracy, 0.8), (Sensitivity,
> 0.6)])
>  68         >>> met2.SetTitle("Example2")
>
>  69         >>> met2.SetAssociatedData("abcdef")  # explicit method
> call
>  70         >>> met3 = EvaluatedMetricSet.BuildByExpl
> icitValue(
>  71         ...           [(Accuracy, 0.5), (Sensitivity,
> 0.5)])
>  72         >>> met3.SetTitle("Example3")
>
>  73         >>> met3.associatedData = float
>
>  74
>
>  75         >>> from metrics.metricSet.converters import
> OptDistConverter
>  76
>
>  77         >>> ranker = TrialRanker()  # pAlg selects
> met1
>  78         >>> best = ranker.SelectBest((met1,met2,m
> et3),
>  79         ...                          pAlg,
> OptDistConverter())
>  80         >>> best.VerboseDescription(True)
>
>  81         >>> str(best)
>
>  82         'Example1: 2 metrics; Accuracy=0.700;
> Sensitivity=0.900'
>  83         >>> best.associatedData
>
>  84         [0, 1, 2, 3, 4]
>
>  85
>
>  86         >>> pAlg = MinNormSelector(OriginBasePareto(),
> nSelect=2)
>  87         >>> best = ranker.SelectBest((met1,met2,m
> et3),
>  88         ...                          pAlg,
> OptDistConverter())
>  89         >>> for metSet in best:
>
>  90         ...     metSet.VerboseDescription(True
> )
>  91         ...     str(metSet)
>
>  92         ...     str(metSet.associatedData)
>
>  93         'Example1: 2 metrics; Accuracy=0.700;
> Sensitivity=0.900'
>  94         '[0, 1, 2, 3, 4]'
>
>  95         'Example2: 2 metrics; Accuracy=0.800;
> Sensitivity=0.600'
>  96         'abcdef'
>
>  97
>
>  98         >>> from metrics.TwoClassMetrics import
> PositivePredictiveValue
>  99         >>> met4 = EvaluatedMetricSet.BuildByExpl
> icitValue(
> 100         ...         [(Accuracy, 0.7), (PositivePredictiveValue,
> 0.5)])
> 101         >>> best = ranker.SelectBest((met1,met2,m
> et3,met4),
> 102         ...                          pAlg,
> OptDistConverter())
> 103         Traceback (most recent call last):
>
> 104         ...
>
> 105         ValueError: Metric sets contain differing
> Metrics.
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161003/1f71f227/attachment-0001.html>

From manojkumarsivaraj334 at gmail.com  Mon Oct  3 12:24:05 2016
From: manojkumarsivaraj334 at gmail.com (Manoj Kumar)
Date: Mon, 3 Oct 2016 12:24:05 -0400
Subject: [scikit-learn] Welcome Raghav to the core-dev team
In-Reply-To: <CAE-UAvTnQT8iHka7qQDeDh7A8oQsTOEhaxnZT4=5fxcdH4AJBw@mail.gmail.com>
References: <20161003151415.GF20745@phare.normalesup.org>
 <CALoLHMLZnsRiTUi4c5wnp5Q0N_ttdyTy1WH32tS2ZEo=_7mg_A@mail.gmail.com>
 <CA+QZHeCK7i_oAcE+yWL=ZofRzvpEUZQmB+0kPzwjtQENpcPVPQ@mail.gmail.com>
 <CAHazPTmwX=nZUh71fpYnRBtsszzaL1FCE+ftKfjrKCsesUbjwg@mail.gmail.com>
 <CAE-UAvTnQT8iHka7qQDeDh7A8oQsTOEhaxnZT4=5fxcdH4AJBw@mail.gmail.com>
Message-ID: <CAFQAd-n_EBJrYnxkLHtMkso7W195fbOWqKrHvVfGKbmz0w4mxQ@mail.gmail.com>

Congratulations!

On Mon, Oct 3, 2016 at 12:21 PM, Nelle Varoquaux <nelle.varoquaux at gmail.com>
wrote:

> Congratulation Raghav!
>
> On 3 October 2016 at 08:40, Ronnie Ghose <ronnie.ghose at gmail.com> wrote:
> > congrats! :)
> >
> > On Mon, Oct 3, 2016 at 11:28 AM, lin yenchen <yenchenlin1994 at gmail.com>
> > wrote:
> >>
> >> Congrats, Raghav!
> >>
> >> Nelson Liu <nfliu at uw.edu> ? 2016?10?3? ?? ??11:27???
> >>>
> >>> Yay! Congrats, Raghav!
> >>>
> >>> On Mon, Oct 3, 2016 at 8:14 AM, Gael Varoquaux
> >>> <gael.varoquaux at normalesup.org> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> We have the pleasure to welcome Raghav RV to the core-dev team. Raghav
> >>>> (@raghavrv) has been working on scikit-learn for more than a year. In
> >>>> particular, he implemented the rewrite of the cross-validation
> >>>> utilities,
> >>>> which is quite dear to my heart.
> >>>>
> >>>> Welcome Raghav!
> >>>>
> >>>> Ga?l
> >>>>
> >>>> _______________________________________________
> >>>> scikit-learn mailing list
> >>>> scikit-learn at python.org
> >>>> https://mail.python.org/mailman/listinfo/scikit-learn
> >>>
> >>>
> >>> _______________________________________________
> >>> scikit-learn mailing list
> >>> scikit-learn at python.org
> >>> https://mail.python.org/mailman/listinfo/scikit-learn
> >>
> >>
> >> _______________________________________________
> >> scikit-learn mailing list
> >> scikit-learn at python.org
> >> https://mail.python.org/mailman/listinfo/scikit-learn
> >>
> >
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>


-- 
Manoj,
http://github.com/MechCoder
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161003/0026adf0/attachment.html>

From aakash at klugtek.co.in  Mon Oct  3 12:48:05 2016
From: aakash at klugtek.co.in (Aakash Agarwal)
Date: Mon, 3 Oct 2016 22:18:05 +0530
Subject: [scikit-learn] Welcome Raghav to the core-dev team
In-Reply-To: <CAFQAd-n_EBJrYnxkLHtMkso7W195fbOWqKrHvVfGKbmz0w4mxQ@mail.gmail.com>
References: <20161003151415.GF20745@phare.normalesup.org>
 <CALoLHMLZnsRiTUi4c5wnp5Q0N_ttdyTy1WH32tS2ZEo=_7mg_A@mail.gmail.com>
 <CA+QZHeCK7i_oAcE+yWL=ZofRzvpEUZQmB+0kPzwjtQENpcPVPQ@mail.gmail.com>
 <CAHazPTmwX=nZUh71fpYnRBtsszzaL1FCE+ftKfjrKCsesUbjwg@mail.gmail.com>
 <CAE-UAvTnQT8iHka7qQDeDh7A8oQsTOEhaxnZT4=5fxcdH4AJBw@mail.gmail.com>
 <CAFQAd-n_EBJrYnxkLHtMkso7W195fbOWqKrHvVfGKbmz0w4mxQ@mail.gmail.com>
Message-ID: <CABVTFDtGrxCDX3PKnmu21s=GeD-LV40f1KOuMR-NA=FHU_rnKg@mail.gmail.com>

Congrats Raghav!

On Mon, Oct 3, 2016 at 9:54 PM, Manoj Kumar <manojkumarsivaraj334 at gmail.com>
wrote:

> Congratulations!
>
> On Mon, Oct 3, 2016 at 12:21 PM, Nelle Varoquaux <
> nelle.varoquaux at gmail.com> wrote:
>
>> Congratulation Raghav!
>>
>> On 3 October 2016 at 08:40, Ronnie Ghose <ronnie.ghose at gmail.com> wrote:
>> > congrats! :)
>> >
>> > On Mon, Oct 3, 2016 at 11:28 AM, lin yenchen <yenchenlin1994 at gmail.com>
>> > wrote:
>> >>
>> >> Congrats, Raghav!
>> >>
>> >> Nelson Liu <nfliu at uw.edu> ? 2016?10?3? ?? ??11:27???
>> >>>
>> >>> Yay! Congrats, Raghav!
>> >>>
>> >>> On Mon, Oct 3, 2016 at 8:14 AM, Gael Varoquaux
>> >>> <gael.varoquaux at normalesup.org> wrote:
>> >>>>
>> >>>> Hi,
>> >>>>
>> >>>> We have the pleasure to welcome Raghav RV to the core-dev team.
>> Raghav
>> >>>> (@raghavrv) has been working on scikit-learn for more than a year. In
>> >>>> particular, he implemented the rewrite of the cross-validation
>> >>>> utilities,
>> >>>> which is quite dear to my heart.
>> >>>>
>> >>>> Welcome Raghav!
>> >>>>
>> >>>> Ga?l
>> >>>>
>> >>>> _______________________________________________
>> >>>> scikit-learn mailing list
>> >>>> scikit-learn at python.org
>> >>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> >>>
>> >>>
>> >>> _______________________________________________
>> >>> scikit-learn mailing list
>> >>> scikit-learn at python.org
>> >>> https://mail.python.org/mailman/listinfo/scikit-learn
>> >>
>> >>
>> >> _______________________________________________
>> >> scikit-learn mailing list
>> >> scikit-learn at python.org
>> >> https://mail.python.org/mailman/listinfo/scikit-learn
>> >>
>> >
>> >
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>> >
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
>
> --
> Manoj,
> http://github.com/MechCoder
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 
Thanks,
Aakash
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161003/24f0c77d/attachment.html>

From siddharthgupta234 at gmail.com  Mon Oct  3 12:53:19 2016
From: siddharthgupta234 at gmail.com (Siddharth Gupta)
Date: Mon, 3 Oct 2016 22:23:19 +0530
Subject: [scikit-learn] Welcome Raghav to the core-dev team
In-Reply-To: <CABVTFDtGrxCDX3PKnmu21s=GeD-LV40f1KOuMR-NA=FHU_rnKg@mail.gmail.com>
References: <20161003151415.GF20745@phare.normalesup.org>
 <CALoLHMLZnsRiTUi4c5wnp5Q0N_ttdyTy1WH32tS2ZEo=_7mg_A@mail.gmail.com>
 <CA+QZHeCK7i_oAcE+yWL=ZofRzvpEUZQmB+0kPzwjtQENpcPVPQ@mail.gmail.com>
 <CAHazPTmwX=nZUh71fpYnRBtsszzaL1FCE+ftKfjrKCsesUbjwg@mail.gmail.com>
 <CAE-UAvTnQT8iHka7qQDeDh7A8oQsTOEhaxnZT4=5fxcdH4AJBw@mail.gmail.com>
 <CAFQAd-n_EBJrYnxkLHtMkso7W195fbOWqKrHvVfGKbmz0w4mxQ@mail.gmail.com>
 <CABVTFDtGrxCDX3PKnmu21s=GeD-LV40f1KOuMR-NA=FHU_rnKg@mail.gmail.com>
Message-ID: <CAM_sO3TOQgsr3Jo87VjL7tnFOwMo4APxwOVgDbms0erU_6sH5w@mail.gmail.com>

Congrats Raghav! :D

On Oct 3, 2016 10:22 PM, "Aakash Agarwal" <aakash at klugtek.co.in> wrote:

> Congrats Raghav!
>
> On Mon, Oct 3, 2016 at 9:54 PM, Manoj Kumar <manojkumarsivaraj334 at gmail.
> com> wrote:
>
>> Congratulations!
>>
>> On Mon, Oct 3, 2016 at 12:21 PM, Nelle Varoquaux <
>> nelle.varoquaux at gmail.com> wrote:
>>
>>> Congratulation Raghav!
>>>
>>> On 3 October 2016 at 08:40, Ronnie Ghose <ronnie.ghose at gmail.com> wrote:
>>> > congrats! :)
>>> >
>>> > On Mon, Oct 3, 2016 at 11:28 AM, lin yenchen <yenchenlin1994 at gmail.com
>>> >
>>> > wrote:
>>> >>
>>> >> Congrats, Raghav!
>>> >>
>>> >> Nelson Liu <nfliu at uw.edu> ? 2016?10?3? ?? ??11:27???
>>> >>>
>>> >>> Yay! Congrats, Raghav!
>>> >>>
>>> >>> On Mon, Oct 3, 2016 at 8:14 AM, Gael Varoquaux
>>> >>> <gael.varoquaux at normalesup.org> wrote:
>>> >>>>
>>> >>>> Hi,
>>> >>>>
>>> >>>> We have the pleasure to welcome Raghav RV to the core-dev team.
>>> Raghav
>>> >>>> (@raghavrv) has been working on scikit-learn for more than a year.
>>> In
>>> >>>> particular, he implemented the rewrite of the cross-validation
>>> >>>> utilities,
>>> >>>> which is quite dear to my heart.
>>> >>>>
>>> >>>> Welcome Raghav!
>>> >>>>
>>> >>>> Ga?l
>>> >>>>
>>> >>>> _______________________________________________
>>> >>>> scikit-learn mailing list
>>> >>>> scikit-learn at python.org
>>> >>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>> >>>
>>> >>>
>>> >>> _______________________________________________
>>> >>> scikit-learn mailing list
>>> >>> scikit-learn at python.org
>>> >>> https://mail.python.org/mailman/listinfo/scikit-learn
>>> >>
>>> >>
>>> >> _______________________________________________
>>> >> scikit-learn mailing list
>>> >> scikit-learn at python.org
>>> >> https://mail.python.org/mailman/listinfo/scikit-learn
>>> >>
>>> >
>>> >
>>> > _______________________________________________
>>> > scikit-learn mailing list
>>> > scikit-learn at python.org
>>> > https://mail.python.org/mailman/listinfo/scikit-learn
>>> >
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>
>>
>>
>> --
>> Manoj,
>> http://github.com/MechCoder
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
>
> --
> Thanks,
> Aakash
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161003/c693366a/attachment-0001.html>

From se.raschka at gmail.com  Mon Oct  3 13:06:59 2016
From: se.raschka at gmail.com (Sebastian Raschka)
Date: Mon, 3 Oct 2016 13:06:59 -0400
Subject: [scikit-learn] Welcome Raghav to the core-dev team
In-Reply-To: <CAM_sO3TOQgsr3Jo87VjL7tnFOwMo4APxwOVgDbms0erU_6sH5w@mail.gmail.com>
References: <20161003151415.GF20745@phare.normalesup.org>
 <CALoLHMLZnsRiTUi4c5wnp5Q0N_ttdyTy1WH32tS2ZEo=_7mg_A@mail.gmail.com>
 <CA+QZHeCK7i_oAcE+yWL=ZofRzvpEUZQmB+0kPzwjtQENpcPVPQ@mail.gmail.com>
 <CAHazPTmwX=nZUh71fpYnRBtsszzaL1FCE+ftKfjrKCsesUbjwg@mail.gmail.com>
 <CAE-UAvTnQT8iHka7qQDeDh7A8oQsTOEhaxnZT4=5fxcdH4AJBw@mail.gmail.com>
 <CAFQAd-n_EBJrYnxkLHtMkso7W195fbOWqKrHvVfGKbmz0w4mxQ@mail.gmail.com>
 <CABVTFDtGrxCDX3PKnmu21s=GeD-LV40f1KOuMR-NA=FHU_rnKg@mail.gmail.com>
 <CAM_sO3TOQgsr3Jo87VjL7tnFOwMo4APxwOVgDbms0erU_6sH5w@mail.gmail.com>
Message-ID: <D6F6BDB5-4B0B-4DDD-B94E-E7731DABD39A@gmail.com>

Congrats Raghav! And thanks a lot for all the great work on the model_selection module!

> On Oct 3, 2016, at 12:53 PM, Siddharth Gupta <siddharthgupta234 at gmail.com> wrote:
> 
> Congrats Raghav! :D
> 
> 
> On Oct 3, 2016 10:22 PM, "Aakash Agarwal" <aakash at klugtek.co.in> wrote:
> Congrats Raghav!
> 
> On Mon, Oct 3, 2016 at 9:54 PM, Manoj Kumar <manojkumarsivaraj334 at gmail.com> wrote:
> Congratulations!
> 
> On Mon, Oct 3, 2016 at 12:21 PM, Nelle Varoquaux <nelle.varoquaux at gmail.com> wrote:
> Congratulation Raghav!
> 
> On 3 October 2016 at 08:40, Ronnie Ghose <ronnie.ghose at gmail.com> wrote:
> > congrats! :)
> >
> > On Mon, Oct 3, 2016 at 11:28 AM, lin yenchen <yenchenlin1994 at gmail.com>
> > wrote:
> >>
> >> Congrats, Raghav!
> >>
> >> Nelson Liu <nfliu at uw.edu> ? 2016?10?3? ?? ??11:27???
> >>>
> >>> Yay! Congrats, Raghav!
> >>>
> >>> On Mon, Oct 3, 2016 at 8:14 AM, Gael Varoquaux
> >>> <gael.varoquaux at normalesup.org> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> We have the pleasure to welcome Raghav RV to the core-dev team. Raghav
> >>>> (@raghavrv) has been working on scikit-learn for more than a year. In
> >>>> particular, he implemented the rewrite of the cross-validation
> >>>> utilities,
> >>>> which is quite dear to my heart.
> >>>>
> >>>> Welcome Raghav!
> >>>>
> >>>> Ga?l
> >>>>
> >>>> _______________________________________________
> >>>> scikit-learn mailing list
> >>>> scikit-learn at python.org
> >>>> https://mail.python.org/mailman/listinfo/scikit-learn
> >>>
> >>>
> >>> _______________________________________________
> >>> scikit-learn mailing list
> >>> scikit-learn at python.org
> >>> https://mail.python.org/mailman/listinfo/scikit-learn
> >>
> >>
> >> _______________________________________________
> >> scikit-learn mailing list
> >> scikit-learn at python.org
> >> https://mail.python.org/mailman/listinfo/scikit-learn
> >>
> >
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> 
> -- 
> Manoj,
> http://github.com/MechCoder
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> 
> 
> -- 
> Thanks,
> Aakash
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From jmschreiber91 at gmail.com  Mon Oct  3 13:32:30 2016
From: jmschreiber91 at gmail.com (Jacob Schreiber)
Date: Mon, 3 Oct 2016 10:32:30 -0700
Subject: [scikit-learn] Welcome Raghav to the core-dev team
In-Reply-To: <D6F6BDB5-4B0B-4DDD-B94E-E7731DABD39A@gmail.com>
References: <20161003151415.GF20745@phare.normalesup.org>
 <CALoLHMLZnsRiTUi4c5wnp5Q0N_ttdyTy1WH32tS2ZEo=_7mg_A@mail.gmail.com>
 <CA+QZHeCK7i_oAcE+yWL=ZofRzvpEUZQmB+0kPzwjtQENpcPVPQ@mail.gmail.com>
 <CAHazPTmwX=nZUh71fpYnRBtsszzaL1FCE+ftKfjrKCsesUbjwg@mail.gmail.com>
 <CAE-UAvTnQT8iHka7qQDeDh7A8oQsTOEhaxnZT4=5fxcdH4AJBw@mail.gmail.com>
 <CAFQAd-n_EBJrYnxkLHtMkso7W195fbOWqKrHvVfGKbmz0w4mxQ@mail.gmail.com>
 <CABVTFDtGrxCDX3PKnmu21s=GeD-LV40f1KOuMR-NA=FHU_rnKg@mail.gmail.com>
 <CAM_sO3TOQgsr3Jo87VjL7tnFOwMo4APxwOVgDbms0erU_6sH5w@mail.gmail.com>
 <D6F6BDB5-4B0B-4DDD-B94E-E7731DABD39A@gmail.com>
Message-ID: <CA+ad8EsMzeZjgBn12Gnoj_xGmA7UaXNrZgT7fqOrKxx3Y__AXg@mail.gmail.com>

Congrats Raghav!

On Mon, Oct 3, 2016 at 10:06 AM, Sebastian Raschka <se.raschka at gmail.com>
wrote:

> Congrats Raghav! And thanks a lot for all the great work on the
> model_selection module!
>
> > On Oct 3, 2016, at 12:53 PM, Siddharth Gupta <
> siddharthgupta234 at gmail.com> wrote:
> >
> > Congrats Raghav! :D
> >
> >
> > On Oct 3, 2016 10:22 PM, "Aakash Agarwal" <aakash at klugtek.co.in> wrote:
> > Congrats Raghav!
> >
> > On Mon, Oct 3, 2016 at 9:54 PM, Manoj Kumar <manojkumarsivaraj334 at gmail.
> com> wrote:
> > Congratulations!
> >
> > On Mon, Oct 3, 2016 at 12:21 PM, Nelle Varoquaux <
> nelle.varoquaux at gmail.com> wrote:
> > Congratulation Raghav!
> >
> > On 3 October 2016 at 08:40, Ronnie Ghose <ronnie.ghose at gmail.com> wrote:
> > > congrats! :)
> > >
> > > On Mon, Oct 3, 2016 at 11:28 AM, lin yenchen <yenchenlin1994 at gmail.com
> >
> > > wrote:
> > >>
> > >> Congrats, Raghav!
> > >>
> > >> Nelson Liu <nfliu at uw.edu> ? 2016?10?3? ?? ??11:27???
> > >>>
> > >>> Yay! Congrats, Raghav!
> > >>>
> > >>> On Mon, Oct 3, 2016 at 8:14 AM, Gael Varoquaux
> > >>> <gael.varoquaux at normalesup.org> wrote:
> > >>>>
> > >>>> Hi,
> > >>>>
> > >>>> We have the pleasure to welcome Raghav RV to the core-dev team.
> Raghav
> > >>>> (@raghavrv) has been working on scikit-learn for more than a year.
> In
> > >>>> particular, he implemented the rewrite of the cross-validation
> > >>>> utilities,
> > >>>> which is quite dear to my heart.
> > >>>>
> > >>>> Welcome Raghav!
> > >>>>
> > >>>> Ga?l
> > >>>>
> > >>>> _______________________________________________
> > >>>> scikit-learn mailing list
> > >>>> scikit-learn at python.org
> > >>>> https://mail.python.org/mailman/listinfo/scikit-learn
> > >>>
> > >>>
> > >>> _______________________________________________
> > >>> scikit-learn mailing list
> > >>> scikit-learn at python.org
> > >>> https://mail.python.org/mailman/listinfo/scikit-learn
> > >>
> > >>
> > >> _______________________________________________
> > >> scikit-learn mailing list
> > >> scikit-learn at python.org
> > >> https://mail.python.org/mailman/listinfo/scikit-learn
> > >>
> > >
> > >
> > > _______________________________________________
> > > scikit-learn mailing list
> > > scikit-learn at python.org
> > > https://mail.python.org/mailman/listinfo/scikit-learn
> > >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> >
> >
> > --
> > Manoj,
> > http://github.com/MechCoder
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> >
> >
> >
> > --
> > Thanks,
> > Aakash
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161003/656ca128/attachment.html>

From klonuo at gmail.com  Mon Oct  3 13:45:29 2016
From: klonuo at gmail.com (klo uo)
Date: Mon, 3 Oct 2016 19:45:29 +0200
Subject: [scikit-learn] Generate data from trained naive bayes
In-Reply-To: <CAFQAd-nXOs8WD_V1Mp=4HKiw2qwPxdGvd_oQBOxHqVobUr9ygA@mail.gmail.com>
References: <CAA-8Ld9oZit0UCof8sSyw-PaCBQxp9ZpsENzuopTsvRK0GrHUQ@mail.gmail.com>
 <476b1ca4-c7fa-89ec-81da-c42f9e7abb69@gmail.com>
 <CAA-8Ld8z72HqJyjMcPJLSwDdsBnYy+=mN5h6HZ7tci7ZzJJ=DA@mail.gmail.com>
 <CAA-8Ld8SbCiKb6rZ+fvi8iixjuK-3mRTWdf0FhGyC9RF=ct7BA@mail.gmail.com>
 <CAFQAd-nXOs8WD_V1Mp=4HKiw2qwPxdGvd_oQBOxHqVobUr9ygA@mail.gmail.com>
Message-ID: <CAA-8Ld8KFpF9jiT+804wUTACvCjj73Zc2ZbZGRrO855=GQnOaA@mail.gmail.com>

Hi Manoj,

thanks for your reply.

Sorry to say, but I don't understand how to generate new feature.
In this example I have `X` with shape (1000, 64) with 5 unique classes.
`feature_log_prob_` has shape (5, 64)

I can generate for example uniform data with `r = np.random.rand(64)`
Now how can I generate new features, having trained classifier?


On Mon, Oct 3, 2016 at 5:23 PM, Manoj Kumar <manojkumarsivaraj334 at gmail.com>
wrote:

> Hi,
>
> feature_log_prob_ is an array of size (n_classes, n_features).
>
> exp(feature_log_prob_[class_ind, feature_ind]) gives P(X_{feature_ind} =
> 1 | class_ind)"
>
> Using the conditional independence assumptions of NaiveBayes, you can use
> this to sample each feature independently given the class.
>
> Hope that helps.
>
>
>
>
> On Mon, Oct 3, 2016 at 11:09 AM, klo uo <klonuo at gmail.com> wrote:
>
>> On Mon, Oct 3, 2016 at 5:08 PM, klo uo <klonuo at gmail.com> wrote:
>>
>>> I can see how can I sample from `feature_log_prob_`...
>>>
>>
>> I meant I cannot see
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
>
> --
> Manoj,
> http://github.com/MechCoder
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161003/aa699321/attachment-0001.html>

From manojkumarsivaraj334 at gmail.com  Mon Oct  3 14:20:09 2016
From: manojkumarsivaraj334 at gmail.com (Manoj Kumar)
Date: Mon, 3 Oct 2016 14:20:09 -0400
Subject: [scikit-learn] Generate data from trained naive bayes
In-Reply-To: <CAA-8Ld8KFpF9jiT+804wUTACvCjj73Zc2ZbZGRrO855=GQnOaA@mail.gmail.com>
References: <CAA-8Ld9oZit0UCof8sSyw-PaCBQxp9ZpsENzuopTsvRK0GrHUQ@mail.gmail.com>
 <476b1ca4-c7fa-89ec-81da-c42f9e7abb69@gmail.com>
 <CAA-8Ld8z72HqJyjMcPJLSwDdsBnYy+=mN5h6HZ7tci7ZzJJ=DA@mail.gmail.com>
 <CAA-8Ld8SbCiKb6rZ+fvi8iixjuK-3mRTWdf0FhGyC9RF=ct7BA@mail.gmail.com>
 <CAFQAd-nXOs8WD_V1Mp=4HKiw2qwPxdGvd_oQBOxHqVobUr9ygA@mail.gmail.com>
 <CAA-8Ld8KFpF9jiT+804wUTACvCjj73Zc2ZbZGRrO855=GQnOaA@mail.gmail.com>
Message-ID: <CAFQAd-=E_CPJpbAk=c938AVUt4wJHVN+_HXqZMShNfAcsmPorw@mail.gmail.com>

Let's say you would like to generate just the first feature of 1000 samples
with label 0.

The distribution of the first feature conditioned on label 1 follows a
Bernoulli distribution (as suggested by the name) with parameter
"exp(feature_log_prob_[0, 0])". You could then generate the first feature
of these 1000 samples by just doing

first_feature = bernoulli.rvs(exp(feature_log_prob_[0, 0]), size=1000)

And follow the same approach for all the other features with the
corresponding parameters. (They are conditionally independent)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161003/973ad092/attachment.html>

From cs14btech11041 at iith.ac.in  Mon Oct  3 14:25:51 2016
From: cs14btech11041 at iith.ac.in (Ibrahim Dalal)
Date: Mon, 3 Oct 2016 23:55:51 +0530
Subject: [scikit-learn] Random Forest with Bootstrapping
Message-ID: <CAAyvngWnwmH0p-7A46OED5Rw7n7RYn-QJduWeSpeLrqOkZfMqw@mail.gmail.com>

Dear Developers,

>From whatever little knowledge I gained last night about Random Forests,
each tree is trained with a sub-sample of original dataset (usually with
replacement)?.

(Note: Please do correct me if I am not making any sense.)

RandomForestClassifier has an option of 'bootstrap'. The API states the
following


> The sub-sample size is always the same as the original input sample size
> but the samples are drawn with replacement if bootstrap=True (default).


Now, what I am not able to understand is - if entire dataset is used to
train each of the trees, then how does the classifier estimates the OOB
error? None of the entries of the dataset is an oob for any of the trees.
(Pardon me if all this sounds BS)

Help this mere mortal.

Thanks
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161003/4c8e5e8c/attachment.html>

From se.raschka at gmail.com  Mon Oct  3 14:32:52 2016
From: se.raschka at gmail.com (Sebastian Raschka)
Date: Mon, 3 Oct 2016 14:32:52 -0400
Subject: [scikit-learn] Random Forest with Bootstrapping
In-Reply-To: <CAAyvngWnwmH0p-7A46OED5Rw7n7RYn-QJduWeSpeLrqOkZfMqw@mail.gmail.com>
References: <CAAyvngWnwmH0p-7A46OED5Rw7n7RYn-QJduWeSpeLrqOkZfMqw@mail.gmail.com>
Message-ID: <AFAAA2CD-3742-4B42-BBD4-85E2B4D5DB95@gmail.com>

> From whatever little knowledge I gained last night about Random Forests, each tree is trained with a sub-sample of original dataset (usually with replacement)?.

Yes, that should be correct!

> Now, what I am not able to understand is - if entire dataset is used to train each of the trees, then how does the classifier estimates the OOB error? None of the entries of the dataset is an oob for any of the trees. (Pardon me if all this sounds BS)

If you take an n-size bootstrap sample, where n is the number of samples in your dataset, you have asymptotically 0.632 * n unique samples in your bootstrap set. Or in other words 0.368 * n samples are not used for growing the respective tree (to compute the OOB). As far as I understand, the random forest OOB score is then computed as the average OOB of each tee (correct me if I am wrong!).

Best,
Sebastian

> On Oct 3, 2016, at 2:25 PM, Ibrahim Dalal via scikit-learn <scikit-learn at python.org> wrote:
> 
> Dear Developers,
> 
> From whatever little knowledge I gained last night about Random Forests, each tree is trained with a sub-sample of original dataset (usually with replacement)?.
> 
> (Note: Please do correct me if I am not making any sense.)
> 
> RandomForestClassifier has an option of 'bootstrap'. The API states the following
>  
> The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).
> 
> Now, what I am not able to understand is - if entire dataset is used to train each of the trees, then how does the classifier estimates the OOB error? None of the entries of the dataset is an oob for any of the trees. (Pardon me if all this sounds BS)
> 
> Help this mere mortal.
> 
> Thanks
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From aadral at gmail.com  Mon Oct  3 14:34:04 2016
From: aadral at gmail.com (=?UTF-8?B?0JDQu9C10LrRgdC10Lkg0JTRgNCw0LvRjA==?=)
Date: Mon, 3 Oct 2016 19:34:04 +0100
Subject: [scikit-learn] Random Forest with Bootstrapping
In-Reply-To: <CAAyvngWnwmH0p-7A46OED5Rw7n7RYn-QJduWeSpeLrqOkZfMqw@mail.gmail.com>
References: <CAAyvngWnwmH0p-7A46OED5Rw7n7RYn-QJduWeSpeLrqOkZfMqw@mail.gmail.com>
Message-ID: <CADdg2yy2iT3GprD5NVc1cX-vM5vNWh6VGW7_LdurwSAUn80pgg@mail.gmail.com>

Hi,

>From docs
http://scikit-learn.org/stable/auto_examples/ensemble/plot_ensemble_oob.html
:

The RandomForestClassifier is trained using bootstrap aggregation, where
each new tree is fit from a bootstrap sample of the training observations
z_i = (x_i, y_i). The out-of-bag (OOB) error is the average error for each
z_i calculated using predictions from the trees that do not contain z_i in
their respective bootstrap sample. This allows the RandomForestClassifier
to be fit and validated whilst being trained [1].

If you get samples with replacements, then you have a high chance for some
of z_i not to be included in the training phase of a tree. Then this tree
will be involved in estimation of OOB error for z_i. I hope it makes a
little bit clearer.


2016-10-03 19:25 GMT+01:00 Ibrahim Dalal via scikit-learn <
scikit-learn at python.org>:

> Dear Developers,
>
> From whatever little knowledge I gained last night about Random Forests,
> each tree is trained with a sub-sample of original dataset (usually with
> replacement)?.
>
> (Note: Please do correct me if I am not making any sense.)
>
> RandomForestClassifier has an option of 'bootstrap'. The API states the
> following
>
>
>> The sub-sample size is always the same as the original input sample size
>> but the samples are drawn with replacement if bootstrap=True (default).
>
>
> Now, what I am not able to understand is - if entire dataset is used to
> train each of the trees, then how does the classifier estimates the OOB
> error? None of the entries of the dataset is an oob for any of the trees.
> (Pardon me if all this sounds BS)
>
> Help this mere mortal.
>
> Thanks
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 
Yours sincerely,
https://www.linkedin.com/in/alexey-dral
Alexey A. Dral
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161003/c6f63312/attachment.html>

From desitter.gravity at gmail.com  Mon Oct  3 14:36:11 2016
From: desitter.gravity at gmail.com (desitter.gravity at gmail.com)
Date: Mon, 3 Oct 2016 11:36:11 -0700
Subject: [scikit-learn] Welcome Raghav to the core-dev team
In-Reply-To: <CA+ad8EsMzeZjgBn12Gnoj_xGmA7UaXNrZgT7fqOrKxx3Y__AXg@mail.gmail.com>
References: <20161003151415.GF20745@phare.normalesup.org>
 <CALoLHMLZnsRiTUi4c5wnp5Q0N_ttdyTy1WH32tS2ZEo=_7mg_A@mail.gmail.com>
 <CA+QZHeCK7i_oAcE+yWL=ZofRzvpEUZQmB+0kPzwjtQENpcPVPQ@mail.gmail.com>
 <CAHazPTmwX=nZUh71fpYnRBtsszzaL1FCE+ftKfjrKCsesUbjwg@mail.gmail.com>
 <CAE-UAvTnQT8iHka7qQDeDh7A8oQsTOEhaxnZT4=5fxcdH4AJBw@mail.gmail.com>
 <CAFQAd-n_EBJrYnxkLHtMkso7W195fbOWqKrHvVfGKbmz0w4mxQ@mail.gmail.com>
 <CABVTFDtGrxCDX3PKnmu21s=GeD-LV40f1KOuMR-NA=FHU_rnKg@mail.gmail.com>
 <CAM_sO3TOQgsr3Jo87VjL7tnFOwMo4APxwOVgDbms0erU_6sH5w@mail.gmail.com>
 <D6F6BDB5-4B0B-4DDD-B94E-E7731DABD39A@gmail.com>
 <CA+ad8EsMzeZjgBn12Gnoj_xGmA7UaXNrZgT7fqOrKxx3Y__AXg@mail.gmail.com>
Message-ID: <CAFiaxv7btsOPJZ_qZ=60FTENthGTVV89j0RS=+31ZDREPh0AVw@mail.gmail.com>

Excellent Raghav! Open Source Rules the World!

On Mon, Oct 3, 2016 at 10:32 AM, Jacob Schreiber <jmschreiber91 at gmail.com>
wrote:

> Congrats Raghav!
>
> On Mon, Oct 3, 2016 at 10:06 AM, Sebastian Raschka <se.raschka at gmail.com>
> wrote:
>
>> Congrats Raghav! And thanks a lot for all the great work on the
>> model_selection module!
>>
>> > On Oct 3, 2016, at 12:53 PM, Siddharth Gupta <
>> siddharthgupta234 at gmail.com> wrote:
>> >
>> > Congrats Raghav! :D
>> >
>> >
>> > On Oct 3, 2016 10:22 PM, "Aakash Agarwal" <aakash at klugtek.co.in> wrote:
>> > Congrats Raghav!
>> >
>> > On Mon, Oct 3, 2016 at 9:54 PM, Manoj Kumar <
>> manojkumarsivaraj334 at gmail.com> wrote:
>> > Congratulations!
>> >
>> > On Mon, Oct 3, 2016 at 12:21 PM, Nelle Varoquaux <
>> nelle.varoquaux at gmail.com> wrote:
>> > Congratulation Raghav!
>> >
>> > On 3 October 2016 at 08:40, Ronnie Ghose <ronnie.ghose at gmail.com>
>> wrote:
>> > > congrats! :)
>> > >
>> > > On Mon, Oct 3, 2016 at 11:28 AM, lin yenchen <
>> yenchenlin1994 at gmail.com>
>> > > wrote:
>> > >>
>> > >> Congrats, Raghav!
>> > >>
>> > >> Nelson Liu <nfliu at uw.edu> ? 2016?10?3? ?? ??11:27???
>> > >>>
>> > >>> Yay! Congrats, Raghav!
>> > >>>
>> > >>> On Mon, Oct 3, 2016 at 8:14 AM, Gael Varoquaux
>> > >>> <gael.varoquaux at normalesup.org> wrote:
>> > >>>>
>> > >>>> Hi,
>> > >>>>
>> > >>>> We have the pleasure to welcome Raghav RV to the core-dev team.
>> Raghav
>> > >>>> (@raghavrv) has been working on scikit-learn for more than a year.
>> In
>> > >>>> particular, he implemented the rewrite of the cross-validation
>> > >>>> utilities,
>> > >>>> which is quite dear to my heart.
>> > >>>>
>> > >>>> Welcome Raghav!
>> > >>>>
>> > >>>> Ga?l
>> > >>>>
>> > >>>> _______________________________________________
>> > >>>> scikit-learn mailing list
>> > >>>> scikit-learn at python.org
>> > >>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> > >>>
>> > >>>
>> > >>> _______________________________________________
>> > >>> scikit-learn mailing list
>> > >>> scikit-learn at python.org
>> > >>> https://mail.python.org/mailman/listinfo/scikit-learn
>> > >>
>> > >>
>> > >> _______________________________________________
>> > >> scikit-learn mailing list
>> > >> scikit-learn at python.org
>> > >> https://mail.python.org/mailman/listinfo/scikit-learn
>> > >>
>> > >
>> > >
>> > > _______________________________________________
>> > > scikit-learn mailing list
>> > > scikit-learn at python.org
>> > > https://mail.python.org/mailman/listinfo/scikit-learn
>> > >
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>> >
>> >
>> >
>> > --
>> > Manoj,
>> > http://github.com/MechCoder
>> >
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>> >
>> >
>> >
>> >
>> > --
>> > Thanks,
>> > Aakash
>> >
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>> >
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161003/52d85a18/attachment-0001.html>

From zephyr14 at gmail.com  Mon Oct  3 15:04:03 2016
From: zephyr14 at gmail.com (Vlad Niculae)
Date: Mon, 3 Oct 2016 15:04:03 -0400
Subject: [scikit-learn] Welcome Raghav to the core-dev team
In-Reply-To: <CA+ad8EsMzeZjgBn12Gnoj_xGmA7UaXNrZgT7fqOrKxx3Y__AXg@mail.gmail.com>
References: <20161003151415.GF20745@phare.normalesup.org>
 <CALoLHMLZnsRiTUi4c5wnp5Q0N_ttdyTy1WH32tS2ZEo=_7mg_A@mail.gmail.com>
 <CA+QZHeCK7i_oAcE+yWL=ZofRzvpEUZQmB+0kPzwjtQENpcPVPQ@mail.gmail.com>
 <CAHazPTmwX=nZUh71fpYnRBtsszzaL1FCE+ftKfjrKCsesUbjwg@mail.gmail.com>
 <CAE-UAvTnQT8iHka7qQDeDh7A8oQsTOEhaxnZT4=5fxcdH4AJBw@mail.gmail.com>
 <CAFQAd-n_EBJrYnxkLHtMkso7W195fbOWqKrHvVfGKbmz0w4mxQ@mail.gmail.com>
 <CABVTFDtGrxCDX3PKnmu21s=GeD-LV40f1KOuMR-NA=FHU_rnKg@mail.gmail.com>
 <CAM_sO3TOQgsr3Jo87VjL7tnFOwMo4APxwOVgDbms0erU_6sH5w@mail.gmail.com>
 <D6F6BDB5-4B0B-4DDD-B94E-E7731DABD39A@gmail.com>
 <CA+ad8EsMzeZjgBn12Gnoj_xGmA7UaXNrZgT7fqOrKxx3Y__AXg@mail.gmail.com>
Message-ID: <CAFJw_eFHVB9OP0k0fe6H6i9e0udEvP6Wekpyi7X5DXkqHSBk-Q@mail.gmail.com>

Awesome! Congrats Raghav and thank you for all your contributions!

On Mon, Oct 3, 2016 at 1:32 PM, Jacob Schreiber <jmschreiber91 at gmail.com> wrote:
> Congrats Raghav!
>
> On Mon, Oct 3, 2016 at 10:06 AM, Sebastian Raschka <se.raschka at gmail.com>
> wrote:
>>
>> Congrats Raghav! And thanks a lot for all the great work on the
>> model_selection module!
>>
>> > On Oct 3, 2016, at 12:53 PM, Siddharth Gupta
>> > <siddharthgupta234 at gmail.com> wrote:
>> >
>> > Congrats Raghav! :D
>> >
>> >
>> > On Oct 3, 2016 10:22 PM, "Aakash Agarwal" <aakash at klugtek.co.in> wrote:
>> > Congrats Raghav!
>> >
>> > On Mon, Oct 3, 2016 at 9:54 PM, Manoj Kumar
>> > <manojkumarsivaraj334 at gmail.com> wrote:
>> > Congratulations!
>> >
>> > On Mon, Oct 3, 2016 at 12:21 PM, Nelle Varoquaux
>> > <nelle.varoquaux at gmail.com> wrote:
>> > Congratulation Raghav!
>> >
>> > On 3 October 2016 at 08:40, Ronnie Ghose <ronnie.ghose at gmail.com> wrote:
>> > > congrats! :)
>> > >
>> > > On Mon, Oct 3, 2016 at 11:28 AM, lin yenchen
>> > > <yenchenlin1994 at gmail.com>
>> > > wrote:
>> > >>
>> > >> Congrats, Raghav!
>> > >>
>> > >> Nelson Liu <nfliu at uw.edu> ? 2016?10?3? ?? ??11:27???
>> > >>>
>> > >>> Yay! Congrats, Raghav!
>> > >>>
>> > >>> On Mon, Oct 3, 2016 at 8:14 AM, Gael Varoquaux
>> > >>> <gael.varoquaux at normalesup.org> wrote:
>> > >>>>
>> > >>>> Hi,
>> > >>>>
>> > >>>> We have the pleasure to welcome Raghav RV to the core-dev team.
>> > >>>> Raghav
>> > >>>> (@raghavrv) has been working on scikit-learn for more than a year.
>> > >>>> In
>> > >>>> particular, he implemented the rewrite of the cross-validation
>> > >>>> utilities,
>> > >>>> which is quite dear to my heart.
>> > >>>>
>> > >>>> Welcome Raghav!
>> > >>>>
>> > >>>> Ga?l
>> > >>>>
>> > >>>> _______________________________________________
>> > >>>> scikit-learn mailing list
>> > >>>> scikit-learn at python.org
>> > >>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> > >>>
>> > >>>
>> > >>> _______________________________________________
>> > >>> scikit-learn mailing list
>> > >>> scikit-learn at python.org
>> > >>> https://mail.python.org/mailman/listinfo/scikit-learn
>> > >>
>> > >>
>> > >> _______________________________________________
>> > >> scikit-learn mailing list
>> > >> scikit-learn at python.org
>> > >> https://mail.python.org/mailman/listinfo/scikit-learn
>> > >>
>> > >
>> > >
>> > > _______________________________________________
>> > > scikit-learn mailing list
>> > > scikit-learn at python.org
>> > > https://mail.python.org/mailman/listinfo/scikit-learn
>> > >
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>> >
>> >
>> >
>> > --
>> > Manoj,
>> > http://github.com/MechCoder
>> >
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>> >
>> >
>> >
>> >
>> > --
>> > Thanks,
>> > Aakash
>> >
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>> >
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>

From cs14btech11041 at iith.ac.in  Mon Oct  3 15:05:55 2016
From: cs14btech11041 at iith.ac.in (Ibrahim Dalal)
Date: Tue, 4 Oct 2016 00:35:55 +0530
Subject: [scikit-learn] Random Forest with Bootstrapping
In-Reply-To: <AFAAA2CD-3742-4B42-BBD4-85E2B4D5DB95@gmail.com>
References: <CAAyvngWnwmH0p-7A46OED5Rw7n7RYn-QJduWeSpeLrqOkZfMqw@mail.gmail.com>
 <AFAAA2CD-3742-4B42-BBD4-85E2B4D5DB95@gmail.com>
Message-ID: <CAAyvngXpJr0b3_N5A_sDFsDAf-2VX5wsc1Rayxw5gMa+PEmqcw@mail.gmail.com>

Hi,

Thank you for the reply. Please bear with me for a while.

>From where did this number, 0.632, come? I have no background in statistics
(which appears to be the case here!). Or let me rephrase my query: what is
this bootstrap sampling all about? Searched the web, but didn't get
satisfactory results.


Thanks

On Tue, Oct 4, 2016 at 12:02 AM, Sebastian Raschka <se.raschka at gmail.com>
wrote:

> > From whatever little knowledge I gained last night about Random Forests,
> each tree is trained with a sub-sample of original dataset (usually with
> replacement)?.
>
> Yes, that should be correct!
>
> > Now, what I am not able to understand is - if entire dataset is used to
> train each of the trees, then how does the classifier estimates the OOB
> error? None of the entries of the dataset is an oob for any of the trees.
> (Pardon me if all this sounds BS)
>
> If you take an n-size bootstrap sample, where n is the number of samples
> in your dataset, you have asymptotically 0.632 * n unique samples in your
> bootstrap set. Or in other words 0.368 * n samples are not used for growing
> the respective tree (to compute the OOB). As far as I understand, the
> random forest OOB score is then computed as the average OOB of each tee
> (correct me if I am wrong!).
>
> Best,
> Sebastian
>
> > On Oct 3, 2016, at 2:25 PM, Ibrahim Dalal via scikit-learn <
> scikit-learn at python.org> wrote:
> >
> > Dear Developers,
> >
> > From whatever little knowledge I gained last night about Random Forests,
> each tree is trained with a sub-sample of original dataset (usually with
> replacement)?.
> >
> > (Note: Please do correct me if I am not making any sense.)
> >
> > RandomForestClassifier has an option of 'bootstrap'. The API states the
> following
> >
> > The sub-sample size is always the same as the original input sample size
> but the samples are drawn with replacement if bootstrap=True (default).
> >
> > Now, what I am not able to understand is - if entire dataset is used to
> train each of the trees, then how does the classifier estimates the OOB
> error? None of the entries of the dataset is an oob for any of the trees.
> (Pardon me if all this sounds BS)
> >
> > Help this mere mortal.
> >
> > Thanks
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161004/05170701/attachment.html>

From se.raschka at gmail.com  Mon Oct  3 15:15:18 2016
From: se.raschka at gmail.com (Sebastian Raschka)
Date: Mon, 3 Oct 2016 15:15:18 -0400
Subject: [scikit-learn] Random Forest with Bootstrapping
In-Reply-To: <CAAyvngXpJr0b3_N5A_sDFsDAf-2VX5wsc1Rayxw5gMa+PEmqcw@mail.gmail.com>
References: <CAAyvngWnwmH0p-7A46OED5Rw7n7RYn-QJduWeSpeLrqOkZfMqw@mail.gmail.com>
 <AFAAA2CD-3742-4B42-BBD4-85E2B4D5DB95@gmail.com>
 <CAAyvngXpJr0b3_N5A_sDFsDAf-2VX5wsc1Rayxw5gMa+PEmqcw@mail.gmail.com>
Message-ID: <18DFDA25-0236-486C-B23D-4E1118EC4803@gmail.com>

Say the probability that a given sample from a dataset of size n is *not* drawn as a bootstrap sample is

P(not_chosen) = (1 - 1\n)^n

Since you have a 1/n chance to draw a particular sample (since bootstrapping involves drawing with replacement), which you repeat n times to get a n-sized bootstrap sample.

This is asymptotically "1/e approx. 0.368? (i.e., for very, very large n)

Then, you can compute the probability of a sample being chosen as

P(chosen) = 1 - (1 - 1/n)^n approx. 0.632 

Best,
Sebastian

> On Oct 3, 2016, at 3:05 PM, Ibrahim Dalal via scikit-learn <scikit-learn at python.org> wrote:
> 
> Hi,
> 
> Thank you for the reply. Please bear with me for a while.
> 
> From where did this number, 0.632, come? I have no background in statistics (which appears to be the case here!). Or let me rephrase my query: what is this bootstrap sampling all about? Searched the web, but didn't get satisfactory results.
> 
> 
> Thanks
> 
> On Tue, Oct 4, 2016 at 12:02 AM, Sebastian Raschka <se.raschka at gmail.com> wrote:
> > From whatever little knowledge I gained last night about Random Forests, each tree is trained with a sub-sample of original dataset (usually with replacement)?.
> 
> Yes, that should be correct!
> 
> > Now, what I am not able to understand is - if entire dataset is used to train each of the trees, then how does the classifier estimates the OOB error? None of the entries of the dataset is an oob for any of the trees. (Pardon me if all this sounds BS)
> 
> If you take an n-size bootstrap sample, where n is the number of samples in your dataset, you have asymptotically 0.632 * n unique samples in your bootstrap set. Or in other words 0.368 * n samples are not used for growing the respective tree (to compute the OOB). As far as I understand, the random forest OOB score is then computed as the average OOB of each tee (correct me if I am wrong!).
> 
> Best,
> Sebastian
> 
> > On Oct 3, 2016, at 2:25 PM, Ibrahim Dalal via scikit-learn <scikit-learn at python.org> wrote:
> >
> > Dear Developers,
> >
> > From whatever little knowledge I gained last night about Random Forests, each tree is trained with a sub-sample of original dataset (usually with replacement)?.
> >
> > (Note: Please do correct me if I am not making any sense.)
> >
> > RandomForestClassifier has an option of 'bootstrap'. The API states the following
> >
> > The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).
> >
> > Now, what I am not able to understand is - if entire dataset is used to train each of the trees, then how does the classifier estimates the OOB error? None of the entries of the dataset is an oob for any of the trees. (Pardon me if all this sounds BS)
> >
> > Help this mere mortal.
> >
> > Thanks
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From klonuo at gmail.com  Mon Oct  3 15:18:30 2016
From: klonuo at gmail.com (klo uo)
Date: Mon, 3 Oct 2016 21:18:30 +0200
Subject: [scikit-learn] Generate data from trained naive bayes
In-Reply-To: <CAFQAd-=E_CPJpbAk=c938AVUt4wJHVN+_HXqZMShNfAcsmPorw@mail.gmail.com>
References: <CAA-8Ld9oZit0UCof8sSyw-PaCBQxp9ZpsENzuopTsvRK0GrHUQ@mail.gmail.com>
 <476b1ca4-c7fa-89ec-81da-c42f9e7abb69@gmail.com>
 <CAA-8Ld8z72HqJyjMcPJLSwDdsBnYy+=mN5h6HZ7tci7ZzJJ=DA@mail.gmail.com>
 <CAA-8Ld8SbCiKb6rZ+fvi8iixjuK-3mRTWdf0FhGyC9RF=ct7BA@mail.gmail.com>
 <CAFQAd-nXOs8WD_V1Mp=4HKiw2qwPxdGvd_oQBOxHqVobUr9ygA@mail.gmail.com>
 <CAA-8Ld8KFpF9jiT+804wUTACvCjj73Zc2ZbZGRrO855=GQnOaA@mail.gmail.com>
 <CAFQAd-=E_CPJpbAk=c938AVUt4wJHVN+_HXqZMShNfAcsmPorw@mail.gmail.com>
Message-ID: <CAA-8Ld-BZ-Gx=KZPK+4hbPuBNjpfod4mTJP+cSobEq-dihfLug@mail.gmail.com>

Great.

Thanks for your time Manoj


Cheers,
Klo


On Mon, Oct 3, 2016 at 8:20 PM, Manoj Kumar <manojkumarsivaraj334 at gmail.com>
wrote:

> Let's say you would like to generate just the first feature of 1000
> samples with label 0.
>
> The distribution of the first feature conditioned on label 1 follows a
> Bernoulli distribution (as suggested by the name) with parameter
> "exp(feature_log_prob_[0, 0])". You could then generate the first feature
> of these 1000 samples by just doing
>
> first_feature = bernoulli.rvs(exp(feature_log_prob_[0, 0]), size=1000)
>
> And follow the same approach for all the other features with the
> corresponding parameters. (They are conditionally independent)
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161003/f065d8c9/attachment-0001.html>

From se.raschka at gmail.com  Mon Oct  3 15:20:06 2016
From: se.raschka at gmail.com (Sebastian Raschka)
Date: Mon, 3 Oct 2016 15:20:06 -0400
Subject: [scikit-learn] Random Forest with Bootstrapping
In-Reply-To: <18DFDA25-0236-486C-B23D-4E1118EC4803@gmail.com>
References: <CAAyvngWnwmH0p-7A46OED5Rw7n7RYn-QJduWeSpeLrqOkZfMqw@mail.gmail.com>
 <AFAAA2CD-3742-4B42-BBD4-85E2B4D5DB95@gmail.com>
 <CAAyvngXpJr0b3_N5A_sDFsDAf-2VX5wsc1Rayxw5gMa+PEmqcw@mail.gmail.com>
 <18DFDA25-0236-486C-B23D-4E1118EC4803@gmail.com>
Message-ID: <4977A46A-2064-42EE-8853-4E4C799776A3@gmail.com>

Or maybe more intuitively, you can visualize this asymptotic behavior e.g., via

import matplotlib.pyplot as plt

vs = []
for n in range(5, 201, 5):
    v = 1 - (1. - 1./n)**n
    vs.append(v)
 
plt.plot([n for n in range(5, 201, 5)], vs, marker='o', 
          markersize=6, 
          alpha=0.5,)

plt.xlabel('n')
plt.ylabel('1 - (1 - 1/n)^n')
plt.xlim([0, 210])
plt.show()

> On Oct 3, 2016, at 3:15 PM, Sebastian Raschka <se.raschka at gmail.com> wrote:
> 
> Say the probability that a given sample from a dataset of size n is *not* drawn as a bootstrap sample is
> 
> P(not_chosen) = (1 - 1\n)^n
> 
> Since you have a 1/n chance to draw a particular sample (since bootstrapping involves drawing with replacement), which you repeat n times to get a n-sized bootstrap sample.
> 
> This is asymptotically "1/e approx. 0.368? (i.e., for very, very large n)
> 
> Then, you can compute the probability of a sample being chosen as
> 
> P(chosen) = 1 - (1 - 1/n)^n approx. 0.632 
> 
> Best,
> Sebastian
> 
>> On Oct 3, 2016, at 3:05 PM, Ibrahim Dalal via scikit-learn <scikit-learn at python.org> wrote:
>> 
>> Hi,
>> 
>> Thank you for the reply. Please bear with me for a while.
>> 
>> From where did this number, 0.632, come? I have no background in statistics (which appears to be the case here!). Or let me rephrase my query: what is this bootstrap sampling all about? Searched the web, but didn't get satisfactory results.
>> 
>> 
>> Thanks
>> 
>> On Tue, Oct 4, 2016 at 12:02 AM, Sebastian Raschka <se.raschka at gmail.com> wrote:
>>> From whatever little knowledge I gained last night about Random Forests, each tree is trained with a sub-sample of original dataset (usually with replacement)?.
>> 
>> Yes, that should be correct!
>> 
>>> Now, what I am not able to understand is - if entire dataset is used to train each of the trees, then how does the classifier estimates the OOB error? None of the entries of the dataset is an oob for any of the trees. (Pardon me if all this sounds BS)
>> 
>> If you take an n-size bootstrap sample, where n is the number of samples in your dataset, you have asymptotically 0.632 * n unique samples in your bootstrap set. Or in other words 0.368 * n samples are not used for growing the respective tree (to compute the OOB). As far as I understand, the random forest OOB score is then computed as the average OOB of each tee (correct me if I am wrong!).
>> 
>> Best,
>> Sebastian
>> 
>>> On Oct 3, 2016, at 2:25 PM, Ibrahim Dalal via scikit-learn <scikit-learn at python.org> wrote:
>>> 
>>> Dear Developers,
>>> 
>>> From whatever little knowledge I gained last night about Random Forests, each tree is trained with a sub-sample of original dataset (usually with replacement)?.
>>> 
>>> (Note: Please do correct me if I am not making any sense.)
>>> 
>>> RandomForestClassifier has an option of 'bootstrap'. The API states the following
>>> 
>>> The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).
>>> 
>>> Now, what I am not able to understand is - if entire dataset is used to train each of the trees, then how does the classifier estimates the OOB error? None of the entries of the dataset is an oob for any of the trees. (Pardon me if all this sounds BS)
>>> 
>>> Help this mere mortal.
>>> 
>>> Thanks
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> 
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>> 
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From t3kcit at gmail.com  Mon Oct  3 15:25:54 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Mon, 3 Oct 2016 15:25:54 -0400
Subject: [scikit-learn] Welcome Raghav to the core-dev team
In-Reply-To: <CACmxyDG3gJsSBVksXVciX10jH6NThva5k623k_Q-OC6HkasRgQ@mail.gmail.com>
References: <20161003151415.GF20745@phare.normalesup.org>
 <CALoLHMLZnsRiTUi4c5wnp5Q0N_ttdyTy1WH32tS2ZEo=_7mg_A@mail.gmail.com>
 <CA+QZHeCK7i_oAcE+yWL=ZofRzvpEUZQmB+0kPzwjtQENpcPVPQ@mail.gmail.com>
 <CAHazPTmwX=nZUh71fpYnRBtsszzaL1FCE+ftKfjrKCsesUbjwg@mail.gmail.com>
 <CACmxyDG3gJsSBVksXVciX10jH6NThva5k623k_Q-OC6HkasRgQ@mail.gmail.com>
Message-ID: <3d961045-39d0-8c81-1deb-2f6b7332ff1e@gmail.com>

Congrats, hope to see lot's more ;)

On 10/03/2016 12:09 PM, Raghav R V wrote:
> Thanks everyone! Looking forward to contributing more :D
>
> On Mon, Oct 3, 2016 at 5:40 PM, Ronnie Ghose <ronnie.ghose at gmail.com 
> <mailto:ronnie.ghose at gmail.com>> wrote:
>
>     congrats! :)
>
>     On Mon, Oct 3, 2016 at 11:28 AM, lin yenchen
>     <yenchenlin1994 at gmail.com <mailto:yenchenlin1994 at gmail.com>> wrote:
>
>         Congrats, Raghav!
>
>         Nelson Liu <nfliu at uw.edu <mailto:nfliu at uw.edu>> ? 2016?10?3?
>         ?? ??11:27???
>
>             Yay! Congrats, Raghav!
>
>             On Mon, Oct 3, 2016 at 8:14 AM, Gael Varoquaux
>             <gael.varoquaux at normalesup.org
>             <mailto:gael.varoquaux at normalesup.org>> wrote:
>
>                 Hi,
>
>                 We have the pleasure to welcome Raghav RV to the
>                 core-dev team. Raghav
>                 (@raghavrv) has been working on scikit-learn for more
>                 than a year. In
>                 particular, he implemented the rewrite of the
>                 cross-validation utilities,
>                 which is quite dear to my heart.
>
>                 Welcome Raghav!
>
>                 Ga?l
>
>                 _______________________________________________
>                 scikit-learn mailing list
>                 scikit-learn at python.org <mailto:scikit-learn at python.org>
>                 https://mail.python.org/mailman/listinfo/scikit-learn
>                 <https://mail.python.org/mailman/listinfo/scikit-learn>
>
>
>             _______________________________________________
>             scikit-learn mailing list
>             scikit-learn at python.org <mailto:scikit-learn at python.org>
>             https://mail.python.org/mailman/listinfo/scikit-learn
>             <https://mail.python.org/mailman/listinfo/scikit-learn>
>
>
>         _______________________________________________
>         scikit-learn mailing list
>         scikit-learn at python.org <mailto:scikit-learn at python.org>
>         https://mail.python.org/mailman/listinfo/scikit-learn
>         <https://mail.python.org/mailman/listinfo/scikit-learn>
>
>
>
>     _______________________________________________
>     scikit-learn mailing list
>     scikit-learn at python.org <mailto:scikit-learn at python.org>
>     https://mail.python.org/mailman/listinfo/scikit-learn
>     <https://mail.python.org/mailman/listinfo/scikit-learn>
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161003/e8b225b2/attachment.html>

From cs14btech11041 at iith.ac.in  Mon Oct  3 15:36:54 2016
From: cs14btech11041 at iith.ac.in (Ibrahim Dalal)
Date: Tue, 4 Oct 2016 01:06:54 +0530
Subject: [scikit-learn] Random Forest with Bootstrapping
In-Reply-To: <4977A46A-2064-42EE-8853-4E4C799776A3@gmail.com>
References: <CAAyvngWnwmH0p-7A46OED5Rw7n7RYn-QJduWeSpeLrqOkZfMqw@mail.gmail.com>
 <AFAAA2CD-3742-4B42-BBD4-85E2B4D5DB95@gmail.com>
 <CAAyvngXpJr0b3_N5A_sDFsDAf-2VX5wsc1Rayxw5gMa+PEmqcw@mail.gmail.com>
 <18DFDA25-0236-486C-B23D-4E1118EC4803@gmail.com>
 <4977A46A-2064-42EE-8853-4E4C799776A3@gmail.com>
Message-ID: <CAAyvngX16ftnwuJB4FNfi1WZ4MZp-qUEB+hsPi-DrMdjqOv+vg@mail.gmail.com>

Hi,

That helped a lot. Thank you very much. I have one more (silly?) doubt
though.

Won't an n-sized bootstrapped sample have repeated entries? Say we have an
original dataset of size 100. A bootstrap sample (say, B) of size 100 is
drawn from this set. Since 32 of the original samples are left out
(theoretically at least), some of the samples in B must be repeated?

On Tue, Oct 4, 2016 at 12:50 AM, Sebastian Raschka <se.raschka at gmail.com>
wrote:

> Or maybe more intuitively, you can visualize this asymptotic behavior
> e.g., via
>
> import matplotlib.pyplot as plt
>
> vs = []
> for n in range(5, 201, 5):
>     v = 1 - (1. - 1./n)**n
>     vs.append(v)
>
> plt.plot([n for n in range(5, 201, 5)], vs, marker='o',
>           markersize=6,
>           alpha=0.5,)
>
> plt.xlabel('n')
> plt.ylabel('1 - (1 - 1/n)^n')
> plt.xlim([0, 210])
> plt.show()
>
> > On Oct 3, 2016, at 3:15 PM, Sebastian Raschka <se.raschka at gmail.com>
> wrote:
> >
> > Say the probability that a given sample from a dataset of size n is
> *not* drawn as a bootstrap sample is
> >
> > P(not_chosen) = (1 - 1\n)^n
> >
> > Since you have a 1/n chance to draw a particular sample (since
> bootstrapping involves drawing with replacement), which you repeat n times
> to get a n-sized bootstrap sample.
> >
> > This is asymptotically "1/e approx. 0.368? (i.e., for very, very large n)
> >
> > Then, you can compute the probability of a sample being chosen as
> >
> > P(chosen) = 1 - (1 - 1/n)^n approx. 0.632
> >
> > Best,
> > Sebastian
> >
> >> On Oct 3, 2016, at 3:05 PM, Ibrahim Dalal via scikit-learn <
> scikit-learn at python.org> wrote:
> >>
> >> Hi,
> >>
> >> Thank you for the reply. Please bear with me for a while.
> >>
> >> From where did this number, 0.632, come? I have no background in
> statistics (which appears to be the case here!). Or let me rephrase my
> query: what is this bootstrap sampling all about? Searched the web, but
> didn't get satisfactory results.
> >>
> >>
> >> Thanks
> >>
> >> On Tue, Oct 4, 2016 at 12:02 AM, Sebastian Raschka <
> se.raschka at gmail.com> wrote:
> >>> From whatever little knowledge I gained last night about Random
> Forests, each tree is trained with a sub-sample of original dataset
> (usually with replacement)?.
> >>
> >> Yes, that should be correct!
> >>
> >>> Now, what I am not able to understand is - if entire dataset is used
> to train each of the trees, then how does the classifier estimates the OOB
> error? None of the entries of the dataset is an oob for any of the trees.
> (Pardon me if all this sounds BS)
> >>
> >> If you take an n-size bootstrap sample, where n is the number of
> samples in your dataset, you have asymptotically 0.632 * n unique samples
> in your bootstrap set. Or in other words 0.368 * n samples are not used for
> growing the respective tree (to compute the OOB). As far as I understand,
> the random forest OOB score is then computed as the average OOB of each tee
> (correct me if I am wrong!).
> >>
> >> Best,
> >> Sebastian
> >>
> >>> On Oct 3, 2016, at 2:25 PM, Ibrahim Dalal via scikit-learn <
> scikit-learn at python.org> wrote:
> >>>
> >>> Dear Developers,
> >>>
> >>> From whatever little knowledge I gained last night about Random
> Forests, each tree is trained with a sub-sample of original dataset
> (usually with replacement)?.
> >>>
> >>> (Note: Please do correct me if I am not making any sense.)
> >>>
> >>> RandomForestClassifier has an option of 'bootstrap'. The API states
> the following
> >>>
> >>> The sub-sample size is always the same as the original input sample
> size but the samples are drawn with replacement if bootstrap=True (default).
> >>>
> >>> Now, what I am not able to understand is - if entire dataset is used
> to train each of the trees, then how does the classifier estimates the OOB
> error? None of the entries of the dataset is an oob for any of the trees.
> (Pardon me if all this sounds BS)
> >>>
> >>> Help this mere mortal.
> >>>
> >>> Thanks
> >>> _______________________________________________
> >>> scikit-learn mailing list
> >>> scikit-learn at python.org
> >>> https://mail.python.org/mailman/listinfo/scikit-learn
> >>
> >> _______________________________________________
> >> scikit-learn mailing list
> >> scikit-learn at python.org
> >> https://mail.python.org/mailman/listinfo/scikit-learn
> >>
> >> _______________________________________________
> >> scikit-learn mailing list
> >> scikit-learn at python.org
> >> https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161004/196d5d39/attachment-0001.html>

From se.raschka at gmail.com  Mon Oct  3 15:59:38 2016
From: se.raschka at gmail.com (Sebastian Raschka)
Date: Mon, 3 Oct 2016 15:59:38 -0400
Subject: [scikit-learn] Random Forest with Bootstrapping
In-Reply-To: <CAAyvngX16ftnwuJB4FNfi1WZ4MZp-qUEB+hsPi-DrMdjqOv+vg@mail.gmail.com>
References: <CAAyvngWnwmH0p-7A46OED5Rw7n7RYn-QJduWeSpeLrqOkZfMqw@mail.gmail.com>
 <AFAAA2CD-3742-4B42-BBD4-85E2B4D5DB95@gmail.com>
 <CAAyvngXpJr0b3_N5A_sDFsDAf-2VX5wsc1Rayxw5gMa+PEmqcw@mail.gmail.com>
 <18DFDA25-0236-486C-B23D-4E1118EC4803@gmail.com>
 <4977A46A-2064-42EE-8853-4E4C799776A3@gmail.com>
 <CAAyvngX16ftnwuJB4FNfi1WZ4MZp-qUEB+hsPi-DrMdjqOv+vg@mail.gmail.com>
Message-ID: <A22C0A34-BF34-4330-8B64-E69511C30B59@gmail.com>

> Hi,
> 
> That helped a lot. Thank you very much. I have one more (silly?) doubt though.
> 
> Won't an n-sized bootstrapped sample have repeated entries? Say we have an original dataset of size 100. A bootstrap sample (say, B) of size 100 is drawn from this set. Since 32 of the original samples are left out (theoretically at least), some of the samples in B must be repeated?

Yeah, you'll definitely have duplications, that?s why (if you have an infinitely large n) only 0.632*n samples are unique ;).

Say your dataset is 

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] (where the numbers represent the indices of your data points)

then a bootstrap sample could be

[9, 1, 1, 0, 4, 4, 5, 7, 9, 9] and your left out sample is consequently [2, 3, 6, 8]


> On Oct 3, 2016, at 3:36 PM, Ibrahim Dalal via scikit-learn <scikit-learn at python.org> wrote:
> 
> Hi,
> 
> That helped a lot. Thank you very much. I have one more (silly?) doubt though.
> 
> Won't an n-sized bootstrapped sample have repeated entries? Say we have an original dataset of size 100. A bootstrap sample (say, B) of size 100 is drawn from this set. Since 32 of the original samples are left out (theoretically at least), some of the samples in B must be repeated?
> 
> On Tue, Oct 4, 2016 at 12:50 AM, Sebastian Raschka <se.raschka at gmail.com> wrote:
> Or maybe more intuitively, you can visualize this asymptotic behavior e.g., via
> 
> import matplotlib.pyplot as plt
> 
> vs = []
> for n in range(5, 201, 5):
>     v = 1 - (1. - 1./n)**n
>     vs.append(v)
> 
> plt.plot([n for n in range(5, 201, 5)], vs, marker='o',
>           markersize=6,
>           alpha=0.5,)
> 
> plt.xlabel('n')
> plt.ylabel('1 - (1 - 1/n)^n')
> plt.xlim([0, 210])
> plt.show()
> 
> > On Oct 3, 2016, at 3:15 PM, Sebastian Raschka <se.raschka at gmail.com> wrote:
> >
> > Say the probability that a given sample from a dataset of size n is *not* drawn as a bootstrap sample is
> >
> > P(not_chosen) = (1 - 1\n)^n
> >
> > Since you have a 1/n chance to draw a particular sample (since bootstrapping involves drawing with replacement), which you repeat n times to get a n-sized bootstrap sample.
> >
> > This is asymptotically "1/e approx. 0.368? (i.e., for very, very large n)
> >
> > Then, you can compute the probability of a sample being chosen as
> >
> > P(chosen) = 1 - (1 - 1/n)^n approx. 0.632
> >
> > Best,
> > Sebastian
> >
> >> On Oct 3, 2016, at 3:05 PM, Ibrahim Dalal via scikit-learn <scikit-learn at python.org> wrote:
> >>
> >> Hi,
> >>
> >> Thank you for the reply. Please bear with me for a while.
> >>
> >> From where did this number, 0.632, come? I have no background in statistics (which appears to be the case here!). Or let me rephrase my query: what is this bootstrap sampling all about? Searched the web, but didn't get satisfactory results.
> >>
> >>
> >> Thanks
> >>
> >> On Tue, Oct 4, 2016 at 12:02 AM, Sebastian Raschka <se.raschka at gmail.com> wrote:
> >>> From whatever little knowledge I gained last night about Random Forests, each tree is trained with a sub-sample of original dataset (usually with replacement)?.
> >>
> >> Yes, that should be correct!
> >>
> >>> Now, what I am not able to understand is - if entire dataset is used to train each of the trees, then how does the classifier estimates the OOB error? None of the entries of the dataset is an oob for any of the trees. (Pardon me if all this sounds BS)
> >>
> >> If you take an n-size bootstrap sample, where n is the number of samples in your dataset, you have asymptotically 0.632 * n unique samples in your bootstrap set. Or in other words 0.368 * n samples are not used for growing the respective tree (to compute the OOB). As far as I understand, the random forest OOB score is then computed as the average OOB of each tee (correct me if I am wrong!).
> >>
> >> Best,
> >> Sebastian
> >>
> >>> On Oct 3, 2016, at 2:25 PM, Ibrahim Dalal via scikit-learn <scikit-learn at python.org> wrote:
> >>>
> >>> Dear Developers,
> >>>
> >>> From whatever little knowledge I gained last night about Random Forests, each tree is trained with a sub-sample of original dataset (usually with replacement)?.
> >>>
> >>> (Note: Please do correct me if I am not making any sense.)
> >>>
> >>> RandomForestClassifier has an option of 'bootstrap'. The API states the following
> >>>
> >>> The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).
> >>>
> >>> Now, what I am not able to understand is - if entire dataset is used to train each of the trees, then how does the classifier estimates the OOB error? None of the entries of the dataset is an oob for any of the trees. (Pardon me if all this sounds BS)
> >>>
> >>> Help this mere mortal.
> >>>
> >>> Thanks
> >>> _______________________________________________
> >>> scikit-learn mailing list
> >>> scikit-learn at python.org
> >>> https://mail.python.org/mailman/listinfo/scikit-learn
> >>
> >> _______________________________________________
> >> scikit-learn mailing list
> >> scikit-learn at python.org
> >> https://mail.python.org/mailman/listinfo/scikit-learn
> >>
> >> _______________________________________________
> >> scikit-learn mailing list
> >> scikit-learn at python.org
> >> https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From cs14btech11041 at iith.ac.in  Mon Oct  3 16:03:52 2016
From: cs14btech11041 at iith.ac.in (Ibrahim Dalal)
Date: Tue, 4 Oct 2016 01:33:52 +0530
Subject: [scikit-learn] Random Forest with Bootstrapping
In-Reply-To: <A22C0A34-BF34-4330-8B64-E69511C30B59@gmail.com>
References: <CAAyvngWnwmH0p-7A46OED5Rw7n7RYn-QJduWeSpeLrqOkZfMqw@mail.gmail.com>
 <AFAAA2CD-3742-4B42-BBD4-85E2B4D5DB95@gmail.com>
 <CAAyvngXpJr0b3_N5A_sDFsDAf-2VX5wsc1Rayxw5gMa+PEmqcw@mail.gmail.com>
 <18DFDA25-0236-486C-B23D-4E1118EC4803@gmail.com>
 <4977A46A-2064-42EE-8853-4E4C799776A3@gmail.com>
 <CAAyvngX16ftnwuJB4FNfi1WZ4MZp-qUEB+hsPi-DrMdjqOv+vg@mail.gmail.com>
 <A22C0A34-BF34-4330-8B64-E69511C30B59@gmail.com>
Message-ID: <CAAyvngV2tXsd-AVQ3QWb_2kB2Je8C5+y8sW6roc8LMTnPzaXuA@mail.gmail.com>

So what is the point of having duplicate entries in your training set? This
seems just a pure overhead. Sorry but you will again have to help me here.

On Tue, Oct 4, 2016 at 1:29 AM, Sebastian Raschka <se.raschka at gmail.com>
wrote:

> > Hi,
> >
> > That helped a lot. Thank you very much. I have one more (silly?) doubt
> though.
> >
> > Won't an n-sized bootstrapped sample have repeated entries? Say we have
> an original dataset of size 100. A bootstrap sample (say, B) of size 100 is
> drawn from this set. Since 32 of the original samples are left out
> (theoretically at least), some of the samples in B must be repeated?
>
> Yeah, you'll definitely have duplications, that?s why (if you have an
> infinitely large n) only 0.632*n samples are unique ;).
>
> Say your dataset is
>
> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] (where the numbers represent the indices of
> your data points)
>
> then a bootstrap sample could be
>
> [9, 1, 1, 0, 4, 4, 5, 7, 9, 9] and your left out sample is consequently
> [2, 3, 6, 8]
>
>
> > On Oct 3, 2016, at 3:36 PM, Ibrahim Dalal via scikit-learn <
> scikit-learn at python.org> wrote:
> >
> > Hi,
> >
> > That helped a lot. Thank you very much. I have one more (silly?) doubt
> though.
> >
> > Won't an n-sized bootstrapped sample have repeated entries? Say we have
> an original dataset of size 100. A bootstrap sample (say, B) of size 100 is
> drawn from this set. Since 32 of the original samples are left out
> (theoretically at least), some of the samples in B must be repeated?
> >
> > On Tue, Oct 4, 2016 at 12:50 AM, Sebastian Raschka <se.raschka at gmail.com>
> wrote:
> > Or maybe more intuitively, you can visualize this asymptotic behavior
> e.g., via
> >
> > import matplotlib.pyplot as plt
> >
> > vs = []
> > for n in range(5, 201, 5):
> >     v = 1 - (1. - 1./n)**n
> >     vs.append(v)
> >
> > plt.plot([n for n in range(5, 201, 5)], vs, marker='o',
> >           markersize=6,
> >           alpha=0.5,)
> >
> > plt.xlabel('n')
> > plt.ylabel('1 - (1 - 1/n)^n')
> > plt.xlim([0, 210])
> > plt.show()
> >
> > > On Oct 3, 2016, at 3:15 PM, Sebastian Raschka <se.raschka at gmail.com>
> wrote:
> > >
> > > Say the probability that a given sample from a dataset of size n is
> *not* drawn as a bootstrap sample is
> > >
> > > P(not_chosen) = (1 - 1\n)^n
> > >
> > > Since you have a 1/n chance to draw a particular sample (since
> bootstrapping involves drawing with replacement), which you repeat n times
> to get a n-sized bootstrap sample.
> > >
> > > This is asymptotically "1/e approx. 0.368? (i.e., for very, very large
> n)
> > >
> > > Then, you can compute the probability of a sample being chosen as
> > >
> > > P(chosen) = 1 - (1 - 1/n)^n approx. 0.632
> > >
> > > Best,
> > > Sebastian
> > >
> > >> On Oct 3, 2016, at 3:05 PM, Ibrahim Dalal via scikit-learn <
> scikit-learn at python.org> wrote:
> > >>
> > >> Hi,
> > >>
> > >> Thank you for the reply. Please bear with me for a while.
> > >>
> > >> From where did this number, 0.632, come? I have no background in
> statistics (which appears to be the case here!). Or let me rephrase my
> query: what is this bootstrap sampling all about? Searched the web, but
> didn't get satisfactory results.
> > >>
> > >>
> > >> Thanks
> > >>
> > >> On Tue, Oct 4, 2016 at 12:02 AM, Sebastian Raschka <
> se.raschka at gmail.com> wrote:
> > >>> From whatever little knowledge I gained last night about Random
> Forests, each tree is trained with a sub-sample of original dataset
> (usually with replacement)?.
> > >>
> > >> Yes, that should be correct!
> > >>
> > >>> Now, what I am not able to understand is - if entire dataset is used
> to train each of the trees, then how does the classifier estimates the OOB
> error? None of the entries of the dataset is an oob for any of the trees.
> (Pardon me if all this sounds BS)
> > >>
> > >> If you take an n-size bootstrap sample, where n is the number of
> samples in your dataset, you have asymptotically 0.632 * n unique samples
> in your bootstrap set. Or in other words 0.368 * n samples are not used for
> growing the respective tree (to compute the OOB). As far as I understand,
> the random forest OOB score is then computed as the average OOB of each tee
> (correct me if I am wrong!).
> > >>
> > >> Best,
> > >> Sebastian
> > >>
> > >>> On Oct 3, 2016, at 2:25 PM, Ibrahim Dalal via scikit-learn <
> scikit-learn at python.org> wrote:
> > >>>
> > >>> Dear Developers,
> > >>>
> > >>> From whatever little knowledge I gained last night about Random
> Forests, each tree is trained with a sub-sample of original dataset
> (usually with replacement)?.
> > >>>
> > >>> (Note: Please do correct me if I am not making any sense.)
> > >>>
> > >>> RandomForestClassifier has an option of 'bootstrap'. The API states
> the following
> > >>>
> > >>> The sub-sample size is always the same as the original input sample
> size but the samples are drawn with replacement if bootstrap=True (default).
> > >>>
> > >>> Now, what I am not able to understand is - if entire dataset is used
> to train each of the trees, then how does the classifier estimates the OOB
> error? None of the entries of the dataset is an oob for any of the trees.
> (Pardon me if all this sounds BS)
> > >>>
> > >>> Help this mere mortal.
> > >>>
> > >>> Thanks
> > >>> _______________________________________________
> > >>> scikit-learn mailing list
> > >>> scikit-learn at python.org
> > >>> https://mail.python.org/mailman/listinfo/scikit-learn
> > >>
> > >> _______________________________________________
> > >> scikit-learn mailing list
> > >> scikit-learn at python.org
> > >> https://mail.python.org/mailman/listinfo/scikit-learn
> > >>
> > >> _______________________________________________
> > >> scikit-learn mailing list
> > >> scikit-learn at python.org
> > >> https://mail.python.org/mailman/listinfo/scikit-learn
> > >
> > > _______________________________________________
> > > scikit-learn mailing list
> > > scikit-learn at python.org
> > > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161004/5affb4cb/attachment-0001.html>

From se.raschka at gmail.com  Mon Oct  3 16:28:36 2016
From: se.raschka at gmail.com (Sebastian Raschka)
Date: Mon, 3 Oct 2016 16:28:36 -0400
Subject: [scikit-learn] Random Forest with Bootstrapping
In-Reply-To: <CAAyvngV2tXsd-AVQ3QWb_2kB2Je8C5+y8sW6roc8LMTnPzaXuA@mail.gmail.com>
References: <CAAyvngWnwmH0p-7A46OED5Rw7n7RYn-QJduWeSpeLrqOkZfMqw@mail.gmail.com>
 <AFAAA2CD-3742-4B42-BBD4-85E2B4D5DB95@gmail.com>
 <CAAyvngXpJr0b3_N5A_sDFsDAf-2VX5wsc1Rayxw5gMa+PEmqcw@mail.gmail.com>
 <18DFDA25-0236-486C-B23D-4E1118EC4803@gmail.com>
 <4977A46A-2064-42EE-8853-4E4C799776A3@gmail.com>
 <CAAyvngX16ftnwuJB4FNfi1WZ4MZp-qUEB+hsPi-DrMdjqOv+vg@mail.gmail.com>
 <A22C0A34-BF34-4330-8B64-E69511C30B59@gmail.com>
 <CAAyvngV2tXsd-AVQ3QWb_2kB2Je8C5+y8sW6roc8LMTnPzaXuA@mail.gmail.com>
Message-ID: <FECF7D75-D434-49D6-BD0E-5E1621A0B0E3@gmail.com>

Originally, it was this technique was used to estimate a sampling distribution. Think of the drawing with replacement as work-around for generating *new* data from a population that is simulated by this repeated sampling from the given dataset with replacement.


For more details, I?d recommend reading the original literature, e.g,. 

Efron, Bradley. 1979. ?Bootstrap Methods: Another Look at the Jackknife.? The Annals of Statistics 7 (1). Institute of Mathematical Statistics: 1?26. 


There?s also a whole book on this topic:

Efron, Bradley, and Robert Tibshirani. 1994. An Introduction to the Bootstrap. Chapman & Hall.


Or more relevant to this particular application, maybe see 

Breiman, L., 1996. Bagging predictors. Machine learning, 24(2), pp.123-140.

"Tests on real and simulated data sets using classification and regression trees and subset selection in linear regression show that bagging can give substantial gains in accuracy. The vital element is the instability of the prediction method. If perturbing the learning set can cause significant changes in the predictor constructed, then bagging can improve accuracy."


> On Oct 3, 2016, at 4:03 PM, Ibrahim Dalal via scikit-learn <scikit-learn at python.org> wrote:
> 
> So what is the point of having duplicate entries in your training set? This seems just a pure overhead. Sorry but you will again have to help me here.
> 
> On Tue, Oct 4, 2016 at 1:29 AM, Sebastian Raschka <se.raschka at gmail.com> wrote:
> > Hi,
> >
> > That helped a lot. Thank you very much. I have one more (silly?) doubt though.
> >
> > Won't an n-sized bootstrapped sample have repeated entries? Say we have an original dataset of size 100. A bootstrap sample (say, B) of size 100 is drawn from this set. Since 32 of the original samples are left out (theoretically at least), some of the samples in B must be repeated?
> 
> Yeah, you'll definitely have duplications, that?s why (if you have an infinitely large n) only 0.632*n samples are unique ;).
> 
> Say your dataset is
> 
> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] (where the numbers represent the indices of your data points)
> 
> then a bootstrap sample could be
> 
> [9, 1, 1, 0, 4, 4, 5, 7, 9, 9] and your left out sample is consequently [2, 3, 6, 8]
> 
> 
> > On Oct 3, 2016, at 3:36 PM, Ibrahim Dalal via scikit-learn <scikit-learn at python.org> wrote:
> >
> > Hi,
> >
> > That helped a lot. Thank you very much. I have one more (silly?) doubt though.
> >
> > Won't an n-sized bootstrapped sample have repeated entries? Say we have an original dataset of size 100. A bootstrap sample (say, B) of size 100 is drawn from this set. Since 32 of the original samples are left out (theoretically at least), some of the samples in B must be repeated?
> >
> > On Tue, Oct 4, 2016 at 12:50 AM, Sebastian Raschka <se.raschka at gmail.com> wrote:
> > Or maybe more intuitively, you can visualize this asymptotic behavior e.g., via
> >
> > import matplotlib.pyplot as plt
> >
> > vs = []
> > for n in range(5, 201, 5):
> >     v = 1 - (1. - 1./n)**n
> >     vs.append(v)
> >
> > plt.plot([n for n in range(5, 201, 5)], vs, marker='o',
> >           markersize=6,
> >           alpha=0.5,)
> >
> > plt.xlabel('n')
> > plt.ylabel('1 - (1 - 1/n)^n')
> > plt.xlim([0, 210])
> > plt.show()
> >
> > > On Oct 3, 2016, at 3:15 PM, Sebastian Raschka <se.raschka at gmail.com> wrote:
> > >
> > > Say the probability that a given sample from a dataset of size n is *not* drawn as a bootstrap sample is
> > >
> > > P(not_chosen) = (1 - 1\n)^n
> > >
> > > Since you have a 1/n chance to draw a particular sample (since bootstrapping involves drawing with replacement), which you repeat n times to get a n-sized bootstrap sample.
> > >
> > > This is asymptotically "1/e approx. 0.368? (i.e., for very, very large n)
> > >
> > > Then, you can compute the probability of a sample being chosen as
> > >
> > > P(chosen) = 1 - (1 - 1/n)^n approx. 0.632
> > >
> > > Best,
> > > Sebastian
> > >
> > >> On Oct 3, 2016, at 3:05 PM, Ibrahim Dalal via scikit-learn <scikit-learn at python.org> wrote:
> > >>
> > >> Hi,
> > >>
> > >> Thank you for the reply. Please bear with me for a while.
> > >>
> > >> From where did this number, 0.632, come? I have no background in statistics (which appears to be the case here!). Or let me rephrase my query: what is this bootstrap sampling all about? Searched the web, but didn't get satisfactory results.
> > >>
> > >>
> > >> Thanks
> > >>
> > >> On Tue, Oct 4, 2016 at 12:02 AM, Sebastian Raschka <se.raschka at gmail.com> wrote:
> > >>> From whatever little knowledge I gained last night about Random Forests, each tree is trained with a sub-sample of original dataset (usually with replacement)?.
> > >>
> > >> Yes, that should be correct!
> > >>
> > >>> Now, what I am not able to understand is - if entire dataset is used to train each of the trees, then how does the classifier estimates the OOB error? None of the entries of the dataset is an oob for any of the trees. (Pardon me if all this sounds BS)
> > >>
> > >> If you take an n-size bootstrap sample, where n is the number of samples in your dataset, you have asymptotically 0.632 * n unique samples in your bootstrap set. Or in other words 0.368 * n samples are not used for growing the respective tree (to compute the OOB). As far as I understand, the random forest OOB score is then computed as the average OOB of each tee (correct me if I am wrong!).
> > >>
> > >> Best,
> > >> Sebastian
> > >>
> > >>> On Oct 3, 2016, at 2:25 PM, Ibrahim Dalal via scikit-learn <scikit-learn at python.org> wrote:
> > >>>
> > >>> Dear Developers,
> > >>>
> > >>> From whatever little knowledge I gained last night about Random Forests, each tree is trained with a sub-sample of original dataset (usually with replacement)?.
> > >>>
> > >>> (Note: Please do correct me if I am not making any sense.)
> > >>>
> > >>> RandomForestClassifier has an option of 'bootstrap'. The API states the following
> > >>>
> > >>> The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).
> > >>>
> > >>> Now, what I am not able to understand is - if entire dataset is used to train each of the trees, then how does the classifier estimates the OOB error? None of the entries of the dataset is an oob for any of the trees. (Pardon me if all this sounds BS)
> > >>>
> > >>> Help this mere mortal.
> > >>>
> > >>> Thanks
> > >>> _______________________________________________
> > >>> scikit-learn mailing list
> > >>> scikit-learn at python.org
> > >>> https://mail.python.org/mailman/listinfo/scikit-learn
> > >>
> > >> _______________________________________________
> > >> scikit-learn mailing list
> > >> scikit-learn at python.org
> > >> https://mail.python.org/mailman/listinfo/scikit-learn
> > >>
> > >> _______________________________________________
> > >> scikit-learn mailing list
> > >> scikit-learn at python.org
> > >> https://mail.python.org/mailman/listinfo/scikit-learn
> > >
> > > _______________________________________________
> > > scikit-learn mailing list
> > > scikit-learn at python.org
> > > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From jaquesgrobler at gmail.com  Tue Oct  4 04:14:54 2016
From: jaquesgrobler at gmail.com (Jaques Grobler)
Date: Tue, 4 Oct 2016 10:14:54 +0200
Subject: [scikit-learn] Welcome Raghav to the core-dev team
In-Reply-To: <3d961045-39d0-8c81-1deb-2f6b7332ff1e@gmail.com>
References: <20161003151415.GF20745@phare.normalesup.org>
 <CALoLHMLZnsRiTUi4c5wnp5Q0N_ttdyTy1WH32tS2ZEo=_7mg_A@mail.gmail.com>
 <CA+QZHeCK7i_oAcE+yWL=ZofRzvpEUZQmB+0kPzwjtQENpcPVPQ@mail.gmail.com>
 <CAHazPTmwX=nZUh71fpYnRBtsszzaL1FCE+ftKfjrKCsesUbjwg@mail.gmail.com>
 <CACmxyDG3gJsSBVksXVciX10jH6NThva5k623k_Q-OC6HkasRgQ@mail.gmail.com>
 <3d961045-39d0-8c81-1deb-2f6b7332ff1e@gmail.com>
Message-ID: <CAHcSORkrxLrGDTxROFVZ7jtS9krSuEWhpmreAjY4XHctgW+GZA@mail.gmail.com>

Congrats Raghav!

2016-10-03 21:25 GMT+02:00 Andreas Mueller <t3kcit at gmail.com>:

> Congrats, hope to see lot's more ;)
>
>
> On 10/03/2016 12:09 PM, Raghav R V wrote:
>
> Thanks everyone! Looking forward to contributing more :D
>
> On Mon, Oct 3, 2016 at 5:40 PM, Ronnie Ghose <ronnie.ghose at gmail.com>
> wrote:
>
>> congrats! :)
>>
>> On Mon, Oct 3, 2016 at 11:28 AM, lin yenchen <yenchenlin1994 at gmail.com>
>> wrote:
>>
>>> Congrats, Raghav!
>>>
>>> Nelson Liu <nfliu at uw.edu> ? 2016?10?3? ?? ??11:27???
>>>
>>>> Yay! Congrats, Raghav!
>>>>
>>>> On Mon, Oct 3, 2016 at 8:14 AM, Gael Varoquaux <
>>>> gael.varoquaux at normalesup.org> wrote:
>>>>
>>>> Hi,
>>>>
>>>> We have the pleasure to welcome Raghav RV to the core-dev team. Raghav
>>>> (@raghavrv) has been working on scikit-learn for more than a year. In
>>>> particular, he implemented the rewrite of the cross-validation
>>>> utilities,
>>>> which is quite dear to my heart.
>>>>
>>>> Welcome Raghav!
>>>>
>>>> Ga?l
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
>
> _______________________________________________
> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161004/170b0e2d/attachment-0001.html>

From Victor.Poughon at cnes.fr  Tue Oct  4 05:10:04 2016
From: Victor.Poughon at cnes.fr (Poughon Victor)
Date: Tue, 4 Oct 2016 09:10:04 +0000
Subject: [scikit-learn] sample_weight for cohen_kappa_score
In-Reply-To: <71ddf2ec-cf8a-c6bd-3134-cd1bc7f5e360@gmail.com>
References: <3E55146A6A81B44A9CB69CAB65908CEA3558C867@TW-MBX-P01.cnesnet.ad.cnes.fr>,
 <71ddf2ec-cf8a-c6bd-3134-cd1bc7f5e360@gmail.com>
Message-ID: <3E55146A6A81B44A9CB69CAB65908CEA3558E06E@TW-MBX-P01.cnesnet.ad.cnes.fr>

I had a go at a PR (with a caveat for testing):
https://github.com/scikit-learn/scikit-learn/pull/7569

Victor Poughon

________________________________________
De : scikit-learn [scikit-learn-bounces+victor.poughon=cnes.fr at python.org] de la part de Andreas Mueller [t3kcit at gmail.com]
Envoy? : lundi 3 octobre 2016 15:09
? : Scikit-learn user and developer mailing list
Objet : Re: [scikit-learn] sample_weight for cohen_kappa_score

Hm it sounds like "weights" should have been called "weighting" maybe?
Not sure if it's worth changing now, as we released it already.

And I think passing the weighting to the confusion matrix is correct.
There should be tests for weighted metrics to confirm that.

PR welcome.

On 10/03/2016 05:21 AM, Poughon Victor wrote:
> Hello,
>
> I'd like to use samples weights together with sklearn.metrics.cohen_kappa_score,
> in a similar way to other metrics which have this argument. Is it as simple as
> forwarding the weights to the confusion_matrix call? [0]
>
> If yes I'm happy to work on the pull request.
>
> In that case the other argument "weights" might be confusing but it's too late
> to rename it, right?
>
> Cheers,
>
> Victor Poughon
>
> [0] https://github.com/scikit-learn/scikit-learn/blob/dee786a/sklearn/metrics/classification.py#L331
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn

From joel.nothman at gmail.com  Tue Oct  4 06:43:02 2016
From: joel.nothman at gmail.com (Joel Nothman)
Date: Tue, 4 Oct 2016 21:43:02 +1100
Subject: [scikit-learn] Welcome Raghav to the core-dev team
In-Reply-To: <CAHcSORkrxLrGDTxROFVZ7jtS9krSuEWhpmreAjY4XHctgW+GZA@mail.gmail.com>
References: <20161003151415.GF20745@phare.normalesup.org>
 <CALoLHMLZnsRiTUi4c5wnp5Q0N_ttdyTy1WH32tS2ZEo=_7mg_A@mail.gmail.com>
 <CA+QZHeCK7i_oAcE+yWL=ZofRzvpEUZQmB+0kPzwjtQENpcPVPQ@mail.gmail.com>
 <CAHazPTmwX=nZUh71fpYnRBtsszzaL1FCE+ftKfjrKCsesUbjwg@mail.gmail.com>
 <CACmxyDG3gJsSBVksXVciX10jH6NThva5k623k_Q-OC6HkasRgQ@mail.gmail.com>
 <3d961045-39d0-8c81-1deb-2f6b7332ff1e@gmail.com>
 <CAHcSORkrxLrGDTxROFVZ7jtS9krSuEWhpmreAjY4XHctgW+GZA@mail.gmail.com>
Message-ID: <CAAkaFLU99N8=E_RhQ1jYchM_y9ga3RmMYP2qu8VUZgkxF-wR8Q@mail.gmail.com>

Congratulations, Raghav! Thanks for your dedication, as a student and
mentor in GSoC, but at all other times too!

On 4 October 2016 at 19:14, Jaques Grobler <jaquesgrobler at gmail.com> wrote:

> Congrats Raghav!
>
> 2016-10-03 21:25 GMT+02:00 Andreas Mueller <t3kcit at gmail.com>:
>
>> Congrats, hope to see lot's more ;)
>>
>>
>> On 10/03/2016 12:09 PM, Raghav R V wrote:
>>
>> Thanks everyone! Looking forward to contributing more :D
>>
>> On Mon, Oct 3, 2016 at 5:40 PM, Ronnie Ghose <ronnie.ghose at gmail.com>
>> wrote:
>>
>>> congrats! :)
>>>
>>> On Mon, Oct 3, 2016 at 11:28 AM, lin yenchen <yenchenlin1994 at gmail.com>
>>> wrote:
>>>
>>>> Congrats, Raghav!
>>>>
>>>> Nelson Liu <nfliu at uw.edu> ? 2016?10?3? ?? ??11:27???
>>>>
>>>>> Yay! Congrats, Raghav!
>>>>>
>>>>> On Mon, Oct 3, 2016 at 8:14 AM, Gael Varoquaux <
>>>>> gael.varoquaux at normalesup.org> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> We have the pleasure to welcome Raghav RV to the core-dev team. Raghav
>>>>> (@raghavrv) has been working on scikit-learn for more than a year. In
>>>>> particular, he implemented the rewrite of the cross-validation
>>>>> utilities,
>>>>> which is quite dear to my heart.
>>>>>
>>>>> Welcome Raghav!
>>>>>
>>>>> Ga?l
>>>>>
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> scikit-learn at python.org
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> scikit-learn at python.org
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>>
>> _______________________________________________
>> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161004/0657c514/attachment.html>

From cs14btech11041 at iith.ac.in  Tue Oct  4 06:44:06 2016
From: cs14btech11041 at iith.ac.in (Ibrahim Dalal)
Date: Tue, 4 Oct 2016 16:14:06 +0530
Subject: [scikit-learn] Random Forest with Bootstrapping
In-Reply-To: <FECF7D75-D434-49D6-BD0E-5E1621A0B0E3@gmail.com>
References: <CAAyvngWnwmH0p-7A46OED5Rw7n7RYn-QJduWeSpeLrqOkZfMqw@mail.gmail.com>
 <AFAAA2CD-3742-4B42-BBD4-85E2B4D5DB95@gmail.com>
 <CAAyvngXpJr0b3_N5A_sDFsDAf-2VX5wsc1Rayxw5gMa+PEmqcw@mail.gmail.com>
 <18DFDA25-0236-486C-B23D-4E1118EC4803@gmail.com>
 <4977A46A-2064-42EE-8853-4E4C799776A3@gmail.com>
 <CAAyvngX16ftnwuJB4FNfi1WZ4MZp-qUEB+hsPi-DrMdjqOv+vg@mail.gmail.com>
 <A22C0A34-BF34-4330-8B64-E69511C30B59@gmail.com>
 <CAAyvngV2tXsd-AVQ3QWb_2kB2Je8C5+y8sW6roc8LMTnPzaXuA@mail.gmail.com>
 <FECF7D75-D434-49D6-BD0E-5E1621A0B0E3@gmail.com>
Message-ID: <CAAyvngWWE8Dx59cHfW8a5aN2Any2VoBk=WSEiMUKm1LxFBe1vA@mail.gmail.com>

Hi,

So why is using a bootstrap sample of size n better than just a random set
of size 0.62*n in Random Forest?

Thanks

On Tue, Oct 4, 2016 at 1:58 AM, Sebastian Raschka <se.raschka at gmail.com>
wrote:

> Originally, it was this technique was used to estimate a sampling
> distribution. Think of the drawing with replacement as work-around for
> generating *new* data from a population that is simulated by this repeated
> sampling from the given dataset with replacement.
>
>
> For more details, I?d recommend reading the original literature, e.g,.
>
> Efron, Bradley. 1979. ?Bootstrap Methods: Another Look at the Jackknife.?
> The Annals of Statistics 7 (1). Institute of Mathematical Statistics: 1?26.
>
>
> There?s also a whole book on this topic:
>
> Efron, Bradley, and Robert Tibshirani. 1994. An Introduction to the
> Bootstrap. Chapman & Hall.
>
>
> Or more relevant to this particular application, maybe see
>
> Breiman, L., 1996. Bagging predictors. Machine learning, 24(2), pp.123-140.
>
> "Tests on real and simulated data sets using classification and regression
> trees and subset selection in linear regression show that bagging can give
> substantial gains in accuracy. The vital element is the instability of the
> prediction method. If perturbing the learning set can cause significant
> changes in the predictor constructed, then bagging can improve accuracy."
>
>
> > On Oct 3, 2016, at 4:03 PM, Ibrahim Dalal via scikit-learn <
> scikit-learn at python.org> wrote:
> >
> > So what is the point of having duplicate entries in your training set?
> This seems just a pure overhead. Sorry but you will again have to help me
> here.
> >
> > On Tue, Oct 4, 2016 at 1:29 AM, Sebastian Raschka <se.raschka at gmail.com>
> wrote:
> > > Hi,
> > >
> > > That helped a lot. Thank you very much. I have one more (silly?) doubt
> though.
> > >
> > > Won't an n-sized bootstrapped sample have repeated entries? Say we
> have an original dataset of size 100. A bootstrap sample (say, B) of size
> 100 is drawn from this set. Since 32 of the original samples are left out
> (theoretically at least), some of the samples in B must be repeated?
> >
> > Yeah, you'll definitely have duplications, that?s why (if you have an
> infinitely large n) only 0.632*n samples are unique ;).
> >
> > Say your dataset is
> >
> > [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] (where the numbers represent the indices
> of your data points)
> >
> > then a bootstrap sample could be
> >
> > [9, 1, 1, 0, 4, 4, 5, 7, 9, 9] and your left out sample is consequently
> [2, 3, 6, 8]
> >
> >
> > > On Oct 3, 2016, at 3:36 PM, Ibrahim Dalal via scikit-learn <
> scikit-learn at python.org> wrote:
> > >
> > > Hi,
> > >
> > > That helped a lot. Thank you very much. I have one more (silly?) doubt
> though.
> > >
> > > Won't an n-sized bootstrapped sample have repeated entries? Say we
> have an original dataset of size 100. A bootstrap sample (say, B) of size
> 100 is drawn from this set. Since 32 of the original samples are left out
> (theoretically at least), some of the samples in B must be repeated?
> > >
> > > On Tue, Oct 4, 2016 at 12:50 AM, Sebastian Raschka <
> se.raschka at gmail.com> wrote:
> > > Or maybe more intuitively, you can visualize this asymptotic behavior
> e.g., via
> > >
> > > import matplotlib.pyplot as plt
> > >
> > > vs = []
> > > for n in range(5, 201, 5):
> > >     v = 1 - (1. - 1./n)**n
> > >     vs.append(v)
> > >
> > > plt.plot([n for n in range(5, 201, 5)], vs, marker='o',
> > >           markersize=6,
> > >           alpha=0.5,)
> > >
> > > plt.xlabel('n')
> > > plt.ylabel('1 - (1 - 1/n)^n')
> > > plt.xlim([0, 210])
> > > plt.show()
> > >
> > > > On Oct 3, 2016, at 3:15 PM, Sebastian Raschka <se.raschka at gmail.com>
> wrote:
> > > >
> > > > Say the probability that a given sample from a dataset of size n is
> *not* drawn as a bootstrap sample is
> > > >
> > > > P(not_chosen) = (1 - 1\n)^n
> > > >
> > > > Since you have a 1/n chance to draw a particular sample (since
> bootstrapping involves drawing with replacement), which you repeat n times
> to get a n-sized bootstrap sample.
> > > >
> > > > This is asymptotically "1/e approx. 0.368? (i.e., for very, very
> large n)
> > > >
> > > > Then, you can compute the probability of a sample being chosen as
> > > >
> > > > P(chosen) = 1 - (1 - 1/n)^n approx. 0.632
> > > >
> > > > Best,
> > > > Sebastian
> > > >
> > > >> On Oct 3, 2016, at 3:05 PM, Ibrahim Dalal via scikit-learn <
> scikit-learn at python.org> wrote:
> > > >>
> > > >> Hi,
> > > >>
> > > >> Thank you for the reply. Please bear with me for a while.
> > > >>
> > > >> From where did this number, 0.632, come? I have no background in
> statistics (which appears to be the case here!). Or let me rephrase my
> query: what is this bootstrap sampling all about? Searched the web, but
> didn't get satisfactory results.
> > > >>
> > > >>
> > > >> Thanks
> > > >>
> > > >> On Tue, Oct 4, 2016 at 12:02 AM, Sebastian Raschka <
> se.raschka at gmail.com> wrote:
> > > >>> From whatever little knowledge I gained last night about Random
> Forests, each tree is trained with a sub-sample of original dataset
> (usually with replacement)?.
> > > >>
> > > >> Yes, that should be correct!
> > > >>
> > > >>> Now, what I am not able to understand is - if entire dataset is
> used to train each of the trees, then how does the classifier estimates the
> OOB error? None of the entries of the dataset is an oob for any of the
> trees. (Pardon me if all this sounds BS)
> > > >>
> > > >> If you take an n-size bootstrap sample, where n is the number of
> samples in your dataset, you have asymptotically 0.632 * n unique samples
> in your bootstrap set. Or in other words 0.368 * n samples are not used for
> growing the respective tree (to compute the OOB). As far as I understand,
> the random forest OOB score is then computed as the average OOB of each tee
> (correct me if I am wrong!).
> > > >>
> > > >> Best,
> > > >> Sebastian
> > > >>
> > > >>> On Oct 3, 2016, at 2:25 PM, Ibrahim Dalal via scikit-learn <
> scikit-learn at python.org> wrote:
> > > >>>
> > > >>> Dear Developers,
> > > >>>
> > > >>> From whatever little knowledge I gained last night about Random
> Forests, each tree is trained with a sub-sample of original dataset
> (usually with replacement)?.
> > > >>>
> > > >>> (Note: Please do correct me if I am not making any sense.)
> > > >>>
> > > >>> RandomForestClassifier has an option of 'bootstrap'. The API
> states the following
> > > >>>
> > > >>> The sub-sample size is always the same as the original input
> sample size but the samples are drawn with replacement if bootstrap=True
> (default).
> > > >>>
> > > >>> Now, what I am not able to understand is - if entire dataset is
> used to train each of the trees, then how does the classifier estimates the
> OOB error? None of the entries of the dataset is an oob for any of the
> trees. (Pardon me if all this sounds BS)
> > > >>>
> > > >>> Help this mere mortal.
> > > >>>
> > > >>> Thanks
> > > >>> _______________________________________________
> > > >>> scikit-learn mailing list
> > > >>> scikit-learn at python.org
> > > >>> https://mail.python.org/mailman/listinfo/scikit-learn
> > > >>
> > > >> _______________________________________________
> > > >> scikit-learn mailing list
> > > >> scikit-learn at python.org
> > > >> https://mail.python.org/mailman/listinfo/scikit-learn
> > > >>
> > > >> _______________________________________________
> > > >> scikit-learn mailing list
> > > >> scikit-learn at python.org
> > > >> https://mail.python.org/mailman/listinfo/scikit-learn
> > > >
> > > > _______________________________________________
> > > > scikit-learn mailing list
> > > > scikit-learn at python.org
> > > > https://mail.python.org/mailman/listinfo/scikit-learn
> > >
> > > _______________________________________________
> > > scikit-learn mailing list
> > > scikit-learn at python.org
> > > https://mail.python.org/mailman/listinfo/scikit-learn
> > >
> > > _______________________________________________
> > > scikit-learn mailing list
> > > scikit-learn at python.org
> > > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161004/ead16e5f/attachment-0001.html>

From Dale.T.Smith at macys.com  Tue Oct  4 08:15:41 2016
From: Dale.T.Smith at macys.com (Dale T Smith)
Date: Tue, 4 Oct 2016 12:15:41 +0000
Subject: [scikit-learn] Random Forest with Bootstrapping
In-Reply-To: <CAAyvngWWE8Dx59cHfW8a5aN2Any2VoBk=WSEiMUKm1LxFBe1vA@mail.gmail.com>
References: <CAAyvngWnwmH0p-7A46OED5Rw7n7RYn-QJduWeSpeLrqOkZfMqw@mail.gmail.com>
 <AFAAA2CD-3742-4B42-BBD4-85E2B4D5DB95@gmail.com>
 <CAAyvngXpJr0b3_N5A_sDFsDAf-2VX5wsc1Rayxw5gMa+PEmqcw@mail.gmail.com>
 <18DFDA25-0236-486C-B23D-4E1118EC4803@gmail.com>
 <4977A46A-2064-42EE-8853-4E4C799776A3@gmail.com>
 <CAAyvngX16ftnwuJB4FNfi1WZ4MZp-qUEB+hsPi-DrMdjqOv+vg@mail.gmail.com>
 <A22C0A34-BF34-4330-8B64-E69511C30B59@gmail.com>
 <CAAyvngV2tXsd-AVQ3QWb_2kB2Je8C5+y8sW6roc8LMTnPzaXuA@mail.gmail.com>
 <FECF7D75-D434-49D6-BD0E-5E1621A0B0E3@gmail.com>
 <CAAyvngWWE8Dx59cHfW8a5aN2Any2VoBk=WSEiMUKm1LxFBe1vA@mail.gmail.com>
Message-ID: <BL2PR06MB22768B2FA6B07EED1E4FD26EC3C50@BL2PR06MB2276.namprd06.prod.outlook.com>

Search for Jackknife at Wikipedia. That will give you a quick overview. Then you will have the background to read the papers below.


While you are at Wikipedia, you may want to read on the bootstrap and random forests as well.


__________________________________________________________________________________________
Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science
770-658-5176 | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com

From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Ibrahim Dalal via scikit-learn
Sent: Tuesday, October 4, 2016 6:44 AM
To: Scikit-learn user and developer mailing list
Cc: Ibrahim Dalal
Subject: Re: [scikit-learn] Random Forest with Bootstrapping

? EXT MSG:
Hi,
So why is using a bootstrap sample of size n better than just a random set of size 0.62*n in Random Forest?
Thanks

On Tue, Oct 4, 2016 at 1:58 AM, Sebastian Raschka <se.raschka at gmail.com<mailto:se.raschka at gmail.com>> wrote:
Originally, it was this technique was used to estimate a sampling distribution. Think of the drawing with replacement as work-around for generating *new* data from a population that is simulated by this repeated sampling from the given dataset with replacement.


For more details, I?d recommend reading the original literature, e.g,.

Efron, Bradley. 1979. ?Bootstrap Methods: Another Look at the Jackknife.? The Annals of Statistics 7 (1). Institute of Mathematical Statistics: 1?26.


There?s also a whole book on this topic:

Efron, Bradley, and Robert Tibshirani. 1994. An Introduction to the Bootstrap. Chapman & Hall.


Or more relevant to this particular application, maybe see

Breiman, L., 1996. Bagging predictors. Machine learning, 24(2), pp.123-140.

"Tests on real and simulated data sets using classification and regression trees and subset selection in linear regression show that bagging can give substantial gains in accuracy. The vital element is the instability of the prediction method. If perturbing the learning set can cause significant changes in the predictor constructed, then bagging can improve accuracy."


> On Oct 3, 2016, at 4:03 PM, Ibrahim Dalal via scikit-learn <scikit-learn at python.org<mailto:scikit-learn at python.org>> wrote:
>
> So what is the point of having duplicate entries in your training set? This seems just a pure overhead. Sorry but you will again have to help me here.
>
> On Tue, Oct 4, 2016 at 1:29 AM, Sebastian Raschka <se.raschka at gmail.com<mailto:se.raschka at gmail.com>> wrote:
> > Hi,
> >
> > That helped a lot. Thank you very much. I have one more (silly?) doubt though.
> >
> > Won't an n-sized bootstrapped sample have repeated entries? Say we have an original dataset of size 100. A bootstrap sample (say, B) of size 100 is drawn from this set. Since 32 of the original samples are left out (theoretically at least), some of the samples in B must be repeated?
>
> Yeah, you'll definitely have duplications, that?s why (if you have an infinitely large n) only 0.632*n samples are unique ;).
>
> Say your dataset is
>
> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] (where the numbers represent the indices of your data points)
>
> then a bootstrap sample could be
>
> [9, 1, 1, 0, 4, 4, 5, 7, 9, 9] and your left out sample is consequently [2, 3, 6, 8]
>
>
> > On Oct 3, 2016, at 3:36 PM, Ibrahim Dalal via scikit-learn <scikit-learn at python.org<mailto:scikit-learn at python.org>> wrote:
> >
> > Hi,
> >
> > That helped a lot. Thank you very much. I have one more (silly?) doubt though.
> >
> > Won't an n-sized bootstrapped sample have repeated entries? Say we have an original dataset of size 100. A bootstrap sample (say, B) of size 100 is drawn from this set. Since 32 of the original samples are left out (theoretically at least), some of the samples in B must be repeated?
> >
> > On Tue, Oct 4, 2016 at 12:50 AM, Sebastian Raschka <se.raschka at gmail.com<mailto:se.raschka at gmail.com>> wrote:
> > Or maybe more intuitively, you can visualize this asymptotic behavior e.g., via
> >
> > import matplotlib.pyplot as plt
> >
> > vs = []
> > for n in range(5, 201, 5):
> >     v = 1 - (1. - 1./n)**n
> >     vs.append(v)
> >
> > plt.plot([n for n in range(5, 201, 5)], vs, marker='o',
> >           markersize=6,
> >           alpha=0.5,)
> >
> > plt.xlabel('n')
> > plt.ylabel('1 - (1 - 1/n)^n')
> > plt.xlim([0, 210])
> > plt.show()
> >
> > > On Oct 3, 2016, at 3:15 PM, Sebastian Raschka <se.raschka at gmail.com<mailto:se.raschka at gmail.com>> wrote:
> > >
> > > Say the probability that a given sample from a dataset of size n is *not* drawn as a bootstrap sample is
> > >
> > > P(not_chosen) = (1 - 1\n)^n
> > >
> > > Since you have a 1/n chance to draw a particular sample (since bootstrapping involves drawing with replacement), which you repeat n times to get a n-sized bootstrap sample.
> > >
> > > This is asymptotically "1/e approx. 0.368? (i.e., for very, very large n)
> > >
> > > Then, you can compute the probability of a sample being chosen as
> > >
> > > P(chosen) = 1 - (1 - 1/n)^n approx. 0.632
> > >
> > > Best,
> > > Sebastian
> > >
> > >> On Oct 3, 2016, at 3:05 PM, Ibrahim Dalal via scikit-learn <scikit-learn at python.org<mailto:scikit-learn at python.org>> wrote:
> > >>
> > >> Hi,
> > >>
> > >> Thank you for the reply. Please bear with me for a while.
> > >>
> > >> From where did this number, 0.632, come? I have no background in statistics (which appears to be the case here!). Or let me rephrase my query: what is this bootstrap sampling all about? Searched the web, but didn't get satisfactory results.
> > >>
> > >>
> > >> Thanks
> > >>
> > >> On Tue, Oct 4, 2016 at 12:02 AM, Sebastian Raschka <se.raschka at gmail.com<mailto:se.raschka at gmail.com>> wrote:
> > >>> From whatever little knowledge I gained last night about Random Forests, each tree is trained with a sub-sample of original dataset (usually with replacement)?.
> > >>
> > >> Yes, that should be correct!
> > >>
> > >>> Now, what I am not able to understand is - if entire dataset is used to train each of the trees, then how does the classifier estimates the OOB error? None of the entries of the dataset is an oob for any of the trees. (Pardon me if all this sounds BS)
> > >>
> > >> If you take an n-size bootstrap sample, where n is the number of samples in your dataset, you have asymptotically 0.632 * n unique samples in your bootstrap set. Or in other words 0.368 * n samples are not used for growing the respective tree (to compute the OOB). As far as I understand, the random forest OOB score is then computed as the average OOB of each tee (correct me if I am wrong!).
> > >>
> > >> Best,
> > >> Sebastian
> > >>
> > >>> On Oct 3, 2016, at 2:25 PM, Ibrahim Dalal via scikit-learn <scikit-learn at python.org<mailto:scikit-learn at python.org>> wrote:
> > >>>
> > >>> Dear Developers,
> > >>>
> > >>> From whatever little knowledge I gained last night about Random Forests, each tree is trained with a sub-sample of original dataset (usually with replacement)?.
> > >>>
> > >>> (Note: Please do correct me if I am not making any sense.)
> > >>>
> > >>> RandomForestClassifier has an option of 'bootstrap'. The API states the following
> > >>>
> > >>> The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).
> > >>>
> > >>> Now, what I am not able to understand is - if entire dataset is used to train each of the trees, then how does the classifier estimates the OOB error? None of the entries of the dataset is an oob for any of the trees. (Pardon me if all this sounds BS)
> > >>>
> > >>> Help this mere mortal.
> > >>>
> > >>> Thanks
> > >>> _______________________________________________
> > >>> scikit-learn mailing list
> > >>> scikit-learn at python.org<mailto:scikit-learn at python.org>
> > >>> https://mail.python.org/mailman/listinfo/scikit-learn
> > >>
> > >> _______________________________________________
> > >> scikit-learn mailing list
> > >> scikit-learn at python.org<mailto:scikit-learn at python.org>
> > >> https://mail.python.org/mailman/listinfo/scikit-learn
> > >>
> > >> _______________________________________________
> > >> scikit-learn mailing list
> > >> scikit-learn at python.org<mailto:scikit-learn at python.org>
> > >> https://mail.python.org/mailman/listinfo/scikit-learn
> > >
> > > _______________________________________________
> > > scikit-learn mailing list
> > > scikit-learn at python.org<mailto:scikit-learn at python.org>
> > > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org<mailto:scikit-learn at python.org>
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org<mailto:scikit-learn at python.org>
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org<mailto:scikit-learn at python.org>
> https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org<mailto:scikit-learn at python.org>
> https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn at python.org<mailto:scikit-learn at python.org>
https://mail.python.org/mailman/listinfo/scikit-learn

* This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161004/2cd95781/attachment-0001.html>

From urvesh.patel11 at gmail.com  Tue Oct  4 17:39:32 2016
From: urvesh.patel11 at gmail.com (urvesh patel)
Date: Tue, 4 Oct 2016 14:39:32 -0700
Subject: [scikit-learn] Adding a function that Calculates Weight of Evidence
 and Information Value
Message-ID: <CAH+UvvRQpV8DeKtKO=9_d5_bgY847NtHxP6tJHc-DVZ+GLQWRA@mail.gmail.com>

>
>
> I have been using R extensively until last few months when I started using
> Python. I noticed that Python doesn't have a function to compute
> information value and weight of evidence. Detailed explanation -
> http://multithreaded.stitchfix.com/blog/2015/08/13/weight-of-evidence/
>
> I have version 0 of this concept ready and I would like to contribute to
> scikit-learn so that more and more people can use it. What are the steps I
> need to follow in order to do so ?
>
> --
> Thanking You,
>
> Urvesh Patel
> Data Ninja
> Udacity
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161004/70a635a3/attachment.html>

From blrstartuphire at gmail.com  Wed Oct  5 05:58:03 2016
From: blrstartuphire at gmail.com (Startup Hire)
Date: Wed, 5 Oct 2016 15:28:03 +0530
Subject: [scikit-learn] Identifying column names of Non-zero values
Message-ID: <CAPEF3M_BwBimL2C27DYOhyd8cpXvjRhu0xn1KQ=QX1PX24UwHA@mail.gmail.com>

Hi Pypers,

Hope you are doing well.

I am working on a project to find out the column names of non-zero values
at a row level.

How can this effectively done in python pandas/dataframe?


For example,


*Column1* *Column *2 *Column *3 Column 4 Column 5 Column 6 *Column 7* New
column to be created
1 1 1 0 0 0 1 Column1,Column 2,Column 3,Column7


I might have to do it on approximately million rows

Regards,
Sanant
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161005/417054ce/attachment.html>

From samo.turk at gmail.com  Wed Oct  5 07:35:25 2016
From: samo.turk at gmail.com (Samo Turk)
Date: Wed, 5 Oct 2016 13:35:25 +0200
Subject: [scikit-learn] Identifying column names of Non-zero values
In-Reply-To: <CAPEF3M_BwBimL2C27DYOhyd8cpXvjRhu0xn1KQ=QX1PX24UwHA@mail.gmail.com>
References: <CAPEF3M_BwBimL2C27DYOhyd8cpXvjRhu0xn1KQ=QX1PX24UwHA@mail.gmail.com>
Message-ID: <CAPKfRYkcmOWn2kzDCWNU-aLPYh5FOM9dmOM+m=-zUwcOUz=VMw@mail.gmail.com>

Something like this might work:

def non_zero(row, columns):
    return list(columns[~(row == 0)])

df.apply(lambda x: non_zero(x, df.columns), axis=1)

Cheers,
Samo

On Wed, Oct 5, 2016 at 11:58 AM, Startup Hire <blrstartuphire at gmail.com>
wrote:

> Hi Pypers,
>
> Hope you are doing well.
>
> I am working on a project to find out the column names of non-zero values
> at a row level.
>
> How can this effectively done in python pandas/dataframe?
>
>
> For example,
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *Column1* *Column *2 *Column *3 Column 4 Column 5 Column 6 *Column 7* New
> column to be created
> 1 1 1 0 0 0 1 Column1,Column 2,Column 3,Column7
>
>
>
>
>
>
>
>
>
>
> I might have to do it on approximately million rows
>
> Regards,
> Sanant
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161005/ed54866a/attachment.html>

From maciek at wojcikowski.pl  Wed Oct  5 07:53:28 2016
From: maciek at wojcikowski.pl (=?UTF-8?Q?Maciek_W=C3=B3jcikowski?=)
Date: Wed, 5 Oct 2016 13:53:28 +0200
Subject: [scikit-learn] Identifying column names of Non-zero values
In-Reply-To: <CAPKfRYkcmOWn2kzDCWNU-aLPYh5FOM9dmOM+m=-zUwcOUz=VMw@mail.gmail.com>
References: <CAPEF3M_BwBimL2C27DYOhyd8cpXvjRhu0xn1KQ=QX1PX24UwHA@mail.gmail.com>
 <CAPKfRYkcmOWn2kzDCWNU-aLPYh5FOM9dmOM+m=-zUwcOUz=VMw@mail.gmail.com>
Message-ID: <CAH2JJR3OCAThYnP_XEbMR++W=_pCJSJrfE22wTAcyTN+QB=FhQ@mail.gmail.com>

Hi Sanant and Samo,

Even easier and faster solution:

> df.columns[(df.values != 0).any(axis=0)]

Or if some reason != 0 does not work for you:

> df.columns[(~(df.values == 0)).any(axis=0)]

----
Pozdrawiam,  |  Best regards,
Maciek W?jcikowski
maciek at wojcikowski.pl

2016-10-05 13:35 GMT+02:00 Samo Turk <samo.turk at gmail.com>:

> Something like this might work:
>
> def non_zero(row, columns):
>     return list(columns[~(row == 0)])
>
> df.apply(lambda x: non_zero(x, df.columns), axis=1)
>
> Cheers,
> Samo
>
> On Wed, Oct 5, 2016 at 11:58 AM, Startup Hire <blrstartuphire at gmail.com>
> wrote:
>
>> Hi Pypers,
>>
>> Hope you are doing well.
>>
>> I am working on a project to find out the column names of non-zero values
>> at a row level.
>>
>> How can this effectively done in python pandas/dataframe?
>>
>>
>> For example,
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *Column1* *Column *2 *Column *3 Column 4 Column 5 Column 6 *Column 7* New
>> column to be created
>> 1 1 1 0 0 0 1 Column1,Column 2,Column 3,Column7
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> I might have to do it on approximately million rows
>>
>> Regards,
>> Sanant
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161005/c575c862/attachment-0001.html>

From blrstartuphire at gmail.com  Wed Oct  5 08:13:19 2016
From: blrstartuphire at gmail.com (Startup Hire)
Date: Wed, 5 Oct 2016 17:43:19 +0530
Subject: [scikit-learn] Identifying column names of Non-zero values
In-Reply-To: <CAPKfRYkcmOWn2kzDCWNU-aLPYh5FOM9dmOM+m=-zUwcOUz=VMw@mail.gmail.com>
References: <CAPEF3M_BwBimL2C27DYOhyd8cpXvjRhu0xn1KQ=QX1PX24UwHA@mail.gmail.com>
 <CAPKfRYkcmOWn2kzDCWNU-aLPYh5FOM9dmOM+m=-zUwcOUz=VMw@mail.gmail.com>
Message-ID: <CAPEF3M8OXRoU2fTngx2LzNG4HVfBjb2P59=56kuq=C=o5c03EA@mail.gmail.com>

Hi Samo,

Thanks a lot. It works at a row level and I can append it a row level to
the main dataframe to do further analysis.

Regards,
Sanant

On Wed, Oct 5, 2016 at 5:05 PM, Samo Turk <samo.turk at gmail.com> wrote:

> Something like this might work:
>
> def non_zero(row, columns):
>     return list(columns[~(row == 0)])
>
> df.apply(lambda x: non_zero(x, df.columns), axis=1)
>
> Cheers,
> Samo
>
> On Wed, Oct 5, 2016 at 11:58 AM, Startup Hire <blrstartuphire at gmail.com>
> wrote:
>
>> Hi Pypers,
>>
>> Hope you are doing well.
>>
>> I am working on a project to find out the column names of non-zero values
>> at a row level.
>>
>> How can this effectively done in python pandas/dataframe?
>>
>>
>> For example,
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *Column1* *Column *2 *Column *3 Column 4 Column 5 Column 6 *Column 7* New
>> column to be created
>> 1 1 1 0 0 0 1 Column1,Column 2,Column 3,Column7
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> I might have to do it on approximately million rows
>>
>> Regards,
>> Sanant
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161005/f3ca680c/attachment.html>

From jiri.borovec at fel.cvut.cz  Wed Oct  5 09:13:45 2016
From: jiri.borovec at fel.cvut.cz (=?UTF-8?B?SmnFmcOtIEJvcm92ZWM=?=)
Date: Wed, 5 Oct 2016 15:13:45 +0200
Subject: [scikit-learn] wrapper for GraphCut or GridCut
Message-ID: <CAGFatNg4uZ4UPWMYQqY-PEt2yfJegmm6PzF8OLCSqHcCjH8qfg@mail.gmail.com>

Hello,
I was thinking about adding GraphCut (
http://www.csd.uwo.ca/~yuri/Papers/pami01.pdf) of GridCut (
http://www.gridcut.com/) which both of them are already implemented in
C/C++ a some of then have also wrapper in Python. What is the statement to
this task, having GraphCut included in this library such that using thier
C/C++ code and include wrappers.

Thanks
--
Best regards, Jiri Borovec
------------------------------------------------------------------------
Ing. Jiri Borovec, MSc  <jiri.borovec at fel.cvut.cz>
PhD student at CMP CTU, http://cmp.felk.cvut.cz/~borovji3
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161005/9624de1a/attachment.html>

From t3kcit at gmail.com  Wed Oct  5 11:08:56 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Wed, 5 Oct 2016 11:08:56 -0400
Subject: [scikit-learn] wrapper for GraphCut or GridCut
In-Reply-To: <CAGFatNg4uZ4UPWMYQqY-PEt2yfJegmm6PzF8OLCSqHcCjH8qfg@mail.gmail.com>
References: <CAGFatNg4uZ4UPWMYQqY-PEt2yfJegmm6PzF8OLCSqHcCjH8qfg@mail.gmail.com>
Message-ID: <9536dd4e-151a-3a03-a422-1a3a6384b5fc@gmail.com>

Hi Jiri.
I think both are better suited for scikit-image.
I think Emanuelle there is actually working on graph cut right now.
I'd ask on the scikit-image mailing list what the current status is.

Best,
Andy

On 10/05/2016 09:13 AM, Ji?? Borovec wrote:
> Hello,
> I was thinking about adding GraphCut 
> (http://www.csd.uwo.ca/~yuri/Papers/pami01.pdf 
> <http://www.csd.uwo.ca/%7Eyuri/Papers/pami01.pdf>) of GridCut 
> (http://www.gridcut.com/) which both of them are already implemented 
> in C/C++ a some of then have also wrapper in Python. What is the 
> statement to this task, having GraphCut included in this library such 
> that using thier C/C++ code and include wrappers.
>
> Thanks
> --
> Best regards, Jiri Borovec
> ------------------------------------------------------------------------
> Ing. Jiri Borovec, MSc  <jiri.borovec at fel.cvut.cz 
> <mailto:jiri.borovec at fel.cvut.cz>>
> PhD student at CMP CTU, http://cmp.felk.cvut.cz/~borovji3 
> <http://cmp.felk.cvut.cz/%7Eborovji3>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161005/7134d77d/attachment.html>

From jiri.borovec at fel.cvut.cz  Wed Oct  5 11:19:39 2016
From: jiri.borovec at fel.cvut.cz (=?UTF-8?B?SmnFmcOtIEJvcm92ZWM=?=)
Date: Wed, 5 Oct 2016 17:19:39 +0200
Subject: [scikit-learn] wrapper for GraphCut or GridCut
In-Reply-To: <9536dd4e-151a-3a03-a422-1a3a6384b5fc@gmail.com>
References: <CAGFatNg4uZ4UPWMYQqY-PEt2yfJegmm6PzF8OLCSqHcCjH8qfg@mail.gmail.com>
 <9536dd4e-151a-3a03-a422-1a3a6384b5fc@gmail.com>
Message-ID: <CAGFatNg1jn-7RuH0w2Y6+6kY0dv=wnE6JCRcYFa7o2miv_1oNg@mail.gmail.com>

Hello,
for the regular graph and GridCut (
https://github.com/willemolding/gridcut-python), meaning regular grid like
image it would be better have it in skimage, but talking about general
graph, I would keep in sklearn.
I think that you already have a wrapper for GraphCut (
https://github.com/amueller/gco_python) even I found this (
https://github.com/yujiali/pygco) better one.

--
Best regards, Jiri Borovec
------------------------------------------------------------------------
Ing. Jiri Borovec, MSc  <jiri.borovec at fel.cvut.cz>
PhD student at CMP CTU, http://cmp.felk.cvut.cz/~borovji3

On 5 October 2016 at 17:08, Andreas Mueller <t3kcit at gmail.com> wrote:

> Hi Jiri.
> I think both are better suited for scikit-image.
> I think Emanuelle there is actually working on graph cut right now.
> I'd ask on the scikit-image mailing list what the current status is.
>
> Best,
> Andy
>
> On 10/05/2016 09:13 AM, Ji?? Borovec wrote:
>
> Hello,
> I was thinking about adding GraphCut (http://www.csd.uwo.ca/~yuri/
> Papers/pami01.pdf) of GridCut (http://www.gridcut.com/) which both of
> them are already implemented in C/C++ a some of then have also wrapper in
> Python. What is the statement to this task, having GraphCut included in
> this library such that using thier C/C++ code and include wrappers.
>
> Thanks
> --
> Best regards, Jiri Borovec
> ------------------------------------------------------------------------
> Ing. Jiri Borovec, MSc  <jiri.borovec at fel.cvut.cz>
> PhD student at CMP CTU, http://cmp.felk.cvut.cz/~borovji3
>
>
> _______________________________________________
> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161005/d88ff51e/attachment-0001.html>

From t3kcit at gmail.com  Wed Oct  5 11:19:36 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Wed, 5 Oct 2016 11:19:36 -0400
Subject: [scikit-learn] Adding a function that Calculates Weight of
 Evidence and Information Value
In-Reply-To: <CAH+UvvRQpV8DeKtKO=9_d5_bgY847NtHxP6tJHc-DVZ+GLQWRA@mail.gmail.com>
References: <CAH+UvvRQpV8DeKtKO=9_d5_bgY847NtHxP6tJHc-DVZ+GLQWRA@mail.gmail.com>
Message-ID: <458656da-494a-eed0-cf38-347a146e987b@gmail.com>

Hey Urvesh.
That looks interesting. We recently added mutual information based 
feature selection.
To add this to scikit-learn, we would like to see that this is an 
established method, for example via citations
or forks or some other way.
If it's only a year old (the date of the blog post) that might be a bit 
fresh for us, and you
can add it to scikit-learn contrib.

We would also like to see that there are cases when it works better than 
what is already established
and what we have, like mutual info based selection.

It looks like WOE is just the coefficient vector of Naive Bayes, right?
I don't quite understand the information value at a glance, though.

Andy

On 10/04/2016 05:39 PM, urvesh patel wrote:
>
>
>     I have been using R extensively until last few months when I
>     started using Python. I noticed that Python doesn't have a
>     function to compute information value and weight of evidence.
>     Detailed explanation -
>     http://multithreaded.stitchfix.com/blog/2015/08/13/weight-of-evidence/
>     <http://multithreaded.stitchfix.com/blog/2015/08/13/weight-of-evidence/>
>
>     I have version 0 of this concept ready and I would like to
>     contribute to scikit-learn so that more and more people can use
>     it. What are the steps I need to follow in order to do so ?
>
>     -- 
>     Thanking You,
>
>     Urvesh Patel
>     Data Ninja
>     Udacity
>
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161005/df93e93d/attachment.html>

From t3kcit at gmail.com  Wed Oct  5 11:25:24 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Wed, 5 Oct 2016 11:25:24 -0400
Subject: [scikit-learn] wrapper for GraphCut or GridCut
In-Reply-To: <CAGFatNg1jn-7RuH0w2Y6+6kY0dv=wnE6JCRcYFa7o2miv_1oNg@mail.gmail.com>
References: <CAGFatNg4uZ4UPWMYQqY-PEt2yfJegmm6PzF8OLCSqHcCjH8qfg@mail.gmail.com>
 <9536dd4e-151a-3a03-a422-1a3a6384b5fc@gmail.com>
 <CAGFatNg1jn-7RuH0w2Y6+6kY0dv=wnE6JCRcYFa7o2miv_1oNg@mail.gmail.com>
Message-ID: <4509db43-1673-5b18-a77d-70e7f04042ea@gmail.com>


On 10/05/2016 11:19 AM, Ji?? Borovec wrote:
> Hello,
> for the regular graph and GridCut 
> (https://github.com/willemolding/gridcut-python), meaning regular grid 
> like image it would be better have it in skimage, but talking about 
> general graph, I would keep in sklearn.
I disagree. Why would it be in scikit-learn? It's not a learning 
algorithm. It doesn't have the same interface at all.
It does something pretty unrelated to machine learning. And in vision, 
you often have other graphs if you work with superpixels.
> I think that you already have a wrapper for GraphCut 
> (https://github.com/amueller/gco_python) even I found this 
> (https://github.com/yujiali/pygco) better one.
>
Cool. I did the minimal port for what I needed at the time. Since I was 
mostly interested in learning, I switched to using QPBO.

Andy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161005/5c8c58f0/attachment.html>

From urvesh.patel11 at gmail.com  Wed Oct  5 11:41:35 2016
From: urvesh.patel11 at gmail.com (urvesh patel)
Date: Wed, 5 Oct 2016 08:41:35 -0700
Subject: [scikit-learn] Adding a function that Calculates Weight of
 Evidence and Information Value
In-Reply-To: <458656da-494a-eed0-cf38-347a146e987b@gmail.com>
References: <CAH+UvvRQpV8DeKtKO=9_d5_bgY847NtHxP6tJHc-DVZ+GLQWRA@mail.gmail.com>
 <458656da-494a-eed0-cf38-347a146e987b@gmail.com>
Message-ID: <CAH+UvvQKBaJKL9zih4OPkUWV+Tbg6Ukend=QSaUYNnn1C5HK9w@mail.gmail.com>

Hi Andreas,

You are correct about weight of evidence. Information Value is a fancy term
but it is very similar to mutual information. Also, this method is used
most widely with uplift random forest methodology or any incremental
modeling problems where the goal is to find subset of population who will
contribute to ROI goal over the users who would have purchased it anyways
and over the users who have negative effect because of promotion.

Citations for Information Value that I found -
http://www.mwsug.org/proceedings/2013/AA/MWSUG-2013-AA14.pdf
http://documentation.statsoft.com/STATISTICAHelp.aspx?path=WeightofEvidence/WeightofEvidenceWoEIntroductoryOverview

More on Uplift Random Forest or Incremental Modeling -
https://www.linkedin.com/pulse/need-more-lift-try-uplift-models-jeffrey-strickland-ph-d-cmsp

PS - The function I have has a special flag for uplift modeling. If this
flag is set, then Information value and weight of evidence are calculated
accordingly.


On Wed, Oct 5, 2016 at 8:19 AM, Andreas Mueller <t3kcit at gmail.com> wrote:

> Hey Urvesh.
> That looks interesting. We recently added mutual information based feature
> selection.
> To add this to scikit-learn, we would like to see that this is an
> established method, for example via citations
> or forks or some other way.
> If it's only a year old (the date of the blog post) that might be a bit
> fresh for us, and you
> can add it to scikit-learn contrib.
>
> We would also like to see that there are cases when it works better than
> what is already established
> and what we have, like mutual info based selection.
>
> It looks like WOE is just the coefficient vector of Naive Bayes, right?
> I don't quite understand the information value at a glance, though.
>
> Andy
>
>
> On 10/04/2016 05:39 PM, urvesh patel wrote:
>
>
>> I have been using R extensively until last few months when I started
>> using Python. I noticed that Python doesn't have a function to compute
>> information value and weight of evidence. Detailed explanation -
>> http://multithreaded.stitchfix.com/blog/2015/08/13/weight-of-evidence/
>>
>> I have version 0 of this concept ready and I would like to contribute to
>> scikit-learn so that more and more people can use it. What are the steps I
>> need to follow in order to do so ?
>>
>> --
>> Thanking You,
>>
>> Urvesh Patel
>> Data Ninja
>> Udacity
>>
>
>
>
>
> _______________________________________________
> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 
Thanking You,

Urvesh Patel
Columbia University
*Masters in Operations Research*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161005/5f477a0f/attachment-0001.html>

From themismavridis at gmail.com  Thu Oct  6 11:45:41 2016
From: themismavridis at gmail.com (Themis Mavridis)
Date: Thu, 6 Oct 2016 17:45:41 +0200
Subject: [scikit-learn] out of core in ARD or Bayessian Ridge Regression
Message-ID: <CABw3FPZngDLYg0_z0fH3niZ+pUiCY-KJCNROCk8M1vKS84QukQ@mail.gmail.com>

I would like to perform out-of-core training using Bayesian Ridge
Regression or ARD. Is there any plan to implement such a functionality?

Thanks,
Themis
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161006/829fc2a4/attachment.html>

From olivier.grisel at ensta.org  Fri Oct  7 03:59:04 2016
From: olivier.grisel at ensta.org (Olivier Grisel)
Date: Fri, 7 Oct 2016 09:59:04 +0200
Subject: [scikit-learn] out of core in ARD or Bayessian Ridge Regression
In-Reply-To: <CABw3FPZngDLYg0_z0fH3niZ+pUiCY-KJCNROCk8M1vKS84QukQ@mail.gmail.com>
References: <CABw3FPZngDLYg0_z0fH3niZ+pUiCY-KJCNROCk8M1vKS84QukQ@mail.gmail.com>
Message-ID: <CAFvE7K4L6tgFqKPLwguWMsNY9c3eKSxD=wOsRiAOFibdMygm2A@mail.gmail.com>

I don't think anybody is working on this but you should better check
in github pull requests.

Best,

-- 
Olivier

From aakash at klugtek.co.in  Fri Oct  7 09:51:44 2016
From: aakash at klugtek.co.in (Aakash Agarwal)
Date: Fri, 7 Oct 2016 19:21:44 +0530
Subject: [scikit-learn] MLP Classifier error in 0.18 version
Message-ID: <CABVTFDuwGrZnb5FZcE5+OY0oNx084xOTwNgMZ_85m-up2xmpLQ@mail.gmail.com>

Hi Guys,

I am playing around MLP classifier lately. So i have about 450 inputs to
classify. Each input is a vector of array size 50. I am trying to fit the
model with 90% as train data.

Size of training data: (398, 50)
Size of testing data: (45, 50)

MLP instantiation:
gen_class =
MLPClassifier(hidden_layer_sizes=(200,),max_iter=3000,learning_rate='adaptive',alpha=0.025,warm_start=True)

Batch size is auto so it is taking 200 as batch_size. But when i am fitting
the classifier model, i am getting following error:

Traceback (most recent call last):
  File "intent_detection_classifier_selection.py", line 452, in <module>
    sk_class.gen_class_fitting(gen_class,corp_lsi_train,train_label)
  File "intent_detection_classifier_selection.py", line 77, in
gen_class_fitting
    gen_class.fit(data,label)
  File
"/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py",
line 612, in fit
    return self._fit(X, y, incremental=False)
  File
"/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py",
line 372, in _fit
    intercept_grads, layer_units, incremental)
  File
"/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py",
line 509, in _fit_stochastic
    coef_grads, intercept_grads)
  File
"/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py",
line 225, in _backprop
    loss = LOSS_FUNCTIONS[self.loss](y, activations[-1])
  File
"/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/_base.py",
line 222, in log_loss
    return -np.sum(y_true * np.log(y_prob)) / y_prob.shape[0]
ValueError: operands could not be broadcast together with shapes (200,128)
(200,125)

Thanks,
Aakash
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161007/af643ee2/attachment.html>

From t3kcit at gmail.com  Fri Oct  7 11:47:55 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Fri, 7 Oct 2016 11:47:55 -0400
Subject: [scikit-learn] MLP Classifier error in 0.18 version
In-Reply-To: <CABVTFDuwGrZnb5FZcE5+OY0oNx084xOTwNgMZ_85m-up2xmpLQ@mail.gmail.com>
References: <CABVTFDuwGrZnb5FZcE5+OY0oNx084xOTwNgMZ_85m-up2xmpLQ@mail.gmail.com>
Message-ID: <96977a13-28b8-5f18-d716-aa5106fddbfb@gmail.com>

Hi.
Can you provide a self-contained example to reproduce on the issue-tracker?
Maybe you used warm_start=True but changed something about the dataset, 
like going from 125 classes to 128?

This works:

from sklearn.neural_network import MLPClassifier
gen_class = 
MLPClassifier(hidden_layer_sizes=(200,),max_iter=3000,learning_rate='adaptive',alpha=0.025,warm_start=True)
X_train = np.random.uniform(size=(398, 50))
y_train = np.random.uniform(size=398) > .5
gen_class.fit(X_train, y_train)

best,
Andy

On 10/07/2016 09:51 AM, Aakash Agarwal wrote:
> Hi Guys,
>
> I am playing around MLP classifier lately. So i have about 450 inputs 
> to classify. Each input is a vector of array size 50. I am trying to 
> fit the model with 90% as train data.
>
> Size of training data: (398, 50)
> Size of testing data: (45, 50)
>
> MLP instantiation:
> gen_class = 
> MLPClassifier(hidden_layer_sizes=(200,),max_iter=3000,learning_rate='adaptive',alpha=0.025,warm_start=True)
>
> Batch size is auto so it is taking 200 as batch_size. But when i am 
> fitting the classifier model, i am getting following error:
>
> Traceback (most recent call last):
>   File "intent_detection_classifier_selection.py", line 452, in <module>
> sk_class.gen_class_fitting(gen_class,corp_lsi_train,train_label)
>   File "intent_detection_classifier_selection.py", line 77, in 
> gen_class_fitting
>     gen_class.fit(data,label)
>   File 
> "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", 
> line 612, in fit
>     return self._fit(X, y, incremental=False)
>   File 
> "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", 
> line 372, in _fit
>     intercept_grads, layer_units, incremental)
>   File 
> "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", 
> line 509, in _fit_stochastic
>     coef_grads, intercept_grads)
>   File 
> "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/multilayer_perceptron.py", 
> line 225, in _backprop
>     loss = LOSS_FUNCTIONS[self.loss](y, activations[-1])
>   File 
> "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/_base.py", 
> line 222, in log_loss
>     return -np.sum(y_true * np.log(y_prob)) / y_prob.shape[0]
> ValueError: operands could not be broadcast together with shapes 
> (200,128) (200,125)
>
> Thanks,
> Aakash
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161007/073321f4/attachment.html>

From t3kcit at gmail.com  Fri Oct  7 11:48:53 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Fri, 7 Oct 2016 11:48:53 -0400
Subject: [scikit-learn] out of core in ARD or Bayessian Ridge Regression
In-Reply-To: <CAFvE7K4L6tgFqKPLwguWMsNY9c3eKSxD=wOsRiAOFibdMygm2A@mail.gmail.com>
References: <CABw3FPZngDLYg0_z0fH3niZ+pUiCY-KJCNROCk8M1vKS84QukQ@mail.gmail.com>
 <CAFvE7K4L6tgFqKPLwguWMsNY9c3eKSxD=wOsRiAOFibdMygm2A@mail.gmail.com>
Message-ID: <660bc83c-5a0f-48ed-9932-19b35fa4fd17@gmail.com>

I don't think anyone is working on this.
I'm not sure what optimizer is best for this. Maybe EP would be interesting.


On 10/07/2016 03:59 AM, Olivier Grisel wrote:
> I don't think anybody is working on this but you should better check
> in github pull requests.
>
> Best,
>


From aakash at klugtek.co.in  Fri Oct  7 14:52:31 2016
From: aakash at klugtek.co.in (Aakash Agarwal)
Date: Sat, 8 Oct 2016 00:22:31 +0530
Subject: [scikit-learn] MLP Classifier error in 0.18 version
In-Reply-To: <96977a13-28b8-5f18-d716-aa5106fddbfb@gmail.com>
References: <CABVTFDuwGrZnb5FZcE5+OY0oNx084xOTwNgMZ_85m-up2xmpLQ@mail.gmail.com>
 <96977a13-28b8-5f18-d716-aa5106fddbfb@gmail.com>
Message-ID: <CABVTFDuxvQQ5b=tg_QhEOTzQxF1OKsD=0rjuT2HdvH_4hd21JA@mail.gmail.com>

Hi Andy,

Thanks for the quick reply. Basically i am randomly choosing 90% training
data from the data set and fitting the classifier again and again. First
few transactions are working fine but after that it is failing in between.
So like you mentioned, standalone fitting is happening.

But as you said, warm_start seems to be the issue. Since i was choosing
data randomly, total number of labels in a single batch was not constant
over multiple iterations and it could not detect new labels from the
previous model and thus failed.

Thanks a lot for the valuable inputs.
Aakash

On Fri, Oct 7, 2016 at 9:17 PM, Andreas Mueller <t3kcit at gmail.com> wrote:

> Hi.
> Can you provide a self-contained example to reproduce on the issue-tracker?
> Maybe you used warm_start=True but changed something about the dataset,
> like going from 125 classes to 128?
>
> This works:
>
> from sklearn.neural_network import MLPClassifier
> gen_class = MLPClassifier(hidden_layer_sizes=(200,),max_iter=3000,
> learning_rate='adaptive',alpha=0.025,warm_start=True)
> X_train = np.random.uniform(size=(398, 50))
> y_train = np.random.uniform(size=398) > .5
> gen_class.fit(X_train, y_train)
>
> best,
> Andy
>
>
> On 10/07/2016 09:51 AM, Aakash Agarwal wrote:
>
> Hi Guys,
>
> I am playing around MLP classifier lately. So i have about 450 inputs to
> classify. Each input is a vector of array size 50. I am trying to fit the
> model with 90% as train data.
>
> Size of training data: (398, 50)
> Size of testing data: (45, 50)
>
> MLP instantiation:
> gen_class = MLPClassifier(hidden_layer_sizes=(200,),max_iter=3000,
> learning_rate='adaptive',alpha=0.025,warm_start=True)
>
> Batch size is auto so it is taking 200 as batch_size. But when i am
> fitting the classifier model, i am getting following error:
>
> Traceback (most recent call last):
>   File "intent_detection_classifier_selection.py", line 452, in <module>
>     sk_class.gen_class_fitting(gen_class,corp_lsi_train,train_label)
>   File "intent_detection_classifier_selection.py", line 77, in
> gen_class_fitting
>     gen_class.fit(data,label)
>   File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_
> network/multilayer_perceptron.py", line 612, in fit
>     return self._fit(X, y, incremental=False)
>   File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_
> network/multilayer_perceptron.py", line 372, in _fit
>     intercept_grads, layer_units, incremental)
>   File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_
> network/multilayer_perceptron.py", line 509, in _fit_stochastic
>     coef_grads, intercept_grads)
>   File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_
> network/multilayer_perceptron.py", line 225, in _backprop
>     loss = LOSS_FUNCTIONS[self.loss](y, activations[-1])
>   File "/usr/local/lib/python2.7/dist-packages/sklearn/neural_network/_base.py",
> line 222, in log_loss
>     return -np.sum(y_true * np.log(y_prob)) / y_prob.shape[0]
> ValueError: operands could not be broadcast together with shapes (200,128)
> (200,125)
>
> Thanks,
> Aakash
>
>
> _______________________________________________
> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 
Thanks,
Aakash
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161008/93f9a6e0/attachment.html>

From drraph at gmail.com  Mon Oct 10 06:55:52 2016
From: drraph at gmail.com (Raphael C)
Date: Mon, 10 Oct 2016 11:55:52 +0100
Subject: [scikit-learn] Using logistic regression with count proportions data
Message-ID: <CAFHc1QZx1Y-O8-5fvGKNQcRw4gXHC8hr__t-g7TxX9S6FSz3rQ@mail.gmail.com>

I am trying to perform regression where my dependent variable is
constrained to be between 0 and 1. This constraint comes from the fact
that it represents a count proportion. That is counts in some category
divided by a total count.

In the literature it seems that one common way to tackle this is to
use logistic regression. However, it appears that in scikit learn
logistic regression is only available as a classifier
(http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
) . Is that right?

Is there another way to perform regression using scikit learn where
the dependent variable is a count proportion?

Thanks for any help.

Raphael

From drraph at gmail.com  Mon Oct 10 07:03:28 2016
From: drraph at gmail.com (Raphael C)
Date: Mon, 10 Oct 2016 12:03:28 +0100
Subject: [scikit-learn] Using logistic regression with count proportions
 data
In-Reply-To: <CAFHc1QZx1Y-O8-5fvGKNQcRw4gXHC8hr__t-g7TxX9S6FSz3rQ@mail.gmail.com>
References: <CAFHc1QZx1Y-O8-5fvGKNQcRw4gXHC8hr__t-g7TxX9S6FSz3rQ@mail.gmail.com>
Message-ID: <CAFHc1QaeOtum=auOaLvt72hdwBV7qfkAQK8Aip7_e_-1gZhyew@mail.gmail.com>

I just noticed this about the glm package in R.
http://stats.stackexchange.com/a/26779/53128

"
The glm function in R allows 3 ways to specify the formula for a
logistic regression model.

The most common is that each row of the data frame represents a single
observation and the response variable is either 0 or 1 (or a factor
with 2 levels, or other varibale with only 2 unique values).

Another option is to use a 2 column matrix as the response variable
with the first column being the counts of 'successes' and the second
column being the counts of 'failures'.

You can also specify the response as a proportion between 0 and 1,
then specify another column as the 'weight' that gives the total
number that the proportion is from (so a response of 0.3 and a weight
of 10 is the same as 3 'successes' and 7 'failures')."

Either of the last two options would do for me.  Does scikit-learn
support either of these last two options?

Raphael

On 10 October 2016 at 11:55, Raphael C <drraph at gmail.com> wrote:
> I am trying to perform regression where my dependent variable is
> constrained to be between 0 and 1. This constraint comes from the fact
> that it represents a count proportion. That is counts in some category
> divided by a total count.
>
> In the literature it seems that one common way to tackle this is to
> use logistic regression. However, it appears that in scikit learn
> logistic regression is only available as a classifier
> (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
> ) . Is that right?
>
> Is there another way to perform regression using scikit learn where
> the dependent variable is a count proportion?
>
> Thanks for any help.
>
> Raphael

From sean.violante at gmail.com  Mon Oct 10 07:08:28 2016
From: sean.violante at gmail.com (Sean Violante)
Date: Mon, 10 Oct 2016 13:08:28 +0200
Subject: [scikit-learn] Using logistic regression with count proportions
 data
In-Reply-To: <CAFHc1QaeOtum=auOaLvt72hdwBV7qfkAQK8Aip7_e_-1gZhyew@mail.gmail.com>
References: <CAFHc1QZx1Y-O8-5fvGKNQcRw4gXHC8hr__t-g7TxX9S6FSz3rQ@mail.gmail.com>
 <CAFHc1QaeOtum=auOaLvt72hdwBV7qfkAQK8Aip7_e_-1gZhyew@mail.gmail.com>
Message-ID: <CAL9=spN0jBdyavRA83p2_z8meiFp30Ym4CK3JrnvCCuVunvi9g@mail.gmail.com>

should be the sample weight function in fit

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

On Mon, Oct 10, 2016 at 1:03 PM, Raphael C <drraph at gmail.com> wrote:

> I just noticed this about the glm package in R.
> http://stats.stackexchange.com/a/26779/53128
>
> "
> The glm function in R allows 3 ways to specify the formula for a
> logistic regression model.
>
> The most common is that each row of the data frame represents a single
> observation and the response variable is either 0 or 1 (or a factor
> with 2 levels, or other varibale with only 2 unique values).
>
> Another option is to use a 2 column matrix as the response variable
> with the first column being the counts of 'successes' and the second
> column being the counts of 'failures'.
>
> You can also specify the response as a proportion between 0 and 1,
> then specify another column as the 'weight' that gives the total
> number that the proportion is from (so a response of 0.3 and a weight
> of 10 is the same as 3 'successes' and 7 'failures')."
>
> Either of the last two options would do for me.  Does scikit-learn
> support either of these last two options?
>
> Raphael
>
> On 10 October 2016 at 11:55, Raphael C <drraph at gmail.com> wrote:
> > I am trying to perform regression where my dependent variable is
> > constrained to be between 0 and 1. This constraint comes from the fact
> > that it represents a count proportion. That is counts in some category
> > divided by a total count.
> >
> > In the literature it seems that one common way to tackle this is to
> > use logistic regression. However, it appears that in scikit learn
> > logistic regression is only available as a classifier
> > (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.
> LogisticRegression.html
> > ) . Is that right?
> >
> > Is there another way to perform regression using scikit learn where
> > the dependent variable is a count proportion?
> >
> > Thanks for any help.
> >
> > Raphael
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161010/d2ac819d/attachment.html>

From drraph at gmail.com  Mon Oct 10 07:15:17 2016
From: drraph at gmail.com (Raphael C)
Date: Mon, 10 Oct 2016 12:15:17 +0100
Subject: [scikit-learn] Using logistic regression with count proportions
 data
In-Reply-To: <CAL9=spN0jBdyavRA83p2_z8meiFp30Ym4CK3JrnvCCuVunvi9g@mail.gmail.com>
References: <CAFHc1QZx1Y-O8-5fvGKNQcRw4gXHC8hr__t-g7TxX9S6FSz3rQ@mail.gmail.com>
 <CAFHc1QaeOtum=auOaLvt72hdwBV7qfkAQK8Aip7_e_-1gZhyew@mail.gmail.com>
 <CAL9=spN0jBdyavRA83p2_z8meiFp30Ym4CK3JrnvCCuVunvi9g@mail.gmail.com>
Message-ID: <CAFHc1Qaf1J_=kyt=nsyq2MMNcAhS2qD9P-Q1M0DJinj_Vi97JQ@mail.gmail.com>

How do I use sample_weight for my use case?

In my case is "y" an array of 0s and 1s and sample_weight then an
array real numbers between 0 and 1 where I should make sure to set
sample_weight[i]= 0 when y[i] = 0?

Raphael

On 10 October 2016 at 12:08, Sean Violante <sean.violante at gmail.com> wrote:
> should be the sample weight function in fit
>
> http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
>
> On Mon, Oct 10, 2016 at 1:03 PM, Raphael C <drraph at gmail.com> wrote:
>>
>> I just noticed this about the glm package in R.
>> http://stats.stackexchange.com/a/26779/53128
>>
>> "
>> The glm function in R allows 3 ways to specify the formula for a
>> logistic regression model.
>>
>> The most common is that each row of the data frame represents a single
>> observation and the response variable is either 0 or 1 (or a factor
>> with 2 levels, or other varibale with only 2 unique values).
>>
>> Another option is to use a 2 column matrix as the response variable
>> with the first column being the counts of 'successes' and the second
>> column being the counts of 'failures'.
>>
>> You can also specify the response as a proportion between 0 and 1,
>> then specify another column as the 'weight' that gives the total
>> number that the proportion is from (so a response of 0.3 and a weight
>> of 10 is the same as 3 'successes' and 7 'failures')."
>>
>> Either of the last two options would do for me.  Does scikit-learn
>> support either of these last two options?
>>
>> Raphael
>>
>> On 10 October 2016 at 11:55, Raphael C <drraph at gmail.com> wrote:
>> > I am trying to perform regression where my dependent variable is
>> > constrained to be between 0 and 1. This constraint comes from the fact
>> > that it represents a count proportion. That is counts in some category
>> > divided by a total count.
>> >
>> > In the literature it seems that one common way to tackle this is to
>> > use logistic regression. However, it appears that in scikit learn
>> > logistic regression is only available as a classifier
>> >
>> > (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
>> > ) . Is that right?
>> >
>> > Is there another way to perform regression using scikit learn where
>> > the dependent variable is a count proportion?
>> >
>> > Thanks for any help.
>> >
>> > Raphael
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>

From sean.violante at gmail.com  Mon Oct 10 07:22:11 2016
From: sean.violante at gmail.com (Sean Violante)
Date: Mon, 10 Oct 2016 13:22:11 +0200
Subject: [scikit-learn] Using logistic regression with count proportions
 data
In-Reply-To: <CAFHc1Qaf1J_=kyt=nsyq2MMNcAhS2qD9P-Q1M0DJinj_Vi97JQ@mail.gmail.com>
References: <CAFHc1QZx1Y-O8-5fvGKNQcRw4gXHC8hr__t-g7TxX9S6FSz3rQ@mail.gmail.com>
 <CAFHc1QaeOtum=auOaLvt72hdwBV7qfkAQK8Aip7_e_-1gZhyew@mail.gmail.com>
 <CAL9=spN0jBdyavRA83p2_z8meiFp30Ym4CK3JrnvCCuVunvi9g@mail.gmail.com>
 <CAFHc1Qaf1J_=kyt=nsyq2MMNcAhS2qD9P-Q1M0DJinj_Vi97JQ@mail.gmail.com>
Message-ID: <CAL9=spNLKZLESsqve2MVwR6h30_EJCJ4Sr8DOL+_LHLkX9KT3A@mail.gmail.com>

no ( but please check !)

sample weights should be the counts for the respective label (0/1)

[ I am actually puzzled about the glm help file - proportions loses how
often an input data 'row' was present relative to the other - though you
could do this by repeating the row 'n' times]

On Mon, Oct 10, 2016 at 1:15 PM, Raphael C <drraph at gmail.com> wrote:

> How do I use sample_weight for my use case?
>
> In my case is "y" an array of 0s and 1s and sample_weight then an
> array real numbers between 0 and 1 where I should make sure to set
> sample_weight[i]= 0 when y[i] = 0?
>
> Raphael
>
> On 10 October 2016 at 12:08, Sean Violante <sean.violante at gmail.com>
> wrote:
> > should be the sample weight function in fit
> >
> > http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.
> LogisticRegression.html
> >
> > On Mon, Oct 10, 2016 at 1:03 PM, Raphael C <drraph at gmail.com> wrote:
> >>
> >> I just noticed this about the glm package in R.
> >> http://stats.stackexchange.com/a/26779/53128
> >>
> >> "
> >> The glm function in R allows 3 ways to specify the formula for a
> >> logistic regression model.
> >>
> >> The most common is that each row of the data frame represents a single
> >> observation and the response variable is either 0 or 1 (or a factor
> >> with 2 levels, or other varibale with only 2 unique values).
> >>
> >> Another option is to use a 2 column matrix as the response variable
> >> with the first column being the counts of 'successes' and the second
> >> column being the counts of 'failures'.
> >>
> >> You can also specify the response as a proportion between 0 and 1,
> >> then specify another column as the 'weight' that gives the total
> >> number that the proportion is from (so a response of 0.3 and a weight
> >> of 10 is the same as 3 'successes' and 7 'failures')."
> >>
> >> Either of the last two options would do for me.  Does scikit-learn
> >> support either of these last two options?
> >>
> >> Raphael
> >>
> >> On 10 October 2016 at 11:55, Raphael C <drraph at gmail.com> wrote:
> >> > I am trying to perform regression where my dependent variable is
> >> > constrained to be between 0 and 1. This constraint comes from the fact
> >> > that it represents a count proportion. That is counts in some category
> >> > divided by a total count.
> >> >
> >> > In the literature it seems that one common way to tackle this is to
> >> > use logistic regression. However, it appears that in scikit learn
> >> > logistic regression is only available as a classifier
> >> >
> >> > (http://scikit-learn.org/stable/modules/generated/
> sklearn.linear_model.LogisticRegression.html
> >> > ) . Is that right?
> >> >
> >> > Is there another way to perform regression using scikit learn where
> >> > the dependent variable is a count proportion?
> >> >
> >> > Thanks for any help.
> >> >
> >> > Raphael
> >> _______________________________________________
> >> scikit-learn mailing list
> >> scikit-learn at python.org
> >> https://mail.python.org/mailman/listinfo/scikit-learn
> >
> >
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161010/49787872/attachment-0001.html>

From drraph at gmail.com  Mon Oct 10 09:48:32 2016
From: drraph at gmail.com (Raphael C)
Date: Mon, 10 Oct 2016 14:48:32 +0100
Subject: [scikit-learn] Using logistic regression with count proportions
 data
In-Reply-To: <CAL9=spNLKZLESsqve2MVwR6h30_EJCJ4Sr8DOL+_LHLkX9KT3A@mail.gmail.com>
References: <CAFHc1QZx1Y-O8-5fvGKNQcRw4gXHC8hr__t-g7TxX9S6FSz3rQ@mail.gmail.com>
 <CAFHc1QaeOtum=auOaLvt72hdwBV7qfkAQK8Aip7_e_-1gZhyew@mail.gmail.com>
 <CAL9=spN0jBdyavRA83p2_z8meiFp30Ym4CK3JrnvCCuVunvi9g@mail.gmail.com>
 <CAFHc1Qaf1J_=kyt=nsyq2MMNcAhS2qD9P-Q1M0DJinj_Vi97JQ@mail.gmail.com>
 <CAL9=spNLKZLESsqve2MVwR6h30_EJCJ4Sr8DOL+_LHLkX9KT3A@mail.gmail.com>
Message-ID: <CAFHc1QZdsmHTyaZicVcLcnD4SGiQLrQCPY0SYvTN1UOvpej4DA@mail.gmail.com>

On 10 October 2016 at 12:22, Sean Violante <sean.violante at gmail.com> wrote:
> no ( but please check !)
>
> sample weights should be the counts for the respective label (0/1)
>
> [ I am actually puzzled about the glm help file - proportions loses how
> often an input data 'row' was present relative to the other - though you
> could do this by repeating the row 'n' times]

I think we might be talking at cross purposes.

I have a matrix X where each row is a feature vector. I also have an
array y where y[i] is a real number between 0 and 1. I would like to
build a regression model that predicts the y values given the X rows.

Now each y[i] value in fact comes from simply counting the number of
positive labelled elements in a particular set (set i) and dividing by
the number of elements in that set.  So I can easily fit this into the
model given by the R package glm by replacing each y[i] value by a
pair of "Number of positives" and "Number of negatives" (this is case
2 in the docs I quoted) or using case 3 which asks for the y[i] plus
the total number of elements in set i.

I don't see how a single integer for sample_weight[i] would cover this
information but I am sure I must have misunderstood.  At best it seems
it could cover the number of positive values but this is missing half
the information.

Raphael

>
> On Mon, Oct 10, 2016 at 1:15 PM, Raphael C <drraph at gmail.com> wrote:
>>
>> How do I use sample_weight for my use case?
>>
>> In my case is "y" an array of 0s and 1s and sample_weight then an
>> array real numbers between 0 and 1 where I should make sure to set
>> sample_weight[i]= 0 when y[i] = 0?
>>
>> Raphael
>>
>> On 10 October 2016 at 12:08, Sean Violante <sean.violante at gmail.com>
>> wrote:
>> > should be the sample weight function in fit
>> >
>> >
>> > http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
>> >
>> > On Mon, Oct 10, 2016 at 1:03 PM, Raphael C <drraph at gmail.com> wrote:
>> >>
>> >> I just noticed this about the glm package in R.
>> >> http://stats.stackexchange.com/a/26779/53128
>> >>
>> >> "
>> >> The glm function in R allows 3 ways to specify the formula for a
>> >> logistic regression model.
>> >>
>> >> The most common is that each row of the data frame represents a single
>> >> observation and the response variable is either 0 or 1 (or a factor
>> >> with 2 levels, or other varibale with only 2 unique values).
>> >>
>> >> Another option is to use a 2 column matrix as the response variable
>> >> with the first column being the counts of 'successes' and the second
>> >> column being the counts of 'failures'.
>> >>
>> >> You can also specify the response as a proportion between 0 and 1,
>> >> then specify another column as the 'weight' that gives the total
>> >> number that the proportion is from (so a response of 0.3 and a weight
>> >> of 10 is the same as 3 'successes' and 7 'failures')."
>> >>
>> >> Either of the last two options would do for me.  Does scikit-learn
>> >> support either of these last two options?
>> >>
>> >> Raphael
>> >>
>> >> On 10 October 2016 at 11:55, Raphael C <drraph at gmail.com> wrote:
>> >> > I am trying to perform regression where my dependent variable is
>> >> > constrained to be between 0 and 1. This constraint comes from the
>> >> > fact
>> >> > that it represents a count proportion. That is counts in some
>> >> > category
>> >> > divided by a total count.
>> >> >
>> >> > In the literature it seems that one common way to tackle this is to
>> >> > use logistic regression. However, it appears that in scikit learn
>> >> > logistic regression is only available as a classifier
>> >> >
>> >> >
>> >> > (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
>> >> > ) . Is that right?
>> >> >
>> >> > Is there another way to perform regression using scikit learn where
>> >> > the dependent variable is a count proportion?
>> >> >
>> >> > Thanks for any help.
>> >> >
>> >> > Raphael
>> >> _______________________________________________
>> >> scikit-learn mailing list
>> >> scikit-learn at python.org
>> >> https://mail.python.org/mailman/listinfo/scikit-learn
>> >
>> >
>> >
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>> >
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>

From sean.violante at gmail.com  Mon Oct 10 10:04:45 2016
From: sean.violante at gmail.com (Sean Violante)
Date: Mon, 10 Oct 2016 16:04:45 +0200
Subject: [scikit-learn] Using logistic regression with count proportions
 data
In-Reply-To: <CAFHc1QZdsmHTyaZicVcLcnD4SGiQLrQCPY0SYvTN1UOvpej4DA@mail.gmail.com>
References: <CAFHc1QZx1Y-O8-5fvGKNQcRw4gXHC8hr__t-g7TxX9S6FSz3rQ@mail.gmail.com>
 <CAFHc1QaeOtum=auOaLvt72hdwBV7qfkAQK8Aip7_e_-1gZhyew@mail.gmail.com>
 <CAL9=spN0jBdyavRA83p2_z8meiFp30Ym4CK3JrnvCCuVunvi9g@mail.gmail.com>
 <CAFHc1Qaf1J_=kyt=nsyq2MMNcAhS2qD9P-Q1M0DJinj_Vi97JQ@mail.gmail.com>
 <CAL9=spNLKZLESsqve2MVwR6h30_EJCJ4Sr8DOL+_LHLkX9KT3A@mail.gmail.com>
 <CAFHc1QZdsmHTyaZicVcLcnD4SGiQLrQCPY0SYvTN1UOvpej4DA@mail.gmail.com>
Message-ID: <CAL9=spNi0zG8h49km2DK_tU_wn29o2CMv-_1ykc1W4Hz7cNFyQ@mail.gmail.com>

sorry yes there was a misunderstanding:

I meant for each feature configuration you should pass in two rows (one for
the positive cases and one for the negative)
and the sample weight being the corresponding count for that configuration
and class

and I am saying that the total  count is important because you could have a
situation where
one feature combination occurs 10 times and another feature combination
1000 times


On Mon, Oct 10, 2016 at 3:48 PM, Raphael C <drraph at gmail.com> wrote:

> On 10 October 2016 at 12:22, Sean Violante <sean.violante at gmail.com>
> wrote:
> > no ( but please check !)
> >
> > sample weights should be the counts for the respective label (0/1)
> >
> > [ I am actually puzzled about the glm help file - proportions loses how
> > often an input data 'row' was present relative to the other - though you
> > could do this by repeating the row 'n' times]
>
> I think we might be talking at cross purposes.
>
> I have a matrix X where each row is a feature vector. I also have an
> array y where y[i] is a real number between 0 and 1. I would like to
> build a regression model that predicts the y values given the X rows.
>
> Now each y[i] value in fact comes from simply counting the number of
> positive labelled elements in a particular set (set i) and dividing by
> the number of elements in that set.  So I can easily fit this into the
> model given by the R package glm by replacing each y[i] value by a
> pair of "Number of positives" and "Number of negatives" (this is case
> 2 in the docs I quoted) or using case 3 which asks for the y[i] plus
> the total number of elements in set i.
>
> I don't see how a single integer for sample_weight[i] would cover this
> information but I am sure I must have misunderstood.  At best it seems
> it could cover the number of positive values but this is missing half
> the information.
>
> Raphael
>
> >
> > On Mon, Oct 10, 2016 at 1:15 PM, Raphael C <drraph at gmail.com> wrote:
> >>
> >> How do I use sample_weight for my use case?
> >>
> >> In my case is "y" an array of 0s and 1s and sample_weight then an
> >> array real numbers between 0 and 1 where I should make sure to set
> >> sample_weight[i]= 0 when y[i] = 0?
> >>
> >> Raphael
> >>
> >> On 10 October 2016 at 12:08, Sean Violante <sean.violante at gmail.com>
> >> wrote:
> >> > should be the sample weight function in fit
> >> >
> >> >
> >> > http://scikit-learn.org/stable/modules/generated/
> sklearn.linear_model.LogisticRegression.html
> >> >
> >> > On Mon, Oct 10, 2016 at 1:03 PM, Raphael C <drraph at gmail.com> wrote:
> >> >>
> >> >> I just noticed this about the glm package in R.
> >> >> http://stats.stackexchange.com/a/26779/53128
> >> >>
> >> >> "
> >> >> The glm function in R allows 3 ways to specify the formula for a
> >> >> logistic regression model.
> >> >>
> >> >> The most common is that each row of the data frame represents a
> single
> >> >> observation and the response variable is either 0 or 1 (or a factor
> >> >> with 2 levels, or other varibale with only 2 unique values).
> >> >>
> >> >> Another option is to use a 2 column matrix as the response variable
> >> >> with the first column being the counts of 'successes' and the second
> >> >> column being the counts of 'failures'.
> >> >>
> >> >> You can also specify the response as a proportion between 0 and 1,
> >> >> then specify another column as the 'weight' that gives the total
> >> >> number that the proportion is from (so a response of 0.3 and a weight
> >> >> of 10 is the same as 3 'successes' and 7 'failures')."
> >> >>
> >> >> Either of the last two options would do for me.  Does scikit-learn
> >> >> support either of these last two options?
> >> >>
> >> >> Raphael
> >> >>
> >> >> On 10 October 2016 at 11:55, Raphael C <drraph at gmail.com> wrote:
> >> >> > I am trying to perform regression where my dependent variable is
> >> >> > constrained to be between 0 and 1. This constraint comes from the
> >> >> > fact
> >> >> > that it represents a count proportion. That is counts in some
> >> >> > category
> >> >> > divided by a total count.
> >> >> >
> >> >> > In the literature it seems that one common way to tackle this is to
> >> >> > use logistic regression. However, it appears that in scikit learn
> >> >> > logistic regression is only available as a classifier
> >> >> >
> >> >> >
> >> >> > (http://scikit-learn.org/stable/modules/generated/
> sklearn.linear_model.LogisticRegression.html
> >> >> > ) . Is that right?
> >> >> >
> >> >> > Is there another way to perform regression using scikit learn where
> >> >> > the dependent variable is a count proportion?
> >> >> >
> >> >> > Thanks for any help.
> >> >> >
> >> >> > Raphael
> >> >> _______________________________________________
> >> >> scikit-learn mailing list
> >> >> scikit-learn at python.org
> >> >> https://mail.python.org/mailman/listinfo/scikit-learn
> >> >
> >> >
> >> >
> >> > _______________________________________________
> >> > scikit-learn mailing list
> >> > scikit-learn at python.org
> >> > https://mail.python.org/mailman/listinfo/scikit-learn
> >> >
> >> _______________________________________________
> >> scikit-learn mailing list
> >> scikit-learn at python.org
> >> https://mail.python.org/mailman/listinfo/scikit-learn
> >
> >
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161010/7989840e/attachment-0001.html>

From michael.eickenberg at gmail.com  Mon Oct 10 10:46:17 2016
From: michael.eickenberg at gmail.com (Michael Eickenberg)
Date: Mon, 10 Oct 2016 16:46:17 +0200
Subject: [scikit-learn] Using logistic regression with count proportions
 data
In-Reply-To: <CAL9=spNi0zG8h49km2DK_tU_wn29o2CMv-_1ykc1W4Hz7cNFyQ@mail.gmail.com>
References: <CAFHc1QZx1Y-O8-5fvGKNQcRw4gXHC8hr__t-g7TxX9S6FSz3rQ@mail.gmail.com>
 <CAFHc1QaeOtum=auOaLvt72hdwBV7qfkAQK8Aip7_e_-1gZhyew@mail.gmail.com>
 <CAL9=spN0jBdyavRA83p2_z8meiFp30Ym4CK3JrnvCCuVunvi9g@mail.gmail.com>
 <CAFHc1Qaf1J_=kyt=nsyq2MMNcAhS2qD9P-Q1M0DJinj_Vi97JQ@mail.gmail.com>
 <CAL9=spNLKZLESsqve2MVwR6h30_EJCJ4Sr8DOL+_LHLkX9KT3A@mail.gmail.com>
 <CAFHc1QZdsmHTyaZicVcLcnD4SGiQLrQCPY0SYvTN1UOvpej4DA@mail.gmail.com>
 <CAL9=spNi0zG8h49km2DK_tU_wn29o2CMv-_1ykc1W4Hz7cNFyQ@mail.gmail.com>
Message-ID: <CADxJN65shoe+Cmj_QBF51+tyOG87XNshCGA7_62XJpfEkz=AJw@mail.gmail.com>

Here is a possibly useful comment of larsmans on stackoverflow about
exactly this procedure

http://stackoverflow.com/questions/26604175/how-to-predict-a-continuous-dependent-variable-that-expresses-target-class-proba/26614131#comment41846816_26614131


On Mon, Oct 10, 2016 at 4:04 PM, Sean Violante <sean.violante at gmail.com>
wrote:

> sorry yes there was a misunderstanding:
>
> I meant for each feature configuration you should pass in two rows (one
> for the positive cases and one for the negative)
> and the sample weight being the corresponding count for that configuration
> and class
>
> and I am saying that the total  count is important because you could have
> a situation where
> one feature combination occurs 10 times and another feature combination
> 1000 times
>
>
>
>
>
> On Mon, Oct 10, 2016 at 3:48 PM, Raphael C <drraph at gmail.com> wrote:
>
>> On 10 October 2016 at 12:22, Sean Violante <sean.violante at gmail.com>
>> wrote:
>> > no ( but please check !)
>> >
>> > sample weights should be the counts for the respective label (0/1)
>> >
>> > [ I am actually puzzled about the glm help file - proportions loses how
>> > often an input data 'row' was present relative to the other - though you
>> > could do this by repeating the row 'n' times]
>>
>> I think we might be talking at cross purposes.
>>
>> I have a matrix X where each row is a feature vector. I also have an
>> array y where y[i] is a real number between 0 and 1. I would like to
>> build a regression model that predicts the y values given the X rows.
>>
>> Now each y[i] value in fact comes from simply counting the number of
>> positive labelled elements in a particular set (set i) and dividing by
>> the number of elements in that set.  So I can easily fit this into the
>> model given by the R package glm by replacing each y[i] value by a
>> pair of "Number of positives" and "Number of negatives" (this is case
>> 2 in the docs I quoted) or using case 3 which asks for the y[i] plus
>> the total number of elements in set i.
>>
>> I don't see how a single integer for sample_weight[i] would cover this
>> information but I am sure I must have misunderstood.  At best it seems
>> it could cover the number of positive values but this is missing half
>> the information.
>>
>> Raphael
>>
>> >
>> > On Mon, Oct 10, 2016 at 1:15 PM, Raphael C <drraph at gmail.com> wrote:
>> >>
>> >> How do I use sample_weight for my use case?
>> >>
>> >> In my case is "y" an array of 0s and 1s and sample_weight then an
>> >> array real numbers between 0 and 1 where I should make sure to set
>> >> sample_weight[i]= 0 when y[i] = 0?
>> >>
>> >> Raphael
>> >>
>> >> On 10 October 2016 at 12:08, Sean Violante <sean.violante at gmail.com>
>> >> wrote:
>> >> > should be the sample weight function in fit
>> >> >
>> >> >
>> >> > http://scikit-learn.org/stable/modules/generated/sklearn.
>> linear_model.LogisticRegression.html
>> >> >
>> >> > On Mon, Oct 10, 2016 at 1:03 PM, Raphael C <drraph at gmail.com> wrote:
>> >> >>
>> >> >> I just noticed this about the glm package in R.
>> >> >> http://stats.stackexchange.com/a/26779/53128
>> >> >>
>> >> >> "
>> >> >> The glm function in R allows 3 ways to specify the formula for a
>> >> >> logistic regression model.
>> >> >>
>> >> >> The most common is that each row of the data frame represents a
>> single
>> >> >> observation and the response variable is either 0 or 1 (or a factor
>> >> >> with 2 levels, or other varibale with only 2 unique values).
>> >> >>
>> >> >> Another option is to use a 2 column matrix as the response variable
>> >> >> with the first column being the counts of 'successes' and the second
>> >> >> column being the counts of 'failures'.
>> >> >>
>> >> >> You can also specify the response as a proportion between 0 and 1,
>> >> >> then specify another column as the 'weight' that gives the total
>> >> >> number that the proportion is from (so a response of 0.3 and a
>> weight
>> >> >> of 10 is the same as 3 'successes' and 7 'failures')."
>> >> >>
>> >> >> Either of the last two options would do for me.  Does scikit-learn
>> >> >> support either of these last two options?
>> >> >>
>> >> >> Raphael
>> >> >>
>> >> >> On 10 October 2016 at 11:55, Raphael C <drraph at gmail.com> wrote:
>> >> >> > I am trying to perform regression where my dependent variable is
>> >> >> > constrained to be between 0 and 1. This constraint comes from the
>> >> >> > fact
>> >> >> > that it represents a count proportion. That is counts in some
>> >> >> > category
>> >> >> > divided by a total count.
>> >> >> >
>> >> >> > In the literature it seems that one common way to tackle this is
>> to
>> >> >> > use logistic regression. However, it appears that in scikit learn
>> >> >> > logistic regression is only available as a classifier
>> >> >> >
>> >> >> >
>> >> >> > (http://scikit-learn.org/stable/modules/generated/sklearn.
>> linear_model.LogisticRegression.html
>> >> >> > ) . Is that right?
>> >> >> >
>> >> >> > Is there another way to perform regression using scikit learn
>> where
>> >> >> > the dependent variable is a count proportion?
>> >> >> >
>> >> >> > Thanks for any help.
>> >> >> >
>> >> >> > Raphael
>> >> >> _______________________________________________
>> >> >> scikit-learn mailing list
>> >> >> scikit-learn at python.org
>> >> >> https://mail.python.org/mailman/listinfo/scikit-learn
>> >> >
>> >> >
>> >> >
>> >> > _______________________________________________
>> >> > scikit-learn mailing list
>> >> > scikit-learn at python.org
>> >> > https://mail.python.org/mailman/listinfo/scikit-learn
>> >> >
>> >> _______________________________________________
>> >> scikit-learn mailing list
>> >> scikit-learn at python.org
>> >> https://mail.python.org/mailman/listinfo/scikit-learn
>> >
>> >
>> >
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>> >
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161010/1f05de8e/attachment.html>

From siddharthgupta234 at gmail.com  Tue Oct 11 01:18:57 2016
From: siddharthgupta234 at gmail.com (Siddharth Gupta)
Date: Tue, 11 Oct 2016 10:48:57 +0530
Subject: [scikit-learn] Doubt regarding issue timeline
Message-ID: <CAM_sO3TZOZ11PdReLOZ6v0SqL4dPuGTdwTvwmctidtLTKudKdQ@mail.gmail.com>

Hello fellas,
I have a doubt. Suppose I ask to volunteer in working on an issue but due
to some unavoidable scenario I fail to work on it for sometime, when should
I let the community know about the same. I guess it depends on the
issue/bug, but on an average how much time should one take to resolve an
issue.

Regards Siddharth Gupta,
Ph: 9871012292
Linkedin <https://www.linkedin.com/in/sidgupta234/> | Github
<https://github.com/sidgupta234> | Codechef
<https://www.codechef.com/users/sidgupta234> | Twitter
<https://twitter.com/SidGupta234> | Facebook
<https://www.facebook.com/profile.php?id=1483695876>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161011/0970088b/attachment.html>

From jaquesgrobler at gmail.com  Tue Oct 11 01:49:30 2016
From: jaquesgrobler at gmail.com (Jaques Grobler)
Date: Tue, 11 Oct 2016 07:49:30 +0200
Subject: [scikit-learn] Doubt regarding issue timeline
In-Reply-To: <CAM_sO3TZOZ11PdReLOZ6v0SqL4dPuGTdwTvwmctidtLTKudKdQ@mail.gmail.com>
References: <CAM_sO3TZOZ11PdReLOZ6v0SqL4dPuGTdwTvwmctidtLTKudKdQ@mail.gmail.com>
Message-ID: <CAHcSOR=_5ErE1PaBx2947_szcE4UBpFC1rmCC4iCnkycue=LTA@mail.gmail.com>

I'd say a 'standup'-ish approach could work with this - everyday or three,
if you find yourself getting pulled off the issue by other work, life,
etc. perhaps take a moment to at a set time to , if needed, post on the
progress/blocking factors -- even if it's 'can't work in this today' - yes,
this could potentially get spammy, but it gives nice transparency and if
it's urgent to finish the issue soon, like before a release, the community
can know wether or not it needs to be handed over - or if you believe
you'll have time still --
This doesn't have to be a rule- but more of a guide line - the community
will always have a fairly recent status update, even if the person can't
touch the issue for weeks.

Just my thoughts on it :)

On Tuesday, 11 October 2016, Siddharth Gupta <siddharthgupta234 at gmail.com>
wrote:

> Hello fellas,
> I have a doubt. Suppose I ask to volunteer in working on an issue but due
> to some unavoidable scenario I fail to work on it for sometime, when should
> I let the community know about the same. I guess it depends on the
> issue/bug, but on an average how much time should one take to resolve an
> issue.
>
> Regards Siddharth Gupta,
> Ph: 9871012292
> Linkedin <https://www.linkedin.com/in/sidgupta234/> | Github
> <https://github.com/sidgupta234> | Codechef
> <https://www.codechef.com/users/sidgupta234> | Twitter
> <https://twitter.com/SidGupta234> | Facebook
> <https://www.facebook.com/profile.php?id=1483695876>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161011/fb5780c7/attachment.html>

From j.vanschoren at tue.nl  Tue Oct 11 05:11:00 2016
From: j.vanschoren at tue.nl (Joaquin Vanschoren)
Date: Tue, 11 Oct 2016 09:11:00 +0000
Subject: [scikit-learn] Welcome Raghav to the core-dev team
In-Reply-To: <CAAkaFLU99N8=E_RhQ1jYchM_y9ga3RmMYP2qu8VUZgkxF-wR8Q@mail.gmail.com>
References: <20161003151415.GF20745@phare.normalesup.org>
 <CALoLHMLZnsRiTUi4c5wnp5Q0N_ttdyTy1WH32tS2ZEo=_7mg_A@mail.gmail.com>
 <CA+QZHeCK7i_oAcE+yWL=ZofRzvpEUZQmB+0kPzwjtQENpcPVPQ@mail.gmail.com>
 <CAHazPTmwX=nZUh71fpYnRBtsszzaL1FCE+ftKfjrKCsesUbjwg@mail.gmail.com>
 <CACmxyDG3gJsSBVksXVciX10jH6NThva5k623k_Q-OC6HkasRgQ@mail.gmail.com>
 <3d961045-39d0-8c81-1deb-2f6b7332ff1e@gmail.com>
 <CAHcSORkrxLrGDTxROFVZ7jtS9krSuEWhpmreAjY4XHctgW+GZA@mail.gmail.com>
 <CAAkaFLU99N8=E_RhQ1jYchM_y9ga3RmMYP2qu8VUZgkxF-wR8Q@mail.gmail.com>
Message-ID: <CALWV7gMKoMJ0wsk2rGTHDQLCNcgHojnhgfJZd8m6HTQbdekLZA@mail.gmail.com>

A bit late, but heartfelt congrats to Raghav :)

On Tue, Oct 4, 2016 at 12:43 PM Joel Nothman <joel.nothman at gmail.com> wrote:

> Congratulations, Raghav! Thanks for your dedication, as a student and
> mentor in GSoC, but at all other times too!
>
> On 4 October 2016 at 19:14, Jaques Grobler <jaquesgrobler at gmail.com>
> wrote:
>
> Congrats Raghav!
>
> 2016-10-03 21:25 GMT+02:00 Andreas Mueller <t3kcit at gmail.com>:
>
> Congrats, hope to see lot's more ;)
>
>
> On 10/03/2016 12:09 PM, Raghav R V wrote:
>
> Thanks everyone! Looking forward to contributing more :D
>
> On Mon, Oct 3, 2016 at 5:40 PM, Ronnie Ghose <ronnie.ghose at gmail.com>
> wrote:
>
> congrats! :)
>
> On Mon, Oct 3, 2016 at 11:28 AM, lin yenchen <yenchenlin1994 at gmail.com>
> wrote:
>
> Congrats, Raghav!
>
> Nelson Liu <nfliu at uw.edu> ? 2016?10?3? ?? ??11:27???
>
> Yay! Congrats, Raghav!
>
> On Mon, Oct 3, 2016 at 8:14 AM, Gael Varoquaux <
> gael.varoquaux at normalesup.org> wrote:
>
> Hi,
>
> We have the pleasure to welcome Raghav RV to the core-dev team. Raghav
> (@raghavrv) has been working on scikit-learn for more than a year. In
> particular, he implemented the rewrite of the cross-validation utilities,
> which is quite dear to my heart.
>
> Welcome Raghav!
>
> Ga?l
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
>
> _______________________________________________
> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161011/0851f407/attachment-0001.html>

From gabit7 at gmail.com  Tue Oct 11 07:29:20 2016
From: gabit7 at gmail.com (Gabriel Trautmann)
Date: Tue, 11 Oct 2016 14:29:20 +0300
Subject: [scikit-learn] HashingVectorizer slow in version 0.18
Message-ID: <CAP+EBtZ=svrG26qgrjOhyp1+J1sZz0bmHYW6EshhvrPF0oS+HQ@mail.gmail.com>

Hi,

After upgrading to scikit-learn 0.18 HashingVectorizer is about 10 times
slower.

Before:

scikit-learn 0.17. Numpy 1.11.2. Python 3.5.2 AMD64
Vectorizing 20newsgroup 11314 documents
Vectorization completed in  4.594092130661011  seconds, resulting shape
 (11314, 1048576)

After upgrade:

scikit-learn 0.18. Numpy 1.11.2. Python 3.5.2 AMD64
Vectorizing 20newsgroup 11314 documents
Vectorization completed in  43.587692737579346  seconds, resulting shape
 (11314, 1048576)


Code:

import time, sklearn, platform, numpy
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import HashingVectorizer

data_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)

print('scikit-learn {}. Numpy {}. Python {}
{}'.format(sklearn.__version__, numpy.version.full_version,
platform.python_version(), platform.machine()))

vectorizer = HashingVectorizer()
print("Vectorizing 20newsgroup {} documents".format(len(data_train.data)))
start = time.time()
data = vectorizer.fit_transform(data_train.data)
print("Vectorization completed in ", time.time() - start, ' seconds,
resulting shape ', data.shape)


Should I submit a bug report?

Thank you,

Gabriel Trautmann
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161011/02ddfc1c/attachment.html>

From olivier.grisel at ensta.org  Tue Oct 11 08:02:47 2016
From: olivier.grisel at ensta.org (Olivier Grisel)
Date: Tue, 11 Oct 2016 14:02:47 +0200
Subject: [scikit-learn] HashingVectorizer slow in version 0.18
In-Reply-To: <CAP+EBtZ=svrG26qgrjOhyp1+J1sZz0bmHYW6EshhvrPF0oS+HQ@mail.gmail.com>
References: <CAP+EBtZ=svrG26qgrjOhyp1+J1sZz0bmHYW6EshhvrPF0oS+HQ@mail.gmail.com>
Message-ID: <CAFvE7K54JRDobzueoVK=LK5bGNbC2xVu34DtzEybyaLpuBCUyQ@mail.gmail.com>

I cannot reproduce such a degradation on my machine:

(sklearn-0.17)ogrisel at is146148:~/code/scikit-learn$ python
~/tmp/bench_vectorizer.py
scikit-learn 0.17.1. Numpy 1.11.2. Python 3.5.0 x86_64
Vectorizing 20newsgroup 11314 documents
Vectorization completed in  4.033604383468628  seconds, resulting
shape  (11314, 1048576)

(sklearn-0.18) ogrisel at is146148:~/code/scikit-learn$ python
~/tmp/bench_vectorizer.py
scikit-learn 0.18. Numpy 1.11.2. Python 3.5.0 x86_64
Vectorizing 20newsgroup 11314 documents
Vectorization completed in  4.990509510040283  seconds, resulting
shape  (11314, 1048576)

Which operating system are you using?

Please feel free to open an issue on the tracker anyway.

-- 
Olivier

From gabit7 at gmail.com  Tue Oct 11 08:19:24 2016
From: gabit7 at gmail.com (Gabriel Trautmann)
Date: Tue, 11 Oct 2016 15:19:24 +0300
Subject: [scikit-learn] HashingVectorizer slow in version 0.18
In-Reply-To: <CAFvE7K54JRDobzueoVK=LK5bGNbC2xVu34DtzEybyaLpuBCUyQ@mail.gmail.com>
References: <CAP+EBtZ=svrG26qgrjOhyp1+J1sZz0bmHYW6EshhvrPF0oS+HQ@mail.gmail.com>
 <CAFvE7K54JRDobzueoVK=LK5bGNbC2xVu34DtzEybyaLpuBCUyQ@mail.gmail.com>
Message-ID: <CAP+EBtZvhGrqgn-aH3ig96XMLLK+4JLxW28u8AeqA+9-wjnDGQ@mail.gmail.com>

Thank you for your response, have Windows 7 Enterprise 64 bit / Intel Xeon
E5 2640 CPU, same problem on two similar machines

python-3.5.2-amd64.exe - fresh installation

numpy-1.11.2+mkl-cp35-cp35m-win_amd64.whl  - from Christoph Gohlke
scipy-0.18.1-cp35-cp35m-win_amd64.whl
pip install scikit-lean

on the same python instance if I downgrade to version 0.17 is much faster.

pip uninstall scikit-lean
pip install scikit-lean==0.17

I will open an issue after I test on more machines or if someone else can
reproduce the problem.


On Tue, Oct 11, 2016 at 3:02 PM, Olivier Grisel <olivier.grisel at ensta.org>
wrote:

> I cannot reproduce such a degradation on my machine:
>
> (sklearn-0.17)ogrisel at is146148:~/code/scikit-learn$ python
> ~/tmp/bench_vectorizer.py
> scikit-learn 0.17.1. Numpy 1.11.2. Python 3.5.0 x86_64
> Vectorizing 20newsgroup 11314 documents
> Vectorization completed in  4.033604383468628  seconds, resulting
> shape  (11314, 1048576)
>
> (sklearn-0.18) ogrisel at is146148:~/code/scikit-learn$ python
> ~/tmp/bench_vectorizer.py
> scikit-learn 0.18. Numpy 1.11.2. Python 3.5.0 x86_64
> Vectorizing 20newsgroup 11314 documents
> Vectorization completed in  4.990509510040283  seconds, resulting
> shape  (11314, 1048576)
>
> Which operating system are you using?
>
> Please feel free to open an issue on the tracker anyway.
>
> --
> Olivier
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161011/f7fba0c1/attachment.html>

From piotr.bialecki at hotmail.de  Tue Oct 11 08:32:54 2016
From: piotr.bialecki at hotmail.de (Piotr Bialecki)
Date: Tue, 11 Oct 2016 12:32:54 +0000
Subject: [scikit-learn] ANN Scikit-learn 0.18 released
In-Reply-To: <CACmxyDFEWrfSG7n03hTy3mgM3hKV7cBV3rnkCafWUwKFs_akRQ@mail.gmail.com>
References: <40a62931-bf5a-7c59-7253-69418d53f196@gmail.com>
 <20160929052856.GA1123098@phare.normalesup.org>
 <F2B0C871-A02F-43DC-BE82-C62357C4D25B@gmail.com>
 <CAAkaFLUAvAj49JGxweA=Jy6VZytvyO0rkUfKVRvssY=gYoUM2Q@mail.gmail.com>
 <77756fd4-5ad9-e51d-51e1-aa60274d2117@gmail.com>
 <CAJe_vxAE7_dsz=4uJ5E19rOAzb+jheGxoAxVQ7UuNnCZGQ4O7g@mail.gmail.com>
 <CACmxyDFEWrfSG7n03hTy3mgM3hKV7cBV3rnkCafWUwKFs_akRQ@mail.gmail.com>
Message-ID: <DB5PR01MB08546C0350CB6B01B5AF7B9DF3DA0@DB5PR01MB0854.eurprd01.prod.exchangelabs.com>

Congratulations to all contributors!

I would like to update to the new version using conda, but apparently it is not available:

~$ conda update scikit-learn
Fetching package metadata .......
Solving package specifications: ..........

# All requested packages already installed.
# packages in environment at /home/pbialecki/anaconda2:
#
scikit-learn              0.17.1              np110py27_2

Should I reinstall scikit?


Best regards,
Piotr


On 03.10.2016 18:23, Raghav R V wrote:
Hi Brown,

Thanks for the email. There is a working PR here at <https://github.com/scikit-learn/scikit-learn/pull/7388> https://github.com/scikit-learn/scikit-learn/pull/7388

Would you be kind to take a look at it and comment how helpful the proposed API is for your use case?

Thanks


On Mon, Oct 3, 2016 at 6:05 AM, Brown J.B. <jbbrown at kuhp.kyoto-u.ac.jp<mailto:jbbrown at kuhp.kyoto-u.ac.jp>> wrote:
Hello community,

Congratulations on the release of 0.19 !
While I'm merely a casual user and wish I could contribute more often, I thank everyone for their time and efforts!

2016-10-01 1:58 GMT+09:00 Andreas Mueller <<mailto:t3kcit at gmail.com>t3kcit at gmail.com<mailto:t3kcit at gmail.com>>:

We've got a lot in the works already for 0.19.

* multiple metrics for cross validation (#7388 et al.)

I've done something like this in my internal model building and selection libraries.
My solution has been to have
  -each metric object be able to explain a "distance from optimal"
  -a metric collection object, which can be built by either explicit instantiation or calculation using data
  -a pareto curve calculation object
  -a ranker for the points on the pareto curve, with the ability to select the N-best points.

While there are certainly smarter interfaces and implementations, here is an example of one of my doctests that may help get this PR started.
My apologies that my old docstring argument notation doesn't match the commonly used standards.

Hope this helps,
J.B. Brown
Kyoto University

 26 class TrialRanker(object):
 27     """An object for handling the generic mechanism of selecting optimal
 28     trials from a colletion of trials."""

 43     def SelectBest(self, metricSets, paretoAlg,
 44                    preProcessor=None):
 45         """Select the best [metricSets] by using the
 46         [paretoAlg] pareto selection object.  Note that it is actually
 47         the [paretoAlg] that specifies how many optimal [metricSets] to
 48         select.
 49
 50         Data may be pre-processed into a form necessary for the [paretoAlg]
 51         by using the [preProcessor] that is a MetricSetConverter.
 52
 53         Return: an EvaluatedMetricSet if [paretoAlg] selects only one
 54         metric set, otherwise a list of EvaluatedMetricSet objects.
 55
 56         >>> from pareto.paretoDecorators import MinNormSelector
 57         >>> from pareto import OriginBasePareto
 58         >>> pAlg = MinNormSelector(OriginBasePareto())
 59
 60         >>> from metrics.TwoClassMetrics import Accuracy, Sensitivity
 61         >>> from metrics.metricSet import EvaluatedMetricSet
 62         >>> met1 = EvaluatedMetricSet.BuildByExplicitValue(
 63         ...           [(Accuracy, 0.7), (Sensitivity, 0.9)])
 64         >>> met1.SetTitle("Example1")
 65         >>> met1.associatedData = range(5)  # property set/get
 66         >>> met2 = EvaluatedMetricSet.BuildByExplicitValue(
 67         ...           [(Accuracy, 0.8), (Sensitivity, 0.6)])
 68         >>> met2.SetTitle("Example2")
 69         >>> met2.SetAssociatedData("abcdef")  # explicit method call
 70         >>> met3 = EvaluatedMetricSet.BuildByExplicitValue(
 71         ...           [(Accuracy, 0.5), (Sensitivity, 0.5)])
 72         >>> met3.SetTitle("Example3")
 73         >>> met3.associatedData = float
 74
 75         >>> from metrics.metricSet.converters import OptDistConverter
 76
 77         >>> ranker = TrialRanker()  # pAlg selects met1
 78         >>> best = ranker.SelectBest((met1,met2,met3),
 79         ...                          pAlg, OptDistConverter())
 80         >>> best.VerboseDescription(True)
 81         >>> str(best)
 82         'Example1: 2 metrics; Accuracy=0.700; Sensitivity=0.900'
 83         >>> best.associatedData
 84         [0, 1, 2, 3, 4]
 85
 86         >>> pAlg = MinNormSelector(OriginBasePareto(), nSelect=2)
 87         >>> best = ranker.SelectBest((met1,met2,met3),
 88         ...                          pAlg, OptDistConverter())
 89         >>> for metSet in best:
 90         ...     metSet.VerboseDescription(True)
 91         ...     str(metSet)
 92         ...     str(metSet.associatedData)
 93         'Example1: 2 metrics; Accuracy=0.700; Sensitivity=0.900'
 94         '[0, 1, 2, 3, 4]'
 95         'Example2: 2 metrics; Accuracy=0.800; Sensitivity=0.600'
 96         'abcdef'
 97
 98         >>> from metrics.TwoClassMetrics import PositivePredictiveValue
 99         >>> met4 = EvaluatedMetricSet.BuildByExplicitValue(
100         ...         [(Accuracy, 0.7), (PositivePredictiveValue, 0.5)])
101         >>> best = ranker.SelectBest((met1,met2,met3,met4),
102         ...                          pAlg, OptDistConverter())
103         Traceback (most recent call last):
104         ...
105         ValueError: Metric sets contain differing Metrics.


_______________________________________________
scikit-learn mailing list
scikit-learn at python.org<mailto:scikit-learn at python.org>
https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
scikit-learn at python.org<mailto:scikit-learn at python.org>
https://mail.python.org/mailman/listinfo/scikit-learn


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161011/46d37c5a/attachment-0001.html>

From maciek at wojcikowski.pl  Tue Oct 11 08:39:07 2016
From: maciek at wojcikowski.pl (=?UTF-8?Q?Maciek_W=C3=B3jcikowski?=)
Date: Tue, 11 Oct 2016 14:39:07 +0200
Subject: [scikit-learn] ANN Scikit-learn 0.18 released
In-Reply-To: <DB5PR01MB08546C0350CB6B01B5AF7B9DF3DA0@DB5PR01MB0854.eurprd01.prod.exchangelabs.com>
References: <40a62931-bf5a-7c59-7253-69418d53f196@gmail.com>
 <20160929052856.GA1123098@phare.normalesup.org>
 <F2B0C871-A02F-43DC-BE82-C62357C4D25B@gmail.com>
 <CAAkaFLUAvAj49JGxweA=Jy6VZytvyO0rkUfKVRvssY=gYoUM2Q@mail.gmail.com>
 <77756fd4-5ad9-e51d-51e1-aa60274d2117@gmail.com>
 <CAJe_vxAE7_dsz=4uJ5E19rOAzb+jheGxoAxVQ7UuNnCZGQ4O7g@mail.gmail.com>
 <CACmxyDFEWrfSG7n03hTy3mgM3hKV7cBV3rnkCafWUwKFs_akRQ@mail.gmail.com>
 <DB5PR01MB08546C0350CB6B01B5AF7B9DF3DA0@DB5PR01MB0854.eurprd01.prod.exchangelabs.com>
Message-ID: <CAH2JJR0XBmY+bofsPLj++mex8A01gC6365L6SpCvywzpue6nnQ@mail.gmail.com>

Hi Piotr,

I've been there - most probably some package is blocking you to update via
numpy dependency. Try to update numpy first and the conflicting package
should pop up: "conda update numpy=1.11"

----
Pozdrawiam,  |  Best regards,
Maciek W?jcikowski
maciek at wojcikowski.pl

2016-10-11 14:32 GMT+02:00 Piotr Bialecki <piotr.bialecki at hotmail.de>:

> Congratulations to all contributors!
>
> I would like to update to the new version using conda, but apparently it
> is not available:
>
> ~$ conda update scikit-learn
> Fetching package metadata .......
> Solving package specifications: ..........
>
> # All requested packages already installed.
> # packages in environment at /home/pbialecki/anaconda2:
> #
> scikit-learn              0.17.1              np110py27_2
>
> Should I reinstall scikit?
>
>
> Best regards,
> Piotr
>
>
>
> On 03.10.2016 18:23, Raghav R V wrote:
>
> Hi Brown,
>
> Thanks for the email. There is a working PR here at
> <https://github.com/scikit-learn/scikit-learn/pull/7388>
> https://github.com/scikit-learn/scikit-learn/pull/7388
>
> Would you be kind to take a look at it and comment how helpful the
> proposed API is for your use case?
>
> Thanks
>
>
> On Mon, Oct 3, 2016 at 6:05 AM, Brown J.B. <jbbrown at kuhp.kyoto-u.ac.jp>
> wrote:
>
>> Hello community,
>>
>> Congratulations on the release of 0.19 !
>> While I'm merely a casual user and wish I could contribute more often, I
>> thank everyone for their time and efforts!
>>
>> 2016-10-01 1:58 GMT+09:00 Andreas Mueller < <t3kcit at gmail.com>
>> t3kcit at gmail.com>:
>>
>> We've got a lot in the works already for 0.19.
>>>>
>>>> * multiple metrics for cross validation (#7388 et al.)
>>>>
>>>
>> I've done something like this in my internal model building and selection
>> libraries.
>> My solution has been to have
>>   -each metric object be able to explain a "distance from optimal"
>>   -a metric collection object, which can be built by either explicit
>> instantiation or calculation using data
>>   -a pareto curve calculation object
>>   -a ranker for the points on the pareto curve, with the ability to
>> select the N-best points.
>>
>> While there are certainly smarter interfaces and implementations, here is
>> an example of one of my doctests that may help get this PR started.
>> My apologies that my old docstring argument notation doesn't match the
>> commonly used standards.
>>
>> Hope this helps,
>> J.B. Brown
>> Kyoto University
>>
>>  26 class TrialRanker(object):
>>
>>  27     """An object for handling the generic mechanism of selecting
>> optimal
>>  28     trials from a colletion of trials."""
>>
>>  43     def SelectBest(self, metricSets, paretoAlg,
>>
>>  44                    preProcessor=None):
>>
>>  45         """Select the best [metricSets] by using
>> the
>>  46         [paretoAlg] pareto selection object.  Note that it is
>> actually
>>  47         the [paretoAlg] that specifies how many optimal [metricSets]
>> to
>>  48         select.
>>
>>  49
>>
>>  50         Data may be pre-processed into a form necessary for the
>> [paretoAlg]
>>  51         by using the [preProcessor] that is a
>> MetricSetConverter.
>>  52
>>
>>  53         Return: an EvaluatedMetricSet if [paretoAlg] selects only
>> one
>>  54         metric set, otherwise a list of EvaluatedMetricSet
>> objects.
>>  55
>>
>>  56         >>> from pareto.paretoDecorators import
>> MinNormSelector
>>  57         >>> from pareto import OriginBasePareto
>>
>>  58         >>> pAlg = MinNormSelector(OriginBasePare
>> to())
>>  59
>>
>>  60         >>> from metrics.TwoClassMetrics import Accuracy,
>> Sensitivity
>>  61         >>> from metrics.metricSet import
>> EvaluatedMetricSet
>>  62         >>> met1 = EvaluatedMetricSet.BuildByExpl
>> icitValue(
>>  63         ...           [(Accuracy, 0.7), (Sensitivity,
>> 0.9)])
>>  64         >>> met1.SetTitle("Example1")
>>
>>  65         >>> met1.associatedData = range(5)  # property
>> set/get
>>  66         >>> met2 = EvaluatedMetricSet.BuildByExpl
>> icitValue(
>>  67         ...           [(Accuracy, 0.8), (Sensitivity,
>> 0.6)])
>>  68         >>> met2.SetTitle("Example2")
>>
>>  69         >>> met2.SetAssociatedData("abcdef")  # explicit method
>> call
>>  70         >>> met3 = EvaluatedMetricSet.BuildByExpl
>> icitValue(
>>  71         ...           [(Accuracy, 0.5), (Sensitivity,
>> 0.5)])
>>  72         >>> met3.SetTitle("Example3")
>>
>>  73         >>> met3.associatedData = float
>>
>>  74
>>
>>  75         >>> from metrics.metricSet.converters import
>> OptDistConverter
>>  76
>>
>>  77         >>> ranker = TrialRanker()  # pAlg selects
>> met1
>>  78         >>> best = ranker.SelectBest((met1,met2,m
>> et3),
>>  79         ...                          pAlg,
>> OptDistConverter())
>>  80         >>> best.VerboseDescription(True)
>>
>>  81         >>> str(best)
>>
>>  82         'Example1: 2 metrics; Accuracy=0.700;
>> Sensitivity=0.900'
>>  83         >>> best.associatedData
>>
>>  84         [0, 1, 2, 3, 4]
>>
>>  85
>>
>>  86         >>> pAlg = MinNormSelector(OriginBasePareto(),
>> nSelect=2)
>>  87         >>> best = ranker.SelectBest((met1,met2,m
>> et3),
>>  88         ...                          pAlg,
>> OptDistConverter())
>>  89         >>> for metSet in best:
>>
>>  90         ...     metSet.VerboseDescription(True
>> )
>>  91         ...     str(metSet)
>>
>>  92         ...     str(metSet.associatedData)
>>
>>  93         'Example1: 2 metrics; Accuracy=0.700;
>> Sensitivity=0.900'
>>  94         '[0, 1, 2, 3, 4]'
>>
>>  95         'Example2: 2 metrics; Accuracy=0.800;
>> Sensitivity=0.600'
>>  96         'abcdef'
>>
>>  97
>>
>>  98         >>> from metrics.TwoClassMetrics import
>> PositivePredictiveValue
>>  99         >>> met4 = EvaluatedMetricSet.BuildByExpl
>> icitValue(
>> 100         ...         [(Accuracy, 0.7), (PositivePredictiveValue,
>> 0.5)])
>> 101         >>> best = ranker.SelectBest((met1,met2,m
>> et3,met4),
>> 102         ...                          pAlg,
>> OptDistConverter())
>> 103         Traceback (most recent call last):
>>
>> 104         ...
>>
>> 105         ValueError: Metric sets contain differing
>> Metrics.
>>
>>
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
>
> _______________________________________________
> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161011/557ea150/attachment-0001.html>

From piotr.bialecki at hotmail.de  Tue Oct 11 08:47:28 2016
From: piotr.bialecki at hotmail.de (Piotr Bialecki)
Date: Tue, 11 Oct 2016 12:47:28 +0000
Subject: [scikit-learn] ANN Scikit-learn 0.18 released
In-Reply-To: <CAH2JJR0XBmY+bofsPLj++mex8A01gC6365L6SpCvywzpue6nnQ@mail.gmail.com>
References: <40a62931-bf5a-7c59-7253-69418d53f196@gmail.com>
 <20160929052856.GA1123098@phare.normalesup.org>
 <F2B0C871-A02F-43DC-BE82-C62357C4D25B@gmail.com>
 <CAAkaFLUAvAj49JGxweA=Jy6VZytvyO0rkUfKVRvssY=gYoUM2Q@mail.gmail.com>
 <77756fd4-5ad9-e51d-51e1-aa60274d2117@gmail.com>
 <CAJe_vxAE7_dsz=4uJ5E19rOAzb+jheGxoAxVQ7UuNnCZGQ4O7g@mail.gmail.com>
 <CACmxyDFEWrfSG7n03hTy3mgM3hKV7cBV3rnkCafWUwKFs_akRQ@mail.gmail.com>
 <DB5PR01MB08546C0350CB6B01B5AF7B9DF3DA0@DB5PR01MB0854.eurprd01.prod.exchangelabs.com>
 <CAH2JJR0XBmY+bofsPLj++mex8A01gC6365L6SpCvywzpue6nnQ@mail.gmail.com>
Message-ID: <DB5PR01MB0854120A5B20675C3E9B4F03F3DA0@DB5PR01MB0854.eurprd01.prod.exchangelabs.com>

Hi Maciek,

thank you very much! Numpy and opencv were indeed the conflicted packages.
Apperently my version of opencv was using numpy 1.10, so I uninstalled opencv, updated numpy and updated scikit to 0.18.

Thank's for the fast help!


Best regards,
Piotr

On 11.10.2016 14:39, Maciek W?jcikowski wrote:
Hi Piotr,

I've been there - most probably some package is blocking you to update via numpy dependency. Try to update numpy first and the conflicting package should pop up: "conda update numpy=1.11"

----
Pozdrawiam,  |  Best regards,
Maciek W?jcikowski
maciek at wojcikowski.pl<mailto:maciek at wojcikowski.pl>

2016-10-11 14:32 GMT+02:00 Piotr Bialecki <piotr.bialecki at hotmail.de<mailto:piotr.bialecki at hotmail.de>>:
Congratulations to all contributors!

I would like to update to the new version using conda, but apparently it is not available:

~$ conda update scikit-learn
Fetching package metadata .......
Solving package specifications: ..........

# All requested packages already installed.
# packages in environment at /home/pbialecki/anaconda2:
#
scikit-learn              0.17.1              np110py27_2

Should I reinstall scikit?


Best regards,
Piotr


On 03.10.2016 18:23, Raghav R V wrote:
Hi Brown,

Thanks for the email. There is a working PR here at https://github.com/scikit-learn/scikit-learn/pull/7388

Would you be kind to take a look at it and comment how helpful the proposed API is for your use case?

Thanks


On Mon, Oct 3, 2016 at 6:05 AM, Brown J.B. <jbbrown at kuhp.kyoto-u.ac.jp<mailto:jbbrown at kuhp.kyoto-u.ac.jp>> wrote:
Hello community,

Congratulations on the release of 0.19 !
While I'm merely a casual user and wish I could contribute more often, I thank everyone for their time and efforts!

2016-10-01 1:58 GMT+09:00 Andreas Mueller <t3kcit at gmail.com<mailto:t3kcit at gmail.com>>:

We've got a lot in the works already for 0.19.

* multiple metrics for cross validation (#7388 et al.)

I've done something like this in my internal model building and selection libraries.
My solution has been to have
  -each metric object be able to explain a "distance from optimal"
  -a metric collection object, which can be built by either explicit instantiation or calculation using data
  -a pareto curve calculation object
  -a ranker for the points on the pareto curve, with the ability to select the N-best points.

While there are certainly smarter interfaces and implementations, here is an example of one of my doctests that may help get this PR started.
My apologies that my old docstring argument notation doesn't match the commonly used standards.

Hope this helps,
J.B. Brown
Kyoto University

 26 class TrialRanker(object):
 27     """An object for handling the generic mechanism of selecting optimal
 28     trials from a colletion of trials."""

 43     def SelectBest(self, metricSets, paretoAlg,
 44                    preProcessor=None):
 45         """Select the best [metricSets] by using the
 46         [paretoAlg] pareto selection object.  Note that it is actually
 47         the [paretoAlg] that specifies how many optimal [metricSets] to
 48         select.
 49
 50         Data may be pre-processed into a form necessary for the [paretoAlg]
 51         by using the [preProcessor] that is a MetricSetConverter.
 52
 53         Return: an EvaluatedMetricSet if [paretoAlg] selects only one
 54         metric set, otherwise a list of EvaluatedMetricSet objects.
 55
 56         >>> from pareto.paretoDecorators import MinNormSelector
 57         >>> from pareto import OriginBasePareto
 58         >>> pAlg = MinNormSelector(OriginBasePareto())
 59
 60         >>> from metrics.TwoClassMetrics import Accuracy, Sensitivity
 61         >>> from metrics.metricSet import EvaluatedMetricSet
 62         >>> met1 = EvaluatedMetricSet.BuildByExplicitValue(
 63         ...           [(Accuracy, 0.7), (Sensitivity, 0.9)])
 64         >>> met1.SetTitle("Example1")
 65         >>> met1.associatedData = range(5)  # property set/get
 66         >>> met2 = EvaluatedMetricSet.BuildByExplicitValue(
 67         ...           [(Accuracy, 0.8), (Sensitivity, 0.6)])
 68         >>> met2.SetTitle("Example2")
 69         >>> met2.SetAssociatedData("abcdef")  # explicit method call
 70         >>> met3 = EvaluatedMetricSet.BuildByExplicitValue(
 71         ...           [(Accuracy, 0.5), (Sensitivity, 0.5)])
 72         >>> met3.SetTitle("Example3")
 73         >>> met3.associatedData = float
 74
 75         >>> from metrics.metricSet.converters import OptDistConverter
 76
 77         >>> ranker = TrialRanker()  # pAlg selects met1
 78         >>> best = ranker.SelectBest((met1,met2,met3),
 79         ...                          pAlg, OptDistConverter())
 80         >>> best.VerboseDescription(True)
 81         >>> str(best)
 82         'Example1: 2 metrics; Accuracy=0.700; Sensitivity=0.900'
 83         >>> best.associatedData
 84         [0, 1, 2, 3, 4]
 85
 86         >>> pAlg = MinNormSelector(OriginBasePareto(), nSelect=2)
 87         >>> best = ranker.SelectBest((met1,met2,met3),
 88         ...                          pAlg, OptDistConverter())
 89         >>> for metSet in best:
 90         ...     metSet.VerboseDescription(True)
 91         ...     str(metSet)
 92         ...     str(metSet.associatedData)
 93         'Example1: 2 metrics; Accuracy=0.700; Sensitivity=0.900'
 94         '[0, 1, 2, 3, 4]'
 95         'Example2: 2 metrics; Accuracy=0.800; Sensitivity=0.600'
 96         'abcdef'
 97
 98         >>> from metrics.TwoClassMetrics import PositivePredictiveValue
 99         >>> met4 = EvaluatedMetricSet.BuildByExplicitValue(
100         ...         [(Accuracy, 0.7), (PositivePredictiveValue, 0.5)])
101         >>> best = ranker.SelectBest((met1,met2,met3,met4),
102         ...                          pAlg, OptDistConverter())
103         Traceback (most recent call last):
104         ...
105         ValueError: Metric sets contain differing Metrics.


_______________________________________________
scikit-learn mailing list
scikit-learn at python.org<mailto:scikit-learn at python.org>
https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
scikit-learn at python.org<mailto:scikit-learn at python.org>
https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________ scikit-learn mailing list scikit-learn at python.org<mailto:scikit-learn at python.org> https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
scikit-learn at python.org<mailto:scikit-learn at python.org>
https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161011/2c8d1b77/attachment-0001.html>

From olivier.grisel at ensta.org  Tue Oct 11 09:44:08 2016
From: olivier.grisel at ensta.org (Olivier Grisel)
Date: Tue, 11 Oct 2016 15:44:08 +0200
Subject: [scikit-learn] HashingVectorizer slow in version 0.18
In-Reply-To: <CAP+EBtZvhGrqgn-aH3ig96XMLLK+4JLxW28u8AeqA+9-wjnDGQ@mail.gmail.com>
References: <CAP+EBtZ=svrG26qgrjOhyp1+J1sZz0bmHYW6EshhvrPF0oS+HQ@mail.gmail.com>
 <CAFvE7K54JRDobzueoVK=LK5bGNbC2xVu34DtzEybyaLpuBCUyQ@mail.gmail.com>
 <CAP+EBtZvhGrqgn-aH3ig96XMLLK+4JLxW28u8AeqA+9-wjnDGQ@mail.gmail.com>
Message-ID: <CAFvE7K4QmoZybfo8GCWr8H5P+zFy5mN7iG7nz5=noy7k9r+yAg@mail.gmail.com>

That's really weird. I don't have a windows machine handy at the
moment. It would be nice if someone else could confirm.

Could you please run the Python profiler on this to see where the time
is spent on the slow setup?

-- 
Olivier

From piotr.bialecki at hotmail.de  Tue Oct 11 10:03:29 2016
From: piotr.bialecki at hotmail.de (Piotr Bialecki)
Date: Tue, 11 Oct 2016 14:03:29 +0000
Subject: [scikit-learn] HashingVectorizer slow in version 0.18
In-Reply-To: <CAFvE7K4QmoZybfo8GCWr8H5P+zFy5mN7iG7nz5=noy7k9r+yAg@mail.gmail.com>
References: <CAP+EBtZ=svrG26qgrjOhyp1+J1sZz0bmHYW6EshhvrPF0oS+HQ@mail.gmail.com>
 <CAFvE7K54JRDobzueoVK=LK5bGNbC2xVu34DtzEybyaLpuBCUyQ@mail.gmail.com>
 <CAP+EBtZvhGrqgn-aH3ig96XMLLK+4JLxW28u8AeqA+9-wjnDGQ@mail.gmail.com>
 <CAFvE7K4QmoZybfo8GCWr8H5P+zFy5mN7iG7nz5=noy7k9r+yAg@mail.gmail.com>
Message-ID: <DB5PR01MB0854FF30F0E8FF4A706074BBF3DA0@DB5PR01MB0854.eurprd01.prod.exchangelabs.com>

I just tested it on my Ubuntu machine and could not see any performance 
issues (5.68 seconds in scikit-learn 0.17 vs. 6.67 seconds in 
scikit-learn 0.18)

However, on another Windows 10 machine I could indeed see this issue:

scikit-learn 0.17.1. Numpy 1.11.1. Python 2.7.12 AMD64
Vectorizing 20newsgroup 11314 documents
('Vectorization completed in ', 5.608999967575073, ' seconds, resulting 
shape ', (11314, 1048576))


scikit-learn 0.18. Numpy 1.11.1. Python 2.7.12 AMD64
Vectorizing 20newsgroup 11314 documents
('Vectorization completed in ', 27.924000024795532, ' seconds, resulting 
shape ', (11314, 1048576))

On 11.10.2016 15:44, Olivier Grisel wrote:
> That's really weird. I don't have a windows machine handy at the
> moment. It would be nice if someone else could confirm.
>
> Could you please run the Python profiler on this to see where the time
> is spent on the slow setup?
>


From gael.varoquaux at normalesup.org  Tue Oct 11 09:49:17 2016
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Tue, 11 Oct 2016 15:49:17 +0200
Subject: [scikit-learn] HashingVectorizer slow in version 0.18
In-Reply-To: <CAFvE7K4QmoZybfo8GCWr8H5P+zFy5mN7iG7nz5=noy7k9r+yAg@mail.gmail.com>
References: <CAP+EBtZ=svrG26qgrjOhyp1+J1sZz0bmHYW6EshhvrPF0oS+HQ@mail.gmail.com>
 <CAFvE7K54JRDobzueoVK=LK5bGNbC2xVu34DtzEybyaLpuBCUyQ@mail.gmail.com>
 <CAP+EBtZvhGrqgn-aH3ig96XMLLK+4JLxW28u8AeqA+9-wjnDGQ@mail.gmail.com>
 <CAFvE7K4QmoZybfo8GCWr8H5P+zFy5mN7iG7nz5=noy7k9r+yAg@mail.gmail.com>
Message-ID: <20161011134917.GI4179541@phare.normalesup.org>

Could it be a case of compilation: it seems to me that we are compiling
MKL vs non MKL builds.

From mathieu at mblondel.org  Tue Oct 11 11:13:30 2016
From: mathieu at mblondel.org (Mathieu Blondel)
Date: Wed, 12 Oct 2016 00:13:30 +0900
Subject: [scikit-learn] HashingVectorizer slow in version 0.18
In-Reply-To: <20161011134917.GI4179541@phare.normalesup.org>
References: <CAP+EBtZ=svrG26qgrjOhyp1+J1sZz0bmHYW6EshhvrPF0oS+HQ@mail.gmail.com>
 <CAFvE7K54JRDobzueoVK=LK5bGNbC2xVu34DtzEybyaLpuBCUyQ@mail.gmail.com>
 <CAP+EBtZvhGrqgn-aH3ig96XMLLK+4JLxW28u8AeqA+9-wjnDGQ@mail.gmail.com>
 <CAFvE7K4QmoZybfo8GCWr8H5P+zFy5mN7iG7nz5=noy7k9r+yAg@mail.gmail.com>
 <20161011134917.GI4179541@phare.normalesup.org>
Message-ID: <CAOKSrLwymgO8-B=XECaoSLxwBq5K2bxbCy7_JvEphZwK5Y2H5Q@mail.gmail.com>

On Tue, Oct 11, 2016 at 10:49 PM, Gael Varoquaux <
gael.varoquaux at normalesup.org> wrote:

> Could it be a case of compilation: it seems to me that we are compiling
> MKL vs non MKL builds.
>

The hashing vectorizer is written in Cython and doesn't use BLAS, though.

Mathieu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161012/a3c76b3b/attachment.html>

From t3kcit at gmail.com  Tue Oct 11 14:56:02 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Tue, 11 Oct 2016 14:56:02 -0400
Subject: [scikit-learn] HashingVectorizer slow in version 0.18
In-Reply-To: <CAP+EBtZvhGrqgn-aH3ig96XMLLK+4JLxW28u8AeqA+9-wjnDGQ@mail.gmail.com>
References: <CAP+EBtZ=svrG26qgrjOhyp1+J1sZz0bmHYW6EshhvrPF0oS+HQ@mail.gmail.com>
 <CAFvE7K54JRDobzueoVK=LK5bGNbC2xVu34DtzEybyaLpuBCUyQ@mail.gmail.com>
 <CAP+EBtZvhGrqgn-aH3ig96XMLLK+4JLxW28u8AeqA+9-wjnDGQ@mail.gmail.com>
Message-ID: <d164beda-6efe-8c0b-0105-a15d0eab9703@gmail.com>

Please open an issue on the issue tracker:
https://github.com/scikit-learn/scikit-learn/issues

On 10/11/2016 08:19 AM, Gabriel Trautmann wrote:
> Thank you for your response, have Windows 7 Enterprise 64 bit / Intel 
> Xeon E5 2640 CPU, same problem on two similar machines
>
> python-3.5.2-amd64.exe - fresh installation
>
> numpy-1.11.2+mkl-cp35-cp35m-win_amd64.whl  - from Christoph Gohlke
> scipy-0.18.1-cp35-cp35m-win_amd64.whl
> pip install scikit-lean
>
> on the same python instance if I downgrade to version 0.17 is much faster.
>
> pip uninstall scikit-lean
> pip install scikit-lean==0.17
>
> I will open an issue after I test on more machines or if someone else 
> can reproduce the problem.
>
>
>
>
> On Tue, Oct 11, 2016 at 3:02 PM, Olivier Grisel 
> <olivier.grisel at ensta.org <mailto:olivier.grisel at ensta.org>> wrote:
>
>     I cannot reproduce such a degradation on my machine:
>
>     (sklearn-0.17)ogrisel at is146148:~/code/scikit-learn$ python
>     ~/tmp/bench_vectorizer.py
>     scikit-learn 0.17.1. Numpy 1.11.2. Python 3.5.0 x86_64
>     Vectorizing 20newsgroup 11314 documents
>     Vectorization completed in  4.033604383468628  seconds, resulting
>     shape  (11314, 1048576)
>
>     (sklearn-0.18) ogrisel at is146148:~/code/scikit-learn$ python
>     ~/tmp/bench_vectorizer.py
>     scikit-learn 0.18. Numpy 1.11.2. Python 3.5.0 x86_64
>     Vectorizing 20newsgroup 11314 documents
>     Vectorization completed in  4.990509510040283  seconds, resulting
>     shape  (11314, 1048576)
>
>     Which operating system are you using?
>
>     Please feel free to open an issue on the tracker anyway.
>
>     --
>     Olivier
>     _______________________________________________
>     scikit-learn mailing list
>     scikit-learn at python.org <mailto:scikit-learn at python.org>
>     https://mail.python.org/mailman/listinfo/scikit-learn
>     <https://mail.python.org/mailman/listinfo/scikit-learn>
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161011/22953a79/attachment.html>

From joel.nothman at gmail.com  Wed Oct 12 08:02:30 2016
From: joel.nothman at gmail.com (Joel Nothman)
Date: Wed, 12 Oct 2016 23:02:30 +1100
Subject: [scikit-learn] Doubt regarding issue timeline
In-Reply-To: <CAHcSOR=_5ErE1PaBx2947_szcE4UBpFC1rmCC4iCnkycue=LTA@mail.gmail.com>
References: <CAM_sO3TZOZ11PdReLOZ6v0SqL4dPuGTdwTvwmctidtLTKudKdQ@mail.gmail.com>
 <CAHcSOR=_5ErE1PaBx2947_szcE4UBpFC1rmCC4iCnkycue=LTA@mail.gmail.com>
Message-ID: <CAAkaFLXn5MM=-X-wwQayLXF+g9X72r=6ucdY3c6nDNiW=akXmg@mail.gmail.com>

If you have a sense that the issue is urgent in some way, then give it up
quickly if you've said you'd do it.

Otherwise, it's okay to take a few weeks. Yes, it would be kind, if it
looks like you won't be able to do it, to say you can't.

Sorry there are no hard rules, but thanks for trying to clarify

On 11 October 2016 at 16:49, Jaques Grobler <jaquesgrobler at gmail.com> wrote:

> I'd say a 'standup'-ish approach could work with this - everyday or three,
> if you find yourself getting pulled off the issue by other work, life,
> etc. perhaps take a moment to at a set time to , if needed, post on the
> progress/blocking factors -- even if it's 'can't work in this today' - yes,
> this could potentially get spammy, but it gives nice transparency and if
> it's urgent to finish the issue soon, like before a release, the community
> can know wether or not it needs to be handed over - or if you believe
> you'll have time still --
> This doesn't have to be a rule- but more of a guide line - the community
> will always have a fairly recent status update, even if the person can't
> touch the issue for weeks.
>
> Just my thoughts on it :)
>
>
> On Tuesday, 11 October 2016, Siddharth Gupta <siddharthgupta234 at gmail.com>
> wrote:
>
>> Hello fellas,
>> I have a doubt. Suppose I ask to volunteer in working on an issue but due
>> to some unavoidable scenario I fail to work on it for sometime, when should
>> I let the community know about the same. I guess it depends on the
>> issue/bug, but on an average how much time should one take to resolve an
>> issue.
>>
>> Regards Siddharth Gupta,
>> Ph: 9871012292
>> Linkedin <https://www.linkedin.com/in/sidgupta234/> | Github
>> <https://github.com/sidgupta234> | Codechef
>> <https://www.codechef.com/users/sidgupta234> | Twitter
>> <https://twitter.com/SidGupta234> | Facebook
>> <https://www.facebook.com/profile.php?id=1483695876>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161012/6e04cd93/attachment.html>

From t3kcit at gmail.com  Thu Oct 13 11:36:20 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Thu, 13 Oct 2016 11:36:20 -0400
Subject: [scikit-learn] Permission for creating new labels
In-Reply-To: <CACmxyDE5Zw_wXiAeWa=G0vWksdR7Ous-MCpnE7AWUv6V+BC8=A@mail.gmail.com>
References: <CACmxyDFAdsRnkLZZEDdnjLE7iEibBNxYG5fKQe3yi3RT+xAScw@mail.gmail.com>
 <CADxzQooZTOnOdwi6RHfrBbBSxSsJM59b7z7GYUR=4oyxOD9Gnw@mail.gmail.com>
 <CACmxyDFdCkmpTQik6xkwTKaHH_tcTRn2LYg6RjXsQVij1zSOBQ@mail.gmail.com>
 <CAFvE7K6TQ6Ekjp_DLJKkJ8LyaiUsqYc3QJ8G7xgHgqGVOv4VBg@mail.gmail.com>
 <CACmxyDGxzw34M3B8cPe_5WGZgDGXhQqp_+C=we6DsWQMgEeKbg@mail.gmail.com>
 <20161012133118.GB1206164@phare.normalesup.org>
 <CAAkaFLXhGVp7BzUgX-L4rSqU+2r8Hc2vCXqYCQiKAvM8FE_v=A@mail.gmail.com>
 <CACmxyDE5Zw_wXiAeWa=G0vWksdR7Ous-MCpnE7AWUv6V+BC8=A@mail.gmail.com>
Message-ID: <8645000e-3326-94ec-beee-e0ba6dd028dc@gmail.com>

going to the mailing list

On 10/13/2016 01:35 AM, Raghav R V wrote:
> Thanks for the messages {Ga|Jo}el. ;)
>
> > We can use "needs second review" as an alternative to "MRG+1" but I 
> don't see the point of using both.
>
> I see the system of MRG+1 and MRG+2 as a more robust way of tracking 
> approvals to see if the PR can be merged (I'm not sure if review 
> approvals completely replace this?) and "Needs 2nd Review" as a quick 
> way to search... "Needs 2nd Review" could also be used with MRG PRs 
> which have already received a solid review and would need a 2nd look 
> from those who don't have much time to do a full fledged review...
>
> >By the way: this discussion should happen on the ML.
>
> Sorry for that. I wasn't sure if this was a very useful/non-trivial 
> suggestion and wanted to avoid noise there...
>
> > "Needs triage":
>
> I see that we have "Stale" label for that.
>
I just added this to make it easier to find PRs to review.
I'm not sure if it is not redundant with the "needs contrib" tag on a PR.
I used "stale" if I was not sure if it's worth working on something and 
the author didn't respond for a while.
I haven't used it a lot yet.

I'm ambivalent about adding a "approved by one" (which I think is more 
explicit then "need one more") tag.

You can search for PRs and issues without comments - I recently did that 
to make sure everything had at least one ;)
I'm not sure you can search for the absence of tags. But I am planning 
to go through all issues tomorrow to see stuff
that I have missed. I'll be catching up today with all notifications 
that I missed this year because writing my ;)

Maybe having an list of statuses for PRs and issues that covers the 
common cases would be good, we just kind of had that discussion, right?

Issues can be bug|enhancement|new feature with status needs contributor, 
has PR or needs confirmation/discussion. It would be nice to see
if a issue has a PR, I think there is no way to do that from the search.

PRs need changes or reviews or are stalled (which is "needs changes" for 
a long time and no response) and then might "need contributor".


We could use "needs review" on issues and add a "has PR" tag for issues 
and a "one approval" tag for PRs.

I agree with Joel that switching between "needs review" and "needs 
changes" in a currently active PR is likely to be cumbersome.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161013/c3740ec8/attachment.html>

From nelle.varoquaux at gmail.com  Thu Oct 13 11:41:29 2016
From: nelle.varoquaux at gmail.com (Nelle Varoquaux)
Date: Thu, 13 Oct 2016 08:41:29 -0700
Subject: [scikit-learn] Permission for creating new labels
In-Reply-To: <8645000e-3326-94ec-beee-e0ba6dd028dc@gmail.com>
References: <CACmxyDFAdsRnkLZZEDdnjLE7iEibBNxYG5fKQe3yi3RT+xAScw@mail.gmail.com>
 <CADxzQooZTOnOdwi6RHfrBbBSxSsJM59b7z7GYUR=4oyxOD9Gnw@mail.gmail.com>
 <CACmxyDFdCkmpTQik6xkwTKaHH_tcTRn2LYg6RjXsQVij1zSOBQ@mail.gmail.com>
 <CAFvE7K6TQ6Ekjp_DLJKkJ8LyaiUsqYc3QJ8G7xgHgqGVOv4VBg@mail.gmail.com>
 <CACmxyDGxzw34M3B8cPe_5WGZgDGXhQqp_+C=we6DsWQMgEeKbg@mail.gmail.com>
 <20161012133118.GB1206164@phare.normalesup.org>
 <CAAkaFLXhGVp7BzUgX-L4rSqU+2r8Hc2vCXqYCQiKAvM8FE_v=A@mail.gmail.com>
 <CACmxyDE5Zw_wXiAeWa=G0vWksdR7Ous-MCpnE7AWUv6V+BC8=A@mail.gmail.com>
 <8645000e-3326-94ec-beee-e0ba6dd028dc@gmail.com>
Message-ID: <CAE-UAvTQ04ZYo+zJxr1K3XdS68RJRJAKRncaY1j+S1TZE8upyg@mail.gmail.com>

On 13 October 2016 at 08:36, Andreas Mueller <t3kcit at gmail.com> wrote:
> going to the mailing list
>
> On 10/13/2016 01:35 AM, Raghav R V wrote:
>
> Thanks for the messages {Ga|Jo}el. ;)
>
>> We can use "needs second review" as an alternative to "MRG+1" but I don't
>> see the point of using both.
>
> I see the system of MRG+1 and MRG+2 as a more robust way of tracking
> approvals to see if the PR can be merged (I'm not sure if review approvals
> completely replace this?) and "Needs 2nd Review" as a quick way to search...
> "Needs 2nd Review" could also be used with MRG PRs which have already
> received a solid review and would need a 2nd look from those who don't have
> much time to do a full fledged review...
>
>> By the way: this discussion should happen on the ML.
>
> Sorry for that. I wasn't sure if this was a very useful/non-trivial
> suggestion and wanted to avoid noise there...
>
>> "Needs triage":
>
> I see that we have "Stale" label for that.
>
> I just added this to make it easier to find PRs to review.
> I'm not sure if it is not redundant with the "needs contrib" tag on a PR.
> I used "stale" if I was not sure if it's worth working on something and the
> author didn't respond for a while.
> I haven't used it a lot yet.
>
> I'm ambivalent about adding a "approved by one" (which I think is more
> explicit then "need one more") tag.
>
> You can search for PRs and issues without comments - I recently did that to
> make sure everything had at least one ;)
> I'm not sure you can search for the absence of tags. But I am planning to go
> through all issues tomorrow to see stuff
> that I have missed. I'll be catching up today with all notifications that I
> missed this year because writing my ;)
>
> Maybe having an list of statuses for PRs and issues that covers the common
> cases would be good, we just kind of had that discussion, right?
>
> Issues can be bug|enhancement|new feature with status needs contributor, has
> PR or needs confirmation/discussion. It would be nice to see
> if a issue has a PR, I think there is no way to do that from the search.
>
> PRs need changes or reviews or are stalled (which is "needs changes" for a
> long time and no response) and then might "need contributor".
>
>
> We could use "needs review" on issues and add a "has PR" tag for issues and
> a "one approval" tag for PRs.
>
> I agree with Joel that switching between "needs review" and "needs changes"
> in a currently active PR is likely to be cumbersome.

>From my experience on matplotlib that has such a system, it is a not a
very good idea? Reviewers rarely change the tag to needs change, and
when they do, reviewers ignore it and continue reviewing it (which is
slightly annoying in some cases).

>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>

From stuart at stuartreynolds.net  Thu Oct 13 14:14:17 2016
From: stuart at stuartreynolds.net (Stuart Reynolds)
Date: Thu, 13 Oct 2016 11:14:17 -0700
Subject: [scikit-learn] Missing data and decision trees
Message-ID: <CAAy-kdn3yEPXOxJ4vKp1ipqUg9kL=wPDwaLL-LCqMvJ_b14=Eg@mail.gmail.com>

I'm looking for a decision tree and RF implementation that supports missing
data (without imputation) -- ideally in Python, Java/Scala or C++.

It seems that scikit's decision tree algorithm doesn't allow this -- which
is disappointing because its one of the few methods that should be able to
sensibly handle problems with high amounts of missingness.

Are there plans to allow missing data in scikit's decision trees?

Also, is there any particular reason why missing values weren't supported
originally (e.g. integrates poorly with other features)

Regards
- Stuart
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161013/6a03ec05/attachment.html>

From jmschreiber91 at gmail.com  Thu Oct 13 14:20:34 2016
From: jmschreiber91 at gmail.com (Jacob Schreiber)
Date: Thu, 13 Oct 2016 11:20:34 -0700
Subject: [scikit-learn] Missing data and decision trees
In-Reply-To: <CAAy-kdn3yEPXOxJ4vKp1ipqUg9kL=wPDwaLL-LCqMvJ_b14=Eg@mail.gmail.com>
References: <CAAy-kdn3yEPXOxJ4vKp1ipqUg9kL=wPDwaLL-LCqMvJ_b14=Eg@mail.gmail.com>
Message-ID: <CA+ad8EvX7-2yhcR+THOD65WndN09J6RsxFTfErpdEmiz90NbDw@mail.gmail.com>

I think Raghav is working on it in this PR:
https://github.com/scikit-learn/scikit-learn/pull/5974

The reason they weren't initially supported is likely that it involves a
lot of work and design choices to handle missing values appropriately, and
the discussion on the best way to handle it was postponed until there was a
working estimator which could serve most peoples needs.

On Thu, Oct 13, 2016 at 11:14 AM, Stuart Reynolds <stuart at stuartreynolds.net
> wrote:

> I'm looking for a decision tree and RF implementation that supports
> missing data (without imputation) -- ideally in Python, Java/Scala or C++.
>
> It seems that scikit's decision tree algorithm doesn't allow this --
> which is disappointing because its one of the few methods that should be
> able to sensibly handle problems with high amounts of missingness.
>
> Are there plans to allow missing data in scikit's decision trees?
>
> Also, is there any particular reason why missing values weren't supported
> originally (e.g. integrates poorly with other features)
>
> Regards
> - Stuart
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161013/b1c5ed0a/attachment.html>

From jeffrey.m.allard at gmail.com  Thu Oct 13 14:20:40 2016
From: jeffrey.m.allard at gmail.com (Jeff)
Date: Thu, 13 Oct 2016 14:20:40 -0400
Subject: [scikit-learn] Missing data and decision trees
In-Reply-To: <CAAy-kdn3yEPXOxJ4vKp1ipqUg9kL=wPDwaLL-LCqMvJ_b14=Eg@mail.gmail.com>
References: <CAAy-kdn3yEPXOxJ4vKp1ipqUg9kL=wPDwaLL-LCqMvJ_b14=Eg@mail.gmail.com>
Message-ID: <7c733120-4f1f-59d4-a14a-3fb15c960598@gmail.com>

I ran into this several times as well with scikit-learn implementation 
of GBM. Look at xgboost if you have not already (is there someone out 
there that hasn't ? :)- it deals with missing values in the predictor 
space in a very eloquent manner.

http://xgboost.readthedocs.io/en/latest/python/python_intro.html

https://arxiv.org/abs/1603.02754


Jeff


On 10/13/2016 2:14 PM, Stuart Reynolds wrote:
> I'm looking for a decision tree and RF implementation that supports 
> missing data (without imputation) -- ideally in Python, Java/Scala or 
> C++.
>
> It seems that scikit's decision tree algorithm doesn't allow this -- 
> which is disappointing because its one of the few methods that should 
> be able to sensibly handle problems with high amounts of missingness.
>
> Are there plans to allow missing data in scikit's decision trees?
>
> Also, is there any particular reason why missing values weren't 
> supported originally (e.g. integrates poorly with other features)
>
> Regards
> - Stuart
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161013/d14f49e0/attachment.html>

From jcrudy at gmail.com  Thu Oct 13 14:28:02 2016
From: jcrudy at gmail.com (Jason Rudy)
Date: Thu, 13 Oct 2016 11:28:02 -0700
Subject: [scikit-learn] Missing data and decision trees
In-Reply-To: <7c733120-4f1f-59d4-a14a-3fb15c960598@gmail.com>
References: <CAAy-kdn3yEPXOxJ4vKp1ipqUg9kL=wPDwaLL-LCqMvJ_b14=Eg@mail.gmail.com>
 <7c733120-4f1f-59d4-a14a-3fb15c960598@gmail.com>
Message-ID: <CANrsH6_By8ZCc_LudmE3ZZrWonLw5VyC6deufDCARJ_RGdnNcw@mail.gmail.com>

It's not a decision tree, but py-earth may also do what you need.  It
handles missingness as described in section 3.4 here:
http://media.salford-systems.com/library/MARS_V2_JHF_LCS-108.pdf.
Basically, missingness is considered potentially predictive.

On Thu, Oct 13, 2016 at 11:20 AM, Jeff <jeffrey.m.allard at gmail.com> wrote:

> I ran into this several times as well with scikit-learn implementation of
> GBM. Look at xgboost if you have not already (is there someone out there
> that hasn't ? :)- it deals with missing values in the predictor space in a
> very eloquent manner.
>
> http://xgboost.readthedocs.io/en/latest/python/python_intro.html
>
> https://arxiv.org/abs/1603.02754
>
>
> Jeff
>
>
>
> On 10/13/2016 2:14 PM, Stuart Reynolds wrote:
>
> I'm looking for a decision tree and RF implementation that supports
> missing data (without imputation) -- ideally in Python, Java/Scala or C++.
>
> It seems that scikit's decision tree algorithm doesn't allow this --
> which is disappointing because its one of the few methods that should be
> able to sensibly handle problems with high amounts of missingness.
>
> Are there plans to allow missing data in scikit's decision trees?
>
> Also, is there any particular reason why missing values weren't supported
> originally (e.g. integrates poorly with other features)
>
> Regards
> - Stuart
>
>
> _______________________________________________
> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161013/04caa05e/attachment-0001.html>

From drraph at gmail.com  Thu Oct 13 14:33:20 2016
From: drraph at gmail.com (Raphael C)
Date: Thu, 13 Oct 2016 19:33:20 +0100
Subject: [scikit-learn] Missing data and decision trees
In-Reply-To: <CAAy-kdn3yEPXOxJ4vKp1ipqUg9kL=wPDwaLL-LCqMvJ_b14=Eg@mail.gmail.com>
References: <CAAy-kdn3yEPXOxJ4vKp1ipqUg9kL=wPDwaLL-LCqMvJ_b14=Eg@mail.gmail.com>
Message-ID: <CAFHc1Qa8uHLS7mbSd+SwwfgiMopVvZzjGt44LcGUrFYw9H_15g@mail.gmail.com>

You can simply make a new binary feature (per feature that might have a
missing value) that is 1 if the value is missing and 0 otherwise.  The RF
can then work out what to do with this information.

I don't know how this compares in practice to more sophisticated approaches.

Raphael

On Thursday, October 13, 2016, Stuart Reynolds <stuart at stuartreynolds.net>
wrote:

> I'm looking for a decision tree and RF implementation that supports
> missing data (without imputation) -- ideally in Python, Java/Scala or C++.
>
> It seems that scikit's decision tree algorithm doesn't allow this --
> which is disappointing because its one of the few methods that should be
> able to sensibly handle problems with high amounts of missingness.
>
> Are there plans to allow missing data in scikit's decision trees?
>
> Also, is there any particular reason why missing values weren't supported
> originally (e.g. integrates poorly with other features)
>
> Regards
> - Stuart
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161013/a82aea37/attachment.html>

From Dale.T.Smith at macys.com  Thu Oct 13 14:21:00 2016
From: Dale.T.Smith at macys.com (Dale T Smith)
Date: Thu, 13 Oct 2016 18:21:00 +0000
Subject: [scikit-learn] Missing data and decision trees
In-Reply-To: <CAAy-kdn3yEPXOxJ4vKp1ipqUg9kL=wPDwaLL-LCqMvJ_b14=Eg@mail.gmail.com>
References: <CAAy-kdn3yEPXOxJ4vKp1ipqUg9kL=wPDwaLL-LCqMvJ_b14=Eg@mail.gmail.com>
Message-ID: <BL2PR06MB2276BBE538B7E96479DD1700C3DC0@BL2PR06MB2276.namprd06.prod.outlook.com>

Please define ?sensibly?. I would be strongly opposed to modifying any models to incorporate ?missingness?. No model handles missing data for you. That is for you to decide based on your individual problem domain.

Take a look at a talk from last winter on missing data by Nina Zumel. Nina defines ?sensibly? in several ways.

https://www.r-bloggers.com/prepping-data-for-analysis-using-r/


__________________________________________________________________________________________
Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science
770-658-5176 | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com

From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Stuart Reynolds
Sent: Thursday, October 13, 2016 2:14 PM
To: scikit-learn at python.org
Subject: [scikit-learn] Missing data and decision trees

? EXT MSG:
I'm looking for a decision tree and RF implementation that supports missing data (without imputation) -- ideally in Python, Java/Scala or C++.

It seems that scikit's decision tree algorithm doesn't allow this -- which is disappointing because its one of the few methods that should be able to sensibly handle problems with high amounts of missingness.

Are there plans to allow missing data in scikit's decision trees?

Also, is there any particular reason why missing values weren't supported originally (e.g. integrates poorly with other features)

Regards
- Stuart
* This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161013/7ff31e50/attachment.html>

From ragvrv at gmail.com  Thu Oct 13 16:17:25 2016
From: ragvrv at gmail.com (Raghav R V)
Date: Thu, 13 Oct 2016 22:17:25 +0200
Subject: [scikit-learn] Missing data and decision trees
In-Reply-To: <CAFHc1Qa8uHLS7mbSd+SwwfgiMopVvZzjGt44LcGUrFYw9H_15g@mail.gmail.com>
References: <CAAy-kdn3yEPXOxJ4vKp1ipqUg9kL=wPDwaLL-LCqMvJ_b14=Eg@mail.gmail.com>
 <CAFHc1Qa8uHLS7mbSd+SwwfgiMopVvZzjGt44LcGUrFYw9H_15g@mail.gmail.com>
Message-ID: <CACmxyDEywtmCx-SmnnVQYgzJ-1_PdnLCqBfRkorVESGFa0XamQ@mail.gmail.com>

Hi Stuart Reynold,

Like Jacob said we have an active PR at
https://github.com/scikit-learn/scikit-learn/pull/5974

You could do

git fetch https://github.com/raghavrv/scikit-learn.git
missing_values_rf:missing_values_rf
git checkout missing_values_rf
python setup.py install

And try it out. I warn you thought, there are some memory leaks I'm trying
to debug. But for the most part it works well and outperforms basic
imputation techniques.

Please let us know if it breaks / not solves your usecase. Your input as a
user of that feature would be invaluable!

> I ran into this several times as well with scikit-learn implementation of
GBM. Look at xgboost if you have not already (is there someone out there
that hasn't ? :)- it deals with missing values in the predictor space in a
very eloquent manner. http://xgboost.readthedocs.io/
en/latest/python/python_intro.html
<http://xgboost.readthedocs.io/en/latest/python/python_intro.html>

The PR handles it in a conceptually similar approach. It is currently
implemented for DecisionTreeClassifier. After reviews and integration,
DecisionTreeRegressor would also be supporting missing values. Once that
happens, enabling it in gradient boosting will be possible.

Thanks for the interest!!

On Thu, Oct 13, 2016 at 8:33 PM, Raphael C <drraph at gmail.com> wrote:

> You can simply make a new binary feature (per feature that might have a
> missing value) that is 1 if the value is missing and 0 otherwise.  The RF
> can then work out what to do with this information.
>
> I don't know how this compares in practice to more sophisticated
> approaches.
>
> Raphael
>
>
> On Thursday, October 13, 2016, Stuart Reynolds <stuart at stuartreynolds.net>
> wrote:
>
>> I'm looking for a decision tree and RF implementation that supports
>> missing data (without imputation) -- ideally in Python, Java/Scala or C++.
>>
>> It seems that scikit's decision tree algorithm doesn't allow this --
>> which is disappointing because its one of the few methods that should be
>> able to sensibly handle problems with high amounts of missingness.
>>
>> Are there plans to allow missing data in scikit's decision trees?
>>
>> Also, is there any particular reason why missing values weren't supported
>> originally (e.g. integrates poorly with other features)
>>
>> Regards
>> - Stuart
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161013/e23e7ee4/attachment-0001.html>

From anael.bonneton at gmail.com  Fri Oct 14 09:27:01 2016
From: anael.bonneton at gmail.com (=?UTF-8?Q?Ana=C3=ABl_Bonneton?=)
Date: Fri, 14 Oct 2016 15:27:01 +0200
Subject: [scikit-learn]  Silhouette example - performance issue
Message-ID: <CAHB3a72+WxOks3ZKmpFD47t_Uw3i++1H5xSpb_zsnG4OmRrkzQ@mail.gmail.com>

Hi,

In the silhouette example (
http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html#sphx-glr-auto-examples-cluster-plot-kmeans-silhouette-analysis-py),
the silhouette values of each sample is computed twice: once with
*silhouette_score
*and once with *silhouette_samples.* The call to *silhouette_score* can be
easily avoided by computing the average of the result of*
silhouette_samples*.

Do you think we should remove the call to *silhouette_score* to improve the
performance ? Or it is better to keep the two functions to show how to use
them ?

Ana?l Bonneton
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161014/f99e838d/attachment.html>

From ragvrv at gmail.com  Fri Oct 14 09:38:55 2016
From: ragvrv at gmail.com (Raghav R V)
Date: Fri, 14 Oct 2016 15:38:55 +0200
Subject: [scikit-learn] Silhouette example - performance issue
In-Reply-To: <CAHB3a72+WxOks3ZKmpFD47t_Uw3i++1H5xSpb_zsnG4OmRrkzQ@mail.gmail.com>
References: <CAHB3a72+WxOks3ZKmpFD47t_Uw3i++1H5xSpb_zsnG4OmRrkzQ@mail.gmail.com>
Message-ID: <CACmxyDFYtew-LuLVOx3T1bwavSPjjOm=eemWSRsEC-twELcScg@mail.gmail.com>

On Fri, Oct 14, 2016 at 3:27 PM, Ana?l Bonneton <anael.bonneton at gmail.com>
wrote:

> Hi,
>
> In the silhouette example (http://scikit-learn.org/
> stable/auto_examples/cluster/plot_kmeans_silhouette_
> analysis.html#sphx-glr-auto-examples-cluster-plot-kmeans-
> silhouette-analysis-py), the silhouette values of each sample is computed
> twice: once with *silhouette_score *and once with *silhouette_samples.*
> The call to *silhouette_score* can be easily avoided by computing the
> average of the result of* silhouette_samples*.
>
> Do you think we should remove the call to *silhouette_score* to improve
> the performance ? Or it is better to keep the two functions to show how to
> use them ?
>
Hi,

When I wrote it, I intended it to be demonstrative of the two methods.

Not sure if we should worry about performance issues there


-- 
Raghav RV
https://github.com/raghavrv
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161014/1d1162e3/attachment.html>

From michael.eickenberg at gmail.com  Fri Oct 14 09:55:25 2016
From: michael.eickenberg at gmail.com (Michael Eickenberg)
Date: Fri, 14 Oct 2016 15:55:25 +0200
Subject: [scikit-learn] Silhouette example - performance issue
In-Reply-To: <CACmxyDFYtew-LuLVOx3T1bwavSPjjOm=eemWSRsEC-twELcScg@mail.gmail.com>
References: <CAHB3a72+WxOks3ZKmpFD47t_Uw3i++1H5xSpb_zsnG4OmRrkzQ@mail.gmail.com>
 <CACmxyDFYtew-LuLVOx3T1bwavSPjjOm=eemWSRsEC-twELcScg@mail.gmail.com>
Message-ID: <CADxJN65AkXbCVkcxPbyaGvVY+Gkb4A3Mtr6TiJ9e-qSP3TPnqA@mail.gmail.com>

Dear Ana?l,

if you wish, you could add a line to the example verifying this
correspondence. E.g. by moving the print function from between the two
silhouette evaluations to after and also evaluating that average and
printing it in parentheses.

Probably not necessary though. A comment would do also. Or nothing :)

Michael


On Fri, Oct 14, 2016 at 3:38 PM, Raghav R V <ragvrv at gmail.com> wrote:

> On Fri, Oct 14, 2016 at 3:27 PM, Ana?l Bonneton <anael.bonneton at gmail.com>
> wrote:
>
>> Hi,
>>
>> In the silhouette example (http://scikit-learn.org/stabl
>> e/auto_examples/cluster/plot_kmeans_silhouette_analysis.
>> html#sphx-glr-auto-examples-cluster-plot-kmeans-silhouette-analysis-py),
>> the silhouette values of each sample is computed twice: once with *silhouette_score
>> *and once with *silhouette_samples.* The call to *silhouette_score* can
>> be easily avoided by computing the average of the result of*
>> silhouette_samples*.
>>
>> Do you think we should remove the call to *silhouette_score* to improve
>> the performance ? Or it is better to keep the two functions to show how to
>> use them ?
>>
> Hi,
>
> When I wrote it, I intended it to be demonstrative of the two methods.
>
> Not sure if we should worry about performance issues there
>
>
> --
> Raghav RV
> https://github.com/raghavrv
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161014/94c5777a/attachment.html>

From tom.to.the.k at gmail.com  Mon Oct 17 07:36:32 2016
From: tom.to.the.k at gmail.com (Tomas Karasek)
Date: Mon, 17 Oct 2016 14:36:32 +0300
Subject: [scikit-learn] Search in results for optimal parameter subsets
Message-ID: <CANgWi10dhcWvw8WvuwM6dDPWwNXPp_+3xOgcomD65vV=Ys4KSg@mail.gmail.com>

Hey, I have dataframe with test results (10k rows). Each result (row)
has ~6 parameters, plus some output metrics. I would like to find
combinations of the parameters which have reasonable mean, std and
support-count (number of results in the configuration).

E.g. if there are parameters "k" and "n", each in range(100), and the
result metric has good mean and std for "k in [4..12] and c in
[90..95]" (support-count for this would be 8*5 = 40) and then maybe "k
in [34..41] and c in [10..13] (s-c is 7*3=21), then I would like to
have the algorithm return sth like

k       c       mean   std    support-count total_score
4..12   90..95  12.1   1.23   40            9.3
34..41  10..13  11.1   1.13   21            6.2

I understand I will first have to define a fucntion that will reduce
the mean, std and count to the total_score. I can do that somehow. But
I don't know what kind of math task is finding the local maxima of
parameter configuration subsets.

Is this optimization task? Can you please point me to sth in sklearn
or scipy, that would give me some direction?

Cheers,
Tomas

From joel.nothman at gmail.com  Tue Oct 18 05:36:38 2016
From: joel.nothman at gmail.com (Joel Nothman)
Date: Tue, 18 Oct 2016 20:36:38 +1100
Subject: [scikit-learn] Silhouette example - performance issue
In-Reply-To: <CADxJN65AkXbCVkcxPbyaGvVY+Gkb4A3Mtr6TiJ9e-qSP3TPnqA@mail.gmail.com>
References: <CAHB3a72+WxOks3ZKmpFD47t_Uw3i++1H5xSpb_zsnG4OmRrkzQ@mail.gmail.com>
 <CACmxyDFYtew-LuLVOx3T1bwavSPjjOm=eemWSRsEC-twELcScg@mail.gmail.com>
 <CADxJN65AkXbCVkcxPbyaGvVY+Gkb4A3Mtr6TiJ9e-qSP3TPnqA@mail.gmail.com>
Message-ID: <CAAkaFLV4TqcjT6N-C6HyuyjoSjyAv5xf0mbobzBTx_O3pX55Nw@mail.gmail.com>

And we can reduce any substantial performance issues by merging
https://github.com/scikit-learn/scikit-learn/pull/7177 ... :)

On 15 October 2016 at 00:55, Michael Eickenberg <
michael.eickenberg at gmail.com> wrote:

> Dear Ana?l,
>
> if you wish, you could add a line to the example verifying this
> correspondence. E.g. by moving the print function from between the two
> silhouette evaluations to after and also evaluating that average and
> printing it in parentheses.
>
> Probably not necessary though. A comment would do also. Or nothing :)
>
> Michael
>
>
> On Fri, Oct 14, 2016 at 3:38 PM, Raghav R V <ragvrv at gmail.com> wrote:
>
>> On Fri, Oct 14, 2016 at 3:27 PM, Ana?l Bonneton <anael.bonneton at gmail.com
>> > wrote:
>>
>>> Hi,
>>>
>>> In the silhouette example (http://scikit-learn.org/stabl
>>> e/auto_examples/cluster/plot_kmeans_silhouette_analysis.html
>>> #sphx-glr-auto-examples-cluster-plot-kmeans-silhouette-analysis-py),
>>> the silhouette values of each sample is computed twice: once with *silhouette_score
>>> *and once with *silhouette_samples.* The call to *silhouette_score* can
>>> be easily avoided by computing the average of the result of*
>>> silhouette_samples*.
>>>
>>> Do you think we should remove the call to *silhouette_score* to improve
>>> the performance ? Or it is better to keep the two functions to show how to
>>> use them ?
>>>
>> Hi,
>>
>> When I wrote it, I intended it to be demonstrative of the two methods.
>>
>> Not sure if we should worry about performance issues there
>>
>>
>> --
>> Raghav RV
>> https://github.com/raghavrv
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161018/471a72f2/attachment.html>

From joel.nothman at gmail.com  Wed Oct 19 10:42:31 2016
From: joel.nothman at gmail.com (Joel Nothman)
Date: Thu, 20 Oct 2016 01:42:31 +1100
Subject: [scikit-learn] Towards 0.18.1
Message-ID: <CAAkaFLWLtz8O02vVLweWEZwM_yVDV2M6OQN6yd30_B0U1MSEgg@mail.gmail.com>

Due to a few substantial bugs in 0.18.0, we're hoping to release 0.18.1
around the end of the month. Help solving (and reviewing) the issues listed
https://github.com/scikit-learn/scikit-learn/milestone/22 is welcome. In
particular, an easy documentation issue at
https://github.com/scikit-learn/scikit-learn/pull/7659 is waiting to be
picked up.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161020/ffb2fbb6/attachment.html>

From brookm291 at gmail.com  Sat Oct 22 13:32:04 2016
From: brookm291 at gmail.com (KevNo)
Date: Sun, 23 Oct 2016 02:32:04 +0900
Subject: [scikit-learn] Recurrent States with Decision Tree
Message-ID: <580BA294.30406@gmail.com>

Hello,

Just wondering, how can we setup path dependant input states for Random 
Forest/Decision Tree ?
This is similar to Recurrent Network,
where input Xt=(x0,...,xi,.. yt-1, yt-2) depends on past output states Yt.


If we could put the exact values states Yt, it obviously creates a bias 
in the training.
So, we should be put some estimate of (Yt)


Is the concept of Recurrent Tree makes sense ?

Thanks for your insight.


> scikit-learn-request at python.org <mailto:scikit-learn-request at python.org>
> Thursday, October 20, 2016 1:00 AM
> Send scikit-learn mailing list submissions to
> scikit-learn at python.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://mail.python.org/mailman/listinfo/scikit-learn
> or, via email, send a message with subject or body 'help' to
> scikit-learn-request at python.org
>
> You can reach the person managing the list at
> scikit-learn-owner at python.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of scikit-learn digest..."
>
>
> Today's Topics:
>
> 1. Towards 0.18.1 (Joel Nothman)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 20 Oct 2016 01:42:31 +1100
> From: Joel Nothman <joel.nothman at gmail.com>
> To: Scikit-learn user and developer mailing list
> <scikit-learn at python.org>
> Subject: [scikit-learn] Towards 0.18.1
> Message-ID:
> <CAAkaFLWLtz8O02vVLweWEZwM_yVDV2M6OQN6yd30_B0U1MSEgg at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Due to a few substantial bugs in 0.18.0, we're hoping to release 0.18.1
> around the end of the month. Help solving (and reviewing) the issues 
> listed
> https://github.com/scikit-learn/scikit-learn/milestone/22 is welcome. In
> particular, an easy documentation issue at
> https://github.com/scikit-learn/scikit-learn/pull/7659 is waiting to be
> picked up.
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: 
> <http://mail.python.org/pipermail/scikit-learn/attachments/20161020/ffb2fbb6/attachment-0001.html>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ------------------------------
>
> End of scikit-learn Digest, Vol 7, Issue 39
> *******************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161023/91713de3/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: compose-unknown-contact.jpg
Type: image/jpeg
Size: 770 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161023/91713de3/attachment.jpg>

From rdslater at gmail.com  Sun Oct 23 18:37:30 2016
From: rdslater at gmail.com (Robert Slater)
Date: Sun, 23 Oct 2016 17:37:30 -0500
Subject: [scikit-learn] Random Forest with Mean Absolute Error
Message-ID: <CAMt686ZROXAUX+hRFjkdraq7JAAuXK_taQ6qGqM6NH9+_zWw-w@mail.gmail.com>

I searched the archives to see if this was a known issue, but could not
seem to find anyone else having the problem.

Nevertheless, in the latest version (0.18) Random Forest Regressor has the
new option of 'mae' for criterion.  However it appears to run
disporportinally longer than the 'mse' critera,

For example:

from sklearn.ensemble import RandomForestRegressor
rf_tree=50
rf_depth=5
rf=RandomForestRegressor(n_estimators=rf_tree, criterion='mae',
max_depth=rf_depth,
                         min_samples_split=4, min_samples_leaf=2,
max_features=0.5,
                         max_leaf_nodes=5,
                         oob_score=True, n_jobs=1, random_state=0,
verbose=1)

from sklearn.ensemble import ExtraTreesRegressor
et_tree=100
et=ExtraTreesRegressor(n_estimators=et_tree,max_depth=5,min_samples_split=4,
min_samples_leaf=2,max_features=0.5,verbose=1,n_jobs=4)

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
X_train, X_test, y_train, y_test = train_test_split(train, loss,
test_size=0.2, random_state=42)

et.fit(X_train,y_train)
rf.fit(X_train,y_train)

rf_pred=rf.predict(X_test)
et_pred=et.predict(X_test)

print(mean_absolute_error(y_test,rf_pred))
print(mean_absolute_error(y_test,et_pred))

I was using these two for a recent Kaggle competition.  If I use
"criterion='mse'" in the Random forest it takes around 1 min to build.
Switching to 'mae' causes 100% CPU usage and 30 minutes (at least) if wait
time before I kill my kernel.

Not sure if the problem is on my end or if there is a real issue so I
wanted to reach out and see if there or others.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161023/49ab3259/attachment.html>

From nfliu at uw.edu  Sun Oct 23 18:45:09 2016
From: nfliu at uw.edu (Nelson Liu)
Date: Sun, 23 Oct 2016 15:45:09 -0700
Subject: [scikit-learn] Random Forest with Mean Absolute Error
In-Reply-To: <CAMt686ZROXAUX+hRFjkdraq7JAAuXK_taQ6qGqM6NH9+_zWw-w@mail.gmail.com>
References: <CAMt686ZROXAUX+hRFjkdraq7JAAuXK_taQ6qGqM6NH9+_zWw-w@mail.gmail.com>
Message-ID: <CALoLHMJsTTJUhzWctDbsuF6FR=wNd6+eJoyZm9bE4LgQScLd7Q@mail.gmail.com>

Hi Robert,
Thanks for the report. This is definitely not something just on your end;
MAE does run longer than MSE, especially on larger datasets, due to the
need to find the median of data for MAE (expensive) vs the mean of data for
MSE (not so expensive). We've used a variety of tricks to try to make it
faster for growing trees, but it still seems like it is quite slow for
these larger datasets.

I've been working on a patch to speed it up by using a binary mask to
further reduce the amount of computation MAE needs per split, but I've been
bogged down with real life recently and haven't had a chance to wrap it up.

Nelson Liu

On Sun, Oct 23, 2016 at 3:37 PM, Robert Slater <rdslater at gmail.com> wrote:

> I searched the archives to see if this was a known issue, but could not
> seem to find anyone else having the problem.
>
> Nevertheless, in the latest version (0.18) Random Forest Regressor has the
> new option of 'mae' for criterion.  However it appears to run
> disporportinally longer than the 'mse' critera,
>
> For example:
>
> from sklearn.ensemble import RandomForestRegressor
> rf_tree=50
> rf_depth=5
> rf=RandomForestRegressor(n_estimators=rf_tree, criterion='mae',
> max_depth=rf_depth,
>                          min_samples_split=4, min_samples_leaf=2,
> max_features=0.5,
>                          max_leaf_nodes=5,
>                          oob_score=True, n_jobs=1, random_state=0,
> verbose=1)
>
> from sklearn.ensemble import ExtraTreesRegressor
> et_tree=100
> et=ExtraTreesRegressor(n_estimators=et_tree,max_depth=5,min_samples_split=4,
> min_samples_leaf=2,max_features=0.5,verbose=1,n_jobs=4)
>
> from sklearn.model_selection import train_test_split
> from sklearn.metrics import mean_absolute_error
> X_train, X_test, y_train, y_test = train_test_split(train, loss,
> test_size=0.2, random_state=42)
>
> et.fit(X_train,y_train)
> rf.fit(X_train,y_train)
>
> rf_pred=rf.predict(X_test)
> et_pred=et.predict(X_test)
>
> print(mean_absolute_error(y_test,rf_pred))
> print(mean_absolute_error(y_test,et_pred))
>
> I was using these two for a recent Kaggle competition.  If I use
> "criterion='mse'" in the Random forest it takes around 1 min to build.
> Switching to 'mae' causes 100% CPU usage and 30 minutes (at least) if wait
> time before I kill my kernel.
>
> Not sure if the problem is on my end or if there is a real issue so I
> wanted to reach out and see if there or others.
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161023/f9391c3a/attachment.html>

From aakash at klugtek.co.in  Mon Oct 24 09:53:02 2016
From: aakash at klugtek.co.in (Aakash Agarwal)
Date: Mon, 24 Oct 2016 19:23:02 +0530
Subject: [scikit-learn] NER Tagged Data
Message-ID: <CABVTFDv_MqvZv6HJ-ZGuVmjUJq+xRqKojR8pxRS7NscX5Bigog@mail.gmail.com>

Hi All.

I am trying to implement NER Algo using CRF data. Can anyone point me to
some tagged data which i used. I am not able to use coNLL data (2003), i
got the tagged data but the words are missing. I downloaded rcv1 dataset
but still could not generate training and testing data.

I would be grateful if anybody can help me.

Thanks in advance!
Aakash
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161024/45e61af8/attachment.html>

From greg315 at hotmail.fr  Mon Oct 24 10:18:11 2016
From: greg315 at hotmail.fr (greg g)
Date: Mon, 24 Oct 2016 14:18:11 +0000
Subject: [scikit-learn] tree visualization with class names in leaves
Message-ID: <DB3PR04MB0780D1222CC0B7629F59072C9AA90@DB3PR04MB0780.eurprd04.prod.outlook.com>


Hi,
?I just begin with scikit-learn and would like to visualize a classification tree with class names displayed in the leaves as shown in the SCIKITLEARN.TREE documentation http://scikit-learn.org/stable/modules/tree.html where we find class=?virginica? etc?
I made a tree providing a 2D array X (n1 samples , n2 features) and 1D array Y (n1 corresponding classes ) such that Y(i) is the class of the sample X(i, ?)
After that I have correct predictions using predict()
Then I use the function 
export_graphviz(clf, out_file=dot_data,feature_names=FEATURES)
with FEATURES being the array of my n2 features names in the same order as in X 
I obtain the tree .png but can?t find a way to have the correct class names in the leaves?
In export_graphviz() should I use the class_names optional parameter and how ?
Thanks for any help
?
Gregory, Toulouse FRANCE
  
 
From se.raschka at gmail.com  Mon Oct 24 11:47:23 2016
From: se.raschka at gmail.com (Sebastian Raschka)
Date: Mon, 24 Oct 2016 11:47:23 -0400
Subject: [scikit-learn] tree visualization with class names in leaves
In-Reply-To: <DB3PR04MB0780D1222CC0B7629F59072C9AA90@DB3PR04MB0780.eurprd04.prod.outlook.com>
References: <DB3PR04MB0780D1222CC0B7629F59072C9AA90@DB3PR04MB0780.eurprd04.prod.outlook.com>
Message-ID: <D08F9C84-26EC-4464-8229-CCF665434210@gmail.com>

Hi, Greg,
if you provide the `class_names` argument, a ?class? label of the majority class will be added at the bottom of each node. For instance, if you have the Iris dataset, with class labels 0, 1, 2, you can provide the `class_names` as ['setosa', 'versicolor', 'virginica?], where  0 -> ?setosa?, 1 -> ?versicolor?, 2 -> ?virginica?.

Best,
Sebastian

> On Oct 24, 2016, at 10:18 AM, greg g <greg315 at hotmail.fr> wrote:
> 
> bLaf1ox-forefront-antispam-report: EFV:NLI; SFV:NSPM; SFS:(10019020)(98900003);
> DIR:OUT; SFP:1102; SCL:1; SRVR:DB5EUR03HT168;
> H:DB3PR04MB0780.eurprd04.prod.outlook.com; FPR:; SPF:None; LANG:en;
> x-ms-office365-filtering-correlation-id: 319900b9-973c-49bb-8e9a-08d3fc1895c4
> x-microsoft-antispam: UriScan:; BCL:0; PCL:0;
> RULEID:(1601124038)(1603103081)(1601125047); SRVR:DB5EUR03HT168;
> x-exchange-antispam-report-cfa-test: BCL:0; PCL:0;
> RULEID:(432015012)(82015046); SRVR:DB5EUR03HT168; BCL:0; PCL:0; RULEID:;
> SRVR:DB5EUR03HT168;
> x-forefront-prvs: 0105DAA385
> X-OriginatorOrg: outlook.com
> X-MS-Exchange-CrossTenant-originalarrivaltime: 24 Oct 2016 14:18:11.0102 (UTC)
> X-MS-Exchange-CrossTenant-fromentityheader: Internet
> X-MS-Exchange-CrossTenant-id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa
> X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB5EUR03HT168
> 
> 
> Hi,
>  I just begin with scikit-learn and would like to visualize a classification tree with class names displayed in the leaves as shown in the SCIKITLEARN.TREE documentation http://scikit-learn.org/stable/modules/tree.html where we find class=?virginica? etc?
> I made a tree providing a 2D array X (n1 samples , n2 features) and 1D array Y (n1 corresponding classes ) such that Y(i) is the class of the sample X(i, ?)
> After that I have correct predictions using predict()
> Then I use the function 
> export_graphviz(clf, out_file=dot_data,feature_names=FEATURES)
> with FEATURES being the array of my n2 features names in the same order as in X 
> I obtain the tree .png but can?t find a way to have the correct class names in the leaves?
> In export_graphviz() should I use the class_names optional parameter and how ?
> Thanks for any help
>  
> Gregory, Toulouse FRANCE
> 
> 
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From blrstartuphire at gmail.com  Tue Oct 25 02:41:08 2016
From: blrstartuphire at gmail.com (Startup Hire)
Date: Tue, 25 Oct 2016 12:11:08 +0530
Subject: [scikit-learn] tree visualization with class names in leaves
In-Reply-To: <D08F9C84-26EC-4464-8229-CCF665434210@gmail.com>
References: <DB3PR04MB0780D1222CC0B7629F59072C9AA90@DB3PR04MB0780.eurprd04.prod.outlook.com>
 <D08F9C84-26EC-4464-8229-CCF665434210@gmail.com>
Message-ID: <CAPEF3M-YLb2YieRpUd+5vP_8hak0qnFfucgiEFkFULN8BDrOdQ@mail.gmail.com>

Hi all,

Thanks for the suggestion.

I have a related question on tree visualization

I have 2 classes to predict: 0 and 1 (it comes up as a numeric field when I
load the dataset)

I have given the class_names as "NotPresent" and "Ispresent" which I
believe it will map to 0 and 1. is that correct?


How do I interpret the nodes and value present in each nodes in the
accompanying diagram?

Regards,
Sanant


On Mon, Oct 24, 2016 at 9:17 PM, Sebastian Raschka <se.raschka at gmail.com>
wrote:

> Hi, Greg,
> if you provide the `class_names` argument, a ?class? label of the majority
> class will be added at the bottom of each node. For instance, if you have
> the Iris dataset, with class labels 0, 1, 2, you can provide the
> `class_names` as ['setosa', 'versicolor', 'virginica?], where  0 ->
> ?setosa?, 1 -> ?versicolor?, 2 -> ?virginica?.
>
> Best,
> Sebastian
>
> > On Oct 24, 2016, at 10:18 AM, greg g <greg315 at hotmail.fr> wrote:
> >
> > bLaf1ox-forefront-antispam-report: EFV:NLI; SFV:NSPM;
> SFS:(10019020)(98900003);
> > DIR:OUT; SFP:1102; SCL:1; SRVR:DB5EUR03HT168;
> > H:DB3PR04MB0780.eurprd04.prod.outlook.com; FPR:; SPF:None; LANG:en;
> > x-ms-office365-filtering-correlation-id: 319900b9-973c-49bb-8e9a-
> 08d3fc1895c4
> > x-microsoft-antispam: UriScan:; BCL:0; PCL:0;
> > RULEID:(1601124038)(1603103081)(1601125047); SRVR:DB5EUR03HT168;
> > x-exchange-antispam-report-cfa-test: BCL:0; PCL:0;
> > RULEID:(432015012)(82015046); SRVR:DB5EUR03HT168; BCL:0; PCL:0; RULEID:;
> > SRVR:DB5EUR03HT168;
> > x-forefront-prvs: 0105DAA385
> > X-OriginatorOrg: outlook.com
> > X-MS-Exchange-CrossTenant-originalarrivaltime: 24 Oct 2016
> 14:18:11.0102 (UTC)
> > X-MS-Exchange-CrossTenant-fromentityheader: Internet
> > X-MS-Exchange-CrossTenant-id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa
> > X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB5EUR03HT168
> >
> >
> > Hi,
> >  I just begin with scikit-learn and would like to visualize a
> classification tree with class names displayed in the leaves as shown in
> the SCIKITLEARN.TREE documentation http://scikit-learn.org/
> stable/modules/tree.html where we find class=?virginica? etc?
> > I made a tree providing a 2D array X (n1 samples , n2 features) and 1D
> array Y (n1 corresponding classes ) such that Y(i) is the class of the
> sample X(i, ?)
> > After that I have correct predictions using predict()
> > Then I use the function
> > export_graphviz(clf, out_file=dot_data,feature_names=FEATURES)
> > with FEATURES being the array of my n2 features names in the same order
> as in X
> > I obtain the tree .png but can?t find a way to have the correct class
> names in the leaves?
> > In export_graphviz() should I use the class_names optional parameter and
> how ?
> > Thanks for any help
> >
> > Gregory, Toulouse FRANCE
> >
> >
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161025/5c781c8b/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Extra Decision tree.png
Type: image/png
Size: 145232 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161025/5c781c8b/attachment-0001.png>

From greg315 at hotmail.fr  Tue Oct 25 03:00:09 2016
From: greg315 at hotmail.fr (greg g)
Date: Tue, 25 Oct 2016 07:00:09 +0000
Subject: [scikit-learn] tree visualization with class names in leaves
In-Reply-To: <D08F9C84-26EC-4464-8229-CCF665434210@gmail.com>
References: <DB3PR04MB0780D1222CC0B7629F59072C9AA90@DB3PR04MB0780.eurprd04.prod.outlook.com>,
 <D08F9C84-26EC-4464-8229-CCF665434210@gmail.com>
Message-ID: <DB3PR04MB07804D8AF8A4676E64D7E3109AA80@DB3PR04MB0780.eurprd04.prod.outlook.com>

Hi Sebastian,

Thanks for your answer.

I dont't use the iris dataset. My classes are distributed in my Y array.

It seems that I can get the classes in alphabetical order with

> clf.classes_

where clf is my tree.

And with

> export_graphviz(clf, out_file=dot_data,feature_names=FEATURES,class_names=clf.classes_)

the nodes of the graphical tree seem to be filled with the predominant class and samples repartition in a vector with the classes in alphabetical order ( the same order as in clf.classes_)

I have to confirm that with more classes.


Regards

Gregory


________________________________
De : scikit-learn <scikit-learn-bounces+greg315=hotmail.fr at python.org> de la part de Sebastian Raschka <se.raschka at gmail.com>
Envoy? : lundi 24 octobre 2016 17:47
? : Scikit-learn user and developer mailing list
Objet : Re: [scikit-learn] tree visualization with class names in leaves

Hi, Greg,
if you provide the `class_names` argument, a ?class? label of the majority class will be added at the bottom of each node. For instance, if you have the Iris dataset, with class labels 0, 1, 2, you can provide the `class_names` as ['setosa', 'versicolor', 'virginica?], where  0 -> ?setosa?, 1 -> ?versicolor?, 2 -> ?virginica?.

Best,
Sebastian

> On Oct 24, 2016, at 10:18 AM, greg g <greg315 at hotmail.fr> wrote:
>
> bLaf1ox-forefront-antispam-report: EFV:NLI; SFV:NSPM; SFS:(10019020)(98900003);
> DIR:OUT; SFP:1102; SCL:1; SRVR:DB5EUR03HT168;
> H:DB3PR04MB0780.eurprd04.prod.outlook.com; FPR:; SPF:None; LANG:en;
> x-ms-office365-filtering-correlation-id: 319900b9-973c-49bb-8e9a-08d3fc1895c4
> x-microsoft-antispam: UriScan:; BCL:0; PCL:0;
> RULEID:(1601124038)(1603103081)(1601125047); SRVR:DB5EUR03HT168;
> x-exchange-antispam-report-cfa-test: BCL:0; PCL:0;
> RULEID:(432015012)(82015046); SRVR:DB5EUR03HT168; BCL:0; PCL:0; RULEID:;
> SRVR:DB5EUR03HT168;
> x-forefront-prvs: 0105DAA385
> X-OriginatorOrg: outlook.com
> X-MS-Exchange-CrossTenant-originalarrivaltime: 24 Oct 2016 14:18:11.0102 (UTC)
> X-MS-Exchange-CrossTenant-fromentityheader: Internet
> X-MS-Exchange-CrossTenant-id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa
> X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB5EUR03HT168
>
>
> Hi,
>  I just begin with scikit-learn and would like to visualize a classification tree with class names displayed in the leaves as shown in the SCIKITLEARN.TREE documentation http://scikit-learn.org/stable/modules/tree.html where we find class=?virginica? etc?
[http://scikit-learn.org/stable/_images/iris.svg]<http://scikit-learn.org/stable/modules/tree.html>

1.10. Decision Trees ? scikit-learn 0.18 documentation<http://scikit-learn.org/stable/modules/tree.html>
scikit-learn.org
Decision-tree learners can create over-complex trees that do not generalise the data well. This is called overfitting. Mechanisms such as pruning (not currently ...


> I made a tree providing a 2D array X (n1 samples , n2 features) and 1D array Y (n1 corresponding classes ) such that Y(i) is the class of the sample X(i, ?)
> After that I have correct predictions using predict()
> Then I use the function
> export_graphviz(clf, out_file=dot_data,feature_names=FEATURES)
> with FEATURES being the array of my n2 features names in the same order as in X
> I obtain the tree .png but can?t find a way to have the correct class names in the leaves?
> In export_graphviz() should I use the class_names optional parameter and how ?
> Thanks for any help
>
> Gregory, Toulouse FRANCE
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
scikit-learn Info Page - Python<https://mail.python.org/mailman/listinfo/scikit-learn>
mail.python.org
To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ...


_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn
scikit-learn Info Page - Python<https://mail.python.org/mailman/listinfo/scikit-learn>
mail.python.org
To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ...


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161025/d798a561/attachment.html>

From piotr.bialecki at hotmail.de  Tue Oct 25 05:32:21 2016
From: piotr.bialecki at hotmail.de (Piotr Bialecki)
Date: Tue, 25 Oct 2016 09:32:21 +0000
Subject: [scikit-learn] tree visualization with class names in leaves
In-Reply-To: <CAPEF3M-YLb2YieRpUd+5vP_8hak0qnFfucgiEFkFULN8BDrOdQ@mail.gmail.com>
References: <DB3PR04MB0780D1222CC0B7629F59072C9AA90@DB3PR04MB0780.eurprd04.prod.outlook.com>
 <D08F9C84-26EC-4464-8229-CCF665434210@gmail.com>
 <CAPEF3M-YLb2YieRpUd+5vP_8hak0qnFfucgiEFkFULN8BDrOdQ@mail.gmail.com>
Message-ID: <DB5PR01MB08546DF2EAA213EA46195A23F3A80@DB5PR01MB0854.eurprd01.prod.exchangelabs.com>

Hi Sanant,

the values represent the thresholds at the current feature (node), which are used to classify the next sample.

You can see an example here:
http://scikit-learn.org/stable/modules/tree.html

The first node uses the feature "petal length (cm)" with a threshold of 2.45.

If your future sample as a petal length <= 2.45cm it will be pushed into the left branch and therefore will be classifies as class = setosa.
However, if the petal length is > 2.45cm, it will be pushed into the right branch and the next node (feature) is evalueted.

I hope I understood your question correct.


Best regards,
Piotr


On 25.10.2016 08:41, Startup Hire wrote:
Hi all,

Thanks for the suggestion.

I have a related question on tree visualization

I have 2 classes to predict: 0 and 1 (it comes up as a numeric field when I load the dataset)

I have given the class_names as "NotPresent" and "Ispresent" which I believe it will map to 0 and 1. is that correct?


How do I interpret the nodes and value present in each nodes in the accompanying diagram?

Regards,
Sanant


On Mon, Oct 24, 2016 at 9:17 PM, Sebastian Raschka <<mailto:se.raschka at gmail.com>se.raschka at gmail.com<mailto:se.raschka at gmail.com>> wrote:
Hi, Greg,
if you provide the `class_names` argument, a ?class? label of the majority class will be added at the bottom of each node. For instance, if you have the Iris dataset, with class labels 0, 1, 2, you can provide the `class_names` as ['setosa', 'versicolor', 'virginica?], where  0 -> ?setosa?, 1 -> ?versicolor?, 2 -> ?virginica?.

Best,
Sebastian

> On Oct 24, 2016, at 10:18 AM, greg g <<mailto:greg315 at hotmail.fr>greg315 at hotmail.fr<mailto:greg315 at hotmail.fr>> wrote:
>
> bLaf1ox-forefront-antispam-report: EFV:NLI; SFV:NSPM; SFS:(10019020)(98900003);
> DIR:OUT; SFP:1102; SCL:1; SRVR:DB5EUR03HT168;
> H:DB3PR04MB0780.eurprd04.prod.outlook.com<http://DB3PR04MB0780.eurprd04.prod.outlook.com>; FPR:; SPF:None; LANG:en;
> x-ms-office365-filtering-correlation-id: 319900b9-973c-49bb-8e9a-08d3fc1895c4
> x-microsoft-antispam: UriScan:; BCL:0; PCL:0;
> RULEID:(1601124038)(1603103081)(1601125047); SRVR:DB5EUR03HT168;
> x-exchange-antispam-report-cfa-test: BCL:0; PCL:0;
> RULEID:(432015012)(82015046); SRVR:DB5EUR03HT168; BCL:0; PCL:0; RULEID:;
> SRVR:DB5EUR03HT168;
> x-forefront-prvs: 0105DAA385
> X-OriginatorOrg: outlook.com<http://outlook.com>
> X-MS-Exchange-CrossTenant-originalarrivaltime: 24 Oct 2016 14:18:11.0102 (UTC)
> X-MS-Exchange-CrossTenant-fromentityheader: Internet
> X-MS-Exchange-CrossTenant-id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa
> X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB5EUR03HT168
>
>
> Hi,
>  I just begin with scikit-learn and would like to visualize a classification tree with class names displayed in the leaves as shown in the SCIKITLEARN.TREE documentation http://scikit-learn.org/stable/modules/tree.html where we find class=?virginica? etc?
> I made a tree providing a 2D array X (n1 samples , n2 features) and 1D array Y (n1 corresponding classes ) such that Y(i) is the class of the sample X(i, ?)
> After that I have correct predictions using predict()
> Then I use the function
> export_graphviz(clf, out_file=dot_data,feature_names=FEATURES)
> with FEATURES being the array of my n2 features names in the same order as in X
> I obtain the tree .png but can?t find a way to have the correct class names in the leaves?
> In export_graphviz() should I use the class_names optional parameter and how ?
> Thanks for any help
>
> Gregory, Toulouse FRANCE
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org<mailto:scikit-learn at python.org>
> https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn at python.org<mailto:scikit-learn at python.org>
https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
scikit-learn at python.org<mailto:scikit-learn at python.org>
https://mail.python.org/mailman/listinfo/scikit-learn


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161025/e8657ada/attachment-0001.html>

From blrstartuphire at gmail.com  Tue Oct 25 08:15:15 2016
From: blrstartuphire at gmail.com (Startup Hire)
Date: Tue, 25 Oct 2016 17:45:15 +0530
Subject: [scikit-learn] tree visualization with class names in leaves
In-Reply-To: <DB5PR01MB08546DF2EAA213EA46195A23F3A80@DB5PR01MB0854.eurprd01.prod.exchangelabs.com>
References: <DB3PR04MB0780D1222CC0B7629F59072C9AA90@DB3PR04MB0780.eurprd04.prod.outlook.com>
 <D08F9C84-26EC-4464-8229-CCF665434210@gmail.com>
 <CAPEF3M-YLb2YieRpUd+5vP_8hak0qnFfucgiEFkFULN8BDrOdQ@mail.gmail.com>
 <DB5PR01MB08546DF2EAA213EA46195A23F3A80@DB5PR01MB0854.eurprd01.prod.exchangelabs.com>
Message-ID: <CAPEF3M-HRMRkevsb2rVsG5uoTxZgfvQ0ZR+VFVApUE8Ci8nksw@mail.gmail.com>

Hi Piotr,

Thanks for the reply.

I understand the thresholds at the current node.

I was referring to this:

Consider the node: Duration <= 0.5 having gini = 0.3386 and samples =
327510      What is meant by this:   value = [216974.9673, 59743.3314]

Regards,
Sanant

On Tue, Oct 25, 2016 at 3:02 PM, Piotr Bialecki <piotr.bialecki at hotmail.de>
wrote:

> Hi Sanant,
>
> the values represent the thresholds at the current feature (node), which
> are used to classify the next sample.
>
> You can see an example here:
> http://scikit-learn.org/stable/modules/tree.html
>
> The first node uses the feature "petal length (cm)" with a threshold of
> 2.45.
>
> If your future sample as a petal length <= 2.45cm it will be pushed into
> the left branch and therefore will be classifies as class = setosa.
> However, if the petal length is > 2.45cm, it will be pushed into the right
> branch and the next node (feature) is evalueted.
>
> I hope I understood your question correct.
>
>
> Best regards,
> Piotr
>
>
>
>
> On 25.10.2016 08:41, Startup Hire wrote:
>
> Hi all,
>
> Thanks for the suggestion.
>
> I have a related question on tree visualization
>
> I have 2 classes to predict: 0 and 1 (it comes up as a numeric field when
> I load the dataset)
>
> I have given the class_names as "NotPresent" and "Ispresent" which I
> believe it will map to 0 and 1. is that correct?
>
>
> How do I interpret the nodes and value present in each nodes in the
> accompanying diagram?
>
> Regards,
> Sanant
>
>
>
>
> On Mon, Oct 24, 2016 at 9:17 PM, Sebastian Raschka <
> <se.raschka at gmail.com>se.raschka at gmail.com> wrote:
>
>> Hi, Greg,
>> if you provide the `class_names` argument, a ?class? label of the
>> majority class will be added at the bottom of each node. For instance, if
>> you have the Iris dataset, with class labels 0, 1, 2, you can provide the
>> `class_names` as ['setosa', 'versicolor', 'virginica?], where  0 ->
>> ?setosa?, 1 -> ?versicolor?, 2 -> ?virginica?.
>>
>> Best,
>> Sebastian
>>
>> > On Oct 24, 2016, at 10:18 AM, greg g < <greg315 at hotmail.fr>
>> greg315 at hotmail.fr> wrote:
>> >
>> > bLaf1ox-forefront-antispam-report: EFV:NLI; SFV:NSPM;
>> SFS:(10019020)(98900003);
>> > DIR:OUT; SFP:1102; SCL:1; SRVR:DB5EUR03HT168;
>> > H:DB3PR04MB0780.eurprd04.prod.outlook.com; FPR:; SPF:None; LANG:en;
>> > x-ms-office365-filtering-correlation-id: 319900b9-973c-49bb-8e9a-08d3fc
>> 1895c4
>> > x-microsoft-antispam: UriScan:; BCL:0; PCL:0;
>> > RULEID:(1601124038)(1603103081)(1601125047); SRVR:DB5EUR03HT168;
>> > x-exchange-antispam-report-cfa-test: BCL:0; PCL:0;
>> > RULEID:(432015012)(82015046); SRVR:DB5EUR03HT168; BCL:0; PCL:0; RULEID:;
>> > SRVR:DB5EUR03HT168;
>> > x-forefront-prvs: 0105DAA385
>> > X-OriginatorOrg: outlook.com
>> > X-MS-Exchange-CrossTenant-originalarrivaltime: 24 Oct 2016
>> 14:18:11.0102 (UTC)
>> > X-MS-Exchange-CrossTenant-fromentityheader: Internet
>> > X-MS-Exchange-CrossTenant-id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa
>> > X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB5EUR03HT168
>> >
>> >
>> > Hi,
>> >  I just begin with scikit-learn and would like to visualize a
>> classification tree with class names displayed in the leaves as shown in
>> the SCIKITLEARN.TREE documentation http://scikit-learn.org/stable
>> /modules/tree.html where we find class=?virginica? etc?
>> > I made a tree providing a 2D array X (n1 samples , n2 features) and 1D
>> array Y (n1 corresponding classes ) such that Y(i) is the class of the
>> sample X(i, ?)
>> > After that I have correct predictions using predict()
>> > Then I use the function
>> > export_graphviz(clf, out_file=dot_data,feature_names=FEATURES)
>> > with FEATURES being the array of my n2 features names in the same order
>> as in X
>> > I obtain the tree .png but can?t find a way to have the correct class
>> names in the leaves?
>> > In export_graphviz() should I use the class_names optional parameter
>> and how ?
>> > Thanks for any help
>> >
>> > Gregory, Toulouse FRANCE
>> >
>> >
>> >
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
>
> _______________________________________________
> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161025/ed46f6ba/attachment.html>

From mail at sebastianraschka.com  Tue Oct 25 08:45:45 2016
From: mail at sebastianraschka.com (Sebastian Raschka)
Date: Tue, 25 Oct 2016 08:45:45 -0400
Subject: [scikit-learn] tree visualization with class names in leaves
In-Reply-To: <DB3PR04MB07804D8AF8A4676E64D7E3109AA80@DB3PR04MB0780.eurprd04.prod.outlook.com>
References: <DB3PR04MB0780D1222CC0B7629F59072C9AA90@DB3PR04MB0780.eurprd04.prod.outlook.com>
 <D08F9C84-26EC-4464-8229-CCF665434210@gmail.com>
 <DB3PR04MB07804D8AF8A4676E64D7E3109AA80@DB3PR04MB0780.eurprd04.prod.outlook.com>
Message-ID: <F4548645-E10B-4CD3-B0B6-5C2803CFE4C2@sebastianraschka.com>

Hi, Gregory,

> I dont't use the iris dataset. My classes are distributed in my Y array. 

Yeah, I just used this as a simple example :). 

> the nodes of the graphical tree seem to be filled with the predominant class

I think that?s right, it gets the class name of the majority class at each node via "class_name = class_names[np.argmax(value)]? (https://github.com/scikit-learn/scikit-learn/blob/3a106fc792eb8e70e1fd078e351ba42487d3214d/sklearn/tree/export.py#L286)

>  in a vector with the classes in alphabetical order ( the same order as in clf.classes_)

yes, it should be in ascending, alpha numerical order. Not sure if this is still a general recommendation in the sklearn 0.18, but I typically convert string class labels to integers before I feed it to a classifier (but it seems to work either way now)

-> from sklearn.preprocessing import LabelEncoder
-> le = LabelEncoder()
-> y = le.fit_transform(labels)
-> le.classes_

array(['Setosa', 'Versicolor', 'Virginica'], 
      dtype='<U21?)

-> import numpy as np
-> np.bincount(y)

array([50, 50, 50])

Best,
Sebastian

> On Oct 25, 2016, at 3:00 AM, greg g <greg315 at hotmail.fr> wrote:
> 
> Hi Sebastian,
> Thanks for your answer.
> I dont't use the iris dataset. My classes are distributed in my Y array. 
> It seems that I can get the classes in alphabetical order with 
> > clf.classes_  
> where clf is my tree.
> And with
> > export_graphviz(clf, out_file=dot_data,feature_names=FEATURES,class_names=clf.classes_)
> the nodes of the graphical tree seem to be filled with the predominant class and samples repartition in a vector with the classes in alphabetical order ( the same order as in clf.classes_)
> I have to confirm that with more classes.
> 
> Regards
> Gregory
> 
> De : scikit-learn <scikit-learn-bounces+greg315=hotmail.fr at python.org> de la part de Sebastian Raschka <se.raschka at gmail.com>
> Envoy? : lundi 24 octobre 2016 17:47
> ? : Scikit-learn user and developer mailing list
> Objet : Re: [scikit-learn] tree visualization with class names in leaves
>  
> Hi, Greg,
> if you provide the `class_names` argument, a ?class? label of the majority class will be added at the bottom of each node. For instance, if you have the Iris dataset, with class labels 0, 1, 2, you can provide the `class_names` as ['setosa', 'versicolor', 'virginica?], where  0 -> ?setosa?, 1 -> ?versicolor?, 2 -> ?virginica?.
> 
> Best,
> Sebastian
> 
> > On Oct 24, 2016, at 10:18 AM, greg g <greg315 at hotmail.fr> wrote:
> > 
> > bLaf1ox-forefront-antispam-report: EFV:NLI; SFV:NSPM; SFS:(10019020)(98900003);
> > DIR:OUT; SFP:1102; SCL:1; SRVR:DB5EUR03HT168;
> > H:DB3PR04MB0780.eurprd04.prod.outlook.com; FPR:; SPF:None; LANG:en;
> > x-ms-office365-filtering-correlation-id: 319900b9-973c-49bb-8e9a-08d3fc1895c4
> > x-microsoft-antispam: UriScan:; BCL:0; PCL:0;
> > RULEID:(1601124038)(1603103081)(1601125047); SRVR:DB5EUR03HT168;
> > x-exchange-antispam-report-cfa-test: BCL:0; PCL:0;
> > RULEID:(432015012)(82015046); SRVR:DB5EUR03HT168; BCL:0; PCL:0; RULEID:;
> > SRVR:DB5EUR03HT168;
> > x-forefront-prvs: 0105DAA385
> > X-OriginatorOrg: outlook.com
> > X-MS-Exchange-CrossTenant-originalarrivaltime: 24 Oct 2016 14:18:11.0102 (UTC)
> > X-MS-Exchange-CrossTenant-fromentityheader: Internet
> > X-MS-Exchange-CrossTenant-id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa
> > X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB5EUR03HT168
> > 
> > 
> > Hi,
> >  I just begin with scikit-learn and would like to visualize a classification tree with class names displayed in the leaves as shown in the SCIKITLEARN.TREE documentation http://scikit-learn.org/stable/modules/tree.html where we find class=?virginica? etc?
> 
> 1.10. Decision Trees ? scikit-learn 0.18 documentation
> scikit-learn.org
> Decision-tree learners can create over-complex trees that do not generalise the data well. This is called overfitting. Mechanisms such as pruning (not currently ...
> 
> > I made a tree providing a 2D array X (n1 samples , n2 features) and 1D array Y (n1 corresponding classes ) such that Y(i) is the class of the sample X(i, ?)
> > After that I have correct predictions using predict()
> > Then I use the function 
> > export_graphviz(clf, out_file=dot_data,feature_names=FEATURES)
> > with FEATURES being the array of my n2 features names in the same order as in X 
> > I obtain the tree .png but can?t find a way to have the correct class names in the leaves?
> > In export_graphviz() should I use the class_names optional parameter and how ?
> > Thanks for any help
> >  
> > Gregory, Toulouse FRANCE
> > 
> > 
> > 
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> scikit-learn Info Page - Python
> mail.python.org
> To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ...
> 
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> scikit-learn Info Page - Python
> mail.python.org
> To see the collection of prior postings to the list, visit the scikit-learn Archives. Using scikit-learn: To post a message to all the list members ...
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From yafc18 at gmail.com  Tue Oct 25 22:42:40 2016
From: yafc18 at gmail.com (=?UTF-8?B?6aKc5Y+R5omNKFlhbiBGYWNhaSk=?=)
Date: Wed, 26 Oct 2016 10:42:40 +0800
Subject: [scikit-learn] The implementation of
 `gradient_boost.py:BinomialDeviance`?
Message-ID: <CAGg=MMDFkfskSLfp0LYwiokTn8xUrPSWukKH5-20KWV629YLQw@mail.gmail.com>

Hi,
which paper or book is the foundation of the implementation of
`gradient_boost.py:BinomialDeviance`?

I recently read the paper: Friedman: greedy function approximation - a
gradient boosting machine. I believe that L2_TreeBoost in the paper should
be equivalent to BinomialDeviance in scikit-learn, while their
implementation are different, for example:

+ negative_gradient:
   - in scikit:  \tilde{y} = y - expit(pred.ravel())
                                 = y - \frac{1}{1 + exp(- F)}
   - in paper: \tilde{y} = \frac{2 y}{1 + exp(2yF)}

Does anyone can help me?
Thanks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161026/2058666d/attachment.html>

From surangakas at gmail.com  Wed Oct 26 11:26:20 2016
From: surangakas at gmail.com (Suranga Kasthurirathne)
Date: Wed, 26 Oct 2016 11:26:20 -0400
Subject: [scikit-learn] Calculating prediction probability per each
 predicted outcome
Message-ID: <CAMBYo-1+3wg35=hMEKg71hfs1Bw2reo8QK4e9qPu2ZgRLGtS0A@mail.gmail.com>

Hi everyone,

I'm currently using Scikit learn to train and test multiple neural networks.

My issue - I'm breaking my dataset into 90/10, training on the 90%, and
testing on the 10%.

For the 10% trained data, I get outcomes as follows:

predicted = neural_network.predict(test_data)

Here, the predicted variable is basically either 1 or 0, which is what i'm
feeding in as the outcome.

But how can I get the prediction probability per each predicted outcome?
back in the day when I used weka it produced a single prediction, followed
by a prediction probability between 1 and 0 for each outcome.


-- 
Best Regards,
Suranga
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161026/62ae0d41/attachment.html>

From piotr.bialecki at hotmail.de  Wed Oct 26 11:35:25 2016
From: piotr.bialecki at hotmail.de (Piotr Bialecki)
Date: Wed, 26 Oct 2016 15:35:25 +0000
Subject: [scikit-learn] Calculating prediction probability per each
 predicted outcome
In-Reply-To: <CAMBYo-1+3wg35=hMEKg71hfs1Bw2reo8QK4e9qPu2ZgRLGtS0A@mail.gmail.com>
References: <CAMBYo-1+3wg35=hMEKg71hfs1Bw2reo8QK4e9qPu2ZgRLGtS0A@mail.gmail.com>
Message-ID: <DB5PR01MB0854B8719D410DFF6FEC6B8FF3AB0@DB5PR01MB0854.eurprd01.prod.exchangelabs.com>

Hi Suranga,

if you are using the MLPClassifier class, it should have a predict_proba() method.
Try:
predicted = neural_network.predict_proba(test_data)


Best regards,
Piotr

On 26.10.2016 17:26, Suranga Kasthurirathne wrote:

Hi everyone,

I'm currently using Scikit learn to train and test multiple neural networks.

My issue - I'm breaking my dataset into 90/10, training on the 90%, and testing on the 10%.

For the 10% trained data, I get outcomes as follows:

predicted = neural_network.predict(test_data)

Here, the predicted variable is basically either 1 or 0, which is what i'm feeding in as the outcome.

But how can I get the prediction probability per each predicted outcome? back in the day when I used weka it produced a single prediction, followed by a prediction probability between 1 and 0 for each outcome.


--
Best Regards,
Suranga


_______________________________________________
scikit-learn mailing list
scikit-learn at python.org<mailto:scikit-learn at python.org>
https://mail.python.org/mailman/listinfo/scikit-learn


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161026/6f5f9138/attachment.html>

From surangakas at gmail.com  Fri Oct 28 08:13:31 2016
From: surangakas at gmail.com (Suranga Kasthurirathne)
Date: Fri, 28 Oct 2016 08:13:31 -0400
Subject: [scikit-learn] Calculating prediction probability per each
 predicted outcome
In-Reply-To: <CAMBYo-1+3wg35=hMEKg71hfs1Bw2reo8QK4e9qPu2ZgRLGtS0A@mail.gmail.com>
References: <CAMBYo-1+3wg35=hMEKg71hfs1Bw2reo8QK4e9qPu2ZgRLGtS0A@mail.gmail.com>
Message-ID: <CAMBYo-3C+M=ob4vU6NWuf46XD5Af1D9m4YUNWO8PeQd9d=Nz6Q@mail.gmail.com>

Thanks Piotr, this was indeed the case. Works for me now :)

On Wed, Oct 26, 2016 at 11:26 AM, Suranga Kasthurirathne <
surangakas at gmail.com> wrote:

>
> Hi everyone,
>
> I'm currently using Scikit learn to train and test multiple neural
> networks.
>
> My issue - I'm breaking my dataset into 90/10, training on the 90%, and
> testing on the 10%.
>
> For the 10% trained data, I get outcomes as follows:
>
> predicted = neural_network.predict(test_data)
>
> Here, the predicted variable is basically either 1 or 0, which is what i'm
> feeding in as the outcome.
>
> But how can I get the prediction probability per each predicted outcome?
> back in the day when I used weka it produced a single prediction, followed
> by a prediction probability between 1 and 0 for each outcome.
>
>
> --
> Best Regards,
> Suranga
>


-- 
Best Regards,
Suranga
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161028/e381097c/attachment.html>

From ragvrv at gmail.com  Sun Oct 30 09:17:56 2016
From: ragvrv at gmail.com (Raghav R V)
Date: Sun, 30 Oct 2016 14:17:56 +0100
Subject: [scikit-learn] Module Level Labels?
Message-ID: <CACmxyDHoUkZBz+xx8x4zzqM=ENL0+Rwb7BDHmo1aaZLF6esEtQ@mail.gmail.com>

Hi all,

Should we have module level labels?

"mod: tree"
"mod: model_selection"
"mod: linear_models"
"mod: ..."

I know it will blow up our label count, but I think it will help filter
issues / PRs to review.

Sometimes I like to look into issues / PRs that concern my two fav. modules
"model_selection" and "tree" and I have to resort to some complex searches
to cover all possible key words. (Sometimes the OP does not use the right
keywords in their issue / PR)

(From time to time I may keep popping some crazy suggestions, please feel
free to shoot them down)

Have a good weekend!!


Raghav RV
https://github.com/raghavrv
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161030/8cdd9fd0/attachment.html>

From nelle.varoquaux at gmail.com  Sun Oct 30 11:52:42 2016
From: nelle.varoquaux at gmail.com (Nelle Varoquaux)
Date: Sun, 30 Oct 2016 08:52:42 -0700
Subject: [scikit-learn] Module Level Labels?
In-Reply-To: <CACmxyDHoUkZBz+xx8x4zzqM=ENL0+Rwb7BDHmo1aaZLF6esEtQ@mail.gmail.com>
References: <CACmxyDHoUkZBz+xx8x4zzqM=ENL0+Rwb7BDHmo1aaZLF6esEtQ@mail.gmail.com>
Message-ID: <CAE-UAvRpg6B_phW3OnnCnHMSgXyz3uF8Xf7bt1ho6M7STbBw-A@mail.gmail.com>

Hello,

I personnally don't think it is useful and it clutters the UI with information.
I am actually trying to reduce matplotlib's number of labels right
now, as we have so many that they are useless.

Cheers,
N

On 30 October 2016 at 06:17, Raghav R V <ragvrv at gmail.com> wrote:
> Hi all,
>
> Should we have module level labels?
>
> "mod: tree"
> "mod: model_selection"
> "mod: linear_models"
> "mod: ..."
>
> I know it will blow up our label count, but I think it will help filter
> issues / PRs to review.
>
> Sometimes I like to look into issues / PRs that concern my two fav. modules
> "model_selection" and "tree" and I have to resort to some complex searches
> to cover all possible key words. (Sometimes the OP does not use the right
> keywords in their issue / PR)
>
> (From time to time I may keep popping some crazy suggestions, please feel
> free to shoot them down)
>
> Have a good weekend!!
>
>
> Raghav RV
> https://github.com/raghavrv
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>

From surangakas at gmail.com  Sun Oct 30 15:24:12 2016
From: surangakas at gmail.com (Suranga Kasthurirathne)
Date: Sun, 30 Oct 2016 12:24:12 -0700
Subject: [scikit-learn] Problem using boxplots to compare significance of
 model performance
Message-ID: <CAMBYo-1ezG3knaGfo2wOhf=NQCUQpX4sPRf3W82D3QqKZtLAYA@mail.gmail.com>

Hi folks!

I'm using scikit-learn to build two neural networks using 10% holdout, and
compare their performance using precision. To compare statistical
significance in the variance of precision, i'm using scikit's boxplots.

My problem is twofold -

1) The standard deviation in the precision of the two models (obtained
using precision.std()) is always 0.0. I'm assuming that's a problem.
2) My boxplot is meant to display bars for the two models, but always
displays only the first model (nn01)

My outcomes for this dataset is binary (0 or 1) since the models assume
average=binary by default, is that a problem?

For those who'd like to look, my source code can be seen at
http://pastebin.com/yvE2T1Sw

The code produces the following plot - which is of course only ONE of the
bars that I need :(


?

-- 
Best Regards,
Suranga
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161030/257140bf/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen Shot 2016-10-30 at 12.17.22 PM.png
Type: image/png
Size: 45270 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161030/257140bf/attachment-0001.png>

From se.raschka at gmail.com  Sun Oct 30 15:56:21 2016
From: se.raschka at gmail.com (Sebastian Raschka)
Date: Sun, 30 Oct 2016 15:56:21 -0400
Subject: [scikit-learn] Problem using boxplots to compare significance
 of model performance
In-Reply-To: <CAMBYo-1ezG3knaGfo2wOhf=NQCUQpX4sPRf3W82D3QqKZtLAYA@mail.gmail.com>
References: <CAMBYo-1ezG3knaGfo2wOhf=NQCUQpX4sPRf3W82D3QqKZtLAYA@mail.gmail.com>
Message-ID: <B3C5E489-78BA-45C8-8B0D-B1380DE4B350@gmail.com>

Hi, Suranga,

> 1) The standard deviation in the precision of the two models (obtained using precision.std()) is always 0.0. I'm assuming that's a problem.

That?s weird. You are sure that ?precision? has more than one value?	E.g., 
>>> np.array([0.89]).std()
0.0

> 2) My boxplot is meant to display bars for the two models, but always displays only the first model (nn01)

Also here, your input array or list for the boxplot function may not be formatted correctly. What you want is

two_models = [ 1Darray_of_model1_results, 1Darray_of_model2_results ]

plt.boxplot(two_models,     
    notch=False, # box instead of notch shape 
    sym='rs',    # red squares for outliers
    vert=True)   # vertical box aligmnent


PS: If you are comparing specifically 2 neural network models, have you considered McNemar?s test? E.g., see
https://github.com/rasbt/mlxtend/blob/master/docs/sources/user_guide/evaluate/mcnemar.ipynb


Best
Sebastian

> On Oct 30, 2016, at 3:24 PM, Suranga Kasthurirathne <surangakas at gmail.com> wrote:
> 
> 
> Hi folks!
> 
> I'm using scikit-learn to build two neural networks using 10% holdout, and compare their performance using precision. To compare statistical significance in the variance of precision, i'm using scikit's boxplots.
> 
> My problem is twofold -
> 
> 1) The standard deviation in the precision of the two models (obtained using precision.std()) is always 0.0. I'm assuming that's a problem.
> 2) My boxplot is meant to display bars for the two models, but always displays only the first model (nn01)
> 
> My outcomes for this dataset is binary (0 or 1) since the models assume average=binary by default, is that a problem?
> 
> For those who'd like to look, my source code can be seen at http://pastebin.com/yvE2T1Sw
> 
> The code produces the following plot - which is of course only ONE of the bars that I need :(
> 
> 
> <Screen Shot 2016-10-30 at 12.17.22 PM.png>
> ?
> 
> -- 
> Best Regards,
> Suranga
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From surangakas at gmail.com  Sun Oct 30 16:43:13 2016
From: surangakas at gmail.com (Suranga Kasthurirathne)
Date: Sun, 30 Oct 2016 13:43:13 -0700
Subject: [scikit-learn] Problem using boxplots to compare significance of
 model performance
Message-ID: <CAMBYo-3LvLq-AMdD=-Rc6zxZ1w_Z_7X6tS+=bPU7NUTCgZjBng@mail.gmail.com>

Hi Sebastian!

Thank you, you might be onto something here ;)

So, I may have to go over 2 models, so McNamara's may not be an option :(

In regard to your second comment, in building my boxplots, this is how I
input results.

plt.boxplot(results)
So what does "results" look like?

[0.85433808345719897, 0.8976733724549345]

These are the two precision values calculated for each neural network.
Exactly what should 1Darray_of_model1_results look like? is it one value
per model or....


-- 
Best Regards,
Suranga
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161030/b2508dd4/attachment.html>

From se.raschka at gmail.com  Sun Oct 30 17:38:18 2016
From: se.raschka at gmail.com (Sebastian Raschka)
Date: Sun, 30 Oct 2016 17:38:18 -0400
Subject: [scikit-learn] Problem using boxplots to compare significance
 of model performance
In-Reply-To: <CAMBYo-3LvLq-AMdD=-Rc6zxZ1w_Z_7X6tS+=bPU7NUTCgZjBng@mail.gmail.com>
References: <CAMBYo-3LvLq-AMdD=-Rc6zxZ1w_Z_7X6tS+=bPU7NUTCgZjBng@mail.gmail.com>
Message-ID: <818ED03F-6F00-4F2C-9FBB-1B79E3E2ED34@gmail.com>

Hi, Suranga

> So, I may have to go over 2 models, so McNamara's may not be an option :(

Sure, but there are many other hypothesis tests, was just a suggestion since I thought you just wanted compare 2 models :)

> plt.boxplot(results)
> So what does "results" look like? 
> 
> [0.85433808345719897, 0.8976733724549345]

You can?t do a boxplot based on 1 single value.

> These are the two precision values calculated for each neural network. Exactly what should 1Darray_of_model1_results look like? is it one value per model or....


This should work:

model_1 = [0.85,   # experiment 1
           0.84]   # experiment 2

model_2 = [0.84,  # experiment 1
    0.83]  # experiment 2

plt.boxplot([model_1, model_2])

However, a boxplot based on 2 values only doesn?t make sense imho, I you could just plot the range.

Best,
Sebastian

> On Oct 30, 2016, at 4:43 PM, Suranga Kasthurirathne <surangakas at gmail.com> wrote:
> 
> 
> Hi Sebastian!
> 
> Thank you, you might be onto something here ;)
> 
> So, I may have to go over 2 models, so McNamara's may not be an option :(
> 
> In regard to your second comment, in building my boxplots, this is how I input results. 
> 
> plt.boxplot(results)
> So what does "results" look like? 
> 
> [0.85433808345719897, 0.8976733724549345]
> 
> These are the two precision values calculated for each neural network. Exactly what should 1Darray_of_model1_results look like? is it one value per model or....
> 
> 
> -- 
> Best Regards,
> Suranga
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From yafc18 at gmail.com  Sun Oct 30 20:02:34 2016
From: yafc18 at gmail.com (=?UTF-8?B?6aKc5Y+R5omNKFlhbiBGYWNhaSk=?=)
Date: Mon, 31 Oct 2016 08:02:34 +0800
Subject: [scikit-learn] The implementation of
 `gradient_boost.py:BinomialDeviance`?
In-Reply-To: <CAGg=MMDFkfskSLfp0LYwiokTn8xUrPSWukKH5-20KWV629YLQw@mail.gmail.com>
References: <CAGg=MMDFkfskSLfp0LYwiokTn8xUrPSWukKH5-20KWV629YLQw@mail.gmail.com>
Message-ID: <CAGg=MMAMjsrSQ-=uB4yCfawJzDuAyDZgHjGBgzb6mSkYz-1n+Q@mail.gmail.com>

Does anyone can help me?
Thanks.

On Wed, Oct 26, 2016 at 10:42 AM, ???(Yan Facai) <yafc18 at gmail.com> wrote:

> Hi,
> which paper or book is the foundation of the implementation of
> `gradient_boost.py:BinomialDeviance`?
>
> I recently read the paper: Friedman: greedy function approximation - a
> gradient boosting machine. I believe that L2_TreeBoost in the paper should
> be equivalent to BinomialDeviance in scikit-learn, while their
> implementation are different, for example:
>
> + negative_gradient:
>    - in scikit:  \tilde{y} = y - expit(pred.ravel())
>                                  = y - \frac{1}{1 + exp(- F)}
>    - in paper: \tilde{y} = \frac{2 y}{1 + exp(2yF)}
>
> Does anyone can help me?
> Thanks.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161031/1233cd5e/attachment.html>

From t3kcit at gmail.com  Mon Oct 31 11:32:16 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Mon, 31 Oct 2016 11:32:16 -0400
Subject: [scikit-learn] Module Level Labels?
In-Reply-To: <CAE-UAvRpg6B_phW3OnnCnHMSgXyz3uF8Xf7bt1ho6M7STbBw-A@mail.gmail.com>
References: <CACmxyDHoUkZBz+xx8x4zzqM=ENL0+Rwb7BDHmo1aaZLF6esEtQ@mail.gmail.com>
 <CAE-UAvRpg6B_phW3OnnCnHMSgXyz3uF8Xf7bt1ho6M7STbBw-A@mail.gmail.com>
Message-ID: <f978dbcc-6661-03b1-0fee-2c4d81282ca2@gmail.com>

I think it would be more helpful to lower the issue count by solving 
issues and closing non-helpful ones ;)

On 10/30/2016 11:52 AM, Nelle Varoquaux wrote:
> Hello,
>
> I personnally don't think it is useful and it clutters the UI with information.
> I am actually trying to reduce matplotlib's number of labels right
> now, as we have so many that they are useless.
>
> Cheers,
> N
>
> On 30 October 2016 at 06:17, Raghav R V <ragvrv at gmail.com> wrote:
>> Hi all,
>>
>> Should we have module level labels?
>>
>> "mod: tree"
>> "mod: model_selection"
>> "mod: linear_models"
>> "mod: ..."
>>
>> I know it will blow up our label count, but I think it will help filter
>> issues / PRs to review.
>>
>> Sometimes I like to look into issues / PRs that concern my two fav. modules
>> "model_selection" and "tree" and I have to resort to some complex searches
>> to cover all possible key words. (Sometimes the OP does not use the right
>> keywords in their issue / PR)
>>
>> (From time to time I may keep popping some crazy suggestions, please feel
>> free to shoot them down)
>>
>> Have a good weekend!!
>>
>>
>> Raghav RV
>> https://github.com/raghavrv
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From ragvrv at gmail.com  Mon Oct 31 12:04:04 2016
From: ragvrv at gmail.com (Raghav R V)
Date: Mon, 31 Oct 2016 17:04:04 +0100
Subject: [scikit-learn] Module Level Labels?
In-Reply-To: <f978dbcc-6661-03b1-0fee-2c4d81282ca2@gmail.com>
References: <CACmxyDHoUkZBz+xx8x4zzqM=ENL0+Rwb7BDHmo1aaZLF6esEtQ@mail.gmail.com>
 <CAE-UAvRpg6B_phW3OnnCnHMSgXyz3uF8Xf7bt1ho6M7STbBw-A@mail.gmail.com>
 <f978dbcc-6661-03b1-0fee-2c4d81282ca2@gmail.com>
Message-ID: <CACmxyDF5p_oqdopvSsbi=mWPgWazy+oPbgbcYUkVGEOUkwkPVA@mail.gmail.com>

Okay! Thanks for the replies Nelle and Andy!

On Mon, Oct 31, 2016 at 4:32 PM, Andreas Mueller <t3kcit at gmail.com> wrote:

> I think it would be more helpful to lower the issue count by solving
> issues and closing non-helpful ones ;)
>
>
> On 10/30/2016 11:52 AM, Nelle Varoquaux wrote:
>
>> Hello,
>>
>> I personnally don't think it is useful and it clutters the UI with
>> information.
>> I am actually trying to reduce matplotlib's number of labels right
>> now, as we have so many that they are useless.
>>
>> Cheers,
>> N
>>
>> On 30 October 2016 at 06:17, Raghav R V <ragvrv at gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> Should we have module level labels?
>>>
>>> "mod: tree"
>>> "mod: model_selection"
>>> "mod: linear_models"
>>> "mod: ..."
>>>
>>> I know it will blow up our label count, but I think it will help filter
>>> issues / PRs to review.
>>>
>>> Sometimes I like to look into issues / PRs that concern my two fav.
>>> modules
>>> "model_selection" and "tree" and I have to resort to some complex
>>> searches
>>> to cover all possible key words. (Sometimes the OP does not use the right
>>> keywords in their issue / PR)
>>>
>>> (From time to time I may keep popping some crazy suggestions, please feel
>>> free to shoot them down)
>>>
>>> Have a good weekend!!
>>>
>>>
>>> Raghav RV
>>> https://github.com/raghavrv
>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>


-- 
Raghav RV
https://github.com/raghavrv
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161031/fe3a689a/attachment.html>

From sumeet.k.sandhu at gmail.com  Mon Oct 31 16:28:43 2016
From: sumeet.k.sandhu at gmail.com (Sumeet Sandhu)
Date: Mon, 31 Oct 2016 13:28:43 -0700
Subject: [scikit-learn] creating a custom scoring function for
 cross-validation of classification
Message-ID: <CANmLPhh02HNQfpj=mNoSR+YkgwpxsNNzubM=Y3HmOm5bo6Cgsw@mail.gmail.com>

Hi,

I've been staring at various doc pages for a while to create a custom
scorer that uses predict_proba output of a multi-class SGDClassifier :
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score
http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html#sklearn.metrics.make_scorer

I got the impression I could customize the "scoring'' parameter in
cross_val_score directly, but that didn't work.
Then I tried customizing the "score_func" parameter in make_scorer, but
that didn't work either. Both errors are ValuErrors :

Traceback (most recent call last):
  File "<pyshell#96>", line 3, in <module>
    accuracy = mean(cross_val_score(LRclassifier, trainPatentVecs,
trainLabelVecs, cv=10, scoring = 'topNscorer'))
  File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/cross_validation.py",
line 1425, in cross_val_score
    scorer = check_scoring(estimator, scoring=scoring)
  File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/metrics/scorer.py",
line 238, in check_scoring
    return get_scorer(scoring)
  File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/metrics/scorer.py",
line 197, in get_scorer
    % (scoring, sorted(SCORERS.keys())))
ValueError: 'topNscorer' is not a valid scoring value. Valid options are
['accuracy', 'adjusted_rand_score', 'average_precision', 'f1', 'f1_macro',
'f1_micro', 'f1_samples', 'f1_weighted', 'log_loss', 'mean_absolute_error',
'mean_squared_error', 'median_absolute_error', 'precision',
'precision_macro', 'precision_micro', 'precision_samples',
'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro',
'recall_samples', 'recall_weighted', 'roc_auc']

At a high level, I want to find out if the true label was found in the top
N multi-class labels coming out of an SGD classifier. Built-in scores like
"accuracy" only look at N=1.

Here is the code using make_scorer :
        LRclassifier = SGDClassifier(loss='log')
        topNscorer = make_scorer(topNscoring, greater_is_better=True,
needs_proba=True)
        accuracyN = mean(cross_val_score(LRclassifier, Data, Labels,
scoring = 'topNscorer'))

Here is the code for the custom scoring function :
def topNscoring(y, yp):
    ## Inputs y = true label per sample, yp = predict_proba probabilities
of all labels per sample
    N = 5
    foundN = []
    for ii in xrange(0,shape(yp)[0]):
        indN = [ w[0] for w in sorted(enumerate(list(yp[ii,:])),key=lambda
w:w[1],reverse=True)[0:N] ]
        if y[ii] in indN:             foundN.append(1)
        else:             foundN.append(0)
    return mean(foundN)

Any help will be greatly appreciated.

best regards,
Sumeet
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161031/0595dceb/attachment.html>