From s.atasever at gmail.com  Mon Jul  3 10:09:44 2017
From: s.atasever at gmail.com (Sema Atasever)
Date: Mon, 3 Jul 2017 17:09:44 +0300
Subject: [scikit-learn] Construct the microclusters using a CF-Tree
In-Reply-To: <75468a69-ba3a-ca8a-7b1c-b477f7d6f08e@gmail.com>
References: <CAAir+CoB_gJbOCDno3Qetdu+=Vte7=qijCdXNt7wmwgEUT6cvw@mail.gmail.com>
 <75468a69-ba3a-ca8a-7b1c-b477f7d6f08e@gmail.com>
Message-ID: <CAAir+Co9Eq+BC9Z+8zMdgPBFi+37J8ELwGf9A0KRYnaWcB6F1A@mail.gmail.com>

Dear Roman,

When I try the code with the original data (*data.dat*) as you suggested, I
get the following error : *Memory Error* --> (*error.png*), how can i
overcome this problem, thank you so much in advance.
?
 data.dat
<https://drive.google.com/file/d/0B4rY6f4kvHeCYlpZOURKNnR0Q1k/view?usp=drive_web>
?

On Fri, Jun 30, 2017 at 5:42 PM, Roman Yurchak <rth.yurchak at gmail.com>
wrote:

> Hello Sema,
>
> On 30/06/17 17:14, Sema Atasever wrote:
>
>> I want to cluster them using Birch clustering algorithm.
>> Does this method have 'precomputed' option.
>>
>
> No it doesn't, see http://scikit-learn.org/stable
> /modules/generated/sklearn.cluster.Birch.html so you would need to
> provide it with the original features matrix (not the precomputed distance
> matrix). Since your dataset is fairly small, there is no reason in
> precomputing it anyway.
>
> I needed train an SVM on the centroids of the microclusters so
>> *How can i get the centroids of the microclusters?*
>>
>
> By "microclusters" do you mean sub-clusters? If you are interested in the
> leaves subclusters see the Birch.subcluster_centers_ parameter.
>
> Otherwise if you want all the centroids in the hierarchy of subclusters,
> you can browse the hierarchical tree via the  Birch.root_ attribute then
> look at _CFSubcluster.centroid_ for each subcluster.
>
> Hope this helps,
> --
> Roman
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170703/1874bc1c/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: error.png
Type: image/png
Size: 74377 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170703/1874bc1c/attachment-0001.png>

From betatim at gmail.com  Mon Jul  3 10:11:40 2017
From: betatim at gmail.com (Tim Head)
Date: Mon, 03 Jul 2017 14:11:40 +0000
Subject: [scikit-learn] Scikit-learn workshop and sprint at EuroScipy
 2017 in Erlangen
In-Reply-To: <CAFvE7K4_YgPJuWW+ceiPjL-TYS9o0DRBu7FdrOW5_kymJwp2yA@mail.gmail.com>
References: <CAFvE7K77kUvPSREyrRovKEoMvwu1FOqVPgVCrr-DiV96U+Pd_w@mail.gmail.com>
 <CAN3x1Rba22uh8vCgQk2NX4X22QJ4=Te78VSL9hvjEY=kjw6UMw@mail.gmail.com>
 <CAFvE7K4_YgPJuWW+ceiPjL-TYS9o0DRBu7FdrOW5_kymJwp2yA@mail.gmail.com>
Message-ID: <CAN3x1RZ2bc_dry8qkd9WbdK3T_p_MTZbVEBUdC-YWm6QtZm=1w@mail.gmail.com>

Hey,

On Wed, Jun 28, 2017 at 9:42 AM Olivier Grisel <olivier.grisel at ensta.org>
wrote:

>
>
> Do you have any suggestion ? The workshop duration is 90 min.
>

Looks like a good setup. Two thoughts: should we construct an example that
uses a pipeline to illustrate the point that you should put your whole
pipeline into your grid search/CV?

Start with intro to scikit-learn slides, then live demo, and if there is
time left the what's new? 90minutes isn't very long :-/

T
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170703/fb6f4935/attachment.html>

From rth.yurchak at gmail.com  Mon Jul  3 16:46:03 2017
From: rth.yurchak at gmail.com (Roman Yurchak)
Date: Mon, 3 Jul 2017 23:46:03 +0300
Subject: [scikit-learn] Construct the microclusters using a CF-Tree
In-Reply-To: <CAAir+Co9Eq+BC9Z+8zMdgPBFi+37J8ELwGf9A0KRYnaWcB6F1A@mail.gmail.com>
References: <CAAir+CoB_gJbOCDno3Qetdu+=Vte7=qijCdXNt7wmwgEUT6cvw@mail.gmail.com>
 <75468a69-ba3a-ca8a-7b1c-b477f7d6f08e@gmail.com>
 <CAAir+Co9Eq+BC9Z+8zMdgPBFi+37J8ELwGf9A0KRYnaWcB6F1A@mail.gmail.com>
Message-ID: <f7f6dc3f-bb95-2d4a-d651-9bd19e0f0ba8@gmail.com>

Hello Sema,

as far as I can tell, in your dataset you has n_samples=65909, 
n_features=539. Clustering high dimensional data is problematic for a 
number of reasons, 
https://en.wikipedia.org/wiki/Clustering_high-dimensional_data#Problems

besides the BIRCH implementation doesn't scale well for n_features >> 50 
(see for instance the discussion in the second part of 
https://github.com/scikit-learn/scikit-learn/pull/8808#issuecomment-300776216 
also in ).

As a workaround for the memory error, you could try using the 
out-of-core version of Birch (using `partial_fit` on chunks of the 
dataset, instead of `fit`) but in any case it might also be better to 
reduce dimensionality beforehand (e.g. with PCA), if that's acceptable. 
Also the threshold parameter may need to be increased: since in your 
dataset it looks like the Euclidean distances are more in the 1-10 range?

-- 
Roman


On 03/07/17 17:09, Sema Atasever wrote:
> Dear Roman,
>
> When I try the code with the original data (*data.dat*) as you
> suggested, I get the following error : *Memory Error* --> (*error.png*),
> how can i overcome this problem, thank you so much in advance.
> ?
>  data.dat
> <https://drive.google.com/file/d/0B4rY6f4kvHeCYlpZOURKNnR0Q1k/view?usp=drive_web>
> ?
>
> On Fri, Jun 30, 2017 at 5:42 PM, Roman Yurchak <rth.yurchak at gmail.com
> <mailto:rth.yurchak at gmail.com>> wrote:
>
>     Hello Sema,
>
>     On 30/06/17 17:14, Sema Atasever wrote:
>
>         I want to cluster them using Birch clustering algorithm.
>         Does this method have 'precomputed' option.
>
>
>     No it doesn't, see
>     http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Birch.html
>     <http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Birch.html>
>     so you would need to provide it with the original features matrix
>     (not the precomputed distance matrix). Since your dataset is fairly
>     small, there is no reason in precomputing it anyway.
>
>         I needed train an SVM on the centroids of the microclusters so
>         *How can i get the centroids of the microclusters?*
>
>
>     By "microclusters" do you mean sub-clusters? If you are interested
>     in the leaves subclusters see the Birch.subcluster_centers_ parameter.
>
>     Otherwise if you want all the centroids in the hierarchy of
>     subclusters, you can browse the hierarchical tree via the
>     Birch.root_ attribute then look at _CFSubcluster.centroid_ for each
>     subcluster.
>
>     Hope this helps,
>     --
>     Roman
>     _______________________________________________
>     scikit-learn mailing list
>     scikit-learn at python.org <mailto:scikit-learn at python.org>
>     https://mail.python.org/mailman/listinfo/scikit-learn
>     <https://mail.python.org/mailman/listinfo/scikit-learn>
>
>


From goix.nicolas at gmail.com  Wed Jul  5 05:06:49 2017
From: goix.nicolas at gmail.com (Nicolas Goix)
Date: Wed, 5 Jul 2017 11:06:49 +0200
Subject: [scikit-learn] Machine learning for PU data
In-Reply-To: <dbe394b9-20db-5769-0387-194a5c9679a2@gmail.com>
References: <CAGz0NpiNB1CqYQzTPrpdFK0Jr-JO0aRtfeRhPLF1GaPVn5L99A@mail.gmail.com>
 <dbe394b9-20db-5769-0387-194a5c9679a2@gmail.com>
Message-ID: <CAPV6P2wjtNK53=5hXiNuBL4270eECeK1Q9x0wCfHQynSnrV56A@mail.gmail.com>

Hello,

As mentioned by Roman, you can try the one-class scikit-learn algorithms
such as OneClassSVM, IsolationForest, LocalOutlierFactor (with the private
predict method) or EllipticEnvelope.

Hope this helps
Nicolas

On Fri, Jun 30, 2017 at 3:39 PM, Roman Yurchak <rth.yurchak at gmail.com>
wrote:

> Hello Ruchika,
>
> I don't think that scikit-learn currently has algorithms that can train
> with positive and unlabeled class labels only. However, you could try one
> of the following compatible wrappers,
>   - http://nktmemo.github.io/jekyll/update/2015/11/07/pu_classif
> ication.html
>   - https://github.com/scikit-learn/scikit-learn/pull/371
>
> (haven't tried them myself).
>
> Also, you could try one class SVM as suggested here
> https://stackoverflow.com/questions/25700724/binary-semi-
> supervised-classification-with-positive-only-and-unlabeled-data-set
>
> --
> Roman
>
>
>
>
> On 30/06/17 16:06, Ruchika Nayyar wrote:
>
>> Hi All,
>>
>> I am a scikit-learn user and have a question for the community, if
>> anyone has applied any available machine learning algorithms in the
>> scikit-learn package for data with positive and unlabeled class only? If
>> so would you share some insight with me. I understand this could be a
>> broader topic but I am new to analyzing PU data and hence can use some
>> help.
>>
>> Thanks,
>> Ruchika
>>
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170705/2a95f07b/attachment.html>

From s.atasever at gmail.com  Wed Jul  5 06:27:58 2017
From: s.atasever at gmail.com (Sema Atasever)
Date: Wed, 5 Jul 2017 13:27:58 +0300
Subject: [scikit-learn] Construct the microclusters using a CF-Tree
In-Reply-To: <f7f6dc3f-bb95-2d4a-d651-9bd19e0f0ba8@gmail.com>
References: <CAAir+CoB_gJbOCDno3Qetdu+=Vte7=qijCdXNt7wmwgEUT6cvw@mail.gmail.com>
 <75468a69-ba3a-ca8a-7b1c-b477f7d6f08e@gmail.com>
 <CAAir+Co9Eq+BC9Z+8zMdgPBFi+37J8ELwGf9A0KRYnaWcB6F1A@mail.gmail.com>
 <f7f6dc3f-bb95-2d4a-d651-9bd19e0f0ba8@gmail.com>
Message-ID: <CAAir+CoQG7AdirKJ72quPwCVFeMMmjzAqoan4xM=H-vaoTWfPA@mail.gmail.com>

Hi Roman,

I reduced my original data set with feature selection, it has
now n_samples=10467, n_features=23.

I tried clustering with Birch algorithm this time it worked.
I obtained 35 clusters for the reduced dataset in the attachment(data2.dat).

How can i know which cluster member represents best each cluster?

For example Cluster 0 has 5 member which are : 1, 2, 3, 28 and 29. rows in
the data set.

Which cluster member (1, 2, 3, 28 or 29) represents best Cluster 0 ?

In the birch code i use this code line: *centroids =
brc.subcluster_centers_*

How do I interpret this line of code output?

Thank you so much for your help.

*Birch Code:*
from sklearn.cluster import Birch
from io import StringIO
import numpy as np

X=np.loadtxt(open("C:\data2.dat", "rb"), delimiter=",")

brc = Birch(branching_factor=50, n_clusters=None,
threshold=0.5,compute_labels=True,copy=True)

brc.fit(X)

centroids = brc.subcluster_centers_
labels = brc.subcluster_labels_


brc.predict(X)

print("\n brc.predict(X)")
print(brc.predict(X))

print("\n centroids")
print(centroids)

print("\n labels")
print(labels)

On Mon, Jul 3, 2017 at 11:46 PM, Roman Yurchak <rth.yurchak at gmail.com>
wrote:

> Hello Sema,
>
> as far as I can tell, in your dataset you has n_samples=65909,
> n_features=539. Clustering high dimensional data is problematic for a
> number of reasons, https://en.wikipedia.org/wiki/
> Clustering_high-dimensional_data#Problems
>
> besides the BIRCH implementation doesn't scale well for n_features >> 50
> (see for instance the discussion in the second part of
> https://github.com/scikit-learn/scikit-learn/pull/8808#issue
> comment-300776216 also in ).
>
> As a workaround for the memory error, you could try using the out-of-core
> version of Birch (using `partial_fit` on chunks of the dataset, instead of
> `fit`) but in any case it might also be better to reduce dimensionality
> beforehand (e.g. with PCA), if that's acceptable. Also the threshold
> parameter may need to be increased: since in your dataset it looks like the
> Euclidean distances are more in the 1-10 range?
>
> --
> Roman
>
>
> On 03/07/17 17:09, Sema Atasever wrote:
>
>> Dear Roman,
>>
>> When I try the code with the original data (*data.dat*) as you
>> suggested, I get the following error : *Memory Error* --> (*error.png*),
>> how can i overcome this problem, thank you so much in advance.
>> ?
>>  data.dat
>> <https://drive.google.com/file/d/0B4rY6f4kvHeCYlpZOURKNnR0Q1
>> k/view?usp=drive_web>
>> ?
>>
>> On Fri, Jun 30, 2017 at 5:42 PM, Roman Yurchak <rth.yurchak at gmail.com
>> <mailto:rth.yurchak at gmail.com>> wrote:
>>
>>     Hello Sema,
>>
>>     On 30/06/17 17:14, Sema Atasever wrote:
>>
>>         I want to cluster them using Birch clustering algorithm.
>>         Does this method have 'precomputed' option.
>>
>>
>>     No it doesn't, see
>>     http://scikit-learn.org/stable/modules/generated/sklearn.
>> cluster.Birch.html
>>     <http://scikit-learn.org/stable/modules/generated/sklearn.
>> cluster.Birch.html>
>>     so you would need to provide it with the original features matrix
>>     (not the precomputed distance matrix). Since your dataset is fairly
>>     small, there is no reason in precomputing it anyway.
>>
>>         I needed train an SVM on the centroids of the microclusters so
>>         *How can i get the centroids of the microclusters?*
>>
>>
>>     By "microclusters" do you mean sub-clusters? If you are interested
>>     in the leaves subclusters see the Birch.subcluster_centers_ parameter.
>>
>>     Otherwise if you want all the centroids in the hierarchy of
>>     subclusters, you can browse the hierarchical tree via the
>>     Birch.root_ attribute then look at _CFSubcluster.centroid_ for each
>>     subcluster.
>>
>>     Hope this helps,
>>     --
>>     Roman
>>     _______________________________________________
>>     scikit-learn mailing list
>>     scikit-learn at python.org <mailto:scikit-learn at python.org>
>>     https://mail.python.org/mailman/listinfo/scikit-learn
>>     <https://mail.python.org/mailman/listinfo/scikit-learn>
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170705/047fd1ed/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: screen_shot.png
Type: image/png
Size: 103493 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170705/047fd1ed/attachment-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: data2.dat
Type: application/octet-stream
Size: 18776 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170705/047fd1ed/attachment-0001.obj>

From axelbreuer at yahoo.com  Thu Jul  6 05:48:23 2017
From: axelbreuer at yahoo.com (axel breuer)
Date: Thu, 6 Jul 2017 09:48:23 +0000 (UTC)
Subject: [scikit-learn] Typo in online documentation on Matrix Factorization
References: <1582496995.5548742.1499334503648.ref@mail.yahoo.com>
Message-ID: <1582496995.5548742.1499334503648@mail.yahoo.com>

Hi,
First at all, I would like to warmly thank the scikit developer community with providing us with such a high quality ML library: it really became an amazing piece of scientific software.
I have a comment concerning the online documentation on Matrix Factorization Problems.
(I use this mailing list because I could not find in your online howto, what is the best channel to communicate documentation issues.Apologies if this email is considered as spam in this mailing list !)

On the webpage?2.5. Decomposing signals in components (matrix factorization problems) ? scikit-learn 0.18.2 documentation
We can read at?2.5.1.5. Sparse principal components analysis

but a bit further, at?2.5.3.2. Generic dictionary learning, we can read

The notations are obviously inconsistent as U and V have been interchanged some how.
Two extra (less important) corrections could probably improve even further the clarity for the reader:1. Sticking to a single upper bound limit (either n_components or n_atoms)2. Specifying whether V_k are columns or rows (maybe using a notation ? la Matlab/Numpy: V_{:,k} or V_{k,:})
Kind regards,
Axel BREUER

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170706/678020bb/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: blob.jpg
Type: image/png
Size: 12836 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170706/678020bb/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: blob.jpg
Type: image/png
Size: 16303 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170706/678020bb/attachment-0003.png>

From olivier.grisel at ensta.org  Thu Jul  6 09:11:55 2017
From: olivier.grisel at ensta.org (Olivier Grisel)
Date: Thu, 6 Jul 2017 15:11:55 +0200
Subject: [scikit-learn] Typo in online documentation on Matrix
 Factorization
In-Reply-To: <CAFvE7K6demN3uHpGGwzeh4ckjsksSTcC8hUm-eUOs=rXr+NtDQ@mail.gmail.com>
References: <1582496995.5548742.1499334503648.ref@mail.yahoo.com>
 <1582496995.5548742.1499334503648@mail.yahoo.com>
 <CAFvE7K6demN3uHpGGwzeh4ckjsksSTcC8hUm-eUOs=rXr+NtDQ@mail.gmail.com>
Message-ID: <CAFvE7K7-6Dw8uhnTxRFaybm1z6JnvGV1zjNyx0Q024o--FiZjw@mail.gmail.com>

2017-07-06 15:10 GMT+02:00 Olivier Grisel <olivier.grisel at ensta.org>:
> (and just make sure that the "components" is a synonym for "dictionary
> atoms" in the literature).


Actually I meant: and just make sure that our documentation states explicitly
that the "components" is a synonym for "dictionary atoms" in the literature.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

From olivier.grisel at ensta.org  Thu Jul  6 09:10:26 2017
From: olivier.grisel at ensta.org (Olivier Grisel)
Date: Thu, 6 Jul 2017 15:10:26 +0200
Subject: [scikit-learn] Typo in online documentation on Matrix
 Factorization
In-Reply-To: <1582496995.5548742.1499334503648@mail.yahoo.com>
References: <1582496995.5548742.1499334503648.ref@mail.yahoo.com>
 <1582496995.5548742.1499334503648@mail.yahoo.com>
Message-ID: <CAFvE7K6demN3uHpGGwzeh4ckjsksSTcC8hUm-eUOs=rXr+NtDQ@mail.gmail.com>

I think the documentation is correct. U, a.k.a. "the code" or "the
activations" has shape (n_samples, n_components) and V a.k.a. "the
dictionary" or "the components" has shape (n_components, n_features) in
both case.

We could use n_components uniformly instead of n_atoms for consistency's
sake (and just make sure that the "components" is a synonym for "dictionary
atoms" in the literature).

I think V_k is fine because the dimension with size n_components is the
first dimension of V.
?
If you spot issues or other things that are unclear or incomplete in the
doc, please feel free to open an issue on github. You can also directly
submit a pull request if you are familiar with git. The website is built
from the docs that live in the "doc/" subfolder of the repo.

-- 
Olivier
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170706/905f5f7c/attachment.html>

From greina at eng.ucsd.edu  Thu Jul  6 12:05:38 2017
From: greina at eng.ucsd.edu (G Reina)
Date: Thu, 6 Jul 2017 09:05:38 -0700
Subject: [scikit-learn] Replacing the Boston Housing Prices dataset
Message-ID: <CAEBTegQhq3x1Jgm5zBY+MQ1kjB6ryfT0X8r5-esk6NP_5WY9rQ@mail.gmail.com>

I'd like to request that the "Boston Housing Prices" dataset in sklearn
(sklearn.datasets.load_boston) be replaced with the "Ames Housing Prices"
dataset (https://ww2.amstat.org/publications/jse/v19n3/decock.pdf). I am
willing to submit the code change if the developers agree.

The Boston dataset has the feature "Bk is the proportion of blacks in
town". It is an incredibly racist "feature" to include in any dataset. I
think is beneath us as data scientists.

I submit that the Ames dataset is a viable alternative for learning
regression. The author has shown that the dataset is a more robust
replacement for Boston. Ames is a 2011 regression dataset on housing prices
and has more than 5 times the amount of training examples with over 7 times
as many features (none of which are morally questionable).

I welcome the community's thoughts on the matter.

Thanks.
-Tony

Here's an article I wrote on the Boston dataset:
https://www.linkedin.com/pulse/hidden-racism-data-science-g-anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170706/6dab8aa9/attachment.html>

From hershil at gmail.com  Thu Jul  6 12:25:28 2017
From: hershil at gmail.com (Vikas Kumar)
Date: Thu, 6 Jul 2017 21:55:28 +0530
Subject: [scikit-learn] Which algorithm is used in sklearn SGDClassifier
 when modified huber loss is used?
In-Reply-To: <CAExPge4qD+7gwK5ma8DmthkuOQQj1yJc4VAitQC-uwwfaYQz4g@mail.gmail.com>
References: <CAExPge4qD+7gwK5ma8DmthkuOQQj1yJc4VAitQC-uwwfaYQz4g@mail.gmail.com>
Message-ID: <CAExPge4748ZDD=pFqxBAzuoF1e38QMCAB1sBRRmNEWLr6mnwxw@mail.gmail.com>

The documentation says:

The loss function to be used. Defaults to ?hinge?, which gives a linear
SVM. The ?log? loss gives logistic regression, a probabilistic classifier.
?modified_huber? is another smooth loss that brings tolerance to outliers
as well as probability estimates.

When we use 'modified_huber' loss function, which classification algorithm
is used? Is it SVM? If yes, how come it is able to give probability
estimates, which is something it can't do with hinge loss?

Regards,

Vikas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170706/9bf0899c/attachment.html>

From t3kcit at gmail.com  Thu Jul  6 12:31:15 2017
From: t3kcit at gmail.com (Andreas Mueller)
Date: Thu, 6 Jul 2017 12:31:15 -0400
Subject: [scikit-learn] Replacing the Boston Housing Prices dataset
In-Reply-To: <CAEBTegQhq3x1Jgm5zBY+MQ1kjB6ryfT0X8r5-esk6NP_5WY9rQ@mail.gmail.com>
References: <CAEBTegQhq3x1Jgm5zBY+MQ1kjB6ryfT0X8r5-esk6NP_5WY9rQ@mail.gmail.com>
Message-ID: <e790c891-b0fb-1adf-f71e-bdb045ed29af@gmail.com>

Hi Tony.

I don't think it's a good idea to remove the dataset, given how many 
tutorials and examples rely on it.
I also don't think it's a good idea to ignore racial discrimination, 
which I guess this feature is trying to capture.

I was recently asked to remove an excerpt from a dataset from my slide, 
as it was "too racist". It was randomly sampled
data from the adult census dataset. Unfortunately, economics in the US 
are not color blind (yet), and the reality is racist.
I haven't done an in-depth analysis on whether this feature is actually 
informative, but I don't think your analysis is conclusive.

Including ethnicity in data actually allows us to ensure "fairness" in 
certain decision making processes.
Without collecting this data, it would be impossible to ensure automatic 
decisions are not influenced
by past human biases. Arguably that's not what the authors of this 
dataset are doing.

Check out http://www.fatml.org/ for more on fairness in machine learning 
and data science.

Cheers,
Andy


On 07/06/2017 12:05 PM, G Reina wrote:
> I'd like to request that the "Boston Housing Prices" dataset in 
> sklearn (sklearn.datasets.load_boston) be replaced with the "Ames 
> Housing Prices" dataset 
> (https://ww2.amstat.org/publications/jse/v19n3/decock.pdf). I am 
> willing to submit the code change if the developers agree.
>
> The Boston dataset has the feature "Bk is the proportion of blacks in 
> town". It is an incredibly racist "feature" to include in any dataset. 
> I think is beneath us as data scientists.
>
> I submit that the Ames dataset is a viable alternative for learning 
> regression. The author has shown that the dataset is a more robust 
> replacement for Boston. Ames is a 2011 regression dataset on housing 
> prices and has more than 5 times the amount of training examples with 
> over 7 times as many features (none of which are morally questionable).
>
> I welcome the community's thoughts on the matter.
>
> Thanks.
> -Tony
>
> Here's an article I wrote on the Boston dataset:
> https://www.linkedin.com/pulse/hidden-racism-data-science-g-anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170706/b03e818f/attachment.html>

From shane.grigsby at colorado.edu  Thu Jul  6 12:32:57 2017
From: shane.grigsby at colorado.edu (Shane Grigsby)
Date: Thu, 6 Jul 2017 10:32:57 -0600
Subject: [scikit-learn] Agglomerative Clustering without knowing number
 of clusters
In-Reply-To: <CAEUHNkaHoG_5ijOS6Mo2RFRB=qq7orXvVYBA6LaeH2kU60V7DQ@mail.gmail.com>
References: <CAEUHNkaHoG_5ijOS6Mo2RFRB=qq7orXvVYBA6LaeH2kU60V7DQ@mail.gmail.com>
Message-ID: <20170706163257.zgvwnoih5zjb73io@MacBook-Pro-3.local>

This sounds like it may be a problem more amenable to either DBSCAN or 
OPTICS. Both algorithms don't require a priori knowledge of the number 
of clusters, and both let you specify a minimum point membership 
threshold for cluster membership. The OPTICS algorithm will also produce 
a dendrogram that you can cut for sub clusters if need be.

DBSCAN is part of the stable release and has been for some time; OPTICS 
is pending as a pull request, but it's stable and you can try it if you 
like:

https://github.com/scikit-learn/scikit-learn/pull/1984

Cheers,
Shane

On 06/30, Ariani A wrote:
>I want to perform agglomerative clustering, but I have no idea of number of
>clusters before hand. But I want that every cluster has at least 40 data
>points in it. How can I apply this to sklearn.agglomerative clustering?
>Should I use dendrogram and cut it somehow? I have no idea how to relate
>dendrogram to this and cutting it out. Any help will be appreciated!

>_______________________________________________
>scikit-learn mailing list
>scikit-learn at python.org
>https://mail.python.org/mailman/listinfo/scikit-learn


-- 
*PhD candidate & Research Assistant*
*Cooperative Institute for Research in Environmental Sciences (CIRES)*
*University of Colorado at Boulder*

From b.noushin7 at gmail.com  Thu Jul  6 12:39:05 2017
From: b.noushin7 at gmail.com (Ariani A)
Date: Thu, 6 Jul 2017 12:39:05 -0400
Subject: [scikit-learn] Agglomerative Clustering without knowing number
 of clusters
In-Reply-To: <20170706163257.zgvwnoih5zjb73io@MacBook-Pro-3.local>
References: <CAEUHNkaHoG_5ijOS6Mo2RFRB=qq7orXvVYBA6LaeH2kU60V7DQ@mail.gmail.com>
 <20170706163257.zgvwnoih5zjb73io@MacBook-Pro-3.local>
Message-ID: <CAEUHNkbkZqzD7FYfLt58iNR0NOeYwoyLXN5DVBQgW3EtcVNtMg@mail.gmail.com>

Dear Shane,
Thanks for your time. But I have to implement it by agglomerative
clustering and cut it when each cluster has at least 40 data points. But I
am not sure how to do cut it. I was guessing maybe it can be done by
cutting the dandrogram? Is it correct? If so, I do not know how to apply
it. Could you give me a point?
Best,
Ariani

On Thu, Jul 6, 2017 at 12:32 PM, Shane Grigsby <shane.grigsby at colorado.edu>
wrote:

> This sounds like it may be a problem more amenable to either DBSCAN or
> OPTICS. Both algorithms don't require a priori knowledge of the number of
> clusters, and both let you specify a minimum point membership threshold for
> cluster membership. The OPTICS algorithm will also produce a dendrogram
> that you can cut for sub clusters if need be.
>
> DBSCAN is part of the stable release and has been for some time; OPTICS is
> pending as a pull request, but it's stable and you can try it if you like:
>
> https://github.com/scikit-learn/scikit-learn/pull/1984
>
> Cheers,
> Shane
>
>
> On 06/30, Ariani A wrote:
>
>> I want to perform agglomerative clustering, but I have no idea of number
>> of
>> clusters before hand. But I want that every cluster has at least 40 data
>> points in it. How can I apply this to sklearn.agglomerative clustering?
>> Should I use dendrogram and cut it somehow? I have no idea how to relate
>> dendrogram to this and cutting it out. Any help will be appreciated!
>>
>
> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
> --
> *PhD candidate & Research Assistant*
> *Cooperative Institute for Research in Environmental Sciences (CIRES)*
> *University of Colorado at Boulder*
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170706/6afa581c/attachment.html>

From greina at eng.ucsd.edu  Thu Jul  6 12:41:19 2017
From: greina at eng.ucsd.edu (G Reina)
Date: Thu, 6 Jul 2017 09:41:19 -0700
Subject: [scikit-learn] Replacing the Boston Housing Prices dataset
In-Reply-To: <e790c891-b0fb-1adf-f71e-bdb045ed29af@gmail.com>
References: <CAEBTegQhq3x1Jgm5zBY+MQ1kjB6ryfT0X8r5-esk6NP_5WY9rQ@mail.gmail.com>
 <e790c891-b0fb-1adf-f71e-bdb045ed29af@gmail.com>
Message-ID: <CAEBTegTRW5gm8Hr5MEHvyJgei_1Ejp591mBeAVJJkCY40jXEdQ@mail.gmail.com>

Wow. I completely disagree.

The fact that too many tutorials and examples rely on it is not a reason to
keep the dataset. New tutorials are written all the time. And, as sklearn
evolves some of the existing tutorials will need to be updated anyway to
keep up with the changes.

Including "ethnicity" is completely illegal in making business decisions in
the United States. For example, credit scoring systems bend over backward
to expunge even proxy features that could be highly correlated with race
(for example, they can't include neighborhood, but can include entire
counties).

Let's leave the studying of racism to actual scientists who study racism.
Not to toy datasets that we use to teach our students about a completely
unrelated matter like regression.

-Tony


On Thu, Jul 6, 2017 at 9:31 AM, Andreas Mueller <t3kcit at gmail.com> wrote:

> Hi Tony.
>
> I don't think it's a good idea to remove the dataset, given how many
> tutorials and examples rely on it.
> I also don't think it's a good idea to ignore racial discrimination, which
> I guess this feature is trying to capture.
>
> I was recently asked to remove an excerpt from a dataset from my slide, as
> it was "too racist". It was randomly sampled
> data from the adult census dataset. Unfortunately, economics in the US are
> not color blind (yet), and the reality is racist.
> I haven't done an in-depth analysis on whether this feature is actually
> informative, but I don't think your analysis is conclusive.
>
> Including ethnicity in data actually allows us to ensure "fairness" in
> certain decision making processes.
> Without collecting this data, it would be impossible to ensure automatic
> decisions are not influenced
> by past human biases. Arguably that's not what the authors of this dataset
> are doing.
>
> Check out http://www.fatml.org/ for more on fairness in machine learning
> and data science.
>
> Cheers,
> Andy
>
>
>
> On 07/06/2017 12:05 PM, G Reina wrote:
>
> I'd like to request that the "Boston Housing Prices" dataset in sklearn
> (sklearn.datasets.load_boston) be replaced with the "Ames Housing Prices"
> dataset (https://ww2.amstat.org/publications/jse/v19n3/decock.pdf). I am
> willing to submit the code change if the developers agree.
>
> The Boston dataset has the feature "Bk is the proportion of blacks in
> town". It is an incredibly racist "feature" to include in any dataset. I
> think is beneath us as data scientists.
>
> I submit that the Ames dataset is a viable alternative for learning
> regression. The author has shown that the dataset is a more robust
> replacement for Boston. Ames is a 2011 regression dataset on housing prices
> and has more than 5 times the amount of training examples with over 7 times
> as many features (none of which are morally questionable).
>
> I welcome the community's thoughts on the matter.
>
> Thanks.
> -Tony
>
> Here's an article I wrote on the Boston dataset:
> https://www.linkedin.com/pulse/hidden-racism-data-
> science-g-anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_
> flagship3_feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D
>
>
>
> _______________________________________________
> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170706/64023ad6/attachment.html>

From andrewholmes82 at icloud.com  Thu Jul  6 12:19:49 2017
From: andrewholmes82 at icloud.com (Andrew Holmes)
Date: Thu, 06 Jul 2017 17:19:49 +0100
Subject: [scikit-learn] Replacing the Boston Housing Prices dataset
In-Reply-To: <CAEBTegQhq3x1Jgm5zBY+MQ1kjB6ryfT0X8r5-esk6NP_5WY9rQ@mail.gmail.com>
References: <CAEBTegQhq3x1Jgm5zBY+MQ1kjB6ryfT0X8r5-esk6NP_5WY9rQ@mail.gmail.com>
Message-ID: <E0CCC6AC-09BB-4F4B-A3EE-2918F91D0850@icloud.com>

But how do social scientists do research into racism without including ethnicity as a feature in the data?

Best wishes
Andrew

Public Profile


> On 6 Jul 2017, at 17:05, G Reina <greina at eng.ucsd.edu> wrote:
> 
> I'd like to request that the "Boston Housing Prices" dataset in sklearn (sklearn.datasets.load_boston) be replaced with the "Ames Housing Prices" dataset (https://ww2.amstat.org/publications/jse/v19n3/decock.pdf <https://ww2.amstat.org/publications/jse/v19n3/decock.pdf>). I am willing to submit the code change if the developers agree.
> 
> The Boston dataset has the feature "Bk is the proportion of blacks in town". It is an incredibly racist "feature" to include in any dataset. I think is beneath us as data scientists.
> 
> I submit that the Ames dataset is a viable alternative for learning regression. The author has shown that the dataset is a more robust replacement for Boston. Ames is a 2011 regression dataset on housing prices and has more than 5 times the amount of training examples with over 7 times as many features (none of which are morally questionable). 
> 
> I welcome the community's thoughts on the matter.
> 
> Thanks.
> -Tony
> 
> Here's an article I wrote on the Boston dataset:
> https://www.linkedin.com/pulse/hidden-racism-data-science-g-anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D <https://www.linkedin.com/pulse/hidden-racism-data-science-g-anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D>
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170706/459958d3/attachment-0001.html>

From jeffrey.m.allard at gmail.com  Thu Jul  6 13:38:02 2017
From: jeffrey.m.allard at gmail.com (jma)
Date: Thu, 6 Jul 2017 13:38:02 -0400
Subject: [scikit-learn] Replacing the Boston Housing Prices dataset
In-Reply-To: <E0CCC6AC-09BB-4F4B-A3EE-2918F91D0850@icloud.com>
References: <CAEBTegQhq3x1Jgm5zBY+MQ1kjB6ryfT0X8r5-esk6NP_5WY9rQ@mail.gmail.com>
 <E0CCC6AC-09BB-4F4B-A3EE-2918F91D0850@icloud.com>
Message-ID: <a3b064d8-f321-f100-8206-1bcf1ad84a37@gmail.com>

I work in the financial services industry and build machine learning 
models for marketing applications. We put an enormous effort (multiple 
layers of oversight and governance) into ensuring that our models are 
free of bias against protected classes etc. Having data describing race 
and ethnicity (among others) is extremely important to validate this is 
indeed the case.  Without it, you have no such assurance.


On 07/06/2017 12:19 PM, Andrew Holmes wrote:
> But how do social scientists do research into racism without including 
> ethnicity as a feature in the data?
>
> Best wishes
> Andrew
>
> Public Profile
>
>
>> On 6 Jul 2017, at 17:05, G Reina <greina at eng.ucsd.edu 
>> <mailto:greina at eng.ucsd.edu>> wrote:
>>
>> I'd like to request that the "Boston Housing Prices" dataset in 
>> sklearn (sklearn.datasets.load_boston) be replaced with the "Ames 
>> Housing Prices" dataset 
>> (https://ww2.amstat.org/publications/jse/v19n3/decock.pdf). I am 
>> willing to submit the code change if the developers agree.
>>
>> The Boston dataset has the feature "Bk is the proportion of blacks in 
>> town". It is an incredibly racist "feature" to include in any 
>> dataset. I think is beneath us as data scientists.
>>
>> I submit that the Ames dataset is a viable alternative for learning 
>> regression. The author has shown that the dataset is a more robust 
>> replacement for Boston. Ames is a 2011 regression dataset on housing 
>> prices and has more than 5 times the amount of training examples with 
>> over 7 times as many features (none of which are morally questionable).
>>
>> I welcome the community's thoughts on the matter.
>>
>> Thanks.
>> -Tony
>>
>> Here's an article I wrote on the Boston dataset:
>> https://www.linkedin.com/pulse/hidden-racism-data-science-g-anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org <mailto:scikit-learn at python.org>
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170706/50c8e199/attachment.html>

From t3kcit at gmail.com  Thu Jul  6 14:09:10 2017
From: t3kcit at gmail.com (Andreas Mueller)
Date: Thu, 6 Jul 2017 14:09:10 -0400
Subject: [scikit-learn] Replacing the Boston Housing Prices dataset
In-Reply-To: <CAEBTegTRW5gm8Hr5MEHvyJgei_1Ejp591mBeAVJJkCY40jXEdQ@mail.gmail.com>
References: <CAEBTegQhq3x1Jgm5zBY+MQ1kjB6ryfT0X8r5-esk6NP_5WY9rQ@mail.gmail.com>
 <e790c891-b0fb-1adf-f71e-bdb045ed29af@gmail.com>
 <CAEBTegTRW5gm8Hr5MEHvyJgei_1Ejp591mBeAVJJkCY40jXEdQ@mail.gmail.com>
Message-ID: <132fe6c2-a62f-fc72-0a95-c9fac7c440b3@gmail.com>


On 07/06/2017 12:41 PM, G Reina wrote:
>
> The fact that too many tutorials and examples rely on it is not a 
> reason to keep the dataset. New tutorials are written all the time. 
> And, as sklearn evolves some of the existing tutorials will need to be 
> updated anyway to keep up with the changes.
No, we try to avoid that as much as possible.
Old examples should work for as long as possible, and we actively avoid 
breaking API unnecessarily. It's one of the core principles of 
scikit-learn development.

And new tutorials can use any dataset they choose. We are working on 
including an openml fetcher, which allows using more datasets more easily.


From sean.violante at gmail.com  Thu Jul  6 15:08:33 2017
From: sean.violante at gmail.com (Sean Violante)
Date: Thu, 6 Jul 2017 21:08:33 +0200
Subject: [scikit-learn] Replacing the Boston Housing Prices dataset
In-Reply-To: <CAEBTegQhq3x1Jgm5zBY+MQ1kjB6ryfT0X8r5-esk6NP_5WY9rQ@mail.gmail.com>
References: <CAEBTegQhq3x1Jgm5zBY+MQ1kjB6ryfT0X8r5-esk6NP_5WY9rQ@mail.gmail.com>
Message-ID: <CAL9=spMoOMWskSBzzVksamD3w_yDZJO7URR_E9zhoEvv96X-sQ@mail.gmail.com>

G Reina
you make a bizarre argument. You argue that you should not even check
racism as a possible factor in house prices?

But then you yourself check whether its relevant
Then you say

"but I'd argue that it's more due to the location (near water, near
businesses, near restaurants, near parks and recreation) than to the ethnic
makeup"

Which  was basically what  the original authors wanted to show too,

Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for clean
air', J. Environ. Economics & Management, vol.5, 81-102, 1978.

 but unless you measure ethnic make-up you cannot show that it is not a
confounder.

The term "white flight" refers to affluent white families moving to the
suburbs.. And clearly a question is whether/how much was racism or avoiding
air pollution.


On 6 Jul 2017 6:10 pm, "G Reina" <greina at eng.ucsd.edu> wrote:

> I'd like to request that the "Boston Housing Prices" dataset in sklearn
> (sklearn.datasets.load_boston) be replaced with the "Ames Housing Prices"
> dataset (https://ww2.amstat.org/publications/jse/v19n3/decock.pdf). I am
> willing to submit the code change if the developers agree.
>
> The Boston dataset has the feature "Bk is the proportion of blacks in
> town". It is an incredibly racist "feature" to include in any dataset. I
> think is beneath us as data scientists.
>
> I submit that the Ames dataset is a viable alternative for learning
> regression. The author has shown that the dataset is a more robust
> replacement for Boston. Ames is a 2011 regression dataset on housing prices
> and has more than 5 times the amount of training examples with over 7 times
> as many features (none of which are morally questionable).
>
> I welcome the community's thoughts on the matter.
>
> Thanks.
> -Tony
>
> Here's an article I wrote on the Boston dataset:
> https://www.linkedin.com/pulse/hidden-racism-data-
> science-g-anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_
> flagship3_feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170706/4207c695/attachment-0001.html>

From jtcunni at gmail.com  Thu Jul  6 15:50:42 2017
From: jtcunni at gmail.com (jt cunni)
Date: Thu, 6 Jul 2017 14:50:42 -0500
Subject: [scikit-learn] Moving average transformer
In-Reply-To: <CAJj_J+L8YC78OAPcgwQj_xLv4V_6eWLaGjaCrjimo0fmoqWb5A@mail.gmail.com>
References: <CAJj_J+L1Xn=9gKh4L2Q4dFJxJKTkC=nc0iywc69cNkymGUGV1A@mail.gmail.com>
 <CAJj_J++WPbXXi5Ov6s=ynK4zvf3Vy_=d1ASrT9DF6UOb9_qwEA@mail.gmail.com>
 <CAJj_J+L8YC78OAPcgwQj_xLv4V_6eWLaGjaCrjimo0fmoqWb5A@mail.gmail.com>
Message-ID: <CAJj_J+LxSdY9nPc_7xyjPwiHm7WAhKB5X0__F1LCDRu1ZBbVhQ@mail.gmail.com>

First off, I have never contributed to anything before so please have
patience with me.  I am a data scientist and I have been working with doing
some feature engineering on one of my datasets.  In my code, I have a
pipeline of several transformers and an estimator.  I use my pipeline
and randomizedsearchcv to tune my hyper-parameters and my transformer
settings. Pretty standard stuff.  One thing I was doing was creating a
feature that was a moving average of another feature. In a basic example,
imagine I want to predict if a team is going to win a baseball game.  I
create a feature that is the moving average of the last N games of runs
scored per game (this is the window size of the moving average).  Not
knowing what the best window size for the moving average, I created a
custom transformer that could be put in a pipeline to find the window size
that provides the most lift.  Is there any interest for this type of
contribution? If so, what unittests or anything else do I need to provide?


Thanks,

Jeremy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170706/4514bf0f/attachment.html>

From rth.yurchak at gmail.com  Thu Jul  6 15:59:48 2017
From: rth.yurchak at gmail.com (Roman Yurchak)
Date: Thu, 6 Jul 2017 22:59:48 +0300
Subject: [scikit-learn] Construct the microclusters using a CF-Tree
In-Reply-To: <CAAir+CoQG7AdirKJ72quPwCVFeMMmjzAqoan4xM=H-vaoTWfPA@mail.gmail.com>
References: <CAAir+CoB_gJbOCDno3Qetdu+=Vte7=qijCdXNt7wmwgEUT6cvw@mail.gmail.com>
 <75468a69-ba3a-ca8a-7b1c-b477f7d6f08e@gmail.com>
 <CAAir+Co9Eq+BC9Z+8zMdgPBFi+37J8ELwGf9A0KRYnaWcB6F1A@mail.gmail.com>
 <f7f6dc3f-bb95-2d4a-d651-9bd19e0f0ba8@gmail.com>
 <CAAir+CoQG7AdirKJ72quPwCVFeMMmjzAqoan4xM=H-vaoTWfPA@mail.gmail.com>
Message-ID: <cb8aee37-2978-00b4-14d6-8e93c246299e@gmail.com>

Hello Sema,

On 05/07/17 13:27, Sema Atasever wrote:
> How can i know which cluster member represents best each cluster?

You could try to pick the one that's closest to the cluster centroid..

> In the birch code i use this code line: *centroids =
> brc.subcluster_centers_*
> How do I interpret this line of code output?

It is supposed to give your the centroid of each leaf node (computed in 
https://github.com/scikit-learn/scikit-learn/blob/ab93d65/sklearn/cluster/birch.py#L472). 


I would just recompute the centroid from the labels, though, with
   X[brc.labels_==k, :].mean() for k in np.unique(brc.labels_)
to be sure of the results...

-- 
Roman


From jmschreiber91 at gmail.com  Thu Jul  6 16:03:41 2017
From: jmschreiber91 at gmail.com (Jacob Schreiber)
Date: Thu, 6 Jul 2017 13:03:41 -0700
Subject: [scikit-learn] Moving average transformer
In-Reply-To: <CAJj_J+LxSdY9nPc_7xyjPwiHm7WAhKB5X0__F1LCDRu1ZBbVhQ@mail.gmail.com>
References: <CAJj_J+L1Xn=9gKh4L2Q4dFJxJKTkC=nc0iywc69cNkymGUGV1A@mail.gmail.com>
 <CAJj_J++WPbXXi5Ov6s=ynK4zvf3Vy_=d1ASrT9DF6UOb9_qwEA@mail.gmail.com>
 <CAJj_J+L8YC78OAPcgwQj_xLv4V_6eWLaGjaCrjimo0fmoqWb5A@mail.gmail.com>
 <CAJj_J+LxSdY9nPc_7xyjPwiHm7WAhKB5X0__F1LCDRu1ZBbVhQ@mail.gmail.com>
Message-ID: <CA+ad8Eu0jGzn-DLUu7pQcQmK2nzXnsheWbj510nsfp6xYv6gGw@mail.gmail.com>

Hi Jeremy!

Thanks for your offer to contribute. We're always looking for people to add
good ideas to the package. Time series data can be tricky to handle
appropriately, and so I think we generally try to pass it off to more
specialized packages that focus on that. Andreas may have a more detailed
perspective on this though.

Jacob

On Thu, Jul 6, 2017 at 12:50 PM, jt cunni <jtcunni at gmail.com> wrote:

> First off, I have never contributed to anything before so please have
> patience with me.  I am a data scientist and I have been working with doing
> some feature engineering on one of my datasets.  In my code, I have a
> pipeline of several transformers and an estimator.  I use my pipeline
> and randomizedsearchcv to tune my hyper-parameters and my transformer
> settings. Pretty standard stuff.  One thing I was doing was creating a
> feature that was a moving average of another feature. In a basic example,
> imagine I want to predict if a team is going to win a baseball game.  I
> create a feature that is the moving average of the last N games of runs
> scored per game (this is the window size of the moving average).  Not
> knowing what the best window size for the moving average, I created a
> custom transformer that could be put in a pipeline to find the window size
> that provides the most lift.  Is there any interest for this type of
> contribution? If so, what unittests or anything else do I need to provide?
>
>
>
> Thanks,
>
> Jeremy
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170706/fc6562b3/attachment.html>

From jmschreiber91 at gmail.com  Thu Jul  6 16:34:51 2017
From: jmschreiber91 at gmail.com (Jacob Schreiber)
Date: Thu, 6 Jul 2017 13:34:51 -0700
Subject: [scikit-learn] Replacing the Boston Housing Prices dataset
In-Reply-To: <CAL9=spMoOMWskSBzzVksamD3w_yDZJO7URR_E9zhoEvv96X-sQ@mail.gmail.com>
References: <CAEBTegQhq3x1Jgm5zBY+MQ1kjB6ryfT0X8r5-esk6NP_5WY9rQ@mail.gmail.com>
 <CAL9=spMoOMWskSBzzVksamD3w_yDZJO7URR_E9zhoEvv96X-sQ@mail.gmail.com>
Message-ID: <CA+ad8EvNw6jiB2nZ-z8UMuPtonFYvreXooVcKeeS73ccHNbDjA@mail.gmail.com>

Hi Tony

As others have pointed out, I think that you may be misunderstanding the
purpose of that "feature." We are in agreement that discrimination against
protected classes is not OK, and that even outside complying with the law
one should avoid discrimination, in model building or elsewhere. However, I
disagree that one does this by eliminating from all datasets any feature
that may allude to these protected classes. As Andreas pointed out, there
is a growing effort to ensure that machine learning models are fair and
benefit the common good (such as FATML, DSSG, etc..), and from my
understanding the general consensus isn't necessarily that simply
eliminating the feature is sufficient. I think we are in agreement that
naively learning a model over a feature set containing questionable
features and calling it a day is not okay, but as others have pointed out,
having these features present and handling them appropriately can help
guard against the model implicitly learning unfair biases (even if they are
not explicitly exposed to the feature).

I would welcome the addition of the Ames dataset to the ones supported by
sklearn, but I'm not convinced that the Boston dataset should be removed.
As Andreas pointed out, there is a benefit to having canonical examples
present so that beginners can easily follow along with the many tutorials
that have been written using them. As Sean points out, the paper itself is
trying to pull out the connection between house price and clean air in the
presence of possible confounding variables. In a more general sense, saying
that a feature shouldn't be there because a simple linear regression is
unaffected by the results is a bit odd because it is very common for
datasets to include irrelevant features, and handling them appropriately is
important. In addition, one could argue that having this type of issue
arise in a toy dataset has a benefit because it exposes these types of
issues to those learning data science earlier on and allows them to keep
these issues in mind in the future when the data is more serious.

It is important for us all to keep issues of fairness in mind when it comes
to data science. I'm glad that you're speaking out in favor of fairness and
trying to bring attention to it.

Jacob

On Thu, Jul 6, 2017 at 12:08 PM, Sean Violante <sean.violante at gmail.com>
wrote:

> G Reina
> you make a bizarre argument. You argue that you should not even check
> racism as a possible factor in house prices?
>
> But then you yourself check whether its relevant
> Then you say
>
> "but I'd argue that it's more due to the location (near water, near
> businesses, near restaurants, near parks and recreation) than to the ethnic
> makeup"
>
> Which  was basically what  the original authors wanted to show too,
>
> Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for clean
> air', J. Environ. Economics & Management, vol.5, 81-102, 1978.
>
>  but unless you measure ethnic make-up you cannot show that it is not a
> confounder.
>
> The term "white flight" refers to affluent white families moving to the
> suburbs.. And clearly a question is whether/how much was racism or avoiding
> air pollution.
>
>
>
>
>
> On 6 Jul 2017 6:10 pm, "G Reina" <greina at eng.ucsd.edu> wrote:
>
>> I'd like to request that the "Boston Housing Prices" dataset in sklearn
>> (sklearn.datasets.load_boston) be replaced with the "Ames Housing Prices"
>> dataset (https://ww2.amstat.org/publications/jse/v19n3/decock.pdf). I am
>> willing to submit the code change if the developers agree.
>>
>> The Boston dataset has the feature "Bk is the proportion of blacks in
>> town". It is an incredibly racist "feature" to include in any dataset. I
>> think is beneath us as data scientists.
>>
>> I submit that the Ames dataset is a viable alternative for learning
>> regression. The author has shown that the dataset is a more robust
>> replacement for Boston. Ames is a 2011 regression dataset on housing prices
>> and has more than 5 times the amount of training examples with over 7 times
>> as many features (none of which are morally questionable).
>>
>> I welcome the community's thoughts on the matter.
>>
>> Thanks.
>> -Tony
>>
>> Here's an article I wrote on the Boston dataset:
>> https://www.linkedin.com/pulse/hidden-racism-data-science-g-
>> anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_
>> feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170706/9c97383e/attachment-0001.html>

From joel.nothman at gmail.com  Thu Jul  6 18:33:42 2017
From: joel.nothman at gmail.com (Joel Nothman)
Date: Fri, 7 Jul 2017 08:33:42 +1000
Subject: [scikit-learn] Moving average transformer
In-Reply-To: <CA+ad8Eu0jGzn-DLUu7pQcQmK2nzXnsheWbj510nsfp6xYv6gGw@mail.gmail.com>
References: <CAJj_J+L1Xn=9gKh4L2Q4dFJxJKTkC=nc0iywc69cNkymGUGV1A@mail.gmail.com>
 <CAJj_J++WPbXXi5Ov6s=ynK4zvf3Vy_=d1ASrT9DF6UOb9_qwEA@mail.gmail.com>
 <CAJj_J+L8YC78OAPcgwQj_xLv4V_6eWLaGjaCrjimo0fmoqWb5A@mail.gmail.com>
 <CAJj_J+LxSdY9nPc_7xyjPwiHm7WAhKB5X0__F1LCDRu1ZBbVhQ@mail.gmail.com>
 <CA+ad8Eu0jGzn-DLUu7pQcQmK2nzXnsheWbj510nsfp6xYv6gGw@mail.gmail.com>
Message-ID: <CAAkaFLWcx+kpiu14oP=G38yEyEHiDS2Z4AqLiLQCf1+UvFd_Yg@mail.gmail.com>

I agree that this is best handled with a custom transformer, for the
reasons cited by Jacob, but also because it sounds like this transformer
does not gather statistics from the training data, and so can be
implemented with FunctionTransformer

On 7 Jul 2017 6:10 am, "Jacob Schreiber" <jmschreiber91 at gmail.com> wrote:

Hi Jeremy!

Thanks for your offer to contribute. We're always looking for people to add
good ideas to the package. Time series data can be tricky to handle
appropriately, and so I think we generally try to pass it off to more
specialized packages that focus on that. Andreas may have a more detailed
perspective on this though.

Jacob

On Thu, Jul 6, 2017 at 12:50 PM, jt cunni <jtcunni at gmail.com> wrote:

> First off, I have never contributed to anything before so please have
> patience with me.  I am a data scientist and I have been working with doing
> some feature engineering on one of my datasets.  In my code, I have a
> pipeline of several transformers and an estimator.  I use my pipeline
> and randomizedsearchcv to tune my hyper-parameters and my transformer
> settings. Pretty standard stuff.  One thing I was doing was creating a
> feature that was a moving average of another feature. In a basic example,
> imagine I want to predict if a team is going to win a baseball game.  I
> create a feature that is the moving average of the last N games of runs
> scored per game (this is the window size of the moving average).  Not
> knowing what the best window size for the moving average, I created a
> custom transformer that could be put in a pipeline to find the window size
> that provides the most lift.  Is there any interest for this type of
> contribution? If so, what unittests or anything else do I need to provide?
>
>
>
> Thanks,
>
> Jeremy
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>

_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170707/01cd9999/attachment.html>

From jni.soma at gmail.com  Thu Jul  6 19:36:41 2017
From: jni.soma at gmail.com (Juan Nunez-Iglesias)
Date: Fri, 7 Jul 2017 09:36:41 +1000
Subject: [scikit-learn] Replacing the Boston Housing Prices dataset
In-Reply-To: <CA+ad8EvNw6jiB2nZ-z8UMuPtonFYvreXooVcKeeS73ccHNbDjA@mail.gmail.com>
References: <CAEBTegQhq3x1Jgm5zBY+MQ1kjB6ryfT0X8r5-esk6NP_5WY9rQ@mail.gmail.com>
 <CAL9=spMoOMWskSBzzVksamD3w_yDZJO7URR_E9zhoEvv96X-sQ@mail.gmail.com>
 <CA+ad8EvNw6jiB2nZ-z8UMuPtonFYvreXooVcKeeS73ccHNbDjA@mail.gmail.com>
Message-ID: <f7106002-3607-4d50-8445-d4dc7291a264@Spark>

For what it's worth: I'm sympathetic to the argument that you can't fix the problem if you don't measure it, but I agree with Tony that "many tutorials use it" is an extremely weak argument. We removed Lena from scikit-image because it was the right thing to do. I very much doubt that Boston house prices is in more widespread use than Lena was in image processing.

You can argue about whether or not it's morally right or wrong to include the dataset. I see merit to both arguments. But "too many tutorials use it" is very similar in flavour to "the economy of the South would collapse without slavery."

Regarding fair uses of the feature, I would hope that all sklearn tutorials using the dataset mention such uses. The potential for abuse and misinterpretation is enormous.

On 7 Jul 2017, 6:36 AM +1000, Jacob Schreiber <jmschreiber91 at gmail.com>, wrote:
> Hi Tony
>
> As others have pointed out, I think that you may be misunderstanding the purpose of that "feature." We are in agreement that discrimination against protected classes is not OK, and that even outside complying with the law one should avoid discrimination, in model building or elsewhere. However, I disagree that one does this by eliminating from all datasets any feature that may allude to these protected classes. As Andreas pointed out, there is a growing effort to ensure that machine learning models are fair and benefit the common good (such as FATML, DSSG, etc..), and from my understanding the general consensus isn't necessarily that simply eliminating the feature is sufficient. I think we are in agreement that naively learning a model over a feature set containing questionable features and calling it a day is not okay, but as others have pointed out, having these features present and handling them appropriately can help guard against the model implicitly learning unfair biases (even if they are not explicitly exposed to the feature).
>
> I would welcome the addition of the Ames dataset to the ones supported by sklearn, but I'm not convinced that the Boston dataset should be removed. As Andreas pointed out, there is a benefit to having canonical examples present so that beginners can easily follow along with the many tutorials that have been written using them. As Sean points out, the paper itself is trying to pull out the connection between house price and clean air in the presence of possible confounding variables. In a more general sense, saying that a feature shouldn't be there because a simple linear regression is unaffected by the results is a bit odd because it is very common for datasets to include irrelevant features, and handling them appropriately is important. In addition, one could argue that having this type of issue arise in a toy dataset has a benefit because it exposes these types of issues to those learning data science earlier on and allows them to keep these issues in mind in the future when the data is more serious.
>
> It is important for us all to keep issues of fairness in mind when it comes to data science. I'm glad that you're speaking out in favor of fairness and trying to bring attention to it.
>
> Jacob
>
> > On Thu, Jul 6, 2017 at 12:08 PM, Sean Violante <sean.violante at gmail.com> wrote:
> > > G Reina
> > > you make a bizarre argument. You argue that you should not even check racism as a possible factor in house prices?
> > >
> > > But then you yourself check whether its relevant
> > > Then you say
> > >
> > > "but I'd argue that it's more due to the location (near water, near businesses, near restaurants, near parks and recreation) than to the ethnic makeup"
> > >
> > > Which ?was basically what ?the original authors wanted to show too,
> > >
> > > Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978.
> > >
> > > ?but unless you measure ethnic make-up you cannot show that it is not a confounder.
> > >
> > > The term "white flight" refers to affluent white families moving to the suburbs.. And clearly a question is whether/how much was racism or avoiding air pollution.
> > >
> > >
> > >
> > >
> > >
> > > > On 6 Jul 2017 6:10 pm, "G Reina" <greina at eng.ucsd.edu> wrote:
> > > > > I'd like to request that the "Boston Housing Prices" dataset in sklearn (sklearn.datasets.load_boston) be replaced with the "Ames Housing Prices" dataset (https://ww2.amstat.org/publications/jse/v19n3/decock.pdf). I am willing to submit the code change if the developers agree.
> > > > >
> > > > > The Boston dataset has the feature "Bk is the proportion of blacks in town". It is an incredibly racist "feature" to include in any dataset. I think is beneath us as data scientists.
> > > > >
> > > > > I submit that the Ames dataset is a viable alternative for learning regression. The author has shown that the dataset is a more robust replacement for Boston. Ames is a 2011 regression dataset on housing prices and has more than 5 times the amount of training examples with over 7 times as many features (none of which are morally questionable).
> > > > >
> > > > > I welcome the community's thoughts on the matter.
> > > > >
> > > > > Thanks.
> > > > > -Tony
> > > > >
> > > > > Here's an article I wrote on the Boston dataset:
> > > > > https://www.linkedin.com/pulse/hidden-racism-data-science-g-anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > scikit-learn mailing list
> > > > > scikit-learn at python.org
> > > > > https://mail.python.org/mailman/listinfo/scikit-learn
> > > > >
> > >
> > > _______________________________________________
> > > scikit-learn mailing list
> > > scikit-learn at python.org
> > > https://mail.python.org/mailman/listinfo/scikit-learn
> > >
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170707/000af78d/attachment-0001.html>

From se.raschka at gmail.com  Thu Jul  6 20:39:13 2017
From: se.raschka at gmail.com (Sebastian Raschka)
Date: Thu, 6 Jul 2017 20:39:13 -0400
Subject: [scikit-learn] Replacing the Boston Housing Prices dataset
In-Reply-To: <f7106002-3607-4d50-8445-d4dc7291a264@Spark>
References: <CAEBTegQhq3x1Jgm5zBY+MQ1kjB6ryfT0X8r5-esk6NP_5WY9rQ@mail.gmail.com>
 <CAL9=spMoOMWskSBzzVksamD3w_yDZJO7URR_E9zhoEvv96X-sQ@mail.gmail.com>
 <CA+ad8EvNw6jiB2nZ-z8UMuPtonFYvreXooVcKeeS73ccHNbDjA@mail.gmail.com>
 <f7106002-3607-4d50-8445-d4dc7291a264@Spark>
Message-ID: <61B34F59-142E-4851-9B27-7DC2A0C2DAF8@gmail.com>

I think there can be some middle ground. I.e., adding a new, simple dataset to demonstrate regression (maybe autmpg, wine quality, or sth like that) and use that for the scikit-learn examples in the main documentation etc but leave the boston dataset in the code base for now. Whether it's a weak argument or not, it would be quite destructive to remove the dataset altogether in the next version or so, not only because old tutorials use it but many unit tests in many different projects depend on it. I think it might be better to phase it out by having a good alternative first, and I am sure that the scikit-learn maintainers wouldn't have anything against it if someone would update the examples/tutorials with the use of different datasets

Best,
Sebastian

> On Jul 6, 2017, at 7:36 PM, Juan Nunez-Iglesias <jni.soma at gmail.com> wrote:
> 
> For what it's worth: I'm sympathetic to the argument that you can't fix the problem if you don't measure it, but I agree with Tony that "many tutorials use it" is an extremely weak argument. We removed Lena from scikit-image because it was the right thing to do. I very much doubt that Boston house prices is in more widespread use than Lena was in image processing.
> 
> You can argue about whether or not it's morally right or wrong to include the dataset. I see merit to both arguments. But "too many tutorials use it" is very similar in flavour to "the economy of the South would collapse without slavery."
> 
> Regarding fair uses of the feature, I would hope that all sklearn tutorials using the dataset mention such uses. The potential for abuse and misinterpretation is enormous.
> 
> On 7 Jul 2017, 6:36 AM +1000, Jacob Schreiber <jmschreiber91 at gmail.com>, wrote:
>> Hi Tony
>> 
>> As others have pointed out, I think that you may be misunderstanding the purpose of that "feature." We are in agreement that discrimination against protected classes is not OK, and that even outside complying with the law one should avoid discrimination, in model building or elsewhere. However, I disagree that one does this by eliminating from all datasets any feature that may allude to these protected classes. As Andreas pointed out, there is a growing effort to ensure that machine learning models are fair and benefit the common good (such as FATML, DSSG, etc..), and from my understanding the general consensus isn't necessarily that simply eliminating the feature is sufficient. I think we are in agreement that naively learning a model over a feature set containing questionable features and calling it a day is not okay, but as others have pointed out, having these features present and handling them appropriately can help guard against the model implicitly learning unfair biases (even if they are not explicitly exposed to the feature). 
>> 
>> I would welcome the addition of the Ames dataset to the ones supported by sklearn, but I'm not convinced that the Boston dataset should be removed. As Andreas pointed out, there is a benefit to having canonical examples present so that beginners can easily follow along with the many tutorials that have been written using them. As Sean points out, the paper itself is trying to pull out the connection between house price and clean air in the presence of possible confounding variables. In a more general sense, saying that a feature shouldn't be there because a simple linear regression is unaffected by the results is a bit odd because it is very common for datasets to include irrelevant features, and handling them appropriately is important. In addition, one could argue that having this type of issue arise in a toy dataset has a benefit because it exposes these types of issues to those learning data science earlier on and allows them to keep these issues in mind in the future when the data is more serious.
>> 
>> It is important for us all to keep issues of fairness in mind when it comes to data science. I'm glad that you're speaking out in favor of fairness and trying to bring attention to it. 
>> 
>> Jacob
>> 
>> On Thu, Jul 6, 2017 at 12:08 PM, Sean Violante <sean.violante at gmail.com> wrote:
>> G Reina 
>> you make a bizarre argument. You argue that you should not even check racism as a possible factor in house prices? 
>> 
>> But then you yourself check whether its relevant 
>> Then you say 
>> 
>> "but I'd argue that it's more due to the location (near water, near businesses, near restaurants, near parks and recreation) than to the ethnic makeup" 
>> 
>> Which  was basically what  the original authors wanted to show too,
>> 
>> Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978.
>> 
>>  but unless you measure ethnic make-up you cannot show that it is not a confounder. 
>> 
>> The term "white flight" refers to affluent white families moving to the suburbs.. And clearly a question is whether/how much was racism or avoiding air pollution. 
>> 
>> 
>> 
>> 
>> 
>> On 6 Jul 2017 6:10 pm, "G Reina" <greina at eng.ucsd.edu> wrote:
>> I'd like to request that the "Boston Housing Prices" dataset in sklearn (sklearn.datasets.load_boston) be replaced with the "Ames Housing Prices" dataset (https://ww2.amstat.org/publications/jse/v19n3/decock.pdf). I am willing to submit the code change if the developers agree.
>> 
>> The Boston dataset has the feature "Bk is the proportion of blacks in town". It is an incredibly racist "feature" to include in any dataset. I think is beneath us as data scientists.
>> 
>> I submit that the Ames dataset is a viable alternative for learning regression. The author has shown that the dataset is a more robust replacement for Boston. Ames is a 2011 regression dataset on housing prices and has more than 5 times the amount of training examples with over 7 times as many features (none of which are morally questionable).
>> 
>> I welcome the community's thoughts on the matter.
>> 
>> Thanks.
>> -Tony
>> 
>> Here's an article I wrote on the Boston dataset:
>> https://www.linkedin.com/pulse/hidden-racism-data-science-g-anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D
>> 
>> 
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>> 
>> 
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>> 
>> 
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From ross at cgl.ucsf.edu  Thu Jul  6 21:00:49 2017
From: ross at cgl.ucsf.edu (Bill Ross)
Date: Thu, 6 Jul 2017 18:00:49 -0700
Subject: [scikit-learn] Replacing the Boston Housing Prices dataset
In-Reply-To: <61B34F59-142E-4851-9B27-7DC2A0C2DAF8@gmail.com>
References: <CAEBTegQhq3x1Jgm5zBY+MQ1kjB6ryfT0X8r5-esk6NP_5WY9rQ@mail.gmail.com>
 <CAL9=spMoOMWskSBzzVksamD3w_yDZJO7URR_E9zhoEvv96X-sQ@mail.gmail.com>
 <CA+ad8EvNw6jiB2nZ-z8UMuPtonFYvreXooVcKeeS73ccHNbDjA@mail.gmail.com>
 <f7106002-3607-4d50-8445-d4dc7291a264@Spark>
 <61B34F59-142E-4851-9B27-7DC2A0C2DAF8@gmail.com>
Message-ID: <32b9ea32-b5dc-dfbe-04ca-36e8db30160e@cgl.ucsf.edu>

Unless the data concretely promotes discrimination, it seems 
discriminatory to exclude it.

Bill

On 7/6/17 5:39 PM, Sebastian Raschka wrote:
> I think there can be some middle ground. I.e., adding a new, simple dataset to demonstrate regression (maybe autmpg, wine quality, or sth like that) and use that for the scikit-learn examples in the main documentation etc but leave the boston dataset in the code base for now. Whether it's a weak argument or not, it would be quite destructive to remove the dataset altogether in the next version or so, not only because old tutorials use it but many unit tests in many different projects depend on it. I think it might be better to phase it out by having a good alternative first, and I am sure that the scikit-learn maintainers wouldn't have anything against it if someone would update the examples/tutorials with the use of different datasets
>
> Best,
> Sebastian
>
>> On Jul 6, 2017, at 7:36 PM, Juan Nunez-Iglesias <jni.soma at gmail.com> wrote:
>>
>> For what it's worth: I'm sympathetic to the argument that you can't fix the problem if you don't measure it, but I agree with Tony that "many tutorials use it" is an extremely weak argument. We removed Lena from scikit-image because it was the right thing to do. I very much doubt that Boston house prices is in more widespread use than Lena was in image processing.
>>
>> You can argue about whether or not it's morally right or wrong to include the dataset. I see merit to both arguments. But "too many tutorials use it" is very similar in flavour to "the economy of the South would collapse without slavery."
>>
>> Regarding fair uses of the feature, I would hope that all sklearn tutorials using the dataset mention such uses. The potential for abuse and misinterpretation is enormous.
>>
>> On 7 Jul 2017, 6:36 AM +1000, Jacob Schreiber <jmschreiber91 at gmail.com>, wrote:
>>> Hi Tony
>>>
>>> As others have pointed out, I think that you may be misunderstanding the purpose of that "feature." We are in agreement that discrimination against protected classes is not OK, and that even outside complying with the law one should avoid discrimination, in model building or elsewhere. However, I disagree that one does this by eliminating from all datasets any feature that may allude to these protected classes. As Andreas pointed out, there is a growing effort to ensure that machine learning models are fair and benefit the common good (such as FATML, DSSG, etc..), and from my understanding the general consensus isn't necessarily that simply eliminating the feature is sufficient. I think we are in agreement that naively learning a model over a feature set containing questionable features and calling it a day is not okay, but as others have pointed out, having these features present and handling them appropriately can help guard against the model implicitly learning unfair !
>   biases (e
>   ven if they are not explicitly exposed to the feature).
>>> I would welcome the addition of the Ames dataset to the ones supported by sklearn, but I'm not convinced that the Boston dataset should be removed. As Andreas pointed out, there is a benefit to having canonical examples present so that beginners can easily follow along with the many tutorials that have been written using them. As Sean points out, the paper itself is trying to pull out the connection between house price and clean air in the presence of possible confounding variables. In a more general sense, saying that a feature shouldn't be there because a simple linear regression is unaffected by the results is a bit odd because it is very common for datasets to include irrelevant features, and handling them appropriately is important. In addition, one could argue that having this type of issue arise in a toy dataset has a benefit because it exposes these types of issues to those learning data science earlier on and allows them to keep these issues in mind in the future!
>    when the
>    data is more serious.
>>> It is important for us all to keep issues of fairness in mind when it comes to data science. I'm glad that you're speaking out in favor of fairness and trying to bring attention to it.
>>>
>>> Jacob
>>>
>>> On Thu, Jul 6, 2017 at 12:08 PM, Sean Violante <sean.violante at gmail.com> wrote:
>>> G Reina
>>> you make a bizarre argument. You argue that you should not even check racism as a possible factor in house prices?
>>>
>>> But then you yourself check whether its relevant
>>> Then you say
>>>
>>> "but I'd argue that it's more due to the location (near water, near businesses, near restaurants, near parks and recreation) than to the ethnic makeup"
>>>
>>> Which  was basically what  the original authors wanted to show too,
>>>
>>> Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978.
>>>
>>>   but unless you measure ethnic make-up you cannot show that it is not a confounder.
>>>
>>> The term "white flight" refers to affluent white families moving to the suburbs.. And clearly a question is whether/how much was racism or avoiding air pollution.
>>>
>>>
>>>
>>>
>>>
>>> On 6 Jul 2017 6:10 pm, "G Reina" <greina at eng.ucsd.edu> wrote:
>>> I'd like to request that the "Boston Housing Prices" dataset in sklearn (sklearn.datasets.load_boston) be replaced with the "Ames Housing Prices" dataset (https://ww2.amstat.org/publications/jse/v19n3/decock.pdf). I am willing to submit the code change if the developers agree.
>>>
>>> The Boston dataset has the feature "Bk is the proportion of blacks in town". It is an incredibly racist "feature" to include in any dataset. I think is beneath us as data scientists.
>>>
>>> I submit that the Ames dataset is a viable alternative for learning regression. The author has shown that the dataset is a more robust replacement for Boston. Ames is a 2011 regression dataset on housing prices and has more than 5 times the amount of training examples with over 7 times as many features (none of which are morally questionable).
>>>
>>> I welcome the community's thoughts on the matter.
>>>
>>> Thanks.
>>> -Tony
>>>
>>> Here's an article I wrote on the Boston dataset:
>>> https://www.linkedin.com/pulse/hidden-racism-data-science-g-anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D
>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From gael.varoquaux at normalesup.org  Fri Jul  7 01:35:05 2017
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Fri, 7 Jul 2017 07:35:05 +0200
Subject: [scikit-learn] Replacing the Boston Housing Prices dataset
In-Reply-To: <f7106002-3607-4d50-8445-d4dc7291a264@Spark>
References: <CAEBTegQhq3x1Jgm5zBY+MQ1kjB6ryfT0X8r5-esk6NP_5WY9rQ@mail.gmail.com>
 <CAL9=spMoOMWskSBzzVksamD3w_yDZJO7URR_E9zhoEvv96X-sQ@mail.gmail.com>
 <CA+ad8EvNw6jiB2nZ-z8UMuPtonFYvreXooVcKeeS73ccHNbDjA@mail.gmail.com>
 <f7106002-3607-4d50-8445-d4dc7291a264@Spark>
Message-ID: <20170707053505.GR2257694@phare.normalesup.org>

Many people gave great points in this thread, in particular Jacob's well
written email.

Andy's point about tutorials is an important one. I don't resonate at
all with Juan's message. Breaking people's code, even if it is the notes
that they use to give a lecture, is a real cost for them. The cost varies
on a case to case basis. But there are still books printed out there
that demo image processing on Lena, and these will be out for decades.
More importantly, the replacement of Lena used in scipy (the raccoon)
does not allow to demonstrate denoising properly (Lena has smooth regions
with details in the middle: the eyes), or segmentation. In effect, it has
made the examples for the ecosystem less convincing.


Of course, by definition, refusing to change anything implies that
unfortunate situations, such as discriminatory biases, cannot be fixed.
This is why changes should be considered on a case-to-case basis.

The problem that we are facing here is that a dataset about society, the
Boston housing dataset, can reveal discrimination. However, this is true
of every data about society. The classic adult data (extracted from the
American census) easily reveals income discrimination. I teach statistics
with an IQ dataset where it is easy to show a male vs female IQ
difference. This difference disappears after controlling for education
(and the purpose of my course is to teach people to control for
confounding effects).

Data about society reveals its inequalities. Not working on such data is
hiding problems, not fixing them. It is true that misuse of such data can
attempt to establish inequalities as facts of life and get them accepted.
When discussing these issues, we need to educate people about how to run
and interpret analyses.


No the Boston data will not go. No it is not a good thing to pretend that
social problems do not exist.


Ga?l

On Fri, Jul 07, 2017 at 09:36:41AM +1000, Juan Nunez-Iglesias wrote:
> For what it's worth: I'm sympathetic to the argument that you can't fix the
> problem if you don't measure it, but I agree with Tony that "many tutorials use
> it" is an extremely weak argument. We removed Lena from scikit-image because it
> was the right thing to do. I very much doubt that Boston house prices is in
> more widespread use than Lena was in image processing.

> You can argue about whether or not it's morally right or wrong to include the
> dataset. I see merit to both arguments. But "too many tutorials use it" is very
> similar in flavour to "the economy of the South would collapse without
> slavery."

> Regarding fair uses of the feature, I would hope that all sklearn tutorials
> using the dataset mention such uses. The potential for abuse and
> misinterpretation is enormous.

> On 7 Jul 2017, 6:36 AM +1000, Jacob Schreiber <jmschreiber91 at gmail.com>, wrote:

>     Hi Tony

>     As others have pointed out, I think that you may be misunderstanding the
>     purpose of that "feature." We are in agreement that discrimination against
>     protected classes is not OK, and that even outside complying with the law
>     one should avoid discrimination, in model building or elsewhere. However, I
>     disagree that one does this by eliminating from all datasets any feature
>     that may allude to these protected classes. As Andreas pointed out, there
>     is a growing effort to ensure that machine learning models are fair and
>     benefit the common good (such as FATML, DSSG, etc..), and from my
>     understanding the general consensus isn't necessarily that simply
>     eliminating the feature is sufficient. I think we are in agreement that
>     naively learning a model over a feature set containing questionable
>     features and calling it a day is not okay, but as others have pointed out,
>     having these features present and handling them appropriately can help
>     guard against the model implicitly learning unfair biases (even if they are
>     not explicitly exposed to the feature). 

>     I would welcome the addition of the Ames dataset to the ones supported by
>     sklearn, but I'm not convinced that the Boston dataset should be removed.
>     As Andreas pointed out, there is a benefit to having canonical examples
>     present so that beginners can easily follow along with the many tutorials
>     that have been written using them. As Sean points out, the paper itself is
>     trying to pull out the connection between house price and clean air in the
>     presence of possible confounding variables. In a more general sense, saying
>     that a feature shouldn't be there because a simple linear regression is
>     unaffected by the results is a bit odd because it is very common for
>     datasets to include irrelevant features, and handling them appropriately is
>     important. In addition, one could argue that having this type of issue
>     arise in a toy dataset has a benefit because it exposes these types of
>     issues to those learning data science earlier on and allows them to keep
>     these issues in mind in the future when the data is more serious.

>     It is important for us all to keep issues of fairness in mind when it comes
>     to data science. I'm glad that you're speaking out in favor of fairness and
>     trying to bring attention to it. 

>     Jacob

>     On Thu, Jul 6, 2017 at 12:08 PM, Sean Violante <sean.violante at gmail.com>
>     wrote:

>         G Reina 
>         you make a bizarre argument. You argue that you should not even check
>         racism as a possible factor in house prices? 

>         But then you yourself check whether its relevant 
>         Then you say 

>         "but I'd argue that it's more due to the location (near water, near
>         businesses, near restaurants, near parks and recreation) than to the
>         ethnic makeup" 

>         Which  was basically what  the original authors wanted to show too,

>         Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for
>         clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978.

>          but unless you measure ethnic make-up you cannot show that it is not a
>         confounder. 

>         The term "white flight" refers to affluent white families moving to the
>         suburbs.. And clearly a question is whether/how much was racism or
>         avoiding air pollution. 


>         On 6 Jul 2017 6:10 pm, "G Reina" <greina at eng.ucsd.edu> wrote:

>             I'd like to request that the "Boston Housing Prices" dataset in
>             sklearn (sklearn.datasets.load_boston) be replaced with the "Ames
>             Housing Prices" dataset (https://ww2.amstat.org/publications/jse/
>             v19n3/decock.pdf). I am willing to submit the code change if the
>             developers agree.

>             The Boston dataset has the feature "Bk is the proportion of blacks
>             in town". It is an incredibly racist "feature" to include in any
>             dataset. I think is beneath us as data scientists.

>             I submit that the Ames dataset is a viable alternative for learning
>             regression. The author has shown that the dataset is a more robust
>             replacement for Boston. Ames is a 2011 regression dataset on
>             housing prices and has more than 5 times the amount of training
>             examples with over 7 times as many features (none of which are
>             morally questionable).

>             I welcome the community's thoughts on the matter.

>             Thanks.
>             -Tony

>             Here's an article I wrote on the Boston dataset:
>             https://www.linkedin.com/pulse/hidden-racism-data-science-g-
>             anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_
>             feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D


>             _______________________________________________
>             scikit-learn mailing list
>             scikit-learn at python.org
>             https://mail.python.org/mailman/listinfo/scikit-learn


>         _______________________________________________
>         scikit-learn mailing list
>         scikit-learn at python.org
>         https://mail.python.org/mailman/listinfo/scikit-learn


>     _______________________________________________
>     scikit-learn mailing list
>     scikit-learn at python.org
>     https://mail.python.org/mailman/listinfo/scikit-learn


> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
    Gael Varoquaux
    Researcher, INRIA Parietal
    NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
    Phone:  ++ 33-1-69-08-79-68
    http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux

From olivier.grisel at ensta.org  Fri Jul  7 09:24:32 2017
From: olivier.grisel at ensta.org (Olivier Grisel)
Date: Fri, 7 Jul 2017 15:24:32 +0200
Subject: [scikit-learn] Which algorithm is used in sklearn SGDClassifier
 when modified huber loss is used?
In-Reply-To: <CAExPge4748ZDD=pFqxBAzuoF1e38QMCAB1sBRRmNEWLr6mnwxw@mail.gmail.com>
References: <CAExPge4qD+7gwK5ma8DmthkuOQQj1yJc4VAitQC-uwwfaYQz4g@mail.gmail.com>
 <CAExPge4748ZDD=pFqxBAzuoF1e38QMCAB1sBRRmNEWLr6mnwxw@mail.gmail.com>
Message-ID: <CAFvE7K41u2nxHYAMfXjFCH95M6cWQSZ0P0PjKS4-j+FQ_OqvYA@mail.gmail.com>

The name of the algorithm / model would be "L2-penalized linear model
with modified Huber loss trained with Stochastic Gradient Descent".

SVM is traditionally used to describe models that use the hinge loss
only (or sometimes the squared hinge loss too).

Only the log loss can be lead to a probabilistic linear binary
classifiers in scikit-learn.

-- 
Olivier

From b.noushin7 at gmail.com  Fri Jul  7 12:18:34 2017
From: b.noushin7 at gmail.com (Ariani A)
Date: Fri, 7 Jul 2017 12:18:34 -0400
Subject: [scikit-learn] Help with NLP
Message-ID: <CAEUHNkYcojAqM0k3O2aUdfpFEv0Y+heayxLiDmiSPBf6Jmj0eg@mail.gmail.com>

Dear all,
I need an urgent help with NLP, do you happen to know anyone who knows nltk
or NLP modules? Have anybody of you read this paper?
"Template-Based Information Extraction without the Templates."
I am looking forward to hearirng from you soon!
Best,
-Ariani
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170707/d4e680ec/attachment.html>

From noflaco at gmail.com  Fri Jul  7 12:23:16 2017
From: noflaco at gmail.com (Carlton Banks)
Date: Fri, 7 Jul 2017 18:23:16 +0200
Subject: [scikit-learn] Help with NLP
In-Reply-To: <CAEUHNkYcojAqM0k3O2aUdfpFEv0Y+heayxLiDmiSPBf6Jmj0eg@mail.gmail.com>
References: <CAEUHNkYcojAqM0k3O2aUdfpFEv0Y+heayxLiDmiSPBf6Jmj0eg@mail.gmail.com>
Message-ID: <1694DCBE-443C-4EB0-B2F5-2A0FCC67D5FB@gmail.com>

NLP as is Natural language processing?

> Den 7. jul. 2017 kl. 18.18 skrev Ariani A <b.noushin7 at gmail.com>:
> 
> Dear all,
> I need an urgent help with NLP, do you happen to know anyone who knows nltk or NLP modules? Have anybody of you read this paper?
> "Template-Based Information Extraction without the Templates."
> I am looking forward to hearirng from you soon!
> Best,
> -Ariani
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170707/7f15ba5a/attachment.html>

From b.noushin7 at gmail.com  Fri Jul  7 12:24:22 2017
From: b.noushin7 at gmail.com (Ariani A)
Date: Fri, 7 Jul 2017 12:24:22 -0400
Subject: [scikit-learn] Help with NLP
In-Reply-To: <1694DCBE-443C-4EB0-B2F5-2A0FCC67D5FB@gmail.com>
References: <CAEUHNkYcojAqM0k3O2aUdfpFEv0Y+heayxLiDmiSPBf6Jmj0eg@mail.gmail.com>
 <1694DCBE-443C-4EB0-B2F5-2A0FCC67D5FB@gmail.com>
Message-ID: <CAEUHNkaqvdZ1MA_Jq2qu9oLWxc8xaLh+XjaZ1XW8-gKrVXRZaw@mail.gmail.com>

Yes , it is.
regards

On Fri, Jul 7, 2017 at 12:23 PM, Carlton Banks <noflaco at gmail.com> wrote:

> NLP as is Natural language processing?
>
> Den 7. jul. 2017 kl. 18.18 skrev Ariani A <b.noushin7 at gmail.com>:
>
> Dear all,
> I need an urgent help with NLP, do you happen to know anyone who knows
> nltk or NLP modules? Have anybody of you read this paper?
> "Template-Based Information Extraction without the Templates."
> I am looking forward to hearirng from you soon!
> Best,
> -Ariani
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170707/f78036bb/attachment.html>

From noflaco at gmail.com  Fri Jul  7 12:50:58 2017
From: noflaco at gmail.com (Carlton Banks)
Date: Fri, 7 Jul 2017 18:50:58 +0200
Subject: [scikit-learn] Help with NLP
In-Reply-To: <CAEUHNkaqvdZ1MA_Jq2qu9oLWxc8xaLh+XjaZ1XW8-gKrVXRZaw@mail.gmail.com>
References: <CAEUHNkYcojAqM0k3O2aUdfpFEv0Y+heayxLiDmiSPBf6Jmj0eg@mail.gmail.com>
 <1694DCBE-443C-4EB0-B2F5-2A0FCC67D5FB@gmail.com>
 <CAEUHNkaqvdZ1MA_Jq2qu9oLWxc8xaLh+XjaZ1XW8-gKrVXRZaw@mail.gmail.com>
Message-ID: <A156F4CC-7BCA-4A6A-B36F-D38733AFA3C0@gmail.com>

I am still not sure i quite understand.. What aspect of NLP are you involved in speech recognition?

> Den 7. jul. 2017 kl. 18.24 skrev Ariani A <b.noushin7 at gmail.com>:
> 
> Yes , it is.
> regards
> 
> On Fri, Jul 7, 2017 at 12:23 PM, Carlton Banks <noflaco at gmail.com <mailto:noflaco at gmail.com>> wrote:
> NLP as is Natural language processing?
> 
>> Den 7. jul. 2017 kl. 18.18 skrev Ariani A <b.noushin7 at gmail.com <mailto:b.noushin7 at gmail.com>>:
>> 
>> Dear all,
>> I need an urgent help with NLP, do you happen to know anyone who knows nltk or NLP modules? Have anybody of you read this paper?
>> "Template-Based Information Extraction without the Templates."
>> I am looking forward to hearirng from you soon!
>> Best,
>> -Ariani
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org <mailto:scikit-learn at python.org>
>> https://mail.python.org/mailman/listinfo/scikit-learn <https://mail.python.org/mailman/listinfo/scikit-learn>
> 
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org <mailto:scikit-learn at python.org>
> https://mail.python.org/mailman/listinfo/scikit-learn <https://mail.python.org/mailman/listinfo/scikit-learn>
> 
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170707/726a8d29/attachment-0001.html>

From jmschreiber91 at gmail.com  Fri Jul  7 12:52:15 2017
From: jmschreiber91 at gmail.com (Jacob Schreiber)
Date: Fri, 07 Jul 2017 16:52:15 +0000
Subject: [scikit-learn] Help with NLP
In-Reply-To: <CAEUHNkaqvdZ1MA_Jq2qu9oLWxc8xaLh+XjaZ1XW8-gKrVXRZaw@mail.gmail.com>
References: <CAEUHNkYcojAqM0k3O2aUdfpFEv0Y+heayxLiDmiSPBf6Jmj0eg@mail.gmail.com>
 <1694DCBE-443C-4EB0-B2F5-2A0FCC67D5FB@gmail.com>
 <CAEUHNkaqvdZ1MA_Jq2qu9oLWxc8xaLh+XjaZ1XW8-gKrVXRZaw@mail.gmail.com>
Message-ID: <CA+ad8EvV+TVxroMt482KsU7JU5du3HO2yw2hzVyVK-VFRP+k7Q@mail.gmail.com>

The scikit-learn mailing list is probably not the best place to be asking
for help with another module.

On Fri, Jul 7, 2017 at 9:28 AM Ariani A <b.noushin7 at gmail.com> wrote:

> Yes , it is.
> regards
>
> On Fri, Jul 7, 2017 at 12:23 PM, Carlton Banks <noflaco at gmail.com> wrote:
>
>> NLP as is Natural language processing?
>>
>> Den 7. jul. 2017 kl. 18.18 skrev Ariani A <b.noushin7 at gmail.com>:
>>
>> Dear all,
>> I need an urgent help with NLP, do you happen to know anyone who knows
>> nltk or NLP modules? Have anybody of you read this paper?
>> "Template-Based Information Extraction without the Templates."
>> I am looking forward to hearirng from you soon!
>> Best,
>> -Ariani
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170707/9beefc24/attachment.html>

From b.noushin7 at gmail.com  Fri Jul  7 13:13:28 2017
From: b.noushin7 at gmail.com (Ariani A)
Date: Fri, 7 Jul 2017 13:13:28 -0400
Subject: [scikit-learn] Help with NLP
In-Reply-To: <CA+ad8EvV+TVxroMt482KsU7JU5du3HO2yw2hzVyVK-VFRP+k7Q@mail.gmail.com>
References: <CAEUHNkYcojAqM0k3O2aUdfpFEv0Y+heayxLiDmiSPBf6Jmj0eg@mail.gmail.com>
 <1694DCBE-443C-4EB0-B2F5-2A0FCC67D5FB@gmail.com>
 <CAEUHNkaqvdZ1MA_Jq2qu9oLWxc8xaLh+XjaZ1XW8-gKrVXRZaw@mail.gmail.com>
 <CA+ad8EvV+TVxroMt482KsU7JU5du3HO2yw2hzVyVK-VFRP+k7Q@mail.gmail.com>
Message-ID: <CAEUHNkbw3NWs=ifmwz8X6a_RMBWyGsOM5QVs_nyMk+x0Cj0qsg@mail.gmail.com>

Dear Jacob,
I know, but I am just asking to get help!

@Carlton, I want to do text processing, can I email you so that the others
do not bother?
Best,
-Ariani

On Fri, Jul 7, 2017 at 12:52 PM, Jacob Schreiber <jmschreiber91 at gmail.com>
wrote:

> The scikit-learn mailing list is probably not the best place to be asking
> for help with another module.
>
> On Fri, Jul 7, 2017 at 9:28 AM Ariani A <b.noushin7 at gmail.com> wrote:
>
>> Yes , it is.
>> regards
>>
>> On Fri, Jul 7, 2017 at 12:23 PM, Carlton Banks <noflaco at gmail.com> wrote:
>>
>>> NLP as is Natural language processing?
>>>
>>> Den 7. jul. 2017 kl. 18.18 skrev Ariani A <b.noushin7 at gmail.com>:
>>>
>>> Dear all,
>>> I need an urgent help with NLP, do you happen to know anyone who knows
>>> nltk or NLP modules? Have anybody of you read this paper?
>>> "Template-Based Information Extraction without the Templates."
>>> I am looking forward to hearirng from you soon!
>>> Best,
>>> -Ariani
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170707/9ccc87f3/attachment.html>

From olivier.grisel at ensta.org  Fri Jul  7 14:51:26 2017
From: olivier.grisel at ensta.org (Olivier Grisel)
Date: Fri, 7 Jul 2017 20:51:26 +0200
Subject: [scikit-learn] Help with NLP
In-Reply-To: <CAEUHNkbw3NWs=ifmwz8X6a_RMBWyGsOM5QVs_nyMk+x0Cj0qsg@mail.gmail.com>
References: <CAEUHNkYcojAqM0k3O2aUdfpFEv0Y+heayxLiDmiSPBf6Jmj0eg@mail.gmail.com>
 <1694DCBE-443C-4EB0-B2F5-2A0FCC67D5FB@gmail.com>
 <CAEUHNkaqvdZ1MA_Jq2qu9oLWxc8xaLh+XjaZ1XW8-gKrVXRZaw@mail.gmail.com>
 <CA+ad8EvV+TVxroMt482KsU7JU5du3HO2yw2hzVyVK-VFRP+k7Q@mail.gmail.com>
 <CAEUHNkbw3NWs=ifmwz8X6a_RMBWyGsOM5QVs_nyMk+x0Cj0qsg@mail.gmail.com>
Message-ID: <CAFvE7K79mgjR4X_nUJ8CNeKZc_fcmsd5HPZ5PC-KemkSVed=Dg@mail.gmail.com>

Please use this mailing list if you have targeted scikit-learn mailing
list questions. Otherwise you should better ask a specific question on
an NLP and datascience community platform such as:

https://datascience.stackexchange.com/questions/tagged/nlp

or if you have a programming related questions for NLTK or a related library:

https://stackoverflow.com/questions/tagged/nlp

Also in any case, I would advise you not to ask for "urgent help".
Instead ask specific questions. Otherwise you are likely to never get
any useful answer to your question. If you do not know where to start,
read tutorials and introductory books on NLTK or NLP in general
instead.

-- 
Olivier

From jni.soma at gmail.com  Fri Jul  7 22:44:38 2017
From: jni.soma at gmail.com (Juan Nunez-Iglesias)
Date: Sat, 8 Jul 2017 12:44:38 +1000
Subject: [scikit-learn] Replacing the Boston Housing Prices dataset
In-Reply-To: <20170707053505.GR2257694@phare.normalesup.org>
References: <CAEBTegQhq3x1Jgm5zBY+MQ1kjB6ryfT0X8r5-esk6NP_5WY9rQ@mail.gmail.com>
 <CAL9=spMoOMWskSBzzVksamD3w_yDZJO7URR_E9zhoEvv96X-sQ@mail.gmail.com>
 <CA+ad8EvNw6jiB2nZ-z8UMuPtonFYvreXooVcKeeS73ccHNbDjA@mail.gmail.com>
 <f7106002-3607-4d50-8445-d4dc7291a264@Spark>
 <20170707053505.GR2257694@phare.normalesup.org>
Message-ID: <79f8556c-d6ea-4bed-8f5a-3f4d1a5bda3e@Spark>

Just to clarify a couple of things about my position.

First, thanks Ga?l for a thoughtful response. I fully respect your decision to keep the Boston dataset, and I agree that it can be a useful "teaching moment." (As I suggested in my earlier post.)

With regards to breaking tutorials, however, I totally disagree. The whole value of tutorials is that they teach general principles, not analysis of specific datasets. Changing a tutorial dataset is thus different from changing an API. This isn't the right forum for a discussion about the ethics of the Lena image, so I won't go into that, but to suggest that it is a uniquely effective picture, the natural image equivalent of a standard test pattern, is ludicrous. Maybe the replacement wasn't as good, but that is a criticism of the choice of replacement, not of the decision to replace it. There clearly exist millions or billions of images with similarly good teaching characteristics.

Finally, yes, removing and deprecating datasets incurs (and inflicts) a real cost, but cost should be at best a minor consideration when dealing with ethical questions. History, and daily life, are replete with unethical decisions made under the excuse that it would cost too much to do what's right. Ultimately the costs are usually found to have been exaggerated.

With regards to this dataset, I cede the argument to maintainers, contributors, and users of the dataset, but I will point out that none of the existing tutorials in the library mention this feature, let alone addresses the ethics of it. The DESCR field mentions it entirely nonchalantly, like it is a natural thing to want to measure if one wants to predict house prices. I think I would certainly have a WTF moment, at least, if I was a black student reading through that description.

Juan.

On 7 Jul 2017, 3:36 PM +1000, Gael Varoquaux <gael.varoquaux at normalesup.org>, wrote:
> Many people gave great points in this thread, in particular Jacob's well
> written email.
>
> Andy's point about tutorials is an important one. I don't resonate at
> all with Juan's message. Breaking people's code, even if it is the notes
> that they use to give a lecture, is a real cost for them. The cost varies
> on a case to case basis. But there are still books printed out there
> that demo image processing on Lena, and these will be out for decades.
> More importantly, the replacement of Lena used in scipy (the raccoon)
> does not allow to demonstrate denoising properly (Lena has smooth regions
> with details in the middle: the eyes), or segmentation. In effect, it has
> made the examples for the ecosystem less convincing.
>
>
> Of course, by definition, refusing to change anything implies that
> unfortunate situations, such as discriminatory biases, cannot be fixed.
> This is why changes should be considered on a case-to-case basis.
>
> The problem that we are facing here is that a dataset about society, the
> Boston housing dataset, can reveal discrimination. However, this is true
> of every data about society. The classic adult data (extracted from the
> American census) easily reveals income discrimination. I teach statistics
> with an IQ dataset where it is easy to show a male vs female IQ
> difference. This difference disappears after controlling for education
> (and the purpose of my course is to teach people to control for
> confounding effects).
>
> Data about society reveals its inequalities. Not working on such data is
> hiding problems, not fixing them. It is true that misuse of such data can
> attempt to establish inequalities as facts of life and get them accepted.
> When discussing these issues, we need to educate people about how to run
> and interpret analyses.
>
>
> No the Boston data will not go. No it is not a good thing to pretend that
> social problems do not exist.
>
>
> Ga?l
>
> On Fri, Jul 07, 2017 at 09:36:41AM +1000, Juan Nunez-Iglesias wrote:
> > For what it's worth: I'm sympathetic to the argument that you can't fix the
> > problem if you don't measure it, but I agree with Tony that "many tutorials use
> > it" is an extremely weak argument. We removed Lena from scikit-image because it
> > was the right thing to do. I very much doubt that Boston house prices is in
> > more widespread use than Lena was in image processing.
>
> > You can argue about whether or not it's morally right or wrong to include the
> > dataset. I see merit to both arguments. But "too many tutorials use it" is very
> > similar in flavour to "the economy of the South would collapse without
> > slavery."
>
> > Regarding fair uses of the feature, I would hope that all sklearn tutorials
> > using the dataset mention such uses. The potential for abuse and
> > misinterpretation is enormous.
>
> > On 7 Jul 2017, 6:36 AM +1000, Jacob Schreiber <jmschreiber91 at gmail.com>, wrote:
>
> > Hi Tony
>
> > As others have pointed out, I think that you may be misunderstanding the
> > purpose of that "feature." We are in agreement that discrimination against
> > protected classes is not OK, and that even outside complying with the law
> > one should avoid discrimination, in model building or elsewhere. However, I
> > disagree that one does this by eliminating from all datasets any feature
> > that may allude to these protected classes. As Andreas pointed out, there
> > is a growing effort to ensure that machine learning models are fair and
> > benefit the common good (such as FATML, DSSG, etc..), and from my
> > understanding the general consensus isn't necessarily that simply
> > eliminating the feature is sufficient. I think we are in agreement that
> > naively learning a model over a feature set containing questionable
> > features and calling it a day is not okay, but as others have pointed out,
> > having these features present and handling them appropriately can help
> > guard against the model implicitly learning unfair biases (even if they are
> > not explicitly exposed to the feature).
>
> > I would welcome the addition of the Ames dataset to the ones supported by
> > sklearn, but I'm not convinced that the Boston dataset should be removed.
> > As Andreas pointed out, there is a benefit to having canonical examples
> > present so that beginners can easily follow along with the many tutorials
> > that have been written using them. As Sean points out, the paper itself is
> > trying to pull out the connection between house price and clean air in the
> > presence of possible confounding variables. In a more general sense, saying
> > that a feature shouldn't be there because a simple linear regression is
> > unaffected by the results is a bit odd because it is very common for
> > datasets to include irrelevant features, and handling them appropriately is
> > important. In addition, one could argue that having this type of issue
> > arise in a toy dataset has a benefit because it exposes these types of
> > issues to those learning data science earlier on and allows them to keep
> > these issues in mind in the future when the data is more serious.
>
> > It is important for us all to keep issues of fairness in mind when it comes
> > to data science. I'm glad that you're speaking out in favor of fairness and
> > trying to bring attention to it.
>
> > Jacob
>
> > On Thu, Jul 6, 2017 at 12:08 PM, Sean Violante <sean.violante at gmail.com
> > wrote:
>
> > G Reina
> > you make a bizarre argument. You argue that you should not even check
> > racism as a possible factor in house prices?
>
> > But then you yourself check whether its relevant
> > Then you say
>
> > "but I'd argue that it's more due to the location (near water, near
> > businesses, near restaurants, near parks and recreation) than to the
> > ethnic makeup"
>
> > Which was basically what the original authors wanted to show too,
>
> > Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for
> > clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978.
>
> > but unless you measure ethnic make-up you cannot show that it is not a
> > confounder.
>
> > The term "white flight" refers to affluent white families moving to the
> > suburbs.. And clearly a question is whether/how much was racism or
> > avoiding air pollution.
>
>
>
>
>
> > On 6 Jul 2017 6:10 pm, "G Reina" <greina at eng.ucsd.edu> wrote:
>
> > I'd like to request that the "Boston Housing Prices" dataset in
> > sklearn (sklearn.datasets.load_boston) be replaced with the "Ames
> > Housing Prices" dataset (https://ww2.amstat.org/publications/jse/
> > v19n3/decock.pdf). I am willing to submit the code change if the
> > developers agree.
>
> > The Boston dataset has the feature "Bk is the proportion of blacks
> > in town". It is an incredibly racist "feature" to include in any
> > dataset. I think is beneath us as data scientists.
>
> > I submit that the Ames dataset is a viable alternative for learning
> > regression. The author has shown that the dataset is a more robust
> > replacement for Boston. Ames is a 2011 regression dataset on
> > housing prices and has more than 5 times the amount of training
> > examples with over 7 times as many features (none of which are
> > morally questionable).
>
> > I welcome the community's thoughts on the matter.
>
> > Thanks.
> > -Tony
>
> > Here's an article I wrote on the Boston dataset:
> > https://www.linkedin.com/pulse/hidden-racism-data-science-g-
> > anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_
> > feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D
>
>
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> --
> Gael Varoquaux
> Researcher, INRIA Parietal
> NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
> Phone: ++ 33-1-69-08-79-68
> http://gael-varoquaux.info http://twitter.com/GaelVaroquaux
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170708/d75ced2e/attachment-0001.html>

From jmschreiber91 at gmail.com  Sat Jul  8 00:26:43 2017
From: jmschreiber91 at gmail.com (Jacob Schreiber)
Date: Fri, 7 Jul 2017 21:26:43 -0700
Subject: [scikit-learn] Replacing the Boston Housing Prices dataset
In-Reply-To: <79f8556c-d6ea-4bed-8f5a-3f4d1a5bda3e@Spark>
References: <CAEBTegQhq3x1Jgm5zBY+MQ1kjB6ryfT0X8r5-esk6NP_5WY9rQ@mail.gmail.com>
 <CAL9=spMoOMWskSBzzVksamD3w_yDZJO7URR_E9zhoEvv96X-sQ@mail.gmail.com>
 <CA+ad8EvNw6jiB2nZ-z8UMuPtonFYvreXooVcKeeS73ccHNbDjA@mail.gmail.com>
 <f7106002-3607-4d50-8445-d4dc7291a264@Spark>
 <20170707053505.GR2257694@phare.normalesup.org>
 <79f8556c-d6ea-4bed-8f5a-3f4d1a5bda3e@Spark>
Message-ID: <CA+ad8Eu3AHeOmsEgpqdmh3LCYwSbNCxfgVfM5coAxc0Jvpv7dQ@mail.gmail.com>

We would welcome a pull request amending the documentation to include a
neutral discussion of the issues you've brought up. Optimally, it would
include many of the points brought up in this discussion as to why it was
ultimately kept despite the issues being raised.


On Fri, Jul 7, 2017 at 7:44 PM, Juan Nunez-Iglesias <jni.soma at gmail.com>
wrote:

> Just to clarify a couple of things about my position.
>
> First, thanks Ga?l for a thoughtful response. I fully respect your
> decision to keep the Boston dataset, and I agree that it can be a useful
> "teaching moment." (As I suggested in my earlier post.)
>
> With regards to breaking tutorials, however, I totally disagree. The whole
> value of tutorials is that they teach general principles, not analysis of
> specific datasets. Changing a tutorial dataset is thus different from
> changing an API. This isn't the right forum for a discussion about the
> ethics of the Lena image, so I won't go into that, but to suggest that it
> is a uniquely effective picture, the natural image equivalent of a standard
> test pattern, is ludicrous. Maybe the replacement wasn't as good, but that
> is a criticism of the choice of replacement, not of the decision to replace
> it. There clearly exist millions or billions of images with similarly good
> teaching characteristics.
>
> Finally, yes, removing and deprecating datasets incurs (and inflicts) a
> real cost, but cost should be at best a minor consideration when dealing
> with ethical questions. History, and daily life, are replete with unethical
> decisions made under the excuse that it would cost too much to do what's
> right. Ultimately the costs are usually found to have been exaggerated.
>
> With regards to this dataset, I cede the argument to maintainers,
> contributors, and users of the dataset, but I will point out that none of
> the existing tutorials
> <http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html>
> in the library mention this feature, let alone addresses the ethics of it.
> The DESCR field mentions it entirely nonchalantly, like it is a natural
> thing to want to measure if one wants to predict house prices. I think I
> would certainly have a WTF moment, at least, if I was a black student
> reading through that description.
>
> Juan.
>
> On 7 Jul 2017, 3:36 PM +1000, Gael Varoquaux <
> gael.varoquaux at normalesup.org>, wrote:
>
> Many people gave great points in this thread, in particular Jacob's well
> written email.
>
> Andy's point about tutorials is an important one. I don't resonate at
> all with Juan's message. Breaking people's code, even if it is the notes
> that they use to give a lecture, is a real cost for them. The cost varies
> on a case to case basis. But there are still books printed out there
> that demo image processing on Lena, and these will be out for decades.
> More importantly, the replacement of Lena used in scipy (the raccoon)
> does not allow to demonstrate denoising properly (Lena has smooth regions
> with details in the middle: the eyes), or segmentation. In effect, it has
> made the examples for the ecosystem less convincing.
>
>
> Of course, by definition, refusing to change anything implies that
> unfortunate situations, such as discriminatory biases, cannot be fixed.
> This is why changes should be considered on a case-to-case basis.
>
> The problem that we are facing here is that a dataset about society, the
> Boston housing dataset, can reveal discrimination. However, this is true
> of every data about society. The classic adult data (extracted from the
> American census) easily reveals income discrimination. I teach statistics
> with an IQ dataset where it is easy to show a male vs female IQ
> difference. This difference disappears after controlling for education
> (and the purpose of my course is to teach people to control for
> confounding effects).
>
> Data about society reveals its inequalities. Not working on such data is
> hiding problems, not fixing them. It is true that misuse of such data can
> attempt to establish inequalities as facts of life and get them accepted.
> When discussing these issues, we need to educate people about how to run
> and interpret analyses.
>
>
> No the Boston data will not go. No it is not a good thing to pretend that
> social problems do not exist.
>
>
> Ga?l
>
> On Fri, Jul 07, 2017 at 09:36:41AM +1000, Juan Nunez-Iglesias wrote:
>
> For what it's worth: I'm sympathetic to the argument that you can't fix the
> problem if you don't measure it, but I agree with Tony that "many
> tutorials use
> it" is an extremely weak argument. We removed Lena from scikit-image
> because it
> was the right thing to do. I very much doubt that Boston house prices is in
> more widespread use than Lena was in image processing.
>
>
> You can argue about whether or not it's morally right or wrong to include
> the
> dataset. I see merit to both arguments. But "too many tutorials use it" is
> very
> similar in flavour to "the economy of the South would collapse without
> slavery."
>
>
> Regarding fair uses of the feature, I would hope that all sklearn tutorials
> using the dataset mention such uses. The potential for abuse and
> misinterpretation is enormous.
>
>
> On 7 Jul 2017, 6:36 AM +1000, Jacob Schreiber <jmschreiber91 at gmail.com>,
> wrote:
>
>
> Hi Tony
>
>
> As others have pointed out, I think that you may be misunderstanding the
> purpose of that "feature." We are in agreement that discrimination against
> protected classes is not OK, and that even outside complying with the law
> one should avoid discrimination, in model building or elsewhere. However, I
> disagree that one does this by eliminating from all datasets any feature
> that may allude to these protected classes. As Andreas pointed out, there
> is a growing effort to ensure that machine learning models are fair and
> benefit the common good (such as FATML, DSSG, etc..), and from my
> understanding the general consensus isn't necessarily that simply
> eliminating the feature is sufficient. I think we are in agreement that
> naively learning a model over a feature set containing questionable
> features and calling it a day is not okay, but as others have pointed out,
> having these features present and handling them appropriately can help
> guard against the model implicitly learning unfair biases (even if they are
> not explicitly exposed to the feature).
>
>
> I would welcome the addition of the Ames dataset to the ones supported by
> sklearn, but I'm not convinced that the Boston dataset should be removed.
> As Andreas pointed out, there is a benefit to having canonical examples
> present so that beginners can easily follow along with the many tutorials
> that have been written using them. As Sean points out, the paper itself is
> trying to pull out the connection between house price and clean air in the
> presence of possible confounding variables. In a more general sense, saying
> that a feature shouldn't be there because a simple linear regression is
> unaffected by the results is a bit odd because it is very common for
> datasets to include irrelevant features, and handling them appropriately is
> important. In addition, one could argue that having this type of issue
> arise in a toy dataset has a benefit because it exposes these types of
> issues to those learning data science earlier on and allows them to keep
> these issues in mind in the future when the data is more serious.
>
>
> It is important for us all to keep issues of fairness in mind when it comes
> to data science. I'm glad that you're speaking out in favor of fairness and
> trying to bring attention to it.
>
>
> Jacob
>
>
> On Thu, Jul 6, 2017 at 12:08 PM, Sean Violante <sean.violante at gmail.com
> wrote:
>
>
> G Reina
> you make a bizarre argument. You argue that you should not even check
> racism as a possible factor in house prices?
>
>
> But then you yourself check whether its relevant
> Then you say
>
>
> "but I'd argue that it's more due to the location (near water, near
> businesses, near restaurants, near parks and recreation) than to the
> ethnic makeup"
>
>
> Which was basically what the original authors wanted to show too,
>
>
> Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for
> clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978.
>
>
> but unless you measure ethnic make-up you cannot show that it is not a
> confounder.
>
>
> The term "white flight" refers to affluent white families moving to the
> suburbs.. And clearly a question is whether/how much was racism or
> avoiding air pollution.
>
>
>
>
>
>
> On 6 Jul 2017 6:10 pm, "G Reina" <greina at eng.ucsd.edu> wrote:
>
>
> I'd like to request that the "Boston Housing Prices" dataset in
> sklearn (sklearn.datasets.load_boston) be replaced with the "Ames
> Housing Prices" dataset (https://ww2.amstat.org/publications/jse/
> v19n3/decock.pdf). I am willing to submit the code change if the
> developers agree.
>
>
> The Boston dataset has the feature "Bk is the proportion of blacks
> in town". It is an incredibly racist "feature" to include in any
> dataset. I think is beneath us as data scientists.
>
>
> I submit that the Ames dataset is a viable alternative for learning
> regression. The author has shown that the dataset is a more robust
> replacement for Boston. Ames is a 2011 regression dataset on
> housing prices and has more than 5 times the amount of training
> examples with over 7 times as many features (none of which are
> morally questionable).
>
>
> I welcome the community's thoughts on the matter.
>
>
> Thanks.
> -Tony
>
>
> Here's an article I wrote on the Boston dataset:
> https://www.linkedin.com/pulse/hidden-racism-data-science-g-
> anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_
> feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> --
> Gael Varoquaux
> Researcher, INRIA Parietal
> NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
> Phone: ++ 33-1-69-08-79-68 <+33%201%2069%2008%2079%2068>
> http://gael-varoquaux.info http://twitter.com/GaelVaroquaux
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170707/7c5f4577/attachment-0001.html>

From mathieu at mblondel.org  Sat Jul  8 02:22:31 2017
From: mathieu at mblondel.org (Mathieu Blondel)
Date: Sat, 8 Jul 2017 15:22:31 +0900
Subject: [scikit-learn] Fwd: Feedback on scikit-learn.org
In-Reply-To: <CAF4_tKW8ymCZs5t2ORRyyYTWfHRzytp5D0DPHgTeFLp3_8026Q@mail.gmail.com>
References: <CAF4_tKW8ymCZs5t2ORRyyYTWfHRzytp5D0DPHgTeFLp3_8026Q@mail.gmail.com>
Message-ID: <CAOKSrLxtYP1aQt-EB-nMs5p8mpuTMhSzaCmfa9ZSUJ5ZyZzovQ@mail.gmail.com>

Someone had this to say about eigenfaces.

---------- Forwarded message ----------
From: Frances Liu <francestfls at gmail.com>
Date: Sat, Jul 8, 2017 at 6:57 AM
Subject: Feedback on scikit-learn.org
To: mathieu at mblondel.org


Hi Mathieu,

I found your email on your personal website, which is linked at the top of
the authors list for scikit-learn.org. I just want to submit a small
complaint -- the page for dimensionality reduction: http://scikit-
learn.org/stable/modules/decomposition.html#decompositions
uses faces as examples. The generate faces are wayyyyyyy too scary.
Considering that minors and people with health conditions may visit the
website, could you use some less horrifying examples please?

Thank you!

Best,
Frances
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170708/c1866edb/attachment.html>

From gael.varoquaux at normalesup.org  Sat Jul  8 03:12:53 2017
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Sat, 8 Jul 2017 09:12:53 +0200
Subject: [scikit-learn] Fwd: Feedback on scikit-learn.org
In-Reply-To: <CAOKSrLxtYP1aQt-EB-nMs5p8mpuTMhSzaCmfa9ZSUJ5ZyZzovQ@mail.gmail.com>
References: <CAF4_tKW8ymCZs5t2ORRyyYTWfHRzytp5D0DPHgTeFLp3_8026Q@mail.gmail.com>
 <CAOKSrLxtYP1aQt-EB-nMs5p8mpuTMhSzaCmfa9ZSUJ5ZyZzovQ@mail.gmail.com>
Message-ID: <20170708071253.GP2257694@phare.normalesup.org>

The way that I would think about such a question is that a website like
that of the New York Times, or of Le Monde, has pictures that are much
more scary. We are probably on the safe end.

Cheers,

Ga?l

On Sat, Jul 08, 2017 at 03:22:31PM +0900, Mathieu Blondel wrote:
> Someone had this to say about eigenfaces.

> ---------- Forwarded message ----------
> From: Frances Liu <francestfls at gmail.com>
> Date: Sat, Jul 8, 2017 at 6:57 AM
> Subject: Feedback on scikit-learn.org
> To: mathieu at mblondel.org


> Hi Mathieu,

> I found your email on your personal website, which is linked at the top of the
> authors list for scikit-learn.org. I just want to submit a small complaint --
> the page for dimensionality reduction:?http://scikit-learn.org/stable/modules/
> decomposition.html#decompositions
> uses faces as examples. The generate faces are wayyyyyyy too scary. Considering
> that minors and people with health conditions may visit the website, could you
> use some less horrifying examples please?

> Thank you!

> Best,
> Frances


> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
    Gael Varoquaux
    Researcher, INRIA Parietal
    NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
    Phone:  ++ 33-1-69-08-79-68
    http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux

From g.lemaitre58 at gmail.com  Sat Jul  8 04:50:57 2017
From: g.lemaitre58 at gmail.com (Guillaume Lemaitre)
Date: Sat, 08 Jul 2017 10:50:57 +0200
Subject: [scikit-learn] Fwd: Feedback on scikit-learn.org
In-Reply-To: <20170708071253.GP2257694@phare.normalesup.org>
References: <CAF4_tKW8ymCZs5t2ORRyyYTWfHRzytp5D0DPHgTeFLp3_8026Q@mail.gmail.com>
 <CAOKSrLxtYP1aQt-EB-nMs5p8mpuTMhSzaCmfa9ZSUJ5ZyZzovQ@mail.gmail.com>
 <20170708071253.GP2257694@phare.normalesup.org>
Message-ID: <20170708085057.4870225.70142.35245@gmail.com>

In the same line, we should stop publishing faces generated by GANs. They are even worse :-)

Guillaume?Lemaitre?
INRIA?Saclay?Ile-de-France?/?Equipe?PARIETAL
guillaume.lemaitre at inria.fr?-?https://glemaitre.github.io/


From warren.weckesser at gmail.com  Sat Jul  8 05:13:54 2017
From: warren.weckesser at gmail.com (Warren Weckesser)
Date: Sat, 8 Jul 2017 05:13:54 -0400
Subject: [scikit-learn] Fwd: Feedback on scikit-learn.org
In-Reply-To: <CAOKSrLxtYP1aQt-EB-nMs5p8mpuTMhSzaCmfa9ZSUJ5ZyZzovQ@mail.gmail.com>
References: <CAF4_tKW8ymCZs5t2ORRyyYTWfHRzytp5D0DPHgTeFLp3_8026Q@mail.gmail.com>
 <CAOKSrLxtYP1aQt-EB-nMs5p8mpuTMhSzaCmfa9ZSUJ5ZyZzovQ@mail.gmail.com>
Message-ID: <CAGzF1ufR91Mg301H_YWdN7s=gaWQ8h1usVf5GoyY9g_1Tx5jig@mail.gmail.com>

Obligatory meme: https://imgur.com/a/BLimp

Warren

On Sat, Jul 8, 2017 at 2:22 AM, Mathieu Blondel <mathieu at mblondel.org>
wrote:

> Someone had this to say about eigenfaces.
>
> ---------- Forwarded message ----------
> From: Frances Liu <francestfls at gmail.com>
> Date: Sat, Jul 8, 2017 at 6:57 AM
> Subject: Feedback on scikit-learn.org
> To: mathieu at mblondel.org
>
>
> Hi Mathieu,
>
> I found your email on your personal website, which is linked at the top of
> the authors list for scikit-learn.org. I just want to submit a small
> complaint -- the page for dimensionality reduction: http://scikit-learn
> .org/stable/modules/decomposition.html#decompositions
> uses faces as examples. The generate faces are wayyyyyyy too scary.
> Considering that minors and people with health conditions may visit the
> website, could you use some less horrifying examples please?
>
> Thank you!
>
> Best,
> Frances
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170708/6c5344aa/attachment.html>

From valia.rodriguez at gmail.com  Sat Jul  8 07:00:56 2017
From: valia.rodriguez at gmail.com (Valia Rodriguez)
Date: Sat, 8 Jul 2017 12:00:56 +0100
Subject: [scikit-learn] Replacing the Boston Housing Prices dataset
In-Reply-To: <CAH6Pt5pn-KhVnY837fo4uAjm-8Fu1DCm9d=DsC1mg9oQ_Jps5g@mail.gmail.com>
References: <CAEBTegQhq3x1Jgm5zBY+MQ1kjB6ryfT0X8r5-esk6NP_5WY9rQ@mail.gmail.com>
 <CAL9=spMoOMWskSBzzVksamD3w_yDZJO7URR_E9zhoEvv96X-sQ@mail.gmail.com>
 <CA+ad8EvNw6jiB2nZ-z8UMuPtonFYvreXooVcKeeS73ccHNbDjA@mail.gmail.com>
 <f7106002-3607-4d50-8445-d4dc7291a264@Spark>
 <20170707053505.GR2257694@phare.normalesup.org>
 <79f8556c-d6ea-4bed-8f5a-3f4d1a5bda3e@Spark>
 <CA+ad8Eu3AHeOmsEgpqdmh3LCYwSbNCxfgVfM5coAxc0Jvpv7dQ@mail.gmail.com>
 <CAH6Pt5pn-KhVnY837fo4uAjm-8Fu1DCm9d=DsC1mg9oQ_Jps5g@mail.gmail.com>
Message-ID: <CACcbZA9w30xvOws2dv59ii-TW4AUQWGqmDOCFTsHjJbp+qEUNA@mail.gmail.com>

Hello everybody

I just subscribed to this list to let you know what I think about this
topic as a black woman I am. My husband who is in this list told me
about the discussion going on and I wanted to share with all of you my
thoughts:

There is nothing wrong or racist in counting how many black people
there is in a given population, as it is not racist either to count
how many Asian or white people there are.

First in many epidemiologic, demographic and sociologic studies we
need to take in account -and do counting on the bases of -ethnicity,
skin color or race; depends on where in the world we are doing the
study and depends on the population we are counting. There is no other
way to address these topics if you do not count how many blacks,
whites, Asians and so on. Any teaching should simulate real
conditions, so a dataset including this is fine.

It is valid to count in the bases of skin color because if we don't,
how to study  then distribution of wealth or even racism itself?

Second: there is nothing wrong with the word: black. That word should
not rise a flag. I am black and that is fine for me and for any other
person like me to be called black because we are -depends on the
context of course. As it is nothing wrong with being white and being
part of a counting for 'number of whites' for a specific study. It
will be very bad if the dataset says however 'number of coloured
people' to refer to black people, that would be very racist.


Valia


On Sat, Jul 8, 2017 at 10:31 AM, Matthew Brett <matthew.brett at gmail.com> wrote:
>
> Forwarded conversation
> Subject: [scikit-learn] Replacing the Boston Housing Prices dataset
> ------------------------
>
> From: G Reina <greina at eng.ucsd.edu>
> Date: Thu, Jul 6, 2017 at 5:05 PM
> To: scikit-learn at python.org
>
>
> I'd like to request that the "Boston Housing Prices" dataset in sklearn
> (sklearn.datasets.load_boston) be replaced with the "Ames Housing Prices"
> dataset (https://ww2.amstat.org/publications/jse/v19n3/decock.pdf). I am
> willing to submit the code change if the developers agree.
>
> The Boston dataset has the feature "Bk is the proportion of blacks in town".
> It is an incredibly racist "feature" to include in any dataset. I think is
> beneath us as data scientists.
>
> I submit that the Ames dataset is a viable alternative for learning
> regression. The author has shown that the dataset is a more robust
> replacement for Boston. Ames is a 2011 regression dataset on housing prices
> and has more than 5 times the amount of training examples with over 7 times
> as many features (none of which are morally questionable).
>
> I welcome the community's thoughts on the matter.
>
> Thanks.
> -Tony
>
> Here's an article I wrote on the Boston dataset:
> https://www.linkedin.com/pulse/hidden-racism-data-science-g-anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ----------
> From: Andreas Mueller <t3kcit at gmail.com>
> Date: Thu, Jul 6, 2017 at 5:31 PM
> To: scikit-learn at python.org
>
>
> Hi Tony.
>
> I don't think it's a good idea to remove the dataset, given how many
> tutorials and examples rely on it.
> I also don't think it's a good idea to ignore racial discrimination, which I
> guess this feature is trying to capture.
>
> I was recently asked to remove an excerpt from a dataset from my slide, as
> it was "too racist". It was randomly sampled
> data from the adult census dataset. Unfortunately, economics in the US are
> not color blind (yet), and the reality is racist.
> I haven't done an in-depth analysis on whether this feature is actually
> informative, but I don't think your analysis is conclusive.
>
> Including ethnicity in data actually allows us to ensure "fairness" in
> certain decision making processes.
> Without collecting this data, it would be impossible to ensure automatic
> decisions are not influenced
> by past human biases. Arguably that's not what the authors of this dataset
> are doing.
>
> Check out http://www.fatml.org/ for more on fairness in machine learning and
> data science.
>
> Cheers,
> Andy
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ----------
> From: G Reina <greina at eng.ucsd.edu>
> Date: Thu, Jul 6, 2017 at 5:41 PM
> To: Scikit-learn user and developer mailing list <scikit-learn at python.org>
>
>
> Wow. I completely disagree.
>
> The fact that too many tutorials and examples rely on it is not a reason to
> keep the dataset. New tutorials are written all the time. And, as sklearn
> evolves some of the existing tutorials will need to be updated anyway to
> keep up with the changes.
>
> Including "ethnicity" is completely illegal in making business decisions in
> the United States. For example, credit scoring systems bend over backward to
> expunge even proxy features that could be highly correlated with race (for
> example, they can't include neighborhood, but can include entire counties).
>
> Let's leave the studying of racism to actual scientists who study racism.
> Not to toy datasets that we use to teach our students about a completely
> unrelated matter like regression.
>
> -Tony
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ----------
> From: Andrew Holmes <andrewholmes82 at icloud.com>
> Date: Thu, Jul 6, 2017 at 5:19 PM
> To: Scikit-learn user and developer mailing list <scikit-learn at python.org>
>
>
> But how do social scientists do research into racism without including
> ethnicity as a feature in the data?
>
> Best wishes
> Andrew
>
> Public Profile
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ----------
> From: jma <jeffrey.m.allard at gmail.com>
> Date: Thu, Jul 6, 2017 at 6:38 PM
> To: scikit-learn at python.org
>
>
> I work in the financial services industry and build machine learning models
> for marketing applications. We put an enormous effort (multiple layers of
> oversight and governance) into ensuring that our models are free of bias
> against protected classes etc. Having data describing race and ethnicity
> (among others) is extremely important to validate this is indeed the case.
> Without it, you have no such assurance.
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ----------
> From: Andreas Mueller <t3kcit at gmail.com>
> Date: Thu, Jul 6, 2017 at 7:09 PM
> To: scikit-learn at python.org
>
>
>
>
> On 07/06/2017 12:41 PM, G Reina wrote:
>>
>>
>> The fact that too many tutorials and examples rely on it is not a reason
>> to keep the dataset. New tutorials are written all the time. And, as sklearn
>> evolves some of the existing tutorials will need to be updated anyway to
>> keep up with the changes.
>
> No, we try to avoid that as much as possible.
> Old examples should work for as long as possible, and we actively avoid
> breaking API unnecessarily. It's one of the core principles of scikit-learn
> development.
>
> And new tutorials can use any dataset they choose. We are working on
> including an openml fetcher, which allows using more datasets more easily.
>
> ----------
> From: Sean Violante <sean.violante at gmail.com>
> Date: Thu, Jul 6, 2017 at 8:08 PM
> To: Scikit-learn user and developer mailing list <scikit-learn at python.org>
>
>
> G Reina
> you make a bizarre argument. You argue that you should not even check racism
> as a possible factor in house prices?
>
> But then you yourself check whether its relevant
> Then you say
>
> "but I'd argue that it's more due to the location (near water, near
> businesses, near restaurants, near parks and recreation) than to the ethnic
> makeup"
>
> Which  was basically what  the original authors wanted to show too,
>
> Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for clean
> air', J. Environ. Economics & Management, vol.5, 81-102, 1978.
>
>  but unless you measure ethnic make-up you cannot show that it is not a
> confounder.
>
> The term "white flight" refers to affluent white families moving to the
> suburbs.. And clearly a question is whether/how much was racism or avoiding
> air pollution.
>
>
>
>
>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ----------
> From: Jacob Schreiber <jmschreiber91 at gmail.com>
> Date: Thu, Jul 6, 2017 at 9:34 PM
> To: Scikit-learn user and developer mailing list <scikit-learn at python.org>
>
>
> Hi Tony
>
> As others have pointed out, I think that you may be misunderstanding the
> purpose of that "feature." We are in agreement that discrimination against
> protected classes is not OK, and that even outside complying with the law
> one should avoid discrimination, in model building or elsewhere. However, I
> disagree that one does this by eliminating from all datasets any feature
> that may allude to these protected classes. As Andreas pointed out, there is
> a growing effort to ensure that machine learning models are fair and benefit
> the common good (such as FATML, DSSG, etc..), and from my understanding the
> general consensus isn't necessarily that simply eliminating the feature is
> sufficient. I think we are in agreement that naively learning a model over a
> feature set containing questionable features and calling it a day is not
> okay, but as others have pointed out, having these features present and
> handling them appropriately can help guard against the model implicitly
> learning unfair biases (even if they are not explicitly exposed to the
> feature).
>
> I would welcome the addition of the Ames dataset to the ones supported by
> sklearn, but I'm not convinced that the Boston dataset should be removed. As
> Andreas pointed out, there is a benefit to having canonical examples present
> so that beginners can easily follow along with the many tutorials that have
> been written using them. As Sean points out, the paper itself is trying to
> pull out the connection between house price and clean air in the presence of
> possible confounding variables. In a more general sense, saying that a
> feature shouldn't be there because a simple linear regression is unaffected
> by the results is a bit odd because it is very common for datasets to
> include irrelevant features, and handling them appropriately is important.
> In addition, one could argue that having this type of issue arise in a toy
> dataset has a benefit because it exposes these types of issues to those
> learning data science earlier on and allows them to keep these issues in
> mind in the future when the data is more serious.
>
> It is important for us all to keep issues of fairness in mind when it comes
> to data science. I'm glad that you're speaking out in favor of fairness and
> trying to bring attention to it.
>
> Jacob
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ----------
> From: Juan Nunez-Iglesias <jni.soma at gmail.com>
> Date: Fri, Jul 7, 2017 at 12:36 AM
> To: Scikit-learn user and developer mailing list <scikit-learn at python.org>
>
>
> For what it's worth: I'm sympathetic to the argument that you can't fix the
> problem if you don't measure it, but I agree with Tony that "many tutorials
> use it" is an extremely weak argument. We removed Lena from scikit-image
> because it was the right thing to do. I very much doubt that Boston house
> prices is in more widespread use than Lena was in image processing.
>
> You can argue about whether or not it's morally right or wrong to include
> the dataset. I see merit to both arguments. But "too many tutorials use it"
> is very similar in flavour to "the economy of the South would collapse
> without slavery."
>
> Regarding fair uses of the feature, I would hope that all sklearn tutorials
> using the dataset mention such uses. The potential for abuse and
> misinterpretation is enormous.
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ----------
> From: Sebastian Raschka <se.raschka at gmail.com>
> Date: Fri, Jul 7, 2017 at 1:39 AM
> To: Scikit-learn user and developer mailing list <scikit-learn at python.org>
>
>
> I think there can be some middle ground. I.e., adding a new, simple dataset
> to demonstrate regression (maybe autmpg, wine quality, or sth like that) and
> use that for the scikit-learn examples in the main documentation etc but
> leave the boston dataset in the code base for now. Whether it's a weak
> argument or not, it would be quite destructive to remove the dataset
> altogether in the next version or so, not only because old tutorials use it
> but many unit tests in many different projects depend on it. I think it
> might be better to phase it out by having a good alternative first, and I am
> sure that the scikit-learn maintainers wouldn't have anything against it if
> someone would update the examples/tutorials with the use of different
> datasets
>
> Best,
> Sebastian
>
> ----------
> From: Bill Ross <ross at cgl.ucsf.edu>
> Date: Fri, Jul 7, 2017 at 2:00 AM
> To: scikit-learn at python.org
>
>
> Unless the data concretely promotes discrimination, it seems discriminatory
> to exclude it.
>
> Bill
>
> On 7/6/17 5:39 PM, Sebastian Raschka wrote:
>>
>> I think there can be some middle ground. I.e., adding a new, simple
>> dataset to demonstrate regression (maybe autmpg, wine quality, or sth like
>> that) and use that for the scikit-learn examples in the main documentation
>> etc but leave the boston dataset in the code base for now. Whether it's a
>> weak argument or not, it would be quite destructive to remove the dataset
>> altogether in the next version or so, not only because old tutorials use it
>> but many unit tests in many different projects depend on it. I think it
>> might be better to phase it out by having a good alternative first, and I am
>> sure that the scikit-learn maintainers wouldn't have anything against it if
>> someone would update the examples/tutorials with the use of different
>> datasets
>>
>> Best,
>> Sebastian
>>
>>> On Jul 6, 2017, at 7:36 PM, Juan Nunez-Iglesias <jni.soma at gmail.com>
>>> wrote:
>>>
>>> For what it's worth: I'm sympathetic to the argument that you can't fix
>>> the problem if you don't measure it, but I agree with Tony that "many
>>> tutorials use it" is an extremely weak argument. We removed Lena from
>>> scikit-image because it was the right thing to do. I very much doubt that
>>> Boston house prices is in more widespread use than Lena was in image
>>> processing.
>>>
>>> You can argue about whether or not it's morally right or wrong to include
>>> the dataset. I see merit to both arguments. But "too many tutorials use it"
>>> is very similar in flavour to "the economy of the South would collapse
>>> without slavery."
>>>
>>> Regarding fair uses of the feature, I would hope that all sklearn
>>> tutorials using the dataset mention such uses. The potential for abuse and
>>> misinterpretation is enormous.
>>>
>>> On 7 Jul 2017, 6:36 AM +1000, Jacob Schreiber <jmschreiber91 at gmail.com>,
>>> wrote:
>>>>
>>>> Hi Tony
>>>>
>>>> As others have pointed out, I think that you may be misunderstanding the
>>>> purpose of that "feature." We are in agreement that discrimination against
>>>> protected classes is not OK, and that even outside complying with the law
>>>> one should avoid discrimination, in model building or elsewhere. However, I
>>>> disagree that one does this by eliminating from all datasets any feature
>>>> that may allude to these protected classes. As Andreas pointed out, there is
>>>> a growing effort to ensure that machine learning models are fair and benefit
>>>> the common good (such as FATML, DSSG, etc..), and from my understanding the
>>>> general consensus isn't necessarily that simply eliminating the feature is
>>>> sufficient. I think we are in agreement that naively learning a model over a
>>>> feature set containing questionable features and calling it a day is not
>>>> okay, but as others have pointed out, having these features present and
>>>> handling them appropriately can help guard against the model implicitly
>>>> learning unfair !
>>
>>   biases (e
>>   ven if they are not explicitly exposed to the feature).
>>>>
>>>> I would welcome the addition of the Ames dataset to the ones supported
>>>> by sklearn, but I'm not convinced that the Boston dataset should be removed.
>>>> As Andreas pointed out, there is a benefit to having canonical examples
>>>> present so that beginners can easily follow along with the many tutorials
>>>> that have been written using them. As Sean points out, the paper itself is
>>>> trying to pull out the connection between house price and clean air in the
>>>> presence of possible confounding variables. In a more general sense, saying
>>>> that a feature shouldn't be there because a simple linear regression is
>>>> unaffected by the results is a bit odd because it is very common for
>>>> datasets to include irrelevant features, and handling them appropriately is
>>>> important. In addition, one could argue that having this type of issue arise
>>>> in a toy dataset has a benefit because it exposes these types of issues to
>>>> those learning data science earlier on and allows them to keep these issues
>>>> in mind in the future!
>
>
> ----------
> From: Gael Varoquaux <gael.varoquaux at normalesup.org>
> Date: Fri, Jul 7, 2017 at 6:35 AM
> To: Scikit-learn user and developer mailing list <scikit-learn at python.org>
>
>
> Many people gave great points in this thread, in particular Jacob's well
> written email.
>
> Andy's point about tutorials is an important one. I don't resonate at
> all with Juan's message. Breaking people's code, even if it is the notes
> that they use to give a lecture, is a real cost for them. The cost varies
> on a case to case basis. But there are still books printed out there
> that demo image processing on Lena, and these will be out for decades.
> More importantly, the replacement of Lena used in scipy (the raccoon)
> does not allow to demonstrate denoising properly (Lena has smooth regions
> with details in the middle: the eyes), or segmentation. In effect, it has
> made the examples for the ecosystem less convincing.
>
>
> Of course, by definition, refusing to change anything implies that
> unfortunate situations, such as discriminatory biases, cannot be fixed.
> This is why changes should be considered on a case-to-case basis.
>
> The problem that we are facing here is that a dataset about society, the
> Boston housing dataset, can reveal discrimination. However, this is true
> of every data about society. The classic adult data (extracted from the
> American census) easily reveals income discrimination. I teach statistics
> with an IQ dataset where it is easy to show a male vs female IQ
> difference. This difference disappears after controlling for education
> (and the purpose of my course is to teach people to control for
> confounding effects).
>
> Data about society reveals its inequalities. Not working on such data is
> hiding problems, not fixing them. It is true that misuse of such data can
> attempt to establish inequalities as facts of life and get them accepted.
> When discussing these issues, we need to educate people about how to run
> and interpret analyses.
>
>
> No the Boston data will not go. No it is not a good thing to pretend that
> social problems do not exist.
>
>
> Ga?l
> --
>     Gael Varoquaux
>     Researcher, INRIA Parietal
>     NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
>     Phone:  ++ 33-1-69-08-79-68
>     http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux
>
> ----------
> From: Juan Nunez-Iglesias <jni.soma at gmail.com>
> Date: Sat, Jul 8, 2017 at 3:44 AM
> To: Scikit-learn user and developer mailing list <scikit-learn at python.org>
>
>
> Just to clarify a couple of things about my position.
>
> First, thanks Ga?l for a thoughtful response. I fully respect your decision
> to keep the Boston dataset, and I agree that it can be a useful "teaching
> moment." (As I suggested in my earlier post.)
>
> With regards to breaking tutorials, however, I totally disagree. The whole
> value of tutorials is that they teach general principles, not analysis of
> specific datasets. Changing a tutorial dataset is thus different from
> changing an API. This isn't the right forum for a discussion about the
> ethics of the Lena image, so I won't go into that, but to suggest that it is
> a uniquely effective picture, the natural image equivalent of a standard
> test pattern, is ludicrous. Maybe the replacement wasn't as good, but that
> is a criticism of the choice of replacement, not of the decision to replace
> it. There clearly exist millions or billions of images with similarly good
> teaching characteristics.
>
> Finally, yes, removing and deprecating datasets incurs (and inflicts) a real
> cost, but cost should be at best a minor consideration when dealing with
> ethical questions. History, and daily life, are replete with unethical
> decisions made under the excuse that it would cost too much to do what's
> right. Ultimately the costs are usually found to have been exaggerated.
>
> With regards to this dataset, I cede the argument to maintainers,
> contributors, and users of the dataset, but I will point out that none of
> the existing tutorials in the library mention this feature, let alone
> addresses the ethics of it. The DESCR field mentions it entirely
> nonchalantly, like it is a natural thing to want to measure if one wants to
> predict house prices. I think I would certainly have a WTF moment, at least,
> if I was a black student reading through that description.
>
> Juan.
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ----------
> From: Jacob Schreiber <jmschreiber91 at gmail.com>
> Date: Sat, Jul 8, 2017 at 5:26 AM
> To: Scikit-learn user and developer mailing list <scikit-learn at python.org>
>
>
> We would welcome a pull request amending the documentation to include a
> neutral discussion of the issues you've brought up. Optimally, it would
> include many of the points brought up in this discussion as to why it was
> ultimately kept despite the issues being raised.
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>


-- 

Valia Rodriguez, MD PhD
Neurophysiology Lecturer. School of Life and Health Sciences, Aston University
Professor of Clinical Neurophysiology, Cuban Neuroscience Center

From ross at cgl.ucsf.edu  Sun Jul  9 20:13:47 2017
From: ross at cgl.ucsf.edu (Bill Ross)
Date: Sun, 9 Jul 2017 17:13:47 -0700
Subject: [scikit-learn] Replacing the Boston Housing Prices dataset
In-Reply-To: <32b9ea32-b5dc-dfbe-04ca-36e8db30160e@cgl.ucsf.edu>
References: <CAEBTegQhq3x1Jgm5zBY+MQ1kjB6ryfT0X8r5-esk6NP_5WY9rQ@mail.gmail.com>
 <CAL9=spMoOMWskSBzzVksamD3w_yDZJO7URR_E9zhoEvv96X-sQ@mail.gmail.com>
 <CA+ad8EvNw6jiB2nZ-z8UMuPtonFYvreXooVcKeeS73ccHNbDjA@mail.gmail.com>
 <f7106002-3607-4d50-8445-d4dc7291a264@Spark>
 <61B34F59-142E-4851-9B27-7DC2A0C2DAF8@gmail.com>
 <32b9ea32-b5dc-dfbe-04ca-36e8db30160e@cgl.ucsf.edu>
Message-ID: <65531062-c9d6-ce7f-b712-0a1abd3cd935@cgl.ucsf.edu>

Possibly of interest:

Race and ethnicity Imputation from Disease history with Deep LEarning

https://github.com/jisungk/riddle

Bill

On 7/6/17 6:00 PM, Bill Ross wrote:
> Unless the data concretely promotes discrimination, it seems 
> discriminatory to exclude it.
>
> Bill
>
> On 7/6/17 5:39 PM, Sebastian Raschka wrote:
>> I think there can be some middle ground. I.e., adding a new, simple 
>> dataset to demonstrate regression (maybe autmpg, wine quality, or sth 
>> like that) and use that for the scikit-learn examples in the main 
>> documentation etc but leave the boston dataset in the code base for 
>> now. Whether it's a weak argument or not, it would be quite 
>> destructive to remove the dataset altogether in the next version or 
>> so, not only because old tutorials use it but many unit tests in many 
>> different projects depend on it. I think it might be better to phase 
>> it out by having a good alternative first, and I am sure that the 
>> scikit-learn maintainers wouldn't have anything against it if someone 
>> would update the examples/tutorials with the use of different datasets
>>
>> Best,
>> Sebastian
>>
>>> On Jul 6, 2017, at 7:36 PM, Juan Nunez-Iglesias <jni.soma at gmail.com> 
>>> wrote:
>>>
>>> For what it's worth: I'm sympathetic to the argument that you can't 
>>> fix the problem if you don't measure it, but I agree with Tony that 
>>> "many tutorials use it" is an extremely weak argument. We removed 
>>> Lena from scikit-image because it was the right thing to do. I very 
>>> much doubt that Boston house prices is in more widespread use than 
>>> Lena was in image processing.
>>>
>>> You can argue about whether or not it's morally right or wrong to 
>>> include the dataset. I see merit to both arguments. But "too many 
>>> tutorials use it" is very similar in flavour to "the economy of the 
>>> South would collapse without slavery."
>>>
>>> Regarding fair uses of the feature, I would hope that all sklearn 
>>> tutorials using the dataset mention such uses. The potential for 
>>> abuse and misinterpretation is enormous.
>>>
>>> On 7 Jul 2017, 6:36 AM +1000, Jacob Schreiber 
>>> <jmschreiber91 at gmail.com>, wrote:
>>>> Hi Tony
>>>>
>>>> As others have pointed out, I think that you may be 
>>>> misunderstanding the purpose of that "feature." We are in agreement 
>>>> that discrimination against protected classes is not OK, and that 
>>>> even outside complying with the law one should avoid 
>>>> discrimination, in model building or elsewhere. However, I disagree 
>>>> that one does this by eliminating from all datasets any feature 
>>>> that may allude to these protected classes. As Andreas pointed out, 
>>>> there is a growing effort to ensure that machine learning models 
>>>> are fair and benefit the common good (such as FATML, DSSG, etc..), 
>>>> and from my understanding the general consensus isn't necessarily 
>>>> that simply eliminating the feature is sufficient. I think we are 
>>>> in agreement that naively learning a model over a feature set 
>>>> containing questionable features and calling it a day is not okay, 
>>>> but as others have pointed out, having these features present and 
>>>> handling them appropriately can help guard against the model 
>>>> implicitly learning unfair!
>  !
>>   biases (e
>>   ven if they are not explicitly exposed to the feature).
>>>> I would welcome the addition of the Ames dataset to the ones 
>>>> supported by sklearn, but I'm not convinced that the Boston dataset 
>>>> should be removed. As Andreas pointed out, there is a benefit to 
>>>> having canonical examples present so that beginners can easily 
>>>> follow along with the many tutorials that have been written using 
>>>> them. As Sean points out, the paper itself is trying to pull out 
>>>> the connection between house price and clean air in the presence of 
>>>> possible confounding variables. In a more general sense, saying 
>>>> that a feature shouldn't be there because a simple linear 
>>>> regression is unaffected by the results is a bit odd because it is 
>>>> very common for datasets to include irrelevant features, and 
>>>> handling them appropriately is important. In addition, one could 
>>>> argue that having this type of issue arise in a toy dataset has a 
>>>> benefit because it exposes these types of issues to those learning 
>>>> data science earlier on and allows them to keep these issues in 
>>>> mind in the futur!
> e!
>>    when the
>>    data is more serious.
>>>> It is important for us all to keep issues of fairness in mind when 
>>>> it comes to data science. I'm glad that you're speaking out in 
>>>> favor of fairness and trying to bring attention to it.
>>>>
>>>> Jacob
>>>>
>>>> On Thu, Jul 6, 2017 at 12:08 PM, Sean Violante 
>>>> <sean.violante at gmail.com> wrote:
>>>> G Reina
>>>> you make a bizarre argument. You argue that you should not even 
>>>> check racism as a possible factor in house prices?
>>>>
>>>> But then you yourself check whether its relevant
>>>> Then you say
>>>>
>>>> "but I'd argue that it's more due to the location (near water, near 
>>>> businesses, near restaurants, near parks and recreation) than to 
>>>> the ethnic makeup"
>>>>
>>>> Which  was basically what  the original authors wanted to show too,
>>>>
>>>> Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for 
>>>> clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978.
>>>>
>>>>   but unless you measure ethnic make-up you cannot show that it is 
>>>> not a confounder.
>>>>
>>>> The term "white flight" refers to affluent white families moving to 
>>>> the suburbs.. And clearly a question is whether/how much was racism 
>>>> or avoiding air pollution.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 6 Jul 2017 6:10 pm, "G Reina" <greina at eng.ucsd.edu> wrote:
>>>> I'd like to request that the "Boston Housing Prices" dataset in 
>>>> sklearn (sklearn.datasets.load_boston) be replaced with the "Ames 
>>>> Housing Prices" dataset 
>>>> (https://ww2.amstat.org/publications/jse/v19n3/decock.pdf). I am 
>>>> willing to submit the code change if the developers agree.
>>>>
>>>> The Boston dataset has the feature "Bk is the proportion of blacks 
>>>> in town". It is an incredibly racist "feature" to include in any 
>>>> dataset. I think is beneath us as data scientists.
>>>>
>>>> I submit that the Ames dataset is a viable alternative for learning 
>>>> regression. The author has shown that the dataset is a more robust 
>>>> replacement for Boston. Ames is a 2011 regression dataset on 
>>>> housing prices and has more than 5 times the amount of training 
>>>> examples with over 7 times as many features (none of which are 
>>>> morally questionable).
>>>>
>>>> I welcome the community's thoughts on the matter.
>>>>
>>>> Thanks.
>>>> -Tony
>>>>
>>>> Here's an article I wrote on the Boston dataset:
>>>> https://www.linkedin.com/pulse/hidden-racism-data-science-g-anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D 
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170709/b9e43d33/attachment.html>

From ross at cgl.ucsf.edu  Sun Jul  9 20:53:02 2017
From: ross at cgl.ucsf.edu (Bill Ross)
Date: Sun, 9 Jul 2017 17:53:02 -0700
Subject: [scikit-learn] Replacing the Boston Housing Prices dataset
In-Reply-To: <65531062-c9d6-ce7f-b712-0a1abd3cd935@cgl.ucsf.edu>
References: <CAEBTegQhq3x1Jgm5zBY+MQ1kjB6ryfT0X8r5-esk6NP_5WY9rQ@mail.gmail.com>
 <CAL9=spMoOMWskSBzzVksamD3w_yDZJO7URR_E9zhoEvv96X-sQ@mail.gmail.com>
 <CA+ad8EvNw6jiB2nZ-z8UMuPtonFYvreXooVcKeeS73ccHNbDjA@mail.gmail.com>
 <f7106002-3607-4d50-8445-d4dc7291a264@Spark>
 <61B34F59-142E-4851-9B27-7DC2A0C2DAF8@gmail.com>
 <32b9ea32-b5dc-dfbe-04ca-36e8db30160e@cgl.ucsf.edu>
 <65531062-c9d6-ce7f-b712-0a1abd3cd935@cgl.ucsf.edu>
Message-ID: <87933999-964e-0684-2c67-0ec105748250@cgl.ucsf.edu>

And more to the point the discussion on Reddit:

https://www.reddit.com/r/MachineLearning/comments/6m8tp0/p_deep_learning_for_estimating_race_and_ethnicity/

Bill

On 7/9/17 5:13 PM, Bill Ross wrote:
>
> Possibly of interest:
>
> Race and ethnicity Imputation from Disease history with Deep LEarning
>
> https://github.com/jisungk/riddle
>
> Bill
>
> On 7/6/17 6:00 PM, Bill Ross wrote:
>> Unless the data concretely promotes discrimination, it seems 
>> discriminatory to exclude it.
>>
>> Bill
>>
>> On 7/6/17 5:39 PM, Sebastian Raschka wrote:
>>> I think there can be some middle ground. I.e., adding a new, simple 
>>> dataset to demonstrate regression (maybe autmpg, wine quality, or 
>>> sth like that) and use that for the scikit-learn examples in the 
>>> main documentation etc but leave the boston dataset in the code base 
>>> for now. Whether it's a weak argument or not, it would be quite 
>>> destructive to remove the dataset altogether in the next version or 
>>> so, not only because old tutorials use it but many unit tests in 
>>> many different projects depend on it. I think it might be better to 
>>> phase it out by having a good alternative first, and I am sure that 
>>> the scikit-learn maintainers wouldn't have anything against it if 
>>> someone would update the examples/tutorials with the use of 
>>> different datasets
>>>
>>> Best,
>>> Sebastian
>>>
>>>> On Jul 6, 2017, at 7:36 PM, Juan Nunez-Iglesias 
>>>> <jni.soma at gmail.com> wrote:
>>>>
>>>> For what it's worth: I'm sympathetic to the argument that you can't 
>>>> fix the problem if you don't measure it, but I agree with Tony that 
>>>> "many tutorials use it" is an extremely weak argument. We removed 
>>>> Lena from scikit-image because it was the right thing to do. I very 
>>>> much doubt that Boston house prices is in more widespread use than 
>>>> Lena was in image processing.
>>>>
>>>> You can argue about whether or not it's morally right or wrong to 
>>>> include the dataset. I see merit to both arguments. But "too many 
>>>> tutorials use it" is very similar in flavour to "the economy of the 
>>>> South would collapse without slavery."
>>>>
>>>> Regarding fair uses of the feature, I would hope that all sklearn 
>>>> tutorials using the dataset mention such uses. The potential for 
>>>> abuse and misinterpretation is enormous.
>>>>
>>>> On 7 Jul 2017, 6:36 AM +1000, Jacob Schreiber 
>>>> <jmschreiber91 at gmail.com>, wrote:
>>>>> Hi Tony
>>>>>
>>>>> As others have pointed out, I think that you may be 
>>>>> misunderstanding the purpose of that "feature." We are in 
>>>>> agreement that discrimination against protected classes is not OK, 
>>>>> and that even outside complying with the law one should avoid 
>>>>> discrimination, in model building or elsewhere. However, I 
>>>>> disagree that one does this by eliminating from all datasets any 
>>>>> feature that may allude to these protected classes. As Andreas 
>>>>> pointed out, there is a growing effort to ensure that machine 
>>>>> learning models are fair and benefit the common good (such as 
>>>>> FATML, DSSG, etc..), and from my understanding the general 
>>>>> consensus isn't necessarily that simply eliminating the feature is 
>>>>> sufficient. I think we are in agreement that naively learning a 
>>>>> model over a feature set containing questionable features and 
>>>>> calling it a day is not okay, but as others have pointed out, 
>>>>> having these features present and handling them appropriately can 
>>>>> help guard against the model implicitly learning unfair!
>>  !
>>>   biases (e
>>>   ven if they are not explicitly exposed to the feature).
>>>>> I would welcome the addition of the Ames dataset to the ones 
>>>>> supported by sklearn, but I'm not convinced that the Boston 
>>>>> dataset should be removed. As Andreas pointed out, there is a 
>>>>> benefit to having canonical examples present so that beginners can 
>>>>> easily follow along with the many tutorials that have been written 
>>>>> using them. As Sean points out, the paper itself is trying to pull 
>>>>> out the connection between house price and clean air in the 
>>>>> presence of possible confounding variables. In a more general 
>>>>> sense, saying that a feature shouldn't be there because a simple 
>>>>> linear regression is unaffected by the results is a bit odd 
>>>>> because it is very common for datasets to include irrelevant 
>>>>> features, and handling them appropriately is important. In 
>>>>> addition, one could argue that having this type of issue arise in 
>>>>> a toy dataset has a benefit because it exposes these types of 
>>>>> issues to those learning data science earlier on and allows them 
>>>>> to keep these issues in mind in the futur!
>> e!
>>>    when the
>>>    data is more serious.
>>>>> It is important for us all to keep issues of fairness in mind when 
>>>>> it comes to data science. I'm glad that you're speaking out in 
>>>>> favor of fairness and trying to bring attention to it.
>>>>>
>>>>> Jacob
>>>>>
>>>>> On Thu, Jul 6, 2017 at 12:08 PM, Sean Violante 
>>>>> <sean.violante at gmail.com> wrote:
>>>>> G Reina
>>>>> you make a bizarre argument. You argue that you should not even 
>>>>> check racism as a possible factor in house prices?
>>>>>
>>>>> But then you yourself check whether its relevant
>>>>> Then you say
>>>>>
>>>>> "but I'd argue that it's more due to the location (near water, 
>>>>> near businesses, near restaurants, near parks and recreation) than 
>>>>> to the ethnic makeup"
>>>>>
>>>>> Which  was basically what  the original authors wanted to show too,
>>>>>
>>>>> Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand 
>>>>> for clean air', J. Environ. Economics & Management, vol.5, 81-102, 
>>>>> 1978.
>>>>>
>>>>>   but unless you measure ethnic make-up you cannot show that it is 
>>>>> not a confounder.
>>>>>
>>>>> The term "white flight" refers to affluent white families moving 
>>>>> to the suburbs.. And clearly a question is whether/how much was 
>>>>> racism or avoiding air pollution.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 6 Jul 2017 6:10 pm, "G Reina" <greina at eng.ucsd.edu> wrote:
>>>>> I'd like to request that the "Boston Housing Prices" dataset in 
>>>>> sklearn (sklearn.datasets.load_boston) be replaced with the "Ames 
>>>>> Housing Prices" dataset 
>>>>> (https://ww2.amstat.org/publications/jse/v19n3/decock.pdf). I am 
>>>>> willing to submit the code change if the developers agree.
>>>>>
>>>>> The Boston dataset has the feature "Bk is the proportion of blacks 
>>>>> in town". It is an incredibly racist "feature" to include in any 
>>>>> dataset. I think is beneath us as data scientists.
>>>>>
>>>>> I submit that the Ames dataset is a viable alternative for 
>>>>> learning regression. The author has shown that the dataset is a 
>>>>> more robust replacement for Boston. Ames is a 2011 regression 
>>>>> dataset on housing prices and has more than 5 times the amount of 
>>>>> training examples with over 7 times as many features (none of 
>>>>> which are morally questionable).
>>>>>
>>>>> I welcome the community's thoughts on the matter.
>>>>>
>>>>> Thanks.
>>>>> -Tony
>>>>>
>>>>> Here's an article I wrote on the Boston dataset:
>>>>> https://www.linkedin.com/pulse/hidden-racism-data-science-g-anthony-reina?trk=v-feed&lipi=urn%3Ali%3Apage%3Ad_flagship3_feed%3Bmu67f2GSzj5xHMpSD6M00A%3D%3D 
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> scikit-learn at python.org
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> scikit-learn at python.org
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> scikit-learn at python.org
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170709/750bd811/attachment-0001.html>

From grhanceylan at gmail.com  Mon Jul 10 10:58:52 2017
From: grhanceylan at gmail.com (=?UTF-8?Q?G=C3=BCrhan_Ceylan?=)
Date: Mon, 10 Jul 2017 17:58:52 +0300
Subject: [scikit-learn] Contribution
Message-ID: <CADSesxZyXwpCiKws4w7vfUMpGOr9YhJXbTaZ-Zhd6kE3AXELQw@mail.gmail.com>

Hi everyone,

I am wondering, How can I  use external optimization algorithms with
scikit-learn,
for instance neural network
<http://scikit-learn.org/stable/modules/neural_networks_supervised.html#algorithms>
, instead of defined algorithms ( Stochastic Gradient Descent, Adam, or
L-BFGS).

Furthermore, I want to introduce a new unconstrained optimization algorithm
to scikit-learn, implementation of the algorithm and related paper can be
found here <https://github.com/sibirbil/PMBSolve>.

I couldn't find any explanation
<http://scikit-learn.org/stable/developers/contributing.html>, about the
situation. Do you have defined procedure to make such kind of
contributions? If this is not the case, How should I start to make such a
proposal/contribution ?


Kind regards,

G?rhan C.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170710/e944ed80/attachment.html>

From jmschreiber91 at gmail.com  Mon Jul 10 12:01:39 2017
From: jmschreiber91 at gmail.com (Jacob Schreiber)
Date: Mon, 10 Jul 2017 09:01:39 -0700
Subject: [scikit-learn] Contribution
In-Reply-To: <CADSesxZyXwpCiKws4w7vfUMpGOr9YhJXbTaZ-Zhd6kE3AXELQw@mail.gmail.com>
References: <CADSesxZyXwpCiKws4w7vfUMpGOr9YhJXbTaZ-Zhd6kE3AXELQw@mail.gmail.com>
Message-ID: <CA+ad8EsKjM4nVAGWfQM_2GWPOvGe1gq3eiiMfW8kgrJVEq2OHw@mail.gmail.com>

Howdy

This question and the one right after in the FAQ are probably relevant re:
inclusion of new algorithms:
http://scikit-learn.org/stable/faq.html#what-are-the-inclusion-criteria-for-new-algorithms.
The gist is that we only include well established algorithms, and there are
no end to those. I think it is unlikely that a PR will get merged with a
cutting edge new algorithm, as the scope of scikit-learn isn't necessary
"the latest" as opposed to "the classics." You may also consider writing a
scikit-contrib package that basically creates what you're interested in in
scikit-learn format, but external to the project. We'd be more than happy
to link to it. If the algorithm becomes a smashing success over time, we'd
reconsider adding it to the main code base.

As to your first question, you should check out how the current optimizers
are written for the algorithm you're interested in. I don't think there's a
plug and play way to drop in your own optimizer like many deep learning
packages support, unfortunately. You'd probably have to modify the code
directly to support your own.

Let me know if you have any other questions.

Jacob

On Mon, Jul 10, 2017 at 7:58 AM, G?rhan Ceylan <grhanceylan at gmail.com>
wrote:

> Hi everyone,
>
> I am wondering, How can I  use external optimization algorithms with scikit-learn,
> for instance neural network
> <http://scikit-learn.org/stable/modules/neural_networks_supervised.html#algorithms>
> , instead of defined algorithms ( Stochastic Gradient Descent, Adam, or
> L-BFGS).
>
> Furthermore, I want to introduce a new unconstrained optimization
> algorithm to scikit-learn, implementation of the algorithm and related paper
> can be found here <https://github.com/sibirbil/PMBSolve>.
>
> I couldn't find any explanation
> <http://scikit-learn.org/stable/developers/contributing.html>, about the
> situation. Do you have defined procedure to make such kind of
> contributions? If this is not the case, How should I start to make such a
> proposal/contribution ?
>
>
> Kind regards,
>
> G?rhan C.
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170710/33f5849e/attachment.html>

From vaggi.federico at gmail.com  Mon Jul 10 12:10:09 2017
From: vaggi.federico at gmail.com (federico vaggi)
Date: Mon, 10 Jul 2017 16:10:09 +0000
Subject: [scikit-learn] Contribution
In-Reply-To: <CA+ad8EsKjM4nVAGWfQM_2GWPOvGe1gq3eiiMfW8kgrJVEq2OHw@mail.gmail.com>
References: <CADSesxZyXwpCiKws4w7vfUMpGOr9YhJXbTaZ-Zhd6kE3AXELQw@mail.gmail.com>
 <CA+ad8EsKjM4nVAGWfQM_2GWPOvGe1gq3eiiMfW8kgrJVEq2OHw@mail.gmail.com>
Message-ID: <CAGvd0=iFD0N2uvP64paw9tY36_qdzS0_1m=99i52dKKnVD97Tw@mail.gmail.com>

Hey Gurhan,

sklearn doesn't really neatly separate optimizers from the models they
optimize at the level of API (except in a few cases).  In order to make the
package more friendly to newer user, each model has excellent optimizer
defaults that you can use, and only in a few cases does it make sense to
tweak the optimization routines (for example, SAGA if you have a very large
dataset when doing logistic regression).

There is a fantastic library called lightning where the optimization
routines are first class citizens:
http://contrib.scikit-learn.org/lightning/ - you can take a look there.
However, lightning focuses on convex optimization, so most algorithms have
provable convergence rates.

Good luck!

On Mon, 10 Jul 2017 at 09:05 Jacob Schreiber <jmschreiber91 at gmail.com>
wrote:

> Howdy
>
> This question and the one right after in the FAQ are probably relevant re:
> inclusion of new algorithms:
> http://scikit-learn.org/stable/faq.html#what-are-the-inclusion-criteria-for-new-algorithms.
> The gist is that we only include well established algorithms, and there are
> no end to those. I think it is unlikely that a PR will get merged with a
> cutting edge new algorithm, as the scope of scikit-learn isn't necessary
> "the latest" as opposed to "the classics." You may also consider writing a
> scikit-contrib package that basically creates what you're interested in in
> scikit-learn format, but external to the project. We'd be more than happy
> to link to it. If the algorithm becomes a smashing success over time, we'd
> reconsider adding it to the main code base.
>
> As to your first question, you should check out how the current optimizers
> are written for the algorithm you're interested in. I don't think there's a
> plug and play way to drop in your own optimizer like many deep learning
> packages support, unfortunately. You'd probably have to modify the code
> directly to support your own.
>
> Let me know if you have any other questions.
>
> Jacob
>
> On Mon, Jul 10, 2017 at 7:58 AM, G?rhan Ceylan <grhanceylan at gmail.com>
> wrote:
>
>> Hi everyone,
>>
>> I am wondering, How can I  use external optimization algorithms with scikit-learn,
>> for instance neural network
>> <http://scikit-learn.org/stable/modules/neural_networks_supervised.html#algorithms>
>> , instead of defined algorithms ( Stochastic Gradient Descent, Adam, or
>> L-BFGS).
>>
>> Furthermore, I want to introduce a new unconstrained optimization
>> algorithm to scikit-learn, implementation of the algorithm and related paper
>> can be found here <https://github.com/sibirbil/PMBSolve>.
>>
>> I couldn't find any explanation
>> <http://scikit-learn.org/stable/developers/contributing.html>, about the
>> situation. Do you have defined procedure to make such kind of
>> contributions? If this is not the case, How should I start to make such a
>> proposal/contribution ?
>>
>>
>> Kind regards,
>>
>> G?rhan C.
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170710/40c9e9ac/attachment.html>

From uri at goren4u.com  Mon Jul 10 13:32:42 2017
From: uri at goren4u.com (Uri Goren)
Date: Mon, 10 Jul 2017 20:32:42 +0300
Subject: [scikit-learn] Contribution
In-Reply-To: <CANCr86ON2jaPKCzjMonDhc5bSStSSn5tYNtK64N2ykd9JyyJTA@mail.gmail.com>
References: <CADSesxZyXwpCiKws4w7vfUMpGOr9YhJXbTaZ-Zhd6kE3AXELQw@mail.gmail.com>
 <CA+ad8EsKjM4nVAGWfQM_2GWPOvGe1gq3eiiMfW8kgrJVEq2OHw@mail.gmail.com>
 <CAGvd0=iFD0N2uvP64paw9tY36_qdzS0_1m=99i52dKKnVD97Tw@mail.gmail.com>
 <CANCr86ON2jaPKCzjMonDhc5bSStSSn5tYNtK64N2ykd9JyyJTA@mail.gmail.com>
Message-ID: <CANCr86PhFNdyfQctUHVZy+7mBWXy7f4bFzuQTDOi=NaTOewSmg@mail.gmail.com>

Hi,
I'd like to implement the Markov clustering algorithm,
Any objections?


On Jul 10, 2017 7:10 PM, "federico vaggi" <vaggi.federico at gmail.com> wrote:

Hey Gurhan,

sklearn doesn't really neatly separate optimizers from the models they
optimize at the level of API (except in a few cases).  In order to make the
package more friendly to newer user, each model has excellent optimizer
defaults that you can use, and only in a few cases does it make sense to
tweak the optimization routines (for example, SAGA if you have a very large
dataset when doing logistic regression).

There is a fantastic library called lightning where the optimization
routines are first class citizens: http://contrib.
scikit-learn.org/lightning/ - you can take a look there.  However,
lightning focuses on convex optimization, so most algorithms have provable
convergence rates.

Good luck!

On Mon, 10 Jul 2017 at 09:05 Jacob Schreiber <jmschreiber91 at gmail.com>
wrote:

> Howdy
>
> This question and the one right after in the FAQ are probably relevant re:
> inclusion of new algorithms: http://scikit-learn.org/stable/faq.html#
> what-are-the-inclusion-criteria-for-new-algorithms. The gist is that we
> only include well established algorithms, and there are no end to those. I
> think it is unlikely that a PR will get merged with a cutting edge new
> algorithm, as the scope of scikit-learn isn't necessary "the latest" as
> opposed to "the classics." You may also consider writing a scikit-contrib
> package that basically creates what you're interested in in scikit-learn
> format, but external to the project. We'd be more than happy to link to it.
> If the algorithm becomes a smashing success over time, we'd reconsider
> adding it to the main code base.
>
> As to your first question, you should check out how the current optimizers
> are written for the algorithm you're interested in. I don't think there's a
> plug and play way to drop in your own optimizer like many deep learning
> packages support, unfortunately. You'd probably have to modify the code
> directly to support your own.
>
> Let me know if you have any other questions.
>
> Jacob
>
> On Mon, Jul 10, 2017 at 7:58 AM, G?rhan Ceylan <grhanceylan at gmail.com>
> wrote:
>
>> Hi everyone,
>>
>> I am wondering, How can I  use external optimization algorithms with scikit-learn,
>> for instance neural network
>> <http://scikit-learn.org/stable/modules/neural_networks_supervised.html#algorithms>
>> , instead of defined algorithms ( Stochastic Gradient Descent, Adam, or
>> L-BFGS).
>>
>> Furthermore, I want to introduce a new unconstrained optimization
>> algorithm to scikit-learn, implementation of the algorithm and related paper
>> can be found here <https://github.com/sibirbil/PMBSolve>.
>>
>> I couldn't find any explanation
>> <http://scikit-learn.org/stable/developers/contributing.html>, about the
>> situation. Do you have defined procedure to make such kind of
>> contributions? If this is not the case, How should I start to make such a
>> proposal/contribution ?
>>
>>
>> Kind regards,
>>
>> G?rhan C.
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>

_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170710/5ffcf14c/attachment-0001.html>

From zephyr14 at gmail.com  Mon Jul 10 13:37:12 2017
From: zephyr14 at gmail.com (Vlad Niculae)
Date: Mon, 10 Jul 2017 13:37:12 -0400
Subject: [scikit-learn] Contribution
In-Reply-To: <CAGvd0=iFD0N2uvP64paw9tY36_qdzS0_1m=99i52dKKnVD97Tw@mail.gmail.com>
References: <CADSesxZyXwpCiKws4w7vfUMpGOr9YhJXbTaZ-Zhd6kE3AXELQw@mail.gmail.com>
 <CA+ad8EsKjM4nVAGWfQM_2GWPOvGe1gq3eiiMfW8kgrJVEq2OHw@mail.gmail.com>
 <CAGvd0=iFD0N2uvP64paw9tY36_qdzS0_1m=99i52dKKnVD97Tw@mail.gmail.com>
Message-ID: <20170710173712.bt6nigii5icmihgl@vladn-desktop>

On Mon, Jul 10, 2017 at 04:10:09PM +0000, federico vaggi wrote:
> There is a fantastic library called lightning where the optimization
> routines are first class citizens:
> http://contrib.scikit-learn.org/lightning/ - you can take a look there.
> However, lightning focuses on convex optimization, so most algorithms have
> provable convergence rates.

Hi,

I fully agree that lightning is fantastic :) but it might not be what G?rhan
wants.

It's true that lightning's api is designed around optimizers rather
than around models. So where in scikit-learn we usually have, e.g., 

  LogisticRegression(solver='sag')

in lightning you would have

  SAGClassifier(loss='log')

to achieve something close. But neither library has the oo-style
separation between freeform models and optimizers such as you might
find in deep learning frameworks.  So, for instance, it's relatively
easy to add a new loss function to the lightning SAGClassifier, but
you would still be able to only use it with a linear model.

This is by design in both scikit-learn and lightning, at least at the
moment: by making these kinds of assumptions about the models,
implementations can be much more efficient in terms of computation and
storage, especially when sparse data is involved.

Yours,
Vlad

> 
> Good luck!
> 
> On Mon, 10 Jul 2017 at 09:05 Jacob Schreiber <jmschreiber91 at gmail.com>
> wrote:
> 
> > Howdy
> >
> > This question and the one right after in the FAQ are probably relevant re:
> > inclusion of new algorithms:
> > http://scikit-learn.org/stable/faq.html#what-are-the-inclusion-criteria-for-new-algorithms.
> > The gist is that we only include well established algorithms, and there are
> > no end to those. I think it is unlikely that a PR will get merged with a
> > cutting edge new algorithm, as the scope of scikit-learn isn't necessary
> > "the latest" as opposed to "the classics." You may also consider writing a
> > scikit-contrib package that basically creates what you're interested in in
> > scikit-learn format, but external to the project. We'd be more than happy
> > to link to it. If the algorithm becomes a smashing success over time, we'd
> > reconsider adding it to the main code base.
> >
> > As to your first question, you should check out how the current optimizers
> > are written for the algorithm you're interested in. I don't think there's a
> > plug and play way to drop in your own optimizer like many deep learning
> > packages support, unfortunately. You'd probably have to modify the code
> > directly to support your own.
> >
> > Let me know if you have any other questions.
> >
> > Jacob
> >
> > On Mon, Jul 10, 2017 at 7:58 AM, G?rhan Ceylan <grhanceylan at gmail.com>
> > wrote:
> >
> >> Hi everyone,
> >>
> >> I am wondering, How can I  use external optimization algorithms with scikit-learn,
> >> for instance neural network
> >> <http://scikit-learn.org/stable/modules/neural_networks_supervised.html#algorithms>
> >> , instead of defined algorithms ( Stochastic Gradient Descent, Adam, or
> >> L-BFGS).
> >>
> >> Furthermore, I want to introduce a new unconstrained optimization
> >> algorithm to scikit-learn, implementation of the algorithm and related paper
> >> can be found here <https://github.com/sibirbil/PMBSolve>.
> >>
> >> I couldn't find any explanation
> >> <http://scikit-learn.org/stable/developers/contributing.html>, about the
> >> situation. Do you have defined procedure to make such kind of
> >> contributions? If this is not the case, How should I start to make such a
> >> proposal/contribution ?
> >>
> >>
> >> Kind regards,
> >>
> >> G?rhan C.
> >>
> >>
> >> _______________________________________________
> >> scikit-learn mailing list
> >> scikit-learn at python.org
> >> https://mail.python.org/mailman/listinfo/scikit-learn
> >>
> >>
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >

> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From uri at goren4u.com  Mon Jul 10 17:40:19 2017
From: uri at goren4u.com (Uri Goren)
Date: Tue, 11 Jul 2017 00:40:19 +0300
Subject: [scikit-learn] Contribution - Markov Clustering
Message-ID: <CANCr86MFp7S_GxWYC=Q_fM1ebwHpMQmW2mDyiRCfGQUpeWtsCw@mail.gmail.com>

Hi,
I've been advised to contact you before working on an implementation of a
new feature.

I am thinking of implementing the Markov clustering and add it to
sklearn.cluster module.

See:
https://micans.org/mcl/
https://gist.github.com/urigoren/1f76567f3af56ed8c33f076537768a60


Do you know if anyone else has started working on it ?
Would you advise against it for some reason ?
Thank you,
Uri
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170711/67bfd798/attachment.html>

From alexandre.gramfort at telecom-paristech.fr  Mon Jul 10 23:00:45 2017
From: alexandre.gramfort at telecom-paristech.fr (Alexandre Gramfort)
Date: Tue, 11 Jul 2017 05:00:45 +0200
Subject: [scikit-learn] Contribution - Markov Clustering
In-Reply-To: <CANCr86MFp7S_GxWYC=Q_fM1ebwHpMQmW2mDyiRCfGQUpeWtsCw@mail.gmail.com>
References: <CANCr86MFp7S_GxWYC=Q_fM1ebwHpMQmW2mDyiRCfGQUpeWtsCw@mail.gmail.com>
Message-ID: <CADeotZpcOTGD4ZXvQmqSKNiSDETUXEnkKTALk28_pfVugX98tQ@mail.gmail.com>

hi,

did you have a look at :

http://scikit-learn.org/stable/faq.html#what-are-the-inclusion-criteria-for-new-algorithms

Alex
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170711/b71a2ca8/attachment.html>

From uri at goren4u.com  Tue Jul 11 00:36:30 2017
From: uri at goren4u.com (Uri Goren)
Date: Tue, 11 Jul 2017 07:36:30 +0300
Subject: [scikit-learn] Contribution - Markov Clustering
In-Reply-To: <CADeotZpcOTGD4ZXvQmqSKNiSDETUXEnkKTALk28_pfVugX98tQ@mail.gmail.com>
References: <CANCr86MFp7S_GxWYC=Q_fM1ebwHpMQmW2mDyiRCfGQUpeWtsCw@mail.gmail.com>
 <CADeotZpcOTGD4ZXvQmqSKNiSDETUXEnkKTALk28_pfVugX98tQ@mail.gmail.com>
Message-ID: <CANCr86NFVGkU15_YM6WYZY3TyTVwu=E_ecueS5ecEd-eqC=67w@mail.gmail.com>

I have,
The only criterion that I am unsure about is the number citations.

In the literature Markov clustering is usually compared to affinity
prolongation, which also has a similar number of citations.

I have attached my implementation in my github account for you to review.

Do I have your approval to make it a pull request?


On Jul 11, 2017 6:00 AM, "Alexandre Gramfort" <
alexandre.gramfort at telecom-paristech.fr> wrote:

> hi,
>
> did you have a look at :
>
> http://scikit-learn.org/stable/faq.html#what-are-the-
> inclusion-criteria-for-new-algorithms
>
> Alex
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170711/b6cd1fa8/attachment-0001.html>

From jmschreiber91 at gmail.com  Tue Jul 11 12:03:34 2017
From: jmschreiber91 at gmail.com (Jacob Schreiber)
Date: Tue, 11 Jul 2017 09:03:34 -0700
Subject: [scikit-learn] Contribution - Markov Clustering
In-Reply-To: <CANCr86NFVGkU15_YM6WYZY3TyTVwu=E_ecueS5ecEd-eqC=67w@mail.gmail.com>
References: <CANCr86MFp7S_GxWYC=Q_fM1ebwHpMQmW2mDyiRCfGQUpeWtsCw@mail.gmail.com>
 <CADeotZpcOTGD4ZXvQmqSKNiSDETUXEnkKTALk28_pfVugX98tQ@mail.gmail.com>
 <CANCr86NFVGkU15_YM6WYZY3TyTVwu=E_ecueS5ecEd-eqC=67w@mail.gmail.com>
Message-ID: <CA+ad8Etmy1oT45QdoAZ-WnSce4hTOaEUDbA+KQDsHUtnPBHqtw@mail.gmail.com>

You don't need our permission to submit a PR, go ahead! We welcome PRs.

On Mon, Jul 10, 2017 at 9:36 PM, Uri Goren <uri at goren4u.com> wrote:

> I have,
> The only criterion that I am unsure about is the number citations.
>
> In the literature Markov clustering is usually compared to affinity
> prolongation, which also has a similar number of citations.
>
> I have attached my implementation in my github account for you to review.
>
> Do I have your approval to make it a pull request?
>
>
>
>
> On Jul 11, 2017 6:00 AM, "Alexandre Gramfort" <alexandre.gramfort at telecom-
> paristech.fr> wrote:
>
>> hi,
>>
>> did you have a look at :
>>
>> http://scikit-learn.org/stable/faq.html#what-are-the-inclusi
>> on-criteria-for-new-algorithms
>>
>> Alex
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170711/2f5670f9/attachment.html>

From b.noushin7 at gmail.com  Tue Jul 11 12:42:15 2017
From: b.noushin7 at gmail.com (Ariani A)
Date: Tue, 11 Jul 2017 12:42:15 -0400
Subject: [scikit-learn] Agglomerative clustering problem
Message-ID: <CAEUHNkavDjX=BZNA1jOHL7zg13m8MPvWbPUi3j79UOtFHtRCKg@mail.gmail.com>

Hi all,
I want to perform agglomerative clustering, but I have no idea of number of
clusters before hand. But I want that every cluster has at least 40 data
points in it. How can I apply this to sklearn.agglomerative clustering?
Should I use dendrogram and cut it somehow? I have no idea how to relate
dendrogram to this and cutting it out. Any help will be appreciated!
I have to use agglomerative clustering!
Thanks,
-Ariani
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170711/d6464ec0/attachment.html>

From uri at goren4u.com  Tue Jul 11 13:54:12 2017
From: uri at goren4u.com (Uri Goren)
Date: Tue, 11 Jul 2017 20:54:12 +0300
Subject: [scikit-learn] Agglomerative clustering problem
In-Reply-To: <CAEUHNkavDjX=BZNA1jOHL7zg13m8MPvWbPUi3j79UOtFHtRCKg@mail.gmail.com>
References: <CAEUHNkavDjX=BZNA1jOHL7zg13m8MPvWbPUi3j79UOtFHtRCKg@mail.gmail.com>
Message-ID: <CANCr86NgwKdFi5P+aA05qzdcRGd100eM7vjEFB+GJtxj551qgw@mail.gmail.com>

Take a look at scipy's fcluster function.
If M is a matrix of all of your feature vectors, this code snippet should
work.

You need to figure out what metric and algorithm work for you

    from sklearn.metrics import pairwise_distance
    from scipy.cluster import  hierarchy
    X = pairwise_distance(M, metric=metric)
    Z = hierarchy.linkage(X, algo, metric=metric)
    C = hierarchy.fcluster(Z,threshold, criterion="distance")

Best,
Uri Goren

On Tue, Jul 11, 2017 at 7:42 PM, Ariani A <b.noushin7 at gmail.com> wrote:

> Hi all,
> I want to perform agglomerative clustering, but I have no idea of number
>  of clusters before hand. But I want that every cluster has at least 40
> data points in it. How can I apply this to sklearn.agglomerative clusteri
> ng?
> Should I use dendrogram and cut it somehow? I have no idea how to relate
> dendrogram to this and cutting it out. Any help will be appreciated!
> I have to use agglomerative clustering!
> Thanks,
> -Ariani
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 


*Uri Goren,Software innovator*

*Phone: +972-507-649-650*

*EMail: uri at goren4u.com <uri at goren4u.com>*
*Linkedin: il.linkedin.com/in/ugoren/ <http://il.linkedin.com/in/ugoren/>*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170711/985b1f86/attachment.html>

From b.noushin7 at gmail.com  Tue Jul 11 14:22:43 2017
From: b.noushin7 at gmail.com (Ariani A)
Date: Tue, 11 Jul 2017 14:22:43 -0400
Subject: [scikit-learn] Agglomerative clustering problem
In-Reply-To: <CANCr86NgwKdFi5P+aA05qzdcRGd100eM7vjEFB+GJtxj551qgw@mail.gmail.com>
References: <CAEUHNkavDjX=BZNA1jOHL7zg13m8MPvWbPUi3j79UOtFHtRCKg@mail.gmail.com>
 <CANCr86NgwKdFi5P+aA05qzdcRGd100eM7vjEFB+GJtxj551qgw@mail.gmail.com>
Message-ID: <CAEUHNkY39ZGZrAN9Bvt22Pox7-pECDQMvmx1O_TSscf_i5E-9A@mail.gmail.com>

?Dear Uri,
Thanks. I just have a pairwise distance matrix and I want to implement it
so that each cluster has at least 40 data points. (in Agglomerative).
Does it work?
Thanks,
-Ariani

On Tue, Jul 11, 2017 at 1:54 PM, Uri Goren <uri at goren4u.com> wrote:

> Take a look at scipy's fcluster function.
> If M is a matrix of all of your feature vectors, this code snippet should
> work.
>
> You need to figure out what metric and algorithm work for you
>
>     from sklearn.metrics import pairwise_distance
>     from scipy.cluster import  hierarchy
>     X = pairwise_distance(M, metric=metric)
>     Z = hierarchy.linkage(X, algo, metric=metric)
>     C = hierarchy.fcluster(Z,threshold, criterion="distance")
>
> Best,
> Uri Goren
>
> On Tue, Jul 11, 2017 at 7:42 PM, Ariani A <b.noushin7 at gmail.com> wrote:
>
>> Hi all,
>> I want to perform agglomerative clustering, but I have no idea of number
>>  of clusters before hand. But I want that every cluster has at least 40
>> data points in it. How can I apply this to sklearn.agglomerative clusteri
>> ng?
>> Should I use dendrogram and cut it somehow? I have no idea how to relate
>> dendrogram to this and cutting it out. Any help will be appreciated!
>> I have to use agglomerative clustering!
>> Thanks,
>> -Ariani
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
>
> --
>
>
> *Uri Goren,Software innovator*
>
> *Phone: +972-507-649-650*
>
> *EMail: uri at goren4u.com <uri at goren4u.com>*
> *Linkedin: il.linkedin.com/in/ugoren/ <http://il.linkedin.com/in/ugoren/>*
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170711/204e43e5/attachment-0001.html>

From olivier.grisel at ensta.org  Tue Jul 11 17:04:14 2017
From: olivier.grisel at ensta.org (Olivier Grisel)
Date: Tue, 11 Jul 2017 23:04:14 +0200
Subject: [scikit-learn] Contribution - Markov Clustering
In-Reply-To: <CANCr86NFVGkU15_YM6WYZY3TyTVwu=E_ecueS5ecEd-eqC=67w@mail.gmail.com>
References: <CANCr86MFp7S_GxWYC=Q_fM1ebwHpMQmW2mDyiRCfGQUpeWtsCw@mail.gmail.com>
 <CADeotZpcOTGD4ZXvQmqSKNiSDETUXEnkKTALk28_pfVugX98tQ@mail.gmail.com>
 <CANCr86NFVGkU15_YM6WYZY3TyTVwu=E_ecueS5ecEd-eqC=67w@mail.gmail.com>
Message-ID: <CAFvE7K76Ha7ON9oVqeqpGr49NAnRCLCoiCwEVakp4wi-7pykNQ@mail.gmail.com>

If this is the first time you contribute, please make sure to
carefully read the contributors guide till the end:

http://scikit-learn.org/stable/developers/contributing.html

In particular, make sure to follow the estimators API conventions for
your PR to get a chance to be reviewed. In particular the gist you
linked to is not compatible with the scikit-learn estimators API.

Personally I have never heard of Markov clustering, so it's hard for
me to assess whether it should be included in the project or not. It
would really help if you could demonstrate its performance on a
publicly available dataset where is does significantly better than all
the other clustering algorithms already implemented in scikit-learn
(both in terms of training speed and in terms of cluster quality /
stability, although this latter point is very domain dependent).

As a side note, if this is the first time you contribute to the
project, it's probably best to have a look at how other pull requests
are being reviewed (by reading the comment threads of other PRs) and
maybe start by a small pull request to fix small bug (with a
non-regression test) or tackle some documentation issues. Adding new
estimators takes a lot of effort to review (we need tests, docs,
updated examples) and assume some familiarity with the existing code
base.

-- 
Olivier

From uri at goren4u.com  Wed Jul 12 00:47:57 2017
From: uri at goren4u.com (Uri Goren)
Date: Wed, 12 Jul 2017 07:47:57 +0300
Subject: [scikit-learn] Contribution - Markov Clustering
In-Reply-To: <CAFvE7K76Ha7ON9oVqeqpGr49NAnRCLCoiCwEVakp4wi-7pykNQ@mail.gmail.com>
References: <CANCr86MFp7S_GxWYC=Q_fM1ebwHpMQmW2mDyiRCfGQUpeWtsCw@mail.gmail.com>
 <CADeotZpcOTGD4ZXvQmqSKNiSDETUXEnkKTALk28_pfVugX98tQ@mail.gmail.com>
 <CANCr86NFVGkU15_YM6WYZY3TyTVwu=E_ecueS5ecEd-eqC=67w@mail.gmail.com>
 <CAFvE7K76Ha7ON9oVqeqpGr49NAnRCLCoiCwEVakp4wi-7pykNQ@mail.gmail.com>
Message-ID: <CANCr86PGmGy=3aE863psJWdRbbG6RxHd9BZ_+wx-hAaCpsbG5w@mail.gmail.com>

I've added this PR, and I addressed in the comments some of your concerns
(publications, comparison to affinity propagation, etc).

https://github.com/scikit-learn/scikit-learn/pull/9329

I'd love for you to review, since this is my first PR in the scikit learn
repository

On Wed, Jul 12, 2017 at 12:04 AM, Olivier Grisel <olivier.grisel at ensta.org>
wrote:

> If this is the first time you contribute, please make sure to
> carefully read the contributors guide till the end:
>
> http://scikit-learn.org/stable/developers/contributing.html
>
> In particular, make sure to follow the estimators API conventions for
> your PR to get a chance to be reviewed. In particular the gist you
> linked to is not compatible with the scikit-learn estimators API.
>
> Personally I have never heard of Markov clustering, so it's hard for
> me to assess whether it should be included in the project or not. It
> would really help if you could demonstrate its performance on a
> publicly available dataset where is does significantly better than all
> the other clustering algorithms already implemented in scikit-learn
> (both in terms of training speed and in terms of cluster quality /
> stability, although this latter point is very domain dependent).
>
> As a side note, if this is the first time you contribute to the
> project, it's probably best to have a look at how other pull requests
> are being reviewed (by reading the comment threads of other PRs) and
> maybe start by a small pull request to fix small bug (with a
> non-regression test) or tackle some documentation issues. Adding new
> estimators takes a lot of effort to review (we need tests, docs,
> updated examples) and assume some familiarity with the existing code
> base.
>
> --
> Olivier
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>


-- 


*Uri Goren,Software innovator*

*Phone: +972-507-649-650*

*EMail: uri at goren4u.com <uri at goren4u.com>*
*Linkedin: il.linkedin.com/in/ugoren/ <http://il.linkedin.com/in/ugoren/>*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170712/d66a2a08/attachment.html>

From b.noushin7 at gmail.com  Thu Jul 13 15:42:33 2017
From: b.noushin7 at gmail.com (Ariani A)
Date: Thu, 13 Jul 2017 15:42:33 -0400
Subject: [scikit-learn] Agglomerative Clustering without knowing number
 of clusters
In-Reply-To: <20170706163257.zgvwnoih5zjb73io@MacBook-Pro-3.local>
References: <CAEUHNkaHoG_5ijOS6Mo2RFRB=qq7orXvVYBA6LaeH2kU60V7DQ@mail.gmail.com>
 <20170706163257.zgvwnoih5zjb73io@MacBook-Pro-3.local>
Message-ID: <CAEUHNkZ89nKXchi2Mx+f6L8tJW_ctbUXHUfLgCuaMzKfS2OoKg@mail.gmail.com>

Dear Shane,
Thanks for your answer.
Does DBSCAN works with distance matrix/? I have a distance matrix
(symmetric matrix which contains pairwise distances). Can you help me? I
did not find DBSCAN code in that link.
Best,
-Ariani

On Thu, Jul 6, 2017 at 12:32 PM, Shane Grigsby <shane.grigsby at colorado.edu>
wrote:

> This sounds like it may be a problem more amenable to either DBSCAN or
> OPTICS. Both algorithms don't require a priori knowledge of the number of
> clusters, and both let you specify a minimum point membership threshold for
> cluster membership. The OPTICS algorithm will also produce a dendrogram
> that you can cut for sub clusters if need be.
>
> DBSCAN is part of the stable release and has been for some time; OPTICS is
> pending as a pull request, but it's stable and you can try it if you like:
>
> https://github.com/scikit-learn/scikit-learn/pull/1984
>
> Cheers,
> Shane
>
>
> On 06/30, Ariani A wrote:
>
>> I want to perform agglomerative clustering, but I have no idea of number
>> of
>> clusters before hand. But I want that every cluster has at least 40 data
>> points in it. How can I apply this to sklearn.agglomerative clustering?
>> Should I use dendrogram and cut it somehow? I have no idea how to relate
>> dendrogram to this and cutting it out. Any help will be appreciated!
>>
>
> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
> --
> *PhD candidate & Research Assistant*
> *Cooperative Institute for Research in Environmental Sciences (CIRES)*
> *University of Colorado at Boulder*
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170713/bcb0775f/attachment.html>

From shane.grigsby at colorado.edu  Thu Jul 13 17:38:37 2017
From: shane.grigsby at colorado.edu (Shane Grigsby)
Date: Thu, 13 Jul 2017 16:38:37 -0500
Subject: [scikit-learn] Agglomerative Clustering without knowing number
 of clusters
In-Reply-To: <CAEUHNkZ89nKXchi2Mx+f6L8tJW_ctbUXHUfLgCuaMzKfS2OoKg@mail.gmail.com>
References: <CAEUHNkaHoG_5ijOS6Mo2RFRB=qq7orXvVYBA6LaeH2kU60V7DQ@mail.gmail.com>
 <20170706163257.zgvwnoih5zjb73io@MacBook-Pro-3.local>
 <CAEUHNkZ89nKXchi2Mx+f6L8tJW_ctbUXHUfLgCuaMzKfS2OoKg@mail.gmail.com>
Message-ID: <20170713213837.fx7ubmlgzcjex6uv@MacBook-Pro-3.local>

Hi Ariani,
Yes, you can use a distance matrix-- I think that what you want is 
metric='precomputed', and then X would be your N by N distance matrix.
Hope that helps,
~Shane

On 07/13, Ariani A wrote:
>Dear Shane,
>Thanks for your answer.
>Does DBSCAN works with distance matrix/? I have a distance matrix
>(symmetric matrix which contains pairwise distances). Can you help me? I
>did not find DBSCAN code in that link.
>Best,
>-Ariani
>
>On Thu, Jul 6, 2017 at 12:32 PM, Shane Grigsby <shane.grigsby at colorado.edu>
>wrote:
>
>> This sounds like it may be a problem more amenable to either DBSCAN or
>> OPTICS. Both algorithms don't require a priori knowledge of the number of
>> clusters, and both let you specify a minimum point membership threshold for
>> cluster membership. The OPTICS algorithm will also produce a dendrogram
>> that you can cut for sub clusters if need be.
>>
>> DBSCAN is part of the stable release and has been for some time; OPTICS is
>> pending as a pull request, but it's stable and you can try it if you like:
>>
>> https://github.com/scikit-learn/scikit-learn/pull/1984
>>
>> Cheers,
>> Shane
>>
>>
>> On 06/30, Ariani A wrote:
>>
>>> I want to perform agglomerative clustering, but I have no idea of number
>>> of
>>> clusters before hand. But I want that every cluster has at least 40 data
>>> points in it. How can I apply this to sklearn.agglomerative clustering?
>>> Should I use dendrogram and cut it somehow? I have no idea how to relate
>>> dendrogram to this and cutting it out. Any help will be appreciated!
>>>
>>
>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>
>>
>> --
>> *PhD candidate & Research Assistant*
>> *Cooperative Institute for Research in Environmental Sciences (CIRES)*
>> *University of Colorado at Boulder*
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>

>_______________________________________________
>scikit-learn mailing list
>scikit-learn at python.org
>https://mail.python.org/mailman/listinfo/scikit-learn


-- 
*PhD candidate & Research Assistant*
*Cooperative Institute for Research in Environmental Sciences (CIRES)*
*University of Colorado at Boulder*

From b.noushin7 at gmail.com  Thu Jul 13 19:03:32 2017
From: b.noushin7 at gmail.com (Ariani A)
Date: Thu, 13 Jul 2017 19:03:32 -0400
Subject: [scikit-learn] Agglomerative Clustering without knowing number
 of clusters
In-Reply-To: <20170713213837.fx7ubmlgzcjex6uv@MacBook-Pro-3.local>
References: <CAEUHNkaHoG_5ijOS6Mo2RFRB=qq7orXvVYBA6LaeH2kU60V7DQ@mail.gmail.com>
 <20170706163257.zgvwnoih5zjb73io@MacBook-Pro-3.local>
 <CAEUHNkZ89nKXchi2Mx+f6L8tJW_ctbUXHUfLgCuaMzKfS2OoKg@mail.gmail.com>
 <20170713213837.fx7ubmlgzcjex6uv@MacBook-Pro-3.local>
Message-ID: <CAEUHNkaGKWJNob8UmXi72zB=Myp5Zd0xu6Yo22_q11ZJF0VeaA@mail.gmail.com>

Dear Shane,
Thanks for your prompt answer.
Do you mean that for DBSCAN there is no need to feed other parameters? Do I
just call the function or I have to manipulate the code?
P.S. I was not able to find the DBSCAN code on github.
Looking forward to hearing from you.
Best,
-Noushin

On Thu, Jul 13, 2017 at 5:38 PM, Shane Grigsby <shane.grigsby at colorado.edu>
wrote:

> Hi Ariani,
> Yes, you can use a distance matrix-- I think that what you want is
> metric='precomputed', and then X would be your N by N distance matrix.
> Hope that helps,
> ~Shane
>
>
> On 07/13, Ariani A wrote:
>
>> Dear Shane,
>> Thanks for your answer.
>> Does DBSCAN works with distance matrix/? I have a distance matrix
>> (symmetric matrix which contains pairwise distances). Can you help me? I
>> did not find DBSCAN code in that link.
>> Best,
>> -Ariani
>>
>> On Thu, Jul 6, 2017 at 12:32 PM, Shane Grigsby <
>> shane.grigsby at colorado.edu>
>> wrote:
>>
>> This sounds like it may be a problem more amenable to either DBSCAN or
>>> OPTICS. Both algorithms don't require a priori knowledge of the number of
>>> clusters, and both let you specify a minimum point membership threshold
>>> for
>>> cluster membership. The OPTICS algorithm will also produce a dendrogram
>>> that you can cut for sub clusters if need be.
>>>
>>> DBSCAN is part of the stable release and has been for some time; OPTICS
>>> is
>>> pending as a pull request, but it's stable and you can try it if you
>>> like:
>>>
>>> https://github.com/scikit-learn/scikit-learn/pull/1984
>>>
>>> Cheers,
>>> Shane
>>>
>>>
>>> On 06/30, Ariani A wrote:
>>>
>>> I want to perform agglomerative clustering, but I have no idea of number
>>>> of
>>>> clusters before hand. But I want that every cluster has at least 40 data
>>>> points in it. How can I apply this to sklearn.agglomerative clustering?
>>>> Should I use dendrogram and cut it somehow? I have no idea how to relate
>>>> dendrogram to this and cutting it out. Any help will be appreciated!
>>>>
>>>>
>>> _______________________________________________
>>>
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>>
>>>
>>> --
>>> *PhD candidate & Research Assistant*
>>> *Cooperative Institute for Research in Environmental Sciences (CIRES)*
>>> *University of Colorado at Boulder*
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
> --
> *PhD candidate & Research Assistant*
> *Cooperative Institute for Research in Environmental Sciences (CIRES)*
> *University of Colorado at Boulder*
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170713/28f690b4/attachment-0001.html>

From b.noushin7 at gmail.com  Thu Jul 13 19:21:41 2017
From: b.noushin7 at gmail.com (Ariani A)
Date: Thu, 13 Jul 2017 19:21:41 -0400
Subject: [scikit-learn] Agglomerative Clustering without knowing number
 of clusters
In-Reply-To: <CAEUHNkaGKWJNob8UmXi72zB=Myp5Zd0xu6Yo22_q11ZJF0VeaA@mail.gmail.com>
References: <CAEUHNkaHoG_5ijOS6Mo2RFRB=qq7orXvVYBA6LaeH2kU60V7DQ@mail.gmail.com>
 <20170706163257.zgvwnoih5zjb73io@MacBook-Pro-3.local>
 <CAEUHNkZ89nKXchi2Mx+f6L8tJW_ctbUXHUfLgCuaMzKfS2OoKg@mail.gmail.com>
 <20170713213837.fx7ubmlgzcjex6uv@MacBook-Pro-3.local>
 <CAEUHNkaGKWJNob8UmXi72zB=Myp5Zd0xu6Yo22_q11ZJF0VeaA@mail.gmail.com>
Message-ID: <CAEUHNkZHFn8twfp0UM3AoA98wz2LtxPdPS9zOKWvEoZMNGD6hQ@mail.gmail.com>

Dear Shane,
Sorry bothering you!
Is the "precomputed" and "distance matrix" you are talking about, are about
"DBSCAN" ?
Thanks,
Best.

On Thu, Jul 13, 2017 at 7:03 PM, Ariani A <b.noushin7 at gmail.com> wrote:

> Dear Shane,
> Thanks for your prompt answer.
> Do you mean that for DBSCAN there is no need to feed other parameters? Do
> I just call the function or I have to manipulate the code?
> P.S. I was not able to find the DBSCAN code on github.
> Looking forward to hearing from you.
> Best,
> -Noushin
>
> On Thu, Jul 13, 2017 at 5:38 PM, Shane Grigsby <shane.grigsby at colorado.edu
> > wrote:
>
>> Hi Ariani,
>> Yes, you can use a distance matrix-- I think that what you want is
>> metric='precomputed', and then X would be your N by N distance matrix.
>> Hope that helps,
>> ~Shane
>>
>>
>> On 07/13, Ariani A wrote:
>>
>>> Dear Shane,
>>> Thanks for your answer.
>>> Does DBSCAN works with distance matrix/? I have a distance matrix
>>> (symmetric matrix which contains pairwise distances). Can you help me? I
>>> did not find DBSCAN code in that link.
>>> Best,
>>> -Ariani
>>>
>>> On Thu, Jul 6, 2017 at 12:32 PM, Shane Grigsby <
>>> shane.grigsby at colorado.edu>
>>> wrote:
>>>
>>> This sounds like it may be a problem more amenable to either DBSCAN or
>>>> OPTICS. Both algorithms don't require a priori knowledge of the number
>>>> of
>>>> clusters, and both let you specify a minimum point membership threshold
>>>> for
>>>> cluster membership. The OPTICS algorithm will also produce a dendrogram
>>>> that you can cut for sub clusters if need be.
>>>>
>>>> DBSCAN is part of the stable release and has been for some time; OPTICS
>>>> is
>>>> pending as a pull request, but it's stable and you can try it if you
>>>> like:
>>>>
>>>> https://github.com/scikit-learn/scikit-learn/pull/1984
>>>>
>>>> Cheers,
>>>> Shane
>>>>
>>>>
>>>> On 06/30, Ariani A wrote:
>>>>
>>>> I want to perform agglomerative clustering, but I have no idea of number
>>>>> of
>>>>> clusters before hand. But I want that every cluster has at least 40
>>>>> data
>>>>> points in it. How can I apply this to sklearn.agglomerative clustering?
>>>>> Should I use dendrogram and cut it somehow? I have no idea how to
>>>>> relate
>>>>> dendrogram to this and cutting it out. Any help will be appreciated!
>>>>>
>>>>>
>>>> _______________________________________________
>>>>
>>>>> scikit-learn mailing list
>>>>> scikit-learn at python.org
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>
>>>>>
>>>>
>>>> --
>>>> *PhD candidate & Research Assistant*
>>>> *Cooperative Institute for Research in Environmental Sciences (CIRES)*
>>>> *University of Colorado at Boulder*
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>>
>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>
>>
>> --
>> *PhD candidate & Research Assistant*
>> *Cooperative Institute for Research in Environmental Sciences (CIRES)*
>> *University of Colorado at Boulder*
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170713/77e56342/attachment.html>

From grhanceylan at gmail.com  Fri Jul 14 02:46:21 2017
From: grhanceylan at gmail.com (=?UTF-8?Q?G=C3=BCrhan_Ceylan?=)
Date: Fri, 14 Jul 2017 09:46:21 +0300
Subject: [scikit-learn] Contribution
In-Reply-To: <20170710173712.bt6nigii5icmihgl@vladn-desktop>
References: <CADSesxZyXwpCiKws4w7vfUMpGOr9YhJXbTaZ-Zhd6kE3AXELQw@mail.gmail.com>
 <CA+ad8EsKjM4nVAGWfQM_2GWPOvGe1gq3eiiMfW8kgrJVEq2OHw@mail.gmail.com>
 <CAGvd0=iFD0N2uvP64paw9tY36_qdzS0_1m=99i52dKKnVD97Tw@mail.gmail.com>
 <20170710173712.bt6nigii5icmihgl@vladn-desktop>
Message-ID: <CADSesxYtBo4X+_UQdC6C23CEH6=cEfLdyth2qmzpx0GhRumGpw@mail.gmail.com>

@Jacob,
I understand your concern about the new algorithms. It will be lost effort
to make coding, updating, documentation for a unsuccessful algorithm.
Thanks for the tips.

@federico,
lightning library is, close to what in my mind but not the same. I think,
there should be an easy way to see how optimizers affects learning
algorithms. Thanks for the link.

@Vlad,
Thank you, for the clarification.

Best,

G?rhan

2017-07-10 20:37 GMT+03:00 Vlad Niculae <zephyr14 at gmail.com>:

> On Mon, Jul 10, 2017 at 04:10:09PM +0000, federico vaggi wrote:
> > There is a fantastic library called lightning where the optimization
> > routines are first class citizens:
> > http://contrib.scikit-learn.org/lightning/ - you can take a look there.
> > However, lightning focuses on convex optimization, so most algorithms
> have
> > provable convergence rates.
>
> Hi,
>
> I fully agree that lightning is fantastic :) but it might not be what
> G?rhan
> wants.
>
> It's true that lightning's api is designed around optimizers rather
> than around models. So where in scikit-learn we usually have, e.g.,
>
>   LogisticRegression(solver='sag')
>
> in lightning you would have
>
>   SAGClassifier(loss='log')
>
> to achieve something close. But neither library has the oo-style
> separation between freeform models and optimizers such as you might
> find in deep learning frameworks.  So, for instance, it's relatively
> easy to add a new loss function to the lightning SAGClassifier, but
> you would still be able to only use it with a linear model.
>
> This is by design in both scikit-learn and lightning, at least at the
> moment: by making these kinds of assumptions about the models,
> implementations can be much more efficient in terms of computation and
> storage, especially when sparse data is involved.
>
> Yours,
> Vlad
>
> >
> > Good luck!
> >
> > On Mon, 10 Jul 2017 at 09:05 Jacob Schreiber <jmschreiber91 at gmail.com>
> > wrote:
> >
> > > Howdy
> > >
> > > This question and the one right after in the FAQ are probably relevant
> re:
> > > inclusion of new algorithms:
> > > http://scikit-learn.org/stable/faq.html#what-are-the-
> inclusion-criteria-for-new-algorithms.
> > > The gist is that we only include well established algorithms, and
> there are
> > > no end to those. I think it is unlikely that a PR will get merged with
> a
> > > cutting edge new algorithm, as the scope of scikit-learn isn't
> necessary
> > > "the latest" as opposed to "the classics." You may also consider
> writing a
> > > scikit-contrib package that basically creates what you're interested
> in in
> > > scikit-learn format, but external to the project. We'd be more than
> happy
> > > to link to it. If the algorithm becomes a smashing success over time,
> we'd
> > > reconsider adding it to the main code base.
> > >
> > > As to your first question, you should check out how the current
> optimizers
> > > are written for the algorithm you're interested in. I don't think
> there's a
> > > plug and play way to drop in your own optimizer like many deep learning
> > > packages support, unfortunately. You'd probably have to modify the code
> > > directly to support your own.
> > >
> > > Let me know if you have any other questions.
> > >
> > > Jacob
> > >
> > > On Mon, Jul 10, 2017 at 7:58 AM, G?rhan Ceylan <grhanceylan at gmail.com>
> > > wrote:
> > >
> > >> Hi everyone,
> > >>
> > >> I am wondering, How can I  use external optimization algorithms with
> scikit-learn,
> > >> for instance neural network
> > >> <http://scikit-learn.org/stable/modules/neural_
> networks_supervised.html#algorithms>
> > >> , instead of defined algorithms ( Stochastic Gradient Descent, Adam,
> or
> > >> L-BFGS).
> > >>
> > >> Furthermore, I want to introduce a new unconstrained optimization
> > >> algorithm to scikit-learn, implementation of the algorithm and
> related paper
> > >> can be found here <https://github.com/sibirbil/PMBSolve>.
> > >>
> > >> I couldn't find any explanation
> > >> <http://scikit-learn.org/stable/developers/contributing.html>, about
> the
> > >> situation. Do you have defined procedure to make such kind of
> > >> contributions? If this is not the case, How should I start to make
> such a
> > >> proposal/contribution ?
> > >>
> > >>
> > >> Kind regards,
> > >>
> > >> G?rhan C.
> > >>
> > >>
> > >> _______________________________________________
> > >> scikit-learn mailing list
> > >> scikit-learn at python.org
> > >> https://mail.python.org/mailman/listinfo/scikit-learn
> > >>
> > >>
> > > _______________________________________________
> > > scikit-learn mailing list
> > > scikit-learn at python.org
> > > https://mail.python.org/mailman/listinfo/scikit-learn
> > >
>
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170714/dfe3f075/attachment-0001.html>

From seralouk at hotmail.com  Fri Jul 14 08:08:14 2017
From: seralouk at hotmail.com (serafeim loukas)
Date: Fri, 14 Jul 2017 12:08:14 +0000
Subject: [scikit-learn] Line graph of weighted graph
Message-ID: <VI1PR10MB15652F4CCA7B5EF0AC99CD97D7AD0@VI1PR10MB1565.EURPRD10.PROD.OUTLOOK.COM>

Dear scikit-learn users,


I would like to know if there is any function that returns the line graph of a weighted graph.


I am aware of the linegraph function (http://igraph.org/python/doc/igraph.GraphBase-class.html#linegraph)

but I would like to take the weights into consideration.


Thank you,

Makis

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170714/5be746af/attachment.html>

From SebastianFlennerhag at hotmail.com  Fri Jul 14 09:49:48 2017
From: SebastianFlennerhag at hotmail.com (Sebastian)
Date: Fri, 14 Jul 2017 13:49:48 +0000
Subject: [scikit-learn] Inquiry third-party package affiliation
Message-ID: <DB6PR0902MB16718012699EF58C5FC7C3E1ADAD0@DB6PR0902MB1671.eurprd09.prod.outlook.com>

Hi,

First off, thanks for a great package!

A while ago I needed a package for building general-purpose ensembles combining any set of Scikit-learn transformers and estimators.
I couldn't find any so I set out to develop such an extension and recently released the result as ML-Ensemble, http://mlens.readthedocs.io/en/latest/.

I am contacting you to ask if you way consider adding this library to your reference list of related packages?

It would be hugely appreciated to have a small mention on your site as an ensemble wrapper around Scikit-learn.

A bit about ML-Ensemble:
It is written in Python following with a Scikit-learn API; it uses joblib with memmapping to achieve scalable parallelization and any ensemble estimator can pass as a Scikit-learn estimator. The library is unit tested on Linux, Mac and Windows for Python 2.7, 3.5 and 3.6, and has been downloaded about four thousand times a month after launch.

All the best,

Sebastian Flennerhag
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170714/7d367e45/attachment.html>

From jmschreiber91 at gmail.com  Sat Jul 15 11:16:18 2017
From: jmschreiber91 at gmail.com (Jacob Schreiber)
Date: Sat, 15 Jul 2017 08:16:18 -0700
Subject: [scikit-learn] Agglomerative clustering problem
In-Reply-To: <CAEUHNkY39ZGZrAN9Bvt22Pox7-pECDQMvmx1O_TSscf_i5E-9A@mail.gmail.com>
References: <CAEUHNkavDjX=BZNA1jOHL7zg13m8MPvWbPUi3j79UOtFHtRCKg@mail.gmail.com>
 <CANCr86NgwKdFi5P+aA05qzdcRGd100eM7vjEFB+GJtxj551qgw@mail.gmail.com>
 <CAEUHNkY39ZGZrAN9Bvt22Pox7-pECDQMvmx1O_TSscf_i5E-9A@mail.gmail.com>
Message-ID: <CA+ad8EsvrVtiWQE4NMfvMY+Qnzwz9XjCwWxQAJ3wivWkrgPpnw@mail.gmail.com>

Typically when I think of limiting the number of points in a cluster I
think of KD trees. I suppose that wouldn't work?

On Tue, Jul 11, 2017 at 11:22 AM, Ariani A <b.noushin7 at gmail.com> wrote:

> ?Dear Uri,
> Thanks. I just have a pairwise distance matrix and I want to implement it
> so that each cluster has at least 40 data points. (in Agglomerative).
> Does it work?
> Thanks,
> -Ariani
>
> On Tue, Jul 11, 2017 at 1:54 PM, Uri Goren <uri at goren4u.com> wrote:
>
>> Take a look at scipy's fcluster function.
>> If M is a matrix of all of your feature vectors, this code snippet should
>> work.
>>
>> You need to figure out what metric and algorithm work for you
>>
>>     from sklearn.metrics import pairwise_distance
>>     from scipy.cluster import  hierarchy
>>     X = pairwise_distance(M, metric=metric)
>>     Z = hierarchy.linkage(X, algo, metric=metric)
>>     C = hierarchy.fcluster(Z,threshold, criterion="distance")
>>
>> Best,
>> Uri Goren
>>
>> On Tue, Jul 11, 2017 at 7:42 PM, Ariani A <b.noushin7 at gmail.com> wrote:
>>
>>> Hi all,
>>> I want to perform agglomerative clustering, but I have no idea of number
>>>  of clusters before hand. But I want that every cluster has at least 40
>>> data points in it. How can I apply this to sklearn.agglomerative
>>> clustering?
>>> Should I use dendrogram and cut it somehow? I have no idea how to relate
>>> dendrogram to this and cutting it out. Any help will be appreciated!
>>> I have to use agglomerative clustering!
>>> Thanks,
>>> -Ariani
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>>
>> --
>>
>>
>> *Uri Goren,Software innovator*
>>
>> *Phone: +972-507-649-650*
>>
>> *EMail: uri at goren4u.com <uri at goren4u.com>*
>> *Linkedin: il.linkedin.com/in/ugoren/ <http://il.linkedin.com/in/ugoren/>*
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170715/3399c2c8/attachment.html>

From olivier.grisel at ensta.org  Mon Jul 17 08:49:51 2017
From: olivier.grisel at ensta.org (Olivier Grisel)
Date: Mon, 17 Jul 2017 14:49:51 +0200
Subject: [scikit-learn] scikit-learn 0.19b2 is available for testing
Message-ID: <CAFvE7K63mnigJqT0KL=7eC8whmNFY-a=wTrre2Fh4miG0aq_fA@mail.gmail.com>

The new release is coming and we are seeking feedback from beta testers!

  pip install scikit-learn==0.19b2

conda-forge packages should follow in the coming hours / days.

Note that many models have changed behaviors and some things have been
deprecated, see the full changelog at:

http://scikit-learn.org/dev/whats_new.html#version-0-19

As usual please report any regression or other bugs as an issue on github.

Thanks to anyone who contributed to the release!

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

From stuart at stuartreynolds.net  Mon Jul 17 12:41:37 2017
From: stuart at stuartreynolds.net (Stuart Reynolds)
Date: Mon, 17 Jul 2017 09:41:37 -0700
Subject: [scikit-learn] Max f1 score for soft classifier?
Message-ID: <CAAy-kdkyuzJ=MQ8FHVEsoJByAk7Q7oUpVG+XEnDvfKAbwsUEFA@mail.gmail.com>

Does scikit have a function to find the maximum f1 score (and decision
threshold) for a (soft) classifier?

- Stuart

From gael.varoquaux at normalesup.org  Mon Jul 17 15:37:13 2017
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Mon, 17 Jul 2017 21:37:13 +0200
Subject: [scikit-learn] scikit-learn 0.19b2 is available for testing
In-Reply-To: <CAFvE7K63mnigJqT0KL=7eC8whmNFY-a=wTrre2Fh4miG0aq_fA@mail.gmail.com>
References: <CAFvE7K63mnigJqT0KL=7eC8whmNFY-a=wTrre2Fh4miG0aq_fA@mail.gmail.com>
Message-ID: <20170717193713.GD1845013@phare.normalesup.org>

Great job! This will be a great release, with a lot of new features and
improvements

G

On Mon, Jul 17, 2017 at 02:49:51PM +0200, Olivier Grisel wrote:
> The new release is coming and we are seeking feedback from beta testers!

>   pip install scikit-learn==0.19b2

> conda-forge packages should follow in the coming hours / days.

> Note that many models have changed behaviors and some things have been
> deprecated, see the full changelog at:

> http://scikit-learn.org/dev/whats_new.html#version-0-19

> As usual please report any regression or other bugs as an issue on github.

> Thanks to anyone who contributed to the release!
-- 
    Gael Varoquaux
    Researcher, INRIA Parietal
    NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
    Phone:  ++ 33-1-69-08-79-68
    http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux

From alexandre.gramfort at telecom-paristech.fr  Mon Jul 17 16:08:18 2017
From: alexandre.gramfort at telecom-paristech.fr (Alexandre Gramfort)
Date: Mon, 17 Jul 2017 22:08:18 +0200
Subject: [scikit-learn] scikit-learn 0.19b2 is available for testing
In-Reply-To: <20170717193713.GD1845013@phare.normalesup.org>
References: <CAFvE7K63mnigJqT0KL=7eC8whmNFY-a=wTrre2Fh4miG0aq_fA@mail.gmail.com>
 <20170717193713.GD1845013@phare.normalesup.org>
Message-ID: <CADeotZo0_jeOPqKxyd9UePV6Jdxc6zx-YZ3QegaSdwuak5cw0A@mail.gmail.com>

great team work as usual !

congrats everyone

Alex

From stuart at stuartreynolds.net  Mon Jul 17 16:12:53 2017
From: stuart at stuartreynolds.net (Stuart Reynolds)
Date: Mon, 17 Jul 2017 13:12:53 -0700
Subject: [scikit-learn] Max f1 score for soft classifier?
In-Reply-To: <CAAy-kdkyuzJ=MQ8FHVEsoJByAk7Q7oUpVG+XEnDvfKAbwsUEFA@mail.gmail.com>
References: <CAAy-kdkyuzJ=MQ8FHVEsoJByAk7Q7oUpVG+XEnDvfKAbwsUEFA@mail.gmail.com>
Message-ID: <CAAy-kdncP72-FTDCZ0DdFgDznRm6TExE1Cjtgw5nJjW96g=CGw@mail.gmail.com>

And... with that in mind -- are there methods that explicitly try to
optimize the f1 score?

On Mon, Jul 17, 2017 at 9:41 AM, Stuart Reynolds
<stuart at stuartreynolds.net> wrote:
> Does scikit have a function to find the maximum f1 score (and decision
> threshold) for a (soft) classifier?
>
> - Stuart

From ashimb9 at gmail.com  Mon Jul 17 16:14:49 2017
From: ashimb9 at gmail.com (Ashim Bhattarai)
Date: Mon, 17 Jul 2017 15:14:49 -0500
Subject: [scikit-learn] PR review request
Message-ID: <CAJhGTCUiZbuss_EzFncyx=k-FcPnQnHV39=B6Pyw4WYiTu6Hsg@mail.gmail.com>

Hi -- I was wondering if somebody could review the pull request at
https://github.com/scikit-learn/scikit-learn/pull/9348 in which I have
worked on adding euclidean distance calculation in the presence of NaNs.
Thanks in advance.

Best,
Ashim
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170717/9229caf3/attachment.html>

From bertrand.thirion at inria.fr  Mon Jul 17 16:15:20 2017
From: bertrand.thirion at inria.fr (bthirion)
Date: Mon, 17 Jul 2017 22:15:20 +0200
Subject: [scikit-learn] scikit-learn 0.19b2 is available for testing
In-Reply-To: <CADeotZo0_jeOPqKxyd9UePV6Jdxc6zx-YZ3QegaSdwuak5cw0A@mail.gmail.com>
References: <CAFvE7K63mnigJqT0KL=7eC8whmNFY-a=wTrre2Fh4miG0aq_fA@mail.gmail.com>
 <20170717193713.GD1845013@phare.normalesup.org>
 <CADeotZo0_jeOPqKxyd9UePV6Jdxc6zx-YZ3QegaSdwuak5cw0A@mail.gmail.com>
Message-ID: <0b442988-2037-b970-2c73-b9f09d77f221@inria.fr>

Great work indeed !
Thx,

Bertrand

On 17/07/2017 22:08, Alexandre Gramfort wrote:
> great team work as usual !
>
> congrats everyone
>
> Alex
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From se.raschka at gmail.com  Mon Jul 17 16:19:15 2017
From: se.raschka at gmail.com (Sebastian Raschka)
Date: Mon, 17 Jul 2017 16:19:15 -0400
Subject: [scikit-learn] Max f1 score for soft classifier?
In-Reply-To: <CAAy-kdncP72-FTDCZ0DdFgDznRm6TExE1Cjtgw5nJjW96g=CGw@mail.gmail.com>
References: <CAAy-kdkyuzJ=MQ8FHVEsoJByAk7Q7oUpVG+XEnDvfKAbwsUEFA@mail.gmail.com>
 <CAAy-kdncP72-FTDCZ0DdFgDznRm6TExE1Cjtgw5nJjW96g=CGw@mail.gmail.com>
Message-ID: <3B9ED23F-C2A5-42D2-AFFA-96F32256A69B@gmail.com>

>> Does scikit have a function to find the maximum f1 score (and decision
>> threshold) for a (soft) classifier?


Hm, I don't think so. F1-score is typically used as evaluation metric; hence, it's something optimized via hyperparameter tuning. There's an interesting publication though, where the authors modified the F1 score so that it's differentiable and can be used as a cost function for optimization/training: Maximum F1-Score Discriminative Training Criterion for Automatic Mispronunciation Detection: http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7055841

Best,
Sebastian

> On Jul 17, 2017, at 4:12 PM, Stuart Reynolds <stuart at stuartreynolds.net> wrote:
> 
> And... with that in mind -- are there methods that explicitly try to
> optimize the f1 score?
> 
> On Mon, Jul 17, 2017 at 9:41 AM, Stuart Reynolds
> <stuart at stuartreynolds.net> wrote:
>> Does scikit have a function to find the maximum f1 score (and decision
>> threshold) for a (soft) classifier?
>> 
>> - Stuart
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From joel.nothman at gmail.com  Mon Jul 17 19:58:37 2017
From: joel.nothman at gmail.com (Joel Nothman)
Date: Tue, 18 Jul 2017 09:58:37 +1000
Subject: [scikit-learn] Max f1 score for soft classifier?
In-Reply-To: <CAAy-kdkyuzJ=MQ8FHVEsoJByAk7Q7oUpVG+XEnDvfKAbwsUEFA@mail.gmail.com>
References: <CAAy-kdkyuzJ=MQ8FHVEsoJByAk7Q7oUpVG+XEnDvfKAbwsUEFA@mail.gmail.com>
Message-ID: <CAAkaFLU9NDz0t5cCR-QuO+12PvTnuJvyEhVh1H2CNTL5vC9eBg@mail.gmail.com>

I suppose it would not be hard to build a wrapper that does this, if all we
are doing is choosing a threshold. Although a global maximum is not
guaranteed without some kind of interpolation over the precision-recall
curve.

On 18 July 2017 at 02:41, Stuart Reynolds <stuart at stuartreynolds.net> wrote:

> Does scikit have a function to find the maximum f1 score (and decision
> threshold) for a (soft) classifier?
>
> - Stuart
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170718/195ccd1f/attachment.html>

From stuart at stuartreynolds.net  Mon Jul 17 20:06:30 2017
From: stuart at stuartreynolds.net (Stuart Reynolds)
Date: Tue, 18 Jul 2017 00:06:30 +0000
Subject: [scikit-learn] Max f1 score for soft classifier?
In-Reply-To: <CAAkaFLU9NDz0t5cCR-QuO+12PvTnuJvyEhVh1H2CNTL5vC9eBg@mail.gmail.com>
References: <CAAy-kdkyuzJ=MQ8FHVEsoJByAk7Q7oUpVG+XEnDvfKAbwsUEFA@mail.gmail.com>
 <CAAkaFLU9NDz0t5cCR-QuO+12PvTnuJvyEhVh1H2CNTL5vC9eBg@mail.gmail.com>
Message-ID: <CAAy-kd=hCoFdaGZ6DeLFEi3DoJ+AKD896BOWicOR437D1NtW1Q@mail.gmail.com>

That was also my thinking. Similarly it's also useful to try and choose a
threshold that achieves some tpr or fpr, so that methods can be
approximately compared to published results.

It's not obvious what to do though when an increment in the threshold
results in several changes in classification.

On Mon, Jul 17, 2017 at 5:00 PM Joel Nothman <joel.nothman at gmail.com> wrote:

> I suppose it would not be hard to build a wrapper that does this, if all
> we are doing is choosing a threshold. Although a global maximum is not
> guaranteed without some kind of interpolation over the precision-recall
> curve.
>
> On 18 July 2017 at 02:41, Stuart Reynolds <stuart at stuartreynolds.net>
> wrote:
>
>> Does scikit have a function to find the maximum f1 score (and decision
>> threshold) for a (soft) classifier?
>>
>> - Stuart
>>
> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170718/18372a6c/attachment.html>

From t3kcit at gmail.com  Tue Jul 18 12:49:42 2017
From: t3kcit at gmail.com (Andreas Mueller)
Date: Tue, 18 Jul 2017 12:49:42 -0400
Subject: [scikit-learn] Max f1 score for soft classifier?
In-Reply-To: <CAAy-kdncP72-FTDCZ0DdFgDznRm6TExE1Cjtgw5nJjW96g=CGw@mail.gmail.com>
References: <CAAy-kdkyuzJ=MQ8FHVEsoJByAk7Q7oUpVG+XEnDvfKAbwsUEFA@mail.gmail.com>
 <CAAy-kdncP72-FTDCZ0DdFgDznRm6TExE1Cjtgw5nJjW96g=CGw@mail.gmail.com>
Message-ID: <3ab6ae32-8278-62a7-7987-f8308a4805ef@gmail.com>

Feature request for a slightly more general solution here:
https://github.com/scikit-learn/scikit-learn/issues/8614

On 07/17/2017 04:12 PM, Stuart Reynolds wrote:
> And... with that in mind -- are there methods that explicitly try to
> optimize the f1 score?
>
> On Mon, Jul 17, 2017 at 9:41 AM, Stuart Reynolds
> <stuart at stuartreynolds.net> wrote:
>> Does scikit have a function to find the maximum f1 score (and decision
>> threshold) for a (soft) classifier?
>>
>> - Stuart
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From ruchika.work at gmail.com  Thu Jul 20 11:23:12 2017
From: ruchika.work at gmail.com (Ruchika Nayyar)
Date: Thu, 20 Jul 2017 11:23:12 -0400
Subject: [scikit-learn] merging the predicted labels with original dataframe
Message-ID: <CAGz0NpihhvNUap0pDuvByYnq622Mh3R0aixDN2kotXHV2hN8nw@mail.gmail.com>

Hi Scikit-learn Users,

I am analyzing some proxy logs to use Machine learning to classify the
events recorded as either "OBSERVED" or "BLOCKED". This is a little snippet
of my code:
The input file is a csv with tokenized string fields.

**************
# load the file
M = pd.read_csv("output100k.csv").fillna('')

# define the fields to use
min_df = 0.001
max_df = .7
TxtCols = ['request__tokens', 'requestClientApplication__tokens',
           'destinationZoneURI__tokens','cs-categories__tokens',
           'fileType__tokens', 'requestMethod__tokens','tcp_status1',
           'app','tcp_status2','dhost'
          ]
NumCols = ['rt', 'out', 'in', 'time-taken','rt_length', 'dt_length']

# vectorize the fields
TfidfModels = [TfidfVectorizer(min_df = min_df, max_df=max_df).fit(M[t])
for t in TxtCols]

# define the columns of sparse matrix
X = hstack([m.transform(M[n].fillna('')) for m,n in zip(TfidfModels,
TxtCols)] + \
               [csr_matrix(pd.to_numeric(M[n]).fillna(-1).values).T for n
in NumCols])

# target variable
Y = M.act.values

## Define train/test parts and scale them
X_train, X_test, y_train, y_test = tts(X, Y, test_size=0.2)
scaler = StandardScaler(with_mean=False, with_std=True)
scaler.fit(X_train)
X_train=scaler.transform(X_train)
X_test=scaler.transform(X_test)


# define the model and train
clf = MLPClassifier(activation='logistic',
solver='lbfgs').fit(X_train,y_train)
# use the model to predict on X_test and convert into a data frame
df=pd.DataFrame(clf.predict(X_test))

**

199845  OBSERVED
199846  OBSERVED

[199847 rows x 1 columns]>

**

Now at the end I have a DataFrame with 20K entries with just one column
"Label", how di I connect it to the main dataframe M, since I want to do
some
investigations on this outcome ?

Any help?

Thanks,
Ruchika
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170720/c89891f7/attachment.html>

From julio at esbet.es  Thu Jul 20 11:37:58 2017
From: julio at esbet.es (Julio Antonio Soto de Vicente)
Date: Thu, 20 Jul 2017 17:37:58 +0200
Subject: [scikit-learn] merging the predicted labels with original
 dataframe
In-Reply-To: <CAGz0NpihhvNUap0pDuvByYnq622Mh3R0aixDN2kotXHV2hN8nw@mail.gmail.com>
References: <CAGz0NpihhvNUap0pDuvByYnq622Mh3R0aixDN2kotXHV2hN8nw@mail.gmail.com>
Message-ID: <1FE71505-E709-4D57-98A5-4877CE0168D5@esbet.es>

Hi Ruchika,

The predictions outputted by all sklearn models are just 1-d Numpy arrays, so it should be trivial to add it to any existing DataFrame:

your_df["prediction"] = clf.predict(X_test)

--
Julio

> El 20 jul 2017, a las 17:23, Ruchika Nayyar <ruchika.work at gmail.com> escribi?:
> 
> Hi Scikit-learn Users, 
> 
> I am analyzing some proxy logs to use Machine learning to classify the events recorded as either "OBSERVED" or "BLOCKED". This is a little snippet of my code: 
> The input file is a csv with tokenized string fields. 
> 
> **************
> # load the file 
> M = pd.read_csv("output100k.csv").fillna('')
> 
> # define the fields to use 
> min_df = 0.001
> max_df = .7
> TxtCols = ['request__tokens', 'requestClientApplication__tokens',
>            'destinationZoneURI__tokens','cs-categories__tokens', 
>            'fileType__tokens', 'requestMethod__tokens','tcp_status1',
>            'app','tcp_status2','dhost'
>           ]
> NumCols = ['rt', 'out', 'in', 'time-taken','rt_length', 'dt_length']
> 
> # vectorize the fields 
> TfidfModels = [TfidfVectorizer(min_df = min_df, max_df=max_df).fit(M[t]) for t in TxtCols]
> 
> # define the columns of sparse matrix 
> X = hstack([m.transform(M[n].fillna('')) for m,n in zip(TfidfModels, TxtCols)] + \
>                [csr_matrix(pd.to_numeric(M[n]).fillna(-1).values).T for n in NumCols])
>            
> # target variable 
> Y = M.act.values 
> 
> ## Define train/test parts and scale them 
> X_train, X_test, y_train, y_test = tts(X, Y, test_size=0.2)
> scaler = StandardScaler(with_mean=False, with_std=True)
> scaler.fit(X_train)
> X_train=scaler.transform(X_train)
> X_test=scaler.transform(X_test)
> 
> 
> # define the model and train 
> clf = MLPClassifier(activation='logistic', solver='lbfgs').fit(X_train,y_train)
> # use the model to predict on X_test and convert into a data frame 
> df=pd.DataFrame(clf.predict(X_test))
> 
> **
> 199845  OBSERVED
> 199846  OBSERVED
> [199847 rows x 1 columns]>
> **
> Now at the end I have a DataFrame with 20K entries with just one column 
> "Label", how di I connect it to the main dataframe M, since I want to do some 
> investigations on this outcome ?
> 
> Any help? 
> 
> Thanks,
> Ruchika
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170720/f0abd609/attachment.html>

From ruchika.work at gmail.com  Thu Jul 20 12:04:00 2017
From: ruchika.work at gmail.com (Ruchika Nayyar)
Date: Thu, 20 Jul 2017 12:04:00 -0400
Subject: [scikit-learn] merging the predicted labels with original
 dataframe
In-Reply-To: <1FE71505-E709-4D57-98A5-4877CE0168D5@esbet.es>
References: <CAGz0NpihhvNUap0pDuvByYnq622Mh3R0aixDN2kotXHV2hN8nw@mail.gmail.com>
 <1FE71505-E709-4D57-98A5-4877CE0168D5@esbet.es>
Message-ID: <CAGz0Nphw2nEmsVsc5+LHNUqxECdr0r-9SgRg-_qVDCne1sGsDg@mail.gmail.com>

The original dataset contains both trainng/testing, I have predictions only
on testing dataset. If I do what you suggest
will it preserve indexing?

Thanks,
Ruchika


On Thu, Jul 20, 2017 at 11:37 AM, Julio Antonio Soto de Vicente <
julio at esbet.es> wrote:

> Hi Ruchika,
>
> The predictions outputted by all sklearn models are just 1-d Numpy arrays,
> so it should be trivial to add it to any existing DataFrame:
>
> your_df["prediction"] = clf.predict(X_test)
>
> --
> Julio
>
> El 20 jul 2017, a las 17:23, Ruchika Nayyar <ruchika.work at gmail.com>
> escribi?:
>
> Hi Scikit-learn Users,
>
> I am analyzing some proxy logs to use Machine learning to classify the
> events recorded as either "OBSERVED" or "BLOCKED". This is a little snippet
> of my code:
> The input file is a csv with tokenized string fields.
>
> **************
> # load the file
> M = pd.read_csv("output100k.csv").fillna('')
>
> # define the fields to use
> min_df = 0.001
> max_df = .7
> TxtCols = ['request__tokens', 'requestClientApplication__tokens',
>            'destinationZoneURI__tokens','cs-categories__tokens',
>            'fileType__tokens', 'requestMethod__tokens','tcp_status1',
>            'app','tcp_status2','dhost'
>           ]
> NumCols = ['rt', 'out', 'in', 'time-taken','rt_length', 'dt_length']
>
> # vectorize the fields
> TfidfModels = [TfidfVectorizer(min_df = min_df, max_df=max_df).fit(M[t])
> for t in TxtCols]
>
> # define the columns of sparse matrix
> X = hstack([m.transform(M[n].fillna('')) for m,n in zip(TfidfModels,
> TxtCols)] + \
>                [csr_matrix(pd.to_numeric(M[n]).fillna(-1).values).T for n
> in NumCols])
>
> # target variable
> Y = M.act.values
>
> ## Define train/test parts and scale them
> X_train, X_test, y_train, y_test = tts(X, Y, test_size=0.2)
> scaler = StandardScaler(with_mean=False, with_std=True)
> scaler.fit(X_train)
> X_train=scaler.transform(X_train)
> X_test=scaler.transform(X_test)
>
>
> # define the model and train
> clf = MLPClassifier(activation='logistic', solver='lbfgs').fit(X_train,y_
> train)
> # use the model to predict on X_test and convert into a data frame
> df=pd.DataFrame(clf.predict(X_test))
>
> **
>
> 199845  OBSERVED
> 199846  OBSERVED
>
> [199847 rows x 1 columns]>
>
> **
>
> Now at the end I have a DataFrame with 20K entries with just one column
> "Label", how di I connect it to the main dataframe M, since I want to do
> some
> investigations on this outcome ?
>
> Any help?
>
> Thanks,
> Ruchika
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170720/8ccd3d56/attachment.html>

From tom.augspurger88 at gmail.com  Thu Jul 20 12:19:47 2017
From: tom.augspurger88 at gmail.com (Tom Augspurger)
Date: Thu, 20 Jul 2017 11:19:47 -0500
Subject: [scikit-learn] merging the predicted labels with original
 dataframe
In-Reply-To: <CAGz0Nphw2nEmsVsc5+LHNUqxECdr0r-9SgRg-_qVDCne1sGsDg@mail.gmail.com>
References: <CAGz0NpihhvNUap0pDuvByYnq622Mh3R0aixDN2kotXHV2hN8nw@mail.gmail.com>
 <1FE71505-E709-4D57-98A5-4877CE0168D5@esbet.es>
 <CAGz0Nphw2nEmsVsc5+LHNUqxECdr0r-9SgRg-_qVDCne1sGsDg@mail.gmail.com>
Message-ID: <CAE1aY-kJe1YE3zh050TV50XAO-FHBnvK_ZzNbWAjvzFpnh3RJA@mail.gmail.com>

Something like

    your_df['prediction'] = pd.Series(clf.predict(X_test),
index=X_test.index)

should handle all the alignment.

On Thu, Jul 20, 2017 at 11:04 AM, Ruchika Nayyar <ruchika.work at gmail.com>
wrote:

> The original dataset contains both trainng/testing, I have predictions
> only on testing dataset. If I do what you suggest
> will it preserve indexing?
>
> Thanks,
> Ruchika
>
>
> On Thu, Jul 20, 2017 at 11:37 AM, Julio Antonio Soto de Vicente <
> julio at esbet.es> wrote:
>
>> Hi Ruchika,
>>
>> The predictions outputted by all sklearn models are just 1-d Numpy
>> arrays, so it should be trivial to add it to any existing DataFrame:
>>
>> your_df["prediction"] = clf.predict(X_test)
>>
>> --
>> Julio
>>
>> El 20 jul 2017, a las 17:23, Ruchika Nayyar <ruchika.work at gmail.com>
>> escribi?:
>>
>> Hi Scikit-learn Users,
>>
>> I am analyzing some proxy logs to use Machine learning to classify the
>> events recorded as either "OBSERVED" or "BLOCKED". This is a little snippet
>> of my code:
>> The input file is a csv with tokenized string fields.
>>
>> **************
>> # load the file
>> M = pd.read_csv("output100k.csv").fillna('')
>>
>> # define the fields to use
>> min_df = 0.001
>> max_df = .7
>> TxtCols = ['request__tokens', 'requestClientApplication__tokens',
>>            'destinationZoneURI__tokens','cs-categories__tokens',
>>            'fileType__tokens', 'requestMethod__tokens','tcp_status1',
>>            'app','tcp_status2','dhost'
>>           ]
>> NumCols = ['rt', 'out', 'in', 'time-taken','rt_length', 'dt_length']
>>
>> # vectorize the fields
>> TfidfModels = [TfidfVectorizer(min_df = min_df, max_df=max_df).fit(M[t])
>> for t in TxtCols]
>>
>> # define the columns of sparse matrix
>> X = hstack([m.transform(M[n].fillna('')) for m,n in zip(TfidfModels,
>> TxtCols)] + \
>>                [csr_matrix(pd.to_numeric(M[n]).fillna(-1).values).T for
>> n in NumCols])
>>
>> # target variable
>> Y = M.act.values
>>
>> ## Define train/test parts and scale them
>> X_train, X_test, y_train, y_test = tts(X, Y, test_size=0.2)
>> scaler = StandardScaler(with_mean=False, with_std=True)
>> scaler.fit(X_train)
>> X_train=scaler.transform(X_train)
>> X_test=scaler.transform(X_test)
>>
>>
>> # define the model and train
>> clf = MLPClassifier(activation='logistic', solver='lbfgs').fit(X_train,y_
>> train)
>> # use the model to predict on X_test and convert into a data frame
>> df=pd.DataFrame(clf.predict(X_test))
>>
>> **
>>
>> 199845  OBSERVED
>> 199846  OBSERVED
>>
>> [199847 rows x 1 columns]>
>>
>> **
>>
>> Now at the end I have a DataFrame with 20K entries with just one column
>> "Label", how di I connect it to the main dataframe M, since I want to do
>> some
>> investigations on this outcome ?
>>
>> Any help?
>>
>> Thanks,
>> Ruchika
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170720/ecdb7115/attachment-0001.html>

From ruchika.work at gmail.com  Thu Jul 20 12:30:24 2017
From: ruchika.work at gmail.com (Ruchika Nayyar)
Date: Thu, 20 Jul 2017 12:30:24 -0400
Subject: [scikit-learn] merging the predicted labels with original
 dataframe
In-Reply-To: <CAE1aY-kJe1YE3zh050TV50XAO-FHBnvK_ZzNbWAjvzFpnh3RJA@mail.gmail.com>
References: <CAGz0NpihhvNUap0pDuvByYnq622Mh3R0aixDN2kotXHV2hN8nw@mail.gmail.com>
 <1FE71505-E709-4D57-98A5-4877CE0168D5@esbet.es>
 <CAGz0Nphw2nEmsVsc5+LHNUqxECdr0r-9SgRg-_qVDCne1sGsDg@mail.gmail.com>
 <CAE1aY-kJe1YE3zh050TV50XAO-FHBnvK_ZzNbWAjvzFpnh3RJA@mail.gmail.com>
Message-ID: <CAGz0Nph2QsE76cpr7cVKYG7dytyPjgaGEpyZ+TEU88Zr0Wcn7w@mail.gmail.com>

Hi Tom

This was also the first thing that came to my mind, but I thought sincr
your_df is X_train+X_test
it may complain that values do not match with the given indices.

Thanks,
Ruchika


On Thu, Jul 20, 2017 at 12:19 PM, Tom Augspurger <tom.augspurger88 at gmail.com
> wrote:

> Something like
>
>     your_df['prediction'] = pd.Series(clf.predict(X_test),
> index=X_test.index)
>
> should handle all the alignment.
>
> On Thu, Jul 20, 2017 at 11:04 AM, Ruchika Nayyar <ruchika.work at gmail.com>
> wrote:
>
>> The original dataset contains both trainng/testing, I have predictions
>> only on testing dataset. If I do what you suggest
>> will it preserve indexing?
>>
>> Thanks,
>> Ruchika
>>
>>
>> On Thu, Jul 20, 2017 at 11:37 AM, Julio Antonio Soto de Vicente <
>> julio at esbet.es> wrote:
>>
>>> Hi Ruchika,
>>>
>>> The predictions outputted by all sklearn models are just 1-d Numpy
>>> arrays, so it should be trivial to add it to any existing DataFrame:
>>>
>>> your_df["prediction"] = clf.predict(X_test)
>>>
>>> --
>>> Julio
>>>
>>> El 20 jul 2017, a las 17:23, Ruchika Nayyar <ruchika.work at gmail.com>
>>> escribi?:
>>>
>>> Hi Scikit-learn Users,
>>>
>>> I am analyzing some proxy logs to use Machine learning to classify the
>>> events recorded as either "OBSERVED" or "BLOCKED". This is a little snippet
>>> of my code:
>>> The input file is a csv with tokenized string fields.
>>>
>>> **************
>>> # load the file
>>> M = pd.read_csv("output100k.csv").fillna('')
>>>
>>> # define the fields to use
>>> min_df = 0.001
>>> max_df = .7
>>> TxtCols = ['request__tokens', 'requestClientApplication__tokens',
>>>            'destinationZoneURI__tokens','cs-categories__tokens',
>>>            'fileType__tokens', 'requestMethod__tokens','tcp_status1',
>>>            'app','tcp_status2','dhost'
>>>           ]
>>> NumCols = ['rt', 'out', 'in', 'time-taken','rt_length', 'dt_length']
>>>
>>> # vectorize the fields
>>> TfidfModels = [TfidfVectorizer(min_df = min_df, max_df=max_df).fit(M[t])
>>> for t in TxtCols]
>>>
>>> # define the columns of sparse matrix
>>> X = hstack([m.transform(M[n].fillna('')) for m,n in zip(TfidfModels,
>>> TxtCols)] + \
>>>                [csr_matrix(pd.to_numeric(M[n]).fillna(-1).values).T for
>>> n in NumCols])
>>>
>>> # target variable
>>> Y = M.act.values
>>>
>>> ## Define train/test parts and scale them
>>> X_train, X_test, y_train, y_test = tts(X, Y, test_size=0.2)
>>> scaler = StandardScaler(with_mean=False, with_std=True)
>>> scaler.fit(X_train)
>>> X_train=scaler.transform(X_train)
>>> X_test=scaler.transform(X_test)
>>>
>>>
>>> # define the model and train
>>> clf = MLPClassifier(activation='logistic',
>>> solver='lbfgs').fit(X_train,y_train)
>>> # use the model to predict on X_test and convert into a data frame
>>> df=pd.DataFrame(clf.predict(X_test))
>>>
>>> **
>>>
>>> 199845  OBSERVED
>>> 199846  OBSERVED
>>>
>>> [199847 rows x 1 columns]>
>>>
>>> **
>>>
>>> Now at the end I have a DataFrame with 20K entries with just one column
>>> "Label", how di I connect it to the main dataframe M, since I want to do
>>> some
>>> investigations on this outcome ?
>>>
>>> Any help?
>>>
>>> Thanks,
>>> Ruchika
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170720/4bad2358/attachment.html>

From raga.markely at gmail.com  Fri Jul 21 11:11:20 2017
From: raga.markely at gmail.com (Raga Markely)
Date: Fri, 21 Jul 2017 11:11:20 -0400
Subject: [scikit-learn] Classifiers for dataset with categorical features
Message-ID: <CAOLKFqvqZAJeNKh=p82tQUs-wgXeftUZRsj=VJhFF+siKBKQTw@mail.gmail.com>

Hello,

I am wondering if there are some classifiers that perform better for
datasets with categorical features (converted into sparse input matrix with
pd.get_dummies())? The data for the categorical features are nominal (order
doesn't matter, e.g. country, occupation, etc).

If you could provide me some references (papers, books, website, etc), that
would be great.

Thank you very much!
Raga
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170721/0237ead0/attachment.html>

From jmschreiber91 at gmail.com  Fri Jul 21 14:27:50 2017
From: jmschreiber91 at gmail.com (Jacob Schreiber)
Date: Fri, 21 Jul 2017 11:27:50 -0700
Subject: [scikit-learn] Classifiers for dataset with categorical features
In-Reply-To: <CAOLKFqvqZAJeNKh=p82tQUs-wgXeftUZRsj=VJhFF+siKBKQTw@mail.gmail.com>
References: <CAOLKFqvqZAJeNKh=p82tQUs-wgXeftUZRsj=VJhFF+siKBKQTw@mail.gmail.com>
Message-ID: <CA+ad8Et04KMyQufDNFYAu7zKyvNhUJiz4shdgoTVkxCkBPurqw@mail.gmail.com>

Traditionally tree based methods are very good when it comes to categorical
variables and can handle them appropriately. There is a current WIP PR to
add this support to sklearn. I'm not exactly sure what you mean that
"perform better" though. Estimators that ignore the categorical aspect of
these variables and treat them as discrete will likely perform worse than
those that treat them appropriately.

On Fri, Jul 21, 2017 at 8:11 AM, Raga Markely <raga.markely at gmail.com>
wrote:

> Hello,
>
> I am wondering if there are some classifiers that perform better for
> datasets with categorical features (converted into sparse input matrix with
> pd.get_dummies())? The data for the categorical features are nominal (order
> doesn't matter, e.g. country, occupation, etc).
>
> If you could provide me some references (papers, books, website, etc),
> that would be great.
>
> Thank you very much!
> Raga
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170721/c75c760b/attachment.html>

From jmschreiber91 at gmail.com  Fri Jul 21 14:27:50 2017
From: jmschreiber91 at gmail.com (Jacob Schreiber)
Date: Fri, 21 Jul 2017 11:27:50 -0700
Subject: [scikit-learn] Classifiers for dataset with categorical features
In-Reply-To: <CAOLKFqvqZAJeNKh=p82tQUs-wgXeftUZRsj=VJhFF+siKBKQTw@mail.gmail.com>
References: <CAOLKFqvqZAJeNKh=p82tQUs-wgXeftUZRsj=VJhFF+siKBKQTw@mail.gmail.com>
Message-ID: <CA+ad8EtcwQC7sMHLZqKrbOySUmz3yx31OC8Txv61F8HA5H5O4g@mail.gmail.com>

Traditionally tree based methods are very good when it comes to categorical
variables and can handle them appropriately. There is a current WIP PR to
add this support to sklearn. I'm not exactly sure what you mean that
"perform better" though. Estimators that ignore the categorical aspect of
these variables and treat them as discrete will likely perform worse than
those that treat them appropriately.

On Fri, Jul 21, 2017 at 8:11 AM, Raga Markely <raga.markely at gmail.com>
wrote:

> Hello,
>
> I am wondering if there are some classifiers that perform better for
> datasets with categorical features (converted into sparse input matrix with
> pd.get_dummies())? The data for the categorical features are nominal (order
> doesn't matter, e.g. country, occupation, etc).
>
> If you could provide me some references (papers, books, website, etc),
> that would be great.
>
> Thank you very much!
> Raga
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170721/f864d895/attachment.html>

From raga.markely at gmail.com  Fri Jul 21 14:37:25 2017
From: raga.markely at gmail.com (Raga Markely)
Date: Fri, 21 Jul 2017 14:37:25 -0400
Subject: [scikit-learn] Classifiers for dataset with categorical features
In-Reply-To: <CA+ad8Et04KMyQufDNFYAu7zKyvNhUJiz4shdgoTVkxCkBPurqw@mail.gmail.com>
References: <CAOLKFqvqZAJeNKh=p82tQUs-wgXeftUZRsj=VJhFF+siKBKQTw@mail.gmail.com>
 <CA+ad8Et04KMyQufDNFYAu7zKyvNhUJiz4shdgoTVkxCkBPurqw@mail.gmail.com>
Message-ID: <CAOLKFqv3cLp36KBwxvwdP8eSXv9D7rK7fOXKpDcq7z6C3+C4Kg@mail.gmail.com>

Thank you, Jacob. Appreciate it.

Regarding 'perform better', I was referring to better accuracy, precision,
recall, F1 score, etc.

Thanks,
Raga

On Fri, Jul 21, 2017 at 2:27 PM, Jacob Schreiber <jmschreiber91 at gmail.com>
wrote:

> Traditionally tree based methods are very good when it comes to
> categorical variables and can handle them appropriately. There is a current
> WIP PR to add this support to sklearn. I'm not exactly sure what you mean
> that "perform better" though. Estimators that ignore the categorical aspect
> of these variables and treat them as discrete will likely perform worse
> than those that treat them appropriately.
>
> On Fri, Jul 21, 2017 at 8:11 AM, Raga Markely <raga.markely at gmail.com>
> wrote:
>
>> Hello,
>>
>> I am wondering if there are some classifiers that perform better for
>> datasets with categorical features (converted into sparse input matrix with
>> pd.get_dummies())? The data for the categorical features are nominal (order
>> doesn't matter, e.g. country, occupation, etc).
>>
>> If you could provide me some references (papers, books, website, etc),
>> that would be great.
>>
>> Thank you very much!
>> Raga
>>
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170721/83e073d0/attachment.html>

From se.raschka at gmail.com  Fri Jul 21 14:52:11 2017
From: se.raschka at gmail.com (Sebastian Raschka)
Date: Fri, 21 Jul 2017 14:52:11 -0400
Subject: [scikit-learn] Classifiers for dataset with categorical features
In-Reply-To: <CAOLKFqv3cLp36KBwxvwdP8eSXv9D7rK7fOXKpDcq7z6C3+C4Kg@mail.gmail.com>
References: <CAOLKFqvqZAJeNKh=p82tQUs-wgXeftUZRsj=VJhFF+siKBKQTw@mail.gmail.com>
 <CA+ad8Et04KMyQufDNFYAu7zKyvNhUJiz4shdgoTVkxCkBPurqw@mail.gmail.com>
 <CAOLKFqv3cLp36KBwxvwdP8eSXv9D7rK7fOXKpDcq7z6C3+C4Kg@mail.gmail.com>
Message-ID: <8434D3C4-503B-4D6B-A4FB-ABB684B7DD71@gmail.com>

Just to throw some additional ideas in here. Based on a conversation with a colleague some time ago, I think learning classifier systems (https://en.wikipedia.org/wiki/Learning_classifier_system) are particularly useful when working with large, sparse binary vectors (like from a one-hot encoding). I am really not into LCS's, and only know the basics (read through the first chapters of the Intro to Learning Classifier Systems draft; the print version will be out later this year). 
Also, I saw an interesting poster on a Set Covering Machine algorithm once, which they benchmarked against SVMs, random forests and the like for categorical (genomics data). Looked promising.

Best,
Sebastian


> On Jul 21, 2017, at 2:37 PM, Raga Markely <raga.markely at gmail.com> wrote:
> 
> Thank you, Jacob. Appreciate it.
> 
> Regarding 'perform better', I was referring to better accuracy, precision, recall, F1 score, etc.
> 
> Thanks,
> Raga
> 
> On Fri, Jul 21, 2017 at 2:27 PM, Jacob Schreiber <jmschreiber91 at gmail.com> wrote:
> Traditionally tree based methods are very good when it comes to categorical variables and can handle them appropriately. There is a current WIP PR to add this support to sklearn. I'm not exactly sure what you mean that "perform better" though. Estimators that ignore the categorical aspect of these variables and treat them as discrete will likely perform worse than those that treat them appropriately.
> 
> On Fri, Jul 21, 2017 at 8:11 AM, Raga Markely <raga.markely at gmail.com> wrote:
> Hello,
> 
> I am wondering if there are some classifiers that perform better for datasets with categorical features (converted into sparse input matrix with pd.get_dummies())? The data for the categorical features are nominal (order doesn't matter, e.g. country, occupation, etc).
> 
> If you could provide me some references (papers, books, website, etc), that would be great.
> 
> Thank you very much!
> Raga
> 
> 
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From se.raschka at gmail.com  Fri Jul 21 14:57:57 2017
From: se.raschka at gmail.com (Sebastian Raschka)
Date: Fri, 21 Jul 2017 14:57:57 -0400
Subject: [scikit-learn] Classifiers for dataset with categorical features
In-Reply-To: <8434D3C4-503B-4D6B-A4FB-ABB684B7DD71@gmail.com>
References: <CAOLKFqvqZAJeNKh=p82tQUs-wgXeftUZRsj=VJhFF+siKBKQTw@mail.gmail.com>
 <CA+ad8Et04KMyQufDNFYAu7zKyvNhUJiz4shdgoTVkxCkBPurqw@mail.gmail.com>
 <CAOLKFqv3cLp36KBwxvwdP8eSXv9D7rK7fOXKpDcq7z6C3+C4Kg@mail.gmail.com>
 <8434D3C4-503B-4D6B-A4FB-ABB684B7DD71@gmail.com>
Message-ID: <A55A72A0-E76B-478E-ADE7-86277E32594E@gmail.com>

> Traditionally tree based methods are very good when it comes to categorical variables and can handle them appropriately. There is a current WIP PR to add this support to sklearn.

I think it's also important to distinguish between nominal and ordinal; it can make a huge difference imho. I.e., treating ordinal variables like continuous variable probably makes more sense than one-hot encoding them. Looking forward to the PR  :)

> On Jul 21, 2017, at 2:52 PM, Sebastian Raschka <se.raschka at gmail.com> wrote:
> 
> Just to throw some additional ideas in here. Based on a conversation with a colleague some time ago, I think learning classifier systems (https://en.wikipedia.org/wiki/Learning_classifier_system) are particularly useful when working with large, sparse binary vectors (like from a one-hot encoding). I am really not into LCS's, and only know the basics (read through the first chapters of the Intro to Learning Classifier Systems draft; the print version will be out later this year). 
> Also, I saw an interesting poster on a Set Covering Machine algorithm once, which they benchmarked against SVMs, random forests and the like for categorical (genomics data). Looked promising.
> 
> Best,
> Sebastian
> 
> 
>> On Jul 21, 2017, at 2:37 PM, Raga Markely <raga.markely at gmail.com> wrote:
>> 
>> Thank you, Jacob. Appreciate it.
>> 
>> Regarding 'perform better', I was referring to better accuracy, precision, recall, F1 score, etc.
>> 
>> Thanks,
>> Raga
>> 
>> On Fri, Jul 21, 2017 at 2:27 PM, Jacob Schreiber <jmschreiber91 at gmail.com> wrote:
>> Traditionally tree based methods are very good when it comes to categorical variables and can handle them appropriately. There is a current WIP PR to add this support to sklearn. I'm not exactly sure what you mean that "perform better" though. Estimators that ignore the categorical aspect of these variables and treat them as discrete will likely perform worse than those that treat them appropriately.
>> 
>> On Fri, Jul 21, 2017 at 8:11 AM, Raga Markely <raga.markely at gmail.com> wrote:
>> Hello,
>> 
>> I am wondering if there are some classifiers that perform better for datasets with categorical features (converted into sparse input matrix with pd.get_dummies())? The data for the categorical features are nominal (order doesn't matter, e.g. country, occupation, etc).
>> 
>> If you could provide me some references (papers, books, website, etc), that would be great.
>> 
>> Thank you very much!
>> Raga
>> 
>> 
>> 
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>> 
>> 
>> 
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>> 
>> 
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From raga.markely at gmail.com  Fri Jul 21 14:59:40 2017
From: raga.markely at gmail.com (Raga Markely)
Date: Fri, 21 Jul 2017 14:59:40 -0400
Subject: [scikit-learn] Classifiers for dataset with categorical features
In-Reply-To: <8434D3C4-503B-4D6B-A4FB-ABB684B7DD71@gmail.com>
References: <CAOLKFqvqZAJeNKh=p82tQUs-wgXeftUZRsj=VJhFF+siKBKQTw@mail.gmail.com>
 <CA+ad8Et04KMyQufDNFYAu7zKyvNhUJiz4shdgoTVkxCkBPurqw@mail.gmail.com>
 <CAOLKFqv3cLp36KBwxvwdP8eSXv9D7rK7fOXKpDcq7z6C3+C4Kg@mail.gmail.com>
 <8434D3C4-503B-4D6B-A4FB-ABB684B7DD71@gmail.com>
Message-ID: <CAOLKFqvXs_6W2dwuG=F8fKkrkj3uP5Hfa4X30=9FyN+7eqzaXA@mail.gmail.com>

Sounds good, Sebastian.

Thank you!
Raga

On Fri, Jul 21, 2017 at 2:52 PM, Sebastian Raschka <se.raschka at gmail.com>
wrote:

> Just to throw some additional ideas in here. Based on a conversation with
> a colleague some time ago, I think learning classifier systems (
> https://en.wikipedia.org/wiki/Learning_classifier_system) are
> particularly useful when working with large, sparse binary vectors (like
> from a one-hot encoding). I am really not into LCS's, and only know the
> basics (read through the first chapters of the Intro to Learning Classifier
> Systems draft; the print version will be out later this year).
> Also, I saw an interesting poster on a Set Covering Machine algorithm
> once, which they benchmarked against SVMs, random forests and the like for
> categorical (genomics data). Looked promising.
>
> Best,
> Sebastian
>
>
> > On Jul 21, 2017, at 2:37 PM, Raga Markely <raga.markely at gmail.com>
> wrote:
> >
> > Thank you, Jacob. Appreciate it.
> >
> > Regarding 'perform better', I was referring to better accuracy,
> precision, recall, F1 score, etc.
> >
> > Thanks,
> > Raga
> >
> > On Fri, Jul 21, 2017 at 2:27 PM, Jacob Schreiber <
> jmschreiber91 at gmail.com> wrote:
> > Traditionally tree based methods are very good when it comes to
> categorical variables and can handle them appropriately. There is a current
> WIP PR to add this support to sklearn. I'm not exactly sure what you mean
> that "perform better" though. Estimators that ignore the categorical aspect
> of these variables and treat them as discrete will likely perform worse
> than those that treat them appropriately.
> >
> > On Fri, Jul 21, 2017 at 8:11 AM, Raga Markely <raga.markely at gmail.com>
> wrote:
> > Hello,
> >
> > I am wondering if there are some classifiers that perform better for
> datasets with categorical features (converted into sparse input matrix with
> pd.get_dummies())? The data for the categorical features are nominal (order
> doesn't matter, e.g. country, occupation, etc).
> >
> > If you could provide me some references (papers, books, website, etc),
> that would be great.
> >
> > Thank you very much!
> > Raga
> >
> >
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> >
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170721/0d3bb651/attachment.html>

From stuart at stuartreynolds.net  Fri Jul 21 15:01:47 2017
From: stuart at stuartreynolds.net (Stuart Reynolds)
Date: Fri, 21 Jul 2017 12:01:47 -0700
Subject: [scikit-learn] Classifiers for dataset with categorical features
In-Reply-To: <8434D3C4-503B-4D6B-A4FB-ABB684B7DD71@gmail.com>
References: <CAOLKFqvqZAJeNKh=p82tQUs-wgXeftUZRsj=VJhFF+siKBKQTw@mail.gmail.com>
 <CA+ad8Et04KMyQufDNFYAu7zKyvNhUJiz4shdgoTVkxCkBPurqw@mail.gmail.com>
 <CAOLKFqv3cLp36KBwxvwdP8eSXv9D7rK7fOXKpDcq7z6C3+C4Kg@mail.gmail.com>
 <8434D3C4-503B-4D6B-A4FB-ABB684B7DD71@gmail.com>
Message-ID: <CAAy-kdkoK6+GeVNWt4LBzD_iWFPa95H4R9n-JVVK5hEP2MZjAg@mail.gmail.com>

+1
LCS and its many many variants seem very practical and adaptable. I'm
not sure why they haven't gotten traction.
Overshadowed by GBM & random forests?


On Fri, Jul 21, 2017 at 11:52 AM, Sebastian Raschka
<se.raschka at gmail.com> wrote:
> Just to throw some additional ideas in here. Based on a conversation with a colleague some time ago, I think learning classifier systems (https://en.wikipedia.org/wiki/Learning_classifier_system) are particularly useful when working with large, sparse binary vectors (like from a one-hot encoding). I am really not into LCS's, and only know the basics (read through the first chapters of the Intro to Learning Classifier Systems draft; the print version will be out later this year).
> Also, I saw an interesting poster on a Set Covering Machine algorithm once, which they benchmarked against SVMs, random forests and the like for categorical (genomics data). Looked promising.
>
> Best,
> Sebastian
>
>
>> On Jul 21, 2017, at 2:37 PM, Raga Markely <raga.markely at gmail.com> wrote:
>>
>> Thank you, Jacob. Appreciate it.
>>
>> Regarding 'perform better', I was referring to better accuracy, precision, recall, F1 score, etc.
>>
>> Thanks,
>> Raga
>>
>> On Fri, Jul 21, 2017 at 2:27 PM, Jacob Schreiber <jmschreiber91 at gmail.com> wrote:
>> Traditionally tree based methods are very good when it comes to categorical variables and can handle them appropriately. There is a current WIP PR to add this support to sklearn. I'm not exactly sure what you mean that "perform better" though. Estimators that ignore the categorical aspect of these variables and treat them as discrete will likely perform worse than those that treat them appropriately.
>>
>> On Fri, Jul 21, 2017 at 8:11 AM, Raga Markely <raga.markely at gmail.com> wrote:
>> Hello,
>>
>> I am wondering if there are some classifiers that perform better for datasets with categorical features (converted into sparse input matrix with pd.get_dummies())? The data for the categorical features are nominal (order doesn't matter, e.g. country, occupation, etc).
>>
>> If you could provide me some references (papers, books, website, etc), that would be great.
>>
>> Thank you very much!
>> Raga
>>
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

From se.raschka at gmail.com  Fri Jul 21 19:09:03 2017
From: se.raschka at gmail.com (Sebastian Raschka)
Date: Fri, 21 Jul 2017 19:09:03 -0400
Subject: [scikit-learn] Classifiers for dataset with categorical features
In-Reply-To: <CAAy-kdkoK6+GeVNWt4LBzD_iWFPa95H4R9n-JVVK5hEP2MZjAg@mail.gmail.com>
References: <CAOLKFqvqZAJeNKh=p82tQUs-wgXeftUZRsj=VJhFF+siKBKQTw@mail.gmail.com>
 <CA+ad8Et04KMyQufDNFYAu7zKyvNhUJiz4shdgoTVkxCkBPurqw@mail.gmail.com>
 <CAOLKFqv3cLp36KBwxvwdP8eSXv9D7rK7fOXKpDcq7z6C3+C4Kg@mail.gmail.com>
 <8434D3C4-503B-4D6B-A4FB-ABB684B7DD71@gmail.com>
 <CAAy-kdkoK6+GeVNWt4LBzD_iWFPa95H4R9n-JVVK5hEP2MZjAg@mail.gmail.com>
Message-ID: <A46341E5-C855-45E1-90EA-5EA66847E32A@gmail.com>

Maybe because they are genetic algorithms, which are -- for some reason -- not very popular in the ML field in general :P. (People in bioinformatics seem to use them a lot though.). Also, the name "Learning Classifier Systems" is also a bit weird I'd must say: I remember that when Ryan introduced me to those, I was like "ah yeah, sure, I know machine learning classifiers" ;)


> On Jul 21, 2017, at 3:01 PM, Stuart Reynolds <stuart at stuartreynolds.net> wrote:
> 
> +1
> LCS and its many many variants seem very practical and adaptable. I'm
> not sure why they haven't gotten traction.
> Overshadowed by GBM & random forests?
> 
> 
> On Fri, Jul 21, 2017 at 11:52 AM, Sebastian Raschka
> <se.raschka at gmail.com> wrote:
>> Just to throw some additional ideas in here. Based on a conversation with a colleague some time ago, I think learning classifier systems (https://en.wikipedia.org/wiki/Learning_classifier_system) are particularly useful when working with large, sparse binary vectors (like from a one-hot encoding). I am really not into LCS's, and only know the basics (read through the first chapters of the Intro to Learning Classifier Systems draft; the print version will be out later this year).
>> Also, I saw an interesting poster on a Set Covering Machine algorithm once, which they benchmarked against SVMs, random forests and the like for categorical (genomics data). Looked promising.
>> 
>> Best,
>> Sebastian
>> 
>> 
>>> On Jul 21, 2017, at 2:37 PM, Raga Markely <raga.markely at gmail.com> wrote:
>>> 
>>> Thank you, Jacob. Appreciate it.
>>> 
>>> Regarding 'perform better', I was referring to better accuracy, precision, recall, F1 score, etc.
>>> 
>>> Thanks,
>>> Raga
>>> 
>>> On Fri, Jul 21, 2017 at 2:27 PM, Jacob Schreiber <jmschreiber91 at gmail.com> wrote:
>>> Traditionally tree based methods are very good when it comes to categorical variables and can handle them appropriately. There is a current WIP PR to add this support to sklearn. I'm not exactly sure what you mean that "perform better" though. Estimators that ignore the categorical aspect of these variables and treat them as discrete will likely perform worse than those that treat them appropriately.
>>> 
>>> On Fri, Jul 21, 2017 at 8:11 AM, Raga Markely <raga.markely at gmail.com> wrote:
>>> Hello,
>>> 
>>> I am wondering if there are some classifiers that perform better for datasets with categorical features (converted into sparse input matrix with pd.get_dummies())? The data for the categorical features are nominal (order doesn't matter, e.g. country, occupation, etc).
>>> 
>>> If you could provide me some references (papers, books, website, etc), that would be great.
>>> 
>>> Thank you very much!
>>> Raga
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>> 
>>> 
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> 
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From jmschreiber91 at gmail.com  Sat Jul 22 15:07:15 2017
From: jmschreiber91 at gmail.com (Jacob Schreiber)
Date: Sat, 22 Jul 2017 12:07:15 -0700
Subject: [scikit-learn] scikit-learn hits 20k github stars
Message-ID: <CA+ad8EuAvPW=Vg2yOXrkOxypirwAm6v8KJKHCDgTOAMYH4o-Jg@mail.gmail.com>

[image: Inline image 1]

Many thanks to everyone who has worked on and contributed to the project
for the past decade to make it such a great tool! Also a special thanks to
Joel Nothman, who has been on top of answering issues and reviewing PRs for
years now.

??
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170722/5aedea47/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 9294 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170722/5aedea47/attachment.png>

From joel.nothman at gmail.com  Sat Jul 22 23:33:42 2017
From: joel.nothman at gmail.com (Joel Nothman)
Date: Sun, 23 Jul 2017 13:33:42 +1000
Subject: [scikit-learn] scikit-learn hits 20k github stars
In-Reply-To: <CA+ad8EuAvPW=Vg2yOXrkOxypirwAm6v8KJKHCDgTOAMYH4o-Jg@mail.gmail.com>
References: <CA+ad8EuAvPW=Vg2yOXrkOxypirwAm6v8KJKHCDgTOAMYH4o-Jg@mail.gmail.com>
Message-ID: <CAAkaFLXLZwOGh0XYh+OerX5fXdfDAgF27Cxx-E8YQKvVham_wA@mail.gmail.com>

oh, thanks. but the last year isn't years! There has been plenty of great
work and personal example, dedication, experience and expertise to build
on. I'm not naming names. Congratulations everyone!

It's no small feat for software to remain relevant for a decade, let alone
to keep accruing stars in a domain with so much change as machine learning.

 that also means processing contributions constantly, almost entirely on
volunteered time.

We appreciate every bit of help sorting through them, ensuring we stay
relevant and high quality. I certainly can't do that by myself.

On 23 Jul 2017 5:13 am, "Jacob Schreiber" <jmschreiber91 at gmail.com> wrote:

> [image: Inline image 1]
>
> Many thanks to everyone who has worked on and contributed to the project
> for the past decade to make it such a great tool! Also a special thanks to
> Joel Nothman, who has been on top of answering issues and reviewing PRs for
> years now.
>
> ??
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170723/869db14b/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 9294 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170723/869db14b/attachment-0001.png>

From sambarnett95 at gmail.com  Mon Jul 24 08:57:25 2017
From: sambarnett95 at gmail.com (Sam Barnett)
Date: Mon, 24 Jul 2017 13:57:25 +0100
Subject: [scikit-learn] Fwd: Custom transformer failing check_estimator test
In-Reply-To: <CACWGMfrDiRFvUSexbhcynFCQtqMTPhGdvdUgMJJx7NDwJ8MSxA@mail.gmail.com>
References: <CACWGMfrDiRFvUSexbhcynFCQtqMTPhGdvdUgMJJx7NDwJ8MSxA@mail.gmail.com>
Message-ID: <CACWGMfoVVhPA4M37N1-8YdgK18sntq=bCt=TRyoyJsj7BDyxfA@mail.gmail.com>

Dear scikit-learn developers,

I am developing a transformer, named Sqizer, that has the ultimate goal of
modifying a kernel for use with the sklearn.svm package. When given an
input data array X, Sqizer.transform(X) should have as its output the Gram
matrix for X using the modified version of the kernel. Here is the code for
the class so far:

class Sqizer(BaseEstimator, TransformerMixin):

    def __init__(self, C=1.0, kernel='rbf', degree=3, gamma=1,
                     coef0=0.0, cut_ord_pair=(2,1)):
            self.C = C
            self.kernel = kernel
            self.degree = degree
            self.gamma = gamma
            self.coef0 = coef0
            self.cut_ord_pair = cut_ord_pair

    def fit(self, X, y=None):
        # Check that X and y have correct shape
        X, y = check_X_y(X, y)
        # Store the classes seen during fit
        self.classes_ = unique_labels(y)

        self.X_ = X
        self.y_ = y
        return self

    def transform(self, X):

        X = check_array(X, warn_on_dtype=True)

        """Returns Gram matrix corresponding to X, once sqized."""
        def kPolynom(x,y):
            return (self.coef0+self.gamma*np.inner(x,y))**self.degree
        def kGauss(x,y):
            return np.exp(-self.gamma*np.sum(np.square(x-y)))
        def kLinear(x,y):
            return np.inner(x,y)
        def kSigmoid(x,y):
            return np.tanh(self.gamma*np.inner(x,y) +self.coef0)

        def kernselect(kername):
            switcher = {
                'linear': kPolynom,
                'rbf': kGauss,
                'sigmoid': kLinear,
                'poly': kSigmoid,
                    }
            return switcher.get(kername, "nothing")

        cut_off = self.cut_ord_pair[0]
        order = self.cut_ord_pair[1]

        from SeqKernel import hiSeqKernEval

        def getGram(Y):
            gram_matrix = np.zeros((Y.shape[0], Y.shape[0]))
            for row1ind in range(Y.shape[0]):
                for row2ind in range(X.shape[0]):
                    gram_matrix[row1ind,row2ind] = \

hiSeqKernEval(Y[row1ind],Y[row2ind],kernselect(self.kernel),\
                    cut_off,order)
            return gram_matrix

        return getGram(X)

However, when I run the check_estimator method on Sqizer, I get an
error with the following check:

# raises error on malformed input for transformif hasattr(X, 'T'):
    # If it's not an array, it does not have a 'T' property
    assert_raises(ValueError, transformer.transform, X.T)

How do I alter my code to pass this test? Could my estimator trip up
on any further tests?

I have attached the relevant .py files if you require a bigger
picture. This particular snippet comes from the OptimalKernel.py file.


Many thanks,

Sam Barnett
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170724/ca0a186d/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OptimalKernel.py
Type: text/x-python-script
Size: 8516 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170724/ca0a186d/attachment-0003.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: SeqKernel.py
Type: text/x-python-script
Size: 4199 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170724/ca0a186d/attachment-0004.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: TensorTools.py
Type: text/x-python-script
Size: 983 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170724/ca0a186d/attachment-0005.bin>

From joel.nothman at gmail.com  Mon Jul 24 19:54:32 2017
From: joel.nothman at gmail.com (Joel Nothman)
Date: Tue, 25 Jul 2017 09:54:32 +1000
Subject: [scikit-learn] Fwd: Custom transformer failing check_estimator
 test
In-Reply-To: <CACWGMfoVVhPA4M37N1-8YdgK18sntq=bCt=TRyoyJsj7BDyxfA@mail.gmail.com>
References: <CACWGMfrDiRFvUSexbhcynFCQtqMTPhGdvdUgMJJx7NDwJ8MSxA@mail.gmail.com>
 <CACWGMfoVVhPA4M37N1-8YdgK18sntq=bCt=TRyoyJsj7BDyxfA@mail.gmail.com>
Message-ID: <CAAkaFLUybRdBRNiFgKFpwRi4=r9xSDLji-uP-dyr-OZJupRHFw@mail.gmail.com>

what is the failing test? please provide the full traceback.

On 24 Jul 2017 10:58 pm, "Sam Barnett" <sambarnett95 at gmail.com> wrote:

> Dear scikit-learn developers,
>
> I am developing a transformer, named Sqizer, that has the ultimate goal
> of modifying a kernel for use with the sklearn.svm package. When given an
> input data array X, Sqizer.transform(X) should have as its output the
> Gram matrix for X using the modified version of the kernel. Here is the
> code for the class so far:
>
> class Sqizer(BaseEstimator, TransformerMixin):
>
>     def __init__(self, C=1.0, kernel='rbf', degree=3, gamma=1,
>                      coef0=0.0, cut_ord_pair=(2,1)):
>             self.C = C
>             self.kernel = kernel
>             self.degree = degree
>             self.gamma = gamma
>             self.coef0 = coef0
>             self.cut_ord_pair = cut_ord_pair
>
>     def fit(self, X, y=None):
>         # Check that X and y have correct shape
>         X, y = check_X_y(X, y)
>         # Store the classes seen during fit
>         self.classes_ = unique_labels(y)
>
>         self.X_ = X
>         self.y_ = y
>         return self
>
>     def transform(self, X):
>
>         X = check_array(X, warn_on_dtype=True)
>
>         """Returns Gram matrix corresponding to X, once sqized."""
>         def kPolynom(x,y):
>             return (self.coef0+self.gamma*np.inner(x,y))**self.degree
>         def kGauss(x,y):
>             return np.exp(-self.gamma*np.sum(np.square(x-y)))
>         def kLinear(x,y):
>             return np.inner(x,y)
>         def kSigmoid(x,y):
>             return np.tanh(self.gamma*np.inner(x,y) +self.coef0)
>
>         def kernselect(kername):
>             switcher = {
>                 'linear': kPolynom,
>                 'rbf': kGauss,
>                 'sigmoid': kLinear,
>                 'poly': kSigmoid,
>                     }
>             return switcher.get(kername, "nothing")
>
>         cut_off = self.cut_ord_pair[0]
>         order = self.cut_ord_pair[1]
>
>         from SeqKernel import hiSeqKernEval
>
>         def getGram(Y):
>             gram_matrix = np.zeros((Y.shape[0], Y.shape[0]))
>             for row1ind in range(Y.shape[0]):
>                 for row2ind in range
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
> ...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170725/c008ca2b/attachment-0001.html>

From sambarnett95 at gmail.com  Tue Jul 25 04:41:06 2017
From: sambarnett95 at gmail.com (Sam Barnett)
Date: Tue, 25 Jul 2017 09:41:06 +0100
Subject: [scikit-learn] Fwd: Custom transformer failing check_estimator
 test
In-Reply-To: <CAAkaFLUybRdBRNiFgKFpwRi4=r9xSDLji-uP-dyr-OZJupRHFw@mail.gmail.com>
References: <CACWGMfrDiRFvUSexbhcynFCQtqMTPhGdvdUgMJJx7NDwJ8MSxA@mail.gmail.com>
 <CACWGMfoVVhPA4M37N1-8YdgK18sntq=bCt=TRyoyJsj7BDyxfA@mail.gmail.com>
 <CAAkaFLUybRdBRNiFgKFpwRi4=r9xSDLji-uP-dyr-OZJupRHFw@mail.gmail.com>
Message-ID: <CACWGMfr5ja0BJ+sZpvy_gqNqGw7fQvNxq2ae8QeOSLOV65Yufw@mail.gmail.com>

This is the Traceback I get:


AssertionErrorTraceback (most recent call last)
<ipython-input-5-166b8f0141db> in <module>()
----> 1 check_estimator(OK.Sqizer)

/Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/utils/estimator_
checks.pyc in check_estimator(Estimator)
    253     check_parameters_default_constructible(name, Estimator)
    254     for check in _yield_all_checks(name, Estimator):
--> 255         check(name, Estimator)
    256
    257

/Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/utils/testing.pyc in
wrapper(*args, **kwargs)
    353             with warnings.catch_warnings():
    354                 warnings.simplefilter("ignore", self.category)
--> 355                 return fn(*args, **kwargs)
    356
    357         return wrapper

/Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/utils/estimator_checks.pyc
in check_transformer_general(name, Transformer)
    578     X = StandardScaler().fit_transform(X)
    579     X -= X.min()
--> 580     _check_transformer(name, Transformer, X, y)
    581     _check_transformer(name, Transformer, X.tolist(), y.tolist())
    582

/Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/utils/estimator_checks.pyc
in _check_transformer(name, Transformer, X, y)
    671         if hasattr(X, 'T'):
    672             # If it's not an array, it does not have a 'T' property
--> 673             assert_raises(ValueError, transformer.transform, X.T)
    674
    675

/Users/Sam/anaconda/lib/python2.7/unittest/case.pyc in assertRaises(self,
excClass, callableObj, *args, **kwargs)
    471             return context
    472         with context:
--> 473             callableObj(*args, **kwargs)
    474
    475     def _getAssertEqualityFunc(self, first, second):

/Users/Sam/anaconda/lib/python2.7/unittest/case.pyc in __exit__(self,
exc_type, exc_value, tb)
    114                 exc_name = str(self.expected)
    115             raise self.failureException(
--> 116                 "{0} not raised".format(exc_name))
    117         if not issubclass(exc_type, self.expected):
    118             # let unexpected exceptions pass through

AssertionError: ValueError not raised


On Tue, Jul 25, 2017 at 12:54 AM, Joel Nothman <joel.nothman at gmail.com>
wrote:

> what is the failing test? please provide the full traceback.
>
> On 24 Jul 2017 10:58 pm, "Sam Barnett" <sambarnett95 at gmail.com> wrote:
>
>> Dear scikit-learn developers,
>>
>> I am developing a transformer, named Sqizer, that has the ultimate goal
>> of modifying a kernel for use with the sklearn.svm package. When given
>> an input data array X, Sqizer.transform(X) should have as its output the
>> Gram matrix for X using the modified version of the kernel. Here is the
>> code for the class so far:
>>
>> class Sqizer(BaseEstimator, TransformerMixin):
>>
>>     def __init__(self, C=1.0, kernel='rbf', degree=3, gamma=1,
>>                      coef0=0.0, cut_ord_pair=(2,1)):
>>             self.C = C
>>             self.kernel = kernel
>>             self.degree = degree
>>             self.gamma = gamma
>>             self.coef0 = coef0
>>             self.cut_ord_pair = cut_ord_pair
>>
>>     def fit(self, X, y=None):
>>         # Check that X and y have correct shape
>>         X, y = check_X_y(X, y)
>>         # Store the classes seen during fit
>>         self.classes_ = unique_labels(y)
>>
>>         self.X_ = X
>>         self.y_ = y
>>         return self
>>
>>     def transform(self, X):
>>
>>         X = check_array(X, warn_on_dtype=True)
>>
>>         """Returns Gram matrix corresponding to X, once sqized."""
>>         def kPolynom(x,y):
>>             return (self.coef0+self.gamma*np.inner(x,y))**self.degree
>>         def kGauss(x,y):
>>             return np.exp(-self.gamma*np.sum(np.square(x-y)))
>>         def kLinear(x,y):
>>             return np.inner(x,y)
>>         def kSigmoid(x,y):
>>             return np.tanh(self.gamma*np.inner(x,y) +self.coef0)
>>
>>         def kernselect(kername):
>>             switcher = {
>>                 'linear': kPolynom,
>>                 'rbf': kGauss,
>>                 'sigmoid': kLinear,
>>                 'poly': kSigmoid,
>>                     }
>>             return switcher.get(kername, "nothing")
>>
>>         cut_off = self.cut_ord_pair[0]
>>         order = self.cut_ord_pair[1]
>>
>>         from SeqKernel import hiSeqKernEval
>>
>>         def getGram(Y):
>>             gram_matrix = np.zeros((Y.
>>
>> ...
>
> [Message clipped]
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170725/728ecf53/attachment-0001.html>

From sambarnett95 at gmail.com  Tue Jul 25 08:15:28 2017
From: sambarnett95 at gmail.com (Sam Barnett)
Date: Tue, 25 Jul 2017 13:15:28 +0100
Subject: [scikit-learn] Fwd: Custom transformer failing check_estimator
 test
In-Reply-To: <CACWGMfr5ja0BJ+sZpvy_gqNqGw7fQvNxq2ae8QeOSLOV65Yufw@mail.gmail.com>
References: <CACWGMfrDiRFvUSexbhcynFCQtqMTPhGdvdUgMJJx7NDwJ8MSxA@mail.gmail.com>
 <CACWGMfoVVhPA4M37N1-8YdgK18sntq=bCt=TRyoyJsj7BDyxfA@mail.gmail.com>
 <CAAkaFLUybRdBRNiFgKFpwRi4=r9xSDLji-uP-dyr-OZJupRHFw@mail.gmail.com>
 <CACWGMfr5ja0BJ+sZpvy_gqNqGw7fQvNxq2ae8QeOSLOV65Yufw@mail.gmail.com>
Message-ID: <CACWGMfquu7N_Qy99oJC=vWrN+tBWfQDw9_VmEV1_MMQ7e5PG2Q@mail.gmail.com>

Apologies: I've since worked out what the problem was and have resolved
this issue. This was what I was missing in my code:


        # Check that the input is of the same shape as the one passed
        # during fit.
        if X.shape != self.input_shape_:
            raise ValueError('Shape of input is different from what was
seen'
                             'in `fit`')


On Tue, Jul 25, 2017 at 9:41 AM, Sam Barnett <sambarnett95 at gmail.com> wrote:

> This is the Traceback I get:
>
>
> AssertionErrorTraceback (most recent call last)
> <ipython-input-5-166b8f0141db> in <module>()
> ----> 1 check_estimator(OK.Sqizer)
>
> /Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/
> utils/estimator_checks.pyc in check_estimator(Estimator)
>     253     check_parameters_default_constructible(name, Estimator)
>     254     for check in _yield_all_checks(name, Estimator):
> --> 255         check(name, Estimator)
>     256
>     257
>
> /Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/utils/testing.pyc
> in wrapper(*args, **kwargs)
>     353             with warnings.catch_warnings():
>     354                 warnings.simplefilter("ignore", self.category)
> --> 355                 return fn(*args, **kwargs)
>     356
>     357         return wrapper
>
> /Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/utils/estimator_checks.pyc
> in check_transformer_general(name, Transformer)
>     578     X = StandardScaler().fit_transform(X)
>     579     X -= X.min()
> --> 580     _check_transformer(name, Transformer, X, y)
>     581     _check_transformer(name, Transformer, X.tolist(), y.tolist())
>     582
>
> /Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/utils/estimator_checks.pyc
> in _check_transformer(name, Transformer, X, y)
>     671         if hasattr(X, 'T'):
>     672             # If it's not an array, it does not have a 'T' property
> --> 673             assert_raises(ValueError, transformer.transform, X.T)
>     674
>     675
>
> /Users/Sam/anaconda/lib/python2.7/unittest/case.pyc in assertRaises(self,
> excClass, callableObj, *args, **kwargs)
>     471             return context
>     472         with context:
> --> 473             callableObj(*args, **kwargs)
>     474
>     475     def _getAssertEqualityFunc(self, first, second):
>
> /Users/Sam/anaconda/lib/python2.7/unittest/case.pyc in __exit__(self,
> exc_type, exc_value, tb)
>     114                 exc_name = str(self.expected)
>     115             raise self.failureException(
> --> 116                 "{0} not raised".format(exc_name))
>     117         if not issubclass(exc_type, self.expected):
>     118             # let unexpected exceptions pass through
>
> AssertionError: ValueError not raised
>
>
> On Tue, Jul 25, 2017 at 12:54 AM, Joel Nothman <joel.nothman at gmail.com>
> wrote:
>
>> what is the failing test? please provide the full traceback.
>>
>> On 24 Jul 2017 10:58 pm, "Sam Barnett" <sambarnett95 at gmail.com> wrote:
>>
>>> Dear scikit-learn developers,
>>>
>>> I am developing a transformer, named Sqizer, that has the ultimate goal
>>> of modifying a kernel for use with the sklearn.svm package. When given
>>> an input data array X, Sqizer.transform(X) should have as its output
>>> the Gram matrix for X using the modified version of the kernel. Here is
>>> the code for the class so far:
>>>
>>> class Sqizer(BaseEstimator, TransformerMixin):
>>>
>>>     def __init__(self, C=1.0, kernel='rbf', degree=3, gamma=1,
>>>                      coef0=0.0, cut_ord_pair=(2,1)):
>>>             self.C = C
>>>             self.kernel = kernel
>>>             self.degree = degree
>>>             self.gamma = gamma
>>>             self.coef0 = coef0
>>>             self.cut_ord_pair = cut_ord_pair
>>>
>>>     def fit(self, X, y=None):
>>>         # Check that X and y have correct shape
>>>         X, y = check_X_y(X, y)
>>>         # Store the classes seen during fit
>>>         self.classes_ = unique_labels(y)
>>>
>>>         self.X_ = X
>>>         self.y_ = y
>>>         return self
>>>
>>>     def transform(self, X):
>>>
>>>         X = check_array(X, warn_on_dtype=True)
>>>
>>>         """Returns Gram matrix corresponding to X, once sqized."""
>>>         def kPolynom(x,y):
>>>             return (self.coef0+self.gamma*np.inner(x,y))**self.degree
>>>         def kGauss(x,y):
>>>             return np.exp(-self.gamma*np.sum(np.square(x-y)))
>>>         def kLinear(x,y):
>>>             return np.inner(x,y)
>>>         def kSigmoid(x,y):
>>>             return np.tanh(self.gamma*np.inner(x,y) +self.coef0)
>>>
>>>         def kernselect(kername):
>>>             switcher = {
>>>                 'linear': kPolynom,
>>>                 'rbf': kGauss,
>>>                 'sigmoid': kLinear,
>>>                 'poly': kSigmoid,
>>>                     }
>>>             return switcher.get(kername, "nothing")
>>>
>>>         cut_off = self.cut_ord_pair[0]
>>>         order = self.cut_ord_pair[1]
>>>
>>>         from SeqKernel import hiSeqKernEval
>>>
>>>         def getGram(Y):
>>>             gram_matrix = np.zeros((Y.
>>>
>>> ...
>>
>> [Message clipped]
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170725/b4919717/attachment-0001.html>

From t3kcit at gmail.com  Tue Jul 25 16:31:40 2017
From: t3kcit at gmail.com (Andreas Mueller)
Date: Tue, 25 Jul 2017 16:31:40 -0400
Subject: [scikit-learn] Fwd: Custom transformer failing check_estimator
 test
In-Reply-To: <CACWGMfquu7N_Qy99oJC=vWrN+tBWfQDw9_VmEV1_MMQ7e5PG2Q@mail.gmail.com>
References: <CACWGMfrDiRFvUSexbhcynFCQtqMTPhGdvdUgMJJx7NDwJ8MSxA@mail.gmail.com>
 <CACWGMfoVVhPA4M37N1-8YdgK18sntq=bCt=TRyoyJsj7BDyxfA@mail.gmail.com>
 <CAAkaFLUybRdBRNiFgKFpwRi4=r9xSDLji-uP-dyr-OZJupRHFw@mail.gmail.com>
 <CACWGMfr5ja0BJ+sZpvy_gqNqGw7fQvNxq2ae8QeOSLOV65Yufw@mail.gmail.com>
 <CACWGMfquu7N_Qy99oJC=vWrN+tBWfQDw9_VmEV1_MMQ7e5PG2Q@mail.gmail.com>
Message-ID: <42dff767-f7d8-b1c1-84f3-58ea4e4cab16@gmail.com>

Indeed, it makes sure that the transform is applied to data with the 
same number of samples as the input.
PR welcome to provide a better error message on this!

On 07/25/2017 08:15 AM, Sam Barnett wrote:
> Apologies: I've since worked out what the problem was and have 
> resolved this issue. This was what I was missing in my code:
>
>
> # Check that the input is of the same shape as the one passed
>         # during fit.
>         if X.shape != self.input_shape_:
>             raise ValueError('Shape of input is different from what 
> was seen'
>      'in `fit`')
>
>
> On Tue, Jul 25, 2017 at 9:41 AM, Sam Barnett <sambarnett95 at gmail.com 
> <mailto:sambarnett95 at gmail.com>> wrote:
>
>     This is the Traceback I get:
>
>
>     AssertionErrorTraceback (most recent call last)
>     <ipython-input-5-166b8f0141db> in <module>()
>     ----> 1 check_estimator(OK.Sqizer)
>
>     /Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/utils/estimator_checks.pyc
>     in check_estimator(Estimator)
>     253 check_parameters_default_constructible(name, Estimator)
>     254 for check in _yield_all_checks(name, Estimator):
>     --> 255 check(name, Estimator)
>     256
>     257
>
>     /Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/utils/testing.pyc
>     in wrapper(*args, **kwargs)
>     353             with warnings.catch_warnings():
>     354                 warnings.simplefilter("ignore", self.category)
>     --> 355                 return fn(*args, **kwargs)
>     356
>     357         return wrapper
>
>     /Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/utils/estimator_checks.pyc
>     in check_transformer_general(name, Transformer)
>     578     X = StandardScaler().fit_transform(X)
>     579     X -= X.min()
>     --> 580 _check_transformer(name, Transformer, X, y)
>     581 _check_transformer(name, Transformer, X.tolist(), y.tolist())
>     582
>
>     /Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/utils/estimator_checks.pyc
>     in _check_transformer(name, Transformer, X, y)
>     671 if hasattr(X, 'T'):
>     672             # If it's not an array, it does not have a 'T'
>     property
>     --> 673 assert_raises(ValueError, transformer.transform, X.T)
>     674
>     675
>
>     /Users/Sam/anaconda/lib/python2.7/unittest/case.pyc in
>     assertRaises(self, excClass, callableObj, *args, **kwargs)
>     471             return context
>     472         with context:
>     --> 473 callableObj(*args, **kwargs)
>     474
>     475     def _getAssertEqualityFunc(self, first, second):
>
>     /Users/Sam/anaconda/lib/python2.7/unittest/case.pyc in
>     __exit__(self, exc_type, exc_value, tb)
>     114                 exc_name = str(self.expected)
>     115 raise self.failureException(
>     --> 116 "{0} not raised".format(exc_name))
>     117 if not issubclass(exc_type, self.expected):
>     118             # let unexpected exceptions pass through
>
>     AssertionError: ValueError not raised
>
>
>     On Tue, Jul 25, 2017 at 12:54 AM, Joel Nothman
>     <joel.nothman at gmail.com <mailto:joel.nothman at gmail.com>> wrote:
>
>         what is the failing test? please provide the full traceback.
>
>         On 24 Jul 2017 10:58 pm, "Sam Barnett" <sambarnett95 at gmail.com
>         <mailto:sambarnett95 at gmail.com>> wrote:
>
>             Dear scikit-learn developers,
>
>             I am developing a transformer, named |Sqizer|, that has
>             the ultimate goal of modifying a kernel for use with the
>             |sklearn.svm| package. When given an input data array |X|,
>             |Sqizer.transform(X)| should have as its output the Gram
>             matrix for |X| using the modified version of the kernel.
>             Here is the code for the class so far:
>
>             |classSqizer(BaseEstimator,TransformerMixin):def__init__(self,C=1.0,kernel='rbf',degree=3,gamma=1,coef0=0.0,cut_ord_pair=(2,1)):self.C
>             =C self.kernel =kernel self.degree =degree self.gamma
>             =gamma self.coef0 =coef0 self.cut_ord_pair =cut_ord_pair
>             deffit(self,X,y=None):# Check that X and y have correct
>             shapeX,y =check_X_y(X,y)# Store the classes seen during
>             fitself.classes_ =unique_labels(y)self.X_ =X self.y_ =y
>             returnself deftransform(self,X):X
>             =check_array(X,warn_on_dtype=True)"""Returns Gram matrix
>             corresponding to X, once
>             sqized."""defkPolynom(x,y):return(self.coef0+self.gamma*np.inner(x,y))**self.degree
>             defkGauss(x,y):returnnp.exp(-self.gamma*np.sum(np.square(x-y)))defkLinear(x,y):returnnp.inner(x,y)defkSigmoid(x,y):returnnp.tanh(self.gamma*np.inner(x,y)+self.coef0)defkernselect(kername):switcher
>             ={'linear':kPolynom,'rbf':kGauss,'sigmoid':kLinear,'poly':kSigmoid,}returnswitcher.get(kername,"nothing")cut_off
>             =self.cut_ord_pair[0]order
>             =self.cut_ord_pair[1]fromSeqKernelimporthiSeqKernEval
>             defgetGram(Y):gram_matrix =np.zeros((Y.|
>
>         ...
>
>         [Message clipped]
>         _______________________________________________
>         scikit-learn mailing list
>         scikit-learn at python.org <mailto:scikit-learn at python.org>
>         https://mail.python.org/mailman/listinfo/scikit-learn
>         <https://mail.python.org/mailman/listinfo/scikit-learn>
>
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170725/fe8d874e/attachment-0001.html>

From joel.nothman at gmail.com  Tue Jul 25 19:58:35 2017
From: joel.nothman at gmail.com (Joel Nothman)
Date: Wed, 26 Jul 2017 09:58:35 +1000
Subject: [scikit-learn] Fwd: Custom transformer failing check_estimator
 test
In-Reply-To: <42dff767-f7d8-b1c1-84f3-58ea4e4cab16@gmail.com>
References: <CACWGMfrDiRFvUSexbhcynFCQtqMTPhGdvdUgMJJx7NDwJ8MSxA@mail.gmail.com>
 <CACWGMfoVVhPA4M37N1-8YdgK18sntq=bCt=TRyoyJsj7BDyxfA@mail.gmail.com>
 <CAAkaFLUybRdBRNiFgKFpwRi4=r9xSDLji-uP-dyr-OZJupRHFw@mail.gmail.com>
 <CACWGMfr5ja0BJ+sZpvy_gqNqGw7fQvNxq2ae8QeOSLOV65Yufw@mail.gmail.com>
 <CACWGMfquu7N_Qy99oJC=vWrN+tBWfQDw9_VmEV1_MMQ7e5PG2Q@mail.gmail.com>
 <42dff767-f7d8-b1c1-84f3-58ea4e4cab16@gmail.com>
Message-ID: <CAAkaFLUcW2BXga1bW90pc0dqf48+5QCTwYcdtk6jJtY_ePxwfQ@mail.gmail.com>

One advantage of moving to pytest is that we can put messages into
pytest.raises, and we should emphasise this in moving the check_estimator
assertions to pytest. But I'm also not sure how we do the deprecation of
nosetests for check_estimator in a way that is friendly to our
contribbers...

On 26 July 2017 at 06:31, Andreas Mueller <t3kcit at gmail.com> wrote:

> Indeed, it makes sure that the transform is applied to data with the same
> number of samples as the input.
> PR welcome to provide a better error message on this!
>
> On 07/25/2017 08:15 AM, Sam Barnett wrote:
>
> Apologies: I've since worked out what the problem was and have resolved
> this issue. This was what I was missing in my code:
>
>
>         # Check that the input is of the same shape as the one passed
>         # during fit.
>         if X.shape != self.input_shape_:
>             raise ValueError('Shape of input is different from what was
> seen'
>                              'in `fit`')
>
>
> On Tue, Jul 25, 2017 at 9:41 AM, Sam Barnett <sambarnett95 at gmail.com>
> wrote:
>
>> This is the Traceback I get:
>>
>>
>> AssertionErrorTraceback (most recent call last)
>> <ipython-input-5-166b8f0141db> in <module>()
>> ----> 1 check_estimator(OK.Sqizer)
>>
>> /Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/util
>> s/estimator_checks.pyc in check_estimator(Estimator)
>>     253     check_parameters_default_constructible(name, Estimator)
>>     254     for check in _yield_all_checks(name, Estimator):
>> --> 255         check(name, Estimator)
>>     256
>>     257
>>
>> /Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/utils/testing.pyc
>> in wrapper(*args, **kwargs)
>>     353             with warnings.catch_warnings():
>>     354                 warnings.simplefilter("ignore", self.category)
>> --> 355                 return fn(*args, **kwargs)
>>     356
>>     357         return wrapper
>>
>> /Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/utils/estimator_checks.pyc
>> in check_transformer_general(name, Transformer)
>>     578     X = StandardScaler().fit_transform(X)
>>     579     X -= X.min()
>> --> 580     _check_transformer(name, Transformer, X, y)
>>     581     _check_transformer(name, Transformer, X.tolist(), y.tolist())
>>     582
>>
>> /Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/utils/estimator_checks.pyc
>> in _check_transformer(name, Transformer, X, y)
>>     671         if hasattr(X, 'T'):
>>     672             # If it's not an array, it does not have a 'T'
>> property
>> --> 673             assert_raises(ValueError, transformer.transform, X.T)
>>     674
>>     675
>>
>> /Users/Sam/anaconda/lib/python2.7/unittest/case.pyc in assertRaises(self,
>> excClass, callableObj, *args, **kwargs)
>>     471             return context
>>     472         with context:
>> --> 473             callableObj(*args, **kwargs)
>>     474
>>     475     def _getAssertEqualityFunc(self, first, second):
>>
>> /Users/Sam/anaconda/lib/python2.7/unittest/case.pyc in __exit__(self,
>> exc_type, exc_value, tb)
>>     114                 exc_name = str(self.expected)
>>     115             raise self.failureException(
>> --> 116                 "{0} not raised".format(exc_name))
>>     117         if not issubclass(exc_type, self.expected):
>>     118             # let unexpected exceptions pass through
>>
>> AssertionError: ValueError not raised
>>
>>
>> On Tue, Jul 25, 2017 at 12:54 AM, Joel Nothman <joel.nothman at gmail.com>
>> wrote:
>>
>>> what is the failing test? please provide the full traceback.
>>>
>>> On 24 Jul 2017 10:58 pm, "Sam Barnett" <sambarnett95 at gmail.com> wrote:
>>>
>>>> Dear scikit-learn developers,
>>>>
>>>> I am developing a transformer, named Sqizer, that has the ultimate
>>>> goal of modifying a kernel for use with the sklearn.svm package. When
>>>> given an input data array X, Sqizer.transform(X) should have as its
>>>> output the Gram matrix for X using the modified version of the kernel.
>>>> Here is the code for the class so far:
>>>>
>>>> class Sqizer(BaseEstimator, TransformerMixin):
>>>>
>>>>     def __init__(self, C=1.0, kernel='rbf', degree=3, gamma=1,
>>>>                      coef0=0.0, cut_ord_pair=(2,1)):
>>>>             self.C = C
>>>>             self.kernel = kernel
>>>>             self.degree = degree
>>>>             self.gamma = gamma
>>>>             self.coef0 = coef0
>>>>             self.cut_ord_pair = cut_ord_pair
>>>>
>>>>     def fit(self, X, y=None):
>>>>         # Check that X and y have correct shape
>>>>         X, y = check_X_y(X, y)
>>>>         # Store the classes seen during fit
>>>>         self.classes_ = unique_labels(y)
>>>>
>>>>         self.X_ = X
>>>>         self.y_ = y
>>>>         return self
>>>>
>>>>     def transform(self, X):
>>>>
>>>>         X = check_array(X, warn_on_dtype=True)
>>>>
>>>>         """Returns Gram matrix corresponding to X, once sqized."""
>>>>         def kPolynom(x,y):
>>>>             return (self.coef0+self.gamma*np.inner(x,y))**self.degree
>>>>         def kGauss(x,y):
>>>>             return np.exp(-self.gamma*np.sum(np.square(x-y)))
>>>>         def kLinear(x,y):
>>>>             return np.inner(x,y)
>>>>         def kSigmoid(x,y):
>>>>             return np.tanh(self.gamma*np.inner(x,y) +self.coef0)
>>>>
>>>>         def kernselect(kername):
>>>>             switcher = {
>>>>                 'linear': kPolynom,
>>>>                 'rbf': kGauss,
>>>>                 'sigmoid': kLinear,
>>>>                 'poly': kSigmoid,
>>>>                     }
>>>>             return switcher.get(kername, "nothing")
>>>>
>>>>         cut_off = self.cut_ord_pair[0]
>>>>         order = self.cut_ord_pair[1]
>>>>
>>>>         from SeqKernel import hiSeqKernEval
>>>>
>>>>         def getGram(Y):
>>>>             gram_matrix = np.zeros((Y.
>>>>
>>>> ...
>>>
>>> [Message clipped]
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>
>
> _______________________________________________
> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170726/06cb2a63/attachment-0001.html>

From gael.varoquaux at normalesup.org  Wed Jul 26 03:02:28 2017
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Wed, 26 Jul 2017 09:02:28 +0200
Subject: [scikit-learn] Classifiers for dataset with categorical features
In-Reply-To: <A46341E5-C855-45E1-90EA-5EA66847E32A@gmail.com>
References: <CAOLKFqvqZAJeNKh=p82tQUs-wgXeftUZRsj=VJhFF+siKBKQTw@mail.gmail.com>
 <CA+ad8Et04KMyQufDNFYAu7zKyvNhUJiz4shdgoTVkxCkBPurqw@mail.gmail.com>
 <CAOLKFqv3cLp36KBwxvwdP8eSXv9D7rK7fOXKpDcq7z6C3+C4Kg@mail.gmail.com>
 <8434D3C4-503B-4D6B-A4FB-ABB684B7DD71@gmail.com>
 <CAAy-kdkoK6+GeVNWt4LBzD_iWFPa95H4R9n-JVVK5hEP2MZjAg@mail.gmail.com>
 <A46341E5-C855-45E1-90EA-5EA66847E32A@gmail.com>
Message-ID: <20170726070228.GO3579441@phare.normalesup.org>

The right thing to do would probably be to write a scikit-learn-contrib
package for them and see if they gather traction. If they perform well on
eg kaggle competitions, we know that we need them in :).

Cheers,

Ga?l

On Fri, Jul 21, 2017 at 07:09:03PM -0400, Sebastian Raschka wrote:
> Maybe because they are genetic algorithms, which are -- for some reason -- not very popular in the ML field in general :P. (People in bioinformatics seem to use them a lot though.). Also, the name "Learning Classifier Systems" is also a bit weird I'd must say: I remember that when Ryan introduced me to those, I was like "ah yeah, sure, I know machine learning classifiers" ;)


> > On Jul 21, 2017, at 3:01 PM, Stuart Reynolds <stuart at stuartreynolds.net> wrote:

> > +1
> > LCS and its many many variants seem very practical and adaptable. I'm
> > not sure why they haven't gotten traction.
> > Overshadowed by GBM & random forests?


> > On Fri, Jul 21, 2017 at 11:52 AM, Sebastian Raschka
> > <se.raschka at gmail.com> wrote:
> >> Just to throw some additional ideas in here. Based on a conversation with a colleague some time ago, I think learning classifier systems (https://en.wikipedia.org/wiki/Learning_classifier_system) are particularly useful when working with large, sparse binary vectors (like from a one-hot encoding). I am really not into LCS's, and only know the basics (read through the first chapters of the Intro to Learning Classifier Systems draft; the print version will be out later this year).
> >> Also, I saw an interesting poster on a Set Covering Machine algorithm once, which they benchmarked against SVMs, random forests and the like for categorical (genomics data). Looked promising.

> >> Best,
> >> Sebastian


> >>> On Jul 21, 2017, at 2:37 PM, Raga Markely <raga.markely at gmail.com> wrote:

> >>> Thank you, Jacob. Appreciate it.

> >>> Regarding 'perform better', I was referring to better accuracy, precision, recall, F1 score, etc.

> >>> Thanks,
> >>> Raga

> >>> On Fri, Jul 21, 2017 at 2:27 PM, Jacob Schreiber <jmschreiber91 at gmail.com> wrote:
> >>> Traditionally tree based methods are very good when it comes to categorical variables and can handle them appropriately. There is a current WIP PR to add this support to sklearn. I'm not exactly sure what you mean that "perform better" though. Estimators that ignore the categorical aspect of these variables and treat them as discrete will likely perform worse than those that treat them appropriately.

> >>> On Fri, Jul 21, 2017 at 8:11 AM, Raga Markely <raga.markely at gmail.com> wrote:
> >>> Hello,

> >>> I am wondering if there are some classifiers that perform better for datasets with categorical features (converted into sparse input matrix with pd.get_dummies())? The data for the categorical features are nominal (order doesn't matter, e.g. country, occupation, etc).

> >>> If you could provide me some references (papers, books, website, etc), that would be great.

> >>> Thank you very much!
> >>> Raga


> >>> _______________________________________________
> >>> scikit-learn mailing list
> >>> scikit-learn at python.org
> >>> https://mail.python.org/mailman/listinfo/scikit-learn


> >>> _______________________________________________
> >>> scikit-learn mailing list
> >>> scikit-learn at python.org
> >>> https://mail.python.org/mailman/listinfo/scikit-learn


> >>> _______________________________________________
> >>> scikit-learn mailing list
> >>> scikit-learn at python.org
> >>> https://mail.python.org/mailman/listinfo/scikit-learn

> >> _______________________________________________
> >> scikit-learn mailing list
> >> scikit-learn at python.org
> >> https://mail.python.org/mailman/listinfo/scikit-learn
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn

> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-- 
    Gael Varoquaux
    Researcher, INRIA Parietal
    NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
    Phone:  ++ 33-1-69-08-79-68
    http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux

From t3kcit at gmail.com  Wed Jul 26 10:54:49 2017
From: t3kcit at gmail.com (Andreas Mueller)
Date: Wed, 26 Jul 2017 10:54:49 -0400
Subject: [scikit-learn] Fwd: Custom transformer failing check_estimator
 test
In-Reply-To: <CAAkaFLUcW2BXga1bW90pc0dqf48+5QCTwYcdtk6jJtY_ePxwfQ@mail.gmail.com>
References: <CACWGMfrDiRFvUSexbhcynFCQtqMTPhGdvdUgMJJx7NDwJ8MSxA@mail.gmail.com>
 <CACWGMfoVVhPA4M37N1-8YdgK18sntq=bCt=TRyoyJsj7BDyxfA@mail.gmail.com>
 <CAAkaFLUybRdBRNiFgKFpwRi4=r9xSDLji-uP-dyr-OZJupRHFw@mail.gmail.com>
 <CACWGMfr5ja0BJ+sZpvy_gqNqGw7fQvNxq2ae8QeOSLOV65Yufw@mail.gmail.com>
 <CACWGMfquu7N_Qy99oJC=vWrN+tBWfQDw9_VmEV1_MMQ7e5PG2Q@mail.gmail.com>
 <42dff767-f7d8-b1c1-84f3-58ea4e4cab16@gmail.com>
 <CAAkaFLUcW2BXga1bW90pc0dqf48+5QCTwYcdtk6jJtY_ePxwfQ@mail.gmail.com>
Message-ID: <130377b1-0557-698a-0ef3-b71a201cb2aa@gmail.com>

Hm, it would be nice to do this in a way that relies less on pytest, but 
I guess that would be tricky.
One way would be to use assert_raise_message to make clear what the 
expected error is.
But that would make the current test more strict - not necessarily that 
bad, I guess?

It looks like all asserts in unittest have a "msg" argument... apart 
from assertRaises:
https://docs.python.org/2/library/unittest.html#unittest.TestCase.assertRaises

That has been fixed in Python 3.3, though:
https://docs.python.org/3/library/unittest.html#unittest.TestCase.assertRaises

So maybe we should just do a backport for assert_raises and 
assert_raises_regex?


On 07/25/2017 07:58 PM, Joel Nothman wrote:
> One advantage of moving to pytest is that we can put messages into 
> pytest.raises, and we should emphasise this in moving the 
> check_estimator assertions to pytest. But I'm also not sure how we do 
> the deprecation of nosetests for check_estimator in a way that is 
> friendly to our contribbers...
>
> On 26 July 2017 at 06:31, Andreas Mueller <t3kcit at gmail.com 
> <mailto:t3kcit at gmail.com>> wrote:
>
>     Indeed, it makes sure that the transform is applied to data with
>     the same number of samples as the input.
>     PR welcome to provide a better error message on this!
>
>     On 07/25/2017 08:15 AM, Sam Barnett wrote:
>>     Apologies: I've since worked out what the problem was and have
>>     resolved this issue. This was what I was missing in my code:
>>
>>
>>     # Check that the input is of the same shape as the one passed
>>             # during fit.
>>             if X.shape != self.input_shape_:
>>     raise ValueError('Shape of input is different from what was seen'
>>                    'in `fit`')
>>
>>
>>     On Tue, Jul 25, 2017 at 9:41 AM, Sam Barnett
>>     <sambarnett95 at gmail.com <mailto:sambarnett95 at gmail.com>> wrote:
>>
>>         This is the Traceback I get:
>>
>>
>>         AssertionErrorTraceback (most recent call last)
>>         <ipython-input-5-166b8f0141db> in <module>()
>>         ----> 1 check_estimator(OK.Sqizer)
>>
>>         /Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/utils/estimator_checks.pyc
>>         in check_estimator(Estimator)
>>           253 check_parameters_default_constructible(name, Estimator)
>>           254 for check in _yield_all_checks(name, Estimator):
>>         --> 255 check(name, Estimator)
>>           256
>>           257
>>
>>         /Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/utils/testing.pyc
>>         in wrapper(*args, **kwargs)
>>           353             with warnings.catch_warnings():
>>           354                 warnings.simplefilter("ignore",
>>         self.category)
>>         --> 355                 return fn(*args, **kwargs)
>>           356
>>           357         return wrapper
>>
>>         /Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/utils/estimator_checks.pyc
>>         in check_transformer_general(name, Transformer)
>>           578     X = StandardScaler().fit_transform(X)
>>         579 X -= X.min()
>>         --> 580 _check_transformer(name, Transformer, X, y)
>>         581 _check_transformer(name, Transformer, X.tolist(), y.tolist())
>>         582
>>
>>         /Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/utils/estimator_checks.pyc
>>         in _check_transformer(name, Transformer, X, y)
>>         671 if hasattr(X, 'T'):
>>           672             # If it's not an array, it does not have a
>>         'T' property
>>         --> 673 assert_raises(ValueError, transformer.transform, X.T)
>>           674
>>           675
>>
>>         /Users/Sam/anaconda/lib/python2.7/unittest/case.pyc in
>>         assertRaises(self, excClass, callableObj, *args, **kwargs)
>>           471             return context
>>           472         with context:
>>         --> 473 callableObj(*args, **kwargs)
>>           474
>>           475     def _getAssertEqualityFunc(self, first, second):
>>
>>         /Users/Sam/anaconda/lib/python2.7/unittest/case.pyc in
>>         __exit__(self, exc_type, exc_value, tb)
>>           114                 exc_name = str(self.expected)
>>         115         raise self.failureException(
>>         --> 116 "{0} not raised".format(exc_name))
>>         117 if not issubclass(exc_type, self.expected):
>>           118             # let unexpected exceptions pass through
>>
>>         AssertionError: ValueError not raised
>>
>>
>>         On Tue, Jul 25, 2017 at 12:54 AM, Joel Nothman
>>         <joel.nothman at gmail.com <mailto:joel.nothman at gmail.com>> wrote:
>>
>>             what is the failing test? please provide the full traceback.
>>
>>             On 24 Jul 2017 10:58 pm, "Sam Barnett"
>>             <sambarnett95 at gmail.com <mailto:sambarnett95 at gmail.com>>
>>             wrote:
>>
>>                 Dear scikit-learn developers,
>>
>>                 I am developing a transformer, named |Sqizer|, that
>>                 has the ultimate goal of modifying a kernel for use
>>                 with the |sklearn.svm| package. When given an input
>>                 data array |X|, |Sqizer.transform(X)| should have as
>>                 its output the Gram matrix for |X| using the modified
>>                 version of the kernel. Here is the code for the class
>>                 so far:
>>
>>                 |classSqizer(BaseEstimator,TransformerMixin):def__init__(self,C=1.0,kernel='rbf',degree=3,gamma=1,coef0=0.0,cut_ord_pair=(2,1)):self.C
>>                 =C self.kernel =kernel self.degree =degree self.gamma
>>                 =gamma self.coef0 =coef0 self.cut_ord_pair
>>                 =cut_ord_pair deffit(self,X,y=None):# Check that X
>>                 and y have correct shapeX,y =check_X_y(X,y)# Store
>>                 the classes seen during fitself.classes_
>>                 =unique_labels(y)self.X_ =X self.y_ =y returnself
>>                 deftransform(self,X):X
>>                 =check_array(X,warn_on_dtype=True)"""Returns Gram
>>                 matrix corresponding to X, once
>>                 sqized."""defkPolynom(x,y):return(self.coef0+self.gamma*np.inner(x,y))**self.degree
>>                 defkGauss(x,y):returnnp.exp(-self.gamma*np.sum(np.square(x-y)))defkLinear(x,y):returnnp.inner(x,y)defkSigmoid(x,y):returnnp.tanh(self.gamma*np.inner(x,y)+self.coef0)defkernselect(kername):switcher
>>                 ={'linear':kPolynom,'rbf':kGauss,'sigmoid':kLinear,'poly':kSigmoid,}returnswitcher.get(kername,"nothing")cut_off
>>                 =self.cut_ord_pair[0]order
>>                 =self.cut_ord_pair[1]fromSeqKernelimporthiSeqKernEval
>>                 defgetGram(Y):gram_matrix =np.zeros((Y.|
>>
>>             ...
>>
>>             [Message clipped]
>>             _______________________________________________
>>             scikit-learn mailing list
>>             scikit-learn at python.org <mailto:scikit-learn at python.org>
>>             https://mail.python.org/mailman/listinfo/scikit-learn
>>             <https://mail.python.org/mailman/listinfo/scikit-learn>
>>
>>
>>
>>
>>
>>     _______________________________________________
>>     scikit-learn mailing list
>>     scikit-learn at python.org <mailto:scikit-learn at python.org>
>>     https://mail.python.org/mailman/listinfo/scikit-learn
>>     <https://mail.python.org/mailman/listinfo/scikit-learn>
>
>
>     _______________________________________________
>     scikit-learn mailing list
>     scikit-learn at python.org <mailto:scikit-learn at python.org>
>     https://mail.python.org/mailman/listinfo/scikit-learn
>     <https://mail.python.org/mailman/listinfo/scikit-learn>
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170726/45122f20/attachment-0001.html>

From renato.deleone at gmail.com  Wed Jul 26 12:26:11 2017
From: renato.deleone at gmail.com (Renato De Leone)
Date: Wed, 26 Jul 2017 18:26:11 +0200
Subject: [scikit-learn] Showing loss value in MLPClassifier
Message-ID: <CAE4ReUi-xNjspNWWWdW1e3Cu4ukm=SNhSnsgRPFuYGxY883=hA@mail.gmail.com>

Is it possible to show additional information such as current value of the
loss function etc in MLPClassifier. Apparently verbose=TRue does not make
any difference.

Thanks

-- 
Renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170726/a0b1b217/attachment.html>

From joel.nothman at gmail.com  Wed Jul 26 18:04:58 2017
From: joel.nothman at gmail.com (Joel Nothman)
Date: Thu, 27 Jul 2017 08:04:58 +1000
Subject: [scikit-learn] Fwd: Custom transformer failing check_estimator
 test
In-Reply-To: <130377b1-0557-698a-0ef3-b71a201cb2aa@gmail.com>
References: <CACWGMfrDiRFvUSexbhcynFCQtqMTPhGdvdUgMJJx7NDwJ8MSxA@mail.gmail.com>
 <CACWGMfoVVhPA4M37N1-8YdgK18sntq=bCt=TRyoyJsj7BDyxfA@mail.gmail.com>
 <CAAkaFLUybRdBRNiFgKFpwRi4=r9xSDLji-uP-dyr-OZJupRHFw@mail.gmail.com>
 <CACWGMfr5ja0BJ+sZpvy_gqNqGw7fQvNxq2ae8QeOSLOV65Yufw@mail.gmail.com>
 <CACWGMfquu7N_Qy99oJC=vWrN+tBWfQDw9_VmEV1_MMQ7e5PG2Q@mail.gmail.com>
 <42dff767-f7d8-b1c1-84f3-58ea4e4cab16@gmail.com>
 <CAAkaFLUcW2BXga1bW90pc0dqf48+5QCTwYcdtk6jJtY_ePxwfQ@mail.gmail.com>
 <130377b1-0557-698a-0ef3-b71a201cb2aa@gmail.com>
Message-ID: <CAAkaFLWP8uoaAk0=gWMY8qF_edf627VQEBztcvUXbZRtPR7QsA@mail.gmail.com>

The difference is the functional form versus the context manager. you can't
add extra parameters to the function, only to the context manager.

On 27 Jul 2017 12:56 am, "Andreas Mueller" <t3kcit at gmail.com> wrote:

> Hm, it would be nice to do this in a way that relies less on pytest, but I
> guess that would be tricky.
> One way would be to use assert_raise_message to make clear what the
> expected error is.
> But that would make the current test more strict - not necessarily that
> bad, I guess?
>
> It looks like all asserts in unittest have a "msg" argument... apart from
> assertRaises:
> https://docs.python.org/2/library/unittest.html#
> unittest.TestCase.assertRaises
>
> That has been fixed in Python 3.3, though:
> https://docs.python.org/3/library/unittest.html#
> unittest.TestCase.assertRaises
>
> So maybe we should just do a backport for assert_raises and
> assert_raises_regex?
>
>
> On 07/25/2017 07:58 PM, Joel Nothman wrote:
>
> One advantage of moving to pytest is that we can put messages into
> pytest.raises, and we should emphasise this in moving the check_estimator
> assertions to pytest. But I'm also not sure how we do the deprecation of
> nosetests for check_estimator in a way that is friendly to our
> contribbers...
>
> On 26 July 2017 at 06:31, Andreas Mueller <t3kcit at gmail.com> wrote:
>
>> Indeed, it makes sure that the transform is applied to data with the same
>> number of samples as the input.
>> PR welcome to provide a better error message on this!
>>
>> On 07/25/2017 08:15 AM, Sam Barnett wrote:
>>
>> Apologies: I've since worked out what the problem was and have resolved
>> this issue. This was what I was missing in my code:
>>
>>
>>         # Check that the input is of the same shape as the one passed
>>         # during fit.
>>         if X.shape != self.input_shape_:
>>             raise ValueError('Shape of input is different from what was
>> seen'
>>                              'in `fit`')
>>
>>
>> On Tue, Jul 25, 2017 at 9:41 AM, Sam Barnett <sambarnett95 at gmail.com>
>> wrote:
>>
>>> This is the Traceback I get:
>>>
>>>
>>> AssertionErrorTraceback (most recent call last)
>>> <ipython-input-5-166b8f0141db> in <module>()
>>> ----> 1 check_estimator(OK.Sqizer)
>>>
>>> /Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/util
>>> s/estimator_checks.pyc in check_estimator(Estimator)
>>>     253     check_parameters_default_constructible(name, Estimator)
>>>     254     for check in _yield_all_checks(name, Estimator):
>>> --> 255         check(name, Estimator)
>>>     256
>>>     257
>>>
>>> /Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/utils/testing.pyc
>>> in wrapper(*args, **kwargs)
>>>     353             with warnings.catch_warnings():
>>>     354                 warnings.simplefilter("ignore", self.category)
>>> --> 355                 return fn(*args, **kwargs)
>>>     356
>>>     357         return wrapper
>>>
>>> /Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/utils/estimator_checks.pyc
>>> in check_transformer_general(name, Transformer)
>>>     578     X = StandardScaler().fit_transform(X)
>>>     579     X -= X.min()
>>> --> 580     _check_transformer(name, Transformer, X, y)
>>>     581     _check_transformer(name, Transformer, X.tolist(),
>>> y.tolist())
>>>     582
>>>
>>> /Users/Sam/anaconda/lib/python2.7/site-packages/sklearn/utils/estimator_checks.pyc
>>> in _check_transformer(name, Transformer, X, y)
>>>     671         if hasattr(X, 'T'):
>>>     672             # If it's not an array, it does not have a 'T'
>>> property
>>> --> 673             assert_raises(ValueError, transformer.transform, X.T
>>> )
>>>     674
>>>     675
>>>
>>> /Users/Sam/anaconda/lib/python2.7/unittest/case.pyc in assertRaises(self,
>>> excClass, callableObj, *args, **kwargs)
>>>     471             return context
>>>     472         with context:
>>> --> 473             callableObj(*args, **kwargs)
>>>     474
>>>     475     def _getAssertEqualityFunc(self, first, second):
>>>
>>> /Users/Sam/anaconda/lib/python2.7/unittest/case.pyc in __exit__(self,
>>> exc_type, exc_value, tb)
>>>     114                 exc_name = str(self.expected)
>>>     115             raise self.failureException(
>>> --> 116                 "{0} not raised".format(exc_name))
>>>     117         if not issubclass(exc_type, self.expected):
>>>     118             # let unexpected exceptions pass through
>>>
>>> AssertionError: ValueError not raised
>>>
>>>
>>> On Tue, Jul 25, 2017 at 12:54 AM, Joel Nothman <joel.nothman at gmail.com>
>>> wrote:
>>>
>>>> what is the failing test? please provide the full traceback.
>>>>
>>>> On 24 Jul 2017 10:58 pm, "Sam Barnett" <sambarnett95 at gmail.com> wrote:
>>>>
>>>>> Dear scikit-learn developers,
>>>>>
>>>>> I am developing a transformer, named Sqizer, that has the ultimate
>>>>> goal of modifying a kernel for use with the sklearn.svm package. When
>>>>> given an input data array X, Sqizer.transform(X) should have as its
>>>>> output the Gram matrix for X using the modified version of the
>>>>> kernel. Here is the code for the class so far:
>>>>>
>>>>> class Sqizer(BaseEstimator, TransformerMixin):
>>>>>
>>>>>     def __init__(self, C=1.0, kernel='rbf', degree=3, gamma=1,
>>>>>                      coef0=0.0, cut_ord_pair=(2,1)):
>>>>>             self.C = C
>>>>>             self.kernel = kernel
>>>>>             self.degree = degree
>>>>>             self.gamma = gamma
>>>>>             self.coef0 = coef0
>>>>>             self.cut_ord_pair = cut_ord_pair
>>>>>
>>>>>     def fit(self, X, y=None):
>>>>>         # Check that X and y have correct shape
>>>>>         X, y = check_X_y(X, y)
>>>>>         # Store the classes seen during fit
>>>>>         self.classes_ = unique_labels(y)
>>>>>
>>>>>         self.X_ = X
>>>>>         self.y_ = y
>>>>>         return self
>>>>>
>>>>>     def transform(self, X):
>>>>>
>>>>>         X = check_array(X, warn_on_dtype=True)
>>>>>
>>>>>         """Returns Gram matrix corresponding to X, once sqized."""
>>>>>         def kPolynom(x,y):
>>>>>             return (self.coef0+self.gamma*np.inner(x,y))**self.degree
>>>>>         def kGauss(x,y):
>>>>>             return np.exp(-self.gamma*np.sum(np.square(x-y)))
>>>>>         def kLinear(
>>>>>
>>>>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
> ...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170727/e04b43f3/attachment-0001.html>

From saladi at caltech.edu  Thu Jul 27 14:47:33 2017
From: saladi at caltech.edu (Shyam Saladi)
Date: Thu, 27 Jul 2017 11:47:33 -0700
Subject: [scikit-learn] Maximum Dissimilarity Sampling
Message-ID: <CAARX5cVYt82KaQ_9-JnV0yfK=AWR9iY=CghUSPTO9stj_VjdiA@mail.gmail.com>

Hello all,

I'm looking to sample a large dataset for a subset that best covers the
space. One way of doing this would be maximum dissimilarity, say as
implemented in R as part of caret::maxDissim
<https://rdrr.io/cran/caret/man/maxDissim.html>. Is anyone are of similar
functionality available as part of a common Python package, perhaps in
scikit-learn?

Many thanks in advance,
Shyam
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170727/216be1ef/attachment.html>

From masa.kondo5 at gmail.com  Thu Jul 27 16:16:06 2017
From: masa.kondo5 at gmail.com (Masanari Kondo)
Date: Thu, 27 Jul 2017 16:16:06 -0400
Subject: [scikit-learn] =?utf-8?q?Question_about_the_Library_of_=E2=80=9C?=
	=?utf-8?q?sklearn=2Eneural=5Fnetwork=2EBernoulliRBM=E2=80=9D_that_?=
	=?utf-8?q?Creates_Highly_Correlated_Features=2E?=
Message-ID: <CAO52JWPQmkPCZG-So+Fs-PtqDD=xMriAmF36wy7wu5X14LVrpw@mail.gmail.com>

Dear all,

I?m using the sklearn library to generate new features of a dataset
using a Restricted Boltzmann Machine (RBM,
sklearn.neural_network.BernoulliRBM). I use the following environment:

python 3.5.0
numpy==1.11.1
scikit-learn==0.18


I have already tried a large number of iterations (n_iter=6000) and a
low learning rate (0.0001) for all training data (373 samples). However,
The new features that are generated by the RBM are all highly
correlated. Can anyone explain why this happens?


Below is a MWE:


import numpy as np
import csv
from sklearn.neural_network import BernoulliRBM

# train data
train_data = np.array(
[[0.0326086956522,0.0,0.0,0.0200400801603,0.0674157303371,0.000805152979066,0.00200803212851,0.243243243243,0.0123456790123,0.55,0.0233428760185,0.0,0.0,0.0,0.444444444,0.0,0.0,0.157556270138,0.0188679245283,0.0983652512615],
[0.0108695652174,0.2,0.0,0.00200400801603,0.0112359550562,0.0,0.0,0.027027027027,0.0123456790123,1.0,0.00154151068047,0.0,0.0,1.0,1.0,0.0,0.0,0.0289389067571,0.0,0.0],
[0.0869565217391,0.0,0.152542372881,0.0260521042084,0.0749063670412,0.00322061191626,0.0180722891566,0.108108108108,0.0987654320988,0.4,0.022241796961,0.2,0.0909090909091,0.0,0.40625,0.0,0.0,0.053054662388,0.0188679245283,0.129097937384],
[0.0326086956522,0.2,0.0847457627119,0.0140280561122,0.0149812734082,0.000268384326355,0.0120481927711,0.027027027027,0.0246913580247,0.25,0.00352345298392,1.0,0.0,0.75,0.555555556,0.0,0.0,0.0192926045047,0.0188679245283,0.0983652512615],
[0.0978260869565,0.0,0.0,0.0100200400802,0.0711610486891,0.00214707461084,0.00803212851406,0.027027027027,0.111111111111,0.265625,0.0262056815679,1.0,0.0,0.0,0.518518519,0.0,0.0,0.0568060021635,0.0566037735849,0.213107498008],
[0.0760869565217,0.8,0.0,0.0180360721443,0.0936329588015,0.0,0.0120481927711,0.0810810810811,0.0864197530864,0.3333333335,0.0561550319313,0.0,0.0,0.863636364,0.342857143,0.5,0.333333333333,0.168121267841,0.169811320755,0.463705037033],
[0.0978260869565,1.0,0.0,0.0100200400802,0.063670411985,0.00697799248524,0.0,0.135135135135,0.0740740740741,0.4166666665,0.0156353226162,0.0,0.0,0.949367089,0.333333333,0.25,0.266666666667,0.0316184351626,0.0566037735849,0.163932249402],
[0.0326086956522,0.2,0.0,0.0380761523046,0.0374531835206,0.000805152979066,0.0281124497992,0.135135135135,0.037037037037,1.0,0.00836820083682,0.0,0.0,0.923076923,0.583333333,0.0,0.0,0.0562700964881,0.0188679245283,0.0491752486057],
[0.0108695652174,0.0,0.0,0.0200400801603,0.00374531835206,0.0,0.0160642570281,0.0540540540541,0.0123456790123,1.0,0.000220215811495,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0188679245283,0.147540499867],
[0.217391304348,0.0,0.0,0.0140280561122,0.295880149813,0.0365002683843,0.0100401606426,0.135135135135,0.123456790123,0.4487534625,0.183880202599,1.0,0.0909090909091,0.0,0.19375,0.0,0.0,0.191961414822,0.188679245283,0.287703974741],
[0.0652173913043,0.0,0.0,0.0160320641283,0.0224719101124,0.00402576489533,0.0140562248996,0.027027027027,0.0740740740741,1.0,0.00132129486897,0.0,0.0,0.0,0.444444444,0.0,0.0,0.0,0.0188679245283,0.147540499867],
[0.0326086956522,0.6,0.0,0.0100200400802,0.0411985018727,0.000268384326355,0.00200803212851,0.108108108108,0.0123456790123,0.25,0.00902884827131,1.0,0.0909090909091,0.971428571,0.75,0.25,0.133333333333,0.0594855305401,0.0566037735849,0.147540499867],
[0.119565217391,0.2,0.0,0.0140280561122,0.0973782771536,0.0,0.0100401606426,0.0540540540541,0.135802469136,0.29,0.0398590618806,1.0,0.0,0.529411765,0.409090909,0.0,0.0,0.0723472668927,0.0188679245283,0.107306205553],
[0.0326086956522,0.2,0.0,0.0100200400802,0.0262172284644,0.000268384326355,0.00200803212851,0.108108108108,0.037037037037,0.25,0.00638625853336,1.0,0.0,0.818181818,0.666666667,0.0,0.0,0.0401929260499,0.0188679245283,0.0983652512615],
[0.173913043478,0.4,0.0,0.0300601202405,0.243445692884,0.020397208803,0.0,0.405405405405,0.16049382716,0.46,0.106364236952,1.0,0.0,0.725490196,0.311111111,0.0,0.0,0.136254019315,0.169811320755,0.230532031043],
[0.163043478261,0.4,0.0,0.0180360721443,0.153558052434,0.0,0.0,0.243243243243,0.185185185185,0.3392857145,0.044924025545,1.0,0.0909090909091,0.725490196,0.225,0.25,0.133333333333,0.0594855305401,0.0377358490566,0.226223848446],
[0.152173913043,0.6,0.0508474576271,0.0220440881764,0.10861423221,0.0228126677402,0.00602409638554,0.216216216216,0.135802469136,0.2884615385,0.0237833076415,1.0,0.0909090909091,0.759259259,0.321428571,0.0,0.0,0.0316949931128,0.0754716981132,0.189692820679],
[0.29347826087,0.4,0.0,0.0160320641283,0.378277153558,0.0421363392378,0.0100401606426,0.0810810810811,0.185185185185,0.4123931625,0.283197533583,0.888888889,0.0909090909091,0.294117647,0.183760684,0.25,0.466666666667,0.220078599537,0.0754716981132,0.163932249402],
[0.0326086956522,0.0,0.0,0.00400801603206,0.0112359550562,0.000805152979066,0.00401606425703,0.0,0.037037037037,0.75,0.000880863245981,0.0,0.0,0.0,0.666666667,0.0,0.0,0.0,0.0188679245283,0.147540499867],
[0.597826086957,0.4,0.135593220339,0.0400801603206,0.397003745318,0.352388620505,0.0160642570281,0.324324324324,0.111111111111,0.4782763535,0.249504514424,1.0,0.181818181818,0.406593407,0.195454545,0.0,0.0,0.0922537270084,0.188679245283,0.273613857004]]
)


# define the RBM model
random_state = 200
model = BernoulliRBM(n_components=10,n_iter=10,random_state=random_state)

# building RBM and creating RBM features
# Each column means one feature, each row means one line of the train data.
RBM_feature_data = model.fit_transform(train_data)

print(RBM_feature_data)


Thank you!

Masanari Kondo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170727/3fcb5d7f/attachment-0001.html>

From abhishekraj10 at yahoo.com  Fri Jul 28 13:01:25 2017
From: abhishekraj10 at yahoo.com (Abhishek Raj)
Date: Fri, 28 Jul 2017 22:31:25 +0530
Subject: [scikit-learn] Are sample weights normalized?
Message-ID: <CANvDnHpDvmJZ3Sr-SbUorjE1VSThCyRL=AoQh=sOz-jUY=dg0A@mail.gmail.com>

Hi,

I am using one class svm for binary classification and was just curious
what is the range/scale for sample weights? Are they normalized internally?
For example -

Sample 1, weight - 1
Sample 2, weight - 10
Sample 3, weight - 100

Does this mean Sample 3 will always be predicted as positive and sample 1
will never be predicted as positive? What about sample 2?

Also, what would happen if I assign a high weight to majority of the
samples and low weights to the rest. Eg if 80% of my samples were weighted
1000 and 20% were weighted 1.

A clarification or a link to read up on how exactly weights affect the
training process would be really helpful.

Thanks,
Abhishek
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170728/0a34e54f/attachment.html>

From michael.eickenberg at gmail.com  Fri Jul 28 13:11:00 2017
From: michael.eickenberg at gmail.com (Michael Eickenberg)
Date: Fri, 28 Jul 2017 10:11:00 -0700
Subject: [scikit-learn] Are sample weights normalized?
In-Reply-To: <CANvDnHpDvmJZ3Sr-SbUorjE1VSThCyRL=AoQh=sOz-jUY=dg0A@mail.gmail.com>
References: <CANvDnHpDvmJZ3Sr-SbUorjE1VSThCyRL=AoQh=sOz-jUY=dg0A@mail.gmail.com>
Message-ID: <CADxJN66yEsw8z-JG_C0PmeMbJEBVAmLETRebMWo0jUQCQVgg6g@mail.gmail.com>

Hi Abhishek,

think of your example as being equivalent to putting 1 of sample 1, 10 of
sample 2 and 100 of sample 3 in a dataset and then run your SVM.
This is exactly true for some estimators and approximately true for others,
but always a good intuition.

Hope this helps!
Michael


On Fri, Jul 28, 2017 at 10:01 AM, Abhishek Raj via scikit-learn <
scikit-learn at python.org> wrote:

> Hi,
>
> I am using one class svm for binary classification and was just curious
> what is the range/scale for sample weights? Are they normalized internally?
> For example -
>
> Sample 1, weight - 1
> Sample 2, weight - 10
> Sample 3, weight - 100
>
> Does this mean Sample 3 will always be predicted as positive and sample 1
> will never be predicted as positive? What about sample 2?
>
> Also, what would happen if I assign a high weight to majority of the
> samples and low weights to the rest. Eg if 80% of my samples were weighted
> 1000 and 20% were weighted 1.
>
> A clarification or a link to read up on how exactly weights affect the
> training process would be really helpful.
>
> Thanks,
> Abhishek
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170728/f4cf9b49/attachment.html>

From abhishekraj10 at yahoo.com  Fri Jul 28 16:06:49 2017
From: abhishekraj10 at yahoo.com (Abhishek Raj)
Date: Sat, 29 Jul 2017 01:36:49 +0530
Subject: [scikit-learn] Are sample weights normalized?
In-Reply-To: <CADxJN66yEsw8z-JG_C0PmeMbJEBVAmLETRebMWo0jUQCQVgg6g@mail.gmail.com>
References: <CANvDnHpDvmJZ3Sr-SbUorjE1VSThCyRL=AoQh=sOz-jUY=dg0A@mail.gmail.com>
 <CADxJN66yEsw8z-JG_C0PmeMbJEBVAmLETRebMWo0jUQCQVgg6g@mail.gmail.com>
Message-ID: <CANvDnHoWrKTj05p6k8CL2afqGXU9niTnb9GpMG_V4LWbtnc8qw@mail.gmail.com>

Hi Michael, thanks for the response. Based on what you said, is it correct
to assume that weights are relative to the size of the data set? Eg

If my dataset size is 200 and I have 1 of sample 1, 10 of sample 2 and 100
of sample 3, sample 3 will be given a lot of focus during training because
it exists in majority, but if my dataset size was say 1 million, these
weights wouldn't really affect much?

Thanks,
Abhishek

On Jul 28, 2017 10:41 PM, "Michael Eickenberg" <michael.eickenberg at gmail.com>
wrote:

> Hi Abhishek,
>
> think of your example as being equivalent to putting 1 of sample 1, 10 of
> sample 2 and 100 of sample 3 in a dataset and then run your SVM.
> This is exactly true for some estimators and approximately true for
> others, but always a good intuition.
>
> Hope this helps!
> Michael
>
>
> On Fri, Jul 28, 2017 at 10:01 AM, Abhishek Raj via scikit-learn <
> scikit-learn at python.org> wrote:
>
>> Hi,
>>
>> I am using one class svm for binary classification and was just curious
>> what is the range/scale for sample weights? Are they normalized internally?
>> For example -
>>
>> Sample 1, weight - 1
>> Sample 2, weight - 10
>> Sample 3, weight - 100
>>
>> Does this mean Sample 3 will always be predicted as positive and sample 1
>> will never be predicted as positive? What about sample 2?
>>
>> Also, what would happen if I assign a high weight to majority of the
>> samples and low weights to the rest. Eg if 80% of my samples were weighted
>> 1000 and 20% were weighted 1.
>>
>> A clarification or a link to read up on how exactly weights affect the
>> training process would be really helpful.
>>
>> Thanks,
>> Abhishek
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170729/c881d0f0/attachment.html>

From michael.eickenberg at gmail.com  Fri Jul 28 16:29:24 2017
From: michael.eickenberg at gmail.com (Michael Eickenberg)
Date: Fri, 28 Jul 2017 13:29:24 -0700
Subject: [scikit-learn] Are sample weights normalized?
In-Reply-To: <CANvDnHoWrKTj05p6k8CL2afqGXU9niTnb9GpMG_V4LWbtnc8qw@mail.gmail.com>
References: <CANvDnHpDvmJZ3Sr-SbUorjE1VSThCyRL=AoQh=sOz-jUY=dg0A@mail.gmail.com>
 <CADxJN66yEsw8z-JG_C0PmeMbJEBVAmLETRebMWo0jUQCQVgg6g@mail.gmail.com>
 <CANvDnHoWrKTj05p6k8CL2afqGXU9niTnb9GpMG_V4LWbtnc8qw@mail.gmail.com>
Message-ID: <CADxJN66-+BgH5RN+nYSppMAUK3tThx0ge98umpgsbaNydWya3A@mail.gmail.com>

Well, that will depend on how your estimator works. But in general you are
right - if you assume that samples 4 to N are weighted with the same weight
(e.g. 1) in both cases, then the sample 3 will be relatively less important
in the larger training set.

On Fri, Jul 28, 2017 at 1:06 PM, Abhishek Raj via scikit-learn <
scikit-learn at python.org> wrote:

> Hi Michael, thanks for the response. Based on what you said, is it correct
> to assume that weights are relative to the size of the data set? Eg
>
> If my dataset size is 200 and I have 1 of sample 1, 10 of sample 2 and 100
> of sample 3, sample 3 will be given a lot of focus during training because
> it exists in majority, but if my dataset size was say 1 million, these
> weights wouldn't really affect much?
>
> Thanks,
> Abhishek
>
> On Jul 28, 2017 10:41 PM, "Michael Eickenberg" <
> michael.eickenberg at gmail.com> wrote:
>
>> Hi Abhishek,
>>
>> think of your example as being equivalent to putting 1 of sample 1, 10 of
>> sample 2 and 100 of sample 3 in a dataset and then run your SVM.
>> This is exactly true for some estimators and approximately true for
>> others, but always a good intuition.
>>
>> Hope this helps!
>> Michael
>>
>>
>> On Fri, Jul 28, 2017 at 10:01 AM, Abhishek Raj via scikit-learn <
>> scikit-learn at python.org> wrote:
>>
>>> Hi,
>>>
>>> I am using one class svm for binary classification and was just curious
>>> what is the range/scale for sample weights? Are they normalized internally?
>>> For example -
>>>
>>> Sample 1, weight - 1
>>> Sample 2, weight - 10
>>> Sample 3, weight - 100
>>>
>>> Does this mean Sample 3 will always be predicted as positive and sample
>>> 1 will never be predicted as positive? What about sample 2?
>>>
>>> Also, what would happen if I assign a high weight to majority of the
>>> samples and low weights to the rest. Eg if 80% of my samples were weighted
>>> 1000 and 20% were weighted 1.
>>>
>>> A clarification or a link to read up on how exactly weights affect the
>>> training process would be really helpful.
>>>
>>> Thanks,
>>> Abhishek
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170728/28b1ccef/attachment-0001.html>

From yrohinkumar at gmail.com  Sun Jul 30 13:38:15 2017
From: yrohinkumar at gmail.com (Rohin Kumar)
Date: Sun, 30 Jul 2017 23:08:15 +0530
Subject: [scikit-learn] Nearest neighbor search with 2 distance measures
Message-ID: <CAARwKuvhF8wiE4F3R3VXWxcvBBt8PTajDJAM+Hie-2fH2B3u4w@mail.gmail.com>

Dear all,

This is my first post on this forum. May be it is a feature request or
about something I don't know how to get it work. My question is on BallTree
algorithm with custom metrics.

I am working with a dataset for which I was calculating two point
correlation with one distance metric using BallTree algorithm. Say

import numpy as npfrom sklearn.neighbors import BallTree
np.random.seed(0)
X = np.random.random((30, 3))
r = np.linspace(0, 1, 5)
tree = BallTree(X,metric='euclidean')
tree.two_point_correlation(X, r)

Now, I want to calculate two-point correlation based on two different
metrics. Imagine I want to find correlation based on their distances on XZ
and YZ planes - group neighbors based on two distances instead of one. Say
I want to find correlation within r1 and r2 bins based on two different
distance metrics say something like

r1 = np.linspace(0, 1, 5)
r2 = np.linspace(0, 1, 5)
tree = BallTree(X,metric1=?euclidean2D?,metric2=?euclidean2D?)
tree.two_point_correlation(X, r1, r2)

How can I go about doing that? Goal is to get a contour plot of two-point
correlation with r1 and r2 as axes. Any help on this would be great!

Thanks in advance,

Rohin.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170730/53d0326e/attachment-0001.html>

From AlexeyUm at yandex.ru  Sun Jul 30 13:40:21 2017
From: AlexeyUm at yandex.ru (=?utf-8?B?ItCj0LzQvdC+0LIg0JDQu9C10LrRgdC10LkgKEFsZXhleSBVbW5vdiki?=)
Date: Sun, 30 Jul 2017 20:40:21 +0300
Subject: [scikit-learn] Nearest neighbor search with 2 distance measures
Message-ID: <379121501436421@mxfront4j.mail.yandex.net>

????????????!

? ? ?????? ?????? ? ???????, ? ??????? 15 ???????.

--
????? ???????

From yrohinkumar at gmail.com  Sun Jul 30 14:18:29 2017
From: yrohinkumar at gmail.com (Rohin Kumar)
Date: Sun, 30 Jul 2017 23:48:29 +0530
Subject: [scikit-learn] Nearest neighbor search with 2 distance measures
In-Reply-To: <379121501436421@mxfront4j.mail.yandex.net>
References: <379121501436421@mxfront4j.mail.yandex.net>
Message-ID: <CAARwKuu+zHGw61=q1t8rO5F-KDUoi+Y7bCFC3SRYBgJAEvyXvw@mail.gmail.com>

*update*

May be it doesn't have to be done at the tree creation level. It could be
using loops and creating two different balltrees. Something like

tree1=BallTree(X,metric='metric1') #for x-z plane
tree2=BallTree(X,metric='metric2') #for y-z plane

And then calculate correlation functions in a loop to get tpcf(X,r1,r2)
 using tree1.two_point_correlation(X,r1) and
tree2.two_point_correlation(X,r2)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170730/4cfb2d3e/attachment.html>

From AlexeyUm at yandex.ru  Sun Jul 30 14:20:30 2017
From: AlexeyUm at yandex.ru (=?utf-8?B?ItCj0LzQvdC+0LIg0JDQu9C10LrRgdC10LkgKEFsZXhleSBVbW5vdiki?=)
Date: Sun, 30 Jul 2017 21:20:30 +0300
Subject: [scikit-learn] Nearest neighbor search with 2 distance measures
Message-ID: <33421501438830@mxfront7o.mail.yandex.net>

????????????!

? ? ?????? ?????? ? ???????, ? ??????? 15 ???????.

--
????? ???????

From jakevdp at cs.washington.edu  Mon Jul 31 10:46:31 2017
From: jakevdp at cs.washington.edu (Jacob Vanderplas)
Date: Mon, 31 Jul 2017 07:46:31 -0700
Subject: [scikit-learn] Nearest neighbor search with 2 distance measures
In-Reply-To: <CAARwKuu+zHGw61=q1t8rO5F-KDUoi+Y7bCFC3SRYBgJAEvyXvw@mail.gmail.com>
References: <379121501436421@mxfront4j.mail.yandex.net>
 <CAARwKuu+zHGw61=q1t8rO5F-KDUoi+Y7bCFC3SRYBgJAEvyXvw@mail.gmail.com>
Message-ID: <CACpqBg0uRrMxWENyz1vgHwDXQiSgU6Kdeh+=ATao29e2Lz0EEw@mail.gmail.com>

On Sun, Jul 30, 2017 at 11:18 AM, Rohin Kumar <yrohinkumar at gmail.com> wrote:

> *update*
>
> May be it doesn't have to be done at the tree creation level. It could be
> using loops and creating two different balltrees. Something like
>
> tree1=BallTree(X,metric='metric1') #for x-z plane
> tree2=BallTree(X,metric='metric2') #for y-z plane
>
> And then calculate correlation functions in a loop to get tpcf(X,r1,r2)
>  using tree1.two_point_correlation(X,r1) and tree2.two_point_
> correlation(X,r2)
>

Hi Rohin,
It's not exactly clear to me what you wish the tree to do with the two
different metrics, but in any case the ball tree only supports one metric
at a time. If you can construct your desired result from two ball trees
each with its own metric, then that's probably the best way to proceed,
   Jake


>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170731/aa0165fe/attachment-0001.html>