From yusuke.nishioka.0713 at gmail.com  Mon Nov  6 21:42:45 2017
From: yusuke.nishioka.0713 at gmail.com (Yusuke Nishioka)
Date: Tue, 7 Nov 2017 11:42:45 +0900
Subject: [scikit-learn] Question about dummy coding using DictVectorizer or
 FeatureHasher: generating correlated dimensions
Message-ID: <CAEnbh3Lqu+=PfxW-eCUmsJ5VqAgc18nqm8A4LfDrE71Mag6F8Q@mail.gmail.com>

Hello,

I have a question about dummy coding using DictVectorizer or FeatureHasher.

```
>>> from sklearn.feature_extraction import DictVectorizer, FeatureHasher
>>> D = [{'age': 23, 'gender': 'm'},{'age': 34, 'gender': 'f'},{'age': 18,
'gender': 'f'},{'age': 50, 'gender': 'm'}]
>>> m1 = FeatureHasher(n_features=10)
>>> m1.fit_transform(D).toarray()
array([[  0.,   0.,  -1.,   0.,   0.,   0.,   0.,   0.,   0.,  23.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   1.,  34.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   1.,  18.],
       [  0.,   0.,  -1.,   0.,   0.,   0.,   0.,   0.,   0.,  50.]])
>>> m2 = DictVectorizer(sparse=False)
>>> m2.fit_transform(D)
array([[ 23.,   0.,   1.],
       [ 34.,   1.,   0.],
       [ 18.,   1.,   0.],
       [ 50.,   0.,   1.]])
>>> m2.feature_names_
['age', 'gender=f', 'gender=m']
```

Since both DictVectorizer and FeatureHasher generate dimensions for
'gender=m' and 'gender=f',
these dimensions are perfectly correlated.
This is because DictVectorizer and FeatureHasher by default generate n
dimensions for n categorical values of 1 feature.

My questions are as follows:

1. My expectation is for them to generate n-1 dimensions for n categorical
values,
   and is there any way to do this using DictVectorizer and FeatureHasher?
2. How should I handle these correlated dimensions?
   In my understanding, the training on data which has colinearity will
make prediction unstable.
   Will L1 or L2 regularization work for this problem?

If there is any issue or article related to these questions,
would you please tell me the URL? Thank you.


Regards,
Yusuke
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171107/c67ed69a/attachment.html>

From gael.varoquaux at normalesup.org  Thu Nov  9 10:58:46 2017
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Thu, 9 Nov 2017 16:58:46 +0100
Subject: [scikit-learn] =?iso-8859-1?q?New_core_devs=3A_Hanmin_Qin=2C_Gui?=
 =?iso-8859-1?q?llaume_Lema=EEtre=2C_and_Roman_Yurchak?=
Message-ID: <20171109155846.GF1150313@phare.normalesup.org>

Hi scikit-learn community,

A week ago, we added 3 core developers, but I think that we forgot to
announce it. So let me please welcome on board Hanmin Qin, Guillaume
Lema?tre, and Roman Yurchak. They have been very active in the
development of the project, and very helpful in the review process. It's
a pleasure to see the team growing.

Ga?l


From olivier.grisel at ensta.org  Thu Nov  9 11:36:22 2017
From: olivier.grisel at ensta.org (Olivier Grisel)
Date: Thu, 9 Nov 2017 17:36:22 +0100
Subject: [scikit-learn] 
	=?utf-8?q?New_core_devs=3A_Hanmin_Qin=2C_Guillaum?=
	=?utf-8?q?e_Lema=C3=AEtre=2C_and_Roman_Yurchak?=
In-Reply-To: <20171109155846.GF1150313@phare.normalesup.org>
References: <20171109155846.GF1150313@phare.normalesup.org>
Message-ID: <CAFvE7K4G-5PfBk2aa7SXFefTsH75s9STsF4a1E_veYOnuW46OA@mail.gmail.com>

Congrats to all three of you! Thank you very much for your contributions
and in particular in reviewing contributions by others.

-- 
Olivier
?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171109/3441bec6/attachment.html>

From jmschreiber91 at gmail.com  Fri Nov 10 03:34:56 2017
From: jmschreiber91 at gmail.com (Jacob Schreiber)
Date: Fri, 10 Nov 2017 00:34:56 -0800
Subject: [scikit-learn] 
	=?utf-8?q?New_core_devs=3A_Hanmin_Qin=2C_Guillaum?=
	=?utf-8?q?e_Lema=C3=AEtre=2C_and_Roman_Yurchak?=
In-Reply-To: <CAFvE7K4G-5PfBk2aa7SXFefTsH75s9STsF4a1E_veYOnuW46OA@mail.gmail.com>
References: <20171109155846.GF1150313@phare.normalesup.org>
 <CAFvE7K4G-5PfBk2aa7SXFefTsH75s9STsF4a1E_veYOnuW46OA@mail.gmail.com>
Message-ID: <CA+ad8EtfTRL6HOjjgi8Kocwm+LPARTsvNmMpWmOc4f_bcMPMcw@mail.gmail.com>

Congrats! Welcome to the team, and thanks for your hard work so far.

On Thu, Nov 9, 2017 at 8:36 AM, Olivier Grisel <olivier.grisel at ensta.org>
wrote:

> Congrats to all three of you! Thank you very much for your contributions
> and in particular in reviewing contributions by others.
>
> --
> Olivier
> ?
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171110/c594ccd6/attachment.html>

From shane.grigsby at colorado.edu  Tue Nov 14 18:27:12 2017
From: shane.grigsby at colorado.edu (Shane Grigsby)
Date: Tue, 14 Nov 2017 16:27:12 -0700
Subject: [scikit-learn] Custom Distance Metric / Distance Matrix with
 K-means?
Message-ID: <20171114232712.hkew6wjy2drarl7n@espgs-MacBook-Pro.local>

Hello,
I'd like to be able to cluster data using either k-means or
mini-batch-kmeans for a toroidal geometry. I know that if I was using
DBSCAN I could pass in a pre-computed distance matrix to do this; if I
was using OPTICS I could pass in a 'metric' keyword for distance and
specify a custom distance metric. Is this possible for K-means /
minibatch-kmeans? I don't see distance metrics documented as possible
keyword arguments... but perhaps they're allowed as **kwargs that pass
to the underlying distance calculation call?
Thanks,
Shane

-- 
*PhD candidate & Research Assistant*
*Cooperative Institute for Research in Environmental Sciences (CIRES)*
*University of Colorado at Boulder*

From joel.nothman at gmail.com  Tue Nov 14 18:50:16 2017
From: joel.nothman at gmail.com (Joel Nothman)
Date: Wed, 15 Nov 2017 10:50:16 +1100
Subject: [scikit-learn] Custom Distance Metric / Distance Matrix with
 K-means?
In-Reply-To: <20171114232712.hkew6wjy2drarl7n@espgs-MacBook-Pro.local>
References: <20171114232712.hkew6wjy2drarl7n@espgs-MacBook-Pro.local>
Message-ID: <CAAkaFLU_-gh58f2dOAd_Ky-YEzAsv1LEpiDFEHVpsQdi3m65xg@mail.gmail.com>

No, it's not applicable to KMeans. There are related algorithms that
support custom metrics, e.g. K Medoids (a pull request to scikit-learn is
here https://github.com/scikit-learn/scikit-learn/pull/7694 but
implementations exist in other libraries). Cheers, Joel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171115/dc209f3e/attachment.html>

From timo.erkkila at gmail.com  Wed Nov 15 00:06:21 2017
From: timo.erkkila at gmail.com (=?UTF-8?Q?Timo_Erkkil=C3=A4?=)
Date: Wed, 15 Nov 2017 07:06:21 +0200
Subject: [scikit-learn] Custom Distance Metric / Distance Matrix with
 K-means?
In-Reply-To: <CAAkaFLU_-gh58f2dOAd_Ky-YEzAsv1LEpiDFEHVpsQdi3m65xg@mail.gmail.com>
References: <20171114232712.hkew6wjy2drarl7n@espgs-MacBook-Pro.local>
 <CAAkaFLU_-gh58f2dOAd_Ky-YEzAsv1LEpiDFEHVpsQdi3m65xg@mail.gmail.com>
Message-ID: <CAHVpoQThFhiHfC6tNez9ieyCknUMU0h=yAwCwMZFe5KZmkevfA@mail.gmail.com>

Shall we finish that PR? :) I would have time to work on it again. I recall
the only work left is to ensure the code works with the latest sklearn
version.

-Timo

15.11.2017 1.51 "Joel Nothman" <joel.nothman at gmail.com> kirjoitti:

> No, it's not applicable to KMeans. There are related algorithms that
> support custom metrics, e.g. K Medoids (a pull request to scikit-learn is
> here https://github.com/scikit-learn/scikit-learn/pull/7694 but
> implementations exist in other libraries). Cheers, Joel
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171115/fa7189d5/attachment.html>

From joel.nothman at gmail.com  Wed Nov 15 00:17:18 2017
From: joel.nothman at gmail.com (Joel Nothman)
Date: Wed, 15 Nov 2017 16:17:18 +1100
Subject: [scikit-learn] Custom Distance Metric / Distance Matrix with
 K-means?
In-Reply-To: <CAHVpoQThFhiHfC6tNez9ieyCknUMU0h=yAwCwMZFe5KZmkevfA@mail.gmail.com>
References: <20171114232712.hkew6wjy2drarl7n@espgs-MacBook-Pro.local>
 <CAAkaFLU_-gh58f2dOAd_Ky-YEzAsv1LEpiDFEHVpsQdi3m65xg@mail.gmail.com>
 <CAHVpoQThFhiHfC6tNez9ieyCknUMU0h=yAwCwMZFe5KZmkevfA@mail.gmail.com>
Message-ID: <CAAkaFLXiyiVXG8o7oPMpJpV7+CTyfLFTsk7UG-aHR99nHmNyeA@mail.gmail.com>

There was certainly not much more to do on #7694, but that's where Kornel
Kie?czewski had taken on completing your work. I suppose you could take it
back again!

On 15 November 2017 at 16:06, Timo Erkkil? <timo.erkkila at gmail.com> wrote:

> Shall we finish that PR? :) I would have time to work on it again. I
> recall the only work left is to ensure the code works with the latest
> sklearn version.
>
> -Timo
>
> 15.11.2017 1.51 "Joel Nothman" <joel.nothman at gmail.com> kirjoitti:
>
>> No, it's not applicable to KMeans. There are related algorithms that
>> support custom metrics, e.g. K Medoids (a pull request to scikit-learn is
>> here https://github.com/scikit-learn/scikit-learn/pull/7694 but
>> implementations exist in other libraries). Cheers, Joel
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171115/8a2eecfc/attachment.html>

From shiduan at ucdavis.edu  Thu Nov 16 03:18:30 2017
From: shiduan at ucdavis.edu (Shiheng Duan)
Date: Thu, 16 Nov 2017 00:18:30 -0800
Subject: [scikit-learn] Issue with Sihouette_samples
Message-ID: <CAJgygqPJQdDa1tkPY_kQC0YwxVixF5GwVr0yQ_+zRd_NouuYUg@mail.gmail.com>

Hi all,

I am doing cluster work and wanna use silhouette score to determine the
number of clusters. But I got MemoryError when execute silhouette_samples.
I searched it and found something related to numpy. But I cannot reproduce
the numpy error. Is there any solution to it?

The data is 621*1405*12.

Thanks!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171116/473572bd/attachment.html>

From l.lomasto at innovationengineering.eu  Thu Nov 16 04:14:02 2017
From: l.lomasto at innovationengineering.eu (Luigi Lomasto)
Date: Thu, 16 Nov 2017 10:14:02 +0100
Subject: [scikit-learn] Issue with Sihouette_samples
In-Reply-To: <CAJgygqPJQdDa1tkPY_kQC0YwxVixF5GwVr0yQ_+zRd_NouuYUg@mail.gmail.com>
References: <CAJgygqPJQdDa1tkPY_kQC0YwxVixF5GwVr0yQ_+zRd_NouuYUg@mail.gmail.com>
Message-ID: <609B3BDB-B8CA-4C82-A075-F9351FF0627E@innovationengineering.eu>

Hi Shiudan, 

You can try to see this link: https://github.com/biolab/orange3/issues/1502

You have 3D dimensional problem, right? For each feature you have 12 values, so probably your RAM is small. How much RAM has your pc? 
Let me know, 

Luigi


> Il giorno 16 nov 2017, alle ore 09:18, Shiheng Duan <shiduan at ucdavis.edu> ha scritto:
> 
> Hi all,
> 
> I am doing cluster work and wanna use silhouette score to determine the number of clusters. But I got MemoryError when execute silhouette_samples. I searched it and found something related to numpy. But I cannot reproduce the numpy error. Is there any solution to it? 
> 
> The data is 621*1405*12. 
> 
> Thanks! 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171116/20a99a90/attachment.html>

From shiduan at ucdavis.edu  Thu Nov 16 13:46:10 2017
From: shiduan at ucdavis.edu (Shiheng Duan)
Date: Thu, 16 Nov 2017 10:46:10 -0800
Subject: [scikit-learn] Issue with Sihouette_samples
In-Reply-To: <609B3BDB-B8CA-4C82-A075-F9351FF0627E@innovationengineering.eu>
References: <CAJgygqPJQdDa1tkPY_kQC0YwxVixF5GwVr0yQ_+zRd_NouuYUg@mail.gmail.com>
 <609B3BDB-B8CA-4C82-A075-F9351FF0627E@innovationengineering.eu>
Message-ID: <CAJgygqPs9OHnpF2DRisuh3MNaW=CjhBtBhz+gDYkdWgCcQUywA@mail.gmail.com>

Hi Luigi,

Actually my data has 621*1405 points and each point has 12 features. I made
it into a 2-D array and kmeans works well. The last time I ran it used 64G
RAM on a cluster. I don't know how much more RAM can I use.

BTW, 1502 issue is about Orange. Is it the same with sklearn?

Thanks.

On Thu, Nov 16, 2017 at 1:14 AM, Luigi Lomasto <
l.lomasto at innovationengineering.eu> wrote:

> Hi Shiudan,
>
> You can try to see this link: https://github.com/
> biolab/orange3/issues/1502
>
> You have 3D dimensional problem, right? For each feature you have 12
> values, so probably your RAM is small. How much RAM has your pc?
> Let me know,
>
> Luigi
>
>
> Il giorno 16 nov 2017, alle ore 09:18, Shiheng Duan <shiduan at ucdavis.edu>
> ha scritto:
>
> Hi all,
>
> I am doing cluster work and wanna use silhouette score to determine the
> number of clusters. But I got MemoryError when execute silhouette_samples.
> I searched it and found something related to numpy. But I cannot reproduce
> the numpy error. Is there any solution to it?
>
> The data is 621*1405*12.
>
> Thanks!
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171116/93e22ee7/attachment.html>

From joel.nothman at gmail.com  Thu Nov 16 15:44:46 2017
From: joel.nothman at gmail.com (Joel Nothman)
Date: Fri, 17 Nov 2017 07:44:46 +1100
Subject: [scikit-learn] Issue with Sihouette_samples
In-Reply-To: <CAJgygqPs9OHnpF2DRisuh3MNaW=CjhBtBhz+gDYkdWgCcQUywA@mail.gmail.com>
References: <CAJgygqPJQdDa1tkPY_kQC0YwxVixF5GwVr0yQ_+zRd_NouuYUg@mail.gmail.com>
 <609B3BDB-B8CA-4C82-A075-F9351FF0627E@innovationengineering.eu>
 <CAJgygqPs9OHnpF2DRisuh3MNaW=CjhBtBhz+gDYkdWgCcQUywA@mail.gmail.com>
Message-ID: <CAAkaFLXemmy0tHxoTTM1Z0t9NkJn3nmuX82Q-xRVp5cNmiR-7A@mail.gmail.com>

https://github.com/scikit-learn/scikit-learn/pull/7177 makes silhouette
more memory-efficient. Try that branch?

On 17 November 2017 at 05:46, Shiheng Duan <shiduan at ucdavis.edu> wrote:

> Hi Luigi,
>
> Actually my data has 621*1405 points and each point has 12 features. I
> made it into a 2-D array and kmeans works well. The last time I ran it used
> 64G RAM on a cluster. I don't know how much more RAM can I use.
>
> BTW, 1502 issue is about Orange. Is it the same with sklearn?
>
> Thanks.
>
> On Thu, Nov 16, 2017 at 1:14 AM, Luigi Lomasto <l.lomasto@
> innovationengineering.eu> wrote:
>
>> Hi Shiudan,
>>
>> You can try to see this link: https://github.com/biola
>> b/orange3/issues/1502
>>
>> You have 3D dimensional problem, right? For each feature you have 12
>> values, so probably your RAM is small. How much RAM has your pc?
>> Let me know,
>>
>> Luigi
>>
>>
>> Il giorno 16 nov 2017, alle ore 09:18, Shiheng Duan <shiduan at ucdavis.edu>
>> ha scritto:
>>
>> Hi all,
>>
>> I am doing cluster work and wanna use silhouette score to determine the
>> number of clusters. But I got MemoryError when execute silhouette_samples.
>> I searched it and found something related to numpy. But I cannot reproduce
>> the numpy error. Is there any solution to it?
>>
>> The data is 621*1405*12.
>>
>> Thanks!
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171117/8179a014/attachment.html>

From info at orges-leka.de  Sat Nov 25 13:34:42 2017
From: info at orges-leka.de (Orges Leka)
Date: Sat, 25 Nov 2017 19:34:42 +0100
Subject: [scikit-learn] Rapid Outlier Detection via Sampling
Message-ID: <CAFKtZkPWij0+KQ8tx1JWfy1LayPz2UvU+ARiFmVJwfW9TUmPMg@mail.gmail.com>

Dear scikit-learn Developers,

My Name is Orges Leka and I would like to implement
"Rapid Outlier Detection via Sampling" [1] in scikit-learn.
In R this method is already available [2] by the authors of the method.

In Python I have not seen any implementation yet. The method is very simple
yet effective as the authors show. First one selects say 20 points. Then
computes the shortest distance of all other points to these 20 points. This
is the outlier-score for one specific point.

It would be nice to implement this with different metrics / distances
(euclid, manhattan or other metrics) .

How would I start the implementation? I have already git-cloned
scikit-learn on my pc. Do I need to write object oriented or are functions
also ok?

If this succeeds, I would also like to extend the "example-outliers" doc
with the above method.

Kind regards
Dipl. Math. Orges Leka

[1]
https://papers.nips.cc/paper/5127-rapid-distance-based-outlier-detection-via-sampling.pdf
[2] https://github.com/mahito-sugiyama/sampling-outlier-detection
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171125/da471e0a/attachment.html>

From gael.varoquaux at normalesup.org  Sat Nov 25 14:28:24 2017
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Sat, 25 Nov 2017 20:28:24 +0100
Subject: [scikit-learn] Rapid Outlier Detection via Sampling
In-Reply-To: <CAFKtZkPWij0+KQ8tx1JWfy1LayPz2UvU+ARiFmVJwfW9TUmPMg@mail.gmail.com>
References: <CAFKtZkPWij0+KQ8tx1JWfy1LayPz2UvU+ARiFmVJwfW9TUmPMg@mail.gmail.com>
Message-ID: <20171125192824.GG3969112@phare.normalesup.org>

Dear Orges,

I can see only 33 citations on Google scholar for this paper.

As detailed in the inclusion criteria of scikit-learn:
http://scikit-learn.org/stable/faq.html#what-are-the-inclusion-criteria-for-new-algorithms
I am afraid that we need many more citations to include this algorithm.

However, you could submit it for inclusion to scikit-learn-contrib:
http://contrib.scikit-learn.org/

Best,

Ga?l

On Sat, Nov 25, 2017 at 07:34:42PM +0100, Orges Leka wrote:
> Dear scikit-learn Developers,

> My Name is Orges Leka and I would like to implement?
> "Rapid Outlier Detection via Sampling" [1] in scikit-learn.
> In R this method is already available [2] by the authors of the method.

> In Python I have not seen any implementation yet. The method is very simple yet
> effective as the authors show. First one selects say 20 points. Then computes
> the shortest distance of all other points to these 20 points. This is the
> outlier-score for one specific point.?

> It would be nice to implement this with different metrics / distances (euclid,
> manhattan or other metrics) .

> How would I start the implementation? I have already git-cloned scikit-learn on
> my pc. Do I need to write object oriented or are functions also ok?

> If this succeeds, I would also like to extend the "example-outliers" doc with
> the above method.

> Kind regards
> Dipl. Math. Orges Leka

> [1]?https://papers.nips.cc/paper/
> 5127-rapid-distance-based-outlier-detection-via-sampling.pdf
> [2] https://github.com/mahito-sugiyama/sampling-outlier-detection


> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
    Gael Varoquaux
    Researcher, INRIA Parietal
    NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
    Phone:  ++ 33-1-69-08-79-68
    http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux

From olivier.grisel at ensta.org  Mon Nov 27 03:45:22 2017
From: olivier.grisel at ensta.org (Olivier Grisel)
Date: Mon, 27 Nov 2017 09:45:22 +0100
Subject: [scikit-learn] Rapid Outlier Detection via Sampling
In-Reply-To: <20171125192824.GG3969112@phare.normalesup.org>
References: <CAFKtZkPWij0+KQ8tx1JWfy1LayPz2UvU+ARiFmVJwfW9TUmPMg@mail.gmail.com>
 <20171125192824.GG3969112@phare.normalesup.org>
Message-ID: <CAFvE7K7xzpqfJgCU6OCZ8octnF9N0XUM9CDQ0+mDfmLbyKWrBw@mail.gmail.com>

> Do I need to write object oriented or are functions also ok?

I you want to contribute an implementation as a new project on scikit-learn
contrib, you should be careful to follow the scikit-learn estimators API:

http://scikit-learn.org/dev/developers/contributing.html#apis-of-scikit-learn-objects

For outlier detection in particular, you should make sure your new
estimator is consistent with the API conventions of other methods already
in scikit-learn:

http://scikit-learn.org/dev/modules/outlier_detection.html

One of the primary goals of the scikit-learn ecosystem is to provide a
simple homogeneous API to a very heterogeneous set of methods.

-- 
Olivier
?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171127/d2d61329/attachment.html>

From jeff1evesque at yahoo.com  Mon Nov 27 18:26:26 2017
From: jeff1evesque at yahoo.com (Jeffrey Levesque)
Date: Mon, 27 Nov 2017 18:26:26 -0500
Subject: [scikit-learn] Jeff Levesque: sklearn + D3JS
Message-ID: <67A950D6-C54E-4353-AE28-A6D03EAE4AEC@yahoo.com>

Hi,
I'm developing an API for sklearn:

- https://github.com/jeff1evesque/machine-learning

I was wondering if anyone had integrated visualization tools, like D3JS, with results from sklearn predictions? If so, would any of you be willing to show how the backend results was piped into JavaScript?

PS. If anyone is willing to contribute, or help in anyway, the codebase is BSD.

Thank you,

Jeff Levesque
https://github.com/jeff1evesque

From info at orges-leka.de  Tue Nov 28 03:04:07 2017
From: info at orges-leka.de (Orges Leka)
Date: Tue, 28 Nov 2017 09:04:07 +0100
Subject: [scikit-learn] 1. Re: Rapid Outlier Detection via Sampling
 (Olivier Grisel)
Message-ID: <CAFKtZkPV=SXW+y=-vbRpGER4mko8seTY1ss_7jEft03psDOzhg@mail.gmail.com>

Dear Olivier and Gael ,

Thank you for your answer. I started a request for inclusion in
scikit-learn-contrib.
The repo can be found here:
https://github.com/orgesleka/rapid-outlier-detection


Kind regards
Orges Leka


2017-11-27 18:00 GMT+01:00 <scikit-learn-request at python.org>:

> Send scikit-learn mailing list submissions to
>         scikit-learn at python.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         https://mail.python.org/mailman/listinfo/scikit-learn
> or, via email, send a message with subject or body 'help' to
>         scikit-learn-request at python.org
>
> You can reach the person managing the list at
>         scikit-learn-owner at python.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of scikit-learn digest..."
>
>
> Today's Topics:
>
>    1. Re: Rapid Outlier Detection via Sampling (Olivier Grisel)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Mon, 27 Nov 2017 09:45:22 +0100
> From: Olivier Grisel <olivier.grisel at ensta.org>
> To: Scikit-learn mailing list <scikit-learn at python.org>
> Subject: Re: [scikit-learn] Rapid Outlier Detection via Sampling
> Message-ID:
>         <CAFvE7K7xzpqfJgCU6OCZ8octnF9N0XUM9CDQ0+mDfmLbyKWrBw at mail.
> gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> > Do I need to write object oriented or are functions also ok?
>
> I you want to contribute an implementation as a new project on scikit-learn
> contrib, you should be careful to follow the scikit-learn estimators API:
>
> http://scikit-learn.org/dev/developers/contributing.html#
> apis-of-scikit-learn-objects
>
> For outlier detection in particular, you should make sure your new
> estimator is consistent with the API conventions of other methods already
> in scikit-learn:
>
> http://scikit-learn.org/dev/modules/outlier_detection.html
>
> One of the primary goals of the scikit-learn ecosystem is to provide a
> simple homogeneous API to a very heterogeneous set of methods.
>
> --
> Olivier
> ?
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://mail.python.org/pipermail/scikit-learn/
> attachments/20171127/d2d61329/attachment-0001.html>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ------------------------------
>
> End of scikit-learn Digest, Vol 20, Issue 9
> *******************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171128/8a9db86a/attachment.html>

From info at orges-leka.de  Thu Nov 30 03:21:12 2017
From: info at orges-leka.de (Orges Leka)
Date: Thu, 30 Nov 2017 09:21:12 +0100
Subject: [scikit-learn] Webservice that uses scikit-learn
Message-ID: <CAFKtZkN3nXW-4JQpv6CXvZPyR1=nctZHGgi3zN2OdHdKKGSP_g@mail.gmail.com>

Dear scikit-learn developers,

I have developed a small webservice which can hold multiple scikit-learn
models and serve post - json requests for prediction.
A model must have model.metadata and must implement
model.transform_predict(newdata). There are two examples:
BostonModel, where only predict is overriden from WebModel
IrisModel, where predict and transform is overriden from WebModel.

The idea is, that while fitting a model, you could have some metadata which
are needed for prediction. These metadata are stored as a python dictionary.
metadata could hold for example:
version of model
when it was created
additional pandas.DataFrames needed for prediction
constants needed in the predict computation
metrics about the model etc.

The repo can be found here:

https://github.com/orgesleka/webscikit

It comes with two examples: iris and boston. The server can load other
models at runtime, in case one is changing the models.

The repo is meant as a proof of concept. If somebody has ideas on how to
improve things or adding new features, that would be great.

To get started, see:
https://github.com/orgesleka/webscikit/wiki/Getting-started

Kind regards
Orges Leka
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171130/60610817/attachment.html>