From dmoisset at machinalis.com  Mon Aug  1 12:15:14 2016
From: dmoisset at machinalis.com (Daniel Moisset)
Date: Mon, 1 Aug 2016 17:15:14 +0100
Subject: [scikit-learn] Is there any official position on PEP484/mypy?
In-Reply-To: <20160729195718.GO787902@phare.normalesup.org>
References: <CALuYSZVt9aL8Boqo+V9oAvX5nvp1y-ApOyEDtf80WgfZbqMXvQ@mail.gmail.com>
 <f595e458-fd98-8909-54e3-24e2de03fda7@gmail.com>
 <CALuYSZUDdo+=r3a_enXz9ZStvVh_rL=O_Os2LSuRxjqOwf2p5A@mail.gmail.com>
 <a4db9429-8a3e-5e7d-d83a-c5d5875f15b8@gmail.com>
 <20160728164339.GD2110660@phare.normalesup.org>
 <598b3780-5b3d-2eb8-7e57-da3856026d0b@gmail.com>
 <CALuYSZXwHe5BCRsiM52Vj-yZzx-vHN69C5jV6A5FOukS_r+XEg@mail.gmail.com>
 <014c8cb1-8997-67a9-3d6a-f0b94c63b7ff@gmail.com>
 <20160729195718.GO787902@phare.normalesup.org>
Message-ID: <CALuYSZW+sk9AFm11wHQp0U3RS=-sKf4dObn8bc6gswsCZ9z==w@mail.gmail.com>

On Fri, Jul 29, 2016 at 8:57 PM, Gael Varoquaux <
gael.varoquaux at normalesup.org> wrote:

>
> Can you summarize once again in very simple terms what would be the big
> benefits?
>

Benefits for regular scikit-learn users

1. Reliable information on method signatures in a standarized way
("reliable" in the sense of "automatically verified")
2. Better integration with tools supporting PEP-484 (editors, documentation
tools). This is a small set now, but I expect it to grow (and it's also an
egg and chicken problem, support has to start somewhere)

Benefits for scikit-learn users also using mypy and/or PEP-484 (probably
not a large set, but I know a few people :) )

0. Same as the rest of the users
1. Early detection of errors in own code while writing code based on SKL
2. Making own code more readable/explicit by annotating functions that
receive/return SKL types (and verifying that annotations)

Benefits for scikit-learn developers

1. Some extra checks that changes keep internal consistency
2. (Future) possible simplification of typing information in docstrings,
which would make themselves redundant (this would require updating doc
generators)

Regarding the cost for contributing, an scenario where you get a CI error
due to mypy would be because:

* the change in the code somewhat changed the existing accepted/returned
types, which is a change in the API and should actually be verified
* the change in the code extended the signature of an existing function
(what Andreas mentioned); in this situation it's similar to a PR that adds
an argument and doesn't update the docstring (only that this is
automatically caught).

WRT to the second issue, the error here might be confusing when using the
"one line" syntax because arguments may "misalign" with their signatures.
The multiline version (or the python3-only form) is safer in that sense (in
fact, adding an argument there will not produce a CI problem because its
unannotated and assumed to be "any type").

Adding new modules/methods without no annotations wouldn't produce an
error, just an incompleteness in the annotations

A possible source of problems like the one you mention is that the
implementation of the annotated methods will be checked, and sometimes
you'll get a warning about a local variable if mypy can't infer its type
(it happens sometimes when assigning an empty list to a local, where mypy
knows that it's a list but doesn't know the element type). But in that case
I think the message you get is very obvious.

-- 
Daniel F. Moisset - UK Country Manager
www.machinalis.com
Skype: @dmoisset
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160801/d7af7213/attachment.html>

From luizfgoncalves at dcc.ufmg.br  Mon Aug  1 15:55:27 2016
From: luizfgoncalves at dcc.ufmg.br (luizfgoncalves at dcc.ufmg.br)
Date: Mon, 1 Aug 2016 16:55:27 -0300
Subject: [scikit-learn] Install sklearn into a specific folder to make some
 changes
Message-ID: <12fb1f9a9aeec248ae4e7476879d8da6.squirrel@webmail.dcc.ufmg.br>

I'm looking for the best way to install sklearn into a specific folder so
I can make changes for my work, without worrying about bugging my main
sklearn installation (as I use the default version for some experiments
too).

I tried to clone the git repository and use "python setup.py install", but
I'm afraid it will change my user installation too.

Right now, what I want is to edit a file called splitter.pyx (on tree
folder), compile/install sklearn so it will work with my changes, and test
it.

What is the best way to do it without causing problems with my main
sklearn installation?

Thanks a lot for your attention


From t3kcit at gmail.com  Mon Aug  1 16:08:44 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Mon, 1 Aug 2016 16:08:44 -0400
Subject: [scikit-learn] Install sklearn into a specific folder to make
 some changes
In-Reply-To: <12fb1f9a9aeec248ae4e7476879d8da6.squirrel@webmail.dcc.ufmg.br>
References: <12fb1f9a9aeec248ae4e7476879d8da6.squirrel@webmail.dcc.ufmg.br>
Message-ID: <10a9c6f8-62b4-dc26-fb6a-336aaafeb286@gmail.com>

Hi.
The best is probably to use a virtual environment or conda environment 
specific for this changed version of scikit-learn.
In that environment you could just run an "install" and it would not 
mess with your other environments.
If you don't want to go that way, you can also do ``python setup.py 
build_ext -i`` to build inplace and then add this
path to your python path (PYTONPATH environment variable or 
sys.path.insert in the script or many other ways).

Best,
Andy

On 08/01/2016 03:55 PM, luizfgoncalves at dcc.ufmg.br wrote:
> I'm looking for the best way to install sklearn into a specific folder so
> I can make changes for my work, without worrying about bugging my main
> sklearn installation (as I use the default version for some experiments
> too).
>
> I tried to clone the git repository and use "python setup.py install", but
> I'm afraid it will change my user installation too.
>
> Right now, what I want is to edit a file called splitter.pyx (on tree
> folder), compile/install sklearn so it will work with my changes, and test
> it.
>
> What is the best way to do it without causing problems with my main
> sklearn installation?
>
> Thanks a lot for your attention
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From michael.eickenberg at gmail.com  Mon Aug  1 16:15:30 2016
From: michael.eickenberg at gmail.com (Michael Eickenberg)
Date: Mon, 1 Aug 2016 22:15:30 +0200
Subject: [scikit-learn] Install sklearn into a specific folder to make
 some changes
In-Reply-To: <12fb1f9a9aeec248ae4e7476879d8da6.squirrel@webmail.dcc.ufmg.br>
References: <12fb1f9a9aeec248ae4e7476879d8da6.squirrel@webmail.dcc.ufmg.br>
Message-ID: <CADxJN66yOVTdE61sBRJu87ij_4PvvoUaSASJetuamLRr010fSg@mail.gmail.com>

There are several ways of achieving this. One is to build scikit-learn in
place by going into the sklearn clone and typing

make in

or alternatively

python setup.py build_ext --inplace   # (i think)

Then you can use the environment variable PYTHONPATH, set to the github
clone, and python will give precedence to the clone whenever the variable
is set.

As an alternative, you can install your clone using

python setup.py develop

and then work on a branch. Checkout master and rebuild whenever you need
it. This would entail working on the same clone for master and development
(so your builtin default sklearn would be overrridden)

hth,
Michael

On Monday, August 1, 2016, <luizfgoncalves at dcc.ufmg.br> wrote:

> I'm looking for the best way to install sklearn into a specific folder so
> I can make changes for my work, without worrying about bugging my main
> sklearn installation (as I use the default version for some experiments
> too).
>
> I tried to clone the git repository and use "python setup.py install", but
> I'm afraid it will change my user installation too.
>
> Right now, what I want is to edit a file called splitter.pyx (on tree
> folder), compile/install sklearn so it will work with my changes, and test
> it.
>
> What is the best way to do it without causing problems with my main
> sklearn installation?
>
> Thanks a lot for your attention
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org <javascript:;>
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160801/c7d122bc/attachment-0001.html>

From mail at sebastianraschka.com  Mon Aug  1 16:09:39 2016
From: mail at sebastianraschka.com (Sebastian Raschka)
Date: Mon, 1 Aug 2016 16:09:39 -0400
Subject: [scikit-learn] Install sklearn into a specific folder to make
 some changes
In-Reply-To: <12fb1f9a9aeec248ae4e7476879d8da6.squirrel@webmail.dcc.ufmg.br>
References: <12fb1f9a9aeec248ae4e7476879d8da6.squirrel@webmail.dcc.ufmg.br>
Message-ID: <D7E11AA1-2453-4CA9-BD28-86C72CC87024@sebastianraschka.com>

Hi, 

I would highly recommend you to work with virtual environments here. E.g.,  look into Anaconda/Miniconda (http://conda.pydata.org/miniconda.html, http://conda.pydata.org/docs/using/using.html), which makes this process most convenient in my opinion. Alternatively, I would use Python?s virtualenv (http://docs.python-guide.org/en/latest/dev/virtualenvs/).

Best,
Sebastian


> On Aug 1, 2016, at 3:55 PM, luizfgoncalves at dcc.ufmg.br wrote:
> 
> I'm looking for the best way to install sklearn into a specific folder so
> I can make changes for my work, without worrying about bugging my main
> sklearn installation (as I use the default version for some experiments
> too).
> 
> I tried to clone the git repository and use "python setup.py install", but
> I'm afraid it will change my user installation too.
> 
> Right now, what I want is to edit a file called splitter.pyx (on tree
> folder), compile/install sklearn so it will work with my changes, and test
> it.
> 
> What is the best way to do it without causing problems with my main
> sklearn installation?
> 
> Thanks a lot for your attention
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From michael.eickenberg at gmail.com  Mon Aug  1 16:17:08 2016
From: michael.eickenberg at gmail.com (Michael Eickenberg)
Date: Mon, 1 Aug 2016 22:17:08 +0200
Subject: [scikit-learn] Install sklearn into a specific folder to make
 some changes
In-Reply-To: <10a9c6f8-62b4-dc26-fb6a-336aaafeb286@gmail.com>
References: <12fb1f9a9aeec248ae4e7476879d8da6.squirrel@webmail.dcc.ufmg.br>
 <10a9c6f8-62b4-dc26-fb6a-336aaafeb286@gmail.com>
Message-ID: <CADxJN657FeCeUGYJVJ9DWMXf=w0QnO7-09--haRr0iJi49dutQ@mail.gmail.com>

On Monday, August 1, 2016, Andreas Mueller <t3kcit at gmail.com> wrote:

> Hi.
> The best is probably to use a virtual environment or conda environment
> specific for this changed version of scikit-learn.
> In that environment you could just run an "install" and it would not mess
> with your other environments.


+1!


> If you don't want to go that way, you can also do ``python setup.py
> build_ext -i`` to build inplace and then add this
> path to your python path (PYTONPATH environment variable or
> sys.path.insert in the script or many other ways).
>
> Best,
> Andy
>
> On 08/01/2016 03:55 PM, luizfgoncalves at dcc.ufmg.br wrote:
>
>> I'm looking for the best way to install sklearn into a specific folder so
>> I can make changes for my work, without worrying about bugging my main
>> sklearn installation (as I use the default version for some experiments
>> too).
>>
>> I tried to clone the git repository and use "python setup.py install", but
>> I'm afraid it will change my user installation too.
>>
>> Right now, what I want is to edit a file called splitter.pyx (on tree
>> folder), compile/install sklearn so it will work with my changes, and test
>> it.
>>
>> What is the best way to do it without causing problems with my main
>> sklearn installation?
>>
>> Thanks a lot for your attention
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160801/a86ab19b/attachment.html>

From Dale.T.Smith at macys.com  Tue Aug  2 08:34:34 2016
From: Dale.T.Smith at macys.com (Dale T Smith)
Date: Tue, 2 Aug 2016 12:34:34 +0000
Subject: [scikit-learn] Install sklearn into a specific folder to make
 some changes
In-Reply-To: <CADxJN657FeCeUGYJVJ9DWMXf=w0QnO7-09--haRr0iJi49dutQ@mail.gmail.com>
References: <12fb1f9a9aeec248ae4e7476879d8da6.squirrel@webmail.dcc.ufmg.br>
 <10a9c6f8-62b4-dc26-fb6a-336aaafeb286@gmail.com>
 <CADxJN657FeCeUGYJVJ9DWMXf=w0QnO7-09--haRr0iJi49dutQ@mail.gmail.com>
Message-ID: <BL2PR06MB2276717EEC66E5F4643240EBC3050@BL2PR06MB2276.namprd06.prod.outlook.com>

I agree with everyone else ? conda environments are specially designed for this situation.

I?ve not used virtualenv  myself (http://docs.python-guide.org/en/latest/dev/virtualenvs/). I?m an Anaconda user.


__________________________________________________________________________________________
Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science and Capacity Planning
| 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com

From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Michael Eickenberg
Sent: Monday, August 1, 2016 4:17 PM
To: Scikit-learn user and developer mailing list
Subject: Re: [scikit-learn] Install sklearn into a specific folder to make some changes

? EXT MSG:


On Monday, August 1, 2016, Andreas Mueller <t3kcit at gmail.com<mailto:t3kcit at gmail.com>> wrote:
Hi.
The best is probably to use a virtual environment or conda environment specific for this changed version of scikit-learn.
In that environment you could just run an "install" and it would not mess with your other environments.

+1!

If you don't want to go that way, you can also do ``python setup.py build_ext -i`` to build inplace and then add this
path to your python path (PYTONPATH environment variable or sys.path.insert in the script or many other ways).

Best,
Andy

On 08/01/2016 03:55 PM, luizfgoncalves at dcc.ufmg.br<mailto:luizfgoncalves at dcc.ufmg.br> wrote:
I'm looking for the best way to install sklearn into a specific folder so
I can make changes for my work, without worrying about bugging my main
sklearn installation (as I use the default version for some experiments
too).

I tried to clone the git repository and use "python setup.py install", but
I'm afraid it will change my user installation too.

Right now, what I want is to edit a file called splitter.pyx (on tree
folder), compile/install sklearn so it will work with my changes, and test
it.

What is the best way to do it without causing problems with my main
sklearn installation?

Thanks a lot for your attention

_______________________________________________
scikit-learn mailing list
scikit-learn at python.org<mailto:scikit-learn at python.org>
https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn at python.org<mailto:scikit-learn at python.org>
https://mail.python.org/mailman/listinfo/scikit-learn
* This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160802/8a93e740/attachment-0001.html>

From dmoisset at machinalis.com  Tue Aug  2 09:34:17 2016
From: dmoisset at machinalis.com (Daniel Moisset)
Date: Tue, 2 Aug 2016 14:34:17 +0100
Subject: [scikit-learn] Is there any official position on PEP484/mypy?
In-Reply-To: <CALuYSZW+sk9AFm11wHQp0U3RS=-sKf4dObn8bc6gswsCZ9z==w@mail.gmail.com>
References: <CALuYSZVt9aL8Boqo+V9oAvX5nvp1y-ApOyEDtf80WgfZbqMXvQ@mail.gmail.com>
 <f595e458-fd98-8909-54e3-24e2de03fda7@gmail.com>
 <CALuYSZUDdo+=r3a_enXz9ZStvVh_rL=O_Os2LSuRxjqOwf2p5A@mail.gmail.com>
 <a4db9429-8a3e-5e7d-d83a-c5d5875f15b8@gmail.com>
 <20160728164339.GD2110660@phare.normalesup.org>
 <598b3780-5b3d-2eb8-7e57-da3856026d0b@gmail.com>
 <CALuYSZXwHe5BCRsiM52Vj-yZzx-vHN69C5jV6A5FOukS_r+XEg@mail.gmail.com>
 <014c8cb1-8997-67a9-3d6a-f0b94c63b7ff@gmail.com>
 <20160729195718.GO787902@phare.normalesup.org>
 <CALuYSZW+sk9AFm11wHQp0U3RS=-sKf4dObn8bc6gswsCZ9z==w@mail.gmail.com>
Message-ID: <CALuYSZUFTsbAOWZSq=N6SoS964JmC02=P6WpgWcHp0NejxbA3Q@mail.gmail.com>

A couple of things I forgot to mention:

* One relevant consequence is that, to add annotations on the code,
scikit-learn should depend on the "typing"[1] module which contains some of
the basic names imported and used in annotations. It's a stdlib module in
python 3.5, but the PyPI package backports it to python 2.7 and newer (I'm
not sure how it works with Python 2.6, which might be an issue)
* As an example of the kind of bugs that mypy can find, someone here
already found a documentation bug in the sklearn.svm.SVC() initializer; the
"kernel" parameter is described as "string"[2], when it's actually a
"string or callable" (which can be read in the "small print" description of
the argument). That kind of slips would be automatically prevented if
declared as an annotation with mypy on the CI. Also it would be more clear
what is the signature of the callable directly instead of looking up
additional documentation on kernel functions or digging into the source

[1] https://pypi.python.org/pypi/typing
[2]
http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC


On Mon, Aug 1, 2016 at 5:15 PM, Daniel Moisset <dmoisset at machinalis.com>
wrote:

> On Fri, Jul 29, 2016 at 8:57 PM, Gael Varoquaux <
> gael.varoquaux at normalesup.org> wrote:
>
>>
>> Can you summarize once again in very simple terms what would be the big
>> benefits?
>>
>
> Benefits for regular scikit-learn users
>
> 1. Reliable information on method signatures in a standarized way
> ("reliable" in the sense of "automatically verified")
> 2. Better integration with tools supporting PEP-484 (editors,
> documentation tools). This is a small set now, but I expect it to grow (and
> it's also an egg and chicken problem, support has to start somewhere)
>
> Benefits for scikit-learn users also using mypy and/or PEP-484 (probably
> not a large set, but I know a few people :) )
>
> 0. Same as the rest of the users
> 1. Early detection of errors in own code while writing code based on SKL
> 2. Making own code more readable/explicit by annotating functions that
> receive/return SKL types (and verifying that annotations)
>
> Benefits for scikit-learn developers
>
> 1. Some extra checks that changes keep internal consistency
> 2. (Future) possible simplification of typing information in docstrings,
> which would make themselves redundant (this would require updating doc
> generators)
>
> Regarding the cost for contributing, an scenario where you get a CI error
> due to mypy would be because:
>
> * the change in the code somewhat changed the existing accepted/returned
> types, which is a change in the API and should actually be verified
> * the change in the code extended the signature of an existing function
> (what Andreas mentioned); in this situation it's similar to a PR that adds
> an argument and doesn't update the docstring (only that this is
> automatically caught).
>
> WRT to the second issue, the error here might be confusing when using the
> "one line" syntax because arguments may "misalign" with their signatures.
> The multiline version (or the python3-only form) is safer in that sense (in
> fact, adding an argument there will not produce a CI problem because its
> unannotated and assumed to be "any type").
>
> Adding new modules/methods without no annotations wouldn't produce an
> error, just an incompleteness in the annotations
>
> A possible source of problems like the one you mention is that the
> implementation of the annotated methods will be checked, and sometimes
> you'll get a warning about a local variable if mypy can't infer its type
> (it happens sometimes when assigning an empty list to a local, where mypy
> knows that it's a list but doesn't know the element type). But in that case
> I think the message you get is very obvious.
>
> --
> Daniel F. Moisset - UK Country Manager
> www.machinalis.com
> Skype: @dmoisset
>


-- 
Daniel F. Moisset - UK Country Manager
www.machinalis.com
Skype: @dmoisset
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160802/96532716/attachment.html>

From joel.nothman at gmail.com  Tue Aug  2 10:06:02 2016
From: joel.nothman at gmail.com (Joel Nothman)
Date: Wed, 3 Aug 2016 00:06:02 +1000
Subject: [scikit-learn] Is there any official position on PEP484/mypy?
In-Reply-To: <CALuYSZUFTsbAOWZSq=N6SoS964JmC02=P6WpgWcHp0NejxbA3Q@mail.gmail.com>
References: <CALuYSZVt9aL8Boqo+V9oAvX5nvp1y-ApOyEDtf80WgfZbqMXvQ@mail.gmail.com>
 <f595e458-fd98-8909-54e3-24e2de03fda7@gmail.com>
 <CALuYSZUDdo+=r3a_enXz9ZStvVh_rL=O_Os2LSuRxjqOwf2p5A@mail.gmail.com>
 <a4db9429-8a3e-5e7d-d83a-c5d5875f15b8@gmail.com>
 <20160728164339.GD2110660@phare.normalesup.org>
 <598b3780-5b3d-2eb8-7e57-da3856026d0b@gmail.com>
 <CALuYSZXwHe5BCRsiM52Vj-yZzx-vHN69C5jV6A5FOukS_r+XEg@mail.gmail.com>
 <014c8cb1-8997-67a9-3d6a-f0b94c63b7ff@gmail.com>
 <20160729195718.GO787902@phare.normalesup.org>
 <CALuYSZW+sk9AFm11wHQp0U3RS=-sKf4dObn8bc6gswsCZ9z==w@mail.gmail.com>
 <CALuYSZUFTsbAOWZSq=N6SoS964JmC02=P6WpgWcHp0NejxbA3Q@mail.gmail.com>
Message-ID: <CAAkaFLX5XLw5sBQuYsmYoa8hhO8=YLn9S8M39sC7j34cnu750A@mail.gmail.com>

I certainly see the benefit, and think we would benefit also from finding
test coverage holes wrt input type.

But I think without ndarray/sparse matrix type support, we're not going to
be able to annotate most of our code in sufficient detail.

On 2 August 2016 at 23:34, Daniel Moisset <dmoisset at machinalis.com> wrote:

> A couple of things I forgot to mention:
>
> * One relevant consequence is that, to add annotations on the code,
> scikit-learn should depend on the "typing"[1] module which contains some of
> the basic names imported and used in annotations. It's a stdlib module in
> python 3.5, but the PyPI package backports it to python 2.7 and newer (I'm
> not sure how it works with Python 2.6, which might be an issue)
> * As an example of the kind of bugs that mypy can find, someone here
> already found a documentation bug in the sklearn.svm.SVC() initializer; the
> "kernel" parameter is described as "string"[2], when it's actually a
> "string or callable" (which can be read in the "small print" description of
> the argument). That kind of slips would be automatically prevented if
> declared as an annotation with mypy on the CI. Also it would be more clear
> what is the signature of the callable directly instead of looking up
> additional documentation on kernel functions or digging into the source
>
> [1] https://pypi.python.org/pypi/typing
> [2]
> http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC
>
>
> On Mon, Aug 1, 2016 at 5:15 PM, Daniel Moisset <dmoisset at machinalis.com>
> wrote:
>
>> On Fri, Jul 29, 2016 at 8:57 PM, Gael Varoquaux <
>> gael.varoquaux at normalesup.org> wrote:
>>
>>>
>>> Can you summarize once again in very simple terms what would be the big
>>> benefits?
>>>
>>
>> Benefits for regular scikit-learn users
>>
>> 1. Reliable information on method signatures in a standarized way
>> ("reliable" in the sense of "automatically verified")
>> 2. Better integration with tools supporting PEP-484 (editors,
>> documentation tools). This is a small set now, but I expect it to grow (and
>> it's also an egg and chicken problem, support has to start somewhere)
>>
>> Benefits for scikit-learn users also using mypy and/or PEP-484 (probably
>> not a large set, but I know a few people :) )
>>
>> 0. Same as the rest of the users
>> 1. Early detection of errors in own code while writing code based on SKL
>> 2. Making own code more readable/explicit by annotating functions that
>> receive/return SKL types (and verifying that annotations)
>>
>> Benefits for scikit-learn developers
>>
>> 1. Some extra checks that changes keep internal consistency
>> 2. (Future) possible simplification of typing information in docstrings,
>> which would make themselves redundant (this would require updating doc
>> generators)
>>
>> Regarding the cost for contributing, an scenario where you get a CI error
>> due to mypy would be because:
>>
>> * the change in the code somewhat changed the existing accepted/returned
>> types, which is a change in the API and should actually be verified
>> * the change in the code extended the signature of an existing function
>> (what Andreas mentioned); in this situation it's similar to a PR that adds
>> an argument and doesn't update the docstring (only that this is
>> automatically caught).
>>
>> WRT to the second issue, the error here might be confusing when using the
>> "one line" syntax because arguments may "misalign" with their signatures.
>> The multiline version (or the python3-only form) is safer in that sense (in
>> fact, adding an argument there will not produce a CI problem because its
>> unannotated and assumed to be "any type").
>>
>> Adding new modules/methods without no annotations wouldn't produce an
>> error, just an incompleteness in the annotations
>>
>> A possible source of problems like the one you mention is that the
>> implementation of the annotated methods will be checked, and sometimes
>> you'll get a warning about a local variable if mypy can't infer its type
>> (it happens sometimes when assigning an empty list to a local, where mypy
>> knows that it's a list but doesn't know the element type). But in that case
>> I think the message you get is very obvious.
>>
>> --
>> Daniel F. Moisset - UK Country Manager
>> www.machinalis.com
>> Skype: @dmoisset
>>
>
>
>
> --
> Daniel F. Moisset - UK Country Manager
> www.machinalis.com
> Skype: @dmoisset
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160803/d53c821c/attachment-0001.html>

From gael.varoquaux at normalesup.org  Tue Aug  2 13:48:22 2016
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Tue, 2 Aug 2016 19:48:22 +0200
Subject: [scikit-learn] Is there any official position on PEP484/mypy?
In-Reply-To: <CALuYSZUFTsbAOWZSq=N6SoS964JmC02=P6WpgWcHp0NejxbA3Q@mail.gmail.com>
References: <f595e458-fd98-8909-54e3-24e2de03fda7@gmail.com>
 <CALuYSZUDdo+=r3a_enXz9ZStvVh_rL=O_Os2LSuRxjqOwf2p5A@mail.gmail.com>
 <a4db9429-8a3e-5e7d-d83a-c5d5875f15b8@gmail.com>
 <20160728164339.GD2110660@phare.normalesup.org>
 <598b3780-5b3d-2eb8-7e57-da3856026d0b@gmail.com>
 <CALuYSZXwHe5BCRsiM52Vj-yZzx-vHN69C5jV6A5FOukS_r+XEg@mail.gmail.com>
 <014c8cb1-8997-67a9-3d6a-f0b94c63b7ff@gmail.com>
 <20160729195718.GO787902@phare.normalesup.org>
 <CALuYSZW+sk9AFm11wHQp0U3RS=-sKf4dObn8bc6gswsCZ9z==w@mail.gmail.com>
 <CALuYSZUFTsbAOWZSq=N6SoS964JmC02=P6WpgWcHp0NejxbA3Q@mail.gmail.com>
Message-ID: <20160802174822.GD1269350@phare.normalesup.org>

> * One relevant consequence is that, to add annotations on the code,
> scikit-learn should depend on the "typing"[1] module which contains some of the
> basic names imported and used in annotations. It's a stdlib module in python
> 3.5, but the PyPI package backports it to python 2.7 and newer (I'm not sure
> how it works with Python 2.6, which might be an issue)

I am afraid that this is going to be a problem: we have a no dependency
policy (beyond numpy and scipy).

From t3kcit at gmail.com  Tue Aug  2 14:12:17 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Tue, 2 Aug 2016 14:12:17 -0400
Subject: [scikit-learn] Is there any official position on PEP484/mypy?
In-Reply-To: <20160802174822.GD1269350@phare.normalesup.org>
References: <f595e458-fd98-8909-54e3-24e2de03fda7@gmail.com>
 <CALuYSZUDdo+=r3a_enXz9ZStvVh_rL=O_Os2LSuRxjqOwf2p5A@mail.gmail.com>
 <a4db9429-8a3e-5e7d-d83a-c5d5875f15b8@gmail.com>
 <20160728164339.GD2110660@phare.normalesup.org>
 <598b3780-5b3d-2eb8-7e57-da3856026d0b@gmail.com>
 <CALuYSZXwHe5BCRsiM52Vj-yZzx-vHN69C5jV6A5FOukS_r+XEg@mail.gmail.com>
 <014c8cb1-8997-67a9-3d6a-f0b94c63b7ff@gmail.com>
 <20160729195718.GO787902@phare.normalesup.org>
 <CALuYSZW+sk9AFm11wHQp0U3RS=-sKf4dObn8bc6gswsCZ9z==w@mail.gmail.com>
 <CALuYSZUFTsbAOWZSq=N6SoS964JmC02=P6WpgWcHp0NejxbA3Q@mail.gmail.com>
 <20160802174822.GD1269350@phare.normalesup.org>
Message-ID: <ba9c4706-2b3a-df00-2f5c-571bd1b8ce30@gmail.com>


On 08/02/2016 01:48 PM, Gael Varoquaux wrote:
>> * One relevant consequence is that, to add annotations on the code,
>> scikit-learn should depend on the "typing"[1] module which contains some of the
>> basic names imported and used in annotations. It's a stdlib module in python
>> 3.5, but the PyPI package backports it to python 2.7 and newer (I'm not sure
>> how it works with Python 2.6, which might be an issue)
> I am afraid that this is going to be a problem: we have a no dependency
> policy (beyond numpy and scipy).
I still think this is a point we should discuss further ;)

From shee.yu at gmail.com  Tue Aug  2 17:02:23 2016
From: shee.yu at gmail.com (Shi Yu)
Date: Tue, 2 Aug 2016 16:02:23 -0500
Subject: [scikit-learn] Model trained in 0.17 gives entirely different
 results in 0.15
Message-ID: <CAJJAbr-d6Qg6SmonEdfBWi8AifRLhaV9VBtFHXtBvrDHFZnpmg@mail.gmail.com>

Hello,

We trained SVM models in scikit-learn 0.17 and saved it as pickle files.
When loading the models back in a lower version of scikit-learn 0.15, the
outputs are entirely different.  Basically for binary classification
problem, for the same test data,  it swapped the probabilities and gave an
opposite prediction.  In 0.17 the probability is [0.02668825,  0.97331175]
and the prediction is 1.  In 0.15 the probability is [0.97331175,
0.02668825] and the prediction is 0.

I wonder is anyone seeing the same issue, or it has been notified.  I could
provide more details for error replication if required.

Best,

Shi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160802/3d23c582/attachment.html>

From t3kcit at gmail.com  Wed Aug  3 14:29:08 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Wed, 3 Aug 2016 14:29:08 -0400
Subject: [scikit-learn] Model trained in 0.17 gives entirely different
 results in 0.15
In-Reply-To: <CAJJAbr-d6Qg6SmonEdfBWi8AifRLhaV9VBtFHXtBvrDHFZnpmg@mail.gmail.com>
References: <CAJJAbr-d6Qg6SmonEdfBWi8AifRLhaV9VBtFHXtBvrDHFZnpmg@mail.gmail.com>
Message-ID: <a3a29ffd-c1ba-0ff4-e0e5-ad3377c6608a@gmail.com>

Hi Shi.
In general, there is no guarantee that models built with one version 
will work in a different version.
In particular, loading in an older version when built in a newer version 
seems something that's tricky to achieve.

We might want to warn the user when doing this. The docs are not very 
explicit about this.

Opened an issue:
https://github.com/scikit-learn/scikit-learn/issues/7135

Andy

On 08/02/2016 05:02 PM, Shi Yu wrote:
> Hello,
>
> We trained SVM models in scikit-learn 0.17 and saved it as pickle 
> files. When loading the models back in a lower version of scikit-learn 
> 0.15, the outputs are entirely different. Basically for binary 
> classification problem, for the same test data,  it swapped the 
> probabilities and gave an opposite prediction.  In 0.17 the 
> probability is [0.02668825,  0.97331175] and the prediction is 1.  In 
> 0.15 the probability is [0.97331175, 0.02668825] and the prediction is 0.
>
> I wonder is anyone seeing the same issue, or it has been notified.  I 
> could provide more details for error replication if required.
>
> Best,
>
> Shi
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160803/72ce73fb/attachment.html>

From shee.yu at gmail.com  Wed Aug  3 15:02:46 2016
From: shee.yu at gmail.com (Shi Yu)
Date: Wed, 3 Aug 2016 14:02:46 -0500
Subject: [scikit-learn] Model trained in 0.17 gives entirely different
 results in 0.15
In-Reply-To: <a3a29ffd-c1ba-0ff4-e0e5-ad3377c6608a@gmail.com>
References: <CAJJAbr-d6Qg6SmonEdfBWi8AifRLhaV9VBtFHXtBvrDHFZnpmg@mail.gmail.com>
 <a3a29ffd-c1ba-0ff4-e0e5-ad3377c6608a@gmail.com>
Message-ID: <CAJJAbr8C+6xb_fzJ9z69wkumJJ_Ok1WnmufMrOA3uL5MBiZFgg@mail.gmail.com>

Hi Andy,

Thanks for the feedback.  Indeed we think it would be a good idea to
enforce version persistence something like in serialVersionUID Java here.

We deployed models trained on our laptop onto our clusters, and ran into
this issue and paid a serious lesson for that.

Best,

Shi


---------- Forwarded message ----------
From: Andreas Mueller <t3kcit at gmail.com>
Date: Wed, Aug 3, 2016 at 1:29 PM
Subject: Re: [scikit-learn] Model trained in 0.17 gives entirely different
results in 0.15
To: Scikit-learn user and developer mailing list <scikit-learn at python.org>


Hi Shi.
In general, there is no guarantee that models built with one version will
work in a different version.
In particular, loading in an older version when built in a newer version
seems something that's tricky to achieve.

We might want to warn the user when doing this. The docs are not very
explicit about this.

Opened an issue:
https://github.com/scikit-learn/scikit-learn/issues/7135

Andy


On 08/02/2016 05:02 PM, Shi Yu wrote:

Hello,

We trained SVM models in scikit-learn 0.17 and saved it as pickle files.
When loading the models back in a lower version of scikit-learn 0.15, the
outputs are entirely different.  Basically for binary classification
problem, for the same test data,  it swapped the probabilities and gave an
opposite prediction.  In 0.17 the probability is [0.02668825,  0.97331175]
and the prediction is 1.  In 0.15 the probability is [0.97331175,
0.02668825] and the prediction is 0.

I wonder is anyone seeing the same issue, or it has been notified.  I could
provide more details for error replication if required.

Best,

Shi


_______________________________________________
scikit-learn mailing
listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160803/369e7c3a/attachment.html>

From matthieu.brucher at gmail.com  Wed Aug  3 15:16:23 2016
From: matthieu.brucher at gmail.com (Matthieu Brucher)
Date: Wed, 3 Aug 2016 20:16:23 +0100
Subject: [scikit-learn] Model trained in 0.17 gives entirely different
 results in 0.15
In-Reply-To: <a3a29ffd-c1ba-0ff4-e0e5-ad3377c6608a@gmail.com>
References: <CAJJAbr-d6Qg6SmonEdfBWi8AifRLhaV9VBtFHXtBvrDHFZnpmg@mail.gmail.com>
 <a3a29ffd-c1ba-0ff4-e0e5-ad3377c6608a@gmail.com>
Message-ID: <CAHCaCkLT45UJJZMsMf2Afsfmicqh4v7ZQ=UdMeRn9FKTCWZRSw@mail.gmail.com>

More often than not, forward compatiblity is not possible. I don't think
there are lots of companies doing so, as even backward compatibility is
tricky to achieve.
Even with serializing the version, if the previous version doesn't know
about the additional data structures that have an impact on the model, you
are screwed. I don't think there is anything you can expect for forward
compatibility...

Cheers,

2016-08-03 19:29 GMT+01:00 Andreas Mueller <t3kcit at gmail.com>:

> Hi Shi.
> In general, there is no guarantee that models built with one version will
> work in a different version.
> In particular, loading in an older version when built in a newer version
> seems something that's tricky to achieve.
>
> We might want to warn the user when doing this. The docs are not very
> explicit about this.
>
> Opened an issue:
> https://github.com/scikit-learn/scikit-learn/issues/7135
>
> Andy
>
>
> On 08/02/2016 05:02 PM, Shi Yu wrote:
>
> Hello,
>
> We trained SVM models in scikit-learn 0.17 and saved it as pickle files.
> When loading the models back in a lower version of scikit-learn 0.15, the
> outputs are entirely different.  Basically for binary classification
> problem, for the same test data,  it swapped the probabilities and gave an
> opposite prediction.  In 0.17 the probability is [0.02668825,  0.97331175]
> and the prediction is 1.  In 0.15 the probability is [0.97331175,
> 0.02668825] and the prediction is 0.
>
> I wonder is anyone seeing the same issue, or it has been notified.  I
> could provide more details for error replication if required.
>
> Best,
>
> Shi
>
>
> _______________________________________________
> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 
Information System Engineer, Ph.D.
Blog: http://blog.audio-tk.com/
LinkedIn: http://www.linkedin.com/in/matthieubrucher
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160803/b2c7560a/attachment-0001.html>

From Dale.T.Smith at macys.com  Wed Aug  3 15:09:06 2016
From: Dale.T.Smith at macys.com (Dale T Smith)
Date: Wed, 3 Aug 2016 19:09:06 +0000
Subject: [scikit-learn] Model trained in 0.17 gives entirely different
 results in 0.15
In-Reply-To: <CAJJAbr8C+6xb_fzJ9z69wkumJJ_Ok1WnmufMrOA3uL5MBiZFgg@mail.gmail.com>
References: <CAJJAbr-d6Qg6SmonEdfBWi8AifRLhaV9VBtFHXtBvrDHFZnpmg@mail.gmail.com>
 <a3a29ffd-c1ba-0ff4-e0e5-ad3377c6608a@gmail.com>
 <CAJJAbr8C+6xb_fzJ9z69wkumJJ_Ok1WnmufMrOA3uL5MBiZFgg@mail.gmail.com>
Message-ID: <BL2PR06MB2276044F1C4E6D236E4BC81EC3060@BL2PR06MB2276.namprd06.prod.outlook.com>

Use conda or a virtualenv to handle compatibility issues. Then you can control when upgrades occur. I?ve used conda with good effect to handle version issues such as yours.

Otherwise, use PMML. The Data Mining Group maintains a list of PMML producers and consumers. I think there is a Python wrapper for JPMML which is what you can use for a consumer.

http://dmg.org/pmml/products.html

__________________________________________________________________________________________
Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science and Capacity Planning
| 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com

From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Shi Yu
Sent: Wednesday, August 3, 2016 3:03 PM
To: Scikit-learn user and developer mailing list
Subject: Re: [scikit-learn] Model trained in 0.17 gives entirely different results in 0.15

? EXT MSG:
Hi Andy,

Thanks for the feedback.  Indeed we think it would be a good idea to enforce version persistence something like in serialVersionUID Java here.

We deployed models trained on our laptop onto our clusters, and ran into this issue and paid a serious lesson for that.

Best,

Shi


---------- Forwarded message ----------
From: Andreas Mueller <t3kcit at gmail.com<mailto:t3kcit at gmail.com>>
Date: Wed, Aug 3, 2016 at 1:29 PM
Subject: Re: [scikit-learn] Model trained in 0.17 gives entirely different results in 0.15
To: Scikit-learn user and developer mailing list <scikit-learn at python.org<mailto:scikit-learn at python.org>>

Hi Shi.
In general, there is no guarantee that models built with one version will work in a different version.
In particular, loading in an older version when built in a newer version seems something that's tricky to achieve.

We might want to warn the user when doing this. The docs are not very explicit about this.

Opened an issue:
https://github.com/scikit-learn/scikit-learn/issues/7135

Andy

On 08/02/2016 05:02 PM, Shi Yu wrote:
Hello,

We trained SVM models in scikit-learn 0.17 and saved it as pickle files. When loading the models back in a lower version of scikit-learn 0.15, the outputs are entirely different.  Basically for binary classification problem, for the same test data,  it swapped the probabilities and gave an opposite prediction.  In 0.17 the probability is [0.02668825,  0.97331175] and the prediction is 1.  In 0.15 the probability is [0.97331175, 0.02668825] and the prediction is 0.

I wonder is anyone seeing the same issue, or it has been notified.  I could provide more details for error replication if required.

Best,

Shi


_______________________________________________

scikit-learn mailing list

scikit-learn at python.org<mailto:scikit-learn at python.org>

https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
scikit-learn at python.org<mailto:scikit-learn at python.org>
https://mail.python.org/mailman/listinfo/scikit-learn

* This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160803/cab7b1c8/attachment.html>

From t3kcit at gmail.com  Wed Aug  3 15:38:39 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Wed, 3 Aug 2016 15:38:39 -0400
Subject: [scikit-learn] Model trained in 0.17 gives entirely different
 results in 0.15
In-Reply-To: <CAHCaCkLT45UJJZMsMf2Afsfmicqh4v7ZQ=UdMeRn9FKTCWZRSw@mail.gmail.com>
References: <CAJJAbr-d6Qg6SmonEdfBWi8AifRLhaV9VBtFHXtBvrDHFZnpmg@mail.gmail.com>
 <a3a29ffd-c1ba-0ff4-e0e5-ad3377c6608a@gmail.com>
 <CAHCaCkLT45UJJZMsMf2Afsfmicqh4v7ZQ=UdMeRn9FKTCWZRSw@mail.gmail.com>
Message-ID: <b0bb6155-72ff-47c5-9c94-d4bd5a7f5640@gmail.com>


On 08/03/2016 03:16 PM, Matthieu Brucher wrote:
> More often than not, forward compatiblity is not possible. I don't 
> think there are lots of companies doing so, as even backward 
> compatibility is tricky to achieve.
> Even with serializing the version, if the previous version doesn't 
> know about the additional data structures that have an impact on the 
> model, you are screwed. I don't think there is anything you can expect 
> for forward compatibility...
I think you can expect an error message instead of undefined behavior, 
though ;)

From matthieu.brucher at gmail.com  Wed Aug  3 16:13:14 2016
From: matthieu.brucher at gmail.com (Matthieu Brucher)
Date: Wed, 3 Aug 2016 21:13:14 +0100
Subject: [scikit-learn] Model trained in 0.17 gives entirely different
 results in 0.15
In-Reply-To: <b0bb6155-72ff-47c5-9c94-d4bd5a7f5640@gmail.com>
References: <CAJJAbr-d6Qg6SmonEdfBWi8AifRLhaV9VBtFHXtBvrDHFZnpmg@mail.gmail.com>
 <a3a29ffd-c1ba-0ff4-e0e5-ad3377c6608a@gmail.com>
 <CAHCaCkLT45UJJZMsMf2Afsfmicqh4v7ZQ=UdMeRn9FKTCWZRSw@mail.gmail.com>
 <b0bb6155-72ff-47c5-9c94-d4bd5a7f5640@gmail.com>
Message-ID: <CAHCaCkJvJTyAujm9BLUpA1r=OhFMQUwOLfKZAhA0y93Tdwb2FA@mail.gmail.com>

True!

2016-08-03 20:38 GMT+01:00 Andreas Mueller <t3kcit at gmail.com>:

>
>
> On 08/03/2016 03:16 PM, Matthieu Brucher wrote:
>
>> More often than not, forward compatiblity is not possible. I don't think
>> there are lots of companies doing so, as even backward compatibility is
>> tricky to achieve.
>> Even with serializing the version, if the previous version doesn't know
>> about the additional data structures that have an impact on the model, you
>> are screwed. I don't think there is anything you can expect for forward
>> compatibility...
>>
> I think you can expect an error message instead of undefined behavior,
> though ;)
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>


-- 
Information System Engineer, Ph.D.
Blog: http://blog.audio-tk.com/
LinkedIn: http://www.linkedin.com/in/matthieubrucher
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160803/cef953f6/attachment.html>

From lukejchang at gmail.com  Wed Aug  3 18:47:35 2016
From: lukejchang at gmail.com (Luke Chang)
Date: Wed, 3 Aug 2016 18:47:35 -0400
Subject: [scikit-learn] Model trained in 0.17 gives entirely different
 results in 0.15
In-Reply-To: <CAHCaCkJvJTyAujm9BLUpA1r=OhFMQUwOLfKZAhA0y93Tdwb2FA@mail.gmail.com>
References: <CAJJAbr-d6Qg6SmonEdfBWi8AifRLhaV9VBtFHXtBvrDHFZnpmg@mail.gmail.com>
 <a3a29ffd-c1ba-0ff4-e0e5-ad3377c6608a@gmail.com>
 <CAHCaCkLT45UJJZMsMf2Afsfmicqh4v7ZQ=UdMeRn9FKTCWZRSw@mail.gmail.com>
 <b0bb6155-72ff-47c5-9c94-d4bd5a7f5640@gmail.com>
 <CAHCaCkJvJTyAujm9BLUpA1r=OhFMQUwOLfKZAhA0y93Tdwb2FA@mail.gmail.com>
Message-ID: <4034A4A6-338F-44BA-A566-56EB91825845@gmail.com>

1pmish

-luke

> On Aug 3, 2016, at 4:13 PM, Matthieu Brucher <matthieu.brucher at gmail.com> wrote:
> 
> True!
> 
> 2016-08-03 20:38 GMT+01:00 Andreas Mueller <t3kcit at gmail.com>:
>> 
>> 
>>> On 08/03/2016 03:16 PM, Matthieu Brucher wrote:
>>> More often than not, forward compatiblity is not possible. I don't think there are lots of companies doing so, as even backward compatibility is tricky to achieve.
>>> Even with serializing the version, if the previous version doesn't know about the additional data structures that have an impact on the model, you are screwed. I don't think there is anything you can expect for forward compatibility...
>> I think you can expect an error message instead of undefined behavior, though ;)
>> 
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> 
> -- 
> Information System Engineer, Ph.D.
> Blog: http://blog.audio-tk.com/
> LinkedIn: http://www.linkedin.com/in/matthieubrucher
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160803/c707597b/attachment.html>

From joel.nothman at gmail.com  Thu Aug  4 00:25:55 2016
From: joel.nothman at gmail.com (Joel Nothman)
Date: Thu, 4 Aug 2016 14:25:55 +1000
Subject: [scikit-learn] StackOverflow Documentation
Message-ID: <CAAkaFLVhcoSw5Qj8jNEwO1gsD4bMWEPDmYAAHCDDiW09MQ_TcA@mail.gmail.com>

StackOverflow has introduced its Documentation space, where scikit-learn is
a covered subject: http://stackoverflow.com/documentation/scikit-learn. The
project is a little interesting, and otherwise somewhat
exasperating/tiring, given the overlap with our own documentation efforts,
which we would like to see continually improve and maintain alignment with
the codebase.

Currently there seem to be two contributors. One
<http://stackoverflow.com/users/3580365/gal-dreiman> appears to have been
copy-pasting official scikit-learn documentation, while the other
<http://stackoverflow.com/users/3510736/ami-tavory> has produced original
material. From a license perspective, copy-pasted material might be okay
with attribution and reference to a BSD licence, with the assumption that
it is then double-licensed (BSD and CC-BY-SA) if copied from SO.

But I assume that copying without attribution is actually plagiarism and
should be reverted, while we should discourage copying with attribution: if
SO Documentation for scikit-learn has its place, it should be different to
the official reference...?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160804/a75467f6/attachment.html>

From mail at sebastianraschka.com  Thu Aug  4 01:13:25 2016
From: mail at sebastianraschka.com (Sebastian Raschka)
Date: Thu, 4 Aug 2016 01:13:25 -0400
Subject: [scikit-learn] StackOverflow Documentation
In-Reply-To: <CAAkaFLVhcoSw5Qj8jNEwO1gsD4bMWEPDmYAAHCDDiW09MQ_TcA@mail.gmail.com>
References: <CAAkaFLVhcoSw5Qj8jNEwO1gsD4bMWEPDmYAAHCDDiW09MQ_TcA@mail.gmail.com>
Message-ID: <F03A3ABD-5FA1-4853-9B80-7E0746D841AB@sebastianraschka.com>

Hm, that?s an ?interesting? approach by SO, I guess their idea is to build a collection of code-and-example based snippets for less well-documented libraries ? especially, libraries that want to keep their documentation lean. 

> But I assume that copying without attribution is actually plagiarism and should be reverted, 


as far as I know, you are right regarding BSD.

In this scikit-learn case, it seems more like that these users are merely ?farming? for SO points and rep by reposting scikit-learn documentation. In my opinion, the polite way to go about it is to just comment as a scikit-learn dev saying that these reposts are okay under the BSD license but that a contribution to the original source needs to be added since it violates the copyright otherwise ? like you mentioned ? and adding a nice message encouraging these users to make suggestions and improvements to the original docs. (and if nothing changes after xx days, I would report it to SO).


> On Aug 4, 2016, at 12:25 AM, Joel Nothman <joel.nothman at gmail.com> wrote:
> 
> StackOverflow has introduced its Documentation space, where scikit-learn is a covered subject: http://stackoverflow.com/documentation/scikit-learn. The project is a little interesting, and otherwise somewhat exasperating/tiring, given the overlap with our own documentation efforts, which we would like to see continually improve and maintain alignment with the codebase.
> 
> Currently there seem to be two contributors. One appears to have been copy-pasting official scikit-learn documentation, while the other has produced original material. From a license perspective, copy-pasted material might be okay with attribution and reference to a BSD licence, with the assumption that it is then double-licensed (BSD and CC-BY-SA) if copied from SO.
> 
> But I assume that copying without attribution is actually plagiarism and should be reverted, while we should discourage copying with attribution: if SO Documentation for scikit-learn has its place, it should be different to the official reference...?
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From gael.varoquaux at normalesup.org  Thu Aug  4 02:29:26 2016
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Thu, 4 Aug 2016 08:29:26 +0200
Subject: [scikit-learn] StackOverflow Documentation
In-Reply-To: <F03A3ABD-5FA1-4853-9B80-7E0746D841AB@sebastianraschka.com>
References: <CAAkaFLVhcoSw5Qj8jNEwO1gsD4bMWEPDmYAAHCDDiW09MQ_TcA@mail.gmail.com>
 <F03A3ABD-5FA1-4853-9B80-7E0746D841AB@sebastianraschka.com>
Message-ID: <20160804062926.GB2146765@phare.normalesup.org>

> In this scikit-learn case, it seems more like that these users are merely ?farming? for SO points and rep by reposting scikit-learn documentation. In my opinion, the polite way to go about it is to just comment as a scikit-learn dev saying that these reposts are okay under the BSD license but that a contribution to the original source needs to be added since it violates the copyright otherwise ? like you mentioned ? and adding a nice message encouraging these users to make suggestions and improvements to the original docs. (and if nothing changes after xx days, I would report it to SO).

+1

From dmoisset at machinalis.com  Thu Aug  4 07:40:37 2016
From: dmoisset at machinalis.com (Daniel Moisset)
Date: Thu, 4 Aug 2016 12:40:37 +0100
Subject: [scikit-learn] Is there any official position on PEP484/mypy?
In-Reply-To: <ba9c4706-2b3a-df00-2f5c-571bd1b8ce30@gmail.com>
References: <f595e458-fd98-8909-54e3-24e2de03fda7@gmail.com>
 <CALuYSZUDdo+=r3a_enXz9ZStvVh_rL=O_Os2LSuRxjqOwf2p5A@mail.gmail.com>
 <a4db9429-8a3e-5e7d-d83a-c5d5875f15b8@gmail.com>
 <20160728164339.GD2110660@phare.normalesup.org>
 <598b3780-5b3d-2eb8-7e57-da3856026d0b@gmail.com>
 <CALuYSZXwHe5BCRsiM52Vj-yZzx-vHN69C5jV6A5FOukS_r+XEg@mail.gmail.com>
 <014c8cb1-8997-67a9-3d6a-f0b94c63b7ff@gmail.com>
 <20160729195718.GO787902@phare.normalesup.org>
 <CALuYSZW+sk9AFm11wHQp0U3RS=-sKf4dObn8bc6gswsCZ9z==w@mail.gmail.com>
 <CALuYSZUFTsbAOWZSq=N6SoS964JmC02=P6WpgWcHp0NejxbA3Q@mail.gmail.com>
 <20160802174822.GD1269350@phare.normalesup.org>
 <ba9c4706-2b3a-df00-2f5c-571bd1b8ce30@gmail.com>
Message-ID: <CALuYSZX9jL3dj5EjaK9T=244RCxJ74pnc66T3_CC+u_RmnUqCg@mail.gmail.com>

If the dependency is really a showstopper, bundling could be an option. The
module is a single, pure python file so that shouldn't complicate things
much.

@Joel, regarding
?without ndarray/sparse matrix type support, we're not going to be able to
annotate most of our code in sufficient detail?

That shouldn't be a problem, we have already written some working support
for numpy at https://github.com/machinalis/mypy-data, so it's possible no
annotate ndarrays and matrix types (scipy.sparse is not covered yet, I
could take a look into that).

Best,
    D.

On Tue, Aug 2, 2016 at 7:12 PM, Andreas Mueller <t3kcit at gmail.com> wrote:

>
>
> On 08/02/2016 01:48 PM, Gael Varoquaux wrote:
>
>> * One relevant consequence is that, to add annotations on the code,
>>> scikit-learn should depend on the "typing"[1] module which contains some
>>> of the
>>> basic names imported and used in annotations. It's a stdlib module in
>>> python
>>> 3.5, but the PyPI package backports it to python 2.7 and newer (I'm not
>>> sure
>>> how it works with Python 2.6, which might be an issue)
>>>
>> I am afraid that this is going to be a problem: we have a no dependency
>> policy (beyond numpy and scipy).
>>
> I still think this is a point we should discuss further ;)
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>


-- 
Daniel F. Moisset - UK Country Manager
www.machinalis.com
Skype: @dmoisset
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160804/f5ad0e22/attachment.html>

From vaggi.federico at gmail.com  Thu Aug  4 08:11:50 2016
From: vaggi.federico at gmail.com (federico vaggi)
Date: Thu, 04 Aug 2016 12:11:50 +0000
Subject: [scikit-learn] Is there any official position on PEP484/mypy?
In-Reply-To: <CALuYSZX9jL3dj5EjaK9T=244RCxJ74pnc66T3_CC+u_RmnUqCg@mail.gmail.com>
References: <f595e458-fd98-8909-54e3-24e2de03fda7@gmail.com>
 <CALuYSZUDdo+=r3a_enXz9ZStvVh_rL=O_Os2LSuRxjqOwf2p5A@mail.gmail.com>
 <a4db9429-8a3e-5e7d-d83a-c5d5875f15b8@gmail.com>
 <20160728164339.GD2110660@phare.normalesup.org>
 <598b3780-5b3d-2eb8-7e57-da3856026d0b@gmail.com>
 <CALuYSZXwHe5BCRsiM52Vj-yZzx-vHN69C5jV6A5FOukS_r+XEg@mail.gmail.com>
 <014c8cb1-8997-67a9-3d6a-f0b94c63b7ff@gmail.com>
 <20160729195718.GO787902@phare.normalesup.org>
 <CALuYSZW+sk9AFm11wHQp0U3RS=-sKf4dObn8bc6gswsCZ9z==w@mail.gmail.com>
 <CALuYSZUFTsbAOWZSq=N6SoS964JmC02=P6WpgWcHp0NejxbA3Q@mail.gmail.com>
 <20160802174822.GD1269350@phare.normalesup.org>
 <ba9c4706-2b3a-df00-2f5c-571bd1b8ce30@gmail.com>
 <CALuYSZX9jL3dj5EjaK9T=244RCxJ74pnc66T3_CC+u_RmnUqCg@mail.gmail.com>
Message-ID: <CAGvd0=iHQv6Gxa0f68jV71CmoYgL5r0RJGUCitfCjNXe6iRcTQ@mail.gmail.com>

Another point about the dependency: the dependency is not required for run
time - it is only required to run the type checker.  You could easily put
it in a try/catch block and people running scikit-learn wouldn't need it.

On Thu, 4 Aug 2016 at 13:41 Daniel Moisset <dmoisset at machinalis.com> wrote:

> If the dependency is really a showstopper, bundling could be an option.
> The module is a single, pure python file so that shouldn't complicate
> things much.
>
> @Joel, regarding
> ?without ndarray/sparse matrix type support, we're not going to be able
> to annotate most of our code in sufficient detail?
>
> That shouldn't be a problem, we have already written some working support
> for numpy at https://github.com/machinalis/mypy-data, so it's possible no
> annotate ndarrays and matrix types (scipy.sparse is not covered yet, I
> could take a look into that).
>
> Best,
>     D.
>
> On Tue, Aug 2, 2016 at 7:12 PM, Andreas Mueller <t3kcit at gmail.com> wrote:
>
>>
>>
>> On 08/02/2016 01:48 PM, Gael Varoquaux wrote:
>>
>>> * One relevant consequence is that, to add annotations on the code,
>>>> scikit-learn should depend on the "typing"[1] module which contains
>>>> some of the
>>>> basic names imported and used in annotations. It's a stdlib module in
>>>> python
>>>> 3.5, but the PyPI package backports it to python 2.7 and newer (I'm not
>>>> sure
>>>> how it works with Python 2.6, which might be an issue)
>>>>
>>> I am afraid that this is going to be a problem: we have a no dependency
>>> policy (beyond numpy and scipy).
>>>
>> I still think this is a point we should discuss further ;)
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
>
> --
> Daniel F. Moisset - UK Country Manager
> www.machinalis.com
> Skype: @dmoisset
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160804/65857482/attachment.html>

From Dale.T.Smith at macys.com  Thu Aug  4 08:20:48 2016
From: Dale.T.Smith at macys.com (Dale T Smith)
Date: Thu, 4 Aug 2016 12:20:48 +0000
Subject: [scikit-learn] StackOverflow Documentation
In-Reply-To: <20160804062926.GB2146765@phare.normalesup.org>
References: <CAAkaFLVhcoSw5Qj8jNEwO1gsD4bMWEPDmYAAHCDDiW09MQ_TcA@mail.gmail.com>
 <F03A3ABD-5FA1-4853-9B80-7E0746D841AB@sebastianraschka.com>
 <20160804062926.GB2146765@phare.normalesup.org>
Message-ID: <BL2PR06MB2276DB0C6168D47893A6FD43C3070@BL2PR06MB2276.namprd06.prod.outlook.com>

Perhaps a comment to that effect at StackOverflow Documentation would be helpful.

I support the SO effort. I think it provides an opportunity to introduce examples and tips that aren't in the tutorials or user's guide. However, my own position is I would like to contribute to the official sklearn site - if I could get clearance on the legal side. Sigh.


__________________________________________________________________________________________
Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science and Capacity Planning
 |?5985 State Bridge Road, Johns Creek, GA 30097?|?dale.t.smith at macys.com

-----Original Message-----
From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Gael Varoquaux
Sent: Thursday, August 4, 2016 2:29 AM
To: Scikit-learn user and developer mailing list
Subject: Re: [scikit-learn] StackOverflow Documentation

? EXT MSG:

> In this scikit-learn case, it seems more like that these users are merely ?farming? for SO points and rep by reposting scikit-learn documentation. In my opinion, the polite way to go about it is to just comment as a scikit-learn dev saying that these reposts are okay under the BSD license but that a contribution to the original source needs to be added since it violates the copyright otherwise ? like you mentioned ? and adding a nice message encouraging these users to make suggestions and improvements to the original docs. (and if nothing changes after xx days, I would report it to SO).

+1
_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn

* This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.

From basilbeirouti at gmail.com  Thu Aug  4 13:07:17 2016
From: basilbeirouti at gmail.com (Basil Beirouti)
Date: Thu, 4 Aug 2016 12:07:17 -0500
Subject: [scikit-learn] BM25 Pull Request
Message-ID: <CAB4mTg9FioJd23oh4ySasdDjiD2_voHSa7PEDEfhozXZt6jGug@mail.gmail.com>

Hi all,

Just sending an email for visibility. I've made a pull request to add Bm25
capabilities to complement TFIDF in feature_extraction.text. All tests
pass.

Sincerely,
Basil Beirouti
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160804/77263603/attachment.html>

From amisra2 at ucsc.edu  Thu Aug  4 17:17:29 2016
From: amisra2 at ucsc.edu (Amita Misra)
Date: Thu, 4 Aug 2016 14:17:29 -0700
Subject: [scikit-learn] Supervised anomaly detection in time series
Message-ID: <CAMa2cetn_DTVn=XzBPpLcP_1z_8duWFBdhuxON4eNUAU1m2JRw@mail.gmail.com>

Hi,

I am currently exploring the problem of speed bump detection using
accelerometer time series data.
I have extracted some features based on mean, std deviation etc  within a
time window.

Since the dataset is highly skewed ( I have just 5  positive samples for
every > 300 samples)
I was looking into

One ClassSVM
covariance.EllipticEnvelope
sklearn.ensemble.IsolationForest

but I am not sure how to use them.

What I get from docs
separate the positive examples and train using only negative examples

clf.fit(X_train)

and then
predict the positive examples using
clf.predict(X_test)


I am not sure what is then the role of positive examples in my training
dataset or how can I use them to improve my classifier so that I can
predict better on new samples.


Can we do something like Cross validation to learn the parameters as in
normal binary SVM classification

Thanks,?
Amita

Amita Misra
Graduate Student Researcher
Natural Language and Dialogue Systems Lab
Baskin School of Engineering
University of California Santa Cruz


-- 
Amita Misra
Graduate Student Researcher
Natural Language and Dialogue Systems Lab
Baskin School of Engineering
University of California Santa Cruz
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160804/4018b543/attachment.html>

From goix.nicolas at gmail.com  Thu Aug  4 19:43:03 2016
From: goix.nicolas at gmail.com (Nicolas Goix)
Date: Thu, 4 Aug 2016 19:43:03 -0400
Subject: [scikit-learn] Supervised anomaly detection in time series
In-Reply-To: <CAMa2cetn_DTVn=XzBPpLcP_1z_8duWFBdhuxON4eNUAU1m2JRw@mail.gmail.com>
References: <CAMa2cetn_DTVn=XzBPpLcP_1z_8duWFBdhuxON4eNUAU1m2JRw@mail.gmail.com>
Message-ID: <CAPV6P2zsFPz=V9sjwh6R=eV3Xp8X-aLTKFLc3FaDtfxX+vjpQg@mail.gmail.com>

Hi,

Yes you can use your labeled data (you will need to sub-sample your normal
class to have similar proportion normal-abnormal) to learn your
hyper-parameters through CV.

You can also try to use supervised classification algorithms on `not too
highly unbalanced' sub-samples.

Nicolas

On Thu, Aug 4, 2016 at 5:17 PM, Amita Misra <amisra2 at ucsc.edu> wrote:

> Hi,
>
> I am currently exploring the problem of speed bump detection using
> accelerometer time series data.
> I have extracted some features based on mean, std deviation etc  within a
> time window.
>
> Since the dataset is highly skewed ( I have just 5  positive samples for
> every > 300 samples)
> I was looking into
>
> One ClassSVM
> covariance.EllipticEnvelope
> sklearn.ensemble.IsolationForest
>
> but I am not sure how to use them.
>
> What I get from docs
> separate the positive examples and train using only negative examples
>
> clf.fit(X_train)
>
> and then
> predict the positive examples using
> clf.predict(X_test)
>
>
> I am not sure what is then the role of positive examples in my training
> dataset or how can I use them to improve my classifier so that I can
> predict better on new samples.
>
>
> Can we do something like Cross validation to learn the parameters as in
> normal binary SVM classification
>
> Thanks,?
> Amita
>
> Amita Misra
> Graduate Student Researcher
> Natural Language and Dialogue Systems Lab
> Baskin School of Engineering
> University of California Santa Cruz
>
>
>
>
>
> --
> Amita Misra
> Graduate Student Researcher
> Natural Language and Dialogue Systems Lab
> Baskin School of Engineering
> University of California Santa Cruz
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160804/b664cd93/attachment.html>

From amisra2 at ucsc.edu  Thu Aug  4 19:48:54 2016
From: amisra2 at ucsc.edu (Amita Misra)
Date: Thu, 4 Aug 2016 16:48:54 -0700
Subject: [scikit-learn] Supervised anomaly detection in time series
In-Reply-To: <CAPV6P2zsFPz=V9sjwh6R=eV3Xp8X-aLTKFLc3FaDtfxX+vjpQg@mail.gmail.com>
References: <CAMa2cetn_DTVn=XzBPpLcP_1z_8duWFBdhuxON4eNUAU1m2JRw@mail.gmail.com>
 <CAPV6P2zsFPz=V9sjwh6R=eV3Xp8X-aLTKFLc3FaDtfxX+vjpQg@mail.gmail.com>
Message-ID: <CAMa2cetr-vo-Z9G0vtDA0MT5qJcPvZXMTwfeWrFOb2P5OhvfZA@mail.gmail.com>

SubSample would remove a lot of information from the negative class.
I have more than 500 samples of negative class and just 5 samples of
positive class.

Amita

On Thu, Aug 4, 2016 at 4:43 PM, Nicolas Goix <goix.nicolas at gmail.com> wrote:

> Hi,
>
> Yes you can use your labeled data (you will need to sub-sample your normal
> class to have similar proportion normal-abnormal) to learn your
> hyper-parameters through CV.
>
> You can also try to use supervised classification algorithms on `not too
> highly unbalanced' sub-samples.
>
> Nicolas
>
> On Thu, Aug 4, 2016 at 5:17 PM, Amita Misra <amisra2 at ucsc.edu> wrote:
>
>> Hi,
>>
>> I am currently exploring the problem of speed bump detection using
>> accelerometer time series data.
>> I have extracted some features based on mean, std deviation etc  within a
>> time window.
>>
>> Since the dataset is highly skewed ( I have just 5  positive samples for
>> every > 300 samples)
>> I was looking into
>>
>> One ClassSVM
>> covariance.EllipticEnvelope
>> sklearn.ensemble.IsolationForest
>>
>> but I am not sure how to use them.
>>
>> What I get from docs
>> separate the positive examples and train using only negative examples
>>
>> clf.fit(X_train)
>>
>> and then
>> predict the positive examples using
>> clf.predict(X_test)
>>
>>
>> I am not sure what is then the role of positive examples in my training
>> dataset or how can I use them to improve my classifier so that I can
>> predict better on new samples.
>>
>>
>> Can we do something like Cross validation to learn the parameters as in
>> normal binary SVM classification
>>
>> Thanks,?
>> Amita
>>
>> Amita Misra
>> Graduate Student Researcher
>> Natural Language and Dialogue Systems Lab
>> Baskin School of Engineering
>> University of California Santa Cruz
>>
>>
>>
>>
>>
>> --
>> Amita Misra
>> Graduate Student Researcher
>> Natural Language and Dialogue Systems Lab
>> Baskin School of Engineering
>> University of California Santa Cruz
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 
Amita Misra
Graduate Student Researcher
Natural Language and Dialogue Systems Lab
Baskin School of Engineering
University of California Santa Cruz
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160804/90a89480/attachment-0001.html>

From goix.nicolas at gmail.com  Thu Aug  4 20:23:28 2016
From: goix.nicolas at gmail.com (Nicolas Goix)
Date: Thu, 4 Aug 2016 20:23:28 -0400
Subject: [scikit-learn] Supervised anomaly detection in time series
In-Reply-To: <CAMa2cetr-vo-Z9G0vtDA0MT5qJcPvZXMTwfeWrFOb2P5OhvfZA@mail.gmail.com>
References: <CAMa2cetn_DTVn=XzBPpLcP_1z_8duWFBdhuxON4eNUAU1m2JRw@mail.gmail.com>
 <CAPV6P2zsFPz=V9sjwh6R=eV3Xp8X-aLTKFLc3FaDtfxX+vjpQg@mail.gmail.com>
 <CAMa2cetr-vo-Z9G0vtDA0MT5qJcPvZXMTwfeWrFOb2P5OhvfZA@mail.gmail.com>
Message-ID: <CAPV6P2wxGbESKBYnPwEO=8Qwkzu6iG8i1qrFs1VKy7OuysXriA@mail.gmail.com>

You can evaluate the accuracy of your hyper-parameters on a few samples.
Just don't use the accuracy as your performance measure.

For supervised classification, training multiple algorithms on small
balanced subsamples usually works well, but 5 anomalies seems indeed to be
very little.

Nicolas

On Aug 4, 2016 7:51 PM, "Amita Misra" <amisra2 at ucsc.edu> wrote:

> SubSample would remove a lot of information from the negative class.
> I have more than 500 samples of negative class and just 5 samples of
> positive class.
>
> Amita
>
> On Thu, Aug 4, 2016 at 4:43 PM, Nicolas Goix <goix.nicolas at gmail.com>
> wrote:
>
>> Hi,
>>
>> Yes you can use your labeled data (you will need to sub-sample your
>> normal class to have similar proportion normal-abnormal) to learn your
>> hyper-parameters through CV.
>>
>> You can also try to use supervised classification algorithms on `not too
>> highly unbalanced' sub-samples.
>>
>> Nicolas
>>
>> On Thu, Aug 4, 2016 at 5:17 PM, Amita Misra <amisra2 at ucsc.edu> wrote:
>>
>>> Hi,
>>>
>>> I am currently exploring the problem of speed bump detection using
>>> accelerometer time series data.
>>> I have extracted some features based on mean, std deviation etc  within
>>> a time window.
>>>
>>> Since the dataset is highly skewed ( I have just 5  positive samples for
>>> every > 300 samples)
>>> I was looking into
>>>
>>> One ClassSVM
>>> covariance.EllipticEnvelope
>>> sklearn.ensemble.IsolationForest
>>>
>>> but I am not sure how to use them.
>>>
>>> What I get from docs
>>> separate the positive examples and train using only negative examples
>>>
>>> clf.fit(X_train)
>>>
>>> and then
>>> predict the positive examples using
>>> clf.predict(X_test)
>>>
>>>
>>> I am not sure what is then the role of positive examples in my training
>>> dataset or how can I use them to improve my classifier so that I can
>>> predict better on new samples.
>>>
>>>
>>> Can we do something like Cross validation to learn the parameters as in
>>> normal binary SVM classification
>>>
>>> Thanks,?
>>> Amita
>>>
>>> Amita Misra
>>> Graduate Student Researcher
>>> Natural Language and Dialogue Systems Lab
>>> Baskin School of Engineering
>>> University of California Santa Cruz
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Amita Misra
>>> Graduate Student Researcher
>>> Natural Language and Dialogue Systems Lab
>>> Baskin School of Engineering
>>> University of California Santa Cruz
>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
>
> --
> Amita Misra
> Graduate Student Researcher
> Natural Language and Dialogue Systems Lab
> Baskin School of Engineering
> University of California Santa Cruz
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160804/5dd1d770/attachment.html>

From amisra2 at ucsc.edu  Thu Aug  4 20:42:25 2016
From: amisra2 at ucsc.edu (Amita Misra)
Date: Thu, 4 Aug 2016 17:42:25 -0700
Subject: [scikit-learn] Supervised anomaly detection in time series
In-Reply-To: <CAPV6P2wxGbESKBYnPwEO=8Qwkzu6iG8i1qrFs1VKy7OuysXriA@mail.gmail.com>
References: <CAMa2cetn_DTVn=XzBPpLcP_1z_8duWFBdhuxON4eNUAU1m2JRw@mail.gmail.com>
 <CAPV6P2zsFPz=V9sjwh6R=eV3Xp8X-aLTKFLc3FaDtfxX+vjpQg@mail.gmail.com>
 <CAMa2cetr-vo-Z9G0vtDA0MT5qJcPvZXMTwfeWrFOb2P5OhvfZA@mail.gmail.com>
 <CAPV6P2wxGbESKBYnPwEO=8Qwkzu6iG8i1qrFs1VKy7OuysXriA@mail.gmail.com>
Message-ID: <CAMa2cesv_aafFyzdBMqA-56PAXFaCt7EcrG=1ouFkBfNrpmd2Q@mail.gmail.com>

If I train multiple algorithms on different subsamples, then how do I get
the final classifier that predicts unseen data?


I have very few positive samples since it is speed bump detection and we
have very few speed bumps in a drive.
However, I think that  unseen new data would be quite similar to what I
have in training data hence if I can correctly learn a classifier for these
5, I hope it should work well for unseen speed bumps.

Thanks,
Amita

On Thu, Aug 4, 2016 at 5:23 PM, Nicolas Goix <goix.nicolas at gmail.com> wrote:

> You can evaluate the accuracy of your hyper-parameters on a few samples.
> Just don't use the accuracy as your performance measure.
>
> For supervised classification, training multiple algorithms on small
> balanced subsamples usually works well, but 5 anomalies seems indeed to be
> very little.
>
> Nicolas
>
> On Aug 4, 2016 7:51 PM, "Amita Misra" <amisra2 at ucsc.edu> wrote:
>
>> SubSample would remove a lot of information from the negative class.
>> I have more than 500 samples of negative class and just 5 samples of
>> positive class.
>>
>> Amita
>>
>> On Thu, Aug 4, 2016 at 4:43 PM, Nicolas Goix <goix.nicolas at gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> Yes you can use your labeled data (you will need to sub-sample your
>>> normal class to have similar proportion normal-abnormal) to learn your
>>> hyper-parameters through CV.
>>>
>>> You can also try to use supervised classification algorithms on `not too
>>> highly unbalanced' sub-samples.
>>>
>>> Nicolas
>>>
>>> On Thu, Aug 4, 2016 at 5:17 PM, Amita Misra <amisra2 at ucsc.edu> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am currently exploring the problem of speed bump detection using
>>>> accelerometer time series data.
>>>> I have extracted some features based on mean, std deviation etc  within
>>>> a time window.
>>>>
>>>> Since the dataset is highly skewed ( I have just 5  positive samples
>>>> for every > 300 samples)
>>>> I was looking into
>>>>
>>>> One ClassSVM
>>>> covariance.EllipticEnvelope
>>>> sklearn.ensemble.IsolationForest
>>>>
>>>> but I am not sure how to use them.
>>>>
>>>> What I get from docs
>>>> separate the positive examples and train using only negative examples
>>>>
>>>> clf.fit(X_train)
>>>>
>>>> and then
>>>> predict the positive examples using
>>>> clf.predict(X_test)
>>>>
>>>>
>>>> I am not sure what is then the role of positive examples in my training
>>>> dataset or how can I use them to improve my classifier so that I can
>>>> predict better on new samples.
>>>>
>>>>
>>>> Can we do something like Cross validation to learn the parameters as in
>>>> normal binary SVM classification
>>>>
>>>> Thanks,?
>>>> Amita
>>>>
>>>> Amita Misra
>>>> Graduate Student Researcher
>>>> Natural Language and Dialogue Systems Lab
>>>> Baskin School of Engineering
>>>> University of California Santa Cruz
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Amita Misra
>>>> Graduate Student Researcher
>>>> Natural Language and Dialogue Systems Lab
>>>> Baskin School of Engineering
>>>> University of California Santa Cruz
>>>>
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>>
>> --
>> Amita Misra
>> Graduate Student Researcher
>> Natural Language and Dialogue Systems Lab
>> Baskin School of Engineering
>> University of California Santa Cruz
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 
Amita Misra
Graduate Student Researcher
Natural Language and Dialogue Systems Lab
Baskin School of Engineering
University of California Santa Cruz
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160804/3a62258a/attachment-0001.html>

From goix.nicolas at gmail.com  Thu Aug  4 21:12:40 2016
From: goix.nicolas at gmail.com (Nicolas Goix)
Date: Thu, 4 Aug 2016 21:12:40 -0400
Subject: [scikit-learn] Supervised anomaly detection in time series
In-Reply-To: <CAMa2cesv_aafFyzdBMqA-56PAXFaCt7EcrG=1ouFkBfNrpmd2Q@mail.gmail.com>
References: <CAMa2cetn_DTVn=XzBPpLcP_1z_8duWFBdhuxON4eNUAU1m2JRw@mail.gmail.com>
 <CAPV6P2zsFPz=V9sjwh6R=eV3Xp8X-aLTKFLc3FaDtfxX+vjpQg@mail.gmail.com>
 <CAMa2cetr-vo-Z9G0vtDA0MT5qJcPvZXMTwfeWrFOb2P5OhvfZA@mail.gmail.com>
 <CAPV6P2wxGbESKBYnPwEO=8Qwkzu6iG8i1qrFs1VKy7OuysXriA@mail.gmail.com>
 <CAMa2cesv_aafFyzdBMqA-56PAXFaCt7EcrG=1ouFkBfNrpmd2Q@mail.gmail.com>
Message-ID: <CAPV6P2wRoFrijG_rVx83=1ByOhS_YSR6NXTHaSwCOaYAJk=JmA@mail.gmail.com>

There are different ways of aggregating estimators. A possibility can be to
take the majority vote, or averaging decision functions.

On Aug 4, 2016 8:44 PM, "Amita Misra" <amisra2 at ucsc.edu> wrote:

> If I train multiple algorithms on different subsamples, then how do I get
> the final classifier that predicts unseen data?
>
>
> I have very few positive samples since it is speed bump detection and we
> have very few speed bumps in a drive.
> However, I think that  unseen new data would be quite similar to what I
> have in training data hence if I can correctly learn a classifier for these
> 5, I hope it should work well for unseen speed bumps.
>
> Thanks,
> Amita
>
> On Thu, Aug 4, 2016 at 5:23 PM, Nicolas Goix <goix.nicolas at gmail.com>
> wrote:
>
>> You can evaluate the accuracy of your hyper-parameters on a few samples.
>> Just don't use the accuracy as your performance measure.
>>
>> For supervised classification, training multiple algorithms on small
>> balanced subsamples usually works well, but 5 anomalies seems indeed to be
>> very little.
>>
>> Nicolas
>>
>> On Aug 4, 2016 7:51 PM, "Amita Misra" <amisra2 at ucsc.edu> wrote:
>>
>>> SubSample would remove a lot of information from the negative class.
>>> I have more than 500 samples of negative class and just 5 samples of
>>> positive class.
>>>
>>> Amita
>>>
>>> On Thu, Aug 4, 2016 at 4:43 PM, Nicolas Goix <goix.nicolas at gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Yes you can use your labeled data (you will need to sub-sample your
>>>> normal class to have similar proportion normal-abnormal) to learn your
>>>> hyper-parameters through CV.
>>>>
>>>> You can also try to use supervised classification algorithms on `not
>>>> too highly unbalanced' sub-samples.
>>>>
>>>> Nicolas
>>>>
>>>> On Thu, Aug 4, 2016 at 5:17 PM, Amita Misra <amisra2 at ucsc.edu> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am currently exploring the problem of speed bump detection using
>>>>> accelerometer time series data.
>>>>> I have extracted some features based on mean, std deviation etc
>>>>> within a time window.
>>>>>
>>>>> Since the dataset is highly skewed ( I have just 5  positive samples
>>>>> for every > 300 samples)
>>>>> I was looking into
>>>>>
>>>>> One ClassSVM
>>>>> covariance.EllipticEnvelope
>>>>> sklearn.ensemble.IsolationForest
>>>>>
>>>>> but I am not sure how to use them.
>>>>>
>>>>> What I get from docs
>>>>> separate the positive examples and train using only negative examples
>>>>>
>>>>> clf.fit(X_train)
>>>>>
>>>>> and then
>>>>> predict the positive examples using
>>>>> clf.predict(X_test)
>>>>>
>>>>>
>>>>> I am not sure what is then the role of positive examples in my
>>>>> training dataset or how can I use them to improve my classifier so that I
>>>>> can predict better on new samples.
>>>>>
>>>>>
>>>>> Can we do something like Cross validation to learn the parameters as
>>>>> in normal binary SVM classification
>>>>>
>>>>> Thanks,?
>>>>> Amita
>>>>>
>>>>> Amita Misra
>>>>> Graduate Student Researcher
>>>>> Natural Language and Dialogue Systems Lab
>>>>> Baskin School of Engineering
>>>>> University of California Santa Cruz
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Amita Misra
>>>>> Graduate Student Researcher
>>>>> Natural Language and Dialogue Systems Lab
>>>>> Baskin School of Engineering
>>>>> University of California Santa Cruz
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> scikit-learn at python.org
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>>
>>>
>>>
>>> --
>>> Amita Misra
>>> Graduate Student Researcher
>>> Natural Language and Dialogue Systems Lab
>>> Baskin School of Engineering
>>> University of California Santa Cruz
>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
>
> --
> Amita Misra
> Graduate Student Researcher
> Natural Language and Dialogue Systems Lab
> Baskin School of Engineering
> University of California Santa Cruz
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160804/b32327cd/attachment.html>

From Dale.T.Smith at macys.com  Fri Aug  5 08:26:01 2016
From: Dale.T.Smith at macys.com (Dale T Smith)
Date: Fri, 5 Aug 2016 12:26:01 +0000
Subject: [scikit-learn] Supervised anomaly detection in time series
In-Reply-To: <CAPV6P2wRoFrijG_rVx83=1ByOhS_YSR6NXTHaSwCOaYAJk=JmA@mail.gmail.com>
References: <CAMa2cetn_DTVn=XzBPpLcP_1z_8duWFBdhuxON4eNUAU1m2JRw@mail.gmail.com>
 <CAPV6P2zsFPz=V9sjwh6R=eV3Xp8X-aLTKFLc3FaDtfxX+vjpQg@mail.gmail.com>
 <CAMa2cetr-vo-Z9G0vtDA0MT5qJcPvZXMTwfeWrFOb2P5OhvfZA@mail.gmail.com>
 <CAPV6P2wxGbESKBYnPwEO=8Qwkzu6iG8i1qrFs1VKy7OuysXriA@mail.gmail.com>
 <CAMa2cesv_aafFyzdBMqA-56PAXFaCt7EcrG=1ouFkBfNrpmd2Q@mail.gmail.com>
 <CAPV6P2wRoFrijG_rVx83=1ByOhS_YSR6NXTHaSwCOaYAJk=JmA@mail.gmail.com>
Message-ID: <BL2PR06MB22767420DD0FA5D0EF9E9DE3C3180@BL2PR06MB2276.namprd06.prod.outlook.com>

I don?t think you should treat this as an outlier detection problem. Why not try it as a classification problem? The dataset is highly unbalanced. Try

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html

Use sample_weight to tell the fit method about the class imbalance. But be sure to read up about unbalanced classification and the class_weight parameter to ExtraTreesClassifier. You cannot use the accuracy to find the best model, so read up on model validation in the sklearn User?s Guide. And when you do cross-validation to get the best hyperparameters, be sure you pass the sample weights as well.

Time series data is a bit different to use with cross-validation. You may want to add features such as minutes since midnight, day of week, weekday/weekend. And make sure your cross-validation folds respect the time series nature of the problem.

http://stackoverflow.com/questions/37583263/scikit-learn-cross-validation-custom-splits-for-time-series-data


__________________________________________________________________________________________
Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science and Capacity Planning
| 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com

From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Nicolas Goix
Sent: Thursday, August 4, 2016 9:13 PM
To: Scikit-learn user and developer mailing list
Subject: Re: [scikit-learn] Supervised anomaly detection in time series

? EXT MSG:

There are different ways of aggregating estimators. A possibility can be to take the majority vote, or averaging decision functions.

On Aug 4, 2016 8:44 PM, "Amita Misra" <amisra2 at ucsc.edu<mailto:amisra2 at ucsc.edu>> wrote:
If I train multiple algorithms on different subsamples, then how do I get the final classifier that predicts unseen data?

I have very few positive samples since it is speed bump detection and we have very few speed bumps in a drive.
However, I think that  unseen new data would be quite similar to what I have in training data hence if I can correctly learn a classifier for these 5, I hope it should work well for unseen speed bumps.
Thanks,
Amita

On Thu, Aug 4, 2016 at 5:23 PM, Nicolas Goix <goix.nicolas at gmail.com<mailto:goix.nicolas at gmail.com>> wrote:

You can evaluate the accuracy of your hyper-parameters on a few samples. Just don't use the accuracy as your performance measure.

For supervised classification, training multiple algorithms on small balanced subsamples usually works well, but 5 anomalies seems indeed to be very little.

Nicolas

On Aug 4, 2016 7:51 PM, "Amita Misra" <amisra2 at ucsc.edu<mailto:amisra2 at ucsc.edu>> wrote:
SubSample would remove a lot of information from the negative class.
I have more than 500 samples of negative class and just 5 samples of positive class.
Amita

On Thu, Aug 4, 2016 at 4:43 PM, Nicolas Goix <goix.nicolas at gmail.com<mailto:goix.nicolas at gmail.com>> wrote:
Hi,

Yes you can use your labeled data (you will need to sub-sample your normal class to have similar proportion normal-abnormal) to learn your hyper-parameters through CV.

You can also try to use supervised classification algorithms on `not too highly unbalanced' sub-samples.

Nicolas

On Thu, Aug 4, 2016 at 5:17 PM, Amita Misra <amisra2 at ucsc.edu<mailto:amisra2 at ucsc.edu>> wrote:
Hi,

I am currently exploring the problem of speed bump detection using accelerometer time series data.
I have extracted some features based on mean, std deviation etc  within a time window.
Since the dataset is highly skewed ( I have just 5  positive samples for every > 300 samples)
I was looking into

One ClassSVM
covariance.EllipticEnvelope
sklearn.ensemble.IsolationForest

but I am not sure how to use them.

What I get from docs
separate the positive examples and train using only negative examples

clf.fit(X_train)
and then
predict the positive examples using
clf.predict(X_test)

I am not sure what is then the role of positive examples in my training dataset or how can I use them to improve my classifier so that I can predict better on new samples.

Can we do something like Cross validation to learn the parameters as in normal binary SVM classification

Thanks,?
Amita

Amita Misra
Graduate Student Researcher
Natural Language and Dialogue Systems Lab
Baskin School of Engineering
University of California Santa Cruz


--
Amita Misra
Graduate Student Researcher
Natural Language and Dialogue Systems Lab
Baskin School of Engineering
University of California Santa Cruz


_______________________________________________
scikit-learn mailing list
scikit-learn at python.org<mailto:scikit-learn at python.org>
https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
scikit-learn at python.org<mailto:scikit-learn at python.org>
https://mail.python.org/mailman/listinfo/scikit-learn


--
Amita Misra
Graduate Student Researcher
Natural Language and Dialogue Systems Lab
Baskin School of Engineering
University of California Santa Cruz


_______________________________________________
scikit-learn mailing list
scikit-learn at python.org<mailto:scikit-learn at python.org>
https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn at python.org<mailto:scikit-learn at python.org>
https://mail.python.org/mailman/listinfo/scikit-learn


--
Amita Misra
Graduate Student Researcher
Natural Language and Dialogue Systems Lab
Baskin School of Engineering
University of California Santa Cruz


_______________________________________________
scikit-learn mailing list
scikit-learn at python.org<mailto:scikit-learn at python.org>
https://mail.python.org/mailman/listinfo/scikit-learn
* This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160805/407ede85/attachment-0001.html>

From pedropazzini at gmail.com  Fri Aug  5 09:32:52 2016
From: pedropazzini at gmail.com (Pedro Pazzini)
Date: Fri, 5 Aug 2016 10:32:52 -0300
Subject: [scikit-learn] Supervised anomaly detection in time series
In-Reply-To: <BL2PR06MB22767420DD0FA5D0EF9E9DE3C3180@BL2PR06MB2276.namprd06.prod.outlook.com>
References: <CAMa2cetn_DTVn=XzBPpLcP_1z_8duWFBdhuxON4eNUAU1m2JRw@mail.gmail.com>
 <CAPV6P2zsFPz=V9sjwh6R=eV3Xp8X-aLTKFLc3FaDtfxX+vjpQg@mail.gmail.com>
 <CAMa2cetr-vo-Z9G0vtDA0MT5qJcPvZXMTwfeWrFOb2P5OhvfZA@mail.gmail.com>
 <CAPV6P2wxGbESKBYnPwEO=8Qwkzu6iG8i1qrFs1VKy7OuysXriA@mail.gmail.com>
 <CAMa2cesv_aafFyzdBMqA-56PAXFaCt7EcrG=1ouFkBfNrpmd2Q@mail.gmail.com>
 <CAPV6P2wRoFrijG_rVx83=1ByOhS_YSR6NXTHaSwCOaYAJk=JmA@mail.gmail.com>
 <BL2PR06MB22767420DD0FA5D0EF9E9DE3C3180@BL2PR06MB2276.namprd06.prod.outlook.com>
Message-ID: <CAAY8FkAY5EwbOnmD5F2WvRwi9LPoaTHW_-mV0twh-DpNMK9=Wg@mail.gmail.com>

Just to add a few things to the discussion:


   1. For unbalanced problems, as far as I know, one of the best scores to
   evaluate a classifier is the Area Under the ROC curve:
   http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html.
   For that you will have to use clf.predict_proba(X_test) instead of
   clf.predict(X_test). I think that using the 'sample_weight' parameter as
   Smith said is a promising choice.
   2. Usually is recommend the normalization of each time series for
   comparing them. The Z-score normalization is one of the most used [Ref:
   http://wan.poly.edu/KDD2012/docs/p262.pdf].
   3. There are some interesting dissimilarity measures such as DTW
   (Dynamic Time Warping), CID (Complex Invariant Distance), and others for
   comparing time series[Ref:
   https://www.icmc.usp.br/~gbatista/files/bracis2013_1.pdf]. And there are
   also other approaches for comparing time series in the frequency domain
   such as FFT and DWT [Ref:
   http://infolab.usc.edu/csci599/Fall2003/Time%20Series/Efficient%20Similarity%20Search%20In%20Sequence%20Databases.pdf
   ].

I hope it helps.

2016-08-05 9:26 GMT-03:00 Dale T Smith <Dale.T.Smith at macys.com>:

> I don?t think you should treat this as an outlier detection problem. Why
> not try it as a classification problem? The dataset is highly unbalanced.
> Try
>
>
>
> http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.
> ExtraTreesClassifier.html
>
>
>
> Use sample_weight to tell the fit method about the class imbalance. But be
> sure to read up about unbalanced classification and the class_weight
> parameter to ExtraTreesClassifier. You cannot use the accuracy to find the
> best model, so read up on model validation in the sklearn User?s Guide. And
> when you do cross-validation to get the best hyperparameters, be sure you
> pass the sample weights as well.
>
>
>
> Time series data is a bit different to use with cross-validation. You may
> want to add features such as minutes since midnight, day of week,
> weekday/weekend. And make sure your cross-validation folds respect the time
> series nature of the problem.
>
>
>
> http://stackoverflow.com/questions/37583263/scikit-
> learn-cross-validation-custom-splits-for-time-series-data
>
>
>
>
>
> ____________________________________________________________
> ______________________________
> *Dale Smith* | Macy's Systems and Technology | IFS eCommerce | Data
> Science and Capacity Planning
> | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com
>
>
>
> *From:* scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=
> macys.com at python.org] *On Behalf Of *Nicolas Goix
> *Sent:* Thursday, August 4, 2016 9:13 PM
> *To:* Scikit-learn user and developer mailing list
> *Subject:* Re: [scikit-learn] Supervised anomaly detection in time series
>
>
>
> ? EXT MSG:
>
> There are different ways of aggregating estimators. A possibility can be
> to take the majority vote, or averaging decision functions.
>
>
>
> On Aug 4, 2016 8:44 PM, "Amita Misra" <amisra2 at ucsc.edu> wrote:
>
> If I train multiple algorithms on different subsamples, then how do I get
> the final classifier that predicts unseen data?
>
> I have very few positive samples since it is speed bump detection and we
> have very few speed bumps in a drive.
> However, I think that  unseen new data would be quite similar to what I
> have in training data hence if I can correctly learn a classifier for these
> 5, I hope it should work well for unseen speed bumps.
>
> Thanks,
> Amita
>
>
>
> On Thu, Aug 4, 2016 at 5:23 PM, Nicolas Goix <goix.nicolas at gmail.com>
> wrote:
>
> You can evaluate the accuracy of your hyper-parameters on a few samples.
> Just don't use the accuracy as your performance measure.
>
> For supervised classification, training multiple algorithms on small
> balanced subsamples usually works well, but 5 anomalies seems indeed to be
> very little.
>
> Nicolas
>
>
>
> On Aug 4, 2016 7:51 PM, "Amita Misra" <amisra2 at ucsc.edu> wrote:
>
> SubSample would remove a lot of information from the negative class.
>
> I have more than 500 samples of negative class and just 5 samples of
> positive class.
>
> Amita
>
>
>
> On Thu, Aug 4, 2016 at 4:43 PM, Nicolas Goix <goix.nicolas at gmail.com>
> wrote:
>
> Hi,
>
>
>
> Yes you can use your labeled data (you will need to sub-sample your normal
> class to have similar proportion normal-abnormal) to learn your
> hyper-parameters through CV.
>
>
>
> You can also try to use supervised classification algorithms on `not too
> highly unbalanced' sub-samples.
>
>
>
> Nicolas
>
>
>
> On Thu, Aug 4, 2016 at 5:17 PM, Amita Misra <amisra2 at ucsc.edu> wrote:
>
> Hi,
>
>
>
> I am currently exploring the problem of speed bump detection using
> accelerometer time series data.
>
> I have extracted some features based on mean, std deviation etc  within a
> time window.
>
> Since the dataset is highly skewed ( I have just 5  positive samples for
> every > 300 samples)
>
> I was looking into
>
> One ClassSVM
> covariance.EllipticEnvelope
> sklearn.ensemble.IsolationForest
>
> but I am not sure how to use them.
>
> What I get from docs
>
> separate the positive examples and train using only negative examples
>
> clf.fit(X_train)
>
> and then
> predict the positive examples using
> clf.predict(X_test)
>
>
> I am not sure what is then the role of positive examples in my training
> dataset or how can I use them to improve my classifier so that I can
> predict better on new samples.
>
> Can we do something like Cross validation to learn the parameters as in
> normal binary SVM classification
>
>
>
> Thanks,?
>
> Amita
>
>
>
> Amita Misra
>
> Graduate Student Researcher
>
> Natural Language and Dialogue Systems Lab
>
> Baskin School of Engineering
>
> University of California Santa Cruz
>
>
>
>
>
>
>
>
> --
>
> Amita Misra
>
> Graduate Student Researcher
>
> Natural Language and Dialogue Systems Lab
>
> Baskin School of Engineering
>
> University of California Santa Cruz
>
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
>
> --
>
> Amita Misra
>
> Graduate Student Researcher
>
> Natural Language and Dialogue Systems Lab
>
> Baskin School of Engineering
>
> University of California Santa Cruz
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
>
> --
>
> Amita Misra
>
> Graduate Student Researcher
>
> Natural Language and Dialogue Systems Lab
>
> Baskin School of Engineering
>
> University of California Santa Cruz
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
> * This is an EXTERNAL EMAIL. Stop and think before clicking a link or
> opening attachments.
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160805/001e0cbe/attachment-0001.html>

From Dale.T.Smith at macys.com  Fri Aug  5 10:09:11 2016
From: Dale.T.Smith at macys.com (Dale T Smith)
Date: Fri, 5 Aug 2016 14:09:11 +0000
Subject: [scikit-learn] Supervised anomaly detection in time series
In-Reply-To: <CAAY8FkAY5EwbOnmD5F2WvRwi9LPoaTHW_-mV0twh-DpNMK9=Wg@mail.gmail.com>
References: <CAMa2cetn_DTVn=XzBPpLcP_1z_8duWFBdhuxON4eNUAU1m2JRw@mail.gmail.com>
 <CAPV6P2zsFPz=V9sjwh6R=eV3Xp8X-aLTKFLc3FaDtfxX+vjpQg@mail.gmail.com>
 <CAMa2cetr-vo-Z9G0vtDA0MT5qJcPvZXMTwfeWrFOb2P5OhvfZA@mail.gmail.com>
 <CAPV6P2wxGbESKBYnPwEO=8Qwkzu6iG8i1qrFs1VKy7OuysXriA@mail.gmail.com>
 <CAMa2cesv_aafFyzdBMqA-56PAXFaCt7EcrG=1ouFkBfNrpmd2Q@mail.gmail.com>
 <CAPV6P2wRoFrijG_rVx83=1ByOhS_YSR6NXTHaSwCOaYAJk=JmA@mail.gmail.com>
 <BL2PR06MB22767420DD0FA5D0EF9E9DE3C3180@BL2PR06MB2276.namprd06.prod.outlook.com>
 <CAAY8FkAY5EwbOnmD5F2WvRwi9LPoaTHW_-mV0twh-DpNMK9=Wg@mail.gmail.com>
Message-ID: <BL2PR06MB2276832B80B3D80352C87DDEC3180@BL2PR06MB2276.namprd06.prod.outlook.com>

To analyze unbalanced classifiers, use

from sklearn.metrics import classification_report


__________________________________________________________________________________________
Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science and Capacity Planning
| 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com

From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Pedro Pazzini
Sent: Friday, August 5, 2016 9:33 AM
To: Scikit-learn user and developer mailing list
Subject: Re: [scikit-learn] Supervised anomaly detection in time series

? EXT MSG:
Just to add a few things to the discussion:

  1.  For unbalanced problems, as far as I know, one of the best scores to evaluate a classifier is the Area Under the ROC curve: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html. For that you will have to use clf.predict_proba(X_test) instead of clf.predict(X_test). I think that using the 'sample_weight' parameter as Smith said is a promising choice.
  2.  Usually is recommend the normalization of each time series for comparing them. The Z-score normalization is one of the most used [Ref: http://wan.poly.edu/KDD2012/docs/p262.pdf].
  3.  There are some interesting dissimilarity measures such as DTW (Dynamic Time Warping), CID (Complex Invariant Distance), and others for comparing time series[Ref: https://www.icmc.usp.br/~gbatista/files/bracis2013_1.pdf]. And there are also other approaches for comparing time series in the frequency domain such as FFT and DWT [Ref: http://infolab.usc.edu/csci599/Fall2003/Time%20Series/Efficient%20Similarity%20Search%20In%20Sequence%20Databases.pdf].

I hope it helps.

2016-08-05 9:26 GMT-03:00 Dale T Smith <Dale.T.Smith at macys.com<mailto:Dale.T.Smith at macys.com>>:
I don?t think you should treat this as an outlier detection problem. Why not try it as a classification problem? The dataset is highly unbalanced. Try

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html

Use sample_weight to tell the fit method about the class imbalance. But be sure to read up about unbalanced classification and the class_weight parameter to ExtraTreesClassifier. You cannot use the accuracy to find the best model, so read up on model validation in the sklearn User?s Guide. And when you do cross-validation to get the best hyperparameters, be sure you pass the sample weights as well.

Time series data is a bit different to use with cross-validation. You may want to add features such as minutes since midnight, day of week, weekday/weekend. And make sure your cross-validation folds respect the time series nature of the problem.

http://stackoverflow.com/questions/37583263/scikit-learn-cross-validation-custom-splits-for-time-series-data


__________________________________________________________________________________________
Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science and Capacity Planning
| 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com<mailto:dale.t.smith at macys.com>

From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith<mailto:scikit-learn-bounces%2Bdale.t.smith>=macys.com at python.org<mailto:macys.com at python.org>] On Behalf Of Nicolas Goix
Sent: Thursday, August 4, 2016 9:13 PM
To: Scikit-learn user and developer mailing list
Subject: Re: [scikit-learn] Supervised anomaly detection in time series

? EXT MSG:

There are different ways of aggregating estimators. A possibility can be to take the majority vote, or averaging decision functions.

On Aug 4, 2016 8:44 PM, "Amita Misra" <amisra2 at ucsc.edu<mailto:amisra2 at ucsc.edu>> wrote:
If I train multiple algorithms on different subsamples, then how do I get the final classifier that predicts unseen data?
I have very few positive samples since it is speed bump detection and we have very few speed bumps in a drive.
However, I think that  unseen new data would be quite similar to what I have in training data hence if I can correctly learn a classifier for these 5, I hope it should work well for unseen speed bumps.
Thanks,
Amita

On Thu, Aug 4, 2016 at 5:23 PM, Nicolas Goix <goix.nicolas at gmail.com<mailto:goix.nicolas at gmail.com>> wrote:

You can evaluate the accuracy of your hyper-parameters on a few samples. Just don't use the accuracy as your performance measure.

For supervised classification, training multiple algorithms on small balanced subsamples usually works well, but 5 anomalies seems indeed to be very little.

Nicolas

On Aug 4, 2016 7:51 PM, "Amita Misra" <amisra2 at ucsc.edu<mailto:amisra2 at ucsc.edu>> wrote:
SubSample would remove a lot of information from the negative class.
I have more than 500 samples of negative class and just 5 samples of positive class.
Amita

On Thu, Aug 4, 2016 at 4:43 PM, Nicolas Goix <goix.nicolas at gmail.com<mailto:goix.nicolas at gmail.com>> wrote:
Hi,

Yes you can use your labeled data (you will need to sub-sample your normal class to have similar proportion normal-abnormal) to learn your hyper-parameters through CV.

You can also try to use supervised classification algorithms on `not too highly unbalanced' sub-samples.

Nicolas

On Thu, Aug 4, 2016 at 5:17 PM, Amita Misra <amisra2 at ucsc.edu<mailto:amisra2 at ucsc.edu>> wrote:
Hi,

I am currently exploring the problem of speed bump detection using accelerometer time series data.
I have extracted some features based on mean, std deviation etc  within a time window.
Since the dataset is highly skewed ( I have just 5  positive samples for every > 300 samples)
I was looking into

One ClassSVM
covariance.EllipticEnvelope
sklearn.ensemble.IsolationForest

but I am not sure how to use them.

What I get from docs
separate the positive examples and train using only negative examples

clf.fit(X_train)
and then
predict the positive examples using
clf.predict(X_test)

I am not sure what is then the role of positive examples in my training dataset or how can I use them to improve my classifier so that I can predict better on new samples.
Can we do something like Cross validation to learn the parameters as in normal binary SVM classification

Thanks,?
Amita

Amita Misra
Graduate Student Researcher
Natural Language and Dialogue Systems Lab
Baskin School of Engineering
University of California Santa Cruz


--
Amita Misra
Graduate Student Researcher
Natural Language and Dialogue Systems Lab
Baskin School of Engineering
University of California Santa Cruz


_______________________________________________
scikit-learn mailing list
scikit-learn at python.org<mailto:scikit-learn at python.org>
https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
scikit-learn at python.org<mailto:scikit-learn at python.org>
https://mail.python.org/mailman/listinfo/scikit-learn


--
Amita Misra
Graduate Student Researcher
Natural Language and Dialogue Systems Lab
Baskin School of Engineering
University of California Santa Cruz


_______________________________________________
scikit-learn mailing list
scikit-learn at python.org<mailto:scikit-learn at python.org>
https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn at python.org<mailto:scikit-learn at python.org>
https://mail.python.org/mailman/listinfo/scikit-learn


--
Amita Misra
Graduate Student Researcher
Natural Language and Dialogue Systems Lab
Baskin School of Engineering
University of California Santa Cruz


_______________________________________________
scikit-learn mailing list
scikit-learn at python.org<mailto:scikit-learn at python.org>
https://mail.python.org/mailman/listinfo/scikit-learn
* This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.

_______________________________________________
scikit-learn mailing list
scikit-learn at python.org<mailto:scikit-learn at python.org>
https://mail.python.org/mailman/listinfo/scikit-learn

* This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160805/9aa59723/attachment-0001.html>

From qingkai.kong at gmail.com  Fri Aug  5 14:05:27 2016
From: qingkai.kong at gmail.com (Qingkai Kong)
Date: Fri, 5 Aug 2016 11:05:27 -0700
Subject: [scikit-learn] Supervised anomaly detection in time series
In-Reply-To: <BL2PR06MB2276832B80B3D80352C87DDEC3180@BL2PR06MB2276.namprd06.prod.outlook.com>
References: <CAMa2cetn_DTVn=XzBPpLcP_1z_8duWFBdhuxON4eNUAU1m2JRw@mail.gmail.com>
 <CAPV6P2zsFPz=V9sjwh6R=eV3Xp8X-aLTKFLc3FaDtfxX+vjpQg@mail.gmail.com>
 <CAMa2cetr-vo-Z9G0vtDA0MT5qJcPvZXMTwfeWrFOb2P5OhvfZA@mail.gmail.com>
 <CAPV6P2wxGbESKBYnPwEO=8Qwkzu6iG8i1qrFs1VKy7OuysXriA@mail.gmail.com>
 <CAMa2cesv_aafFyzdBMqA-56PAXFaCt7EcrG=1ouFkBfNrpmd2Q@mail.gmail.com>
 <CAPV6P2wRoFrijG_rVx83=1ByOhS_YSR6NXTHaSwCOaYAJk=JmA@mail.gmail.com>
 <BL2PR06MB22767420DD0FA5D0EF9E9DE3C3180@BL2PR06MB2276.namprd06.prod.outlook.com>
 <CAAY8FkAY5EwbOnmD5F2WvRwi9LPoaTHW_-mV0twh-DpNMK9=Wg@mail.gmail.com>
 <BL2PR06MB2276832B80B3D80352C87DDEC3180@BL2PR06MB2276.namprd06.prod.outlook.com>
Message-ID: <CADN07ZnFz2qHmdJBjeTnmADcWpQYrm1sTQu5kkRabfHFXhBTyg@mail.gmail.com>

I also worked on something similar, instead of using some algorithms deal
with unbalanced data, you can also try to create a balanced dataset either
using oversampling or downsampling. scikit-learn-contrib has already had a
project dealing with unbalanced data:
https://github.com/scikit-learn-contrib/imbalanced-learn.

Either you treat it as a classification problem or anomaly detection
problem (I prefer to treat it as a classification problem first) you all
need to find a better set of features in time domain or frequency domain.

On Fri, Aug 5, 2016 at 7:09 AM, Dale T Smith <Dale.T.Smith at macys.com> wrote:

> To analyze unbalanced classifiers, use
>
>
>
> from sklearn.metrics import classification_report
>
>
>
>
>
> ____________________________________________________________
> ______________________________
> *Dale Smith* | Macy's Systems and Technology | IFS eCommerce | Data
> Science and Capacity Planning
> | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com
>
>
>
> *From:* scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=
> macys.com at python.org] *On Behalf Of *Pedro Pazzini
> *Sent:* Friday, August 5, 2016 9:33 AM
>
> *To:* Scikit-learn user and developer mailing list
> *Subject:* Re: [scikit-learn] Supervised anomaly detection in time series
>
>
>
> ? EXT MSG:
>
> Just to add a few things to the discussion:
>
>    1. For unbalanced problems, as far as I know, one of the best scores
>    to evaluate a classifier is the Area Under the ROC curve:
>    http://scikit-learn.org/stable/modules/generated/
>    sklearn.metrics.roc_auc_score.html
>    <http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html>.
>    For that you will have to use clf.predict_proba(X_test) instead of
>    clf.predict(X_test). I think that using the 'sample_weight' parameter as
>    Smith said is a promising choice.
>    2. Usually is recommend the normalization of each time series for
>    comparing them. The Z-score normalization is one of the most used [Ref:
>    http://wan.poly.edu/KDD2012/docs/p262.pdf
>    <http://wan.poly.edu/KDD2012/docs/p262.pdf>].
>    3. There are some interesting dissimilarity measures such as DTW
>    (Dynamic Time Warping), CID (Complex Invariant Distance), and others for
>    comparing time series[Ref: https://www.icmc.usp.br/~
>    gbatista/files/bracis2013_1.pdf
>    <https://www.icmc.usp.br/~gbatista/files/bracis2013_1.pdf>]. And there
>    are also other approaches for comparing time series in the frequency domain
>    such as FFT and DWT [Ref: http://infolab.usc.edu/csci599/Fall2003/Time%
>    20Series/Efficient%20Similarity%20Search%20In%
>    20Sequence%20Databases.pdf
>    <http://infolab.usc.edu/csci599/Fall2003/Time%20Series/Efficient%20Similarity%20Search%20In%20Sequence%20Databases.pdf>
>    ].
>
> I hope it helps.
>
>
>
> 2016-08-05 9:26 GMT-03:00 Dale T Smith <Dale.T.Smith at macys.com>:
>
> I don?t think you should treat this as an outlier detection problem. Why
> not try it as a classification problem? The dataset is highly unbalanced.
> Try
>
>
>
> http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.
> ExtraTreesClassifier.html
>
>
>
> Use sample_weight to tell the fit method about the class imbalance. But be
> sure to read up about unbalanced classification and the class_weight
> parameter to ExtraTreesClassifier. You cannot use the accuracy to find the
> best model, so read up on model validation in the sklearn User?s Guide. And
> when you do cross-validation to get the best hyperparameters, be sure you
> pass the sample weights as well.
>
>
>
> Time series data is a bit different to use with cross-validation. You may
> want to add features such as minutes since midnight, day of week,
> weekday/weekend. And make sure your cross-validation folds respect the time
> series nature of the problem.
>
>
>
> http://stackoverflow.com/questions/37583263/scikit-
> learn-cross-validation-custom-splits-for-time-series-data
>
>
>
>
>
> ____________________________________________________________
> ______________________________
> *Dale Smith* | Macy's Systems and Technology | IFS eCommerce | Data
> Science and Capacity Planning
> | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com
>
>
>
> *From:* scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=
> macys.com at python.org] *On Behalf Of *Nicolas Goix
> *Sent:* Thursday, August 4, 2016 9:13 PM
> *To:* Scikit-learn user and developer mailing list
> *Subject:* Re: [scikit-learn] Supervised anomaly detection in time series
>
>
>
> ? EXT MSG:
>
> There are different ways of aggregating estimators. A possibility can be
> to take the majority vote, or averaging decision functions.
>
>
>
> On Aug 4, 2016 8:44 PM, "Amita Misra" <amisra2 at ucsc.edu> wrote:
>
> If I train multiple algorithms on different subsamples, then how do I get
> the final classifier that predicts unseen data?
>
> I have very few positive samples since it is speed bump detection and we
> have very few speed bumps in a drive.
> However, I think that  unseen new data would be quite similar to what I
> have in training data hence if I can correctly learn a classifier for these
> 5, I hope it should work well for unseen speed bumps.
>
> Thanks,
> Amita
>
>
>
> On Thu, Aug 4, 2016 at 5:23 PM, Nicolas Goix <goix.nicolas at gmail.com>
> wrote:
>
> You can evaluate the accuracy of your hyper-parameters on a few samples.
> Just don't use the accuracy as your performance measure.
>
> For supervised classification, training multiple algorithms on small
> balanced subsamples usually works well, but 5 anomalies seems indeed to be
> very little.
>
> Nicolas
>
>
>
> On Aug 4, 2016 7:51 PM, "Amita Misra" <amisra2 at ucsc.edu> wrote:
>
> SubSample would remove a lot of information from the negative class.
>
> I have more than 500 samples of negative class and just 5 samples of
> positive class.
>
> Amita
>
>
>
> On Thu, Aug 4, 2016 at 4:43 PM, Nicolas Goix <goix.nicolas at gmail.com>
> wrote:
>
> Hi,
>
>
>
> Yes you can use your labeled data (you will need to sub-sample your normal
> class to have similar proportion normal-abnormal) to learn your
> hyper-parameters through CV.
>
>
>
> You can also try to use supervised classification algorithms on `not too
> highly unbalanced' sub-samples.
>
>
>
> Nicolas
>
>
>
> On Thu, Aug 4, 2016 at 5:17 PM, Amita Misra <amisra2 at ucsc.edu> wrote:
>
> Hi,
>
>
>
> I am currently exploring the problem of speed bump detection using
> accelerometer time series data.
>
> I have extracted some features based on mean, std deviation etc  within a
> time window.
>
> Since the dataset is highly skewed ( I have just 5  positive samples for
> every > 300 samples)
>
> I was looking into
>
> One ClassSVM
> covariance.EllipticEnvelope
> sklearn.ensemble.IsolationForest
>
> but I am not sure how to use them.
>
> What I get from docs
>
> separate the positive examples and train using only negative examples
>
> clf.fit(X_train)
>
> and then
> predict the positive examples using
> clf.predict(X_test)
>
>
> I am not sure what is then the role of positive examples in my training
> dataset or how can I use them to improve my classifier so that I can
> predict better on new samples.
>
> Can we do something like Cross validation to learn the parameters as in
> normal binary SVM classification
>
>
>
> Thanks,?
>
> Amita
>
>
>
> Amita Misra
>
> Graduate Student Researcher
>
> Natural Language and Dialogue Systems Lab
>
> Baskin School of Engineering
>
> University of California Santa Cruz
>
>
>
>
>
>
>
>
> --
>
> Amita Misra
>
> Graduate Student Researcher
>
> Natural Language and Dialogue Systems Lab
>
> Baskin School of Engineering
>
> University of California Santa Cruz
>
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
>
> --
>
> Amita Misra
>
> Graduate Student Researcher
>
> Natural Language and Dialogue Systems Lab
>
> Baskin School of Engineering
>
> University of California Santa Cruz
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
>
> --
>
> Amita Misra
>
> Graduate Student Researcher
>
> Natural Language and Dialogue Systems Lab
>
> Baskin School of Engineering
>
> University of California Santa Cruz
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
> * This is an EXTERNAL EMAIL. Stop and think before clicking a link or
> opening attachments.
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> * This is an EXTERNAL EMAIL. Stop and think before clicking a link or
> opening attachments.
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 
Qingkai KONG
Ph.D Candidate
Seismological Lab
289 McCone Hall
University of California, Berkeley
http://seismo.berkeley.edu/qingkaikong
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160805/ed6625a3/attachment-0001.html>

From jgabor.astro at gmail.com  Fri Aug  5 14:55:30 2016
From: jgabor.astro at gmail.com (Jared Gabor)
Date: Fri, 5 Aug 2016 11:55:30 -0700
Subject: [scikit-learn] Supervised anomaly detection in time series
In-Reply-To: <CAMa2cetn_DTVn=XzBPpLcP_1z_8duWFBdhuxON4eNUAU1m2JRw@mail.gmail.com>
References: <CAMa2cetn_DTVn=XzBPpLcP_1z_8duWFBdhuxON4eNUAU1m2JRw@mail.gmail.com>
Message-ID: <CAOaEz0k2E8trAmvQOBNVH+4_JEgrH=-y6L-0FJwOt+rEmgxj9w@mail.gmail.com>

Lots of great suggestions on how to model your problem.  But this might be
the kind of problem where you seriously ask how hard it would be to gather
more data.

On Thu, Aug 4, 2016 at 2:17 PM, Amita Misra <amisra2 at ucsc.edu> wrote:

> Hi,
>
> I am currently exploring the problem of speed bump detection using
> accelerometer time series data.
> I have extracted some features based on mean, std deviation etc  within a
> time window.
>
> Since the dataset is highly skewed ( I have just 5  positive samples for
> every > 300 samples)
> I was looking into
>
> One ClassSVM
> covariance.EllipticEnvelope
> sklearn.ensemble.IsolationForest
>
> but I am not sure how to use them.
>
> What I get from docs
> separate the positive examples and train using only negative examples
>
> clf.fit(X_train)
>
> and then
> predict the positive examples using
> clf.predict(X_test)
>
>
> I am not sure what is then the role of positive examples in my training
> dataset or how can I use them to improve my classifier so that I can
> predict better on new samples.
>
>
> Can we do something like Cross validation to learn the parameters as in
> normal binary SVM classification
>
> Thanks,?
> Amita
>
> Amita Misra
> Graduate Student Researcher
> Natural Language and Dialogue Systems Lab
> Baskin School of Engineering
> University of California Santa Cruz
>
>
>
>
>
> --
> Amita Misra
> Graduate Student Researcher
> Natural Language and Dialogue Systems Lab
> Baskin School of Engineering
> University of California Santa Cruz
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160805/ff13ed73/attachment.html>

From amisra2 at ucsc.edu  Fri Aug  5 15:07:54 2016
From: amisra2 at ucsc.edu (Amita Misra)
Date: Fri, 5 Aug 2016 12:07:54 -0700
Subject: [scikit-learn] Supervised anomaly detection in time series
In-Reply-To: <CAOaEz0k2E8trAmvQOBNVH+4_JEgrH=-y6L-0FJwOt+rEmgxj9w@mail.gmail.com>
References: <CAMa2cetn_DTVn=XzBPpLcP_1z_8duWFBdhuxON4eNUAU1m2JRw@mail.gmail.com>
 <CAOaEz0k2E8trAmvQOBNVH+4_JEgrH=-y6L-0FJwOt+rEmgxj9w@mail.gmail.com>
Message-ID: <CAMa2cesF65hH7OmX8oPATH7ubSuZpmHnVYEB==F5SeSjwVtd5Q@mail.gmail.com>

Thanks everyone for the suggestions.

Actually we thought of gathering more data but the point is we do not have
many speed bumps in our driving area. If we drive over the same speed bump
again and again it may not add anything really novel to the data.

I think a combination of oversampling and sample_weight along with ROC may
be a good start for me.

Thanks,
Amita

On Fri, Aug 5, 2016 at 11:55 AM, Jared Gabor <jgabor.astro at gmail.com> wrote:

> Lots of great suggestions on how to model your problem.  But this might be
> the kind of problem where you seriously ask how hard it would be to gather
> more data.
>
> On Thu, Aug 4, 2016 at 2:17 PM, Amita Misra <amisra2 at ucsc.edu> wrote:
>
>> Hi,
>>
>> I am currently exploring the problem of speed bump detection using
>> accelerometer time series data.
>> I have extracted some features based on mean, std deviation etc  within a
>> time window.
>>
>> Since the dataset is highly skewed ( I have just 5  positive samples for
>> every > 300 samples)
>> I was looking into
>>
>> One ClassSVM
>> covariance.EllipticEnvelope
>> sklearn.ensemble.IsolationForest
>>
>> but I am not sure how to use them.
>>
>> What I get from docs
>> separate the positive examples and train using only negative examples
>>
>> clf.fit(X_train)
>>
>> and then
>> predict the positive examples using
>> clf.predict(X_test)
>>
>>
>> I am not sure what is then the role of positive examples in my training
>> dataset or how can I use them to improve my classifier so that I can
>> predict better on new samples.
>>
>>
>> Can we do something like Cross validation to learn the parameters as in
>> normal binary SVM classification
>>
>> Thanks,?
>> Amita
>>
>> Amita Misra
>> Graduate Student Researcher
>> Natural Language and Dialogue Systems Lab
>> Baskin School of Engineering
>> University of California Santa Cruz
>>
>>
>>
>>
>>
>> --
>> Amita Misra
>> Graduate Student Researcher
>> Natural Language and Dialogue Systems Lab
>> Baskin School of Engineering
>> University of California Santa Cruz
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 
Amita Misra
Graduate Student Researcher
Natural Language and Dialogue Systems Lab
Baskin School of Engineering
University of California Santa Cruz
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160805/5b14b2ed/attachment.html>

From mail at sebastianraschka.com  Fri Aug  5 15:26:59 2016
From: mail at sebastianraschka.com (Sebastian Raschka)
Date: Fri, 5 Aug 2016 15:26:59 -0400
Subject: [scikit-learn] Supervised anomaly detection in time series
In-Reply-To: <CAOaEz0k2E8trAmvQOBNVH+4_JEgrH=-y6L-0FJwOt+rEmgxj9w@mail.gmail.com>
References: <CAMa2cetn_DTVn=XzBPpLcP_1z_8duWFBdhuxON4eNUAU1m2JRw@mail.gmail.com>
 <CAOaEz0k2E8trAmvQOBNVH+4_JEgrH=-y6L-0FJwOt+rEmgxj9w@mail.gmail.com>
Message-ID: <56374299-8CFE-4430-BD1A-CE93F836211D@sebastianraschka.com>

> But this might be the kind of problem where you seriously ask how hard it would be to gather more data.  


Yeah, I agree, but this scenario is then typical in a sense of that it is an anomaly detection problem rather than a classification problem. I.e., you don?t have enough positive labels to fit the model and thus you need to do unsupervised learning to learn from the negative class only.

Sure, supervised learning could work well, but I would also explore unsupervised learning here and see how that works for you; maybe one-class SVM as suggested or EM algorithm based mixture models (http://scikit-learn.org/stable/modules/mixture.html)

Best,
Sebastian

> On Aug 5, 2016, at 2:55 PM, Jared Gabor <jgabor.astro at gmail.com> wrote:
> 
> Lots of great suggestions on how to model your problem.  But this might be the kind of problem where you seriously ask how hard it would be to gather more data.  
> 
> On Thu, Aug 4, 2016 at 2:17 PM, Amita Misra <amisra2 at ucsc.edu> wrote:
> Hi,
> 
> I am currently exploring the problem of speed bump detection using accelerometer time series data.
> I have extracted some features based on mean, std deviation etc  within a time window.
> 
> Since the dataset is highly skewed ( I have just 5  positive samples for every > 300 samples)
> I was looking into 
> 
> One ClassSVM 
> covariance.EllipticEnvelope
> sklearn.ensemble.IsolationForest
> but I am not sure how to use them. 
> 
> What I get from docs
> 
> separate the positive examples and train using only negative examples
> clf.fit(X_train)
> and then
> predict the positive examples using
> clf.predict(X_test)
> 
> 
> I am not sure what is then the role of positive examples in my training dataset or how can I use them to improve my classifier so that I can predict better on new samples.
> 
> 
> Can we do something like Cross validation to learn the parameters as in normal binary SVM classification
> 
> Thanks,?
> Amita
> 
> Amita Misra
> Graduate Student Researcher
> Natural Language and Dialogue Systems Lab
> Baskin School of Engineering
> University of California Santa Cruz
> 
> 
> 
> 
> 
> -- 
> Amita Misra
> Graduate Student Researcher
> Natural Language and Dialogue Systems Lab
> Baskin School of Engineering
> University of California Santa Cruz
> 
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From albertthomas88 at gmail.com  Fri Aug  5 19:40:30 2016
From: albertthomas88 at gmail.com (Albert Thomas)
Date: Fri, 05 Aug 2016 23:40:30 +0000
Subject: [scikit-learn] Supervised anomaly detection in time series
In-Reply-To: <56374299-8CFE-4430-BD1A-CE93F836211D@sebastianraschka.com>
References: <CAMa2cetn_DTVn=XzBPpLcP_1z_8duWFBdhuxON4eNUAU1m2JRw@mail.gmail.com>
 <CAOaEz0k2E8trAmvQOBNVH+4_JEgrH=-y6L-0FJwOt+rEmgxj9w@mail.gmail.com>
 <56374299-8CFE-4430-BD1A-CE93F836211D@sebastianraschka.com>
Message-ID: <CAK6amUP-uoBPspcGU33jtVZxv7GAXJvL9D7wG079EyMvBcdcGA@mail.gmail.com>

Hi,

About your question on how to learn the parameters of anomaly detection
algorithms using only the negative samples in your case, Nicolas and I
worked on this aspect recently. If you are interested you can have look at:

- Learning hyperparameters for unsupervised anomaly detection:
https://drive.google.com/file/d/0B8Dg3PBX90KNUTg5NGNOVnFPX0hDNmJsSTcybzZMSHNPYkd3/view
- How to evaluate the quality of unsupervised anomaly Detection algorithms?:
https://drive.google.com/file/d/0B8Dg3PBX90KNenV3WjRkR09Bakx5YlNyMF9BUXVNem1hb0NR/view


Best,
Albert

On Fri, Aug 5, 2016 at 9:34 PM Sebastian Raschka <mail at sebastianraschka.com>
wrote:

> > But this might be the kind of problem where you seriously ask how hard
> it would be to gather more data.
>
>
> Yeah, I agree, but this scenario is then typical in a sense of that it is
> an anomaly detection problem rather than a classification problem. I.e.,
> you don?t have enough positive labels to fit the model and thus you need to
> do unsupervised learning to learn from the negative class only.
>
> Sure, supervised learning could work well, but I would also explore
> unsupervised learning here and see how that works for you; maybe one-class
> SVM as suggested or EM algorithm based mixture models (
> http://scikit-learn.org/stable/modules/mixture.html)
>
> Best,
> Sebastian
>
> > On Aug 5, 2016, at 2:55 PM, Jared Gabor <jgabor.astro at gmail.com> wrote:
> >
> > Lots of great suggestions on how to model your problem.  But this might
> be the kind of problem where you seriously ask how hard it would be to
> gather more data.
> >
> > On Thu, Aug 4, 2016 at 2:17 PM, Amita Misra <amisra2 at ucsc.edu> wrote:
> > Hi,
> >
> > I am currently exploring the problem of speed bump detection using
> accelerometer time series data.
> > I have extracted some features based on mean, std deviation etc  within
> a time window.
> >
> > Since the dataset is highly skewed ( I have just 5  positive samples for
> every > 300 samples)
> > I was looking into
> >
> > One ClassSVM
> > covariance.EllipticEnvelope
> > sklearn.ensemble.IsolationForest
> > but I am not sure how to use them.
> >
> > What I get from docs
> >
> > separate the positive examples and train using only negative examples
> > clf.fit(X_train)
> > and then
> > predict the positive examples using
> > clf.predict(X_test)
> >
> >
> > I am not sure what is then the role of positive examples in my training
> dataset or how can I use them to improve my classifier so that I can
> predict better on new samples.
> >
> >
> > Can we do something like Cross validation to learn the parameters as in
> normal binary SVM classification
> >
> > Thanks,?
> > Amita
> >
> > Amita Misra
> > Graduate Student Researcher
> > Natural Language and Dialogue Systems Lab
> > Baskin School of Engineering
> > University of California Santa Cruz
> >
> >
> >
> >
> >
> > --
> > Amita Misra
> > Graduate Student Researcher
> > Natural Language and Dialogue Systems Lab
> > Baskin School of Engineering
> > University of California Santa Cruz
> >
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160805/2d66f633/attachment-0001.html>

From ionescu.vlad1 at gmail.com  Sun Aug  7 04:42:03 2016
From: ionescu.vlad1 at gmail.com (Vlad Ionescu)
Date: Sun, 07 Aug 2016 08:42:03 +0000
Subject: [scikit-learn] Scaling model selection on a cluster
Message-ID: <CAE9CH62TmLbWTfvjz0Oq181vQ106ryReGAYx2y9q7W4zcoaTaQ@mail.gmail.com>

Hello,

I am interested in scaling grid searches on an HPC LSF cluster with about
60 nodes, each with 20 cores. I thought i could just set n_jobs=1000 then
submit a job with bsub -n 1000, but then I dug deeper and understood that
the underlying joblib used by scikit-learn will create all of those jobs on
a single node, resulting in no performance benefits. So I am stuck using a
single node.

I've read a lengthy discussion some time ago about adding something like
this in scikit-learn:
https://sourceforge.net/p/scikit-learn/mailman/scikit-learn-general/thread/4F26C3CB.8070603 at ais.uni-bonn.de/


However, it hasn't materialized in any way, as far as I can tell.

Do you know of any way to do this, or any modern cluster computing
libraries for python that might help me write something myself (I found a
lot, but it's hard to tell what's considered good or even still under
development)?

Also, are there still plans to implement this in scikit-learn? You seemed
to like the idea back then.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160807/f6abb995/attachment.html>

From vaggi.federico at gmail.com  Sun Aug  7 05:05:41 2016
From: vaggi.federico at gmail.com (federico vaggi)
Date: Sun, 07 Aug 2016 09:05:41 +0000
Subject: [scikit-learn] Scaling model selection on a cluster
In-Reply-To: <CAE9CH62TmLbWTfvjz0Oq181vQ106ryReGAYx2y9q7W4zcoaTaQ@mail.gmail.com>
References: <CAE9CH62TmLbWTfvjz0Oq181vQ106ryReGAYx2y9q7W4zcoaTaQ@mail.gmail.com>
Message-ID: <CAGvd0=iFxWk0yy2=K4NUM6tCw8j_qG9KDG1BQAA9+tx3wjVv7A@mail.gmail.com>

This might be interesting to you:

http://blaze.pydata.org/blog/2015/10/19/dask-learn/


On Sun, 7 Aug 2016 at 10:42 Vlad Ionescu <ionescu.vlad1 at gmail.com> wrote:

> Hello,
>
> I am interested in scaling grid searches on an HPC LSF cluster with about
> 60 nodes, each with 20 cores. I thought i could just set n_jobs=1000 then
> submit a job with bsub -n 1000, but then I dug deeper and understood that
> the underlying joblib used by scikit-learn will create all of those jobs on
> a single node, resulting in no performance benefits. So I am stuck using a
> single node.
>
> I've read a lengthy discussion some time ago about adding something like
> this in scikit-learn:
> https://sourceforge.net/p/scikit-learn/mailman/scikit-learn-general/thread/4F26C3CB.8070603 at ais.uni-bonn.de/
>
>
> However, it hasn't materialized in any way, as far as I can tell.
>
> Do you know of any way to do this, or any modern cluster computing
> libraries for python that might help me write something myself (I found a
> lot, but it's hard to tell what's considered good or even still under
> development)?
>
> Also, are there still plans to implement this in scikit-learn? You seemed
> to like the idea back then.
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160807/cd9b3b51/attachment.html>

From ionescu.vlad1 at gmail.com  Sun Aug  7 06:51:32 2016
From: ionescu.vlad1 at gmail.com (Vlad Ionescu)
Date: Sun, 07 Aug 2016 10:51:32 +0000
Subject: [scikit-learn] Scaling model selection on a cluster
In-Reply-To: <CAGvd0=iFxWk0yy2=K4NUM6tCw8j_qG9KDG1BQAA9+tx3wjVv7A@mail.gmail.com>
References: <CAE9CH62TmLbWTfvjz0Oq181vQ106ryReGAYx2y9q7W4zcoaTaQ@mail.gmail.com>
 <CAGvd0=iFxWk0yy2=K4NUM6tCw8j_qG9KDG1BQAA9+tx3wjVv7A@mail.gmail.com>
Message-ID: <CAE9CH63RONa1QHzY6B_L4nbJ8R3g46NsumPPX2qitwyP6RhmYQ@mail.gmail.com>

Thanks, that looks interesting. I've looked into dask-learn's grid search (
https://github.com/mrocklin/dask-learn/blob/master/grid_search.py) but it
seems not to make use of the n_jobs parameter. Will this work in a
distributed fashion? The link you gave seemed to focus more on optimizing
the grid search by eliminating duplicate work rather than by distributing
it on more machines (I am actually using a random search, so I'm not sure
those optimizations apply to my use case anyway).

Dask itself seems like it might work, although it seems to require running
manually on each node. Will look into it some more.

On Sun, Aug 7, 2016 at 12:06 PM federico vaggi <vaggi.federico at gmail.com>
wrote:

> This might be interesting to you:
>
> http://blaze.pydata.org/blog/2015/10/19/dask-learn/
>
>
> On Sun, 7 Aug 2016 at 10:42 Vlad Ionescu <ionescu.vlad1 at gmail.com> wrote:
>
>> Hello,
>>
>> I am interested in scaling grid searches on an HPC LSF cluster with about
>> 60 nodes, each with 20 cores. I thought i could just set n_jobs=1000 then
>> submit a job with bsub -n 1000, but then I dug deeper and understood that
>> the underlying joblib used by scikit-learn will create all of those jobs on
>> a single node, resulting in no performance benefits. So I am stuck using a
>> single node.
>>
>> I've read a lengthy discussion some time ago about adding something like
>> this in scikit-learn:
>> https://sourceforge.net/p/scikit-learn/mailman/scikit-learn-general/thread/4F26C3CB.8070603 at ais.uni-bonn.de/
>>
>>
>> However, it hasn't materialized in any way, as far as I can tell.
>>
>> Do you know of any way to do this, or any modern cluster computing
>> libraries for python that might help me write something myself (I found a
>> lot, but it's hard to tell what's considered good or even still under
>> development)?
>>
>> Also, are there still plans to implement this in scikit-learn? You seemed
>> to like the idea back then.
>>
> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160807/1188bc67/attachment.html>

From ragvrv at gmail.com  Sun Aug  7 08:39:55 2016
From: ragvrv at gmail.com (Raghav R V)
Date: Sun, 7 Aug 2016 14:39:55 +0200
Subject: [scikit-learn] Disable Travis Cache
Message-ID: <CACmxyDEqE+pnK5LOEhibBGF7KBgeQHfvxFqXZcnXfek=uw1frA@mail.gmail.com>

Could someone disable the Travis cache once and for all please?

I have seen several frustrating incidents where the Travis fails the PR
because of this caching of old files.

I also don't understand why it is enabled in the first place. It would
really be super helpful if it is disabled for good.

Also refer - https://github.com/scikit-learn/scikit-learn/issues/7094

**cc**: Olivier, Andy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160807/79c34cab/attachment-0001.html>

From alexandre.gramfort at telecom-paristech.fr  Sun Aug  7 09:01:15 2016
From: alexandre.gramfort at telecom-paristech.fr (Alexandre Gramfort)
Date: Sun, 7 Aug 2016 15:01:15 +0200
Subject: [scikit-learn] Disable Travis Cache
In-Reply-To: <CACmxyDEqE+pnK5LOEhibBGF7KBgeQHfvxFqXZcnXfek=uw1frA@mail.gmail.com>
References: <CACmxyDEqE+pnK5LOEhibBGF7KBgeQHfvxFqXZcnXfek=uw1frA@mail.gmail.com>
Message-ID: <CADeotZoR2q8SX3BEbzd11=Y2Ukexqu6yEj69dZXhvhnbnPeCzA@mail.gmail.com>

hi,

I just flushed all the caches.

HTH
Alex

On Sun, Aug 7, 2016 at 2:39 PM, Raghav R V <ragvrv at gmail.com> wrote:
> Could someone disable the Travis cache once and for all please?
>
> I have seen several frustrating incidents where the Travis fails the PR
> because of this caching of old files.
>
> I also don't understand why it is enabled in the first place. It would
> really be super helpful if it is disabled for good.
>
> Also refer - https://github.com/scikit-learn/scikit-learn/issues/7094
>
> **cc**: Olivier, Andy
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>

From gael.varoquaux at normalesup.org  Sun Aug  7 13:30:43 2016
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Sun, 7 Aug 2016 19:30:43 +0200
Subject: [scikit-learn] Scaling model selection on a cluster
In-Reply-To: <CAE9CH62TmLbWTfvjz0Oq181vQ106ryReGAYx2y9q7W4zcoaTaQ@mail.gmail.com>
References: <CAE9CH62TmLbWTfvjz0Oq181vQ106ryReGAYx2y9q7W4zcoaTaQ@mail.gmail.com>
Message-ID: <20160807173043.GI3335822@phare.normalesup.org>

Parallel computing in scikit-learn is built upon on joblib. In the
development version of scikit-learn, the included joblib can be extended
with a distributed backend:
http://distributed.readthedocs.io/en/latest/joblib.html
that can distribute code on a cluster.

This is still bleeding edge, but this is probably a direction that will
see more development.


From ionescu.vlad1 at gmail.com  Sun Aug  7 17:25:47 2016
From: ionescu.vlad1 at gmail.com (Vlad Ionescu)
Date: Sun, 07 Aug 2016 21:25:47 +0000
Subject: [scikit-learn] Scaling model selection on a cluster
In-Reply-To: <20160807173043.GI3335822@phare.normalesup.org>
References: <CAE9CH62TmLbWTfvjz0Oq181vQ106ryReGAYx2y9q7W4zcoaTaQ@mail.gmail.com>
 <20160807173043.GI3335822@phare.normalesup.org>
Message-ID: <CAE9CH632if4g3BttybvefcZYxoP-r1VCaOpiryhGm3e+v7egfw@mail.gmail.com>

I copy pasted the example in the link you gave, only made the search take a
longer time. I used dask-ssh to setup worker nodes and a scheduler, then
connected to the scheduler in my code.

Tweaking the n_jobs parameters for the randomized search does not get any
performance benefits. The connection to the scheduler seems to work, but
nothing gets assigned to the workers, because the code doesn't scale.

I am using scikit-learn 0.18.dev0

Any ideas?

Code and results are below. Only the n_jobs value was changed between
executions. I printed an Executor assigned to my scheduler, and it reported
240 cores.

import distributed.joblib
from joblib import Parallel, parallel_backend
from sklearn.datasets import load_digits
from sklearn.grid_search import RandomizedSearchCV
from sklearn.svm import SVC
import numpy as np

digits = load_digits()

param_space = {
    'C': np.logspace(-6, 6, 100),
    'gamma': np.logspace(-8, 8, 100),
    'tol': np.logspace(-4, -1, 100),
    'class_weight': [None, 'balanced'],
}

model = SVC(kernel='rbf')
search = RandomizedSearchCV(model, param_space, cv=3, n_iter=1000,
verbose=1, *n_jobs=200*)

with parallel_backend('distributed', scheduler_host='my_scheduler:8786'):
    search.fit(digits.data, digits.target)

Fitting 3 folds for each of 1000 candidates, totalling 3000 fits
[Parallel(n_jobs=200)]: Done   4 tasks      | elapsed:    0.5s
[Parallel(n_jobs=200)]: Done 292 tasks      | elapsed:    6.9s
[Parallel(n_jobs=200)]: Done 800 tasks      | elapsed:   16.1s
[Parallel(n_jobs=200)]: Done 1250 tasks      | elapsed:   24.8s
[Parallel(n_jobs=200)]: Done 1800 tasks      | elapsed:   36.0s
[Parallel(n_jobs=200)]: Done 2450 tasks      | elapsed:   49.0s
[Parallel(*n_jobs=200*)]: Done 3000 out of 3000 | *elapsed:  1.0min
finished*

-------------------------------------

Fitting 3 folds for each of 1000 candidates, totalling 3000 fits
[Parallel(n_jobs=20)]: Done  10 tasks      | elapsed:    0.5s
[Parallel(n_jobs=20)]: Done 160 tasks      | elapsed:    3.7s
[Parallel(n_jobs=20)]: Done 410 tasks      | elapsed:    8.6s
[Parallel(n_jobs=20)]: Done 760 tasks      | elapsed:   16.2s
[Parallel(n_jobs=20)]: Done 1210 tasks      | elapsed:   25.0s
[Parallel(n_jobs=20)]: Done 1760 tasks      | elapsed:   36.2s
[Parallel(n_jobs=20)]: Done 2410 tasks      | elapsed:   48.8s
[Parallel(*n_jobs=20*)]: Done 3000 out of 3000 | *elapsed:  1.0min finished*


On Sun, Aug 7, 2016 at 8:31 PM Gael Varoquaux <gael.varoquaux at normalesup.org>
wrote:

> Parallel computing in scikit-learn is built upon on joblib. In the
> development version of scikit-learn, the included joblib can be extended
> with a distributed backend:
> http://distributed.readthedocs.io/en/latest/joblib.html
> that can distribute code on a cluster.
>
> This is still bleeding edge, but this is probably a direction that will
> see more development.
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160807/4001aff4/attachment.html>

From t3kcit at gmail.com  Sun Aug  7 17:39:43 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Sun, 7 Aug 2016 17:39:43 -0400
Subject: [scikit-learn] Disable Travis Cache
In-Reply-To: <CACmxyDEqE+pnK5LOEhibBGF7KBgeQHfvxFqXZcnXfek=uw1frA@mail.gmail.com>
References: <CACmxyDEqE+pnK5LOEhibBGF7KBgeQHfvxFqXZcnXfek=uw1frA@mail.gmail.com>
Message-ID: <be888988-582d-188d-7063-61929910ae23@gmail.com>

Why do you think it should be disabled instead of fixed?


On 08/07/2016 08:39 AM, Raghav R V wrote:
> Could someone disable the Travis cache once and for all please?
>
> I have seen several frustrating incidents where the Travis fails the 
> PR because of this caching of old files.
>
> I also don't understand why it is enabled in the first place. It would 
> really be super helpful if it is disabled for good.
>
> Also refer - https://github.com/scikit-learn/scikit-learn/issues/7094
>
> **cc**: Olivier, Andy
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160807/6eb2d8be/attachment.html>

From gael.varoquaux at normalesup.org  Mon Aug  8 01:24:20 2016
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Mon, 8 Aug 2016 07:24:20 +0200
Subject: [scikit-learn] Scaling model selection on a cluster
In-Reply-To: <CAE9CH632if4g3BttybvefcZYxoP-r1VCaOpiryhGm3e+v7egfw@mail.gmail.com>
References: <CAE9CH62TmLbWTfvjz0Oq181vQ106ryReGAYx2y9q7W4zcoaTaQ@mail.gmail.com>
 <20160807173043.GI3335822@phare.normalesup.org>
 <CAE9CH632if4g3BttybvefcZYxoP-r1VCaOpiryhGm3e+v7egfw@mail.gmail.com>
Message-ID: <20160808052420.GR3335822@phare.normalesup.org>

My guess is that your model evaluations are too fast, and that you are
not getting the benefits of distributed computing as the overhead is
hiding them.

Anyhow, I don't think that this is ready for prime-time usage. It
probably requires tweeking and understanding the tradeoffs.

G

On Sun, Aug 07, 2016 at 09:25:47PM +0000, Vlad Ionescu wrote:
> I copy pasted the example in the link you gave, only made the search take a
> longer time. I used dask-ssh to setup worker nodes and a scheduler, then
> connected to the scheduler in my code.

> Tweaking the n_jobs parameters for the randomized search does not get any
> performance benefits. The connection to the scheduler seems to work, but
> nothing gets assigned to the workers, because the code doesn't scale.

> I am using scikit-learn 0.18.dev0

> Any ideas?

> Code and results are below. Only the n_jobs value was changed between
> executions. I printed an Executor assigned to my scheduler, and it reported 240
> cores.

> import distributed.joblib
> from joblib import Parallel, parallel_backend
> from sklearn.datasets import load_digits
> from sklearn.grid_search import RandomizedSearchCV
> from sklearn.svm import SVC
> import numpy as np

> digits = load_digits()

> param_space = {
> ? ? 'C': np.logspace(-6, 6, 100),
> ? ? 'gamma': np.logspace(-8, 8, 100),
> ? ? 'tol': np.logspace(-4, -1, 100),
> ? ? 'class_weight': [None, 'balanced'],
> }

> model = SVC(kernel='rbf')
> search = RandomizedSearchCV(model, param_space, cv=3, n_iter=1000, verbose=1,
> n_jobs=200)

> with parallel_backend('distributed', scheduler_host='my_scheduler:8786'):
> ? ? search.fit(digits.data, digits.target)

> Fitting 3 folds for each of 1000 candidates, totalling 3000 fits
> [Parallel(n_jobs=200)]: Done ? 4 tasks ? ? ?| elapsed: ? ?0.5s
> [Parallel(n_jobs=200)]: Done 292 tasks ? ? ?| elapsed: ? ?6.9s
> [Parallel(n_jobs=200)]: Done 800 tasks ? ? ?| elapsed: ? 16.1s
> [Parallel(n_jobs=200)]: Done 1250 tasks ? ? ?| elapsed: ? 24.8s
> [Parallel(n_jobs=200)]: Done 1800 tasks ? ? ?| elapsed: ? 36.0s
> [Parallel(n_jobs=200)]: Done 2450 tasks ? ? ?| elapsed: ? 49.0s
> [Parallel(n_jobs=200)]: Done 3000 out of 3000 | elapsed: ?1.0min finished

> -------------------------------------

> Fitting 3 folds for each of 1000 candidates, totalling 3000 fits
> [Parallel(n_jobs=20)]: Done ?10 tasks ? ? ?| elapsed: ? ?0.5s
> [Parallel(n_jobs=20)]: Done 160 tasks ? ? ?| elapsed: ? ?3.7s
> [Parallel(n_jobs=20)]: Done 410 tasks ? ? ?| elapsed: ? ?8.6s
> [Parallel(n_jobs=20)]: Done 760 tasks ? ? ?| elapsed: ? 16.2s
> [Parallel(n_jobs=20)]: Done 1210 tasks ? ? ?| elapsed: ? 25.0s
> [Parallel(n_jobs=20)]: Done 1760 tasks ? ? ?| elapsed: ? 36.2s
> [Parallel(n_jobs=20)]: Done 2410 tasks ? ? ?| elapsed: ? 48.8s
> [Parallel(n_jobs=20)]: Done 3000 out of 3000 | elapsed: ?1.0min finished


> ?

> On Sun, Aug 7, 2016 at 8:31 PM Gael Varoquaux <gael.varoquaux at normalesup.org>
> wrote:

>     Parallel computing in scikit-learn is built upon on joblib. In the
>     development version of scikit-learn, the included joblib can be extended
>     with a distributed backend:
>     http://distributed.readthedocs.io/en/latest/joblib.html
>     that can distribute code on a cluster.

>     This is still bleeding edge, but this is probably a direction that will
>     see more development.

>     _______________________________________________
>     scikit-learn mailing list
>     scikit-learn at python.org
>     https://mail.python.org/mailman/listinfo/scikit-learn


> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
    Gael Varoquaux
    Researcher, INRIA Parietal
    NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
    Phone:  ++ 33-1-69-08-79-68
    http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux

From ionescu.vlad1 at gmail.com  Mon Aug  8 02:48:34 2016
From: ionescu.vlad1 at gmail.com (Vlad Ionescu)
Date: Mon, 08 Aug 2016 06:48:34 +0000
Subject: [scikit-learn] Scaling model selection on a cluster
In-Reply-To: <20160808052420.GR3335822@phare.normalesup.org>
References: <CAE9CH62TmLbWTfvjz0Oq181vQ106ryReGAYx2y9q7W4zcoaTaQ@mail.gmail.com>
 <20160807173043.GI3335822@phare.normalesup.org>
 <CAE9CH632if4g3BttybvefcZYxoP-r1VCaOpiryhGm3e+v7egfw@mail.gmail.com>
 <20160808052420.GR3335822@phare.normalesup.org>
Message-ID: <CAE9CH62QnsTbV9OJBHiJ5Eug7bkAgYMesg46CRfAn9v-egFVWQ@mail.gmail.com>

I don't think they're too fast. I tried with slower models and bigger data
sets as well. I get the best results with n_jobs=20, which is the number of
cores on a single node. Anything below is considerably slower, anything
above is mostly the same, sometimes a little slower.

Is there a way to see what each worker is running? Nothing is reported in
the scheduler console window about the workers, just that there is a
connection to the scheduler. Should something be reported about the work
assigned to workers?

If I notice speed benefits going from 1 to 20 n_jobs, surely there should
be something noticeable above that as well if the distributed part is
running correctly, no? This is a very easily parallelizable task, and my
nodes are in a cluster on the same network. I highly doubt it's (just)
overhead.

Is there anything else that I could look into to try fixing this?

Fitting 10 folds for each of 10000 candidates, totalling 100000 fits
[Parallel(n_jobs=20)]: Done  10 tasks      | elapsed:    0.7s
[Parallel(n_jobs=20)]: Done 160 tasks      | elapsed:    4.8s
[Parallel(n_jobs=20)]: Done 410 tasks      | elapsed:   12.6s
[Parallel(n_jobs=20)]: Done 760 tasks      | elapsed:   23.7s
[Parallel(n_jobs=20)]: Done 1210 tasks      | elapsed:   37.9s
[Parallel(n_jobs=20)]: Done 1760 tasks      | elapsed:   55.0s
*[Parallel(n_jobs=20)]: Done 2410 tasks      | elapsed:  1.2min*

---

Fitting 10 folds for each of 10000 candidates, totalling 100000 fits
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    6.2s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:   27.5s
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed:  1.0min
*[Parallel(n_jobs=4)]: Done 792 tasks      | elapsed:  1.7min*


---

Fitting 10 folds for each of 10000 candidates, totalling 100000 fits
[Parallel(n_jobs=100)]: Done 250 tasks      | elapsed:    9.1s
[Parallel(n_jobs=100)]: Done 600 tasks      | elapsed:   19.3s
[Parallel(n_jobs=100)]: Done 1050 tasks      | elapsed:   34.0s
[Parallel(n_jobs=100)]: Done 1600 tasks      | elapsed:   49.8s
*[Parallel(n_jobs=100)]: Done 2250 tasks      | elapsed:  1.2min*

If 4 workers do 442 tasks in a minute, then 5x=20 workers should ideally do
5x442 = 2210. So double the workers, half the time seems to hold very well
until 20 workers. I have a hard time imagining that it would stop holding
at exactly the number of cores per node.

On Mon, Aug 8, 2016 at 8:25 AM Gael Varoquaux <gael.varoquaux at normalesup.org>
wrote:

> My guess is that your model evaluations are too fast, and that you are
> not getting the benefits of distributed computing as the overhead is
> hiding them.
>
> Anyhow, I don't think that this is ready for prime-time usage. It
> probably requires tweeking and understanding the tradeoffs.
>
> G
>
> On Sun, Aug 07, 2016 at 09:25:47PM +0000, Vlad Ionescu wrote:
> > I copy pasted the example in the link you gave, only made the search
> take a
> > longer time. I used dask-ssh to setup worker nodes and a scheduler, then
> > connected to the scheduler in my code.
>
> > Tweaking the n_jobs parameters for the randomized search does not get any
> > performance benefits. The connection to the scheduler seems to work, but
> > nothing gets assigned to the workers, because the code doesn't scale.
>
> > I am using scikit-learn 0.18.dev0
>
> > Any ideas?
>
> > Code and results are below. Only the n_jobs value was changed between
> > executions. I printed an Executor assigned to my scheduler, and it
> reported 240
> > cores.
>
> > import distributed.joblib
> > from joblib import Parallel, parallel_backend
> > from sklearn.datasets import load_digits
> > from sklearn.grid_search import RandomizedSearchCV
> > from sklearn.svm import SVC
> > import numpy as np
>
> > digits = load_digits()
>
> > param_space = {
> >     'C': np.logspace(-6, 6, 100),
> >     'gamma': np.logspace(-8, 8, 100),
> >     'tol': np.logspace(-4, -1, 100),
> >     'class_weight': [None, 'balanced'],
> > }
>
> > model = SVC(kernel='rbf')
> > search = RandomizedSearchCV(model, param_space, cv=3, n_iter=1000,
> verbose=1,
> > n_jobs=200)
>
> > with parallel_backend('distributed', scheduler_host='my_scheduler:8786'):
> >     search.fit(digits.data, digits.target)
>
> > Fitting 3 folds for each of 1000 candidates, totalling 3000 fits
> > [Parallel(n_jobs=200)]: Done   4 tasks      | elapsed:    0.5s
> > [Parallel(n_jobs=200)]: Done 292 tasks      | elapsed:    6.9s
> > [Parallel(n_jobs=200)]: Done 800 tasks      | elapsed:   16.1s
> > [Parallel(n_jobs=200)]: Done 1250 tasks      | elapsed:   24.8s
> > [Parallel(n_jobs=200)]: Done 1800 tasks      | elapsed:   36.0s
> > [Parallel(n_jobs=200)]: Done 2450 tasks      | elapsed:   49.0s
> > [Parallel(n_jobs=200)]: Done 3000 out of 3000 | elapsed:  1.0min finished
>
> > -------------------------------------
>
> > Fitting 3 folds for each of 1000 candidates, totalling 3000 fits
> > [Parallel(n_jobs=20)]: Done  10 tasks      | elapsed:    0.5s
> > [Parallel(n_jobs=20)]: Done 160 tasks      | elapsed:    3.7s
> > [Parallel(n_jobs=20)]: Done 410 tasks      | elapsed:    8.6s
> > [Parallel(n_jobs=20)]: Done 760 tasks      | elapsed:   16.2s
> > [Parallel(n_jobs=20)]: Done 1210 tasks      | elapsed:   25.0s
> > [Parallel(n_jobs=20)]: Done 1760 tasks      | elapsed:   36.2s
> > [Parallel(n_jobs=20)]: Done 2410 tasks      | elapsed:   48.8s
> > [Parallel(n_jobs=20)]: Done 3000 out of 3000 | elapsed:  1.0min finished
>
>
> >
>
> > On Sun, Aug 7, 2016 at 8:31 PM Gael Varoquaux <
> gael.varoquaux at normalesup.org>
> > wrote:
>
> >     Parallel computing in scikit-learn is built upon on joblib. In the
> >     development version of scikit-learn, the included joblib can be
> extended
> >     with a distributed backend:
> >     http://distributed.readthedocs.io/en/latest/joblib.html
> >     that can distribute code on a cluster.
>
> >     This is still bleeding edge, but this is probably a direction that
> will
> >     see more development.
>
> >     _______________________________________________
> >     scikit-learn mailing list
> >     scikit-learn at python.org
> >     https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> --
>     Gael Varoquaux
>     Researcher, INRIA Parietal
>     NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
>     Phone:  ++ 33-1-69-08-79-68
>     http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160808/3937ad3c/attachment.html>

From ionescu.vlad1 at gmail.com  Mon Aug  8 03:59:23 2016
From: ionescu.vlad1 at gmail.com (Vlad Ionescu)
Date: Mon, 08 Aug 2016 07:59:23 +0000
Subject: [scikit-learn] Scaling model selection on a cluster
In-Reply-To: <CAE9CH62QnsTbV9OJBHiJ5Eug7bkAgYMesg46CRfAn9v-egFVWQ@mail.gmail.com>
References: <CAE9CH62TmLbWTfvjz0Oq181vQ106ryReGAYx2y9q7W4zcoaTaQ@mail.gmail.com>
 <20160807173043.GI3335822@phare.normalesup.org>
 <CAE9CH632if4g3BttybvefcZYxoP-r1VCaOpiryhGm3e+v7egfw@mail.gmail.com>
 <20160808052420.GR3335822@phare.normalesup.org>
 <CAE9CH62QnsTbV9OJBHiJ5Eug7bkAgYMesg46CRfAn9v-egFVWQ@mail.gmail.com>
Message-ID: <CAE9CH62av+pQuBuGwmu_tsQNzFKbto3o8x+BVV52Fn-aw4WjxA@mail.gmail.com>

I realize this is in early stages and I'd like to help improve it, even if
just by testing on an actual cluster. All of the examples I've seen are
very small, and it's impossible for anyone to notice if they're really
running in parallel judging just by the execution time. None of them
mention how you can ensure or check that each worker is doing work either.

If there's anything I can do to help debug this (I realize it could be a
problem on my end though), please let me know.

On Mon, Aug 8, 2016 at 9:48 AM Vlad Ionescu <ionescu.vlad1 at gmail.com> wrote:

> I don't think they're too fast. I tried with slower models and bigger data
> sets as well. I get the best results with n_jobs=20, which is the number of
> cores on a single node. Anything below is considerably slower, anything
> above is mostly the same, sometimes a little slower.
>
> Is there a way to see what each worker is running? Nothing is reported in
> the scheduler console window about the workers, just that there is a
> connection to the scheduler. Should something be reported about the work
> assigned to workers?
>
> If I notice speed benefits going from 1 to 20 n_jobs, surely there should
> be something noticeable above that as well if the distributed part is
> running correctly, no? This is a very easily parallelizable task, and my
> nodes are in a cluster on the same network. I highly doubt it's (just)
> overhead.
>
> Is there anything else that I could look into to try fixing this?
>
> Fitting 10 folds for each of 10000 candidates, totalling 100000 fits
> [Parallel(n_jobs=20)]: Done  10 tasks      | elapsed:    0.7s
> [Parallel(n_jobs=20)]: Done 160 tasks      | elapsed:    4.8s
> [Parallel(n_jobs=20)]: Done 410 tasks      | elapsed:   12.6s
> [Parallel(n_jobs=20)]: Done 760 tasks      | elapsed:   23.7s
> [Parallel(n_jobs=20)]: Done 1210 tasks      | elapsed:   37.9s
> [Parallel(n_jobs=20)]: Done 1760 tasks      | elapsed:   55.0s
> *[Parallel(n_jobs=20)]: Done 2410 tasks      | elapsed:  1.2min*
>
> ---
>
> Fitting 10 folds for each of 10000 candidates, totalling 100000 fits
> [Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    6.2s
> [Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:   27.5s
> [Parallel(n_jobs=4)]: Done 442 tasks      | elapsed:  1.0min
> *[Parallel(n_jobs=4)]: Done 792 tasks      | elapsed:  1.7min*
>
>
> ---
>
> Fitting 10 folds for each of 10000 candidates, totalling 100000 fits
> [Parallel(n_jobs=100)]: Done 250 tasks      | elapsed:    9.1s
> [Parallel(n_jobs=100)]: Done 600 tasks      | elapsed:   19.3s
> [Parallel(n_jobs=100)]: Done 1050 tasks      | elapsed:   34.0s
> [Parallel(n_jobs=100)]: Done 1600 tasks      | elapsed:   49.8s
> *[Parallel(n_jobs=100)]: Done 2250 tasks      | elapsed:  1.2min*
>
> If 4 workers do 442 tasks in a minute, then 5x=20 workers should ideally
> do 5x442 = 2210. So double the workers, half the time seems to hold very
> well until 20 workers. I have a hard time imagining that it would stop
> holding at exactly the number of cores per node.
>
> On Mon, Aug 8, 2016 at 8:25 AM Gael Varoquaux <
> gael.varoquaux at normalesup.org> wrote:
>
>> My guess is that your model evaluations are too fast, and that you are
>> not getting the benefits of distributed computing as the overhead is
>> hiding them.
>>
>> Anyhow, I don't think that this is ready for prime-time usage. It
>> probably requires tweeking and understanding the tradeoffs.
>>
>> G
>>
>> On Sun, Aug 07, 2016 at 09:25:47PM +0000, Vlad Ionescu wrote:
>> > I copy pasted the example in the link you gave, only made the search
>> take a
>> > longer time. I used dask-ssh to setup worker nodes and a scheduler, then
>> > connected to the scheduler in my code.
>>
>> > Tweaking the n_jobs parameters for the randomized search does not get
>> any
>> > performance benefits. The connection to the scheduler seems to work, but
>> > nothing gets assigned to the workers, because the code doesn't scale.
>>
>> > I am using scikit-learn 0.18.dev0
>>
>> > Any ideas?
>>
>> > Code and results are below. Only the n_jobs value was changed between
>> > executions. I printed an Executor assigned to my scheduler, and it
>> reported 240
>> > cores.
>>
>> > import distributed.joblib
>> > from joblib import Parallel, parallel_backend
>> > from sklearn.datasets import load_digits
>> > from sklearn.grid_search import RandomizedSearchCV
>> > from sklearn.svm import SVC
>> > import numpy as np
>>
>> > digits = load_digits()
>>
>> > param_space = {
>> >     'C': np.logspace(-6, 6, 100),
>> >     'gamma': np.logspace(-8, 8, 100),
>> >     'tol': np.logspace(-4, -1, 100),
>> >     'class_weight': [None, 'balanced'],
>> > }
>>
>> > model = SVC(kernel='rbf')
>> > search = RandomizedSearchCV(model, param_space, cv=3, n_iter=1000,
>> verbose=1,
>> > n_jobs=200)
>>
>> > with parallel_backend('distributed',
>> scheduler_host='my_scheduler:8786'):
>> >     search.fit(digits.data, digits.target)
>>
>> > Fitting 3 folds for each of 1000 candidates, totalling 3000 fits
>> > [Parallel(n_jobs=200)]: Done   4 tasks      | elapsed:    0.5s
>> > [Parallel(n_jobs=200)]: Done 292 tasks      | elapsed:    6.9s
>> > [Parallel(n_jobs=200)]: Done 800 tasks      | elapsed:   16.1s
>> > [Parallel(n_jobs=200)]: Done 1250 tasks      | elapsed:   24.8s
>> > [Parallel(n_jobs=200)]: Done 1800 tasks      | elapsed:   36.0s
>> > [Parallel(n_jobs=200)]: Done 2450 tasks      | elapsed:   49.0s
>> > [Parallel(n_jobs=200)]: Done 3000 out of 3000 | elapsed:  1.0min
>> finished
>>
>> > -------------------------------------
>>
>> > Fitting 3 folds for each of 1000 candidates, totalling 3000 fits
>> > [Parallel(n_jobs=20)]: Done  10 tasks      | elapsed:    0.5s
>> > [Parallel(n_jobs=20)]: Done 160 tasks      | elapsed:    3.7s
>> > [Parallel(n_jobs=20)]: Done 410 tasks      | elapsed:    8.6s
>> > [Parallel(n_jobs=20)]: Done 760 tasks      | elapsed:   16.2s
>> > [Parallel(n_jobs=20)]: Done 1210 tasks      | elapsed:   25.0s
>> > [Parallel(n_jobs=20)]: Done 1760 tasks      | elapsed:   36.2s
>> > [Parallel(n_jobs=20)]: Done 2410 tasks      | elapsed:   48.8s
>> > [Parallel(n_jobs=20)]: Done 3000 out of 3000 | elapsed:  1.0min finished
>>
>>
>> >
>>
>> > On Sun, Aug 7, 2016 at 8:31 PM Gael Varoquaux <
>> gael.varoquaux at normalesup.org>
>> > wrote:
>>
>> >     Parallel computing in scikit-learn is built upon on joblib. In the
>> >     development version of scikit-learn, the included joblib can be
>> extended
>> >     with a distributed backend:
>> >     http://distributed.readthedocs.io/en/latest/joblib.html
>> >     that can distribute code on a cluster.
>>
>> >     This is still bleeding edge, but this is probably a direction that
>> will
>> >     see more development.
>>
>> >     _______________________________________________
>> >     scikit-learn mailing list
>> >     scikit-learn at python.org
>> >     https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>> --
>>     Gael Varoquaux
>>     Researcher, INRIA Parietal
>>     NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
>>     Phone:  ++ 33-1-69-08-79-68
>>     http://gael-varoquaux.info
>> http://twitter.com/GaelVaroquaux
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160808/e80969ec/attachment-0001.html>

From ragvrv at gmail.com  Mon Aug  8 09:10:26 2016
From: ragvrv at gmail.com (Raghav R V)
Date: Mon, 8 Aug 2016 15:10:26 +0200
Subject: [scikit-learn] Disable Travis Cache
In-Reply-To: <be888988-582d-188d-7063-61929910ae23@gmail.com>
References: <CACmxyDEqE+pnK5LOEhibBGF7KBgeQHfvxFqXZcnXfek=uw1frA@mail.gmail.com>
 <be888988-582d-188d-7063-61929910ae23@gmail.com>
Message-ID: <CACmxyDFtxVuB82ESzWKi_YMdBXMcX_AecbwkLh9NMP4o4MfAkw@mail.gmail.com>

I felt we could rather have a clean build and wait for a few more minutes
(if that's the disadvantage of disabling caching) than have it pass / fail
on old code...

On Sun, Aug 7, 2016 at 11:39 PM, Andreas Mueller <t3kcit at gmail.com> wrote:

> Why do you think it should be disabled instead of fixed?
>
>
>
> On 08/07/2016 08:39 AM, Raghav R V wrote:
>
> Could someone disable the Travis cache once and for all please?
>
> I have seen several frustrating incidents where the Travis fails the PR
> because of this caching of old files.
>
> I also don't understand why it is enabled in the first place. It would
> really be super helpful if it is disabled for good.
>
> Also refer - https://github.com/scikit-learn/scikit-learn/issues/7094
>
> **cc**: Olivier, Andy
>
>
> _______________________________________________
> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160808/8d31d7a8/attachment.html>

From amisra2 at ucsc.edu  Mon Aug  8 12:21:57 2016
From: amisra2 at ucsc.edu (Amita Misra)
Date: Mon, 8 Aug 2016 09:21:57 -0700
Subject: [scikit-learn] Supervised anomaly detection in time series
In-Reply-To: <CAK6amUP-uoBPspcGU33jtVZxv7GAXJvL9D7wG079EyMvBcdcGA@mail.gmail.com>
References: <CAMa2cetn_DTVn=XzBPpLcP_1z_8duWFBdhuxON4eNUAU1m2JRw@mail.gmail.com>
 <CAOaEz0k2E8trAmvQOBNVH+4_JEgrH=-y6L-0FJwOt+rEmgxj9w@mail.gmail.com>
 <56374299-8CFE-4430-BD1A-CE93F836211D@sebastianraschka.com>
 <CAK6amUP-uoBPspcGU33jtVZxv7GAXJvL9D7wG079EyMvBcdcGA@mail.gmail.com>
Message-ID: <CAMa2cettWn9wBEpVYkJNsg5aseXGhKini9JKkGrG=N9u0-mnzg@mail.gmail.com>

Thanks for the pointers and papers. I'd definitely go through this approach
and see if it can be applied to my problem.

Thanks,
Amita

On Fri, Aug 5, 2016 at 4:40 PM, Albert Thomas <albertthomas88 at gmail.com>
wrote:

> Hi,
>
> About your question on how to learn the parameters of anomaly detection
> algorithms using only the negative samples in your case, Nicolas and I
> worked on this aspect recently. If you are interested you can have look at:
>
> - Learning hyperparameters for unsupervised anomaly detection:
> https://drive.google.com/file/d/0B8Dg3PBX90KNUTg5NGNOVnFPX0hDN
> mJsSTcybzZMSHNPYkd3/view
> - How to evaluate the quality of unsupervised anomaly Detection
> algorithms?:
> https://drive.google.com/file/d/0B8Dg3PBX90KNenV3WjRkR09Bakx5Y
> lNyMF9BUXVNem1hb0NR/view
>
> Best,
> Albert
>
> On Fri, Aug 5, 2016 at 9:34 PM Sebastian Raschka <
> mail at sebastianraschka.com> wrote:
>
>> > But this might be the kind of problem where you seriously ask how hard
>> it would be to gather more data.
>>
>>
>> Yeah, I agree, but this scenario is then typical in a sense of that it is
>> an anomaly detection problem rather than a classification problem. I.e.,
>> you don?t have enough positive labels to fit the model and thus you need to
>> do unsupervised learning to learn from the negative class only.
>>
>> Sure, supervised learning could work well, but I would also explore
>> unsupervised learning here and see how that works for you; maybe one-class
>> SVM as suggested or EM algorithm based mixture models (
>> http://scikit-learn.org/stable/modules/mixture.html)
>>
>> Best,
>> Sebastian
>>
>> > On Aug 5, 2016, at 2:55 PM, Jared Gabor <jgabor.astro at gmail.com> wrote:
>> >
>> > Lots of great suggestions on how to model your problem.  But this might
>> be the kind of problem where you seriously ask how hard it would be to
>> gather more data.
>> >
>> > On Thu, Aug 4, 2016 at 2:17 PM, Amita Misra <amisra2 at ucsc.edu> wrote:
>> > Hi,
>> >
>> > I am currently exploring the problem of speed bump detection using
>> accelerometer time series data.
>> > I have extracted some features based on mean, std deviation etc  within
>> a time window.
>> >
>> > Since the dataset is highly skewed ( I have just 5  positive samples
>> for every > 300 samples)
>> > I was looking into
>> >
>> > One ClassSVM
>> > covariance.EllipticEnvelope
>> > sklearn.ensemble.IsolationForest
>> > but I am not sure how to use them.
>> >
>> > What I get from docs
>> >
>> > separate the positive examples and train using only negative examples
>> > clf.fit(X_train)
>> > and then
>> > predict the positive examples using
>> > clf.predict(X_test)
>> >
>> >
>> > I am not sure what is then the role of positive examples in my training
>> dataset or how can I use them to improve my classifier so that I can
>> predict better on new samples.
>> >
>> >
>> > Can we do something like Cross validation to learn the parameters as in
>> normal binary SVM classification
>> >
>> > Thanks,?
>> > Amita
>> >
>> > Amita Misra
>> > Graduate Student Researcher
>> > Natural Language and Dialogue Systems Lab
>> > Baskin School of Engineering
>> > University of California Santa Cruz
>> >
>> >
>> >
>> >
>> >
>> > --
>> > Amita Misra
>> > Graduate Student Researcher
>> > Natural Language and Dialogue Systems Lab
>> > Baskin School of Engineering
>> > University of California Santa Cruz
>> >
>> >
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>> >
>> >
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 
Amita Misra
Graduate Student Researcher
Natural Language and Dialogue Systems Lab
Baskin School of Engineering
University of California Santa Cruz
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160808/aa18cfae/attachment.html>

From gael.varoquaux at normalesup.org  Mon Aug  8 12:22:50 2016
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Mon, 8 Aug 2016 18:22:50 +0200
Subject: [scikit-learn] Disable Travis Cache
In-Reply-To: <CACmxyDFtxVuB82ESzWKi_YMdBXMcX_AecbwkLh9NMP4o4MfAkw@mail.gmail.com>
References: <CACmxyDEqE+pnK5LOEhibBGF7KBgeQHfvxFqXZcnXfek=uw1frA@mail.gmail.com>
 <be888988-582d-188d-7063-61929910ae23@gmail.com>
 <CACmxyDFtxVuB82ESzWKi_YMdBXMcX_AecbwkLh9NMP4o4MfAkw@mail.gmail.com>
Message-ID: <20160808162250.GC3335822@phare.normalesup.org>

On Mon, Aug 08, 2016 at 03:10:26PM +0200, Raghav R V wrote:
> I felt we could rather have a clean build and wait for a few more minutes (if
> that's the disadvantage of disabling caching) than have it pass / fail on old
> code...

Time of CI is a real problem. On our side because it slows down merges of
PRs, and for the infrastructure: it's server time and money for them.

I'd much rather have a working cache.

Ga?l

From ragvrv at gmail.com  Mon Aug  8 12:56:29 2016
From: ragvrv at gmail.com (Raghav R V)
Date: Mon, 8 Aug 2016 18:56:29 +0200
Subject: [scikit-learn] Disable Travis Cache
In-Reply-To: <20160808162250.GC3335822@phare.normalesup.org>
References: <CACmxyDEqE+pnK5LOEhibBGF7KBgeQHfvxFqXZcnXfek=uw1frA@mail.gmail.com>
 <be888988-582d-188d-7063-61929910ae23@gmail.com>
 <CACmxyDFtxVuB82ESzWKi_YMdBXMcX_AecbwkLh9NMP4o4MfAkw@mail.gmail.com>
 <20160808162250.GC3335822@phare.normalesup.org>
Message-ID: <CACmxyDFvgaJHRACkzxmmGbpopD4KjYtUkSj0e7RHNatEMvwT6w@mail.gmail.com>

Ok. Thanks for the comments!

On Mon, Aug 8, 2016 at 6:22 PM, Gael Varoquaux <
gael.varoquaux at normalesup.org> wrote:

> On Mon, Aug 08, 2016 at 03:10:26PM +0200, Raghav R V wrote:
> > I felt we could rather have a clean build and wait for a few more
> minutes (if
> > that's the disadvantage of disabling caching) than have it pass / fail
> on old
> > code...
>
> Time of CI is a real problem. On our side because it slows down merges of
> PRs, and for the infrastructure: it's server time and money for them.
>
> I'd much rather have a working cache.
>
> Ga?l
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160808/59e01e41/attachment.html>

From zude07 at yahoo.com  Thu Aug 11 07:21:22 2016
From: zude07 at yahoo.com (Ali Zude)
Date: Thu, 11 Aug 2016 11:21:22 +0000 (UTC)
Subject: [scikit-learn] Speeding up RF regressors
References: <2043263797.12551627.1470914482530.JavaMail.yahoo.ref@mail.yahoo.com>
Message-ID: <2043263797.12551627.1470914482530.JavaMail.yahoo@mail.yahoo.com>

Hi all,
I've 6 RF models and I am using them online to predict 6 different variables (using the same features), models quality (error in test data is good). However, the online prediction is very very slow. 
How can I speed up the prediction?   
   - ??? Can I import models into C++ code?
   - ??? Is it useful to upgrade to scikit-learn 0.18? and then use multi-output models?   

   - ??? Is sklearn-compiledtreesuseful, they are claiming that it will speed the prediction (5x-8x)times?
   
   - I could not use because of array2d error >>PyPi
Thank you for your help 

RegardsAli
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160811/0203f23b/attachment.html>

From maciek at wojcikowski.pl  Thu Aug 11 07:26:46 2016
From: maciek at wojcikowski.pl (=?UTF-8?Q?Maciek_W=C3=B3jcikowski?=)
Date: Thu, 11 Aug 2016 13:26:46 +0200
Subject: [scikit-learn] Speeding up RF regressors
In-Reply-To: <2043263797.12551627.1470914482530.JavaMail.yahoo@mail.yahoo.com>
References: <2043263797.12551627.1470914482530.JavaMail.yahoo.ref@mail.yahoo.com>
 <2043263797.12551627.1470914482530.JavaMail.yahoo@mail.yahoo.com>
Message-ID: <CAH2JJR2PNwM4OKeZH7gG+9i_5vU06-J2p010=LCV-RH8Ximt-Q@mail.gmail.com>

Hi Ali,

I'm using sklearn-compiledtrees [
https://github.com/ajtulloch/sklearn-compiledtrees] on quite large trees
(pickle size ~1GB, compiled ~100MB) and the speedup is gigantic (never
measured it properly) but I'd say it's over 10x.

----
Pozdrawiam,  |  Best regards,
Maciek W?jcikowski
maciek at wojcikowski.pl

2016-08-11 13:21 GMT+02:00 Ali Zude via scikit-learn <
scikit-learn at python.org>:

> Hi all,
>
> I've 6 RF models and I am using them online to predict 6 different
> variables (using the same features), models quality (error in test data is
> good). However, the online prediction is very very slow.
> How can I speed up the prediction?
>
>    -     Can I import models into C++ code?
>    -     Is it useful to upgrade to scikit-learn 0.18? and then use
>    multi-output models?
>    -     Is sklearn-compiledtreesuseful, they are claiming that it will
>    speed the prediction (5x-8x)times?
>       - I could not use because of array2d error >>PyPi
>
> Thank you for your help
>
> Regards
> Ali
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160811/a323a995/attachment.html>

From zude07 at yahoo.com  Thu Aug 11 07:31:24 2016
From: zude07 at yahoo.com (Ali Zude)
Date: Thu, 11 Aug 2016 11:31:24 +0000 (UTC)
Subject: [scikit-learn] Speeding up RF regressors
In-Reply-To: <CAH2JJR2PNwM4OKeZH7gG+9i_5vU06-J2p010=LCV-RH8Ximt-Q@mail.gmail.com>
References: <2043263797.12551627.1470914482530.JavaMail.yahoo.ref@mail.yahoo.com>
 <2043263797.12551627.1470914482530.JavaMail.yahoo@mail.yahoo.com>
 <CAH2JJR2PNwM4OKeZH7gG+9i_5vU06-J2p010=LCV-RH8Ximt-Q@mail.gmail.com>
Message-ID: <2056525296.12336951.1470915084854.JavaMail.yahoo@mail.yahoo.com>

Thnx Maciek,
I've tried to use it but I could not sort out the PyPi problem,? see the error below. Thanks in advance.

---> 16 import compiledtrees

/home/ali/anaconda2/lib/python2.7/site-packages/compiledtrees/__init__.py in <module>()
----> 1 from compiledtrees.compiled import CompiledRegressionPredictor
      2 
      3 __all__ = ["CompiledRegressionPredictor"]

/home/ali/anaconda2/lib/python2.7/site-packages/compiledtrees/compiled.py in <module>()
      1 from __future__ import print_function
      2 
----> 3 from sklearn.utils import array2d
      4 from sklearn.tree.tree import DecisionTreeRegressor, DTYPE
      5 from sklearn.ensemble.gradient_boosting import GradientBoostingRegressor

ImportError: cannot import name array2d


Kind regards
Ali

      Von: Maciek W?jcikowski <maciek at wojcikowski.pl>
 An: Ali Zude <zude07 at yahoo.com>; Scikit-learn user and developer mailing list <scikit-learn at python.org> 
 Gesendet: 12:26 Donnerstag, 11.August 2016
 Betreff: Re: [scikit-learn] Speeding up RF regressors
   
Hi Ali,
I'm using sklearn-compiledtrees [https://github.com/ajtulloch/sklearn-compiledtrees] on quite large trees (pickle size ~1GB, compiled ~100MB) and the speedup is gigantic (never measured it properly) but I'd say it's over 10x.
----
Pozdrawiam, ?| ?Best regards,
Maciek W?jcikowski
maciek at wojcikowski.pl

2016-08-11 13:21 GMT+02:00 Ali Zude via scikit-learn <scikit-learn at python.org>:

Hi all,
I've 6 RF models and I am using them online to predict 6 different variables (using the same features), models quality (error in test data is good). However, the online prediction is very very slow. 
How can I speed up the prediction?   
   - ??? Can I import models into C++ code?
   - ??? Is it useful to upgrade to scikit-learn 0.18? and then use multi-output models?   

   - ??? Is sklearn-compiledtreesuseful, they are claiming that it will speed the prediction (5x-8x)times?
   
   - I could not use because of array2d error >>PyPi
Thank you for your help 

RegardsAli

______________________________ _________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/ mailman/listinfo/scikit-learn


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160811/b5d9c592/attachment-0001.html>

From maciek at wojcikowski.pl  Thu Aug 11 09:10:48 2016
From: maciek at wojcikowski.pl (=?UTF-8?Q?Maciek_W=C3=B3jcikowski?=)
Date: Thu, 11 Aug 2016 15:10:48 +0200
Subject: [scikit-learn] Speeding up RF regressors
In-Reply-To: <2056525296.12336951.1470915084854.JavaMail.yahoo@mail.yahoo.com>
References: <2043263797.12551627.1470914482530.JavaMail.yahoo.ref@mail.yahoo.com>
 <2043263797.12551627.1470914482530.JavaMail.yahoo@mail.yahoo.com>
 <CAH2JJR2PNwM4OKeZH7gG+9i_5vU06-J2p010=LCV-RH8Ximt-Q@mail.gmail.com>
 <2056525296.12336951.1470915084854.JavaMail.yahoo@mail.yahoo.com>
Message-ID: <CAH2JJR1N4J_p==QN9ZJ2LDNPQcVW43GfadOR6rF1XZgwD2grFA@mail.gmail.com>

First of all the pypi version is outdated, please install using
>
> pip install git+https://github.com/ajtulloch/sklearn-compiledtrees.git


Secondly, which scikit-learn version are you using?

----
Pozdrawiam,  |  Best regards,
Maciek W?jcikowski
maciek at wojcikowski.pl

2016-08-11 13:31 GMT+02:00 Ali Zude <zude07 at yahoo.com>:

> Thnx Maciek,
>
> I've tried to use it but I could not sort out the PyPi problem,  see the
> error below. Thanks in advance.
>
> ---> 16 import compiledtrees
> /home/ali/anaconda2/lib/python2.7/site-packages/compiledtrees/__init__.py in <module>()----> 1 from compiledtrees.compiled import CompiledRegressionPredictor      2       3 __all__ = ["CompiledRegressionPredictor"]
> /home/ali/anaconda2/lib/python2.7/site-packages/compiledtrees/compiled.py in <module>()      1 from __future__ import print_function      2 ----> 3 from sklearn.utils import array2d      4 from sklearn.tree.tree import DecisionTreeRegressor, DTYPE      5 from sklearn.ensemble.gradient_boosting import GradientBoostingRegressor
> ImportError: cannot import name array2d
>
>
> Kind regards
> Ali
>
> ------------------------------
> *Von:* Maciek W?jcikowski <maciek at wojcikowski.pl>
> *An:* Ali Zude <zude07 at yahoo.com>; Scikit-learn user and developer
> mailing list <scikit-learn at python.org>
> *Gesendet:* 12:26 Donnerstag, 11.August 2016
> *Betreff:* Re: [scikit-learn] Speeding up RF regressors
>
> Hi Ali,
>
> I'm using sklearn-compiledtrees [https://github.com/ajtulloch/
> sklearn-compiledtrees] on quite large trees (pickle size ~1GB, compiled
> ~100MB) and the speedup is gigantic (never measured it properly) but I'd
> say it's over 10x.
>
> ----
> Pozdrawiam,  |  Best regards,
> Maciek W?jcikowski
> maciek at wojcikowski.pl
>
> 2016-08-11 13:21 GMT+02:00 Ali Zude via scikit-learn <
> scikit-learn at python.org>:
>
> Hi all,
>
> I've 6 RF models and I am using them online to predict 6 different
> variables (using the same features), models quality (error in test data is
> good). However, the online prediction is very very slow.
> How can I speed up the prediction?
>
>    -     Can I import models into C++ code?
>    -     Is it useful to upgrade to scikit-learn 0.18? and then use
>    multi-output models?
>    -     Is sklearn-compiledtreesuseful, they are claiming that it will
>    speed the prediction (5x-8x)times?
>       - I could not use because of array2d error >>PyPi
>
> Thank you for your help
>
> Regards
> Ali
>
> ______________________________ _________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/ mailman/listinfo/scikit-learn
> <https://mail.python.org/mailman/listinfo/scikit-learn>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160811/f9a41979/attachment.html>

From odaym2 at gmail.com  Thu Aug 11 09:41:52 2016
From: odaym2 at gmail.com (o m)
Date: Thu, 11 Aug 2016 09:41:52 -0400
Subject: [scikit-learn] Speeding up RF regressors
In-Reply-To: <CAH2JJR1N4J_p==QN9ZJ2LDNPQcVW43GfadOR6rF1XZgwD2grFA@mail.gmail.com>
References: <2043263797.12551627.1470914482530.JavaMail.yahoo.ref@mail.yahoo.com>
 <2043263797.12551627.1470914482530.JavaMail.yahoo@mail.yahoo.com>
 <CAH2JJR2PNwM4OKeZH7gG+9i_5vU06-J2p010=LCV-RH8Ximt-Q@mail.gmail.com>
 <2056525296.12336951.1470915084854.JavaMail.yahoo@mail.yahoo.com>
 <CAH2JJR1N4J_p==QN9ZJ2LDNPQcVW43GfadOR6rF1XZgwD2grFA@mail.gmail.com>
Message-ID: <4CC9E9A2-2EC1-4C31-B2ED-F9B86854CC41@gmail.com>

Can someone please take me off this list? Thanks

Sent from my iPhone

> On Aug 11, 2016, at 9:10 AM, Maciek W?jcikowski <maciek at wojcikowski.pl> wrote:
> 
> First of all the pypi version is outdated, please install using 
>> 
>> pip install git+https://github.com/ajtulloch/sklearn-compiledtrees.git
> 
> Secondly, which scikit-learn version are you using?
> 
> ----
> Pozdrawiam,  |  Best regards,
> Maciek W?jcikowski
> maciek at wojcikowski.pl
> 
> 2016-08-11 13:31 GMT+02:00 Ali Zude <zude07 at yahoo.com>:
>> Thnx Maciek,
>> 
>> I've tried to use it but I could not sort out the PyPi problem,  see the error below. Thanks in advance.
>> 
>> ---> 16 import compiledtrees
>> 
>> /home/ali/anaconda2/lib/python2.7/site-packages/compiledtrees/__init__.py in <module>()
>> ----> 1 from compiledtrees.compiled import CompiledRegressionPredictor
>>       2 
>>       3 __all__ = ["CompiledRegressionPredictor"]
>> 
>> /home/ali/anaconda2/lib/python2.7/site-packages/compiledtrees/compiled.py in <module>()
>>       1 from __future__ import print_function
>>       2 
>> ----> 3 from sklearn.utils import array2d
>>       4 from sklearn.tree.tree import DecisionTreeRegressor, DTYPE
>>       5 from sklearn.ensemble.gradient_boosting import GradientBoostingRegressor
>> 
>> ImportError: cannot import name array2d
>> 
>> 
>> Kind regards
>> Ali
>> 
>> Von: Maciek W?jcikowski <maciek at wojcikowski.pl>
>> An: Ali Zude <zude07 at yahoo.com>; Scikit-learn user and developer mailing list <scikit-learn at python.org> 
>> Gesendet: 12:26 Donnerstag, 11.August 2016
>> Betreff: Re: [scikit-learn] Speeding up RF regressors
>> 
>> Hi Ali,
>> 
>> I'm using sklearn-compiledtrees [https://github.com/ajtulloch/sklearn-compiledtrees] on quite large trees (pickle size ~1GB, compiled ~100MB) and the speedup is gigantic (never measured it properly) but I'd say it's over 10x.
>> 
>> ----
>> Pozdrawiam,  |  Best regards,
>> Maciek W?jcikowski
>> maciek at wojcikowski.pl
>> 
>> 2016-08-11 13:21 GMT+02:00 Ali Zude via scikit-learn <scikit-learn at python.org>:
>> Hi all,
>> 
>> I've 6 RF models and I am using them online to predict 6 different variables (using the same features), models quality (error in test data is good). However, the online prediction is very very slow. 
>> How can I speed up the prediction?
>>     Can I import models into C++ code?
>>     Is it useful to upgrade to scikit-learn 0.18? and then use multi-output models?
>>     Is sklearn-compiledtreesuseful, they are claiming that it will speed the prediction (5x-8x)times?
>> I could not use because of array2d error >>PyPi
>> Thank you for your help 
>> 
>> Regards
>> Ali
>> 
>> ______________________________ _________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/ mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160811/bba572a1/attachment-0001.html>

From vlad32.de at gmail.com  Thu Aug 11 09:45:43 2016
From: vlad32.de at gmail.com (Vlad Deshkovich)
Date: Thu, 11 Aug 2016 09:45:43 -0400
Subject: [scikit-learn] Speeding up RF regressors
In-Reply-To: <4CC9E9A2-2EC1-4C31-B2ED-F9B86854CC41@gmail.com>
References: <2043263797.12551627.1470914482530.JavaMail.yahoo.ref@mail.yahoo.com>
 <2043263797.12551627.1470914482530.JavaMail.yahoo@mail.yahoo.com>
 <CAH2JJR2PNwM4OKeZH7gG+9i_5vU06-J2p010=LCV-RH8Ximt-Q@mail.gmail.com>
 <2056525296.12336951.1470915084854.JavaMail.yahoo@mail.yahoo.com>
 <CAH2JJR1N4J_p==QN9ZJ2LDNPQcVW43GfadOR6rF1XZgwD2grFA@mail.gmail.com>
 <4CC9E9A2-2EC1-4C31-B2ED-F9B86854CC41@gmail.com>
Message-ID: <CABhmVCj4z+UO-fnW+aOGCW8Z+h9BDyr1A_224go7pgZn5EzXsg@mail.gmail.com>

Please remove me as well.

On Thursday, August 11, 2016, o m <odaym2 at gmail.com> wrote:

> Can someone please take me off this list? Thanks
>
> Sent from my iPhone
>
> On Aug 11, 2016, at 9:10 AM, Maciek W?jcikowski <maciek at wojcikowski.pl
> <javascript:_e(%7B%7D,'cvml','maciek at wojcikowski.pl');>> wrote:
>
> First of all the pypi version is outdated, please install using
>>
>> pip install git+https://github.com/ajtulloch/sklearn-compiledtrees.git
>
>
> Secondly, which scikit-learn version are you using?
>
> ----
> Pozdrawiam,  |  Best regards,
> Maciek W?jcikowski
> maciek at wojcikowski.pl
> <javascript:_e(%7B%7D,'cvml','maciek at wojcikowski.pl');>
>
> 2016-08-11 13:31 GMT+02:00 Ali Zude <zude07 at yahoo.com
> <javascript:_e(%7B%7D,'cvml','zude07 at yahoo.com');>>:
>
>> Thnx Maciek,
>>
>> I've tried to use it but I could not sort out the PyPi problem,  see the
>> error below. Thanks in advance.
>>
>> ---> 16 import compiledtrees
>> /home/ali/anaconda2/lib/python2.7/site-packages/compiledtrees/__init__.py in <module>()----> 1 from compiledtrees.compiled import CompiledRegressionPredictor      2       3 __all__ = ["CompiledRegressionPredictor"]
>> /home/ali/anaconda2/lib/python2.7/site-packages/compiledtrees/compiled.py in <module>()      1 from __future__ import print_function      2 ----> 3 from sklearn.utils import array2d      4 from sklearn.tree.tree import DecisionTreeRegressor, DTYPE      5 from sklearn.ensemble.gradient_boosting import GradientBoostingRegressor
>> ImportError: cannot import name array2d
>>
>>
>> Kind regards
>> Ali
>>
>> ------------------------------
>> *Von:* Maciek W?jcikowski <maciek at wojcikowski.pl
>> <javascript:_e(%7B%7D,'cvml','maciek at wojcikowski.pl');>>
>> *An:* Ali Zude <zude07 at yahoo.com
>> <javascript:_e(%7B%7D,'cvml','zude07 at yahoo.com');>>; Scikit-learn user
>> and developer mailing list <scikit-learn at python.org
>> <javascript:_e(%7B%7D,'cvml','scikit-learn at python.org');>>
>> *Gesendet:* 12:26 Donnerstag, 11.August 2016
>> *Betreff:* Re: [scikit-learn] Speeding up RF regressors
>>
>> Hi Ali,
>>
>> I'm using sklearn-compiledtrees [https://github.com/ajtulloch/
>> sklearn-compiledtrees] on quite large trees (pickle size ~1GB, compiled
>> ~100MB) and the speedup is gigantic (never measured it properly) but I'd
>> say it's over 10x.
>>
>> ----
>> Pozdrawiam,  |  Best regards,
>> Maciek W?jcikowski
>> maciek at wojcikowski.pl
>> <javascript:_e(%7B%7D,'cvml','maciek at wojcikowski.pl');>
>>
>> 2016-08-11 13:21 GMT+02:00 Ali Zude via scikit-learn <
>> scikit-learn at python.org
>> <javascript:_e(%7B%7D,'cvml','scikit-learn at python.org');>>:
>>
>> Hi all,
>>
>> I've 6 RF models and I am using them online to predict 6 different
>> variables (using the same features), models quality (error in test data is
>> good). However, the online prediction is very very slow.
>> How can I speed up the prediction?
>>
>>    -     Can I import models into C++ code?
>>    -     Is it useful to upgrade to scikit-learn 0.18? and then use
>>    multi-output models?
>>    -     Is sklearn-compiledtreesuseful, they are claiming that it will
>>    speed the prediction (5x-8x)times?
>>       - I could not use because of array2d error >>PyPi
>>
>> Thank you for your help
>>
>> Regards
>> Ali
>>
>> ______________________________ _________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> <javascript:_e(%7B%7D,'cvml','scikit-learn at python.org');>
>> https://mail.python.org/ mailman/listinfo/scikit-learn
>> <https://mail.python.org/mailman/listinfo/scikit-learn>
>>
>>
>>
>>
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> <javascript:_e(%7B%7D,'cvml','scikit-learn at python.org');>
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160811/dba36311/attachment.html>

From zude07 at yahoo.com  Thu Aug 11 17:39:30 2016
From: zude07 at yahoo.com (Ali Zude)
Date: Thu, 11 Aug 2016 21:39:30 +0000 (UTC)
Subject: [scikit-learn] Compiled trees
References: <1676605323.13315284.1470951570354.JavaMail.yahoo.ref@mail.yahoo.com>
Message-ID: <1676605323.13315284.1470951570354.JavaMail.yahoo@mail.yahoo.com>

Dear All,
I am trying to speed up the prediction of Random Forests. I've used compiledtress, which was useful, but since I have 6 models and once I've loaded all of them I got "Multiprocessing exception:" 

here is my models in the code:
...model1=joblib.load('/models/model1.pkl'') model2=joblib.load('/models/model2.pkl') model3=joblib.load('/models/model3.pkl') model4=compiledtrees.CompiledRegressionPredictor(joblib.load('/models/model4.pkl')) model5=compiledtrees.CompiledRegressionPredictor(joblib.load('/models/model4.pkl')) model6=compiledtrees.CompiledRegressionPredictor(joblib.load('/models/model4.pkl')) 
model1=compiledtrees.CompiledRegressionPredictor(model1) model2=compiledtrees.CompiledRegressionPredictor(model2) model3=compiledtrees.CompiledRegressionPredictor(model3)....
Now I'm trying to use MultiOutputRegressor(RandomForestRegressor()), however, I could not find any tool to do model selection, can anyone help me either to solve the first problem or the second one
Best regards
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160811/56cb4f26/attachment.html>

From maciek at wojcikowski.pl  Fri Aug 12 02:30:03 2016
From: maciek at wojcikowski.pl (=?UTF-8?Q?Maciek_W=C3=B3jcikowski?=)
Date: Fri, 12 Aug 2016 08:30:03 +0200
Subject: [scikit-learn] Compiled trees
In-Reply-To: <1676605323.13315284.1470951570354.JavaMail.yahoo@mail.yahoo.com>
References: <1676605323.13315284.1470951570354.JavaMail.yahoo.ref@mail.yahoo.com>
 <1676605323.13315284.1470951570354.JavaMail.yahoo@mail.yahoo.com>
Message-ID: <CAH2JJR0fqGrLcT__GtTr8oXEHcjLWdEQy+mcJqW2nW=mOO4L2Q@mail.gmail.com>

Which version of compiledtrees are you using?

----
Pozdrawiam,  |  Best regards,
Maciek W?jcikowski
maciek at wojcikowski.pl

2016-08-11 23:39 GMT+02:00 Ali Zude via scikit-learn <
scikit-learn at python.org>:

> Dear All,
>
> I am trying to speed up the prediction of Random Forests. I've used
> compiledtress, which was useful, but since I have 6 models and once I've
> loaded all of them I got "Multiprocessing exception:"
>
> here is my models in the code:
> ...
> model1=joblib.load('/models/model1.pkl'')
> model2=joblib.load('/models/model2.pkl')
> model3=joblib.load('/models/model3.pkl')
> model4=compiledtrees.CompiledRegressionPredictor(
> joblib.load('/models/model4.pkl'))
> model5=compiledtrees.CompiledRegressionPredictor(
> joblib.load('/models/model4.pkl'))
> model6=compiledtrees.CompiledRegressionPredictor(
> joblib.load('/models/model4.pkl'))
>
> model1=compiledtrees.CompiledRegressionPredictor(model1)
> model2=compiledtrees.CompiledRegressionPredictor(model2)
> model3=compiledtrees.CompiledRegressionPredictor(model3)
> ....
>
> Now I'm trying to use MultiOutputRegressor(RandomForestRegressor()),
> however, I could not find any tool to do model selection, can anyone help
> me either to solve the first problem or the second one
>
> Best regards
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160812/de8feeb2/attachment.html>

From zude07 at yahoo.com  Fri Aug 12 02:37:37 2016
From: zude07 at yahoo.com (Ali Zude)
Date: Fri, 12 Aug 2016 06:37:37 +0000 (UTC)
Subject: [scikit-learn] Compiled trees
In-Reply-To: <CAH2JJR0fqGrLcT__GtTr8oXEHcjLWdEQy+mcJqW2nW=mOO4L2Q@mail.gmail.com>
References: <1676605323.13315284.1470951570354.JavaMail.yahoo.ref@mail.yahoo.com>
 <1676605323.13315284.1470951570354.JavaMail.yahoo@mail.yahoo.com>
 <CAH2JJR0fqGrLcT__GtTr8oXEHcjLWdEQy+mcJqW2nW=mOO4L2Q@mail.gmail.com>
Message-ID: <616174978.13126540.1470983857571.JavaMail.yahoo@mail.yahoo.com>

sklearn-compiledtrees==1.3


      Von: Maciek W?jcikowski <maciek at wojcikowski.pl>
 An: Ali Zude <zude07 at yahoo.com>; Scikit-learn user and developer mailing list <scikit-learn at python.org> 
 Gesendet: 7:30 Freitag, 12.August 2016
 Betreff: Re: [scikit-learn] Compiled trees
   
Which version of compiledtrees are you using?
----
Pozdrawiam, ?| ?Best regards,
Maciek W?jcikowski
maciek at wojcikowski.pl

2016-08-11 23:39 GMT+02:00 Ali Zude via scikit-learn <scikit-learn at python.org>:

Dear All,
I am trying to speed up the prediction of Random Forests. I've used compiledtress, which was useful, but since I have 6 models and once I've loaded all of them I got "Multiprocessing exception:" 

here is my models in the code:
...model1=joblib.load('/models/ model1.pkl'') model2=joblib.load('/models/ model2.pkl') model3=joblib.load('/models/ model3.pkl') model4=compiledtrees. CompiledRegressionPredictor( joblib.load('/models/model4. pkl')) model5=compiledtrees. CompiledRegressionPredictor( joblib.load('/models/model4. pkl')) model6=compiledtrees. CompiledRegressionPredictor( joblib.load('/models/model4. pkl')) 
model1=compiledtrees. CompiledRegressionPredictor( model1) model2=compiledtrees. CompiledRegressionPredictor( model2) model3=compiledtrees. CompiledRegressionPredictor( model3)....
Now I'm trying to use MultiOutputRegressor( RandomForestRegressor()), however, I could not find any tool to do model selection, can anyone help me either to solve the first problem or the second one
Best regards

______________________________ _________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/ mailman/listinfo/scikit-learn


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160812/cd5f52b9/attachment-0001.html>

From maciek at wojcikowski.pl  Fri Aug 12 03:48:19 2016
From: maciek at wojcikowski.pl (=?UTF-8?Q?Maciek_W=C3=B3jcikowski?=)
Date: Fri, 12 Aug 2016 09:48:19 +0200
Subject: [scikit-learn] Compiled trees
In-Reply-To: <616174978.13126540.1470983857571.JavaMail.yahoo@mail.yahoo.com>
References: <1676605323.13315284.1470951570354.JavaMail.yahoo.ref@mail.yahoo.com>
 <1676605323.13315284.1470951570354.JavaMail.yahoo@mail.yahoo.com>
 <CAH2JJR0fqGrLcT__GtTr8oXEHcjLWdEQy+mcJqW2nW=mOO4L2Q@mail.gmail.com>
 <616174978.13126540.1470983857571.JavaMail.yahoo@mail.yahoo.com>
Message-ID: <CAH2JJR1iWu7SZoHvjGZeo5PQ+chGF6hns=T-w67oUPvB2Z3JEg@mail.gmail.com>

Can you please copy whole error message? There seams to be a problem with
compiling the tree. Do you have gcc or other C compiler under CXX shell
variable?

----
Pozdrawiam,  |  Best regards,
Maciek W?jcikowski
maciek at wojcikowski.pl

2016-08-12 8:37 GMT+02:00 Ali Zude <zude07 at yahoo.com>:

> sklearn-compiledtrees==1.3
>
>
> ------------------------------
> *Von:* Maciek W?jcikowski <maciek at wojcikowski.pl>
> *An:* Ali Zude <zude07 at yahoo.com>; Scikit-learn user and developer
> mailing list <scikit-learn at python.org>
> *Gesendet:* 7:30 Freitag, 12.August 2016
> *Betreff:* Re: [scikit-learn] Compiled trees
>
> Which version of compiledtrees are you using?
>
> ----
> Pozdrawiam,  |  Best regards,
> Maciek W?jcikowski
> maciek at wojcikowski.pl
>
> 2016-08-11 23:39 GMT+02:00 Ali Zude via scikit-learn <
> scikit-learn at python.org>:
>
> Dear All,
>
> I am trying to speed up the prediction of Random Forests. I've used
> compiledtress, which was useful, but since I have 6 models and once I've
> loaded all of them I got "Multiprocessing exception:"
>
> here is my models in the code:
> ...
> model1=joblib.load('/models/ model1.pkl'')
> model2=joblib.load('/models/ model2.pkl')
> model3=joblib.load('/models/ model3.pkl')
> model4=compiledtrees. CompiledRegressionPredictor(
> joblib.load('/models/model4. pkl'))
> model5=compiledtrees. CompiledRegressionPredictor(
> joblib.load('/models/model4. pkl'))
> model6=compiledtrees. CompiledRegressionPredictor(
> joblib.load('/models/model4. pkl'))
>
> model1=compiledtrees. CompiledRegressionPredictor( model1)
> model2=compiledtrees. CompiledRegressionPredictor( model2)
> model3=compiledtrees. CompiledRegressionPredictor( model3)
> ....
>
> Now I'm trying to use MultiOutputRegressor( RandomForestRegressor()),
> however, I could not find any tool to do model selection, can anyone help
> me either to solve the first problem or the second one
>
> Best regards
>
> ______________________________ _________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/ mailman/listinfo/scikit-learn
> <https://mail.python.org/mailman/listinfo/scikit-learn>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160812/5112f608/attachment.html>

From chris at upnix.com  Mon Aug 15 17:27:03 2016
From: chris at upnix.com (Chris Cameron)
Date: Mon, 15 Aug 2016 15:27:03 -0600
Subject: [scikit-learn] Inconsistent Logistic Regression fit results
Message-ID: <E386032E-6664-4866-B1E5-DA7FF883EFE0@upnix.com>

Hi all,

Using the same X and y values sklearn.linear_model.LogisticRegression.fit() is providing me with inconsistent results.

The documentation for sklearn.linear_model.LogisticRegression states that "It is thus not uncommon, to have slightly different results for the same input data.? I am experiencing this, however the fix of using a smaller ?tol? parameter isn?t providing me with consistent fit.

The code I?m using:

def log_run(logreg_x, logreg_y):
    logreg_x['pass_fail'] = logreg_y
    df_train, df_test = train_test_split(logreg_x, random_state=0)
    y_train = df_train.pass_fail.as_matrix()
    y_test = df_test.pass_fail.as_matrix()
    del(df_train['pass_fail'])
    del(df_test['pass_fail'])
    log_reg_fit = LogisticRegression(class_weight='balanced',tol=0.000000001).fit(df_train, y_train)
    predicted = log_reg_fit.predict(df_test)
    accuracy = accuracy_score(y_test, predicted)
    kappa = cohen_kappa_score(y_test, predicted)
    
    return [kappa, accuracy]


I?ve gone out of my way to be sure the test and train data is the same for each run, so I don?t think there should be random shuffling going on.

Example output:
---
log_run(df_save, y)
Out[32]: [0.027777777777777728, 0.53333333333333333]

log_run(df_save, y)
Out[33]: [0.027777777777777728, 0.53333333333333333]

log_run(df_save, y)
Out[34]: [0.11347517730496456, 0.58333333333333337]

log_run(df_save, y)
Out[35]: [0.042553191489361743, 0.55000000000000004]

log_run(df_save, y)
Out[36]: [-0.07407407407407407, 0.51666666666666672]

log_run(df_save, y)
Out[37]: [0.042553191489361743, 0.55000000000000004]

A little information on the problem DataFrame:
---
len(df_save)
Out[40]: 240

len(df_save.columns)
Out[41]: 18


If I omit this particular column the Kappa no longer fluctuates:

df_save[?abc'].head()
Out[42]: 
0    0.026316
1    0.333333
2    0.015152
3    0.010526
4    0.125000
Name: abc, dtype: float64


Does anyone have ideas on how I can figure this out? Is there some randomness/shuffling still going on I missed?


Thanks!
Chris

From mail at sebastianraschka.com  Mon Aug 15 17:42:10 2016
From: mail at sebastianraschka.com (mail at sebastianraschka.com)
Date: Mon, 15 Aug 2016 17:42:10 -0400
Subject: [scikit-learn] Inconsistent Logistic Regression fit results
In-Reply-To: <E386032E-6664-4866-B1E5-DA7FF883EFE0@upnix.com>
References: <E386032E-6664-4866-B1E5-DA7FF883EFE0@upnix.com>
Message-ID: <2C047F1F-FC6F-4D74-A293-C2422BEDC3DF@sebastianraschka.com>

Hi, Chris,
have you set the random seed to a specific, contant integer value? Note that the default in LogisticRegression is random_state=None. Setting it to some arbitrary number like 123 may help if you haven?t done so, yet.

Best,
Sebastian


> On Aug 15, 2016, at 5:27 PM, Chris Cameron <chris at upnix.com> wrote:
> 
> Hi all,
> 
> Using the same X and y values sklearn.linear_model.LogisticRegression.fit() is providing me with inconsistent results.
> 
> The documentation for sklearn.linear_model.LogisticRegression states that "It is thus not uncommon, to have slightly different results for the same input data.? I am experiencing this, however the fix of using a smaller ?tol? parameter isn?t providing me with consistent fit.
> 
> The code I?m using:
> 
> def log_run(logreg_x, logreg_y):
>    logreg_x['pass_fail'] = logreg_y
>    df_train, df_test = train_test_split(logreg_x, random_state=0)
>    y_train = df_train.pass_fail.as_matrix()
>    y_test = df_test.pass_fail.as_matrix()
>    del(df_train['pass_fail'])
>    del(df_test['pass_fail'])
>    log_reg_fit = LogisticRegression(class_weight='balanced',tol=0.000000001).fit(df_train, y_train)
>    predicted = log_reg_fit.predict(df_test)
>    accuracy = accuracy_score(y_test, predicted)
>    kappa = cohen_kappa_score(y_test, predicted)
> 
>    return [kappa, accuracy]
> 
> 
> I?ve gone out of my way to be sure the test and train data is the same for each run, so I don?t think there should be random shuffling going on.
> 
> Example output:
> ---
> log_run(df_save, y)
> Out[32]: [0.027777777777777728, 0.53333333333333333]
> 
> log_run(df_save, y)
> Out[33]: [0.027777777777777728, 0.53333333333333333]
> 
> log_run(df_save, y)
> Out[34]: [0.11347517730496456, 0.58333333333333337]
> 
> log_run(df_save, y)
> Out[35]: [0.042553191489361743, 0.55000000000000004]
> 
> log_run(df_save, y)
> Out[36]: [-0.07407407407407407, 0.51666666666666672]
> 
> log_run(df_save, y)
> Out[37]: [0.042553191489361743, 0.55000000000000004]
> 
> A little information on the problem DataFrame:
> ---
> len(df_save)
> Out[40]: 240
> 
> len(df_save.columns)
> Out[41]: 18
> 
> 
> If I omit this particular column the Kappa no longer fluctuates:
> 
> df_save[?abc'].head()
> Out[42]: 
> 0    0.026316
> 1    0.333333
> 2    0.015152
> 3    0.010526
> 4    0.125000
> Name: abc, dtype: float64
> 
> 
> Does anyone have ideas on how I can figure this out? Is there some randomness/shuffling still going on I missed?
> 
> 
> Thanks!
> Chris
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From chris at upnix.com  Mon Aug 15 18:00:05 2016
From: chris at upnix.com (Chris Cameron)
Date: Mon, 15 Aug 2016 16:00:05 -0600
Subject: [scikit-learn] Inconsistent Logistic Regression fit results
In-Reply-To: <2C047F1F-FC6F-4D74-A293-C2422BEDC3DF@sebastianraschka.com>
References: <E386032E-6664-4866-B1E5-DA7FF883EFE0@upnix.com>
 <2C047F1F-FC6F-4D74-A293-C2422BEDC3DF@sebastianraschka.com>
Message-ID: <903082E9-D944-4838-A882-982911520540@upnix.com>

Sebastian,

That doesn?t do it. With the function:

def log_run(logreg_x, logreg_y):
    logreg_x['pass_fail'] = logreg_y
    df_train, df_test = train_test_split(logreg_x, random_state=0)
    y_train = df_train.pass_fail.as_matrix()
    y_test = df_test.pass_fail.as_matrix()
    del(df_train['pass_fail'])
    del(df_test['pass_fail'])
    log_reg_fit = LogisticRegression(class_weight='balanced',
                                     tol=0.000000001,
                                     random_state=0).fit(df_train, y_train)
    predicted = log_reg_fit.predict(df_test)
    accuracy = accuracy_score(y_test, predicted)
    kappa = cohen_kappa_score(y_test, predicted)
    
    return [kappa, accuracy]

I?m still seeing:
log_run(df_save, y)
Out[7]: [-0.054421768707483005, 0.48333333333333334]

log_run(df_save, y)
Out[8]: [0.042553191489361743, 0.55000000000000004]

log_run(df_save, y)
Out[9]: [0.042553191489361743, 0.55000000000000004]

log_run(df_save, y)
Out[10]: [0.027777777777777728, 0.53333333333333333]


Chris

> On Aug 15, 2016, at 3:42 PM, mail at sebastianraschka.com wrote:
> 
> Hi, Chris,
> have you set the random seed to a specific, contant integer value? Note that the default in LogisticRegression is random_state=None. Setting it to some arbitrary number like 123 may help if you haven?t done so, yet.
> 
> Best,
> Sebastian
> 
> 
> 
>> On Aug 15, 2016, at 5:27 PM, Chris Cameron <chris at upnix.com> wrote:
>> 
>> Hi all,
>> 
>> Using the same X and y values sklearn.linear_model.LogisticRegression.fit() is providing me with inconsistent results.
>> 
>> The documentation for sklearn.linear_model.LogisticRegression states that "It is thus not uncommon, to have slightly different results for the same input data.? I am experiencing this, however the fix of using a smaller ?tol? parameter isn?t providing me with consistent fit.
>> 
>> The code I?m using:
>> 
>> def log_run(logreg_x, logreg_y):
>>   logreg_x['pass_fail'] = logreg_y
>>   df_train, df_test = train_test_split(logreg_x, random_state=0)
>>   y_train = df_train.pass_fail.as_matrix()
>>   y_test = df_test.pass_fail.as_matrix()
>>   del(df_train['pass_fail'])
>>   del(df_test['pass_fail'])
>>   log_reg_fit = LogisticRegression(class_weight='balanced',tol=0.000000001).fit(df_train, y_train)
>>   predicted = log_reg_fit.predict(df_test)
>>   accuracy = accuracy_score(y_test, predicted)
>>   kappa = cohen_kappa_score(y_test, predicted)
>> 
>>   return [kappa, accuracy]
>> 
>> 
>> I?ve gone out of my way to be sure the test and train data is the same for each run, so I don?t think there should be random shuffling going on.
>> 
>> Example output:
>> ---
>> log_run(df_save, y)
>> Out[32]: [0.027777777777777728, 0.53333333333333333]
>> 
>> log_run(df_save, y)
>> Out[33]: [0.027777777777777728, 0.53333333333333333]
>> 
>> log_run(df_save, y)
>> Out[34]: [0.11347517730496456, 0.58333333333333337]
>> 
>> log_run(df_save, y)
>> Out[35]: [0.042553191489361743, 0.55000000000000004]
>> 
>> log_run(df_save, y)
>> Out[36]: [-0.07407407407407407, 0.51666666666666672]
>> 
>> log_run(df_save, y)
>> Out[37]: [0.042553191489361743, 0.55000000000000004]
>> 
>> A little information on the problem DataFrame:
>> ---
>> len(df_save)
>> Out[40]: 240
>> 
>> len(df_save.columns)
>> Out[41]: 18
>> 
>> 
>> If I omit this particular column the Kappa no longer fluctuates:
>> 
>> df_save[?abc'].head()
>> Out[42]: 
>> 0    0.026316
>> 1    0.333333
>> 2    0.015152
>> 3    0.010526
>> 4    0.125000
>> Name: abc, dtype: float64
>> 
>> 
>> Does anyone have ideas on how I can figure this out? Is there some randomness/shuffling still going on I missed?
>> 
>> 
>> Thanks!
>> Chris
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From t3kcit at gmail.com  Mon Aug 15 18:17:25 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Mon, 15 Aug 2016 18:17:25 -0400
Subject: [scikit-learn] Inconsistent Logistic Regression fit results
In-Reply-To: <903082E9-D944-4838-A882-982911520540@upnix.com>
References: <E386032E-6664-4866-B1E5-DA7FF883EFE0@upnix.com>
 <2C047F1F-FC6F-4D74-A293-C2422BEDC3DF@sebastianraschka.com>
 <903082E9-D944-4838-A882-982911520540@upnix.com>
Message-ID: <31533467-ef95-6696-23cb-ac4ce5c1d9ff@gmail.com>

Hm that looks kinda convoluted.
Why don't you just do

     df_train, df_test, y_train, y_test = train_test_split(logreg_x, logreg_y, random_state=0)


?
What version of scikit-learn are you using?

Also, you are modifying the inputs. Can you try to do the same but
pass a copy of the input dataframe to the method each time?


On 08/15/2016 06:00 PM, Chris Cameron wrote:
> Sebastian,
>
> That doesn?t do it. With the function:
>
> def log_run(logreg_x, logreg_y):
>      logreg_x['pass_fail'] = logreg_y
>      df_train, df_test = train_test_split(logreg_x, random_state=0)
>      y_train = df_train.pass_fail.as_matrix()
>      y_test = df_test.pass_fail.as_matrix()
>      del(df_train['pass_fail'])
>      del(df_test['pass_fail'])
>      log_reg_fit = LogisticRegression(class_weight='balanced',
>                                       tol=0.000000001,
>                                       random_state=0).fit(df_train, y_train)
>      predicted = log_reg_fit.predict(df_test)
>      accuracy = accuracy_score(y_test, predicted)
>      kappa = cohen_kappa_score(y_test, predicted)
>      
>      return [kappa, accuracy]
>
> I?m still seeing:
> log_run(df_save, y)
> Out[7]: [-0.054421768707483005, 0.48333333333333334]
>
> log_run(df_save, y)
> Out[8]: [0.042553191489361743, 0.55000000000000004]
>
> log_run(df_save, y)
> Out[9]: [0.042553191489361743, 0.55000000000000004]
>
> log_run(df_save, y)
> Out[10]: [0.027777777777777728, 0.53333333333333333]
>
>
> Chris
>
>> On Aug 15, 2016, at 3:42 PM, mail at sebastianraschka.com wrote:
>>
>> Hi, Chris,
>> have you set the random seed to a specific, contant integer value? Note that the default in LogisticRegression is random_state=None. Setting it to some arbitrary number like 123 may help if you haven?t done so, yet.
>>
>> Best,
>> Sebastian
>>
>>
>>
>>> On Aug 15, 2016, at 5:27 PM, Chris Cameron <chris at upnix.com> wrote:
>>>
>>> Hi all,
>>>
>>> Using the same X and y values sklearn.linear_model.LogisticRegression.fit() is providing me with inconsistent results.
>>>
>>> The documentation for sklearn.linear_model.LogisticRegression states that "It is thus not uncommon, to have slightly different results for the same input data.? I am experiencing this, however the fix of using a smaller ?tol? parameter isn?t providing me with consistent fit.
>>>
>>> The code I?m using:
>>>
>>> def log_run(logreg_x, logreg_y):
>>>    logreg_x['pass_fail'] = logreg_y
>>>    df_train, df_test = train_test_split(logreg_x, random_state=0)
>>>    y_train = df_train.pass_fail.as_matrix()
>>>    y_test = df_test.pass_fail.as_matrix()
>>>    del(df_train['pass_fail'])
>>>    del(df_test['pass_fail'])
>>>    log_reg_fit = LogisticRegression(class_weight='balanced',tol=0.000000001).fit(df_train, y_train)
>>>    predicted = log_reg_fit.predict(df_test)
>>>    accuracy = accuracy_score(y_test, predicted)
>>>    kappa = cohen_kappa_score(y_test, predicted)
>>>
>>>    return [kappa, accuracy]
>>>
>>>
>>> I?ve gone out of my way to be sure the test and train data is the same for each run, so I don?t think there should be random shuffling going on.
>>>
>>> Example output:
>>> ---
>>> log_run(df_save, y)
>>> Out[32]: [0.027777777777777728, 0.53333333333333333]
>>>
>>> log_run(df_save, y)
>>> Out[33]: [0.027777777777777728, 0.53333333333333333]
>>>
>>> log_run(df_save, y)
>>> Out[34]: [0.11347517730496456, 0.58333333333333337]
>>>
>>> log_run(df_save, y)
>>> Out[35]: [0.042553191489361743, 0.55000000000000004]
>>>
>>> log_run(df_save, y)
>>> Out[36]: [-0.07407407407407407, 0.51666666666666672]
>>>
>>> log_run(df_save, y)
>>> Out[37]: [0.042553191489361743, 0.55000000000000004]
>>>
>>> A little information on the problem DataFrame:
>>> ---
>>> len(df_save)
>>> Out[40]: 240
>>>
>>> len(df_save.columns)
>>> Out[41]: 18
>>>
>>>
>>> If I omit this particular column the Kappa no longer fluctuates:
>>>
>>> df_save[?abc'].head()
>>> Out[42]:
>>> 0    0.026316
>>> 1    0.333333
>>> 2    0.015152
>>> 3    0.010526
>>> 4    0.125000
>>> Name: abc, dtype: float64
>>>
>>>
>>> Does anyone have ideas on how I can figure this out? Is there some randomness/shuffling still going on I missed?
>>>
>>>
>>> Thanks!
>>> Chris
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From mail at sebastianraschka.com  Mon Aug 15 18:26:28 2016
From: mail at sebastianraschka.com (mail at sebastianraschka.com)
Date: Mon, 15 Aug 2016 18:26:28 -0400
Subject: [scikit-learn] Inconsistent Logistic Regression fit results
In-Reply-To: <903082E9-D944-4838-A882-982911520540@upnix.com>
References: <E386032E-6664-4866-B1E5-DA7FF883EFE0@upnix.com>
 <2C047F1F-FC6F-4D74-A293-C2422BEDC3DF@sebastianraschka.com>
 <903082E9-D944-4838-A882-982911520540@upnix.com>
Message-ID: <ECC79031-A0CF-4154-A2DD-34B07EFB8931@sebastianraschka.com>

hm, was worth a try. What happens if you change the solver to something else than liblinear, does this issue still persist?


Btw. scikit-learn works with NumPy arrays, not NumPy matrices. Probably unrelated to your issue, I?d recommend setting

>    y_train = df_train.pass_fail.values
>    y_test = df_test.pass_fail.values

instead of

>    y_train = df_train.pass_fail.as_matrix()
>    y_test = df_test.pass_fail.as_matrix()


Also, try passing NumPy arrays to the fit method:

>    log_reg_fit = LogisticRegression(...).fit(df_train.values, y_train)

and

> predicted = log_reg_fit.predict(df_test.values)

and so forth.


> On Aug 15, 2016, at 6:00 PM, Chris Cameron <chris at upnix.com> wrote:
> 
> Sebastian,
> 
> That doesn?t do it. With the function:
> 
> def log_run(logreg_x, logreg_y):
>    logreg_x['pass_fail'] = logreg_y
>    df_train, df_test = train_test_split(logreg_x, random_state=0)
>    y_train = df_train.pass_fail.as_matrix()
>    y_test = df_test.pass_fail.as_matrix()
>    del(df_train['pass_fail'])
>    del(df_test['pass_fail'])
>    log_reg_fit = LogisticRegression(class_weight='balanced',
>                                     tol=0.000000001,
>                                     random_state=0).fit(df_train, y_train)
>    predicted = log_reg_fit.predict(df_test)
>    accuracy = accuracy_score(y_test, predicted)
>    kappa = cohen_kappa_score(y_test, predicted)
> 
>    return [kappa, accuracy]
> 
> I?m still seeing:
> log_run(df_save, y)
> Out[7]: [-0.054421768707483005, 0.48333333333333334]
> 
> log_run(df_save, y)
> Out[8]: [0.042553191489361743, 0.55000000000000004]
> 
> log_run(df_save, y)
> Out[9]: [0.042553191489361743, 0.55000000000000004]
> 
> log_run(df_save, y)
> Out[10]: [0.027777777777777728, 0.53333333333333333]
> 
> 
> Chris
> 
>> On Aug 15, 2016, at 3:42 PM, mail at sebastianraschka.com wrote:
>> 
>> Hi, Chris,
>> have you set the random seed to a specific, contant integer value? Note that the default in LogisticRegression is random_state=None. Setting it to some arbitrary number like 123 may help if you haven?t done so, yet.
>> 
>> Best,
>> Sebastian
>> 
>> 
>> 
>>> On Aug 15, 2016, at 5:27 PM, Chris Cameron <chris at upnix.com> wrote:
>>> 
>>> Hi all,
>>> 
>>> Using the same X and y values sklearn.linear_model.LogisticRegression.fit() is providing me with inconsistent results.
>>> 
>>> The documentation for sklearn.linear_model.LogisticRegression states that "It is thus not uncommon, to have slightly different results for the same input data.? I am experiencing this, however the fix of using a smaller ?tol? parameter isn?t providing me with consistent fit.
>>> 
>>> The code I?m using:
>>> 
>>> def log_run(logreg_x, logreg_y):
>>>  logreg_x['pass_fail'] = logreg_y
>>>  df_train, df_test = train_test_split(logreg_x, random_state=0)
>>>  y_train = df_train.pass_fail.as_matrix()
>>>  y_test = df_test.pass_fail.as_matrix()
>>>  del(df_train['pass_fail'])
>>>  del(df_test['pass_fail'])
>>>  log_reg_fit = LogisticRegression(class_weight='balanced',tol=0.000000001).fit(df_train, y_train)
>>>  predicted = log_reg_fit.predict(df_test)
>>>  accuracy = accuracy_score(y_test, predicted)
>>>  kappa = cohen_kappa_score(y_test, predicted)
>>> 
>>>  return [kappa, accuracy]
>>> 
>>> 
>>> I?ve gone out of my way to be sure the test and train data is the same for each run, so I don?t think there should be random shuffling going on.
>>> 
>>> Example output:
>>> ---
>>> log_run(df_save, y)
>>> Out[32]: [0.027777777777777728, 0.53333333333333333]
>>> 
>>> log_run(df_save, y)
>>> Out[33]: [0.027777777777777728, 0.53333333333333333]
>>> 
>>> log_run(df_save, y)
>>> Out[34]: [0.11347517730496456, 0.58333333333333337]
>>> 
>>> log_run(df_save, y)
>>> Out[35]: [0.042553191489361743, 0.55000000000000004]
>>> 
>>> log_run(df_save, y)
>>> Out[36]: [-0.07407407407407407, 0.51666666666666672]
>>> 
>>> log_run(df_save, y)
>>> Out[37]: [0.042553191489361743, 0.55000000000000004]
>>> 
>>> A little information on the problem DataFrame:
>>> ---
>>> len(df_save)
>>> Out[40]: 240
>>> 
>>> len(df_save.columns)
>>> Out[41]: 18
>>> 
>>> 
>>> If I omit this particular column the Kappa no longer fluctuates:
>>> 
>>> df_save[?abc'].head()
>>> Out[42]: 
>>> 0    0.026316
>>> 1    0.333333
>>> 2    0.015152
>>> 3    0.010526
>>> 4    0.125000
>>> Name: abc, dtype: float64
>>> 
>>> 
>>> Does anyone have ideas on how I can figure this out? Is there some randomness/shuffling still going on I missed?
>>> 
>>> 
>>> Thanks!
>>> Chris
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> 
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From chris at upnix.com  Tue Aug 16 12:15:38 2016
From: chris at upnix.com (Chris Cameron)
Date: Tue, 16 Aug 2016 10:15:38 -0600
Subject: [scikit-learn] Inconsistent Logistic Regression fit results
In-Reply-To: <ECC79031-A0CF-4154-A2DD-34B07EFB8931@sebastianraschka.com>
References: <E386032E-6664-4866-B1E5-DA7FF883EFE0@upnix.com>
 <2C047F1F-FC6F-4D74-A293-C2422BEDC3DF@sebastianraschka.com>
 <903082E9-D944-4838-A882-982911520540@upnix.com>
 <ECC79031-A0CF-4154-A2DD-34B07EFB8931@sebastianraschka.com>
Message-ID: <34155333-560B-4A8A-87AE-7D5D76C76807@upnix.com>

Thank you everyone for your help. The short version of this email is that changing the solver from ?liblinear? to ?sag? fixed my problem - but only if I upped ?max_iter? to 1000.


Longer version - 
Without max_iter=1000, I would get the warning:
ConvergenceWarning: The max_iter was reached which means the coef_ did not converge

I have some columns in my data that have a huge range of values. Using ?liblinear?, if I transformed those columns, causing the range to be smaller, the results would be consistent every time.

This is the function I ended up using -
def log_run(logreg_x, logreg_y):
    logreg_x['pass_fail'] = logreg_y
    df_train, df_test, y_train, y_test = train_test_split(logreg_x, logreg_y, random_state=0)
    del(df_train['pass_fail'])
    del(df_test['pass_fail'])
    log_reg_fit = LogisticRegression(class_weight='balanced',
                                     tol=0.00000001,
                                     random_state=8,
                                     solver='sag',
                                     max_iter=1000).fit(df_train.values, y_train)
    predicted = log_reg_fit.predict(df_test.values)
    accuracy = accuracy_score(y_test, predicted)
    kappa = cohen_kappa_score(y_test, predicted)
        
    return [kappa, accuracy]


Thank you again for the help,

Chris

> On Aug 15, 2016, at 4:26 PM, mail at sebastianraschka.com wrote:
> 
> hm, was worth a try. What happens if you change the solver to something else than liblinear, does this issue still persist?
> 
> 
> Btw. scikit-learn works with NumPy arrays, not NumPy matrices. Probably unrelated to your issue, I?d recommend setting
> 
>>   y_train = df_train.pass_fail.values
>>   y_test = df_test.pass_fail.values
> 
> instead of
> 
>>   y_train = df_train.pass_fail.as_matrix()
>>   y_test = df_test.pass_fail.as_matrix()
> 
> 
> Also, try passing NumPy arrays to the fit method:
> 
>>   log_reg_fit = LogisticRegression(...).fit(df_train.values, y_train)
> 
> and
> 
>> predicted = log_reg_fit.predict(df_test.values)
> 
> and so forth.
> 
> 
> 
> 
> 
>> On Aug 15, 2016, at 6:00 PM, Chris Cameron <chris at upnix.com> wrote:
>> 
>> Sebastian,
>> 
>> That doesn?t do it. With the function:
>> 
>> def log_run(logreg_x, logreg_y):
>>   logreg_x['pass_fail'] = logreg_y
>>   df_train, df_test = train_test_split(logreg_x, random_state=0)
>>   y_train = df_train.pass_fail.as_matrix()
>>   y_test = df_test.pass_fail.as_matrix()
>>   del(df_train['pass_fail'])
>>   del(df_test['pass_fail'])
>>   log_reg_fit = LogisticRegression(class_weight='balanced',
>>                                    tol=0.000000001,
>>                                    random_state=0).fit(df_train, y_train)
>>   predicted = log_reg_fit.predict(df_test)
>>   accuracy = accuracy_score(y_test, predicted)
>>   kappa = cohen_kappa_score(y_test, predicted)
>> 
>>   return [kappa, accuracy]
>> 
>> I?m still seeing:
>> log_run(df_save, y)
>> Out[7]: [-0.054421768707483005, 0.48333333333333334]
>> 
>> log_run(df_save, y)
>> Out[8]: [0.042553191489361743, 0.55000000000000004]
>> 
>> log_run(df_save, y)
>> Out[9]: [0.042553191489361743, 0.55000000000000004]
>> 
>> log_run(df_save, y)
>> Out[10]: [0.027777777777777728, 0.53333333333333333]
>> 
>> 
>> Chris
>> 
>>> On Aug 15, 2016, at 3:42 PM, mail at sebastianraschka.com wrote:
>>> 
>>> Hi, Chris,
>>> have you set the random seed to a specific, contant integer value? Note that the default in LogisticRegression is random_state=None. Setting it to some arbitrary number like 123 may help if you haven?t done so, yet.
>>> 
>>> Best,
>>> Sebastian
>>> 
>>> 
>>> 
>>>> On Aug 15, 2016, at 5:27 PM, Chris Cameron <chris at upnix.com> wrote:
>>>> 
>>>> Hi all,
>>>> 
>>>> Using the same X and y values sklearn.linear_model.LogisticRegression.fit() is providing me with inconsistent results.
>>>> 
>>>> The documentation for sklearn.linear_model.LogisticRegression states that "It is thus not uncommon, to have slightly different results for the same input data.? I am experiencing this, however the fix of using a smaller ?tol? parameter isn?t providing me with consistent fit.
>>>> 
>>>> The code I?m using:
>>>> 
>>>> def log_run(logreg_x, logreg_y):
>>>> logreg_x['pass_fail'] = logreg_y
>>>> df_train, df_test = train_test_split(logreg_x, random_state=0)
>>>> y_train = df_train.pass_fail.as_matrix()
>>>> y_test = df_test.pass_fail.as_matrix()
>>>> del(df_train['pass_fail'])
>>>> del(df_test['pass_fail'])
>>>> log_reg_fit = LogisticRegression(class_weight='balanced',tol=0.000000001).fit(df_train, y_train)
>>>> predicted = log_reg_fit.predict(df_test)
>>>> accuracy = accuracy_score(y_test, predicted)
>>>> kappa = cohen_kappa_score(y_test, predicted)
>>>> 
>>>> return [kappa, accuracy]
>>>> 
>>>> 
>>>> I?ve gone out of my way to be sure the test and train data is the same for each run, so I don?t think there should be random shuffling going on.
>>>> 
>>>> Example output:
>>>> ---
>>>> log_run(df_save, y)
>>>> Out[32]: [0.027777777777777728, 0.53333333333333333]
>>>> 
>>>> log_run(df_save, y)
>>>> Out[33]: [0.027777777777777728, 0.53333333333333333]
>>>> 
>>>> log_run(df_save, y)
>>>> Out[34]: [0.11347517730496456, 0.58333333333333337]
>>>> 
>>>> log_run(df_save, y)
>>>> Out[35]: [0.042553191489361743, 0.55000000000000004]
>>>> 
>>>> log_run(df_save, y)
>>>> Out[36]: [-0.07407407407407407, 0.51666666666666672]
>>>> 
>>>> log_run(df_save, y)
>>>> Out[37]: [0.042553191489361743, 0.55000000000000004]
>>>> 
>>>> A little information on the problem DataFrame:
>>>> ---
>>>> len(df_save)
>>>> Out[40]: 240
>>>> 
>>>> len(df_save.columns)
>>>> Out[41]: 18
>>>> 
>>>> 
>>>> If I omit this particular column the Kappa no longer fluctuates:
>>>> 
>>>> df_save[?abc'].head()
>>>> Out[42]: 
>>>> 0    0.026316
>>>> 1    0.333333
>>>> 2    0.015152
>>>> 3    0.010526
>>>> 4    0.125000
>>>> Name: abc, dtype: float64
>>>> 
>>>> 
>>>> Does anyone have ideas on how I can figure this out? Is there some randomness/shuffling still going on I missed?
>>>> 
>>>> 
>>>> Thanks!
>>>> Chris
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>> 
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> 
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From gael.varoquaux at normalesup.org  Wed Aug 17 03:23:12 2016
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Wed, 17 Aug 2016 09:23:12 +0200
Subject: [scikit-learn] Inconsistent Logistic Regression fit results
In-Reply-To: <34155333-560B-4A8A-87AE-7D5D76C76807@upnix.com>
References: <E386032E-6664-4866-B1E5-DA7FF883EFE0@upnix.com>
 <2C047F1F-FC6F-4D74-A293-C2422BEDC3DF@sebastianraschka.com>
 <903082E9-D944-4838-A882-982911520540@upnix.com>
 <ECC79031-A0CF-4154-A2DD-34B07EFB8931@sebastianraschka.com>
 <34155333-560B-4A8A-87AE-7D5D76C76807@upnix.com>
Message-ID: <892ed9d1-0aa3-4cfb-9454-889ddfac2e43@typeapp.com>

In other words, you have an ill conditioned estimation problem, and what you were seeing were numerical instabilities due to this ill conditionning. 

Not a bug. An expected behavior. 

Sent from my phone. Please forgive brevity and mis spelling


On Aug 16, 2016, 18:17, at 18:17, Chris Cameron <chris at upnix.com> wrote:
>Thank you everyone for your help. The short version of this email is
>that changing the solver from ?liblinear? to ?sag? fixed my problem -
>but only if I upped ?max_iter? to 1000.
>
>
>Longer version - 
>Without max_iter=1000, I would get the warning:
>ConvergenceWarning: The max_iter was reached which means the coef_ did
>not converge
>
>I have some columns in my data that have a huge range of values. Using
>?liblinear?, if I transformed those columns, causing the range to be
>smaller, the results would be consistent every time.
>
>This is the function I ended up using -
>def log_run(logreg_x, logreg_y):
>    logreg_x['pass_fail'] = logreg_y
>df_train, df_test, y_train, y_test = train_test_split(logreg_x,
>logreg_y, random_state=0)
>    del(df_train['pass_fail'])
>    del(df_test['pass_fail'])
>    log_reg_fit = LogisticRegression(class_weight='balanced',
>                                     tol=0.00000001,
>                                     random_state=8,
>                                     solver='sag',
>                           max_iter=1000).fit(df_train.values, y_train)
>    predicted = log_reg_fit.predict(df_test.values)
>    accuracy = accuracy_score(y_test, predicted)
>    kappa = cohen_kappa_score(y_test, predicted)
>        
>    return [kappa, accuracy]
>
>
>Thank you again for the help,
>
>Chris
>
>> On Aug 15, 2016, at 4:26 PM, mail at sebastianraschka.com wrote:
>> 
>> hm, was worth a try. What happens if you change the solver to
>something else than liblinear, does this issue still persist?
>> 
>> 
>> Btw. scikit-learn works with NumPy arrays, not NumPy matrices.
>Probably unrelated to your issue, I?d recommend setting
>> 
>>>   y_train = df_train.pass_fail.values
>>>   y_test = df_test.pass_fail.values
>> 
>> instead of
>> 
>>>   y_train = df_train.pass_fail.as_matrix()
>>>   y_test = df_test.pass_fail.as_matrix()
>> 
>> 
>> Also, try passing NumPy arrays to the fit method:
>> 
>>>   log_reg_fit = LogisticRegression(...).fit(df_train.values,
>y_train)
>> 
>> and
>> 
>>> predicted = log_reg_fit.predict(df_test.values)
>> 
>> and so forth.
>> 
>> 
>> 
>> 
>> 
>>> On Aug 15, 2016, at 6:00 PM, Chris Cameron <chris at upnix.com> wrote:
>>> 
>>> Sebastian,
>>> 
>>> That doesn?t do it. With the function:
>>> 
>>> def log_run(logreg_x, logreg_y):
>>>   logreg_x['pass_fail'] = logreg_y
>>>   df_train, df_test = train_test_split(logreg_x, random_state=0)
>>>   y_train = df_train.pass_fail.as_matrix()
>>>   y_test = df_test.pass_fail.as_matrix()
>>>   del(df_train['pass_fail'])
>>>   del(df_test['pass_fail'])
>>>   log_reg_fit = LogisticRegression(class_weight='balanced',
>>>                                    tol=0.000000001,
>>>                                    random_state=0).fit(df_train,
>y_train)
>>>   predicted = log_reg_fit.predict(df_test)
>>>   accuracy = accuracy_score(y_test, predicted)
>>>   kappa = cohen_kappa_score(y_test, predicted)
>>> 
>>>   return [kappa, accuracy]
>>> 
>>> I?m still seeing:
>>> log_run(df_save, y)
>>> Out[7]: [-0.054421768707483005, 0.48333333333333334]
>>> 
>>> log_run(df_save, y)
>>> Out[8]: [0.042553191489361743, 0.55000000000000004]
>>> 
>>> log_run(df_save, y)
>>> Out[9]: [0.042553191489361743, 0.55000000000000004]
>>> 
>>> log_run(df_save, y)
>>> Out[10]: [0.027777777777777728, 0.53333333333333333]
>>> 
>>> 
>>> Chris
>>> 
>>>> On Aug 15, 2016, at 3:42 PM, mail at sebastianraschka.com wrote:
>>>> 
>>>> Hi, Chris,
>>>> have you set the random seed to a specific, contant integer value?
>Note that the default in LogisticRegression is random_state=None.
>Setting it to some arbitrary number like 123 may help if you haven?t
>done so, yet.
>>>> 
>>>> Best,
>>>> Sebastian
>>>> 
>>>> 
>>>> 
>>>>> On Aug 15, 2016, at 5:27 PM, Chris Cameron <chris at upnix.com>
>wrote:
>>>>> 
>>>>> Hi all,
>>>>> 
>>>>> Using the same X and y values
>sklearn.linear_model.LogisticRegression.fit() is providing me with
>inconsistent results.
>>>>> 
>>>>> The documentation for sklearn.linear_model.LogisticRegression
>states that "It is thus not uncommon, to have slightly different
>results for the same input data.? I am experiencing this, however the
>fix of using a smaller ?tol? parameter isn?t providing me with
>consistent fit.
>>>>> 
>>>>> The code I?m using:
>>>>> 
>>>>> def log_run(logreg_x, logreg_y):
>>>>> logreg_x['pass_fail'] = logreg_y
>>>>> df_train, df_test = train_test_split(logreg_x, random_state=0)
>>>>> y_train = df_train.pass_fail.as_matrix()
>>>>> y_test = df_test.pass_fail.as_matrix()
>>>>> del(df_train['pass_fail'])
>>>>> del(df_test['pass_fail'])
>>>>> log_reg_fit =
>LogisticRegression(class_weight='balanced',tol=0.000000001).fit(df_train,
>y_train)
>>>>> predicted = log_reg_fit.predict(df_test)
>>>>> accuracy = accuracy_score(y_test, predicted)
>>>>> kappa = cohen_kappa_score(y_test, predicted)
>>>>> 
>>>>> return [kappa, accuracy]
>>>>> 
>>>>> 
>>>>> I?ve gone out of my way to be sure the test and train data is the
>same for each run, so I don?t think there should be random shuffling
>going on.
>>>>> 
>>>>> Example output:
>>>>> ---
>>>>> log_run(df_save, y)
>>>>> Out[32]: [0.027777777777777728, 0.53333333333333333]
>>>>> 
>>>>> log_run(df_save, y)
>>>>> Out[33]: [0.027777777777777728, 0.53333333333333333]
>>>>> 
>>>>> log_run(df_save, y)
>>>>> Out[34]: [0.11347517730496456, 0.58333333333333337]
>>>>> 
>>>>> log_run(df_save, y)
>>>>> Out[35]: [0.042553191489361743, 0.55000000000000004]
>>>>> 
>>>>> log_run(df_save, y)
>>>>> Out[36]: [-0.07407407407407407, 0.51666666666666672]
>>>>> 
>>>>> log_run(df_save, y)
>>>>> Out[37]: [0.042553191489361743, 0.55000000000000004]
>>>>> 
>>>>> A little information on the problem DataFrame:
>>>>> ---
>>>>> len(df_save)
>>>>> Out[40]: 240
>>>>> 
>>>>> len(df_save.columns)
>>>>> Out[41]: 18
>>>>> 
>>>>> 
>>>>> If I omit this particular column the Kappa no longer fluctuates:
>>>>> 
>>>>> df_save[?abc'].head()
>>>>> Out[42]: 
>>>>> 0    0.026316
>>>>> 1    0.333333
>>>>> 2    0.015152
>>>>> 3    0.010526
>>>>> 4    0.125000
>>>>> Name: abc, dtype: float64
>>>>> 
>>>>> 
>>>>> Does anyone have ideas on how I can figure this out? Is there some
>randomness/shuffling still going on I missed?
>>>>> 
>>>>> 
>>>>> Thanks!
>>>>> Chris
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> scikit-learn at python.org
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>> 
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>> 
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> 
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
>_______________________________________________
>scikit-learn mailing list
>scikit-learn at python.org
>https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160817/96633586/attachment-0001.html>

From olivier.grisel at ensta.org  Wed Aug 17 10:43:25 2016
From: olivier.grisel at ensta.org (Olivier Grisel)
Date: Wed, 17 Aug 2016 16:43:25 +0200
Subject: [scikit-learn] 0.18?
In-Reply-To: <CAFvE7K6Xfjg+1tpDuFrOvv=B3mxkR7rreA=w4MwsR_k9dWmRJQ@mail.gmail.com>
References: <CAAkaFLUs+8R-Pp0WkTUdYo4OZn5JQXVHrfzhSoLoBnK1Ohnbcg@mail.gmail.com>
 <577D4D41.60102@gmail.com>
 <CAAkaFLV9s+gfTADc6s0qS4G2KjJmyKykx4tB6KoCyVJ8aov=wQ@mail.gmail.com>
 <CAFvE7K66DJTHsK5sQxTsPE5dWxdDO7xvrkJBS+Lry=ttGso3AA@mail.gmail.com>
 <CAH6Pt5pysOtof6iEww4s7tB1_K4GHctWH1OVAyoOeQEr+FR8XQ@mail.gmail.com>
 <CAFvE7K6Xfjg+1tpDuFrOvv=B3mxkR7rreA=w4MwsR_k9dWmRJQ@mail.gmail.com>
Message-ID: <CAFvE7K6vW=X8KdnCGEAh8MuHbSSZpYkenXGQwUSZ5N7icFX5TA@mail.gmail.com>

Ok I fixed all the 32 bit Linux & OSX build issues in scikit-learn
master and all wheels builds are green with the multibuild setup:

https://travis-ci.org/MacPython/scikit-learn-wheels

Matthew: would you be interested in having the multibuild repo
extended to also include appveyor configration files or do you think
it's better to let projects owner do their own appveyor config by
themselves?

-- 
Olivier

From matthew.brett at gmail.com  Wed Aug 17 12:44:20 2016
From: matthew.brett at gmail.com (Matthew Brett)
Date: Wed, 17 Aug 2016 09:44:20 -0700
Subject: [scikit-learn] 0.18?
In-Reply-To: <CAFvE7K6vW=X8KdnCGEAh8MuHbSSZpYkenXGQwUSZ5N7icFX5TA@mail.gmail.com>
References: <CAAkaFLUs+8R-Pp0WkTUdYo4OZn5JQXVHrfzhSoLoBnK1Ohnbcg@mail.gmail.com>
 <577D4D41.60102@gmail.com>
 <CAAkaFLV9s+gfTADc6s0qS4G2KjJmyKykx4tB6KoCyVJ8aov=wQ@mail.gmail.com>
 <CAFvE7K66DJTHsK5sQxTsPE5dWxdDO7xvrkJBS+Lry=ttGso3AA@mail.gmail.com>
 <CAH6Pt5pysOtof6iEww4s7tB1_K4GHctWH1OVAyoOeQEr+FR8XQ@mail.gmail.com>
 <CAFvE7K6Xfjg+1tpDuFrOvv=B3mxkR7rreA=w4MwsR_k9dWmRJQ@mail.gmail.com>
 <CAFvE7K6vW=X8KdnCGEAh8MuHbSSZpYkenXGQwUSZ5N7icFX5TA@mail.gmail.com>
Message-ID: <CAH6Pt5phSw1+eL4J9VG7CwFEbxx+e6nqBmbfJRDU1oyHeDAYsQ@mail.gmail.com>

Definitely interested !

On 17 Aug 2016 07:44, "Olivier Grisel" <olivier.grisel at ensta.org> wrote:

> Ok I fixed all the 32 bit Linux & OSX build issues in scikit-learn
> master and all wheels builds are green with the multibuild setup:
>
> https://travis-ci.org/MacPython/scikit-learn-wheels
>
> Matthew: would you be interested in having the multibuild repo
> extended to also include appveyor configration files or do you think
> it's better to let projects owner do their own appveyor config by
> themselves?
>
> --
> Olivier
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160817/619fd87d/attachment.html>

From adamt at nih.gov  Wed Aug 17 14:53:29 2016
From: adamt at nih.gov (Thomas, Adam (NIH/NIMH) [E])
Date: Wed, 17 Aug 2016 18:53:29 +0000
Subject: [scikit-learn] Hiring: Cloud and HPC Engineer at NIMH in Bethesda,
 MD
Message-ID: <D3DA2AE4.AF6F2%adamt@nih.gov>


HIRING: CLOUD AND HPC ENGINEER

The National Institute of Mental Health (NIMH) is the lead federal
agency for research on mental disorders. NIMH is one of the 27
Institutes and Centers that make up the National Institutes of Health
(NIH), which is responsible for all federally funded biomedical research
in US. NIH is part of the U.S. Department of Health and Human Services
(HHS). The NIH is a highly rated employer at glassdoor.com with very
competitive salary and benefits packages.

The Data Science and Sharing Team (DSST) is a new group created to
develop and support data sharing and other data-intensive scientific
projects within the NIMH Intramural Research Program (IRP). Working
closely with the Office of Data Science the goal of the DSST is to make
the NIMH IRP a leader in the open science and data sharing practices
mandated by the Open Data Policy released by the White House on 9 May,
2013. We are building a team to make that happen.


What you?ll do?
---------------

BUILD

You will work with a team of researchers and developers to build and
deploy neuroimaging data processing pipelines for investigators within
the NIMH IRP. You will collaborate with and contribute to other projects
throughout the world that are building standards and tools for open and
reproducible neuroscience (e.g., NiPy, BIDS, Binder, Rstudio). You'll
have the resources of the NIH HPC Cluster at your disposal as well as
additional help from the AWS cloud. All tools and code will be open
source and freely distributed.

TEACH

You will work to bolster data science skills within the NIMH IRP by
teaching courses to scientists on best data practices (e.g. Software
& Data Carpentry) as well as accessing and using specific neuroimaging
repositories (e.g. The Human Connectome Project, OpenfMRI, UK Biobank).

QUANTIFY

There is no use building tools for open science if no one uses them.
Part of the job of the DSST is to measure data sharing and open science
practices within the NIMH IRP and progress toward their adoption. This
will include bibliometrics for scientific publications from the NIMH IRP
and other measures of data sharing and secondary data utilization. You
will provide crucial systems level support to the team in gauging this
progress.

Who you are?
------------

EXPERIENCED

You should be very comfortable on the command line and have a rock-solid
handle on one or more Unix-based operating systems. You should have some
experience with distributed, high-performance computing tools such as
Spark, OpenStack, Docker/Singularity, and batch processing systems such
as SLURM and SGE. You should also have experience coding in modern
languages currently used in data-intensive, scientific computing such as
Python, R, and Javascript, as well as interfacing with a variety of
APIs.

PROVEN

Ideally we would like to see a recent degree (BS, MS, or PhD) in a STEM
field, but if you can prove you have an equivalent amount of expertise
with your publications, projects, or github/kaggle ranking, we?re all
ears. We are also interviewing students and part-time staff if you?re
still working on your degree.

DRIVEN

Data science is moving fast ? we?re looking for someone who can move
faster. You should be a self-learner and a self-starter. Provide some
examples of things you have worked on independently.

How to apply?
-------------

Email your resume, a cover letter, and a code sample that demonstrates
you are all three of the above to:

DATASCI-JOBSEARCH at mail.nih.gov

The National Institutes of Health is an equal opportunity employer.


From t3kcit at gmail.com  Wed Aug 17 16:37:17 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Wed, 17 Aug 2016 16:37:17 -0400
Subject: [scikit-learn] 0.18?
In-Reply-To: <69c5d973-f798-f5a1-a4a3-1ee43e2c1a36@gmail.com>
References: <CAAkaFLUs+8R-Pp0WkTUdYo4OZn5JQXVHrfzhSoLoBnK1Ohnbcg@mail.gmail.com>
 <577D4D41.60102@gmail.com>
 <CAAkaFLV9s+gfTADc6s0qS4G2KjJmyKykx4tB6KoCyVJ8aov=wQ@mail.gmail.com>
 <CAFvE7K66DJTHsK5sQxTsPE5dWxdDO7xvrkJBS+Lry=ttGso3AA@mail.gmail.com>
 <69c5d973-f798-f5a1-a4a3-1ee43e2c1a36@gmail.com>
Message-ID: <cfd55546-d170-7604-617e-6a4cbacd0ce5@gmail.com>

I vote we push back the release a bit (two weeks?).
I had unexpected things come up that ate a bunch of my time.


On 07/25/2016 03:53 PM, Andreas Mueller wrote:
> Hi Olivier / all
> Let me know if I can help with the builds.
> I'm gonna start reviews and triaging and tagging this week.
> Mid August sounds good for a beta / RC.
>
> It would be great if we could release in September, as that is when 
> The Book (aka my past year)
> is scheduled to come out (I finished it last week). The Book uses 
> model_selection, so having
> the release out before the book would be good.
>
> Andy
>
> On 07/25/2016 07:54 AM, Olivier Grisel wrote:
>> Sorry for the late reply,
>>
>> Before working on this release I would like to automate the wheel
>> generation process (for the release wheels) in a single repo that will
>> generate wheels for linux, osx and windows based on
>> https://github.com/matthew-brett/multibuild
>>
>> I plan to put that repo under
>> https://github.com/scikit-learn/scikit-learn-wheels and deprecate
>> https://github.com/MacPython/scikit-learn-wheels that we used for the
>> OSX wheels.
>>
>> There is also some issue triaging to do, it would be great to identify
>> blocker bugs that we would like to get fixed before releasing 0.18.
>>
>> We can aim to do a beta mid-August and the final release after
>> euroscipy (first week of September).
>>
>


From bertrand.thirion at inria.fr  Wed Aug 17 16:57:00 2016
From: bertrand.thirion at inria.fr (bthirion)
Date: Wed, 17 Aug 2016 22:57:00 +0200
Subject: [scikit-learn] 0.18?
In-Reply-To: <CAFvE7K6vW=X8KdnCGEAh8MuHbSSZpYkenXGQwUSZ5N7icFX5TA@mail.gmail.com>
References: <CAAkaFLUs+8R-Pp0WkTUdYo4OZn5JQXVHrfzhSoLoBnK1Ohnbcg@mail.gmail.com>
 <577D4D41.60102@gmail.com>
 <CAAkaFLV9s+gfTADc6s0qS4G2KjJmyKykx4tB6KoCyVJ8aov=wQ@mail.gmail.com>
 <CAFvE7K66DJTHsK5sQxTsPE5dWxdDO7xvrkJBS+Lry=ttGso3AA@mail.gmail.com>
 <CAH6Pt5pysOtof6iEww4s7tB1_K4GHctWH1OVAyoOeQEr+FR8XQ@mail.gmail.com>
 <CAFvE7K6Xfjg+1tpDuFrOvv=B3mxkR7rreA=w4MwsR_k9dWmRJQ@mail.gmail.com>
 <CAFvE7K6vW=X8KdnCGEAh8MuHbSSZpYkenXGQwUSZ5N7icFX5TA@mail.gmail.com>
Message-ID: <bfa97fe3-07a7-8cf7-f955-6e941f6d1c60@inria.fr>

Many thanks !

Bertrand

On 17/08/2016 16:43, Olivier Grisel wrote:
> Ok I fixed all the 32 bit Linux & OSX build issues in scikit-learn
> master and all wheels builds are green with the multibuild setup:
>
> https://travis-ci.org/MacPython/scikit-learn-wheels
>
> Matthew: would you be interested in having the multibuild repo
> extended to also include appveyor configration files or do you think
> it's better to let projects owner do their own appveyor config by
> themselves?
>


From mail at sebastianraschka.com  Thu Aug 18 11:44:42 2016
From: mail at sebastianraschka.com (Sebastian Raschka)
Date: Thu, 18 Aug 2016 11:44:42 -0400
Subject: [scikit-learn] update pydata schedule
Message-ID: <4EC4AB2F-EA4B-4794-AECE-97FAB72C79AA@sebastianraschka.com>


From mail at sebastianraschka.com  Thu Aug 18 12:51:54 2016
From: mail at sebastianraschka.com (mail at sebastianraschka.com)
Date: Thu, 18 Aug 2016 12:51:54 -0400
Subject: [scikit-learn] update pydata schedule
In-Reply-To: <4EC4AB2F-EA4B-4794-AECE-97FAB72C79AA@sebastianraschka.com>
References: <4EC4AB2F-EA4B-4794-AECE-97FAB72C79AA@sebastianraschka.com>
Message-ID: <BFF62F9D-283F-45A1-B781-2C57EC6E1C84@sebastianraschka.com>

Sorry for this previous Email, please disregard. This was a reminder to myself and I somehow sent it to the wrong recipient.

Sent from my iPhone

> On Aug 18, 2016, at 11:44 AM, Sebastian Raschka <mail at sebastianraschka.com> wrote:
> 
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From shanglunwang at gmail.com  Fri Aug 19 19:59:06 2016
From: shanglunwang at gmail.com (Shanglun Wang)
Date: Fri, 19 Aug 2016 19:59:06 -0400
Subject: [scikit-learn] Help with improving t-sne
Message-ID: <CAOKBHtOKWp48Lygb9c0S9L-3LO1DcTW2GbwqRan9zo+j39xBrQ@mail.gmail.com>

Hello,

I am currently working on a ticket on github involving improving the data
structures powering t-sne. I am running into some trouble trying to
conceptually link up what the code is doing and the underlying mathematical
theory. Normally I would just grapple with it, but I feel like I would need
some help to get this ticket done in a reasonable time frame.

Would someone be willing to help me understand the theory underpinning
t-sne, and how that links up with the implementation?

Thank you,

Sean
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160819/6ab7fcd7/attachment.html>

From brookm291 at gmail.com  Sun Aug 21 05:33:35 2016
From: brookm291 at gmail.com (KevNo)
Date: Sun, 21 Aug 2016 18:33:35 +0900
Subject: [scikit-learn] Building Scikit Learn in Win 7 64bits
Message-ID: <57B9756F.40100@gmail.com>

Hi,

I follow the instructions here:
to compile  Scikit-learn for development

https://github.com/scikit-learn/scikit-learn/issues/5709

1) Download from Git into a repository scikit_learn
2) Add the path to Python Path
3) In IPython
!!python D:\_devs\Python01\scikit_learn\sklearn\setup.py build_ext 
--inplace

and I got this message:

|Assuming default configuration (scikit_learn\\sklearn\\svm\\tests/{setup_tests,setup}.py was not found)D:\\\\_devs\\\\Python01\\\\scikit_learn\\\\sklearn\\\\setup.py:71: UserWarning: ',
  '    Blas (http://www.netlib.org/blas/) libraries not found.',
  '    Directories to search for the libraries can be specified in the',
  '    numpy/distutils/site.cfg file (section [blas]) or by setting',
  '    the BLAS environment variable.',
  '  warnings.warn(BlasNotFoundError.__doc__)',
  'Warning: Assuming default configuration (scikit_learn\\sklearn\\linear_model\\tests/{setup_tests,setup}.py was not found)Warning: Assuming default configuration (scikit_learn\\sklearn\\utils\\sparsetools\\tests/{setup_tests,setup}.py was not found)Warning: Assuming default configuration (scikit_learn\\sklearn\\utils\\tests/{setup_tests,setup}.py was not found)Warning: Assuming default configuration (scikit_learn\\sklearn\\tests/{setup_tests,setup}.py was not found)gcc.exe: error: _check_build.c: No such file or directory',
  'gcc.exe: fatal error: no input files',
  'compilation terminated.',
  'error: Command "gcc -m64 -g -DNDEBUG -DMS_WIN64 -O2 -Wall -Wstrict-prototypes -D__MSVCRT_VERSION__=0x0900 -ID:\\_devs\\Python01\\WinPython-64-2710\\python-2.7.10.amd64\\lib\\site-packages\\numpy\\core\\include -ID:\\_devs\\Python01\\WinPython-64-2710\\python-2.7.10.amd64\\lib\\site-packages\\numpy\\core\\include -ID:\\_devs\\Python01\\WinPython-64-2710\\python-2.7.10.amd64\\include -ID:\\_devs\\Python01\\WinPython-64-2710\\python-2.7.10.amd64\\PC -c _check_build.c -o build\\temp.win-amd64-2.7\\Release\\_check_build.o" failed with exit status 1']|


Environnment
WinPython 2.7 64bits
Windows-7-6.1.7601-SP1
('Python', '2.7.10 (default, May 23 2015, 09:44:00) [MSC v.1500 64 bit 
(AMD64)]')
('NumPy', '1.9.2')
('SciPy', '0.16.0' )


Just wondering if possible to get a place where we can compile without 
spending hours and days on the web
to find the issues ?


Thanks
Brook


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160821/d1a1ffb2/attachment.html>

From olivier.grisel at ensta.org  Mon Aug 22 05:43:21 2016
From: olivier.grisel at ensta.org (Olivier Grisel)
Date: Mon, 22 Aug 2016 11:43:21 +0200
Subject: [scikit-learn] Building Scikit Learn in Win 7 64bits
In-Reply-To: <57B9756F.40100@gmail.com>
References: <57B9756F.40100@gmail.com>
Message-ID: <CAFvE7K6SdMjQ8LM9-RamxhKZNx0gsemoAGVUzr8tYAmSMGgjgw@mail.gmail.com>

The error message mentions gcc. Have you installed some mingw version?

As of now our windows build is only properly tested with the Visual
Studio C++ compiler from appveyor:

https://ci.appveyor.com/project/sklearn-ci/scikit-learn

I have not tested the build with mingwpy in a while (I am not a
windows user my-self).

The file not found error makes me think that you might need to cd into
the scikit-learn source folder:

!!cd D:\_devs\Python01\scikit_learn\sklearn
!!python setup.py build_ext --inplace

-- 
Olivier

From joel.nothman at gmail.com  Mon Aug 22 06:22:23 2016
From: joel.nothman at gmail.com (Joel Nothman)
Date: Mon, 22 Aug 2016 20:22:23 +1000
Subject: [scikit-learn] Building Scikit Learn in Win 7 64bits
In-Reply-To: <CAFvE7K6SdMjQ8LM9-RamxhKZNx0gsemoAGVUzr8tYAmSMGgjgw@mail.gmail.com>
References: <57B9756F.40100@gmail.com>
 <CAFvE7K6SdMjQ8LM9-RamxhKZNx0gsemoAGVUzr8tYAmSMGgjgw@mail.gmail.com>
Message-ID: <CAAkaFLUXDH0piEd4nWs-4wDTX=z3wO9refPjenVVjGCE7P5irg@mail.gmail.com>

You could also use

! pip install D:\_devs\Python01\scikit_learn\sklearn

or indeed

! pip install git+https://github.com/scikit-learn/scikit-learn/

if you don't actually want to use the directory with the source code in it.

On 22 August 2016 at 19:43, Olivier Grisel <olivier.grisel at ensta.org> wrote:

> The error message mentions gcc. Have you installed some mingw version?
>
> As of now our windows build is only properly tested with the Visual
> Studio C++ compiler from appveyor:
>
> https://ci.appveyor.com/project/sklearn-ci/scikit-learn
>
> I have not tested the build with mingwpy in a while (I am not a
> windows user my-self).
>
> The file not found error makes me think that you might need to cd into
> the scikit-learn source folder:
>
> !!cd D:\_devs\Python01\scikit_learn\sklearn
> !!python setup.py build_ext --inplace
>
> --
> Olivier
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160822/8fe71827/attachment.html>

From olivier.grisel at ensta.org  Mon Aug 22 09:33:07 2016
From: olivier.grisel at ensta.org (Olivier Grisel)
Date: Mon, 22 Aug 2016 15:33:07 +0200
Subject: [scikit-learn] 0.18?
In-Reply-To: <cfd55546-d170-7604-617e-6a4cbacd0ce5@gmail.com>
References: <CAAkaFLUs+8R-Pp0WkTUdYo4OZn5JQXVHrfzhSoLoBnK1Ohnbcg@mail.gmail.com>
 <577D4D41.60102@gmail.com>
 <CAAkaFLV9s+gfTADc6s0qS4G2KjJmyKykx4tB6KoCyVJ8aov=wQ@mail.gmail.com>
 <CAFvE7K66DJTHsK5sQxTsPE5dWxdDO7xvrkJBS+Lry=ttGso3AA@mail.gmail.com>
 <69c5d973-f798-f5a1-a4a3-1ee43e2c1a36@gmail.com>
 <cfd55546-d170-7604-617e-6a4cbacd0ce5@gmail.com>
Message-ID: <CAFvE7K4kPxRV4fHpOApfGDUk0sGXekQfXg5HsqgrNyUObMiAfw@mail.gmail.com>

Ok for pushing back. Let's try to work on the beta on the week after
euroscipy if we can.

At least all the annoying binary packaging issues are fixed (test
failures for the linux and OSX 32 bit platforms) so the release
process itself should hopefully be painless.

-- 
Olivier

From t3kcit at gmail.com  Mon Aug 22 14:30:51 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Mon, 22 Aug 2016 14:30:51 -0400
Subject: [scikit-learn] Help with improving t-sne
In-Reply-To: <CAOKBHtOKWp48Lygb9c0S9L-3LO1DcTW2GbwqRan9zo+j39xBrQ@mail.gmail.com>
References: <CAOKBHtOKWp48Lygb9c0S9L-3LO1DcTW2GbwqRan9zo+j39xBrQ@mail.gmail.com>
Message-ID: <9cfda285-e073-851f-f4dd-963a76b62a9b@gmail.com>

Hi Sean.
Thanks for working on this.
Do you have any more specific questions? Have you looked at the 
barnes-hut paper?

Cheers,
Andy

On 08/19/2016 07:59 PM, Shanglun Wang wrote:
>
> Hello,
>
> I am currently working on a ticket on github involving improving the 
> data structures powering t-sne. I am running into some trouble trying 
> to conceptually link up what the code is doing and the underlying 
> mathematical theory. Normally I would just grapple with it, but I feel 
> like I would need some help to get this ticket done in a reasonable 
> time frame.
>
> Would someone be willing to help me understand the theory underpinning 
> t-sne, and how that links up with the implementation?
>
> Thank you,
>
> Sean
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160822/8c82435a/attachment.html>

From brookm291 at gmail.com  Tue Aug 23 13:23:34 2016
From: brookm291 at gmail.com (KevNo)
Date: Wed, 24 Aug 2016 02:23:34 +0900
Subject: [scikit-learn] Building Scikit Learn in Win 7 64bits
In-Reply-To: <mailman.65.1471881606.22470.scikit-learn@python.org>
References: <mailman.65.1471881606.22470.scikit-learn@python.org>
Message-ID: <57BC8696.3080206@gmail.com>

Hello,

Thanks for yoru advice/reply.

I tried to build from VC after following Instructions of VS Python 2.7 
compiler.
Steps ares:

|1)  Git  download
2)  VC++  for  Python:  https://blogs.msdn.microsoft.com/pythonengineering/2016/04/11/unable-to-find-vcvarsall-bat/#comment-515
3)  Change  thePath  for  compiler VStudio
4) in|!!cd D:\_devs\Python01\scikit_learn\sklearn     (folder of sklearn)
|5)  python setup.py build
|

I have this message:

|building'sklearn.__check_build._check_build'  extension

compiling C sources

cl.exe/c/nologo/Ox  /MD/W3/GS-  /DNDEBUG-ID:\_devs\Python01\Anaconda2\lib\si
te-packages\numpy\core\include-ID:\_devs\Python01\Anaconda2\lib\site-packages\n
umpy\core\include-ID:\_devs\Python01\Anaconda2\include-ID:\_devs\Python01\Anac
onda2\PC/Tc_check_build.c/Fobuild\temp.win-amd64-2.7\Release\_check_build.obj


Found  executable C:\Users\asus1\AppData\Local\Programs\Common\Microsoft\Visual C
++  for  Python\9.0\VC\Bin\amd64\cl.exe


_check_build.c
c1:  fatal error C1083:  Cannot  open source file:  '_check_build.c':  No  such file
or  directory|


I dont have any idea where it could come from

Thanks
Brook


> scikit-learn-request at python.org <mailto:scikit-learn-request at python.org>
> Tuesday, August 23, 2016 1:00 AM
> Send scikit-learn mailing list submissions to
> scikit-learn at python.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://mail.python.org/mailman/listinfo/scikit-learn
> or, via email, send a message with subject or body 'help' to
> scikit-learn-request at python.org
>
> You can reach the person managing the list at
> scikit-learn-owner at python.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of scikit-learn digest..."
>
>
> Today's Topics:
>
> 1. Re: Building Scikit Learn in Win 7 64bits (Olivier Grisel)
> 2. Re: Building Scikit Learn in Win 7 64bits (Joel Nothman)
> 3. Re: 0.18? (Olivier Grisel)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Mon, 22 Aug 2016 11:43:21 +0200
> From: Olivier Grisel <olivier.grisel at ensta.org>
> To: Scikit-learn user and developer mailing list
> <scikit-learn at python.org>
> Subject: Re: [scikit-learn] Building Scikit Learn in Win 7 64bits
> Message-ID:
> <CAFvE7K6SdMjQ8LM9-RamxhKZNx0gsemoAGVUzr8tYAmSMGgjgw at mail.gmail.com>
> Content-Type: text/plain; charset=UTF-8
>
> The error message mentions gcc. Have you installed some mingw version?
>
> As of now our windows build is only properly tested with the Visual
> Studio C++ compiler from appveyor:
>
> https://ci.appveyor.com/project/sklearn-ci/scikit-learn
>
> I have not tested the build with mingwpy in a while (I am not a
> windows user my-self).
>
> The file not found error makes me think that you might need to cd into
> the scikit-learn source folder:
>
> !!cd D:\_devs\Python01\scikit_learn\sklearn
> !!python setup.py build_ext --inplace
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160824/e2a57751/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: compose-unknown-contact.jpg
Type: image/jpeg
Size: 770 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160824/e2a57751/attachment-0001.jpg>

From aadral at gmail.com  Tue Aug 23 16:24:30 2016
From: aadral at gmail.com (=?UTF-8?B?0JDQu9C10LrRgdC10Lkg0JTRgNCw0LvRjA==?=)
Date: Tue, 23 Aug 2016 21:24:30 +0100
Subject: [scikit-learn] GradientBoostingRegressor,
 question about initialisation with MeanEstimator
Message-ID: <CADdg2ywgzFuVHzLf4j+7xXtj1SFOt=1wbxLK=rS22JoW3uF6yw@mail.gmail.com>

Hi there,

I recently found out that GradientBoostingRegressor uses MeanEstimator for
the initial estimator in ensemble. Could you please point out (or
explain) to the research showing superiority of this approach compared to
the usage of DecisionTreeRegressor?

-- 
Yours sincerely,
Alexey A. Dral
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160823/43342427/attachment.html>

From siddhantloya2008 at gmail.com  Fri Aug 26 03:08:55 2016
From: siddhantloya2008 at gmail.com (Siddhant Loya)
Date: Fri, 26 Aug 2016 12:38:55 +0530
Subject: [scikit-learn] Fitting a plane to a 3D points Cloud
Message-ID: <CAKkwFhm7UEbrLQRyaEQ1ASsvFjgCMZcBS=Yq1g3MGxVxhLWkXQ@mail.gmail.com>

I have been trying to use Ransac to fit a plane to a 3D point cloud.

I am not able to understand on how to do this on 3D data.

I have already posted a question on SO.

Link :-
http://stackoverflow.com/questions/39159102/fit-a-plane-to-3d-point-cloud-using-ransac?noredirect=1#comment65663410_39159102

I am not able to understand how to solve this for a plane instead of 2-D
line.

Regards,
Siddhant
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160826/1abc1548/attachment.html>

From rth.yurchak at gmail.com  Fri Aug 26 10:09:15 2016
From: rth.yurchak at gmail.com (Roman Yurchak)
Date: Fri, 26 Aug 2016 16:09:15 +0200
Subject: [scikit-learn] Latent Semantic Analysis (LSA) and TrucatedSVD
Message-ID: <57C04D8B.40001@gmail.com>

Hi all,

I have a question about using the TruncatedSVD method for performing
Latent Semantic Analysis/Indexing (LSA/LSI). The docs imply that simply
applying TruncatedSVD to a tf-idf matrice is sufficient (cf.
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html),
but I'm wondering about that.

As far as I understood for LSA one computes a truncated SVD
decomposition of the tf-idf matrix X (n_features x n_samples),
      X ? U @ Sigma @ V.T
and then for a document vector d, the projection is computed as,
      d_proj = d.T @ U @ Sigma??
(source: http://nlp.stanford.edu/IR-book/pdf/18lsi.pdf)
However, TruncatedSVD.fit_transform only computes,
      d_proj = d.T @ U
and what's more does not store the singular values (Sigma) internally,
so it cannot be easily applied afterwards.
(the above notation are transposed with respect to those in the scikit
learn docs).

For instance, I have tried reproducing LSA decomposition from literature
and I'm not getting the expected results unless I perform an additional
normalization by the Sigma matrix:
https://gist.github.com/rth/3af30c60bece7db4207821a6dddc5e8d

I was wondering if I am missing something here?
Thank you,
-- 
Roman

From t3kcit at gmail.com  Fri Aug 26 10:55:41 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Fri, 26 Aug 2016 10:55:41 -0400
Subject: [scikit-learn] Latent Semantic Analysis (LSA) and TrucatedSVD
In-Reply-To: <57C04D8B.40001@gmail.com>
References: <57C04D8B.40001@gmail.com>
Message-ID: <532083f1-0647-989d-6f35-2a83176199ea@gmail.com>

Looks like they apply whitening, which is not implemented in TruncatedSVD.
I guess we could add that option. It's equivalent to using a 
StandardScaler after the TruncatedSVD.
Can you try and see if that reproduces the results?


On 08/26/2016 10:09 AM, Roman Yurchak wrote:
> Hi all,
>
> I have a question about using the TruncatedSVD method for performing
> Latent Semantic Analysis/Indexing (LSA/LSI). The docs imply that simply
> applying TruncatedSVD to a tf-idf matrice is sufficient (cf.
> http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html),
> but I'm wondering about that.
>
> As far as I understood for LSA one computes a truncated SVD
> decomposition of the tf-idf matrix X (n_features x n_samples),
>        X ? U @ Sigma @ V.T
> and then for a document vector d, the projection is computed as,
>        d_proj = d.T @ U @ Sigma??
> (source: http://nlp.stanford.edu/IR-book/pdf/18lsi.pdf)
> However, TruncatedSVD.fit_transform only computes,
>        d_proj = d.T @ U
> and what's more does not store the singular values (Sigma) internally,
> so it cannot be easily applied afterwards.
> (the above notation are transposed with respect to those in the scikit
> learn docs).
>
> For instance, I have tried reproducing LSA decomposition from literature
> and I'm not getting the expected results unless I perform an additional
> normalization by the Sigma matrix:
> https://gist.github.com/rth/3af30c60bece7db4207821a6dddc5e8d
>
> I was wondering if I am missing something here?
> Thank you,


From elgesto at gmail.com  Sat Aug 27 05:33:19 2016
From: elgesto at gmail.com (elgesto at gmail.com)
Date: Sat, 27 Aug 2016 12:33:19 +0300
Subject: [scikit-learn] Fwd: inconsistency between libsvm and
 scikit-learn.svc results
In-Reply-To: <CADUGVB99itjKP=d68TLFn6+YaFSp8GqOXwb8J7QTHjDweyZRnw@mail.gmail.com>
References: <CADUGVB99itjKP=d68TLFn6+YaFSp8GqOXwb8J7QTHjDweyZRnw@mail.gmail.com>
Message-ID: <CADUGVB-rWTqnJpJ8ontnyk4TDyh1XnoeW0XDNqzYnUdm=LR4Gw@mail.gmail.com>

I have a project that is based on SVM algorithm implemented by libsvm
<https://www.csie.ntu.edu.tw/~cjlin/libsvm/>. Recently I decided to try
several other classification algorithm, this is where scikit-learn
<http://scikit-learn.org/> comes to the picture.

The connection to the scikit was pretty straightforward, it supports libsvm
format by load_svmlight_file routine. Ans it's svm implementation is based
on the same libsvm.

When everything was done, I decided to the check the consistence of the
results by directly running libsvm and via scikit-learn, and the results
were different. Among 18 measures in learning curves, 7 were different, and
the difference is located at the small steps of the learning curve. The
libsvm results seems much more stable, but scikit-learn results have some
drastic fluctuation.

The classifiers have exactly the same parameters of course. I tried to
check the version of libsvm in scikit-learn implementation, but I din't
find it, the only thing I found was libsvm.so file.

Currently I am using libsvm 3.21 version, and scikit-learn 0.17.1 version.

I wound appreciate any help in addressing this issue.


size    libsvm                  scikit-learn
1       0.1336239435355727      0.1336239435355727
2       0.08699516468193455     0.08699516468193455
3       0.32928301642777424     0.2117238289550198      #different
4       0.2835688734876902      0.2835688734876902
5       0.27846766962743097     0.26651875338163966     #different
6       0.2853854654662907      0.18898048915599963     #different
7       0.28196058132165136     0.28196058132165136
8       0.31473956032575623     0.1958710201604552      #different
9       0.33588303670653136     0.2101641630182972      #different
10      0.4075242509025311      0.2997807499800962      #different
15      0.4391771087975972      0.4391771087975972
20      0.3837789445609818      0.2713167833345173      #different
25      0.4252154334940311      0.4252154334940311
30      0.4256407777477492      0.4256407777477492
35      0.45314944605858387     0.45314944605858387
40      0.4278633233755064      0.4278633233755064
45      0.46174762022239796     0.46174762022239796
50      0.45370452524846866     0.45370452524846866
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160827/bf27a04d/attachment.html>

From olologin at gmail.com  Sat Aug 27 05:49:26 2016
From: olologin at gmail.com (olologin)
Date: Sat, 27 Aug 2016 12:49:26 +0300
Subject: [scikit-learn] Fwd: inconsistency between libsvm and
 scikit-learn.svc results
In-Reply-To: <CADUGVB-rWTqnJpJ8ontnyk4TDyh1XnoeW0XDNqzYnUdm=LR4Gw@mail.gmail.com>
References: <CADUGVB99itjKP=d68TLFn6+YaFSp8GqOXwb8J7QTHjDweyZRnw@mail.gmail.com>
 <CADUGVB-rWTqnJpJ8ontnyk4TDyh1XnoeW0XDNqzYnUdm=LR4Gw@mail.gmail.com>
Message-ID: <f9292e7d-fe42-25d0-16bc-6605159587cf@gmail.com>

On 08/27/2016 12:33 PM, elgesto at gmail.com wrote:
>
> I have a project that is based on SVM algorithm implemented by libsvm 
> <https://www.csie.ntu.edu.tw/%7Ecjlin/libsvm/>. Recently I decided to 
> try several other classification algorithm, this is where scikit-learn 
> <http://scikit-learn.org/> comes to the picture.
>
> The connection to the scikit was pretty straightforward, it supports 
> libsvm format by |load_svmlight_file| routine. Ans it's svm 
> implementation is based on the same libsvm.
>
> When everything was done, I decided to the check the consistence of 
> the results by directly running libsvm and via scikit-learn, and the 
> results were different. Among 18 measures in learning curves, 7 were 
> different, and the difference is located at the small steps of the 
> learning curve. The libsvm results seems much more stable, but 
> scikit-learn results have some drastic fluctuation.
>
> The classifiers have exactly the same parameters of course. I tried to 
> check the version of libsvm in scikit-learn implementation, but I 
> din't find it, the only thing I found was libsvm.so file.
>
> Currently I am using libsvm 3.21 version, and scikit-learn 0.17.1 version.
>
> I wound appreciate any help in addressing this issue.
>
>
> |size libsvm scikit-learn 1 0.1336239435355727 0.1336239435355727 2 
> 0.08699516468193455 0.08699516468193455 3 0.32928301642777424 
> 0.2117238289550198 #different 4 0.2835688734876902 0.2835688734876902 
> 5 0.27846766962743097 0.26651875338163966 #different 6 
> 0.2853854654662907 0.18898048915599963 #different 7 
> 0.28196058132165136 0.28196058132165136 8 0.31473956032575623 
> 0.1958710201604552 #different 9 0.33588303670653136 0.2101641630182972 
> #different 10 0.4075242509025311 0.2997807499800962 #different 15 
> 0.4391771087975972 0.4391771087975972 20 0.3837789445609818 
> 0.2713167833345173 #different 25 0.4252154334940311 0.4252154334940311 
> 30 0.4256407777477492 0.4256407777477492 35 0.45314944605858387 
> 0.45314944605858387 40 0.4278633233755064 0.4278633233755064 45 
> 0.46174762022239796 0.46174762022239796 50 0.45370452524846866 
> 0.45370452524846866|
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

This might be because current version of libsvm used in scikit is 3.10 
from 2011. With some patch imported from upstream.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160827/b9eb71d0/attachment.html>

From elgesto at gmail.com  Sat Aug 27 07:19:28 2016
From: elgesto at gmail.com (elgesto at gmail.com)
Date: Sat, 27 Aug 2016 14:19:28 +0300
Subject: [scikit-learn] Fwd: inconsistency between libsvm and
 scikit-learn.svc results
In-Reply-To: <f9292e7d-fe42-25d0-16bc-6605159587cf@gmail.com>
References: <CADUGVB99itjKP=d68TLFn6+YaFSp8GqOXwb8J7QTHjDweyZRnw@mail.gmail.com>
 <CADUGVB-rWTqnJpJ8ontnyk4TDyh1XnoeW0XDNqzYnUdm=LR4Gw@mail.gmail.com>
 <f9292e7d-fe42-25d0-16bc-6605159587cf@gmail.com>
Message-ID: <CADUGVB_eq2NQU_Jbxcu1ZVT=j9ayhqbW4ESrqTfYeJ-whUo8VQ@mail.gmail.com>

Can I update the libsvm version by myself?

2016-08-27 12:49 GMT+03:00 olologin <olologin at gmail.com>:

> On 08/27/2016 12:33 PM, elgesto at gmail.com wrote:
>
> I have a project that is based on SVM algorithm implemented by libsvm
> <https://www.csie.ntu.edu.tw/%7Ecjlin/libsvm/>. Recently I decided to try
> several other classification algorithm, this is where scikit-learn
> <http://scikit-learn.org/> comes to the picture.
>
> The connection to the scikit was pretty straightforward, it supports
> libsvm format by load_svmlight_file routine. Ans it's svm implementation
> is based on the same libsvm.
>
> When everything was done, I decided to the check the consistence of the
> results by directly running libsvm and via scikit-learn, and the results
> were different. Among 18 measures in learning curves, 7 were different, and
> the difference is located at the small steps of the learning curve. The
> libsvm results seems much more stable, but scikit-learn results have some
> drastic fluctuation.
>
> The classifiers have exactly the same parameters of course. I tried to
> check the version of libsvm in scikit-learn implementation, but I din't
> find it, the only thing I found was libsvm.so file.
>
> Currently I am using libsvm 3.21 version, and scikit-learn 0.17.1 version.
>
> I wound appreciate any help in addressing this issue.
>
>
> size    libsvm                  scikit-learn
> 1       0.1336239435355727      0.1336239435355727
> 2       0.08699516468193455     0.08699516468193455
> 3       0.32928301642777424     0.2117238289550198      #different
> 4       0.2835688734876902      0.2835688734876902
> 5       0.27846766962743097     0.26651875338163966     #different
> 6       0.2853854654662907      0.18898048915599963     #different
> 7       0.28196058132165136     0.28196058132165136
> 8       0.31473956032575623     0.1958710201604552      #different
> 9       0.33588303670653136     0.2101641630182972      #different
> 10      0.4075242509025311      0.2997807499800962      #different
> 15      0.4391771087975972      0.4391771087975972
> 20      0.3837789445609818      0.2713167833345173      #different
> 25      0.4252154334940311      0.4252154334940311
> 30      0.4256407777477492      0.4256407777477492
> 35      0.45314944605858387     0.45314944605858387
> 40      0.4278633233755064      0.4278633233755064
> 45      0.46174762022239796     0.46174762022239796
> 50      0.45370452524846866     0.45370452524846866
>
>
>
>
> _______________________________________________
> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
> This might be because current version of libsvm used in scikit is 3.10
> from 2011. With some patch imported from upstream.
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160827/5c69f2c8/attachment-0001.html>

From olologin at gmail.com  Sat Aug 27 08:36:56 2016
From: olologin at gmail.com (olologin)
Date: Sat, 27 Aug 2016 15:36:56 +0300
Subject: [scikit-learn] Fwd: inconsistency between libsvm and
 scikit-learn.svc results
In-Reply-To: <CADUGVB_eq2NQU_Jbxcu1ZVT=j9ayhqbW4ESrqTfYeJ-whUo8VQ@mail.gmail.com>
References: <CADUGVB99itjKP=d68TLFn6+YaFSp8GqOXwb8J7QTHjDweyZRnw@mail.gmail.com>
 <CADUGVB-rWTqnJpJ8ontnyk4TDyh1XnoeW0XDNqzYnUdm=LR4Gw@mail.gmail.com>
 <f9292e7d-fe42-25d0-16bc-6605159587cf@gmail.com>
 <CADUGVB_eq2NQU_Jbxcu1ZVT=j9ayhqbW4ESrqTfYeJ-whUo8VQ@mail.gmail.com>
Message-ID: <9c1c30b5-a11f-5a01-2b68-8368d7ed486c@gmail.com>

On 08/27/2016 02:19 PM, elgesto at gmail.com wrote:
> Can I update the libsvm version by myself?
>
> 2016-08-27 12:49 GMT+03:00 olologin <olologin at gmail.com 
> <mailto:olologin at gmail.com>>:
>
>     On 08/27/2016 12:33 PM, elgesto at gmail.com
>     <mailto:elgesto at gmail.com> wrote:
>>
>>     I have a project that is based on SVM algorithm implemented by
>>     libsvm <https://www.csie.ntu.edu.tw/%7Ecjlin/libsvm/>. Recently I
>>     decided to try several other classification algorithm, this is
>>     where scikit-learn <http://scikit-learn.org/> comes to the picture.
>>
>>     The connection to the scikit was pretty straightforward, it
>>     supports libsvm format by |load_svmlight_file| routine. Ans it's
>>     svm implementation is based on the same libsvm.
>>
>>     When everything was done, I decided to the check the consistence
>>     of the results by directly running libsvm and via scikit-learn,
>>     and the results were different. Among 18 measures in learning
>>     curves, 7 were different, and the difference is located at the
>>     small steps of the learning curve. The libsvm results seems much
>>     more stable, but scikit-learn results have some drastic fluctuation.
>>
>>     The classifiers have exactly the same parameters of course. I
>>     tried to check the version of libsvm in scikit-learn
>>     implementation, but I din't find it, the only thing I found was
>>     libsvm.so file.
>>
>>     Currently I am using libsvm 3.21 version, and scikit-learn 0.17.1
>>     version.
>>
>>     I wound appreciate any help in addressing this issue.
>>
>>
>>     |size libsvm scikit-learn 1 0.1336239435355727 0.1336239435355727
>>     2 0.08699516468193455 0.08699516468193455 3 0.32928301642777424
>>     0.2117238289550198 #different 4 0.2835688734876902
>>     0.2835688734876902 5 0.27846766962743097 0.26651875338163966
>>     #different 6 0.2853854654662907 0.18898048915599963 #different 7
>>     0.28196058132165136 0.28196058132165136 8 0.31473956032575623
>>     0.1958710201604552 #different 9 0.33588303670653136
>>     0.2101641630182972 #different 10 0.4075242509025311
>>     0.2997807499800962 #different 15 0.4391771087975972
>>     0.4391771087975972 20 0.3837789445609818 0.2713167833345173
>>     #different 25 0.4252154334940311 0.4252154334940311 30
>>     0.4256407777477492 0.4256407777477492 35 0.45314944605858387
>>     0.45314944605858387 40 0.4278633233755064 0.4278633233755064 45
>>     0.46174762022239796 0.46174762022239796 50 0.45370452524846866
>>     0.45370452524846866|
>>
>>
>>
>>     _______________________________________________
>>     scikit-learn mailing list
>>     scikit-learn at python.org <mailto:scikit-learn at python.org>
>>     https://mail.python.org/mailman/listinfo/scikit-learn
>>     <https://mail.python.org/mailman/listinfo/scikit-learn>
>
>     This might be because current version of libsvm used in scikit is
>     3.10 from 2011. With some patch imported from upstream.
>
>     _______________________________________________ scikit-learn
>     mailing list scikit-learn at python.org
>     <mailto:scikit-learn at python.org>
>     https://mail.python.org/mailman/listinfo/scikit-learn
>     <https://mail.python.org/mailman/listinfo/scikit-learn> 
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

I don't think it is so easy, version which is used in scikit-learn has 
many additional modifications.

from header of svm.cpp: /*    Modified 2010:    - Support for dense data 
by Ming-Fang Weng    - Return indices for support vectors, Fabian 
Pedregosa      <fabian.pedregosa at inria.fr>    - Fixes to avoid name 
collision, Fabian Pedregosa    - Add support for instance weights, 
Fabian Pedregosa based on work      by Ming-Wei Chang, Hsuan-Tien Lin, 
Ming-Hen Tsai, Chia-Hua Ho and      Hsiang-Fu Yu,      
<http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/#weights_for_data_instances>. 
    - Make labels sorted in svm_group_classes, Fabian Pedregosa.  */

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160827/b77b5f2a/attachment.html>

From elgesto at gmail.com  Sat Aug 27 09:42:20 2016
From: elgesto at gmail.com (elgesto at gmail.com)
Date: Sat, 27 Aug 2016 16:42:20 +0300
Subject: [scikit-learn] Fwd: inconsistency between libsvm and
 scikit-learn.svc results
In-Reply-To: <9c1c30b5-a11f-5a01-2b68-8368d7ed486c@gmail.com>
References: <CADUGVB99itjKP=d68TLFn6+YaFSp8GqOXwb8J7QTHjDweyZRnw@mail.gmail.com>
 <CADUGVB-rWTqnJpJ8ontnyk4TDyh1XnoeW0XDNqzYnUdm=LR4Gw@mail.gmail.com>
 <f9292e7d-fe42-25d0-16bc-6605159587cf@gmail.com>
 <CADUGVB_eq2NQU_Jbxcu1ZVT=j9ayhqbW4ESrqTfYeJ-whUo8VQ@mail.gmail.com>
 <9c1c30b5-a11f-5a01-2b68-8368d7ed486c@gmail.com>
Message-ID: <CADUGVB83CzoGZ+P=hQcMS+E7WcHZmCc=G0dXn02+=m8hhkA4nw@mail.gmail.com>

So there is no possibility to reach a consistency?

2016-08-27 15:36 GMT+03:00 olologin <olologin at gmail.com>:

> On 08/27/2016 02:19 PM, elgesto at gmail.com wrote:
>
> Can I update the libsvm version by myself?
>
> 2016-08-27 12:49 GMT+03:00 olologin <olologin at gmail.com>:
>
>> On 08/27/2016 12:33 PM, elgesto at gmail.com wrote:
>>
>> I have a project that is based on SVM algorithm implemented by libsvm
>> <https://www.csie.ntu.edu.tw/%7Ecjlin/libsvm/>. Recently I decided to
>> try several other classification algorithm, this is where scikit-learn
>> <http://scikit-learn.org/> comes to the picture.
>>
>> The connection to the scikit was pretty straightforward, it supports
>> libsvm format by load_svmlight_file routine. Ans it's svm implementation
>> is based on the same libsvm.
>>
>> When everything was done, I decided to the check the consistence of the
>> results by directly running libsvm and via scikit-learn, and the results
>> were different. Among 18 measures in learning curves, 7 were different, and
>> the difference is located at the small steps of the learning curve. The
>> libsvm results seems much more stable, but scikit-learn results have some
>> drastic fluctuation.
>>
>> The classifiers have exactly the same parameters of course. I tried to
>> check the version of libsvm in scikit-learn implementation, but I din't
>> find it, the only thing I found was libsvm.so file.
>>
>> Currently I am using libsvm 3.21 version, and scikit-learn 0.17.1 version.
>>
>> I wound appreciate any help in addressing this issue.
>>
>>
>> size    libsvm                  scikit-learn
>> 1       0.1336239435355727      0.1336239435355727
>> 2       0.08699516468193455     0.08699516468193455
>> 3       0.32928301642777424     0.2117238289550198      #different
>> 4       0.2835688734876902      0.2835688734876902
>> 5       0.27846766962743097     0.26651875338163966     #different
>> 6       0.2853854654662907      0.18898048915599963     #different
>> 7       0.28196058132165136     0.28196058132165136
>> 8       0.31473956032575623     0.1958710201604552      #different
>> 9       0.33588303670653136     0.2101641630182972      #different
>> 10      0.4075242509025311      0.2997807499800962      #different
>> 15      0.4391771087975972      0.4391771087975972
>> 20      0.3837789445609818      0.2713167833345173      #different
>> 25      0.4252154334940311      0.4252154334940311
>> 30      0.4256407777477492      0.4256407777477492
>> 35      0.45314944605858387     0.45314944605858387
>> 40      0.4278633233755064      0.4278633233755064
>> 45      0.46174762022239796     0.46174762022239796
>> 50      0.45370452524846866     0.45370452524846866
>>
>>
>>
>>
>> _______________________________________________
>> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>>
>> This might be because current version of libsvm used in scikit is 3.10
>> from 2011. With some patch imported from upstream.
>> _______________________________________________ scikit-learn mailing
>> list scikit-learn at python.org https://mail.python.org/mailma
>> n/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
> I don't think it is so easy, version which is used in scikit-learn has
> many additional modifications.
>
> from header of svm.cpp: /*    Modified 2010:    - Support for dense data
> by Ming-Fang Weng    - Return indices for support vectors, Fabian Pedregosa
>      <fabian.pedregosa at inria.fr> <fabian.pedregosa at inria.fr>    - Fixes
> to avoid name collision, Fabian Pedregosa    - Add support for instance
> weights, Fabian Pedregosa based on work      by Ming-Wei Chang, Hsuan-Tien
> Lin, Ming-Hen Tsai, Chia-Hua Ho and      Hsiang-Fu Yu,
> <http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/#weights_
> for_data_instances>
> <http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/#weights_for_data_instances>.
>    - Make labels sorted in svm_group_classes, Fabian Pedregosa.  */
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160827/6fd0d1a6/attachment-0001.html>

From joel.nothman at gmail.com  Sat Aug 27 09:48:17 2016
From: joel.nothman at gmail.com (Joel Nothman)
Date: Sat, 27 Aug 2016 23:48:17 +1000
Subject: [scikit-learn] Fwd: inconsistency between libsvm and
 scikit-learn.svc results
In-Reply-To: <CADUGVB83CzoGZ+P=hQcMS+E7WcHZmCc=G0dXn02+=m8hhkA4nw@mail.gmail.com>
References: <CADUGVB99itjKP=d68TLFn6+YaFSp8GqOXwb8J7QTHjDweyZRnw@mail.gmail.com>
 <CADUGVB-rWTqnJpJ8ontnyk4TDyh1XnoeW0XDNqzYnUdm=LR4Gw@mail.gmail.com>
 <f9292e7d-fe42-25d0-16bc-6605159587cf@gmail.com>
 <CADUGVB_eq2NQU_Jbxcu1ZVT=j9ayhqbW4ESrqTfYeJ-whUo8VQ@mail.gmail.com>
 <9c1c30b5-a11f-5a01-2b68-8368d7ed486c@gmail.com>
 <CADUGVB83CzoGZ+P=hQcMS+E7WcHZmCc=G0dXn02+=m8hhkA4nw@mail.gmail.com>
Message-ID: <CAAkaFLXjx1cx39Tee-iMWRqvX5JP418JfxHOv_ci=q7hSPimMg@mail.gmail.com>

I don't think we should assume that this is the only possible reason for
inconsistency. Could you give us a small snippet of data and code on which
you find this inconsistency?

On 27 August 2016 at 23:42, elgesto at gmail.com <elgesto at gmail.com> wrote:

> So there is no possibility to reach a consistency?
>
> 2016-08-27 15:36 GMT+03:00 olologin <olologin at gmail.com>:
>
>> On 08/27/2016 02:19 PM, elgesto at gmail.com wrote:
>>
>> Can I update the libsvm version by myself?
>>
>> 2016-08-27 12:49 GMT+03:00 olologin <olologin at gmail.com>:
>>
>>> On 08/27/2016 12:33 PM, elgesto at gmail.com wrote:
>>>
>>> I have a project that is based on SVM algorithm implemented by libsvm
>>> <https://www.csie.ntu.edu.tw/%7Ecjlin/libsvm/>. Recently I decided to
>>> try several other classification algorithm, this is where scikit-learn
>>> <http://scikit-learn.org/> comes to the picture.
>>>
>>> The connection to the scikit was pretty straightforward, it supports
>>> libsvm format by load_svmlight_file routine. Ans it's svm
>>> implementation is based on the same libsvm.
>>>
>>> When everything was done, I decided to the check the consistence of the
>>> results by directly running libsvm and via scikit-learn, and the results
>>> were different. Among 18 measures in learning curves, 7 were different, and
>>> the difference is located at the small steps of the learning curve. The
>>> libsvm results seems much more stable, but scikit-learn results have some
>>> drastic fluctuation.
>>>
>>> The classifiers have exactly the same parameters of course. I tried to
>>> check the version of libsvm in scikit-learn implementation, but I din't
>>> find it, the only thing I found was libsvm.so file.
>>>
>>> Currently I am using libsvm 3.21 version, and scikit-learn 0.17.1
>>> version.
>>>
>>> I wound appreciate any help in addressing this issue.
>>>
>>>
>>> size    libsvm                  scikit-learn
>>> 1       0.1336239435355727      0.1336239435355727
>>> 2       0.08699516468193455     0.08699516468193455
>>> 3       0.32928301642777424     0.2117238289550198      #different
>>> 4       0.2835688734876902      0.2835688734876902
>>> 5       0.27846766962743097     0.26651875338163966     #different
>>> 6       0.2853854654662907      0.18898048915599963     #different
>>> 7       0.28196058132165136     0.28196058132165136
>>> 8       0.31473956032575623     0.1958710201604552      #different
>>> 9       0.33588303670653136     0.2101641630182972      #different
>>> 10      0.4075242509025311      0.2997807499800962      #different
>>> 15      0.4391771087975972      0.4391771087975972
>>> 20      0.3837789445609818      0.2713167833345173      #different
>>> 25      0.4252154334940311      0.4252154334940311
>>> 30      0.4256407777477492      0.4256407777477492
>>> 35      0.45314944605858387     0.45314944605858387
>>> 40      0.4278633233755064      0.4278633233755064
>>> 45      0.46174762022239796     0.46174762022239796
>>> 50      0.45370452524846866     0.45370452524846866
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>> This might be because current version of libsvm used in scikit is 3.10
>>> from 2011. With some patch imported from upstream.
>>> _______________________________________________ scikit-learn mailing
>>> list scikit-learn at python.org https://mail.python.org/mailma
>>> n/listinfo/scikit-learn
>>
>> _______________________________________________
>> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>>
>> I don't think it is so easy, version which is used in scikit-learn has
>> many additional modifications.
>>
>> from header of svm.cpp: /*    Modified 2010:    - Support for dense data
>> by Ming-Fang Weng    - Return indices for support vectors, Fabian Pedregosa
>>      <fabian.pedregosa at inria.fr> <fabian.pedregosa at inria.fr>    - Fixes
>> to avoid name collision, Fabian Pedregosa    - Add support for instance
>> weights, Fabian Pedregosa based on work      by Ming-Wei Chang, Hsuan-Tien
>> Lin, Ming-Hen Tsai, Chia-Hua Ho and      Hsiang-Fu Yu,
>> <http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/#weights_for_
>> data_instances>
>> <http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/#weights_for_data_instances>.
>>    - Make labels sorted in svm_group_classes, Fabian Pedregosa.  */
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160827/bcd96dd6/attachment.html>

From ross at cgl.ucsf.edu  Sat Aug 27 09:55:10 2016
From: ross at cgl.ucsf.edu (Bill Ross)
Date: Sat, 27 Aug 2016 06:55:10 -0700
Subject: [scikit-learn] Fwd: inconsistency between libsvm and
 scikit-learn.svc results
In-Reply-To: <CADUGVB83CzoGZ+P=hQcMS+E7WcHZmCc=G0dXn02+=m8hhkA4nw@mail.gmail.com>
References: <CADUGVB99itjKP=d68TLFn6+YaFSp8GqOXwb8J7QTHjDweyZRnw@mail.gmail.com>
 <CADUGVB-rWTqnJpJ8ontnyk4TDyh1XnoeW0XDNqzYnUdm=LR4Gw@mail.gmail.com>
 <f9292e7d-fe42-25d0-16bc-6605159587cf@gmail.com>
 <CADUGVB_eq2NQU_Jbxcu1ZVT=j9ayhqbW4ESrqTfYeJ-whUo8VQ@mail.gmail.com>
 <9c1c30b5-a11f-5a01-2b68-8368d7ed486c@gmail.com>
 <CADUGVB83CzoGZ+P=hQcMS+E7WcHZmCc=G0dXn02+=m8hhkA4nw@mail.gmail.com>
Message-ID: <239c735a-fb69-f2a4-6dc8-235dca562bf2@cgl.ucsf.edu>

One logical possibility is if svm would accept the scikit-learn changes.

On 8/27/16 6:42 AM, elgesto at gmail.com wrote:
> So there is no possibility to reach a consistency?
>
> 2016-08-27 15:36 GMT+03:00 olologin <olologin at gmail.com 
> <mailto:olologin at gmail.com>>:
>
>     On 08/27/2016 02:19 PM, elgesto at gmail.com
>     <mailto:elgesto at gmail.com> wrote:
>>     Can I update the libsvm version by myself?
>>
>>     2016-08-27 12:49 GMT+03:00 olologin <olologin at gmail.com
>>     <mailto:olologin at gmail.com>>:
>>
>>         On 08/27/2016 12:33 PM, elgesto at gmail.com
>>         <mailto:elgesto at gmail.com> wrote:
>>>
>>>         I have a project that is based on SVM algorithm implemented
>>>         by libsvm <https://www.csie.ntu.edu.tw/%7Ecjlin/libsvm/>.
>>>         Recently I decided to try several other classification
>>>         algorithm, this is where scikit-learn
>>>         <http://scikit-learn.org/> comes to the picture.
>>>
>>>         The connection to the scikit was pretty straightforward, it
>>>         supports libsvm format by |load_svmlight_file| routine. Ans
>>>         it's svm implementation is based on the same libsvm.
>>>
>>>         When everything was done, I decided to the check the
>>>         consistence of the results by directly running libsvm and
>>>         via scikit-learn, and the results were different. Among 18
>>>         measures in learning curves, 7 were different, and the
>>>         difference is located at the small steps of the learning
>>>         curve. The libsvm results seems much more stable, but
>>>         scikit-learn results have some drastic fluctuation.
>>>
>>>         The classifiers have exactly the same parameters of course.
>>>         I tried to check the version of libsvm in scikit-learn
>>>         implementation, but I din't find it, the only thing I found
>>>         was libsvm.so file.
>>>
>>>         Currently I am using libsvm 3.21 version, and scikit-learn
>>>         0.17.1 version.
>>>
>>>         I wound appreciate any help in addressing this issue.
>>>
>>>
>>>         |size libsvm scikit-learn 1 0.1336239435355727
>>>         0.1336239435355727 2 0.08699516468193455 0.08699516468193455
>>>         3 0.32928301642777424 0.2117238289550198 #different 4
>>>         0.2835688734876902 0.2835688734876902 5 0.27846766962743097
>>>         0.26651875338163966 #different 6 0.2853854654662907
>>>         0.18898048915599963 #different 7 0.28196058132165136
>>>         0.28196058132165136 8 0.31473956032575623 0.1958710201604552
>>>         #different 9 0.33588303670653136 0.2101641630182972
>>>         #different 10 0.4075242509025311 0.2997807499800962
>>>         #different 15 0.4391771087975972 0.4391771087975972 20
>>>         0.3837789445609818 0.2713167833345173 #different 25
>>>         0.4252154334940311 0.4252154334940311 30 0.4256407777477492
>>>         0.4256407777477492 35 0.45314944605858387
>>>         0.45314944605858387 40 0.4278633233755064 0.4278633233755064
>>>         45 0.46174762022239796 0.46174762022239796 50
>>>         0.45370452524846866 0.45370452524846866|
>>>
>>>
>>>
>>>         _______________________________________________
>>>         scikit-learn mailing list
>>>         scikit-learn at python.org <mailto:scikit-learn at python.org>
>>>         https://mail.python.org/mailman/listinfo/scikit-learn
>>>         <https://mail.python.org/mailman/listinfo/scikit-learn>
>>
>>         This might be because current version of libsvm used in
>>         scikit is 3.10 from 2011. With some patch imported from
>>         upstream.
>>
>>         _______________________________________________ scikit-learn
>>         mailing list scikit-learn at python.org
>>         <mailto:scikit-learn at python.org>
>>         https://mail.python.org/mailman/listinfo/scikit-learn
>>         <https://mail.python.org/mailman/listinfo/scikit-learn> 
>>
>>     _______________________________________________
>>     scikit-learn mailing list
>>     scikit-learn at python.org <mailto:scikit-learn at python.org>
>>     https://mail.python.org/mailman/listinfo/scikit-learn
>>     <https://mail.python.org/mailman/listinfo/scikit-learn>
>
>     I don't think it is so easy, version which is used in scikit-learn
>     has many additional modifications.
>
>     from header of svm.cpp: /*    Modified 2010:    - Support for
>     dense data by Ming-Fang Weng    - Return indices for support
>     vectors, Fabian Pedregosa <fabian.pedregosa at inria.fr>
>     <mailto:fabian.pedregosa at inria.fr>    - Fixes to avoid name
>     collision, Fabian Pedregosa    - Add support for instance weights,
>     Fabian Pedregosa based on work      by Ming-Wei Chang, Hsuan-Tien
>     Lin, Ming-Hen Tsai, Chia-Hua Ho and      Hsiang-Fu Yu,
>     <http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/#weights_for_data_instances>
>     <http://www.csie.ntu.edu.tw/%7Ecjlin/libsvmtools/#weights_for_data_instances>.
>        - Make labels sorted in svm_group_classes, Fabian Pedregosa.  */
>
>     _______________________________________________ scikit-learn
>     mailing list scikit-learn at python.org
>     <mailto:scikit-learn at python.org>
>     https://mail.python.org/mailman/listinfo/scikit-learn
>     <https://mail.python.org/mailman/listinfo/scikit-learn> 
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160827/951be21b/attachment-0001.html>

From olivier.grisel at ensta.org  Sat Aug 27 12:20:31 2016
From: olivier.grisel at ensta.org (Olivier Grisel)
Date: Sat, 27 Aug 2016 18:20:31 +0200
Subject: [scikit-learn] Latent Semantic Analysis (LSA) and TrucatedSVD
In-Reply-To: <532083f1-0647-989d-6f35-2a83176199ea@gmail.com>
References: <57C04D8B.40001@gmail.com>
 <532083f1-0647-989d-6f35-2a83176199ea@gmail.com>
Message-ID: <CAFvE7K6WkVZisnyafQ6c1MMG3WjvgJGi44v8PE44UzUjOA+BLQ@mail.gmail.com>

I am not sure this is exactly the same because we do not center the
data in the TruncatedSVD case (as opposed to the real PCA case where
whitening is the same as calling StandardScaler).

Having an option to normalize the transformed data by sigma seems like
a good idea but we should probably not call that whitening.

-- 
Olivier

From olivier.grisel at ensta.org  Sat Aug 27 12:37:37 2016
From: olivier.grisel at ensta.org (Olivier Grisel)
Date: Sat, 27 Aug 2016 18:37:37 +0200
Subject: [scikit-learn] Latent Semantic Analysis (LSA) and TrucatedSVD
In-Reply-To: <CAFvE7K6WkVZisnyafQ6c1MMG3WjvgJGi44v8PE44UzUjOA+BLQ@mail.gmail.com>
References: <57C04D8B.40001@gmail.com>
 <532083f1-0647-989d-6f35-2a83176199ea@gmail.com>
 <CAFvE7K6WkVZisnyafQ6c1MMG3WjvgJGi44v8PE44UzUjOA+BLQ@mail.gmail.com>
Message-ID: <CAFvE7K4bwbVA8kOh8qpSo25CF+WTzBAh0tJPn6hS0rxREJG0bg@mail.gmail.com>

BTW Roman, the examples in your gist would make a great non-regression
test for this new feature. Please feel free to submit a PR.

--
Olivier

From mathieu at mblondel.org  Sun Aug 28 00:30:05 2016
From: mathieu at mblondel.org (Mathieu Blondel)
Date: Sun, 28 Aug 2016 13:30:05 +0900
Subject: [scikit-learn] GradientBoostingRegressor,
 question about initialisation with MeanEstimator
In-Reply-To: <CADdg2ywgzFuVHzLf4j+7xXtj1SFOt=1wbxLK=rS22JoW3uF6yw@mail.gmail.com>
References: <CADdg2ywgzFuVHzLf4j+7xXtj1SFOt=1wbxLK=rS22JoW3uF6yw@mail.gmail.com>
Message-ID: <CAOKSrLxZHOJAHk-T5kvV29KDHSXTZDjnO099h0ESDz4hmoGT0w@mail.gmail.com>

This comes from Algorithm 1, line 1, in "Greedy Function Approximation: a
Gradient Boosting Machine" by J. Friedman.

Intuitively, this has the same effect as fitting a bias (intercept) term in
a linear model. This allows the subsequent iterations (decision trees) to
work with centered targets.

Mathieu

On Wed, Aug 24, 2016 at 5:24 AM, ??????? ????? <aadral at gmail.com> wrote:

> Hi there,
>
> I recently found out that GradientBoostingRegressor uses MeanEstimator for
> the initial estimator in ensemble. Could you please point out (or
> explain) to the research showing superiority of this approach compared to
> the usage of DecisionTreeRegressor?
>
> --
> Yours sincerely,
> Alexey A. Dral
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160828/13745653/attachment.html>

From aadral at gmail.com  Sun Aug 28 04:57:28 2016
From: aadral at gmail.com (=?UTF-8?B?0JDQu9C10LrRgdC10Lkg0JTRgNCw0LvRjA==?=)
Date: Sun, 28 Aug 2016 09:57:28 +0100
Subject: [scikit-learn] GradientBoostingRegressor,
 question about initialisation with MeanEstimator
In-Reply-To: <CAOKSrLxZHOJAHk-T5kvV29KDHSXTZDjnO099h0ESDz4hmoGT0w@mail.gmail.com>
References: <CADdg2ywgzFuVHzLf4j+7xXtj1SFOt=1wbxLK=rS22JoW3uF6yw@mail.gmail.com>
 <CAOKSrLxZHOJAHk-T5kvV29KDHSXTZDjnO099h0ESDz4hmoGT0w@mail.gmail.com>
Message-ID: <CADdg2ywWpYF71Nzs4x6kyHpvvH1y+cQCSTTsuWxNT=2fXrogBA@mail.gmail.com>

Hi Mathieu,

I was looking exactly for this article. Thank you very much.

2016-08-28 5:30 GMT+01:00 Mathieu Blondel <mathieu at mblondel.org>:

> This comes from Algorithm 1, line 1, in "Greedy Function Approximation: a
> Gradient Boosting Machine" by J. Friedman.
>
> Intuitively, this has the same effect as fitting a bias (intercept) term
> in a linear model. This allows the subsequent iterations (decision trees)
> to work with centered targets.
>
> Mathieu
>
> On Wed, Aug 24, 2016 at 5:24 AM, ??????? ????? <aadral at gmail.com> wrote:
>
>> Hi there,
>>
>> I recently found out that GradientBoostingRegressor uses MeanEstimator
>> for the initial estimator in ensemble. Could you please point out (or
>> explain) to the research showing superiority of this approach compared to
>> the usage of DecisionTreeRegressor?
>>
>> --
>> Yours sincerely,
>> Alexey A. Dral
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 
Yours sincerely,
Alexey A. Dral
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160828/5760c8f7/attachment.html>

From drraph at gmail.com  Sun Aug 28 10:35:33 2016
From: drraph at gmail.com (Raphael C)
Date: Sun, 28 Aug 2016 16:35:33 +0200
Subject: [scikit-learn] Does NMF optimise over observed values
Message-ID: <CAFHc1QbCryUHX8ufUpvJC9-pOPCtn3DgCq2dohJ-stXoJDiDDw@mail.gmail.com>

Reading the docs for
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html
it
says

The objective function is:

0.5 * ||X - WH||_Fro^2
+ alpha * l1_ratio * ||vec(W)||_1
+ alpha * l1_ratio * ||vec(H)||_1
+ 0.5 * alpha * (1 - l1_ratio) * ||W||_Fro^2
+ 0.5 * alpha * (1 - l1_ratio) * ||H||_Fro^2

Where:

||A||_Fro^2 = \sum_{i,j} A_{ij}^2 (Frobenius norm)
||vec(A)||_1 = \sum_{i,j} abs(A_{ij}) (Elementwise L1 norm)

This seems to suggest that it is optimising over all values in X even
if X is sparse.   When using NMF for collaborative filtering we need
the objective function to be defined over only the defined elements of
X. The remaining elements should effectively be regarded as missing.


What is the true objective function NMF is using?


Raphael
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160828/be8f97a2/attachment-0001.html>

From drraph at gmail.com  Sun Aug 28 10:57:44 2016
From: drraph at gmail.com (Raphael C)
Date: Sun, 28 Aug 2016 16:57:44 +0200
Subject: [scikit-learn] Does NMF optimise over observed values
In-Reply-To: <CAFHc1QbCryUHX8ufUpvJC9-pOPCtn3DgCq2dohJ-stXoJDiDDw@mail.gmail.com>
References: <CAFHc1QbCryUHX8ufUpvJC9-pOPCtn3DgCq2dohJ-stXoJDiDDw@mail.gmail.com>
Message-ID: <CAFHc1Qb=z_EHRRc=tqNX9_gwSGOWxwqDedU5pR6tXszMKQjPsw@mail.gmail.com>

What I meant was, how is the objective function defined when X is sparse?

Raphael

On Sunday, August 28, 2016, Raphael C <drraph at gmail.com> wrote:

> Reading the docs for http://scikit-learn.org/stable/modules/generated/
> sklearn.decomposition.NMF.html it says
>
> The objective function is:
>
> 0.5 * ||X - WH||_Fro^2
> + alpha * l1_ratio * ||vec(W)||_1
> + alpha * l1_ratio * ||vec(H)||_1
> + 0.5 * alpha * (1 - l1_ratio) * ||W||_Fro^2
> + 0.5 * alpha * (1 - l1_ratio) * ||H||_Fro^2
>
> Where:
>
> ||A||_Fro^2 = \sum_{i,j} A_{ij}^2 (Frobenius norm)
> ||vec(A)||_1 = \sum_{i,j} abs(A_{ij}) (Elementwise L1 norm)
>
> This seems to suggest that it is optimising over all values in X even if X is sparse.   When using NMF for collaborative filtering we need the objective function to be defined over only the defined elements of X. The remaining elements should effectively be regarded as missing.
>
>
> What is the true objective function NMF is using?
>
>
> Raphael
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160828/87bfda8d/attachment.html>

From arthur.mensch at inria.fr  Sun Aug 28 11:44:43 2016
From: arthur.mensch at inria.fr (Arthur Mensch)
Date: Sun, 28 Aug 2016 17:44:43 +0200
Subject: [scikit-learn] Does NMF optimise over observed values
In-Reply-To: <CAGmDsiOhPRN5+XX_m2Pe_BwTZnaJJ4o8RN_SKeENyNS2Yb1JNw@mail.gmail.com>
References: <CAFHc1QbCryUHX8ufUpvJC9-pOPCtn3DgCq2dohJ-stXoJDiDDw@mail.gmail.com>
 <CAFHc1Qb=z_EHRRc=tqNX9_gwSGOWxwqDedU5pR6tXszMKQjPsw@mail.gmail.com>
 <CAGmDsiOhPRN5+XX_m2Pe_BwTZnaJJ4o8RN_SKeENyNS2Yb1JNw@mail.gmail.com>
Message-ID: <CAGmDsiOWHpdbA=qQ=7NGSd8OS=4AE8FQW1UVjs3gEikxD_3f4A@mail.gmail.com>

Zeros are considered as zeros in the objective function, not as missing
values - - i.e. no mask in the loss function.
Le 28 ao?t 2016 16:58, "Raphael C" <drraph at gmail.com> a ?crit :

What I meant was, how is the objective function defined when X is sparse?

Raphael


On Sunday, August 28, 2016, Raphael C <drraph at gmail.com> wrote:

> Reading the docs for http://scikit-learn.org/st
> able/modules/generated/sklearn.decomposition.NMF.html it says
>
> The objective function is:
>
> 0.5 * ||X - WH||_Fro^2
> + alpha * l1_ratio * ||vec(W)||_1
> + alpha * l1_ratio * ||vec(H)||_1
> + 0.5 * alpha * (1 - l1_ratio) * ||W||_Fro^2
> + 0.5 * alpha * (1 - l1_ratio) * ||H||_Fro^2
>
> Where:
>
> ||A||_Fro^2 = \sum_{i,j} A_{ij}^2 (Frobenius norm)
> ||vec(A)||_1 = \sum_{i,j} abs(A_{ij}) (Elementwise L1 norm)
>
> This seems to suggest that it is optimising over all values in X even if X is sparse.   When using NMF for collaborative filtering we need the objective function to be defined over only the defined elements of X. The remaining elements should effectively be regarded as missing.
>
>
> What is the true objective function NMF is using?
>
>
> Raphael
>
>
_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160828/42c1813e/attachment.html>

From drraph at gmail.com  Sun Aug 28 12:15:55 2016
From: drraph at gmail.com (Raphael C)
Date: Sun, 28 Aug 2016 18:15:55 +0200
Subject: [scikit-learn] Does NMF optimise over observed values
In-Reply-To: <CAGmDsiOWHpdbA=qQ=7NGSd8OS=4AE8FQW1UVjs3gEikxD_3f4A@mail.gmail.com>
References: <CAFHc1QbCryUHX8ufUpvJC9-pOPCtn3DgCq2dohJ-stXoJDiDDw@mail.gmail.com>
 <CAFHc1Qb=z_EHRRc=tqNX9_gwSGOWxwqDedU5pR6tXszMKQjPsw@mail.gmail.com>
 <CAGmDsiOhPRN5+XX_m2Pe_BwTZnaJJ4o8RN_SKeENyNS2Yb1JNw@mail.gmail.com>
 <CAGmDsiOWHpdbA=qQ=7NGSd8OS=4AE8FQW1UVjs3gEikxD_3f4A@mail.gmail.com>
Message-ID: <CAFHc1QYEZpwUdUaHDaH2xH-_htG-wqvtGttxNRkuK1ZWmtgZmg@mail.gmail.com>

Thank you for the quick reply.  Just to make sure I understand, if X is
sparse and n by n with X[0,0] = 1, X_[n-1, n-1]=0 explicitly set (that is
only two values are set in X) then this is treated the same for the
purposes of the objective function  as the all zeros n by n matrix with
X[0,0] set to 1? That is all elements of X that are not specified
explicitly are assumed to be 0?

It would be really useful if it were possible to have a version of NMF
where contributions to the objective function are only counted where the
value is explicitly set in X.  This is AFAIK the standard formulation for
collaborative filtering. Would there be any interest in doing this? In
theory it should be a simple modification of the optimisation code.

Raphael


On Sunday, August 28, 2016, Arthur Mensch <arthur.mensch at inria.fr> wrote:

> Zeros are considered as zeros in the objective function, not as missing
> values - - i.e. no mask in the loss function.
> Le 28 ao?t 2016 16:58, "Raphael C" <drraph at gmail.com
> <javascript:_e(%7B%7D,'cvml','drraph at gmail.com');>> a ?crit :
>
> What I meant was, how is the objective function defined when X is sparse?
>
> Raphael
>
>
> On Sunday, August 28, 2016, Raphael C <drraph at gmail.com
> <javascript:_e(%7B%7D,'cvml','drraph at gmail.com');>> wrote:
>
>> Reading the docs for http://scikit-learn.org/st
>> able/modules/generated/sklearn.decomposition.NMF.html it says
>>
>> The objective function is:
>>
>> 0.5 * ||X - WH||_Fro^2
>> + alpha * l1_ratio * ||vec(W)||_1
>> + alpha * l1_ratio * ||vec(H)||_1
>> + 0.5 * alpha * (1 - l1_ratio) * ||W||_Fro^2
>> + 0.5 * alpha * (1 - l1_ratio) * ||H||_Fro^2
>>
>> Where:
>>
>> ||A||_Fro^2 = \sum_{i,j} A_{ij}^2 (Frobenius norm)
>> ||vec(A)||_1 = \sum_{i,j} abs(A_{ij}) (Elementwise L1 norm)
>>
>> This seems to suggest that it is optimising over all values in X even if X is sparse.   When using NMF for collaborative filtering we need the objective function to be defined over only the defined elements of X. The remaining elements should effectively be regarded as missing.
>>
>>
>> What is the true objective function NMF is using?
>>
>>
>> Raphael
>>
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> <javascript:_e(%7B%7D,'cvml','scikit-learn at python.org');>
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160828/8bf9b7e3/attachment.html>

From t3kcit at gmail.com  Sun Aug 28 12:20:03 2016
From: t3kcit at gmail.com (Andy)
Date: Sun, 28 Aug 2016 12:20:03 -0400
Subject: [scikit-learn] Fwd: inconsistency between libsvm and
 scikit-learn.svc results
In-Reply-To: <CAAkaFLXjx1cx39Tee-iMWRqvX5JP418JfxHOv_ci=q7hSPimMg@mail.gmail.com>
References: <CADUGVB99itjKP=d68TLFn6+YaFSp8GqOXwb8J7QTHjDweyZRnw@mail.gmail.com>
 <CADUGVB-rWTqnJpJ8ontnyk4TDyh1XnoeW0XDNqzYnUdm=LR4Gw@mail.gmail.com>
 <f9292e7d-fe42-25d0-16bc-6605159587cf@gmail.com>
 <CADUGVB_eq2NQU_Jbxcu1ZVT=j9ayhqbW4ESrqTfYeJ-whUo8VQ@mail.gmail.com>
 <9c1c30b5-a11f-5a01-2b68-8368d7ed486c@gmail.com>
 <CADUGVB83CzoGZ+P=hQcMS+E7WcHZmCc=G0dXn02+=m8hhkA4nw@mail.gmail.com>
 <CAAkaFLXjx1cx39Tee-iMWRqvX5JP418JfxHOv_ci=q7hSPimMg@mail.gmail.com>
Message-ID: <350ce275-2c9e-4d44-7155-edd4293acb2c@gmail.com>


On 08/27/2016 09:48 AM, Joel Nothman wrote:
> I don't think we should assume that this is the only possible reason 
> for inconsistency. Could you give us a small snippet of data and code 
> on which you find this inconsistency?
>
I would also expect different settings or random states or data 
preparation to be more likely culprits.


From t3kcit at gmail.com  Sun Aug 28 12:20:45 2016
From: t3kcit at gmail.com (Andy)
Date: Sun, 28 Aug 2016 12:20:45 -0400
Subject: [scikit-learn] Latent Semantic Analysis (LSA) and TrucatedSVD
In-Reply-To: <CAFvE7K6WkVZisnyafQ6c1MMG3WjvgJGi44v8PE44UzUjOA+BLQ@mail.gmail.com>
References: <57C04D8B.40001@gmail.com>
 <532083f1-0647-989d-6f35-2a83176199ea@gmail.com>
 <CAFvE7K6WkVZisnyafQ6c1MMG3WjvgJGi44v8PE44UzUjOA+BLQ@mail.gmail.com>
Message-ID: <ac124192-d4e9-b246-9729-f453227b74f9@gmail.com>

If you do "with_mean=False" it should be the same, right?

On 08/27/2016 12:20 PM, Olivier Grisel wrote:
> I am not sure this is exactly the same because we do not center the
> data in the TruncatedSVD case (as opposed to the real PCA case where
> whitening is the same as calling StandardScaler).
>
> Having an option to normalize the transformed data by sigma seems like
> a good idea but we should probably not call that whitening.
>


From michael at bommaritollc.com  Sun Aug 28 12:22:44 2016
From: michael at bommaritollc.com (Michael Bommarito)
Date: Sun, 28 Aug 2016 12:22:44 -0400
Subject: [scikit-learn] Fwd: inconsistency between libsvm and
 scikit-learn.svc results
In-Reply-To: <350ce275-2c9e-4d44-7155-edd4293acb2c@gmail.com>
References: <CADUGVB99itjKP=d68TLFn6+YaFSp8GqOXwb8J7QTHjDweyZRnw@mail.gmail.com>
 <CADUGVB-rWTqnJpJ8ontnyk4TDyh1XnoeW0XDNqzYnUdm=LR4Gw@mail.gmail.com>
 <f9292e7d-fe42-25d0-16bc-6605159587cf@gmail.com>
 <CADUGVB_eq2NQU_Jbxcu1ZVT=j9ayhqbW4ESrqTfYeJ-whUo8VQ@mail.gmail.com>
 <9c1c30b5-a11f-5a01-2b68-8368d7ed486c@gmail.com>
 <CADUGVB83CzoGZ+P=hQcMS+E7WcHZmCc=G0dXn02+=m8hhkA4nw@mail.gmail.com>
 <CAAkaFLXjx1cx39Tee-iMWRqvX5JP418JfxHOv_ci=q7hSPimMg@mail.gmail.com>
 <350ce275-2c9e-4d44-7155-edd4293acb2c@gmail.com>
Message-ID: <CAN=rtBii2RW8isN4O_Hi_L9Z3RacCGUWvaNdUzBs5t2GwyHtbg@mail.gmail.com>

Any chance it's related to the seed issue in the "Decoding Differences
Between SKL SVM and Matlab Libsvm Even When Parameters the Same" thread?

Thanks,
Michael J. Bommarito II, CEO
Bommarito Consulting, LLC
*Web:* http://www.bommaritollc.com
*Mobile:* +1 (646) 450-3387

On Sun, Aug 28, 2016 at 12:20 PM, Andy <t3kcit at gmail.com> wrote:

>
>
> On 08/27/2016 09:48 AM, Joel Nothman wrote:
>
>> I don't think we should assume that this is the only possible reason for
>> inconsistency. Could you give us a small snippet of data and code on which
>> you find this inconsistency?
>>
>> I would also expect different settings or random states or data
> preparation to be more likely culprits.
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160828/aa355474/attachment-0001.html>

From drraph at gmail.com  Sun Aug 28 12:29:59 2016
From: drraph at gmail.com (Raphael C)
Date: Sun, 28 Aug 2016 18:29:59 +0200
Subject: [scikit-learn] Does NMF optimise over observed values
In-Reply-To: <CAFHc1QYEZpwUdUaHDaH2xH-_htG-wqvtGttxNRkuK1ZWmtgZmg@mail.gmail.com>
References: <CAFHc1QbCryUHX8ufUpvJC9-pOPCtn3DgCq2dohJ-stXoJDiDDw@mail.gmail.com>
 <CAFHc1Qb=z_EHRRc=tqNX9_gwSGOWxwqDedU5pR6tXszMKQjPsw@mail.gmail.com>
 <CAGmDsiOhPRN5+XX_m2Pe_BwTZnaJJ4o8RN_SKeENyNS2Yb1JNw@mail.gmail.com>
 <CAGmDsiOWHpdbA=qQ=7NGSd8OS=4AE8FQW1UVjs3gEikxD_3f4A@mail.gmail.com>
 <CAFHc1QYEZpwUdUaHDaH2xH-_htG-wqvtGttxNRkuK1ZWmtgZmg@mail.gmail.com>
Message-ID: <CAFHc1QYL60gpoZD3vrMjJqKSHo9wMXCOHr5b2J1+X1rwR8jobw@mail.gmail.com>

To give a little context from the web, see e.g.
http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial-and-implementation-in-python/
where
it explains:

"
A question might have come to your mind by now: if we find two matrices [image:
\mathbf{P}] and [image: \mathbf{Q}] such that [image: \mathbf{P} \times
\mathbf{Q}] approximates [image: \mathbf{R}], isn?t that our predictions of
all the unseen ratings will all be zeros? In fact, we are not really trying
to come up with [image: \mathbf{P}] and [image: \mathbf{Q}] such that we
can reproduce [image: \mathbf{R}] exactly. Instead, we will only try to
minimise the errors of the observed user-item pairs.
"

Raphael

On Sunday, August 28, 2016, Raphael C <drraph at gmail.com> wrote:

> Thank you for the quick reply.  Just to make sure I understand, if X is
> sparse and n by n with X[0,0] = 1, X_[n-1, n-1]=0 explicitly set (that is
> only two values are set in X) then this is treated the same for the
> purposes of the objective function  as the all zeros n by n matrix with
> X[0,0] set to 1? That is all elements of X that are not specified
> explicitly are assumed to be 0?
>
> It would be really useful if it were possible to have a version of NMF
> where contributions to the objective function are only counted where the
> value is explicitly set in X.  This is AFAIK the standard formulation for
> collaborative filtering. Would there be any interest in doing this? In
> theory it should be a simple modification of the optimisation code.
>
> Raphael
>
>
>
> On Sunday, August 28, 2016, Arthur Mensch <arthur.mensch at inria.fr
> <javascript:_e(%7B%7D,'cvml','arthur.mensch at inria.fr');>> wrote:
>
>> Zeros are considered as zeros in the objective function, not as missing
>> values - - i.e. no mask in the loss function.
>> Le 28 ao?t 2016 16:58, "Raphael C" <drraph at gmail.com> a ?crit :
>>
>> What I meant was, how is the objective function defined when X is sparse?
>>
>> Raphael
>>
>>
>> On Sunday, August 28, 2016, Raphael C <drraph at gmail.com> wrote:
>>
>>> Reading the docs for http://scikit-learn.org/st
>>> able/modules/generated/sklearn.decomposition.NMF.html it says
>>>
>>> The objective function is:
>>>
>>> 0.5 * ||X - WH||_Fro^2
>>> + alpha * l1_ratio * ||vec(W)||_1
>>> + alpha * l1_ratio * ||vec(H)||_1
>>> + 0.5 * alpha * (1 - l1_ratio) * ||W||_Fro^2
>>> + 0.5 * alpha * (1 - l1_ratio) * ||H||_Fro^2
>>>
>>> Where:
>>>
>>> ||A||_Fro^2 = \sum_{i,j} A_{ij}^2 (Frobenius norm)
>>> ||vec(A)||_1 = \sum_{i,j} abs(A_{ij}) (Elementwise L1 norm)
>>>
>>> This seems to suggest that it is optimising over all values in X even if X is sparse.   When using NMF for collaborative filtering we need the objective function to be defined over only the defined elements of X. The remaining elements should effectively be regarded as missing.
>>>
>>>
>>> What is the true objective function NMF is using?
>>>
>>>
>>> Raphael
>>>
>>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160828/9a22fbf6/attachment.html>

From t3kcit at gmail.com  Sun Aug 28 12:37:05 2016
From: t3kcit at gmail.com (Andy)
Date: Sun, 28 Aug 2016 12:37:05 -0400
Subject: [scikit-learn] Does NMF optimise over observed values
In-Reply-To: <CAFHc1QYL60gpoZD3vrMjJqKSHo9wMXCOHr5b2J1+X1rwR8jobw@mail.gmail.com>
References: <CAFHc1QbCryUHX8ufUpvJC9-pOPCtn3DgCq2dohJ-stXoJDiDDw@mail.gmail.com>
 <CAFHc1Qb=z_EHRRc=tqNX9_gwSGOWxwqDedU5pR6tXszMKQjPsw@mail.gmail.com>
 <CAGmDsiOhPRN5+XX_m2Pe_BwTZnaJJ4o8RN_SKeENyNS2Yb1JNw@mail.gmail.com>
 <CAGmDsiOWHpdbA=qQ=7NGSd8OS=4AE8FQW1UVjs3gEikxD_3f4A@mail.gmail.com>
 <CAFHc1QYEZpwUdUaHDaH2xH-_htG-wqvtGttxNRkuK1ZWmtgZmg@mail.gmail.com>
 <CAFHc1QYL60gpoZD3vrMjJqKSHo9wMXCOHr5b2J1+X1rwR8jobw@mail.gmail.com>
Message-ID: <96dc908d-c437-88d2-d986-b867c34635b4@gmail.com>


On 08/28/2016 12:29 PM, Raphael C wrote:
> To give a little context from the web, see e.g. 
> http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial-and-implementation-in-python/ where 
> it explains:
>
> "
> A question might have come to your mind by now: if we find two 
> matrices \mathbf{P} and \mathbf{Q} such that \mathbf{P} \times 
> \mathbf{Q} approximates \mathbf{R}, isn?t that our predictions of all 
> the unseen ratings will all be zeros? In fact, we are not really 
> trying to come up with \mathbf{P} and \mathbf{Q} such that we can 
> reproduce \mathbf{R} exactly. Instead, we will only try to minimise 
> the errors of the observed user-item pairs.
> "
Yes, the sklearn interface is not meant for matrix completion but 
matrix-factorization.
There was a PR for some matrix completion for missing value imputation 
at some point.

In general, scikit-learn doesn't really implement anything for 
recommendation algorithms as that requires a different interface.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160828/68a5d025/attachment.html>

From drraph at gmail.com  Sun Aug 28 13:16:14 2016
From: drraph at gmail.com (Raphael C)
Date: Sun, 28 Aug 2016 19:16:14 +0200
Subject: [scikit-learn] Does NMF optimise over observed values
In-Reply-To: <96dc908d-c437-88d2-d986-b867c34635b4@gmail.com>
References: <CAFHc1QbCryUHX8ufUpvJC9-pOPCtn3DgCq2dohJ-stXoJDiDDw@mail.gmail.com>
 <CAFHc1Qb=z_EHRRc=tqNX9_gwSGOWxwqDedU5pR6tXszMKQjPsw@mail.gmail.com>
 <CAGmDsiOhPRN5+XX_m2Pe_BwTZnaJJ4o8RN_SKeENyNS2Yb1JNw@mail.gmail.com>
 <CAGmDsiOWHpdbA=qQ=7NGSd8OS=4AE8FQW1UVjs3gEikxD_3f4A@mail.gmail.com>
 <CAFHc1QYEZpwUdUaHDaH2xH-_htG-wqvtGttxNRkuK1ZWmtgZmg@mail.gmail.com>
 <CAFHc1QYL60gpoZD3vrMjJqKSHo9wMXCOHr5b2J1+X1rwR8jobw@mail.gmail.com>
 <96dc908d-c437-88d2-d986-b867c34635b4@gmail.com>
Message-ID: <CAFHc1QZjME1zRGBKwSmU8DtaKEEUrtTCfLrLchgO2x=KEs6-ew@mail.gmail.com>

On Sunday, August 28, 2016, Andy <t3kcit at gmail.com> wrote:

>
>
> On 08/28/2016 12:29 PM, Raphael C wrote:
>
> To give a little context from the web, see e.g. http://www.quuxlabs.com/
> blog/2010/09/matrix-factorization-a-simple-tutorial-and-implementation-
> in-python/ where it explains:
>
> "
> A question might have come to your mind by now: if we find two matrices [image:
> \mathbf{P}] and [image: \mathbf{Q}] such that [image: \mathbf{P} \times
> \mathbf{Q}] approximates [image: \mathbf{R}], isn?t that our predictions
> of all the unseen ratings will all be zeros? In fact, we are not really
> trying to come up with [image: \mathbf{P}] and [image: \mathbf{Q}] such
> that we can reproduce [image: \mathbf{R}] exactly. Instead, we will only
> try to minimise the errors of the observed user-item pairs.
> "
>
> Yes, the sklearn interface is not meant for matrix completion but
> matrix-factorization.
> There was a PR for some matrix completion for missing value imputation at
> some point.
>
> In general, scikit-learn doesn't really implement anything for
> recommendation algorithms as that requires a different interface.
>

Thanks Andy. I just looked up that PR.

I was thinking simply producing a different factorisation optimised only
over the observed values wouldn't need a new interface. That in itself
would be hugely useful.

I can see that providing a full drop in recommender system would involve
more work.

Raphael
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160828/71c97a4c/attachment-0001.html>

From cs14btech11041 at iith.ac.in  Sun Aug 28 15:09:58 2016
From: cs14btech11041 at iith.ac.in (Ibrahim Dalal)
Date: Mon, 29 Aug 2016 00:39:58 +0530
Subject: [scikit-learn] Issue with DecisionTreeClassifier
Message-ID: <CAAyvngVLhuMHQQX0Rj+8XqEBxXc=dZmCt8g_z=CK+A8MgyskxA@mail.gmail.com>

Dear Developers,

DecisionTreeClassifier.decision_path() as used here
http://scikit-learn.org/dev/auto_examples/tree/unveil_tree_structure.html
is giving the following error:

AttributeError: 'DecisionTreeClassifier' object has no attribute
'decision_path'

Kindly help.

Thanks
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160829/14638d93/attachment.html>

From nfliu at uw.edu  Sun Aug 28 15:23:59 2016
From: nfliu at uw.edu (Nelson Liu)
Date: Sun, 28 Aug 2016 19:23:59 +0000
Subject: [scikit-learn] Issue with DecisionTreeClassifier
In-Reply-To: <CAAyvngVLhuMHQQX0Rj+8XqEBxXc=dZmCt8g_z=CK+A8MgyskxA@mail.gmail.com>
References: <CAAyvngVLhuMHQQX0Rj+8XqEBxXc=dZmCt8g_z=CK+A8MgyskxA@mail.gmail.com>
Message-ID: <CALoLHMK7EEDpC0+neELBz4iXPmcU5MLQcK1z2erDOmUfTsOCKw@mail.gmail.com>

That should be:
node indicator = estimator.tree_.decision_path(X_test)

PR welcome :)

On Sun, Aug 28, 2016, 13:12 Ibrahim Dalal via scikit-learn <
scikit-learn at python.org> wrote:

> Dear Developers,
>
> DecisionTreeClassifier.decision_path() as used here
> http://scikit-learn.org/dev/auto_examples/tree/unveil_tree_structure.html
> is giving the following error:
>
> AttributeError: 'DecisionTreeClassifier' object has no attribute
> 'decision_path'
>
> Kindly help.
>
> Thanks
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160828/8033c132/attachment.html>

From nfliu at uw.edu  Sun Aug 28 15:25:33 2016
From: nfliu at uw.edu (Nelson Liu)
Date: Sun, 28 Aug 2016 19:25:33 +0000
Subject: [scikit-learn] Issue with DecisionTreeClassifier
In-Reply-To: <CAAyvngVLhuMHQQX0Rj+8XqEBxXc=dZmCt8g_z=CK+A8MgyskxA@mail.gmail.com>
References: <CAAyvngVLhuMHQQX0Rj+8XqEBxXc=dZmCt8g_z=CK+A8MgyskxA@mail.gmail.com>
Message-ID: <CALoLHMKapeVe0+s2HufAUbM7WfnzOj1itObuYq5uUYL+krSW1w@mail.gmail.com>

Oops, phone removed the underscore between the two words of the variable
name but I think you get the point.

Nelson

On Sun, Aug 28, 2016, 13:12 Ibrahim Dalal via scikit-learn <
scikit-learn at python.org> wrote:

> Dear Developers,
>
> DecisionTreeClassifier.decision_path() as used here
> http://scikit-learn.org/dev/auto_examples/tree/unveil_tree_structure.html
> is giving the following error:
>
> AttributeError: 'DecisionTreeClassifier' object has no attribute
> 'decision_path'
>
> Kindly help.
>
> Thanks
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160828/4241f56d/attachment.html>

From rth.yurchak at gmail.com  Mon Aug 29 06:39:46 2016
From: rth.yurchak at gmail.com (Roman Yurchak)
Date: Mon, 29 Aug 2016 12:39:46 +0200
Subject: [scikit-learn] Latent Semantic Analysis (LSA) and TrucatedSVD
In-Reply-To: <ac124192-d4e9-b246-9729-f453227b74f9@gmail.com>
References: <57C04D8B.40001@gmail.com>
 <532083f1-0647-989d-6f35-2a83176199ea@gmail.com>
 <CAFvE7K6WkVZisnyafQ6c1MMG3WjvgJGi44v8PE44UzUjOA+BLQ@mail.gmail.com>
 <ac124192-d4e9-b246-9729-f453227b74f9@gmail.com>
Message-ID: <57C410F2.8090905@gmail.com>

Thank you for all your responses!

In the LSA what is equivalent, I think, is
   - to apply a L2 normalization (not the StandardScaler) after the LSA
and then compute the cosine similarity between document vectors simply
as a dot product.
   - not apply the L2 normalization and call the `cosine_similarity`
function instead.

I have applied this normalization to the previous example, and it
produces indeed equivalent results (i.e. does not solve the problem).
Opening an issue on this for further discussion
   https://github.com/scikit-learn/scikit-learn/issues/7283

Thanks for your feedback!
-- 
Roman

On 28/08/16 18:20, Andy wrote:
> If you do "with_mean=False" it should be the same, right?
> 
> On 08/27/2016 12:20 PM, Olivier Grisel wrote:
>> I am not sure this is exactly the same because we do not center the
>> data in the TruncatedSVD case (as opposed to the real PCA case where
>> whitening is the same as calling StandardScaler).
>>
>> Having an option to normalize the transformed data by sigma seems like
>> a good idea but we should probably not call that whitening.
>>
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From cs14btech11041 at iith.ac.in  Mon Aug 29 11:15:52 2016
From: cs14btech11041 at iith.ac.in (Ibrahim Dalal)
Date: Mon, 29 Aug 2016 20:45:52 +0530
Subject: [scikit-learn] Issue with DecisionTreeClassifier
In-Reply-To: <CALoLHMK7EEDpC0+neELBz4iXPmcU5MLQcK1z2erDOmUfTsOCKw@mail.gmail.com>
References: <CAAyvngVLhuMHQQX0Rj+8XqEBxXc=dZmCt8g_z=CK+A8MgyskxA@mail.gmail.com>
 <CALoLHMK7EEDpC0+neELBz4iXPmcU5MLQcK1z2erDOmUfTsOCKw@mail.gmail.com>
Message-ID: <CAAyvngXQn7MEKTiSpUn5kGte=1drHJm361C2YPBSzhy3pZ5V4A@mail.gmail.com>

Hi,

Is there a way to extract impurity value of a node in
DecisionTreeClassifier? I am able to get this value in graph (using
export_grapgviz), but can't figure out how to get this value in my code. Is
there any attribute similar to estimator.tree_.children_left?

Thanks

On Mon, Aug 29, 2016 at 12:53 AM, Nelson Liu <nfliu at uw.edu> wrote:

> That should be:
> node indicator = estimator.tree_.decision_path(X_test)
>
> PR welcome :)
>
> On Sun, Aug 28, 2016, 13:12 Ibrahim Dalal via scikit-learn <
> scikit-learn at python.org> wrote:
>
>> Dear Developers,
>>
>> DecisionTreeClassifier.decision_path() as used here
>> http://scikit-learn.org/dev/auto_examples/tree/unveil_tree_structure.html
>> is giving the following error:
>>
>> AttributeError: 'DecisionTreeClassifier' object has no attribute
>> 'decision_path'
>>
>> Kindly help.
>>
>> Thanks
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160829/daa04f58/attachment-0001.html>

From nfliu at uw.edu  Mon Aug 29 11:23:30 2016
From: nfliu at uw.edu (Nelson Liu)
Date: Mon, 29 Aug 2016 15:23:30 +0000
Subject: [scikit-learn] Issue with DecisionTreeClassifier
In-Reply-To: <CAAyvngXQn7MEKTiSpUn5kGte=1drHJm361C2YPBSzhy3pZ5V4A@mail.gmail.com>
References: <CAAyvngVLhuMHQQX0Rj+8XqEBxXc=dZmCt8g_z=CK+A8MgyskxA@mail.gmail.com>
 <CALoLHMK7EEDpC0+neELBz4iXPmcU5MLQcK1z2erDOmUfTsOCKw@mail.gmail.com>
 <CAAyvngXQn7MEKTiSpUn5kGte=1drHJm361C2YPBSzhy3pZ5V4A@mail.gmail.com>
Message-ID: <CALoLHMKe+z4AfTTk=eahJDxUBuJ9k=6KUpVyp9OPL9=ed54ORQ@mail.gmail.com>

Hi,
Yes, it's estimator.tree_.impurity

Nelson

On Mon, Aug 29, 2016, 09:18 Ibrahim Dalal via scikit-learn <
scikit-learn at python.org> wrote:

> Hi,
>
> Is there a way to extract impurity value of a node in
> DecisionTreeClassifier? I am able to get this value in graph (using
> export_grapgviz), but can't figure out how to get this value in my code. Is
> there any attribute similar to estimator.tree_.children_left?
>
> Thanks
>
> On Mon, Aug 29, 2016 at 12:53 AM, Nelson Liu <nfliu at uw.edu> wrote:
>
>> That should be:
>> node indicator = estimator.tree_.decision_path(X_test)
>>
>> PR welcome :)
>>
>> On Sun, Aug 28, 2016, 13:12 Ibrahim Dalal via scikit-learn <
>> scikit-learn at python.org> wrote:
>>
>>> Dear Developers,
>>>
>>> DecisionTreeClassifier.decision_path() as used here
>>> http://scikit-learn.org/dev/auto_examples/tree/unveil_tree_structure.html
>>> is giving the following error:
>>>
>>> AttributeError: 'DecisionTreeClassifier' object has no attribute
>>> 'decision_path'
>>>
>>> Kindly help.
>>>
>>> Thanks
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160829/0b467a14/attachment.html>

From cs14btech11041 at iith.ac.in  Mon Aug 29 11:44:21 2016
From: cs14btech11041 at iith.ac.in (Ibrahim Dalal)
Date: Mon, 29 Aug 2016 21:14:21 +0530
Subject: [scikit-learn] Issue with DecisionTreeClassifier
In-Reply-To: <CALoLHMKe+z4AfTTk=eahJDxUBuJ9k=6KUpVyp9OPL9=ed54ORQ@mail.gmail.com>
References: <CAAyvngVLhuMHQQX0Rj+8XqEBxXc=dZmCt8g_z=CK+A8MgyskxA@mail.gmail.com>
 <CALoLHMK7EEDpC0+neELBz4iXPmcU5MLQcK1z2erDOmUfTsOCKw@mail.gmail.com>
 <CAAyvngXQn7MEKTiSpUn5kGte=1drHJm361C2YPBSzhy3pZ5V4A@mail.gmail.com>
 <CALoLHMKe+z4AfTTk=eahJDxUBuJ9k=6KUpVyp9OPL9=ed54ORQ@mail.gmail.com>
Message-ID: <CAAyvngXvwT=4YzVSLbXqfr2z6iCnGjqj-vqKxnriWvGi8igUcA@mail.gmail.com>

Thanks Nelson.

Is there any way to access number of training samples in a node?
Thanks

On Mon, Aug 29, 2016 at 8:53 PM, Nelson Liu <nfliu at uw.edu> wrote:

> Hi,
> Yes, it's estimator.tree_.impurity
>
> Nelson
>
> On Mon, Aug 29, 2016, 09:18 Ibrahim Dalal via scikit-learn <
> scikit-learn at python.org> wrote:
>
>> Hi,
>>
>> Is there a way to extract impurity value of a node in
>> DecisionTreeClassifier? I am able to get this value in graph (using
>> export_grapgviz), but can't figure out how to get this value in my code. Is
>> there any attribute similar to estimator.tree_.children_left?
>>
>> Thanks
>>
>> On Mon, Aug 29, 2016 at 12:53 AM, Nelson Liu <nfliu at uw.edu> wrote:
>>
>>> That should be:
>>> node indicator = estimator.tree_.decision_path(X_test)
>>>
>>> PR welcome :)
>>>
>>> On Sun, Aug 28, 2016, 13:12 Ibrahim Dalal via scikit-learn <
>>> scikit-learn at python.org> wrote:
>>>
>>>> Dear Developers,
>>>>
>>>> DecisionTreeClassifier.decision_path() as used here
>>>> http://scikit-learn.org/dev/auto_examples/tree/unveil_
>>>> tree_structure.html is giving the following error:
>>>>
>>>> AttributeError: 'DecisionTreeClassifier' object has no attribute
>>>> 'decision_path'
>>>>
>>>> Kindly help.
>>>>
>>>> Thanks
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160829/aa3cd6e8/attachment.html>

From t3kcit at gmail.com  Mon Aug 29 11:50:24 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Mon, 29 Aug 2016 11:50:24 -0400
Subject: [scikit-learn] Does NMF optimise over observed values
In-Reply-To: <CAFHc1QZjME1zRGBKwSmU8DtaKEEUrtTCfLrLchgO2x=KEs6-ew@mail.gmail.com>
References: <CAFHc1QbCryUHX8ufUpvJC9-pOPCtn3DgCq2dohJ-stXoJDiDDw@mail.gmail.com>
 <CAFHc1Qb=z_EHRRc=tqNX9_gwSGOWxwqDedU5pR6tXszMKQjPsw@mail.gmail.com>
 <CAGmDsiOhPRN5+XX_m2Pe_BwTZnaJJ4o8RN_SKeENyNS2Yb1JNw@mail.gmail.com>
 <CAGmDsiOWHpdbA=qQ=7NGSd8OS=4AE8FQW1UVjs3gEikxD_3f4A@mail.gmail.com>
 <CAFHc1QYEZpwUdUaHDaH2xH-_htG-wqvtGttxNRkuK1ZWmtgZmg@mail.gmail.com>
 <CAFHc1QYL60gpoZD3vrMjJqKSHo9wMXCOHr5b2J1+X1rwR8jobw@mail.gmail.com>
 <96dc908d-c437-88d2-d986-b867c34635b4@gmail.com>
 <CAFHc1QZjME1zRGBKwSmU8DtaKEEUrtTCfLrLchgO2x=KEs6-ew@mail.gmail.com>
Message-ID: <f1735db0-6bd9-952f-d022-bca1c88d1dd7@gmail.com>


On 08/28/2016 01:16 PM, Raphael C wrote:
>
>
> On Sunday, August 28, 2016, Andy <t3kcit at gmail.com 
> <mailto:t3kcit at gmail.com>> wrote:
>
>
>
>     On 08/28/2016 12:29 PM, Raphael C wrote:
>>     To give a little context from the web, see e.g.
>>     http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial-and-implementation-in-python/
>>     <http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial-and-implementation-in-python/> where
>>     it explains:
>>
>>     "
>>     A question might have come to your mind by now: if we find two
>>     matrices \mathbf{P} and \mathbf{Q} such that \mathbf{P} \times
>>     \mathbf{Q} approximates \mathbf{R}, isn?t that our predictions of
>>     all the unseen ratings will all be zeros? In fact, we are not
>>     really trying to come up with \mathbf{P} and \mathbf{Q} such that
>>     we can reproduce \mathbf{R} exactly. Instead, we will only try to
>>     minimise the errors of the observed user-item pairs.
>>     "
>     Yes, the sklearn interface is not meant for matrix completion but
>     matrix-factorization.
>     There was a PR for some matrix completion for missing value
>     imputation at some point.
>
>     In general, scikit-learn doesn't really implement anything for
>     recommendation algorithms as that requires a different interface.
>
>
> Thanks Andy. I just looked up that PR.
>
> I was thinking simply producing a different factorisation optimised 
> only over the observed values wouldn't need a new interface. That in 
> itself would be hugely useful.
Depends. Usually you don't want to complete all values, but only compute 
a factorization. What do you return? Only the factors?
The PR implements completing everything, and that you can do with the 
transformer interface. I'm not sure what the status of the PR is,
but doing that with NMF instead of SVD would certainly also be interesting.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160829/047ba6f7/attachment-0001.html>

From t3kcit at gmail.com  Mon Aug 29 11:52:16 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Mon, 29 Aug 2016 11:52:16 -0400
Subject: [scikit-learn] Issue with DecisionTreeClassifier
In-Reply-To: <CALoLHMK7EEDpC0+neELBz4iXPmcU5MLQcK1z2erDOmUfTsOCKw@mail.gmail.com>
References: <CAAyvngVLhuMHQQX0Rj+8XqEBxXc=dZmCt8g_z=CK+A8MgyskxA@mail.gmail.com>
 <CALoLHMK7EEDpC0+neELBz4iXPmcU5MLQcK1z2erDOmUfTsOCKw@mail.gmail.com>
Message-ID: <1acae211-7d06-cd5e-e9a2-6cb21600b381@gmail.com>


On 08/28/2016 03:23 PM, Nelson Liu wrote:
> That should be:
> node indicator = estimator.tree_.decision_path(X_test)
>
> PR welcome :)
Was there a reason not to make this a "plot" example?
Would it take too long? Not having run examples by CI is a pretty big 
maintenance burden.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160829/72817d39/attachment.html>

From tom.duprelatour at orange.fr  Mon Aug 29 13:01:57 2016
From: tom.duprelatour at orange.fr (Tom DLT)
Date: Mon, 29 Aug 2016 19:01:57 +0200
Subject: [scikit-learn] Does NMF optimise over observed values
In-Reply-To: <f1735db0-6bd9-952f-d022-bca1c88d1dd7@gmail.com>
References: <CAFHc1QbCryUHX8ufUpvJC9-pOPCtn3DgCq2dohJ-stXoJDiDDw@mail.gmail.com>
 <CAFHc1Qb=z_EHRRc=tqNX9_gwSGOWxwqDedU5pR6tXszMKQjPsw@mail.gmail.com>
 <CAGmDsiOhPRN5+XX_m2Pe_BwTZnaJJ4o8RN_SKeENyNS2Yb1JNw@mail.gmail.com>
 <CAGmDsiOWHpdbA=qQ=7NGSd8OS=4AE8FQW1UVjs3gEikxD_3f4A@mail.gmail.com>
 <CAFHc1QYEZpwUdUaHDaH2xH-_htG-wqvtGttxNRkuK1ZWmtgZmg@mail.gmail.com>
 <CAFHc1QYL60gpoZD3vrMjJqKSHo9wMXCOHr5b2J1+X1rwR8jobw@mail.gmail.com>
 <96dc908d-c437-88d2-d986-b867c34635b4@gmail.com>
 <CAFHc1QZjME1zRGBKwSmU8DtaKEEUrtTCfLrLchgO2x=KEs6-ew@mail.gmail.com>
 <f1735db0-6bd9-952f-d022-bca1c88d1dd7@gmail.com>
Message-ID: <CAGKmC=sG=0ihB49RxgXuUx0yvSnjbs6tWzLO7=nbb8qJvynE3Q@mail.gmail.com>

If X is sparse, explicit zeros and missing-value zeros are **both**
considered as zeros in the objective functions.

Changing the objective function wouldn't need a new interface, yet I am not
sure the code change would be completely trivial.
The question is: do we want this new objective function in scikit-learn,
since we have no other recommendation-like algorithm?
If we agree that it would useful, feel free to send a PR.

Tom

2016-08-29 17:50 GMT+02:00 Andreas Mueller <t3kcit at gmail.com>:

>
>
> On 08/28/2016 01:16 PM, Raphael C wrote:
>
>
>
> On Sunday, August 28, 2016, Andy <t3kcit at gmail.com> wrote:
>
>>
>>
>> On 08/28/2016 12:29 PM, Raphael C wrote:
>>
>> To give a little context from the web, see e.g. http://www.quuxlabs.com/b
>> log/2010/09/matrix-factorization-a-simple-tutorial-and-
>> implementation-in-python/ where it explains:
>>
>> "
>> A question might have come to your mind by now: if we find two matrices [image:
>> \mathbf{P}] and [image: \mathbf{Q}] such that [image: \mathbf{P} \times
>> \mathbf{Q}] approximates [image: \mathbf{R}], isn?t that our predictions
>> of all the unseen ratings will all be zeros? In fact, we are not really
>> trying to come up with [image: \mathbf{P}] and [image: \mathbf{Q}] such
>> that we can reproduce [image: \mathbf{R}] exactly. Instead, we will only
>> try to minimise the errors of the observed user-item pairs.
>> "
>>
>> Yes, the sklearn interface is not meant for matrix completion but
>> matrix-factorization.
>> There was a PR for some matrix completion for missing value imputation at
>> some point.
>>
>> In general, scikit-learn doesn't really implement anything for
>> recommendation algorithms as that requires a different interface.
>>
>
> Thanks Andy. I just looked up that PR.
>
> I was thinking simply producing a different factorisation optimised only
> over the observed values wouldn't need a new interface. That in itself
> would be hugely useful.
>
> Depends. Usually you don't want to complete all values, but only compute a
> factorization. What do you return? Only the factors?
> The PR implements completing everything, and that you can do with the
> transformer interface. I'm not sure what the status of the PR is,
> but doing that with NMF instead of SVD would certainly also be interesting.
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160829/edf8dd02/attachment.html>

From drraph at gmail.com  Mon Aug 29 14:11:42 2016
From: drraph at gmail.com (Raphael C)
Date: Mon, 29 Aug 2016 20:11:42 +0200
Subject: [scikit-learn] Does NMF optimise over observed values
In-Reply-To: <f1735db0-6bd9-952f-d022-bca1c88d1dd7@gmail.com>
References: <CAFHc1QbCryUHX8ufUpvJC9-pOPCtn3DgCq2dohJ-stXoJDiDDw@mail.gmail.com>
 <CAFHc1Qb=z_EHRRc=tqNX9_gwSGOWxwqDedU5pR6tXszMKQjPsw@mail.gmail.com>
 <CAGmDsiOhPRN5+XX_m2Pe_BwTZnaJJ4o8RN_SKeENyNS2Yb1JNw@mail.gmail.com>
 <CAGmDsiOWHpdbA=qQ=7NGSd8OS=4AE8FQW1UVjs3gEikxD_3f4A@mail.gmail.com>
 <CAFHc1QYEZpwUdUaHDaH2xH-_htG-wqvtGttxNRkuK1ZWmtgZmg@mail.gmail.com>
 <CAFHc1QYL60gpoZD3vrMjJqKSHo9wMXCOHr5b2J1+X1rwR8jobw@mail.gmail.com>
 <96dc908d-c437-88d2-d986-b867c34635b4@gmail.com>
 <CAFHc1QZjME1zRGBKwSmU8DtaKEEUrtTCfLrLchgO2x=KEs6-ew@mail.gmail.com>
 <f1735db0-6bd9-952f-d022-bca1c88d1dd7@gmail.com>
Message-ID: <CAFHc1QYkPEq9yA2QfOOFD1b2hTH4coGVbePRstuZyb++XVM5UQ@mail.gmail.com>

On Monday, August 29, 2016, Andreas Mueller <t3kcit at gmail.com> wrote:

>
>
> On 08/28/2016 01:16 PM, Raphael C wrote:
>
>
>
> On Sunday, August 28, 2016, Andy <t3kcit at gmail.com
> <javascript:_e(%7B%7D,'cvml','t3kcit at gmail.com');>> wrote:
>
>>
>>
>> On 08/28/2016 12:29 PM, Raphael C wrote:
>>
>> To give a little context from the web, see e.g. http://www.quuxlabs.com/b
>> log/2010/09/matrix-factorization-a-simple-tutorial-and-
>> implementation-in-python/ where it explains:
>>
>> "
>> A question might have come to your mind by now: if we find two matrices [image:
>> \mathbf{P}] and [image: \mathbf{Q}] such that [image: \mathbf{P} \times
>> \mathbf{Q}] approximates [image: \mathbf{R}], isn?t that our predictions
>> of all the unseen ratings will all be zeros? In fact, we are not really
>> trying to come up with [image: \mathbf{P}] and [image: \mathbf{Q}] such
>> that we can reproduce [image: \mathbf{R}] exactly. Instead, we will only
>> try to minimise the errors of the observed user-item pairs.
>> "
>>
>> Yes, the sklearn interface is not meant for matrix completion but
>> matrix-factorization.
>> There was a PR for some matrix completion for missing value imputation at
>> some point.
>>
>> In general, scikit-learn doesn't really implement anything for
>> recommendation algorithms as that requires a different interface.
>>
>
> Thanks Andy. I just looked up that PR.
>
> I was thinking simply producing a different factorisation optimised only
> over the observed values wouldn't need a new interface. That in itself
> would be hugely useful.
>
> Depends. Usually you don't want to complete all values, but only compute a
> factorization. What do you return? Only the factors?
>
> The PR implements completing everything, and that you can do with the
> transformer interface. I'm not sure what the status of the PR is,
> but doing that with NMF instead of SVD would certainly also be interesting.
>

I was thinking you would literally return W and H so that WH approx X.  The
user can then decide what to do with the factorisation just like when doing
SVD.

Raphael
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160829/0d6671f3/attachment.html>

From cs14btech11041 at iith.ac.in  Mon Aug 29 23:03:59 2016
From: cs14btech11041 at iith.ac.in (Ibrahim Dalal)
Date: Tue, 30 Aug 2016 08:33:59 +0530
Subject: [scikit-learn] Issue with DecisionTreeClassifier
In-Reply-To: <1acae211-7d06-cd5e-e9a2-6cb21600b381@gmail.com>
References: <CAAyvngVLhuMHQQX0Rj+8XqEBxXc=dZmCt8g_z=CK+A8MgyskxA@mail.gmail.com>
 <CALoLHMK7EEDpC0+neELBz4iXPmcU5MLQcK1z2erDOmUfTsOCKw@mail.gmail.com>
 <1acae211-7d06-cd5e-e9a2-6cb21600b381@gmail.com>
Message-ID: <CAAyvngU_Dmcpj=GO2wzQHcuMxXSuzF-XyYwTejUGiXbCXq7A8Q@mail.gmail.com>

Hi,

What does the estimator.tree_.value array represent? I looked up the source
code but not able to get what it is. I am interested in the number of
training samples of each class in a given tree node.

Thanks

On Mon, Aug 29, 2016 at 9:22 PM, Andreas Mueller <t3kcit at gmail.com> wrote:

>
>
> On 08/28/2016 03:23 PM, Nelson Liu wrote:
>
> That should be:
> node indicator = estimator.tree_.decision_path(X_test)
>
> PR welcome :)
>
> Was there a reason not to make this a "plot" example?
> Would it take too long? Not having run examples by CI is a pretty big
> maintenance burden.
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160830/5da3eef4/attachment-0001.html>

From nfliu at uw.edu  Mon Aug 29 23:22:41 2016
From: nfliu at uw.edu (Nelson Liu)
Date: Tue, 30 Aug 2016 03:22:41 +0000
Subject: [scikit-learn] Issue with DecisionTreeClassifier
In-Reply-To: <CAAyvngU_Dmcpj=GO2wzQHcuMxXSuzF-XyYwTejUGiXbCXq7A8Q@mail.gmail.com>
References: <CAAyvngVLhuMHQQX0Rj+8XqEBxXc=dZmCt8g_z=CK+A8MgyskxA@mail.gmail.com>
 <CALoLHMK7EEDpC0+neELBz4iXPmcU5MLQcK1z2erDOmUfTsOCKw@mail.gmail.com>
 <1acae211-7d06-cd5e-e9a2-6cb21600b381@gmail.com>
 <CAAyvngU_Dmcpj=GO2wzQHcuMxXSuzF-XyYwTejUGiXbCXq7A8Q@mail.gmail.com>
Message-ID: <CALoLHMK-zc_Sbu-OWHVW=r+HmDYTWQYoVKfQCg-z3M=pCEBJtg@mail.gmail.com>

estimator.tree_.value gives the constant prediction of the tree at each
node. Think of it as what the tree would output if that node was a leaf.

I don't think we have a readily available way of checking the number of
training samples of each class in a given tree node. The closest thing
easily accessible is estimator.tree_.n_node_samples. Getting finer-grained
counts of the number of samples in each class would require modifying the
source code, I think.

On Mon, Aug 29, 2016 at 8:06 PM Ibrahim Dalal via scikit-learn <
scikit-learn at python.org> wrote:

> Hi,
>
> What does the estimator.tree_.value array represent? I looked up the
> source code but not able to get what it is. I am interested in the number
> of training samples of each class in a given tree node.
>
> Thanks
>
> On Mon, Aug 29, 2016 at 9:22 PM, Andreas Mueller <t3kcit at gmail.com> wrote:
>
>>
>>
>> On 08/28/2016 03:23 PM, Nelson Liu wrote:
>>
>> That should be:
>> node indicator = estimator.tree_.decision_path(X_test)
>>
>> PR welcome :)
>>
>> Was there a reason not to make this a "plot" example?
>> Would it take too long? Not having run examples by CI is a pretty big
>> maintenance burden.
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160830/a6701107/attachment.html>

From joel.nothman at gmail.com  Mon Aug 29 23:31:19 2016
From: joel.nothman at gmail.com (Joel Nothman)
Date: Tue, 30 Aug 2016 13:31:19 +1000
Subject: [scikit-learn] Issue with DecisionTreeClassifier
In-Reply-To: <CALoLHMK-zc_Sbu-OWHVW=r+HmDYTWQYoVKfQCg-z3M=pCEBJtg@mail.gmail.com>
References: <CAAyvngVLhuMHQQX0Rj+8XqEBxXc=dZmCt8g_z=CK+A8MgyskxA@mail.gmail.com>
 <CALoLHMK7EEDpC0+neELBz4iXPmcU5MLQcK1z2erDOmUfTsOCKw@mail.gmail.com>
 <1acae211-7d06-cd5e-e9a2-6cb21600b381@gmail.com>
 <CAAyvngU_Dmcpj=GO2wzQHcuMxXSuzF-XyYwTejUGiXbCXq7A8Q@mail.gmail.com>
 <CALoLHMK-zc_Sbu-OWHVW=r+HmDYTWQYoVKfQCg-z3M=pCEBJtg@mail.gmail.com>
Message-ID: <CAAkaFLUGUNdZGQXu9YLy+BhLAQkc-5piQVpptEGPo6YJ3+Z2GA@mail.gmail.com>

Or just running estimator.tree_.apply(X_train) and inferring from there.

On 30 August 2016 at 13:22, Nelson Liu <nfliu at uw.edu> wrote:

> estimator.tree_.value gives the constant prediction of the tree at each
> node. Think of it as what the tree would output if that node was a leaf.
>
> I don't think we have a readily available way of checking the number of
> training samples of each class in a given tree node. The closest thing
> easily accessible is estimator.tree_.n_node_samples. Getting
> finer-grained counts of the number of samples in each class would require
> modifying the source code, I think.
>
> On Mon, Aug 29, 2016 at 8:06 PM Ibrahim Dalal via scikit-learn <
> scikit-learn at python.org> wrote:
>
>> Hi,
>>
>> What does the estimator.tree_.value array represent? I looked up the
>> source code but not able to get what it is. I am interested in the number
>> of training samples of each class in a given tree node.
>>
>> Thanks
>>
>> On Mon, Aug 29, 2016 at 9:22 PM, Andreas Mueller <t3kcit at gmail.com>
>> wrote:
>>
>>>
>>>
>>> On 08/28/2016 03:23 PM, Nelson Liu wrote:
>>>
>>> That should be:
>>> node indicator = estimator.tree_.decision_path(X_test)
>>>
>>> PR welcome :)
>>>
>>> Was there a reason not to make this a "plot" example?
>>> Would it take too long? Not having run examples by CI is a pretty big
>>> maintenance burden.
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160830/3ec4ee70/attachment.html>

From t3kcit at gmail.com  Tue Aug 30 09:56:22 2016
From: t3kcit at gmail.com (Andy)
Date: Tue, 30 Aug 2016 09:56:22 -0400
Subject: [scikit-learn] Issue with DecisionTreeClassifier
In-Reply-To: <CALoLHMK-zc_Sbu-OWHVW=r+HmDYTWQYoVKfQCg-z3M=pCEBJtg@mail.gmail.com>
References: <CAAyvngVLhuMHQQX0Rj+8XqEBxXc=dZmCt8g_z=CK+A8MgyskxA@mail.gmail.com>
 <CALoLHMK7EEDpC0+neELBz4iXPmcU5MLQcK1z2erDOmUfTsOCKw@mail.gmail.com>
 <1acae211-7d06-cd5e-e9a2-6cb21600b381@gmail.com>
 <CAAyvngU_Dmcpj=GO2wzQHcuMxXSuzF-XyYwTejUGiXbCXq7A8Q@mail.gmail.com>
 <CALoLHMK-zc_Sbu-OWHVW=r+HmDYTWQYoVKfQCg-z3M=pCEBJtg@mail.gmail.com>
Message-ID: <a98dafe4-3014-b78e-8d10-850ef5546d02@gmail.com>


On 08/29/2016 11:22 PM, Nelson Liu wrote:
> estimator.tree_.value gives the constant prediction of the tree at 
> each node. Think of it as what the tree would output if that node was 
> a leaf.
well it's also the weighted number of samples of each class, right?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160830/6af2f972/attachment.html>

From t3kcit at gmail.com  Tue Aug 30 10:31:51 2016
From: t3kcit at gmail.com (Andy)
Date: Tue, 30 Aug 2016 10:31:51 -0400
Subject: [scikit-learn] [Scikit-learn-general] Dropping Python 2.6
 compatibility
In-Reply-To: <20160104132846.GA1242267@phare.normalesup.org>
References: <20160104132846.GA1242267@phare.normalesup.org>
Message-ID: <d963d8e0-1484-fb19-26b8-1dfbd949b46b@gmail.com>

Hi all.

Picking up this old thread, I propose we announce that 0.18 is the last 
release that
will support Python 2.6.
That will give people some time to think about it between releases.

Wdyt?

Andy


On 01/04/2016 08:28 AM, Gael Varoquaux wrote:
> Happy new year everybody,
>
> As a new year resolution, I suggest that we drop Python 2.6
> compatibility.
>
> For an argumentation in this favor, see
> http://www.snarky.ca/stop-using-python-2-6 (I don't buy everything there,
> but the core idea is there).
>
> For us, this will mean more usage of context managers, which is good.
>
> The down side is that many clusters run RedHat variant that are still
> under 2.6 (Duh!). The question is: are people using the stock Python on
> the clusters, or something else.
>
> Opinions please?
>
> Ga?l
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


From gael.varoquaux at normalesup.org  Tue Aug 30 10:37:47 2016
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Tue, 30 Aug 2016 16:37:47 +0200
Subject: [scikit-learn] [Scikit-learn-general] Dropping Python 2.6
 compatibility
In-Reply-To: <d963d8e0-1484-fb19-26b8-1dfbd949b46b@gmail.com>
References: <20160104132846.GA1242267@phare.normalesup.org>
 <d963d8e0-1484-fb19-26b8-1dfbd949b46b@gmail.com>
Message-ID: <20160830143747.GH1642932@phare.normalesup.org>

> Picking up this old thread, I propose we announce that 0.18 is the last
> release that
> will support Python 2.6.
> That will give people some time to think about it between releases.

As you know: +1 from my side.

One of my arguments for this is that it becomes harder and harder to set
up continuous integration environments to test related projects with 2.6.
Hence related projects are likely to coderot under 2.6.

Thanks for raising this issue.

Ga?l

From ilya.persky at gmail.com  Tue Aug 30 15:19:40 2016
From: ilya.persky at gmail.com (Ilya Persky)
Date: Tue, 30 Aug 2016 22:19:40 +0300
Subject: [scikit-learn] How to deal with minor inconsistencies in scikit
 source code?
Message-ID: <CAAxHhJ1hWYJTPdRvnkbbPxExcKAMqOtwekUEUxbOtnzd4bAvUw@mail.gmail.com>

Hi All!

I'm now reading scikit-learn source code and sometimes meet minor
inconsistencies here and there like unnecessary copying of some array or
some very unimportant race condition. Nothing like serious bug really.

What should I do about it? Create an issue for each case would be an
overkill. Create an issue for all of them and add pull request with fixes?
Or first send a letter with them here?..

Again I'm new to this code and can be easily missing something (something
looking like minor bug could appear to be a feature :) ).

-- 
Thank you,
Ilya.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160830/b82499c8/attachment.html>

From t3kcit at gmail.com  Tue Aug 30 15:24:48 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Tue, 30 Aug 2016 15:24:48 -0400
Subject: [scikit-learn] How to deal with minor inconsistencies in scikit
 source code?
In-Reply-To: <CAAxHhJ1hWYJTPdRvnkbbPxExcKAMqOtwekUEUxbOtnzd4bAvUw@mail.gmail.com>
References: <CAAxHhJ1hWYJTPdRvnkbbPxExcKAMqOtwekUEUxbOtnzd4bAvUw@mail.gmail.com>
Message-ID: <65c6d2b1-35ff-2769-d627-7e61f2392df1@gmail.com>

Hi Ilya.

You can raise an issue with multiple minor problems, or you can just 
send a PR.
We don't really like to do many cosmetic fixes, because they tend to 
create merge conflicts.
But for semantic changes, like avoiding an array copy, we're very happy 
about any improvements.
You can totally pack multiple one-line changes into a single PR if they 
are all simple to review.
One thing to keep in mind: the shorter the PR, the faster the review and 
merge ;)

Andy

On 08/30/2016 03:19 PM, Ilya Persky wrote:
> Hi All!
>
> I'm now reading scikit-learn source code and sometimes meet minor 
> inconsistencies here and there like unnecessary copying of some array 
> or some very unimportant race condition. Nothing like serious bug really.
>
> What should I do about it? Create an issue for each case would be an 
> overkill. Create an issue for all of them and add pull request with 
> fixes? Or first send a letter with them here?..
>
> Again I'm new to this code and can be easily missing something 
> (something looking like minor bug could appear to be a feature :) ).
>
> -- 
> Thank you,
> Ilya.
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160830/2ba4551c/attachment.html>

From douglas.chan at ieee.org  Wed Aug 31 01:26:11 2016
From: douglas.chan at ieee.org (Douglas Chan)
Date: Tue, 30 Aug 2016 22:26:11 -0700
Subject: [scikit-learn] Gradient Boosting: Feature Importances do not sum to
 1
Message-ID: <C2484D9F44FE441897BAA416DCC281BD@Serendipitous>

Hello everyone,

I notice conditions when Feature Importance values do not add up to 1 in ensemble tree methods, like Gradient Boosting Trees or AdaBoost Trees.  I wonder if there?s a bug in the code.

This error occurs when the ensemble has a large number of estimators.  The exact conditions depend variously.  For example, the error shows up sooner with a smaller amount of training samples.  Or, if the depth of the tree is large.  

When this error appears, the predicted value seems to have converged.  But it?s unclear if the error is causing the predicted value not to change with more estimators.  In fact, the feature importance sum goes lower and lower with more estimators thereafter.

I wonder if we?re hitting some floating point calculation error. 

Looking forward to hear your thoughts on this.

Thank you!
-Doug
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160830/e1e49b0f/attachment.html>

From drraph at gmail.com  Wed Aug 31 02:28:29 2016
From: drraph at gmail.com (Raphael C)
Date: Wed, 31 Aug 2016 08:28:29 +0200
Subject: [scikit-learn] Gradient Boosting: Feature Importances do not
 sum to 1
In-Reply-To: <C2484D9F44FE441897BAA416DCC281BD@Serendipitous>
References: <C2484D9F44FE441897BAA416DCC281BD@Serendipitous>
Message-ID: <CAFHc1QZm4_R8eFka_V5gWF3irLwayozLsU-2JW7+YE0-U604HA@mail.gmail.com>

Can you provide a reproducible example?
Raphael

On Wednesday, August 31, 2016, Douglas Chan <douglas.chan at ieee.org> wrote:

> Hello everyone,
>
> I notice conditions when Feature Importance values do not add up to 1 in
> ensemble tree methods, like Gradient Boosting Trees or AdaBoost Trees.  I
> wonder if there?s a bug in the code.
>
> This error occurs when the ensemble has a large number of estimators.  The
> exact conditions depend variously.  For example, the error shows up sooner
> with a smaller amount of training samples.  Or, if the depth of the tree is
> large.
>
> When this error appears, the predicted value seems to have converged.  But
> it?s unclear if the error is causing the predicted value not to change with
> more estimators.  In fact, the feature importance sum goes lower and lower
> with more estimators thereafter.
>
> I wonder if we?re hitting some floating point calculation error.
>
> Looking forward to hear your thoughts on this.
>
> Thank you!
> -Doug
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160831/eb1b7a71/attachment.html>

From Myles.Gartland at Rockhurst.edu  Tue Aug 30 22:20:38 2016
From: Myles.Gartland at Rockhurst.edu (Gartland, Myles)
Date: Tue, 30 Aug 2016 21:20:38 -0500
Subject: [scikit-learn] MLPClassifier release
Message-ID: <etPan.57c63ef6.3f6eb704.2d1@rockhurst.edu>

Curious when .18 will be released. I am specifically interested in the MLPClassifier to show my students in class. Not sure I want them on the dev fork just yet.

From t3kcit at gmail.com  Wed Aug 31 13:31:43 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Wed, 31 Aug 2016 13:31:43 -0400
Subject: [scikit-learn] MLPClassifier release
In-Reply-To: <etPan.57c63ef6.3f6eb704.2d1@rockhurst.edu>
References: <etPan.57c63ef6.3f6eb704.2d1@rockhurst.edu>
Message-ID: <a6aa3a73-0545-1a67-f5e5-3c9f51c47752@gmail.com>


On 08/30/2016 10:20 PM, Gartland, Myles wrote:
> Curious when .18 will be released. I am specifically interested in the MLPClassifier to show my students in class. Not sure I want them on the dev fork just yet.
Release candidate probably next week.

Andy

From t3kcit at gmail.com  Wed Aug 31 17:15:01 2016
From: t3kcit at gmail.com (Andreas Mueller)
Date: Wed, 31 Aug 2016 17:15:01 -0400
Subject: [scikit-learn] Declaring numpy and scipy dependencies?
In-Reply-To: <CAH6Pt5pqt4eB+dxpDXxsPto4KmJz1D5P81saqGSBRTi62U-qHg@mail.gmail.com>
References: <195faf56-d8c6-49e0-7fd7-5bb4f1b22931@gmail.com>
 <98971054-939E-416C-BA47-AE5AD515E170@sebastianraschka.com>
 <CAH6Pt5r0z09cJDALPXRSeTcUJLOB3mFnm0y9D4zGQR-JtaBLsA@mail.gmail.com>
 <705a27d4-3643-bc9b-11a8-80ba0f6752bf@gmail.com>
 <CAH6Pt5pqt4eB+dxpDXxsPto4KmJz1D5P81saqGSBRTi62U-qHg@mail.gmail.com>
Message-ID: <36f5d0ef-397d-f5bc-c312-19793482fb06@gmail.com>


On 07/28/2016 03:16 PM, Matthew Brett wrote:
> On Thu, Jul 28, 2016 at 8:10 PM, Andreas Mueller <t3kcit at gmail.com> wrote:
>>
>> On 07/28/2016 03:04 PM, Matthew Brett wrote:
>>> On Thu, Jul 28, 2016 at 7:55 PM, Sebastian Raschka
>>> <mail at sebastianraschka.com> wrote:
>>>> I think that should work fine for the `pip install scikit-learn`,
>>>> however, I think the problem was with upgrading, right?
>>>> E.g., if you run
>>>>
>>>> pip install scikit-learn --upgrade
>>>>
>>>> it would try to upgrade numpy and scipy as well, which may not be
>>>> desired. I think the only workaround would be to run
>>>>
>>>> pip install scikit-learn --upgrade --no-deps
>>>>
>>>> unless they changed the behavior recently. I mean, it?s not really a
>>>> problem, but many users may not know about the --no-deps flag.
>>>>
>>> Also - the install will work fine for platforms with wheels, but is
>>> still bad for platforms without - like the Raspberry Pi.
>> Hm... so these would be ARM wheels? Or Raspberry Pi specific ones?
> No, they'd have to be Raspberry Pi specific ones because no-one has
> worked out a general ARM-wide specification, as we have for Intel
> Linux = manylinux1.
>
Following up on this thread, I'm trying to write better installation 
instructions.
https://github.com/scikit-learn/scikit-learn/pull/7313

What's the best-practice for cases when there are no wheels?
I imagine there's also no conda channel for Raspberry Pi.

So is it the package manager?

Andy

From douglas.chan at ieee.org  Wed Aug 31 19:52:17 2016
From: douglas.chan at ieee.org (Douglas Chan)
Date: Wed, 31 Aug 2016 16:52:17 -0700
Subject: [scikit-learn] Gradient Boosting: Feature Importances do not
 sum to 1
In-Reply-To: <CAFHc1QZm4_R8eFka_V5gWF3irLwayozLsU-2JW7+YE0-U604HA@mail.gmail.com>
References: <C2484D9F44FE441897BAA416DCC281BD@Serendipitous>
 <CAFHc1QZm4_R8eFka_V5gWF3irLwayozLsU-2JW7+YE0-U604HA@mail.gmail.com>
Message-ID: <A8FF93441B0044FE9AED45D831CD3273@Serendipitous>

Thanks for your reply, Raphael.

Here?s some code using the Boston dataset to reproduce this.  

=== START CODE ===
import numpy as np
from sklearn import datasets
from sklearn.ensemble import GradientBoostingRegressor

boston = datasets.load_boston()
X, Y = (boston.data, boston.target)

n_estimators = 712   
# Note: From 712 onwards, the feature importance sum is less than 1

params = {'n_estimators': n_estimators, 'max_depth': 6, 'learning_rate': 0.1}
clf = GradientBoostingRegressor(**params)
clf.fit(X, Y)

feature_importance_sum = np.sum(clf.feature_importances_)
print "At n_estimators = %i, feature importance sum = %f" % (n_estimators , feature_importance_sum)

=== END CODE ===

If we deem this to be an error, I can open a bug to track it.  Please share your thoughts on it.

Thank you,
-Doug


From: Raphael C 
Sent: Tuesday, August 30, 2016 11:28 PM
To: Scikit-learn user and developer mailing list 
Subject: Re: [scikit-learn] Gradient Boosting: Feature Importances do not sum to 1

Can you provide a reproducible example? 
Raphael

On Wednesday, August 31, 2016, Douglas Chan <douglas.chan at ieee.org> wrote:

  Hello everyone,

  I notice conditions when Feature Importance values do not add up to 1 in ensemble tree methods, like Gradient Boosting Trees or AdaBoost Trees.  I wonder if there?s a bug in the code.

  This error occurs when the ensemble has a large number of estimators.  The exact conditions depend variously.  For example, the error shows up sooner with a smaller amount of training samples.  Or, if the depth of the tree is large.  

  When this error appears, the predicted value seems to have converged.  But it?s unclear if the error is causing the predicted value not to change with more estimators.  In fact, the feature importance sum goes lower and lower with more estimators thereafter.

  I wonder if we?re hitting some floating point calculation error. 

  Looking forward to hear your thoughts on this.

  Thank you!
  -Doug


--------------------------------------------------------------------------------
_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160831/cef12571/attachment-0001.html>