From gael.varoquaux at normalesup.org  Wed Aug  1 01:38:37 2018
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Wed, 1 Aug 2018 07:38:37 +0200
Subject: [scikit-learn] Query about an algorithm
In-Reply-To: <246651338.109874.1533080955512@mail.yahoo.com>
References: <246651338.109874.1533080955512.ref@mail.yahoo.com>
 <246651338.109874.1533080955512@mail.yahoo.com>
Message-ID: <20180801053837.fld4f7eropekxxfy@phare.normalesup.org>

You'll find generic optimization algorithms in scipy.optimize, and not in
scikit-learn.

Best,

Ga?l

On Tue, Jul 31, 2018 at 11:49:15PM +0000, Shantanu Bhattacharya via scikit-learn wrote:
> Hello,

> I am new to this mailing list. I would like to understand the algorithms
> provided.

> Is second order gradient descent with hessian error matrix supported by this
> library?

> I went through the documentation, but did not find it. Are you able to confirm
> or direct me to some place that might have it?

> Look forward to your thoughts

> Kind regards
> Shantanu

> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


-- 
    Gael Varoquaux
    Senior Researcher, INRIA Parietal
    NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
    Phone:  ++ 33-1-69-08-79-68
    http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux

From sarah.zaranek at gmail.com  Wed Aug  1 15:11:35 2018
From: sarah.zaranek at gmail.com (Sarah Wait Zaranek)
Date: Wed, 1 Aug 2018 15:11:35 -0400
Subject: [scikit-learn] One-hot encoding
In-Reply-To: <CAM8evwpLCinc38tR9BaY2aRtXEiErfK-Li4nFEpkJAZodiVhbA@mail.gmail.com>
References: <CAM8evwoXh565QfwXO-Nrv0ncFbH_HnTfaFx1eJx-4GCC3mEXWw@mail.gmail.com>
 <CAAkaFLXKtt1qZn9RWWMSg5w3GqiucK+_8HfBu6twba7tFvjGtw@mail.gmail.com>
 <CAM8evwoC8=GyhCXfzeQcxXLDHWPJ4dtFKn6nHVq39GZjsMN-UA@mail.gmail.com>
 <CAAkaFLWE3sMR1g8wW9E-eatDwd=ndcpPUL2Ny0M41jJRmE3eHg@mail.gmail.com>
 <CAM8evwpb0YHyx0B=gYdzDDzh91zPc6yZ=W6BDPuzkqm-qzD2mQ@mail.gmail.com>
 <CAM8evwqmzxKewWvBPOyiz+kLzXS6Sg5hHnh5eNf9Ur_0Z5YNBA@mail.gmail.com>
 <CAAkaFLVedfNxu9DngLOwo4j+TnqN5hjn=M+8iT-xoKNyiPXUcg@mail.gmail.com>
 <CAM8evwqREWD_-wTRCZG6qVF5ayy2hsYY_+sv+kpab5+A_9-G2g@mail.gmail.com>
 <CAM8evwomO8ntAxZ4w554yT_cC0XvRNt_PeO_crTTAJ-jvYZnvQ@mail.gmail.com>
 <CAAkaFLWzmxKfO5fZ_E12-bN+u-OFhP4o1VOWqC7GT2pRfmg-6Q@mail.gmail.com>
 <CAM8evwp4Vt_eiYuHmrqiS-96pUAdYTTsR+UGBdL9UKbwEs5xKw@mail.gmail.com>
 <CAAkaFLXHM0VZUovEWpMw2sZ5w4D8sR0Q5xgpwjM3o2iUspkTgQ@mail.gmail.com>
 <CAM8evwpLCinc38tR9BaY2aRtXEiErfK-Li4nFEpkJAZodiVhbA@mail.gmail.com>
Message-ID: <CAM8evwqTLvqr-k2AsAfPU5eui9gX9FeObw5rU=p8iPeBSzFruA@mail.gmail.com>

Hello,

I have installed the dev version (0.20.dev0), should I just use Categorical
Encoder or is the functionality already rolled up into OneHotEncoder. I get
the following message:

File "", line 1, in
File "/scikit-learn/sklearn/preprocessing/data.py", line 2839, in *init*
"CategoricalEncoder briefly existed in 0.20dev. Its functionality "
RuntimeError: CategoricalEncoder briefly existed in 0.20dev. Its
functionality has been rolled into the OneHotEncoder and OrdinalEncoder.
This stub will be removed in version 0.21.

Cheers,
Sarah


On Mon, Feb 5, 2018 at 10:46 PM, Sarah Wait Zaranek <sarah.zaranek at gmail.com
> wrote:

> Thanks, this makes sense. I will try using the CategoricalEncoder to see
> the difference. It wouldn't be such a big deal if my input matrix wasn't so
> large.  Thanks again for all your help.
>
> Cheers,
> Sarah
>
> On Mon, Feb 5, 2018 at 10:33 PM, Joel Nothman
> <joel.nothman at gmail.com> wrote:
>
>> Yes, the output CSR representation requires:
>> 1 (dtype) value per entry
>> 1 int32 per entry
>> 1 int32 per row
>>
>> The intermediate COO representation requires:
>> 1 (dtype) value per entry
>> 2 int32 per entry
>>
>> So as long as the transformation from COO to CSR is done over the whole
>> data, it will occupy roughly 5x the input size, which is exactly what you
>> are experienciong.
>>
>> The CategoricalEncoder currently available in the development version of
>> scikit-learn does not have this problem, but might be slower due to
>> handling non-integer categories. It will also possibly disappear and be
>> merged into OneHotEncoder soon (see PR #10523).
>>
>>
>>
>> On 6 February 2018 at 13:53, Sarah Wait Zaranek <sarah.zaranek at gmail.com>
>> wrote:
>>
>>> Yes, of course.  What I mean is the I start out with 19 Gigs (initial
>>> matrix size) or so, it balloons to 100 Gigs *within the encoder function*
>>> and returns 28 Gigs (sparse one-hot matrix size).  These numbers aren't
>>> exact, but you can see my point.
>>>
>>> Cheers,
>>> Sarah
>>>
>>> On Mon, Feb 5, 2018 at 9:50 PM, Joel Nothman <joel.nothman at gmail.com>
>>> wrote:
>>>
>>>> OneHotEncoder will not magically reduce the size of your input. It will
>>>> necessarily increase the memory of the input data as long as we are storing
>>>> the results in scipy.sparse matrices. The sparse representation will be
>>>> less expensive than the dense representation, but it won't be less
>>>> expensive than the input.
>>>>
>>>> On 6 February 2018 at 13:24, Sarah Wait Zaranek <
>>>> sarah.zaranek at gmail.com> wrote:
>>>>
>>>>> Hi Joel -
>>>>>
>>>>> I am also seeing a huge overhead in memory for calling the
>>>>> onehot-encoder.  I have hacked it by running it splitting by matrix into
>>>>> 4-5 smaller matrices (by columns) and then concatenating the results.  But,
>>>>> I am seeing upwards of 100 Gigs overhead. Should I file a bug report?  Or
>>>>> is this to be expected.
>>>>>
>>>>> Cheers,
>>>>> Sarah
>>>>>
>>>>> On Mon, Feb 5, 2018 at 1:05 AM, Sarah Wait Zaranek <
>>>>> sarah.zaranek at gmail.com> wrote:
>>>>>
>>>>>> Great.  Thank you for all your help.
>>>>>>
>>>>>> Cheers,
>>>>>> Sarah
>>>>>>
>>>>>> On Mon, Feb 5, 2018 at 12:56 AM, Joel Nothman <joel.nothman at gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> If you specify n_values=[list_of_vals_for_column1,
>>>>>>> list_of_vals_for_column2], you should be able to engineer it to how you
>>>>>>> want.
>>>>>>>
>>>>>>> On 5 February 2018 at 16:31, Sarah Wait Zaranek <
>>>>>>> sarah.zaranek at gmail.com> wrote:
>>>>>>>
>>>>>>>> If I use the n+1 approach, then I get the correct matrix, except
>>>>>>>> with the columns of zeros:
>>>>>>>>
>>>>>>>> >>> test
>>>>>>>> array([[0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1.],
>>>>>>>>        [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.],
>>>>>>>>        [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0.],
>>>>>>>>        [0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1.,
>>>>>>>> 0.]])
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Feb 5, 2018 at 12:25 AM, Sarah Wait Zaranek <
>>>>>>>> sarah.zaranek at gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Joel -
>>>>>>>>>
>>>>>>>>> Conceptually, that makes sense.  But when I assign n_values, I
>>>>>>>>> can't make it match the result when you don't specify them. See below.  I
>>>>>>>>> used the number of unique levels per column.
>>>>>>>>>
>>>>>>>>> >>> enc = OneHotEncoder(sparse=False)
>>>>>>>>> >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1,
>>>>>>>>> 0, 2]])
>>>>>>>>> >>> test
>>>>>>>>> array([[0., 0., 1., 1., 0., 0., 0., 0., 1.],
>>>>>>>>>        [0., 1., 0., 0., 1., 1., 0., 0., 0.],
>>>>>>>>>        [1., 0., 0., 0., 1., 0., 1., 0., 0.],
>>>>>>>>>        [0., 1., 0., 1., 0., 0., 0., 1., 0.]])
>>>>>>>>> >>> enc = OneHotEncoder(sparse=False,n_values=[3,2,4])
>>>>>>>>> >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1,
>>>>>>>>> 0, 2]])
>>>>>>>>> >>> test
>>>>>>>>> array([[0., 0., 0., 1., 0., 0., 0., 1., 1.],
>>>>>>>>>        [0., 1., 0., 0., 0., 2., 0., 0., 0.],
>>>>>>>>>        [1., 0., 0., 0., 0., 1., 1., 0., 0.],
>>>>>>>>>        [0., 1., 0., 1., 0., 0., 0., 1., 0.]])
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Sarah
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Sarah
>>>>>>>>>
>>>>>>>>> On Mon, Feb 5, 2018 at 12:02 AM, Joel Nothman <
>>>>>>>>> joel.nothman at gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> If each input column is encoded as a value from 0 to the (number
>>>>>>>>>> of possible values for that column - 1) then n_values for that column
>>>>>>>>>> should be the highest value + 1, which is also the number of levels per
>>>>>>>>>> column. Does that make sense?
>>>>>>>>>>
>>>>>>>>>> Actually, I've realised there's a somewhat slow and unnecessary
>>>>>>>>>> bit of code in the one-hot encoder: where the COO matrix is converted to
>>>>>>>>>> CSR. I suspect this was done because most of our ML algorithms perform
>>>>>>>>>> better on CSR, or else to maintain backwards compatibility with an earlier
>>>>>>>>>> implementation.
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> scikit-learn mailing list
>>>>>>>>>> scikit-learn at python.org
>>>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> scikit-learn mailing list
>>>>>>>> scikit-learn at python.org
>>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> scikit-learn mailing list
>>>>>>> scikit-learn at python.org
>>>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> scikit-learn at python.org
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180801/cf60d0d4/attachment-0001.html>

From joel.nothman at gmail.com  Wed Aug  1 17:29:34 2018
From: joel.nothman at gmail.com (Joel Nothman)
Date: Thu, 2 Aug 2018 07:29:34 +1000
Subject: [scikit-learn] One-hot encoding
In-Reply-To: <CAM8evwqTLvqr-k2AsAfPU5eui9gX9FeObw5rU=p8iPeBSzFruA@mail.gmail.com>
References: <CAM8evwoXh565QfwXO-Nrv0ncFbH_HnTfaFx1eJx-4GCC3mEXWw@mail.gmail.com>
 <CAAkaFLXKtt1qZn9RWWMSg5w3GqiucK+_8HfBu6twba7tFvjGtw@mail.gmail.com>
 <CAM8evwoC8=GyhCXfzeQcxXLDHWPJ4dtFKn6nHVq39GZjsMN-UA@mail.gmail.com>
 <CAAkaFLWE3sMR1g8wW9E-eatDwd=ndcpPUL2Ny0M41jJRmE3eHg@mail.gmail.com>
 <CAM8evwpb0YHyx0B=gYdzDDzh91zPc6yZ=W6BDPuzkqm-qzD2mQ@mail.gmail.com>
 <CAM8evwqmzxKewWvBPOyiz+kLzXS6Sg5hHnh5eNf9Ur_0Z5YNBA@mail.gmail.com>
 <CAAkaFLVedfNxu9DngLOwo4j+TnqN5hjn=M+8iT-xoKNyiPXUcg@mail.gmail.com>
 <CAM8evwqREWD_-wTRCZG6qVF5ayy2hsYY_+sv+kpab5+A_9-G2g@mail.gmail.com>
 <CAM8evwomO8ntAxZ4w554yT_cC0XvRNt_PeO_crTTAJ-jvYZnvQ@mail.gmail.com>
 <CAAkaFLWzmxKfO5fZ_E12-bN+u-OFhP4o1VOWqC7GT2pRfmg-6Q@mail.gmail.com>
 <CAM8evwp4Vt_eiYuHmrqiS-96pUAdYTTsR+UGBdL9UKbwEs5xKw@mail.gmail.com>
 <CAAkaFLXHM0VZUovEWpMw2sZ5w4D8sR0Q5xgpwjM3o2iUspkTgQ@mail.gmail.com>
 <CAM8evwpLCinc38tR9BaY2aRtXEiErfK-Li4nFEpkJAZodiVhbA@mail.gmail.com>
 <CAM8evwqTLvqr-k2AsAfPU5eui9gX9FeObw5rU=p8iPeBSzFruA@mail.gmail.com>
Message-ID: <CAAkaFLVa8HQ96oLXswp6CO=+o4YX_zyN_nD_MUPDDq-D5-OGug@mail.gmail.com>

Use OneHotEncoder
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180802/a5d94006/attachment.html>

From sarah.zaranek at gmail.com  Wed Aug  1 19:19:26 2018
From: sarah.zaranek at gmail.com (Sarah Wait Zaranek)
Date: Wed, 1 Aug 2018 19:19:26 -0400
Subject: [scikit-learn] One-hot encoding
In-Reply-To: <CAAkaFLVa8HQ96oLXswp6CO=+o4YX_zyN_nD_MUPDDq-D5-OGug@mail.gmail.com>
References: <CAM8evwoXh565QfwXO-Nrv0ncFbH_HnTfaFx1eJx-4GCC3mEXWw@mail.gmail.com>
 <CAAkaFLXKtt1qZn9RWWMSg5w3GqiucK+_8HfBu6twba7tFvjGtw@mail.gmail.com>
 <CAM8evwoC8=GyhCXfzeQcxXLDHWPJ4dtFKn6nHVq39GZjsMN-UA@mail.gmail.com>
 <CAAkaFLWE3sMR1g8wW9E-eatDwd=ndcpPUL2Ny0M41jJRmE3eHg@mail.gmail.com>
 <CAM8evwpb0YHyx0B=gYdzDDzh91zPc6yZ=W6BDPuzkqm-qzD2mQ@mail.gmail.com>
 <CAM8evwqmzxKewWvBPOyiz+kLzXS6Sg5hHnh5eNf9Ur_0Z5YNBA@mail.gmail.com>
 <CAAkaFLVedfNxu9DngLOwo4j+TnqN5hjn=M+8iT-xoKNyiPXUcg@mail.gmail.com>
 <CAM8evwqREWD_-wTRCZG6qVF5ayy2hsYY_+sv+kpab5+A_9-G2g@mail.gmail.com>
 <CAM8evwomO8ntAxZ4w554yT_cC0XvRNt_PeO_crTTAJ-jvYZnvQ@mail.gmail.com>
 <CAAkaFLWzmxKfO5fZ_E12-bN+u-OFhP4o1VOWqC7GT2pRfmg-6Q@mail.gmail.com>
 <CAM8evwp4Vt_eiYuHmrqiS-96pUAdYTTsR+UGBdL9UKbwEs5xKw@mail.gmail.com>
 <CAAkaFLXHM0VZUovEWpMw2sZ5w4D8sR0Q5xgpwjM3o2iUspkTgQ@mail.gmail.com>
 <CAM8evwpLCinc38tR9BaY2aRtXEiErfK-Li4nFEpkJAZodiVhbA@mail.gmail.com>
 <CAM8evwqTLvqr-k2AsAfPU5eui9gX9FeObw5rU=p8iPeBSzFruA@mail.gmail.com>
 <CAAkaFLVa8HQ96oLXswp6CO=+o4YX_zyN_nD_MUPDDq-D5-OGug@mail.gmail.com>
Message-ID: <CAM8evwqazJ6LoKOZWhQk2eGmTq8p2aqupZhAGmBnLXzc8dHAkA@mail.gmail.com>

In the developer version, yes? Looking for the new memory savings :)

On Wed, Aug 1, 2018, 17:29 Joel Nothman <joel.nothman at gmail.com> wrote:

> Use OneHotEncoder
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180801/2d41fc2c/attachment.html>

From david.mo.burns at gmail.com  Thu Aug  2 01:25:54 2018
From: david.mo.burns at gmail.com (David Burns)
Date: Thu, 2 Aug 2018 01:25:54 -0400
Subject: [scikit-learn] pipeline for modifying target and number of samples
Message-ID: <b65904ab-3249-230e-a8bc-3df4fcf7af35@gmail.com>

Hi,

I posted a while back about this, and am reposting now since I have made 
progress on this topic. As you are probably aware, the sklearn Pipeline 
only supports transformers for X, and the number of samples must stay 
the same.

I work with time series where the learning pipeline relies on 
transformations like resampling, segmentation, etc that change the 
target and number of samples in the data set. In order to address this, 
I created an sklearn compatible pipeline that handles transformers that 
alter X, y, and sample_weight together. It can undergo model selection 
using the sklearn tools, and integrates with all the sklearn 
transformers and estimators. It also has some new options for setting 
hyper-parameters with callables and in reference to other parameters.

The implementation is in my time series package seglearn:

https://github.com/dmbee/seglearn

- Best

David Burns


From joel.nothman at gmail.com  Thu Aug  2 01:48:54 2018
From: joel.nothman at gmail.com (Joel Nothman)
Date: Thu, 2 Aug 2018 15:48:54 +1000
Subject: [scikit-learn] pipeline for modifying target and number of
 samples
In-Reply-To: <b65904ab-3249-230e-a8bc-3df4fcf7af35@gmail.com>
References: <b65904ab-3249-230e-a8bc-3df4fcf7af35@gmail.com>
Message-ID: <CAAkaFLWH=QVv-fAhH19-vrNy8imufFHD4B5V9Uxa7vxbou7NPA@mail.gmail.com>

But you can't use cross_validate(seglearn.Pype(...), X, y) in general, can
you, if the Pype changes the samples and their correspondence to the input
y arbitrarily at both train and predict time?
?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180802/068f11fd/attachment.html>

From sarah.zaranek at gmail.com  Thu Aug  2 06:53:28 2018
From: sarah.zaranek at gmail.com (Sarah Wait Zaranek)
Date: Thu, 2 Aug 2018 06:53:28 -0400
Subject: [scikit-learn] One-hot encoding
In-Reply-To: <CAM8evwqazJ6LoKOZWhQk2eGmTq8p2aqupZhAGmBnLXzc8dHAkA@mail.gmail.com>
References: <CAM8evwoXh565QfwXO-Nrv0ncFbH_HnTfaFx1eJx-4GCC3mEXWw@mail.gmail.com>
 <CAAkaFLXKtt1qZn9RWWMSg5w3GqiucK+_8HfBu6twba7tFvjGtw@mail.gmail.com>
 <CAM8evwoC8=GyhCXfzeQcxXLDHWPJ4dtFKn6nHVq39GZjsMN-UA@mail.gmail.com>
 <CAAkaFLWE3sMR1g8wW9E-eatDwd=ndcpPUL2Ny0M41jJRmE3eHg@mail.gmail.com>
 <CAM8evwpb0YHyx0B=gYdzDDzh91zPc6yZ=W6BDPuzkqm-qzD2mQ@mail.gmail.com>
 <CAM8evwqmzxKewWvBPOyiz+kLzXS6Sg5hHnh5eNf9Ur_0Z5YNBA@mail.gmail.com>
 <CAAkaFLVedfNxu9DngLOwo4j+TnqN5hjn=M+8iT-xoKNyiPXUcg@mail.gmail.com>
 <CAM8evwqREWD_-wTRCZG6qVF5ayy2hsYY_+sv+kpab5+A_9-G2g@mail.gmail.com>
 <CAM8evwomO8ntAxZ4w554yT_cC0XvRNt_PeO_crTTAJ-jvYZnvQ@mail.gmail.com>
 <CAAkaFLWzmxKfO5fZ_E12-bN+u-OFhP4o1VOWqC7GT2pRfmg-6Q@mail.gmail.com>
 <CAM8evwp4Vt_eiYuHmrqiS-96pUAdYTTsR+UGBdL9UKbwEs5xKw@mail.gmail.com>
 <CAAkaFLXHM0VZUovEWpMw2sZ5w4D8sR0Q5xgpwjM3o2iUspkTgQ@mail.gmail.com>
 <CAM8evwpLCinc38tR9BaY2aRtXEiErfK-Li4nFEpkJAZodiVhbA@mail.gmail.com>
 <CAM8evwqTLvqr-k2AsAfPU5eui9gX9FeObw5rU=p8iPeBSzFruA@mail.gmail.com>
 <CAAkaFLVa8HQ96oLXswp6CO=+o4YX_zyN_nD_MUPDDq-D5-OGug@mail.gmail.com>
 <CAM8evwqazJ6LoKOZWhQk2eGmTq8p2aqupZhAGmBnLXzc8dHAkA@mail.gmail.com>
Message-ID: <CAM8evwrXDn3TdVAWz93i7PEjaR8Qt9f95x5fOPp760sV8LjR4A@mail.gmail.com>

Hi Joel -

Are you sure?  I ran it and it actually uses bit more memory instead of
less, same code just run with a different docker container.

Max memory used by a single task: 50.41GB
vs
Max memory used by a single task: 51.15GB

Cheers,
Sarah

On Wed, Aug 1, 2018 at 7:19 PM, Sarah Wait Zaranek <sarah.zaranek at gmail.com>
wrote:

> In the developer version, yes? Looking for the new memory savings :)
>
> On Wed, Aug 1, 2018, 17:29 Joel Nothman <joel.nothman at gmail.com> wrote:
>
>> Use OneHotEncoder
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180802/dc521757/attachment.html>

From fernando.wittmann at gmail.com  Fri Aug  3 07:52:53 2018
From: fernando.wittmann at gmail.com (Fernando Marcos Wittmann)
Date: Fri, 3 Aug 2018 13:52:53 +0200
Subject: [scikit-learn] One-hot encoding
In-Reply-To: <CAM8evwrXDn3TdVAWz93i7PEjaR8Qt9f95x5fOPp760sV8LjR4A@mail.gmail.com>
References: <CAM8evwoXh565QfwXO-Nrv0ncFbH_HnTfaFx1eJx-4GCC3mEXWw@mail.gmail.com>
 <CAAkaFLXKtt1qZn9RWWMSg5w3GqiucK+_8HfBu6twba7tFvjGtw@mail.gmail.com>
 <CAM8evwoC8=GyhCXfzeQcxXLDHWPJ4dtFKn6nHVq39GZjsMN-UA@mail.gmail.com>
 <CAAkaFLWE3sMR1g8wW9E-eatDwd=ndcpPUL2Ny0M41jJRmE3eHg@mail.gmail.com>
 <CAM8evwpb0YHyx0B=gYdzDDzh91zPc6yZ=W6BDPuzkqm-qzD2mQ@mail.gmail.com>
 <CAM8evwqmzxKewWvBPOyiz+kLzXS6Sg5hHnh5eNf9Ur_0Z5YNBA@mail.gmail.com>
 <CAAkaFLVedfNxu9DngLOwo4j+TnqN5hjn=M+8iT-xoKNyiPXUcg@mail.gmail.com>
 <CAM8evwqREWD_-wTRCZG6qVF5ayy2hsYY_+sv+kpab5+A_9-G2g@mail.gmail.com>
 <CAM8evwomO8ntAxZ4w554yT_cC0XvRNt_PeO_crTTAJ-jvYZnvQ@mail.gmail.com>
 <CAAkaFLWzmxKfO5fZ_E12-bN+u-OFhP4o1VOWqC7GT2pRfmg-6Q@mail.gmail.com>
 <CAM8evwp4Vt_eiYuHmrqiS-96pUAdYTTsR+UGBdL9UKbwEs5xKw@mail.gmail.com>
 <CAAkaFLXHM0VZUovEWpMw2sZ5w4D8sR0Q5xgpwjM3o2iUspkTgQ@mail.gmail.com>
 <CAM8evwpLCinc38tR9BaY2aRtXEiErfK-Li4nFEpkJAZodiVhbA@mail.gmail.com>
 <CAM8evwqTLvqr-k2AsAfPU5eui9gX9FeObw5rU=p8iPeBSzFruA@mail.gmail.com>
 <CAAkaFLVa8HQ96oLXswp6CO=+o4YX_zyN_nD_MUPDDq-D5-OGug@mail.gmail.com>
 <CAM8evwqazJ6LoKOZWhQk2eGmTq8p2aqupZhAGmBnLXzc8dHAkA@mail.gmail.com>
 <CAM8evwrXDn3TdVAWz93i7PEjaR8Qt9f95x5fOPp760sV8LjR4A@mail.gmail.com>
Message-ID: <CABM1w2Tw1LkNCY22Mr2TRHCaLY8k0PJEe-OsWfXRUO7wOo9ssQ@mail.gmail.com>

Hi Sarah, I have some reflection questions. You don't need to answer  all
of them :) how many categories (approximately) do you have in each of those
20M categorical variables? How many samples do you have? Maybe you should
consider different encoding strategies such as binary encoding. Also, this
looks like a big data problem. Have you considered using distributed
computing? Also, do you really need to use all of those 20M variables in
your first approach? Consider using feature selection techniques. I would
suggest that you start with something simpler with less features and that
run more easily in your machine. Then later you can starting adding more
complexity if necessary. Keep in mind that if the number of samples is
lower than the number of columns after one hot encoding, you might face
overfitting. Try to always have less columns than the number of samples.

On Aug 2, 2018 12:53, "Sarah Wait Zaranek" <sarah.zaranek at gmail.com> wrote:

Hi Joel -

Are you sure?  I ran it and it actually uses bit more memory instead of
less, same code just run with a different docker container.

Max memory used by a single task: 50.41GB
vs
Max memory used by a single task: 51.15GB

Cheers,
Sarah

On Wed, Aug 1, 2018 at 7:19 PM, Sarah Wait Zaranek <sarah.zaranek at gmail.com>
wrote:

> In the developer version, yes? Looking for the new memory savings :)
>
> On Wed, Aug 1, 2018, 17:29 Joel Nothman <joel.nothman at gmail.com> wrote:
>
>> Use OneHotEncoder
>>
>
_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180803/72a9e8fd/attachment.html>

From sarah.zaranek at gmail.com  Fri Aug  3 08:20:10 2018
From: sarah.zaranek at gmail.com (Sarah Wait Zaranek)
Date: Fri, 3 Aug 2018 08:20:10 -0400
Subject: [scikit-learn] One-hot encoding
In-Reply-To: <CABM1w2Tw1LkNCY22Mr2TRHCaLY8k0PJEe-OsWfXRUO7wOo9ssQ@mail.gmail.com>
References: <CAM8evwoXh565QfwXO-Nrv0ncFbH_HnTfaFx1eJx-4GCC3mEXWw@mail.gmail.com>
 <CAAkaFLXKtt1qZn9RWWMSg5w3GqiucK+_8HfBu6twba7tFvjGtw@mail.gmail.com>
 <CAM8evwoC8=GyhCXfzeQcxXLDHWPJ4dtFKn6nHVq39GZjsMN-UA@mail.gmail.com>
 <CAAkaFLWE3sMR1g8wW9E-eatDwd=ndcpPUL2Ny0M41jJRmE3eHg@mail.gmail.com>
 <CAM8evwpb0YHyx0B=gYdzDDzh91zPc6yZ=W6BDPuzkqm-qzD2mQ@mail.gmail.com>
 <CAM8evwqmzxKewWvBPOyiz+kLzXS6Sg5hHnh5eNf9Ur_0Z5YNBA@mail.gmail.com>
 <CAAkaFLVedfNxu9DngLOwo4j+TnqN5hjn=M+8iT-xoKNyiPXUcg@mail.gmail.com>
 <CAM8evwqREWD_-wTRCZG6qVF5ayy2hsYY_+sv+kpab5+A_9-G2g@mail.gmail.com>
 <CAM8evwomO8ntAxZ4w554yT_cC0XvRNt_PeO_crTTAJ-jvYZnvQ@mail.gmail.com>
 <CAAkaFLWzmxKfO5fZ_E12-bN+u-OFhP4o1VOWqC7GT2pRfmg-6Q@mail.gmail.com>
 <CAM8evwp4Vt_eiYuHmrqiS-96pUAdYTTsR+UGBdL9UKbwEs5xKw@mail.gmail.com>
 <CAAkaFLXHM0VZUovEWpMw2sZ5w4D8sR0Q5xgpwjM3o2iUspkTgQ@mail.gmail.com>
 <CAM8evwpLCinc38tR9BaY2aRtXEiErfK-Li4nFEpkJAZodiVhbA@mail.gmail.com>
 <CAM8evwqTLvqr-k2AsAfPU5eui9gX9FeObw5rU=p8iPeBSzFruA@mail.gmail.com>
 <CAAkaFLVa8HQ96oLXswp6CO=+o4YX_zyN_nD_MUPDDq-D5-OGug@mail.gmail.com>
 <CAM8evwqazJ6LoKOZWhQk2eGmTq8p2aqupZhAGmBnLXzc8dHAkA@mail.gmail.com>
 <CAM8evwrXDn3TdVAWz93i7PEjaR8Qt9f95x5fOPp760sV8LjR4A@mail.gmail.com>
 <CABM1w2Tw1LkNCY22Mr2TRHCaLY8k0PJEe-OsWfXRUO7wOo9ssQ@mail.gmail.com>
Message-ID: <CAM8evwr68gugeDK86E9fa+HdKARy1Y0sNrQY4yR-_y62dOwskQ@mail.gmail.com>

Hi all -

I can't do binary encoding because I need to trace back to the exact
categorical variable and that is difficult in binary encoding, I believe.
Each categorical variable has a range, but on average it is about 10
categories. I return a sparse matrix from the encoder.  Regardless of the
encoding strategy, the issue is the overhead of the encoding itself not the
resulting encoded matrix so using an encoding which is slightly smaller
isn't going to solve my issue as far as I am aware.  We have done the tests
just with the integer representation of the categorical variable and the
results are unsatisfying.  If there is an encoder that isn't lossy - that I
can get my original category back and not see as large a memory requirement
as with the one-hot creation -- I am happy to try it out.

Yes, I need this many variables, and there are all categorical.  In my
world, we have short and wide matrices-- it is very common.
Unfortunately,  I need to do feature selection techniques on the encoded
version of the data - which I could possible do in parts for so of the
feature selection techniques but for the ones I really want to use I need
the entire matrix (think very large ReliefF).  I already have a working
version on my machine with less data (and btw my "machine" is one of the
biggest instances available in my region with 400GB+ of RAM).  I am
eventually moving to using a distributed computing solution (mLlib+Spark),
but I wanted to see what I could do in scikit-learn before I went there.
Of course, I am aware of overfitting issues-- we do regularization and
cross validation, etc.   I just thought it was unfortunately that the thing
holding my analysis back from using scikit-learn wasn't the machine
learning but the encoding algorithm, memory requirements.

Cheers,
Sarah

On Fri, Aug 3, 2018 at 7:52 AM, Fernando Marcos Wittmann <
fernando.wittmann at gmail.com> wrote:

> Hi Sarah, I have some reflection questions. You don't need to answer  all
> of them :) how many categories (approximately) do you have in each of those
> 20M categorical variables? How many samples do you have? Maybe you should
> consider different encoding strategies such as binary encoding. Also, this
> looks like a big data problem. Have you considered using distributed
> computing? Also, do you really need to use all of those 20M variables in
> your first approach? Consider using feature selection techniques. I would
> suggest that you start with something simpler with less features and that
> run more easily in your machine. Then later you can starting adding more
> complexity if necessary. Keep in mind that if the number of samples is
> lower than the number of columns after one hot encoding, you might face
> overfitting. Try to always have less columns than the number of samples.
>
> On Aug 2, 2018 12:53, "Sarah Wait Zaranek" <sarah.zaranek at gmail.com>
> wrote:
>
> Hi Joel -
>
> Are you sure?  I ran it and it actually uses bit more memory instead of
> less, same code just run with a different docker container.
>
> Max memory used by a single task: 50.41GB
> vs
> Max memory used by a single task: 51.15GB
>
> Cheers,
> Sarah
>
> On Wed, Aug 1, 2018 at 7:19 PM, Sarah Wait Zaranek <
> sarah.zaranek at gmail.com> wrote:
>
>> In the developer version, yes? Looking for the new memory savings :)
>>
>> On Wed, Aug 1, 2018, 17:29 Joel Nothman <joel.nothman at gmail.com> wrote:
>>
>>> Use OneHotEncoder
>>>
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180803/be20f0d3/attachment.html>

From t3kcit at gmail.com  Sat Aug  4 09:45:44 2018
From: t3kcit at gmail.com (Andreas Mueller)
Date: Sat, 4 Aug 2018 15:45:44 +0200
Subject: [scikit-learn] pipeline for modifying target and number of
 samples
In-Reply-To: <b65904ab-3249-230e-a8bc-3df4fcf7af35@gmail.com>
References: <b65904ab-3249-230e-a8bc-3df4fcf7af35@gmail.com>
Message-ID: <7ddecae9-aeb3-43ce-02b2-46b205d2bfc9@gmail.com>

The naming seems a bit unfortunate with seqlearn ;)

On 08/02/2018 07:25 AM, David Burns wrote:
> Hi,
>
> I posted a while back about this, and am reposting now since I have 
> made progress on this topic. As you are probably aware, the sklearn 
> Pipeline only supports transformers for X, and the number of samples 
> must stay the same.
>
> I work with time series where the learning pipeline relies on 
> transformations like resampling, segmentation, etc that change the 
> target and number of samples in the data set. In order to address 
> this, I created an sklearn compatible pipeline that handles 
> transformers that alter X, y, and sample_weight together. It can 
> undergo model selection using the sklearn tools, and integrates with 
> all the sklearn transformers and estimators. It also has some new 
> options for setting hyper-parameters with callables and in reference 
> to other parameters.
>
> The implementation is in my time series package seglearn:
>
> https://github.com/dmbee/seglearn
>
> - Best
>
> David Burns
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From thouraya87 at gmail.com  Sun Aug  5 12:00:51 2018
From: thouraya87 at gmail.com (Thouraya TH)
Date: Sun, 5 Aug 2018 17:00:51 +0100
Subject: [scikit-learn] My first Program using sklearn
Message-ID: <CALbnis0bWyXPfQy-vQpX_T=mHjHK59uzT0JL_yJsspjW6TFLPA@mail.gmail.com>

 Hi,
Please i'd like to use prediction techniques in tenserflow.
I have this file:


x | y

1| 1
2| 4
4|16


--> prediction techniques must give me this model y=x*x


using this model i can predict the value y of x=3
This is a sample example. In my experiment i use a file with 1000 lines.
Please, how can transform these lines on codes ?


Code:

 from sklearn.preprocessing import PolynomialFeatures

from sklearn import linear_model


X = [[1, 2, 4], [1, 4, 16]]


poly = PolynomialFeatures(degree=2)


X_ = poly.fit_transform(X)


Thank you so much for help.
Kind regards.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180805/7c59722c/attachment.html>

From thouraya87 at gmail.com  Sun Aug  5 12:01:54 2018
From: thouraya87 at gmail.com (Thouraya TH)
Date: Sun, 5 Aug 2018 17:01:54 +0100
Subject: [scikit-learn] My first Program using sklearn
In-Reply-To: <CALbnis0bWyXPfQy-vQpX_T=mHjHK59uzT0JL_yJsspjW6TFLPA@mail.gmail.com>
References: <CALbnis0bWyXPfQy-vQpX_T=mHjHK59uzT0JL_yJsspjW6TFLPA@mail.gmail.com>
Message-ID: <CALbnis3BWt8GvQvMHWVk5-jem0F53ffPjnuxXGSYO-oGp0tegw@mail.gmail.com>

2018-08-05 17:00 GMT+01:00 Thouraya TH <thouraya87 at gmail.com>:

> Hi,
> Please i'd like to use prediction techniques in sklearn .
> I have this file:
>
>
> x | y
>
> 1| 1
> 2| 4
> 4|16
>
>
> --> prediction techniques must give me this model y=x*x
>
>
> using this model i can predict the value y of x=3
> This is a sample example. In my experiment i use a file with 1000 lines.
> Please, how can transform these lines on codes ?
>
>
> Code:
>
>  from sklearn.preprocessing import PolynomialFeatures
>
> from sklearn import linear_model
>
>
>
> X = [[1, 2, 4], [1, 4, 16]]
>
>
>
> poly = PolynomialFeatures(degree=2)
>
>
>
> X_ = poly.fit_transform(X)
>
>
> Thank you so much for help.
> Kind regards.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180805/2c0453d1/attachment.html>

From tahoangtrung at gmail.com  Mon Aug  6 22:55:08 2018
From: tahoangtrung at gmail.com (hoang trung Ta)
Date: Tue, 7 Aug 2018 11:55:08 +0900
Subject: [scikit-learn] Random forest classification source code
Message-ID: <CAMY9vuwvYn3wKcf89Hpukc530UDQ_n6h8xdfKtDUUo_xsk=Dvw@mail.gmail.com>

Dear all members,

I am running Random forest for my research using Scikit learn.

I want to see the source code of the random forest in Scikit learn to
understand how it works, thus
*Can you tell me where I can find this source code, please ?*

Thank you very much

-- 
*Ta Hoang Trung (Mr)*

*Master student*
Graduate School of Life and Environmental Sciences
University of Tsukuba, Japan

Mobile:  +81 70 3846 2993
Email :  ta.hoang-trung.xm at alumni.tsukuba.ac.jp
             tahoangtrung at gmail.com
             s1626066 at u.tsukuba.ac.jp

*----*
*Mapping Technician*
Department of Surveying and Mapping Vietnam
No 2, Dang Thuy Tram street, Hanoi, Viet Nam

Mobile: +84 1255151344
Email : tahoangtrung at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180807/d10e67a2/attachment.html>

From jbc.develop at gmail.com  Mon Aug  6 23:00:20 2018
From: jbc.develop at gmail.com (Juan BC)
Date: Tue, 7 Aug 2018 00:00:20 -0300
Subject: [scikit-learn] Random forest classification source code
In-Reply-To: <CAMY9vuwvYn3wKcf89Hpukc530UDQ_n6h8xdfKtDUUo_xsk=Dvw@mail.gmail.com>
References: <CAMY9vuwvYn3wKcf89Hpukc530UDQ_n6h8xdfKtDUUo_xsk=Dvw@mail.gmail.com>
Message-ID: <CAFT7ZuhJmybnP2kkJ+=mo2WLuMYBg7YKXLEAFtMZZB2Qvxrysw@mail.gmail.com>

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/forest.py

On Mon, 6 Aug 2018 at 23:56 hoang trung Ta <tahoangtrung at gmail.com> wrote:

> Dear all members,
>
> I am running Random forest for my research using Scikit learn.
>
> I want to see the source code of the random forest in Scikit learn to
> understand how it works, thus
> *Can you tell me where I can find this source code, please ?*
>
> Thank you very much
>
> --
> *Ta Hoang Trung (Mr)*
>
> *Master student*
> Graduate School of Life and Environmental Sciences
> University of Tsukuba, Japan
>
> Mobile:  +81 70 3846 2993 <+81%2070-3846-2993>
> Email :  ta.hoang-trung.xm at alumni.tsukuba.ac.jp
>              tahoangtrung at gmail.com
>              s1626066 at u.tsukuba.ac.jp
>
> *----*
> *Mapping Technician*
> Department of Surveying and Mapping Vietnam
> No 2, Dang Thuy Tram street, Hanoi, Viet Nam
>
> Mobile: +84 1255151344 <+84%20125%20515%201344>
> Email : tahoangtrung at gmail.com
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-- 

Juan BC (from phone)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180807/6eff6cd9/attachment.html>

From christophe at pallier.org  Tue Aug  7 06:43:27 2018
From: christophe at pallier.org (Christophe Pallier)
Date: Tue, 7 Aug 2018 12:43:27 +0200
Subject: [scikit-learn] RidgeCV with multiple targets returns a single
 alpha. Is it possible to get one alpha per target?
Message-ID: <CALyk9bzDRC+c+Tq_j4FB-1Ds+bV=xdEvbNOapSAzg+G-NsRS+g@mail.gmail.com>

Hello,

I'd like to use RidgeCV to find the optimal alpha for each colunm
(ntargets) of the DV variable.

It lloks like itthe fit() computes a single alpha. Is there a way to
compute one alpha per column?


-- 
--
Christophe Pallier <christophe at pallier.org>
INSERM-CEA Cognitive Neuroimaging Lab, Neurospin, bat 145,
91191 Gif-sur-Yvette Cedex, France
Tel: 00 33 1 69 08 79 34
Personal web site: http://www.pallier.org
Lab web site: http://www.unicog.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180807/6e42cb9b/attachment.html>

From chrismancuso1984 at gmail.com  Tue Aug  7 08:53:43 2018
From: chrismancuso1984 at gmail.com (Christopher Mancuso)
Date: Tue, 7 Aug 2018 08:53:43 -0400
Subject: [scikit-learn] RidgeCV with multiple targets returns a single
 alpha. Is it possible to get one alpha per target?
In-Reply-To: <CALyk9bzDRC+c+Tq_j4FB-1Ds+bV=xdEvbNOapSAzg+G-NsRS+g@mail.gmail.com>
References: <CALyk9bzDRC+c+Tq_j4FB-1Ds+bV=xdEvbNOapSAzg+G-NsRS+g@mail.gmail.com>
Message-ID: <B269F118-F4E2-46C5-806A-03F26EA2BCB5@gmail.com>


> On Aug 7, 2018, at 6:43 AM, Christophe Pallier <christophe at pallier.org> wrote:
> 
> Hello,
> 
> I'd like to use RidgeCV to find the optimal alpha for each colunm (ntargets) of the DV variable.
> 
> It lloks like itthe fit() computes a single alpha. Is there a way to compute one alpha per column?
> 
> 
> 
> 
> -- 
> --
> Christophe Pallier <christophe at pallier.org>
> INSERM-CEA Cognitive Neuroimaging Lab, Neurospin, bat 145,
> 91191 Gif-sur-Yvette Cedex, France
> Tel: 00 33 1 69 08 79 34
> Personal web site: http://www.pallier.org
> Lab web site: http://www.unicog.org
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180807/ad86296e/attachment-0001.html>

From alexandre.gramfort at inria.fr  Tue Aug  7 09:05:41 2018
From: alexandre.gramfort at inria.fr (Alexandre Gramfort)
Date: Tue, 7 Aug 2018 15:05:41 +0200
Subject: [scikit-learn] RidgeCV with multiple targets returns a single
 alpha. Is it possible to get one alpha per target?
In-Reply-To: <CALyk9bzDRC+c+Tq_j4FB-1Ds+bV=xdEvbNOapSAzg+G-NsRS+g@mail.gmail.com>
References: <CALyk9bzDRC+c+Tq_j4FB-1Ds+bV=xdEvbNOapSAzg+G-NsRS+g@mail.gmail.com>
Message-ID: <CADeotZqWFuEoVNupU1Rvrgp=BMaYTig-dGAP1QYWRXgkY-Wd_w@mail.gmail.com>

you should call RidgeCV on all targets separately.

HTH
Alex

On Tue, Aug 7, 2018 at 12:46 PM Christophe Pallier
<christophe at pallier.org> wrote:
>
> Hello,
>
> I'd like to use RidgeCV to find the optimal alpha for each colunm (ntargets) of the DV variable.
>
> It lloks like itthe fit() computes a single alpha. Is there a way to compute one alpha per column?
>
>
>
>
> --
> --
> Christophe Pallier <christophe at pallier.org>
> INSERM-CEA Cognitive Neuroimaging Lab, Neurospin, bat 145,
> 91191 Gif-sur-Yvette Cedex, France
> Tel: 00 33 1 69 08 79 34
> Personal web site: http://www.pallier.org
> Lab web site: http://www.unicog.org
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

From prat2 at umbc.edu  Tue Aug  7 11:38:10 2018
From: prat2 at umbc.edu (Prathusha Jonnagaddla Subramanyam Naidu)
Date: Tue, 7 Aug 2018 11:38:10 -0400
Subject: [scikit-learn] Normalize an image with 3 channels
Message-ID: <CAOhYXzSQe59mTZV3zTRwm-WGNxgfxUhR2zGd+E=jQQwoReTV3Q@mail.gmail.com>

Hi everyone,
            I'm trying to extract features from images using VGG and I have
to normalize each image array before doing that. Each array is of the size
(224, 224 ,3) and I'm unable to use the MinMaxScalar in this case. Would
appreciate any help with this .

Thank you !

-- 
Regards,
Prathusha JS Naidu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180807/3ae80137/attachment.html>

From michael.eickenberg at gmail.com  Tue Aug  7 13:24:18 2018
From: michael.eickenberg at gmail.com (Michael Eickenberg)
Date: Tue, 7 Aug 2018 10:24:18 -0700
Subject: [scikit-learn] RidgeCV with multiple targets returns a single
 alpha. Is it possible to get one alpha per target?
In-Reply-To: <CADeotZqWFuEoVNupU1Rvrgp=BMaYTig-dGAP1QYWRXgkY-Wd_w@mail.gmail.com>
References: <CALyk9bzDRC+c+Tq_j4FB-1Ds+bV=xdEvbNOapSAzg+G-NsRS+g@mail.gmail.com>
 <CADeotZqWFuEoVNupU1Rvrgp=BMaYTig-dGAP1QYWRXgkY-Wd_w@mail.gmail.com>
Message-ID: <CADxJN65aNg=gJn6fbd_TqXnGM6KxeNQb55TajXC2mZYEjCa0=A@mail.gmail.com>

You can get one alpha per target in the Ridge estimator (without CV).  Then
you would have to code the cv loop yourself.

Depending on how many target you have this can be more efficient than
looping over targets as Alex suggests.

Either way there is some coding to do unfortunately.

Michael


On Tue, Aug 7, 2018 at 6:05 AM, Alexandre Gramfort <
alexandre.gramfort at inria.fr> wrote:

> you should call RidgeCV on all targets separately.
>
> HTH
> Alex
>
> On Tue, Aug 7, 2018 at 12:46 PM Christophe Pallier
> <christophe at pallier.org> wrote:
> >
> > Hello,
> >
> > I'd like to use RidgeCV to find the optimal alpha for each colunm
> (ntargets) of the DV variable.
> >
> > It lloks like itthe fit() computes a single alpha. Is there a way to
> compute one alpha per column?
> >
> >
> >
> >
> > --
> > --
> > Christophe Pallier <christophe at pallier.org>
> > INSERM-CEA Cognitive Neuroimaging Lab, Neurospin, bat 145,
> > 91191 Gif-sur-Yvette Cedex, France
> > Tel: 00 33 1 69 08 79 34
> > Personal web site: http://www.pallier.org
> > Lab web site: http://www.unicog.org
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180807/c5625d57/attachment.html>

From fellypao at yahoo.com.br  Tue Aug  7 13:35:43 2018
From: fellypao at yahoo.com.br (Fellype)
Date: Tue, 7 Aug 2018 17:35:43 +0000 (UTC)
Subject: [scikit-learn] Small suggestion for documentation
References: <1426595031.3378809.1533663343561.ref@mail.yahoo.com>
Message-ID: <1426595031.3378809.1533663343561@mail.yahoo.com>

Dear maintainers,I've just known scikit-learn and found it very useful. Congratulations for this library.
I found some confuse terms to describe r2_score parameters in documentation [1]. For me, the meanings of y_true and y_pred are not clear. From [1]:- y_true: ... Ground truth (correct) target values- y_pred: ... Estimated target values
Since the R^2 value is usually used to compare the behavior of experimental data (observed) with a theoretical model or standard data (expected), I guess that it would be better to change the description of y_true and y_pred to something like:- y_true: ... Observed (or measured) target values- y_pred: ... Expected (or theoretical) target values
 I also think that?the same should be done in documentation of other scikit-learn functions that use the y_true and y_pred terms with the same meaning.

Thanks for your attention and best wishes.
Fellype

[1] http://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180807/1caf764e/attachment.html>

From gael.varoquaux at normalesup.org  Tue Aug  7 14:23:48 2018
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Tue, 07 Aug 2018 11:23:48 -0700
Subject: [scikit-learn] Small suggestion for documentation
In-Reply-To: <1426595031.3378809.1533663343561@mail.yahoo.com>
References: <1426595031.3378809.1533663343561.ref@mail.yahoo.com>
 <1426595031.3378809.1533663343561@mail.yahoo.com>
Message-ID: <d70552d7-6cc3-482a-b0b6-be949db24ba9@normalesup.org>

I think that the vocabulary mismatch comes from the fact that you are looking at these terms thinking about in sample statistics, while they are used here in the context of prediction. I think that in the context of prediction, these are the right terms. 

Cheers,

Ga?l

?Sent from my phone. Please forgive typos and briefness.?

On Aug 7, 2018, 10:40, at 10:40, Fellype via scikit-learn <scikit-learn at python.org> wrote:
>Dear maintainers,I've just known scikit-learn and found it very useful.
>Congratulations for this library.
>I found some confuse terms to describe r2_score parameters in
>documentation [1]. For me, the meanings of y_true and y_pred are not
>clear. From [1]:- y_true: ... Ground truth (correct) target values-
>y_pred: ... Estimated target values
>Since the R^2 value is usually used to compare the behavior of
>experimental data (observed) with a theoretical model or standard data
>(expected), I guess that it would be better to change the description
>of y_true and y_pred to something like:- y_true: ... Observed (or
>measured) target values- y_pred: ... Expected (or theoretical) target
>values
>I also think that?the same should be done in documentation of other
>scikit-learn functions that use the y_true and y_pred terms with the
>same meaning.
>
>Thanks for your attention and best wishes.
>Fellype
>
>[1]
>http://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html
>
>------------------------------------------------------------------------
>
>_______________________________________________
>scikit-learn mailing list
>scikit-learn at python.org
>https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180807/d7550a19/attachment.html>

From fellypao at yahoo.com.br  Tue Aug  7 15:46:49 2018
From: fellypao at yahoo.com.br (Fellype)
Date: Tue, 7 Aug 2018 19:46:49 +0000 (UTC)
Subject: [scikit-learn] Small suggestion for documentation
In-Reply-To: <d70552d7-6cc3-482a-b0b6-be949db24ba9@normalesup.org>
References: <1426595031.3378809.1533663343561.ref@mail.yahoo.com>
 <1426595031.3378809.1533663343561@mail.yahoo.com>
 <d70552d7-6cc3-482a-b0b6-be949db24ba9@normalesup.org>
Message-ID: <755204610.3442567.1533671209582@mail.yahoo.com>

Hi Ga?l,

  > Em ter?a-feira, 7 de agosto de 2018 15:24:00 BRT, Gael Varoquaux <gael.varoquaux at normalesup.org> escreveu:  > 
 >
 > I think that the vocabulary mismatch comes from the fact that you are looking at these terms thinking about in sample statistics, while they are used here in the context of prediction. I think that in the context of prediction, these are the right terms. 

Actually, I was looking at the terms in the context of curve fitting, which is based in statistics, of course...
Fellype

  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180807/43123e4c/attachment-0001.html>

From christophe at pallier.org  Wed Aug  8 04:39:47 2018
From: christophe at pallier.org (Christophe Pallier)
Date: Wed, 8 Aug 2018 10:39:47 +0200
Subject: [scikit-learn] RidgeCV and cv=LeaveOneGroupOut
Message-ID: <CALyk9bx9s8p9hthtz-BhGnSFxLqHpqjrMAsvMMdaQdzUVEk93w@mail.gmail.com>

Hello,

I do not manage to find out how to use a LeaveOneGroupOut crossvalidation
stratergy with RidgeCV. The problem is that I do not see how to pass the
groups mapping.

Any hint?
The approach by GridSearch is way to slow to be practical.

Thanks in advance!

PS: Thanks Alex for the response to my previous question. By the way, the
context is fMRI data, so I am running the Ridge regression in about 200.000
voxels.

-- 
--
Christophe Pallier <christophe at pallier.org>
INSERM-CEA Cognitive Neuroimaging Lab, Neurospin, bat 145,
91191 Gif-sur-Yvette Cedex, France
Tel: 00 33 1 69 08 79 34
Personal web site: http://www.pallier.org
Lab web site: http://www.unicog.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180808/f58ed6ed/attachment.html>

From pranavashok at gmail.com  Wed Aug  8 05:41:27 2018
From: pranavashok at gmail.com (Pranav Ashok)
Date: Wed, 8 Aug 2018 11:41:27 +0200
Subject: [scikit-learn] Decision trees with only pure leaves
Message-ID: <CA+kZnRyR_KYVm9dx46b1hwQ_R-Q5TEaWHO2=bboNivzhiprOzg@mail.gmail.com>

I am trying to use scikit-learn for building decision trees representing
fully defined many-to-one functions. i.e. f(x1, x2, ..., xn) = f(x1', x2',
..., xn') if and only if x1 = x1', x2 = x2' and so on. In such a scenario,
it seems clear that it is possible to construct trees which have pure
leaves as long as min_samples_split = 2 and I am not setting any other
parameters which might stop the splitting. My question is whether the
decision tree builder in scikit-learn can indeed give me a perfect
representation, i.e where all leaf nodes are pure. This would imply that
tree.predict(x) = f(x).

Thanks and best regards.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180808/9f215722/attachment.html>

From alexandre.gramfort at inria.fr  Wed Aug  8 09:06:24 2018
From: alexandre.gramfort at inria.fr (Alexandre Gramfort)
Date: Wed, 8 Aug 2018 15:06:24 +0200
Subject: [scikit-learn] RidgeCV and cv=LeaveOneGroupOut
In-Reply-To: <CALyk9bx9s8p9hthtz-BhGnSFxLqHpqjrMAsvMMdaQdzUVEk93w@mail.gmail.com>
References: <CALyk9bx9s8p9hthtz-BhGnSFxLqHpqjrMAsvMMdaQdzUVEk93w@mail.gmail.com>
Message-ID: <CADeotZotw+yEe-dZgAO9x-jMcEptba20W0S7a_xVTgiVbCNjAw@mail.gmail.com>

you cannot do this indeed. The groups cannot be passed to the cv.split
method as they are not exposed as samples_props in the fit. This is
still a hole in our API...

From tahoangtrung at gmail.com  Wed Aug  8 20:50:57 2018
From: tahoangtrung at gmail.com (hoang trung Ta)
Date: Thu, 9 Aug 2018 09:50:57 +0900
Subject: [scikit-learn] Using GPU in scikit learn
Message-ID: <CAMY9vuwQP_3NHOQdLLcCfRR5PEk+qoHWftt6J4Z2sLC8E0E8LQ@mail.gmail.com>

Dear all members,

I am using Random forest for classification satellite images. I have a
bunch of images, thus the processing is quite slow. I searched on the
Internet and they said that GPU can accelerate the process.

I have GPU NDVIA Geforce GTX 1080 Ti installed in the computer

Do you know how to use GPU in Scikit learn, I mean the packages to use and
sample code that used GPU in random forest classification?

Thank you very much

-- 
*Ta Hoang Trung (Mr)*

*Master student*
Graduate School of Life and Environmental Sciences
University of Tsukuba, Japan

Mobile:  +81 70 3846 2993
Email :  ta.hoang-trung.xm at alumni.tsukuba.ac.jp
             tahoangtrung at gmail.com
             s1626066 at u.tsukuba.ac.jp

*----*
*Mapping Technician*
Department of Surveying and Mapping Vietnam
No 2, Dang Thuy Tram street, Hanoi, Viet Nam

Mobile: +84 1255151344
Email : tahoangtrung at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180809/3998b2c2/attachment.html>

From g.lemaitre58 at gmail.com  Wed Aug  8 21:30:11 2018
From: g.lemaitre58 at gmail.com (=?ISO-8859-1?Q?Guillaume_Lema=EEtre?=)
Date: Thu, 09 Aug 2018 09:30:11 +0800
Subject: [scikit-learn] Using GPU in scikit learn
In-Reply-To: <CAMY9vuwQP_3NHOQdLLcCfRR5PEk+qoHWftt6J4Z2sLC8E0E8LQ@mail.gmail.com>
Message-ID: <m6bjgl7dt0lilupuof5pfd39.1533778211763@gmail.com>

An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180809/f376c388/attachment.html>

From mail at sebastianraschka.com  Wed Aug  8 21:00:44 2018
From: mail at sebastianraschka.com (Sebastian Raschka)
Date: Wed, 8 Aug 2018 20:00:44 -0500
Subject: [scikit-learn] Using GPU in scikit learn
In-Reply-To: <CAMY9vuwQP_3NHOQdLLcCfRR5PEk+qoHWftt6J4Z2sLC8E0E8LQ@mail.gmail.com>
References: <CAMY9vuwQP_3NHOQdLLcCfRR5PEk+qoHWftt6J4Z2sLC8E0E8LQ@mail.gmail.com>
Message-ID: <7A4B573D-43BD-4404-94EC-F6B46C0C2284@sebastianraschka.com>

Hi,

scikit-learn doesn't support computations on the GPU, unfortunately. Specifically for random forests, there's CudaTree, which implements a GPU version of scikit-learn's random forests. It doesn't look like the library is actively developed (hard to tell whether that's a good thing or a bad thing -- whether it's stable enough that it didn't need any updates). Anyway, maybe worth a try: https://github.com/EasonLiao/CudaTree

Otherwise, I can imagine there are probably alternative implementations out there?

Best,
Sebastian

> On Aug 8, 2018, at 7:50 PM, hoang trung Ta <tahoangtrung at gmail.com> wrote:
> 
> Dear all members,
> 
> I am using Random forest for classification satellite images. I have a bunch of images, thus the processing is quite slow. I searched on the Internet and they said that GPU can accelerate the process. 
> 
> I have GPU NDVIA Geforce GTX 1080 Ti installed in the computer
> 
> Do you know how to use GPU in Scikit learn, I mean the packages to use and sample code that used GPU in random forest classification?
> 
> Thank you very much
> 
> -- 
> Ta Hoang Trung (Mr)
> 
> Master student
> Graduate School of Life and Environmental Sciences
> University of Tsukuba, Japan
> 
> Mobile:  +81 70 3846 2993
> Email :  ta.hoang-trung.xm at alumni.tsukuba.ac.jp
>              tahoangtrung at gmail.com
>              s1626066 at u.tsukuba.ac.jp
> ----
> Mapping Technician
> Department of Surveying and Mapping Vietnam
> No 2, Dang Thuy Tram street, Hanoi, Viet Nam
> 
> Mobile: +84 1255151344
> Email : tahoangtrung at gmail.com
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From jbbrown at kuhp.kyoto-u.ac.jp  Wed Aug  8 21:33:18 2018
From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.)
Date: Thu, 9 Aug 2018 10:33:18 +0900
Subject: [scikit-learn] Using GPU in scikit learn
In-Reply-To: <CAMY9vuwQP_3NHOQdLLcCfRR5PEk+qoHWftt6J4Z2sLC8E0E8LQ@mail.gmail.com>
References: <CAMY9vuwQP_3NHOQdLLcCfRR5PEk+qoHWftt6J4Z2sLC8E0E8LQ@mail.gmail.com>
Message-ID: <CAJe_vxACSgVQpeg2hb=6ptKPAGMMdkzL1Rm32HTXpsGe4uY8QQ@mail.gmail.com>

Dear Ta Hoang,

GPU processing can be done with Python libraries such as TensorFlow, Keras,
or Theano.

However, sklearn's implementation of RandomForestClassifier is
outstandingly fast, and a previous effort to develop GPU RandomForest
abandoned their efforts as a result:
https://github.com/EasonLiao/CudaTree

If you need to speed up predictions because of a large dataset, you can
combine joblib with sklearn to utilize parallelize the predictions of the
individual trees:
####
from joblib import Parallel, delayed
...
predictions = Parallel(n_jobs=n,
backend=backend)(delayed(your_forest_prediction_func)(func_arguments) for
tree_group in tree_groups))
####
where, n is how many parallel computations you want to execute, and backend
is either "threading" or "multiprocessing".
Typically, your_forest_predict_func() would iterate over the collection of
trees and prediction objects given in func_arguments using a single
thread/process.

Hope this helps you parallelize and speed-up.

Sincerely,
J.B. Brown
Kyoto University Graduate School of Medicine


2018-08-09 9:50 GMT+09:00 hoang trung Ta <tahoangtrung at gmail.com>:

> Dear all members,
>
> I am using Random forest for classification satellite images. I have a
> bunch of images, thus the processing is quite slow. I searched on the
> Internet and they said that GPU can accelerate the process.
>
> I have GPU NDVIA Geforce GTX 1080 Ti installed in the computer
>
> Do you know how to use GPU in Scikit learn, I mean the packages to use and
> sample code that used GPU in random forest classification?
>
> Thank you very much
>
> --
> *Ta Hoang Trung (Mr)*
>
> *Master student*
> Graduate School of Life and Environmental Sciences
> University of Tsukuba, Japan
>
> Mobile:  +81 70 3846 2993
> Email :  ta.hoang-trung.xm at alumni.tsukuba.ac.jp
>              tahoangtrung at gmail.com
>              s1626066 at u.tsukuba.ac.jp
>
> *----*
> *Mapping Technician*
> Department of Surveying and Mapping Vietnam
> No 2, Dang Thuy Tram street, Hanoi, Viet Nam
>
> Mobile: +84 1255151344
> Email : tahoangtrung at gmail.com
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180809/b9d0b9ce/attachment-0001.html>

From tjt7a at virginia.edu  Wed Aug  8 22:01:57 2018
From: tjt7a at virginia.edu (Tommy Tracy)
Date: Wed, 8 Aug 2018 22:01:57 -0400
Subject: [scikit-learn] Using GPU in scikit learn
In-Reply-To: <CAMY9vuwQP_3NHOQdLLcCfRR5PEk+qoHWftt6J4Z2sLC8E0E8LQ@mail.gmail.com>
References: <CAMY9vuwQP_3NHOQdLLcCfRR5PEk+qoHWftt6J4Z2sLC8E0E8LQ@mail.gmail.com>
Message-ID: <CAAt9edwxJS8JMDjSWAP05RGhy3Y=p1F5=-eNFE0x3sktZ-io5A@mail.gmail.com>

Dear Ta Hoang,

Accelerating decision tree ensembles (including Random Forest) is actually
a current area of computer architecture research; in fact it is a principle
component of my dissertation. Like Sebastian Raschka said, the GPU is not
an ideal architecture for decision tree inference because at its core it is
a pointer-chasing algorithm (low computation per memory access) that shows
low memory locality. Scikit-Learn has done an excellent job with their von
Neumann implementation utilizing things like predication and vectorization.
If you're looking to go beyond what the CPU can give you, I would point you
to FPGAs. If you're interested in discussing this further, let me know.

-- 
-- 
         Sincerely,
Tommy James Tracy II
     Ph.D Candidate
Computer Engineering
  University of Virginia


On Wed, Aug 8, 2018 at 8:50 PM, hoang trung Ta <tahoangtrung at gmail.com>
wrote:

> Dear all members,
>
> I am using Random forest for classification satellite images. I have a
> bunch of images, thus the processing is quite slow. I searched on the
> Internet and they said that GPU can accelerate the process.
>
> I have GPU NDVIA Geforce GTX 1080 Ti installed in the computer
>
> Do you know how to use GPU in Scikit learn, I mean the packages to use and
> sample code that used GPU in random forest classification?
>
> Thank you very much
>
> --
> *Ta Hoang Trung (Mr)*
>
> *Master student*
> Graduate School of Life and Environmental Sciences
> University of Tsukuba, Japan
>
> Mobile:  +81 70 3846 2993
> Email :  ta.hoang-trung.xm at alumni.tsukuba.ac.jp
>              tahoangtrung at gmail.com
>              s1626066 at u.tsukuba.ac.jp
>
> *----*
> *Mapping Technician*
> Department of Surveying and Mapping Vietnam
> No 2, Dang Thuy Tram street, Hanoi, Viet Nam
>
> Mobile: +84 1255151344
> Email : tahoangtrung at gmail.com
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180808/694cc6c3/attachment.html>

From blacklabel29 at web.de  Thu Aug  9 03:35:06 2018
From: blacklabel29 at web.de (blacklabel29)
Date: Thu, 09 Aug 2018 09:35:06 +0200
Subject: [scikit-learn] =?utf-8?b?5Zue6KaG77yaICBVc2luZyBHUFUgaW4gc2Np?=
 =?utf-8?q?kit_learn?=
Message-ID: <eid9vdgbgavjxpw41nnn36gt.1533799182523@email.android.com>

Hi,

the scikit-learn random forest does not support GPUs.?
If you want to do image classification using GPU processing, the standard way in this day and age is to use a neural network library like TensorFlow/keras or pytorch.
GPUs can be faster than CPUs when the task is SIMD (single instruction multiple data), meaning the same calculation is done many times just on different datapoints. Neural networks are well-suited for such an architecture, decision trees not so much (even though there have been attempts to speed up decision trees using GPUs).
So my advice to you depends on how much time you have: If you are willing to invest time to learn about neural networks and the aforementioned libraries, then that is certainly a very valuable skill, especially when looking for a job later on. But if you just need to get your paper done as soon as possible, stick with random forest.

Greetings,Patrick


??? Samsung Galaxy ????????-------- ???? --------?? hoang trung Ta <tahoangtrung at gmail.com> ??? 2018/8/9  02:50  (GMT+01:00) ?? scikit-learn at python.org ??? [scikit-learn] Using GPU in scikit learn 
Dear all members,
I am using?Random forest for classification?satellite images. I have a bunch of images, thus the processing is quite slow. I searched on the Internet and they said that GPU can accelerate the process.?
I have GPU NDVIA Geforce GTX 1080 Ti installed in the computer
Do you know how to use GPU in Scikit learn, I mean the packages to use and sample code that used GPU in random forest classification?
Thank you very much

-- 
Ta Hoang Trung (Mr)
Master studentGraduate School of Life and Environmental SciencesUniversity of Tsukuba, Japan
Mobile: ?+81 70 3846 2993Email :??ta.hoang-trung.xm at alumni.tsukuba.ac.jp? ? ? ? ? ? ?tahoangtrung at gmail.com? ? ? ? ? ???s1626066 at u.tsukuba.ac.jp----
Mapping Technician
Department of Surveying and Mapping?VietnamNo 2, Dang Thuy Tram street,?Hanoi, Viet Nam
Mobile: +84 1255151344Email : tahoangtrung at gmail.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180809/3b581eee/attachment.html>

From dixeenalopez at gmail.com  Thu Aug  9 05:00:34 2018
From: dixeenalopez at gmail.com (Dixeena Lopez)
Date: Thu, 9 Aug 2018 14:30:34 +0530
Subject: [scikit-learn] Fwd: BIC using GMM.fit and GaussianMixture.fit()
In-Reply-To: <CAErMtuY+Cnxtv8uFHL9KzwWz9cp2RZMpN83iECvc0-0pfzzJzg@mail.gmail.com>
References: <CAErMtuY+Cnxtv8uFHL9KzwWz9cp2RZMpN83iECvc0-0pfzzJzg@mail.gmail.com>
Message-ID: <CAErMtubawO-F3GRe87dHF9S9+9B33Enju=WoG2WxWqQ9NpHhiQ@mail.gmail.com>

---------- Forwarded message ----------
From: Dixeena Lopez <dixeenalopez at gmail.com>
Date: 2 August 2018 at 23:53
Subject: BIC using GMM.fit and GaussianMixture.fit()
To: scikit-learn at python.org


Dear Sir/Madam,

I have tried to fit the data using GaussianMixture.fit() and GMM.fit() and
calculated the BIC score. The BIC score value and the number of clusters I
got using each method is different. Do you have any idea?
?

Sincerely,

Dixeena

<https://mailtrack.io/> Sent with Mailtrack
<https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality&>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180809/705fd9ce/attachment-0001.html>

From tahoangtrung at gmail.com  Thu Aug  9 06:46:42 2018
From: tahoangtrung at gmail.com (hoang trung Ta)
Date: Thu, 9 Aug 2018 19:46:42 +0900
Subject: [scikit-learn] =?utf-8?b?5Zue6KaG77yaIFVzaW5nIEdQVSBpbiBzY2lr?=
	=?utf-8?q?it_learn?=
In-Reply-To: <eid9vdgbgavjxpw41nnn36gt.1533799182523@email.android.com>
References: <eid9vdgbgavjxpw41nnn36gt.1533799182523@email.android.com>
Message-ID: <CAMY9vuyrx2qivj13v0nVEJgNDnXazigpWpEgLAnaK-i=0=D4ww@mail.gmail.com>

Thank you very much for all of your information. Now I understand more
about Scikit learn.

On Thu, Aug 9, 2018 at 4:35 PM, blacklabel29 <blacklabel29 at web.de> wrote:

> Hi,
>
>
> the scikit-learn random forest does not support GPUs.
>
> If you want to do image classification using GPU processing, the standard
> way in this day and age is to use a neural network library like
> TensorFlow/keras or pytorch.
>
> GPUs can be faster than CPUs when the task is SIMD (single instruction
> multiple data), meaning the same calculation is done many times just on
> different datapoints. Neural networks are well-suited for such an
> architecture, decision trees not so much (even though there have been
> attempts to speed up decision trees using GPUs).
>
> So my advice to you depends on how much time you have: If you are willing
> to invest time to learn about neural networks and the aforementioned
> libraries, then that is certainly a very valuable skill, especially when
> looking for a job later on. But if you just need to get your paper done as
> soon as possible, stick with random forest.
>
>
> Greetings,
> Patrick
>
>
>
> ??? Samsung Galaxy ????????
> -------- ???? --------
> ?? hoang trung Ta <tahoangtrung at gmail.com>
> ??? 2018/8/9 02:50 (GMT+01:00)
> ?? scikit-learn at python.org
> ??? [scikit-learn] Using GPU in scikit learn
>
> Dear all members,
>
> I am using Random forest for classification satellite images. I have a
> bunch of images, thus the processing is quite slow. I searched on the
> Internet and they said that GPU can accelerate the process.
>
> I have GPU NDVIA Geforce GTX 1080 Ti installed in the computer
>
> Do you know how to use GPU in Scikit learn, I mean the packages to use and
> sample code that used GPU in random forest classification?
>
> Thank you very much
>
> --
> *Ta Hoang Trung (Mr)*
>
> *Master student*
> Graduate School of Life and Environmental Sciences
> University of Tsukuba, Japan
>
> Mobile:  +81 70 3846 2993
> Email :  ta.hoang-trung.xm at alumni.tsukuba.ac.jp
>              tahoangtrung at gmail.com
>              s1626066 at u.tsukuba.ac.jp
>
> *----*
> *Mapping Technician*
> Department of Surveying and Mapping Vietnam
> No 2, Dang Thuy Tram street, Hanoi, Viet Nam
>
> Mobile: +84 1255151344
> Email : tahoangtrung at gmail.com
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 
*Ta Hoang Trung (Mr)*

*Master student*
Graduate School of Life and Environmental Sciences
University of Tsukuba, Japan

Mobile:  +81 70 3846 2993
Email :  ta.hoang-trung.xm at alumni.tsukuba.ac.jp
             tahoangtrung at gmail.com
             s1626066 at u.tsukuba.ac.jp

*----*
*Mapping Technician*
Department of Surveying and Mapping Vietnam
No 2, Dang Thuy Tram street, Hanoi, Viet Nam

Mobile: +84 1255151344
Email : tahoangtrung at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180809/63de9c0f/attachment.html>

From dixeenalopez at gmail.com  Thu Aug  9 21:35:25 2018
From: dixeenalopez at gmail.com (Dixeena Lopez)
Date: Fri, 10 Aug 2018 07:05:25 +0530
Subject: [scikit-learn] GMM.fit and GaussianMixture.fit()
Message-ID: <CAErMtuZLSPMdBZahj4fP1RCW9V7HP-jjiYKzGujzgfN3MkJDHQ@mail.gmail.com>

Dear Sir/Madam,

I have used GMM.fit() instead of GaussianMixture.fit() and got different
answers. Please gives the advantage and disadvantage of these two. Please
reply fast


?Diixeena
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180810/04f5fae4/attachment.html>

From prat2 at umbc.edu  Thu Aug  9 22:15:19 2018
From: prat2 at umbc.edu (Prathusha Jonnagaddla Subramanyam Naidu)
Date: Thu, 9 Aug 2018 22:15:19 -0400
Subject: [scikit-learn] DBSCAN
Message-ID: <CAOhYXzTsz0hLDWZrpr2ZGyym+CURdeZwP1sGUoXME6=VaHZ2zQ@mail.gmail.com>

Hi everyone,
      I'm trying to cluster 14000 samples using DBSCAN and want to know if
there is a way to display the index of each data point along with it's
label. I'm only able to access labels in the form of a list . When I look
at the graph and see outliers (black points) , I'm not able to pinpoint as
to which image/data that particular point belongs to.

Thank you


-- 
Regards,
Prathusha JS Naidu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180809/12320f28/attachment.html>

From mailfordebu at gmail.com  Sun Aug 12 08:16:10 2018
From: mailfordebu at gmail.com (Debabrata Ghosh)
Date: Sun, 12 Aug 2018 17:46:10 +0530
Subject: [scikit-learn] Unable to connect HDInsight hive to python
Message-ID: <CAB547RECjHgxubS4+UexRDd_St+UBNdccGRV_2S655C=zx+k1g@mail.gmail.com>

Hi All,
                       Greetings ! Wish you are doing good ! I am just
reaching out to you in case if you have any answer or help me direct to the
right forum please:

We are facing with hive connectivity from the python on Azure HDinsights,
We have installed required SASL,thrift_sasl(0.2.1) and Thirft (0.9.3)
packages on Ubuntu , but some how when we are trying to connect Hive using
following packages we are getting errors , It would be really great help if
you could provide some pointers based on your experience

Example 1: from impala.dbapi import connect conn=connect(host="localhost",
port=10001 , auth_mechanism="PLAIN", user="admin", password="PWD") (tried
both 127.0.0.1:10000/10001)

Example 2:

import pyhs2 conn = pyhs2.connect(host='localhost ',
port=10000,authMechanism="PLAIN", user='admin',
password=,database='default')

Example 3:

from pyhive import hive conn = hive.Connection(host="localhost",
port=10001, username="admin", password=None, auth='NONE')

Across all of the above examples we are getting the error message:
thrift.transport.TTransport.TTransportException: Tsocket read 0 bytes
Thanks,
Debu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180812/de8ba10d/attachment.html>

From mail at sebastianraschka.com  Sun Aug 12 19:23:15 2018
From: mail at sebastianraschka.com (Sebastian Raschka)
Date: Sun, 12 Aug 2018 18:23:15 -0500
Subject: [scikit-learn] Unable to connect HDInsight hive to python
In-Reply-To: <CAB547RECjHgxubS4+UexRDd_St+UBNdccGRV_2S655C=zx+k1g@mail.gmail.com>
References: <CAB547RECjHgxubS4+UexRDd_St+UBNdccGRV_2S655C=zx+k1g@mail.gmail.com>
Message-ID: <D7D4D0FD-C874-43AA-9856-DC08B802772D@sebastianraschka.com>

Hi Debu,

since Azure HDInsights is a commercial service, their customer support should handle questions like this

> On Aug 12, 2018, at 7:16 AM, Debabrata Ghosh <mailfordebu at gmail.com> wrote:
> 
> Hi All,
>                        Greetings ! Wish you are doing good ! I am just reaching out to you in case if you have any answer or help me direct to the right forum please:
> 
> We are facing with hive connectivity from the python on Azure HDinsights, We have installed required SASL,thrift_sasl(0.2.1) and Thirft (0.9.3) packages on Ubuntu , but some how when we are trying to connect Hive using following packages we are getting errors , It would be really great help if you could provide some pointers based on your experience
> 
> Example 1: from impala.dbapi import connect conn=connect(host="localhost", port=10001 , auth_mechanism="PLAIN", user="admin", password="PWD") (tried both 127.0.0.1:10000/10001)
> 
> Example 2:
> 
> import pyhs2 conn = pyhs2.connect(host='localhost ', port=10000,authMechanism="PLAIN", user='admin', password=,database='default')
> 
> Example 3:
> 
> from pyhive import hive conn = hive.Connection(host="localhost", port=10001, username="admin", password=None, auth='NONE')
> 
> Across all of the above examples we are getting the error message: thrift.transport.TTransport.TTransportException: Tsocket read 0 bytes
> 
> Thanks,
> Debu
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From joel.nothman at gmail.com  Tue Aug 14 18:11:57 2018
From: joel.nothman at gmail.com (Joel Nothman)
Date: Wed, 15 Aug 2018 08:11:57 +1000
Subject: [scikit-learn] GMM.fit and GaussianMixture.fit()
In-Reply-To: <CAErMtuZLSPMdBZahj4fP1RCW9V7HP-jjiYKzGujzgfN3MkJDHQ@mail.gmail.com>
References: <CAErMtuZLSPMdBZahj4fP1RCW9V7HP-jjiYKzGujzgfN3MkJDHQ@mail.gmail.com>
Message-ID: <CAAkaFLVLHcSUn8XXQQD2HSQ4svtqy6kEyGrJWWfWMz0A7f9QPw@mail.gmail.com>

gmm is deprecated because it did inappropriate things. Use GaussianMixture

On Fri, 10 Aug 2018 11:37 am Dixeena Lopez, <dixeenalopez at gmail.com> wrote:

> Dear Sir/Madam,
>
> I have used GMM.fit() instead of GaussianMixture.fit() and got different
> answers. Please gives the advantage and disadvantage of these two. Please
> reply fast
>
>
> ?Diixeena
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180815/63719e22/attachment.html>

From t3kcit at gmail.com  Tue Aug 21 09:05:09 2018
From: t3kcit at gmail.com (Andreas Mueller)
Date: Tue, 21 Aug 2018 09:05:09 -0400
Subject: [scikit-learn] Using GPU in scikit learn
In-Reply-To: <CAAt9edwxJS8JMDjSWAP05RGhy3Y=p1F5=-eNFE0x3sktZ-io5A@mail.gmail.com>
References: <CAMY9vuwQP_3NHOQdLLcCfRR5PEk+qoHWftt6J4Z2sLC8E0E8LQ@mail.gmail.com>
 <CAAt9edwxJS8JMDjSWAP05RGhy3Y=p1F5=-eNFE0x3sktZ-io5A@mail.gmail.com>
Message-ID: <8b52f003-ee47-5856-bcaa-e7135158e021@gmail.com>

My lab had a GPU library for Random Forests on Images.

It was pretty fast but probably needs some updates:

https://github.com/deeplearningais/curfil


On 8/8/18 10:01 PM, Tommy Tracy wrote:
> Dear Ta Hoang,
>
> Accelerating decision tree ensembles (including Random Forest) is 
> actually a current area of computer architecture research; in fact it 
> is a principle component of my dissertation. Like Sebastian Raschka 
> said, the GPU is not an ideal architecture for decision tree inference 
> because at its core it is a pointer-chasing algorithm (low computation 
> per memory access) that shows low memory locality. Scikit-Learn has 
> done an excellent job with their von Neumann implementation utilizing 
> things like predication and vectorization. If you're looking to go 
> beyond what the CPU can give you, I would point you to FPGAs. If 
> you're interested in discussing this further, let me know.
>
> --
> --
> ? ?????? Sincerely,
> Tommy James Tracy II
> ???? Ph.D Candidate
> Computer Engineering
> UniversityofVirginia
>
>
> On Wed, Aug 8, 2018 at 8:50 PM, hoang trung Ta <tahoangtrung at gmail.com 
> <mailto:tahoangtrung at gmail.com>> wrote:
>
>     Dear all members,
>
>     I am using?Random forest for classification?satellite images. I
>     have a bunch of images, thus the processing is quite slow. I
>     searched on the Internet and they said that GPU can accelerate the
>     process.
>
>     I have GPU NDVIA Geforce GTX 1080 Ti installed in the computer
>
>     Do you know how to use GPU in Scikit learn, I mean the packages to
>     use and sample code that used GPU in random forest classification?
>
>     Thank you very much
>
>     -- 
>     *Ta Hoang Trung (Mr)*
>     *
>     *
>     /Master student/
>     Graduate School of Life and Environmental Sciences
>     University of Tsukuba, Japan
>
>     Mobile: ?+81 70 3846 2993
>     Email : ta.hoang-trung.xm at alumni.tsukuba.ac.jp
>     <mailto:ta.hoang-trung.xm at alumni.tsukuba.ac.jp>
>     tahoangtrung at gmail.com <mailto:tahoangtrung at gmail.com>
>     s1626066 at u.tsukuba.ac.jp <mailto:s1626066 at u.tsukuba.ac.jp>
>     *----
>     *
>     /Mapping Technician/
>     Department of Surveying and Mapping Vietnam
>     No 2, Dang Thuy Tram street,?Hanoi, Viet Nam
>
>     Mobile: +84 1255151344
>     Email : tahoangtrung at gmail.com <mailto:tahoangtrung at gmail.com>
>
>     _______________________________________________
>     scikit-learn mailing list
>     scikit-learn at python.org <mailto:scikit-learn at python.org>
>     https://mail.python.org/mailman/listinfo/scikit-learn
>     <https://mail.python.org/mailman/listinfo/scikit-learn>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180821/09552d33/attachment.html>

From herbold at cs.uni-goettingen.de  Wed Aug 22 07:12:51 2018
From: herbold at cs.uni-goettingen.de (Steffen Herbold)
Date: Wed, 22 Aug 2018 13:12:51 +0200
Subject: [scikit-learn] Smoke and Metamorphic Testing of scikit-learn
Message-ID: <58887735-dbfa-b26f-fee7-c5cb0290b24e@cs.uni-goettingen.de>

Dear developers,

I am writing you because I applied an approach for the automated testing 
of classification algorithms to scikit-learn and would like to forward 
the results to you.

The approach is a combination of smoke testing and metamorphic testing. 
The smoke tests try to find problems by executing the training and 
prediction functions of classifiers with different data. These smoke 
tests should ensure the basic functioning of classifiers. I defined 20 
different data sets, some very simple (uniform features in [0,1]), some 
with extreme distributions, e.g., data close to machine precision. The 
metamorphic tests determine if classification results change as expected 
if the training data is modified, e.g., by reordering features, flipping 
class labels, or reordering instances.

I generated 70 different Python unittest tests for eleven different 
scikit-learn classifiers. In summary, I found the following potential 
problems:
- Two errors due to possibly infinite loops for the 
LogisticRegressionClassifier for data that approaches MAXDOUBLE.
- The classification of LogisticRegression, MLPClassifier, 
QuadraticDiscriminantAnalysis, and SVM with a polynomial kernel changed 
if one is added to each feature value.
- The classification of DecisionTreeClassifier, LogisticRegression, 
MLPClassifier, QuadraticDiscriminantAnalysis, RandomForestClassifier, 
and SVM with a linear and a polynomial kernel were not inverted when all 
binary class labels are flipped.
- The classification of LogisticRegression, MLPClassifier, 
QuadraticDiscriminantAnalysis, and RandomForestClassifier sometimes 
changed when the features are reordered.
- The classification of KNeighborsClassifier, MLPClassifier, 
QuadraticDiscriminantAnalysis, RandomForestClassifier, and SVM with a 
linear kernel sometimes changed when the instances are reordered.

You can find details of our results online [1]. The provided resources 
include the current draft of the paper that describes the tests as well 
as detailed results in detail. Moreover, we provide an executable test 
suite with all tests we executed, as well as the export of our test 
results as XML file that contains all details of the test execution, 
including stack traces in case of exceptions. The preprint and online 
materials also contain the results for two other machine learning 
libraries, i.e., Weka and Spark MLlib. Additionally, you can find the 
atoml tool used to generate the tests on GitHub [2].

I hope that these tests may help with the future development of 
scikit-learn. You could help me a lot by answering the following questions:
- Do you consider the tests helpful?
- Do you consider any source code or documentation changes due to our 
findings?
- Would you be interested in a pull request or any other type of 
integration of (a subset of) the tests into your project?
- Would you be interested in more such tests, e.g., for the 
consideration of hyper parameters, other algorithm types like 
clustering, or more complex algorithm specific metamorphic tests?

I am looking forward to your feedback.

Best regards,
Steffen Herbold

[1] http://user.informatik.uni-goettingen.de/~sherbold/atoml-results/
[2] https://github.com/sherbold/atoml

-- 
Dr. Steffen Herbold
Institute of Computer Science
University of Goettingen
Goldschmidtstra?e 7
37077 G?ttingen, Germany
mailto. herbold at cs.uni-goettingen.de
tel. +49 551 39-172037


From t3kcit at gmail.com  Wed Aug 22 11:49:02 2018
From: t3kcit at gmail.com (Andreas Mueller)
Date: Wed, 22 Aug 2018 11:49:02 -0400
Subject: [scikit-learn] Smoke and Metamorphic Testing of scikit-learn
In-Reply-To: <58887735-dbfa-b26f-fee7-c5cb0290b24e@cs.uni-goettingen.de>
References: <58887735-dbfa-b26f-fee7-c5cb0290b24e@cs.uni-goettingen.de>
Message-ID: <e9587713-d8f1-9664-3eb2-e53f7561b583@gmail.com>

Hi Steffen.

Thanks for sharing your analysis. We really need more work in this 
direction.
I assume you fixed the random states everywhere?

I consider these tests helpful but not all your expectations are 
warranted depending on the model.

If you add one to each feature, there is no expectations that results 
will be the same, unless for the tree models.
For tree-based models with fixed random states, however, it's expected 
that reordering features will change the result.
For non-convex optimization it's expected that results are not symmetric 
(i.e. the MLPClassifier will not flip
the decision function because the optimization is initialized in an 
asymetric way), and reordering features will
also change the result. If using mini-batches (the default) the results 
will also change when instances are reordered.
I assume you didn't test SGDClassifier or any of it's derivatives 
because it doesn't show up here. Did you test LinearDiscriminantAnalysis?

For the invariance tests it would be interesting to know if they are due 
to tie-breaking or numerical issues.
There is some numerical issues that are very hard to control, and I'm 
pretty sure we have asymmetric tie-breaking
(multiclass libsvm is "always predict the first class" 
https://github.com/scikit-learn/scikit-learn/issues/8276 )

I would looks at QuadraticDiscriminantAnalysis a bit more closely as a 
consequence of your tests.
Maybe check if the SVM, RF and KNN issues are due to tie-breaking.

We could try and document all the cases where the result will not 
fulfill these invariances, but I think that might be too much.
At some point we need the users to understand what's going on. If you 
look at the random forest algorithm and you fix
the random state it's obvious that feature order matters.

A big question here is how big the differences are. Some algorithms are 
randomized (I think the coordinate descent in
some of the linear models uses random orders), but the results are 
expected to be near-identical, independent of the ordering.

Cheers,

Andy


On 8/22/18 7:12 AM, Steffen Herbold wrote:
> Dear developers,
>
> I am writing you because I applied an approach for the automated 
> testing of classification algorithms to scikit-learn and would like to 
> forward the results to you.
>
> The approach is a combination of smoke testing and metamorphic 
> testing. The smoke tests try to find problems by executing the 
> training and prediction functions of classifiers with different data. 
> These smoke tests should ensure the basic functioning of classifiers. 
> I defined 20 different data sets, some very simple (uniform features 
> in [0,1]), some with extreme distributions, e.g., data close to 
> machine precision. The metamorphic tests determine if classification 
> results change as expected if the training data is modified, e.g., by 
> reordering features, flipping class labels, or reordering instances.
>
> I generated 70 different Python unittest tests for eleven different 
> scikit-learn classifiers. In summary, I found the following potential 
> problems:
> - Two errors due to possibly infinite loops for the 
> LogisticRegressionClassifier for data that approaches MAXDOUBLE.
> - The classification of LogisticRegression, MLPClassifier, 
> QuadraticDiscriminantAnalysis, and SVM with a polynomial kernel 
> changed if one is added to each feature value.
> - The classification of DecisionTreeClassifier, LogisticRegression, 
> MLPClassifier, QuadraticDiscriminantAnalysis, RandomForestClassifier, 
> and SVM with a linear and a polynomial kernel were not inverted when 
> all binary class labels are flipped.
> - The classification of LogisticRegression, MLPClassifier, 
> QuadraticDiscriminantAnalysis, and RandomForestClassifier sometimes 
> changed when the features are reordered.
> - The classification of KNeighborsClassifier, MLPClassifier, 
> QuadraticDiscriminantAnalysis, RandomForestClassifier, and SVM with a 
> linear kernel sometimes changed when the instances are reordered.
>
> You can find details of our results online [1]. The provided resources 
> include the current draft of the paper that describes the tests as 
> well as detailed results in detail. Moreover, we provide an executable 
> test suite with all tests we executed, as well as the export of our 
> test results as XML file that contains all details of the test 
> execution, including stack traces in case of exceptions. The preprint 
> and online materials also contain the results for two other machine 
> learning libraries, i.e., Weka and Spark MLlib. Additionally, you can 
> find the atoml tool used to generate the tests on GitHub [2].
>
> I hope that these tests may help with the future development of 
> scikit-learn. You could help me a lot by answering the following 
> questions:
> - Do you consider the tests helpful?
> - Do you consider any source code or documentation changes due to our 
> findings?
> - Would you be interested in a pull request or any other type of 
> integration of (a subset of) the tests into your project?
> - Would you be interested in more such tests, e.g., for the 
> consideration of hyper parameters, other algorithm types like 
> clustering, or more complex algorithm specific metamorphic tests?
>
> I am looking forward to your feedback.
>
> Best regards,
> Steffen Herbold
>
> [1] http://user.informatik.uni-goettingen.de/~sherbold/atoml-results/
> [2] https://github.com/sherbold/atoml
>

From herbold at cs.uni-goettingen.de  Thu Aug 23 07:39:01 2018
From: herbold at cs.uni-goettingen.de (Steffen Herbold)
Date: Thu, 23 Aug 2018 13:39:01 +0200
Subject: [scikit-learn] Smoke and Metamorphic Testing of scikit-learn
In-Reply-To: <e9587713-d8f1-9664-3eb2-e53f7561b583@gmail.com>
References: <58887735-dbfa-b26f-fee7-c5cb0290b24e@cs.uni-goettingen.de>
 <e9587713-d8f1-9664-3eb2-e53f7561b583@gmail.com>
Message-ID: <47fe57e8-d139-844e-b303-79e20adfb584@cs.uni-goettingen.de>

Hi Andy,

thanks for your detailed feedback.

The random states are fixed, and set immediately before calling the fit 
function. Here is a gist with the code for one smoke tests and a 
metamorphic test [1].

I will run the tests for LinearDiscriminantAnalysis and the 
SGDClassifier. I somehow missed them when I scanned the documentation.

I know that these problems should sometimes be expected. However, I was 
actually not sure what to expect, especially after I started to look at 
the results for different ML libraries in comparison. The random forest 
you brought up are good example. I also expected them to be dependent on 
feature/instance order. However, they are not in Weka, only in 
scikit-learn and Spark MLlib. There are more such examples, like 
logistic regression that exihibits different behavior in all three 
libraries.

I already have a comparison regarding expected differences between 
machine learning frameworks planned as a topic for future work.

Best,
Steffen

[1] https://gist.github.com/sherbold/570c9399e9bc39dd980d6c2bdbf3b64a

Am 22.08.2018 um 17:49 schrieb Andreas Mueller:
> Hi Steffen.
>
> Thanks for sharing your analysis. We really need more work in this 
> direction.
> I assume you fixed the random states everywhere?
>
> I consider these tests helpful but not all your expectations are 
> warranted depending on the model.
>
> If you add one to each feature, there is no expectations that results 
> will be the same, unless for the tree models.
> For tree-based models with fixed random states, however, it's expected 
> that reordering features will change the result.
> For non-convex optimization it's expected that results are not 
> symmetric (i.e. the MLPClassifier will not flip
> the decision function because the optimization is initialized in an 
> asymetric way), and reordering features will
> also change the result. If using mini-batches (the default) the 
> results will also change when instances are reordered.
> I assume you didn't test SGDClassifier or any of it's derivatives 
> because it doesn't show up here. Did you test LinearDiscriminantAnalysis?
>
> For the invariance tests it would be interesting to know if they are 
> due to tie-breaking or numerical issues.
> There is some numerical issues that are very hard to control, and I'm 
> pretty sure we have asymmetric tie-breaking
> (multiclass libsvm is "always predict the first class" 
> https://github.com/scikit-learn/scikit-learn/issues/8276 )
>
> I would looks at QuadraticDiscriminantAnalysis a bit more closely as a 
> consequence of your tests.
> Maybe check if the SVM, RF and KNN issues are due to tie-breaking.
>
> We could try and document all the cases where the result will not 
> fulfill these invariances, but I think that might be too much.
> At some point we need the users to understand what's going on. If you 
> look at the random forest algorithm and you fix
> the random state it's obvious that feature order matters.
>
> A big question here is how big the differences are. Some algorithms 
> are randomized (I think the coordinate descent in
> some of the linear models uses random orders), but the results are 
> expected to be near-identical, independent of the ordering.
>
> Cheers,
>
> Andy
>
>
> On 8/22/18 7:12 AM, Steffen Herbold wrote:
>> Dear developers,
>>
>> I am writing you because I applied an approach for the automated 
>> testing of classification algorithms to scikit-learn and would like 
>> to forward the results to you.
>>
>> The approach is a combination of smoke testing and metamorphic 
>> testing. The smoke tests try to find problems by executing the 
>> training and prediction functions of classifiers with different data. 
>> These smoke tests should ensure the basic functioning of classifiers. 
>> I defined 20 different data sets, some very simple (uniform features 
>> in [0,1]), some with extreme distributions, e.g., data close to 
>> machine precision. The metamorphic tests determine if classification 
>> results change as expected if the training data is modified, e.g., by 
>> reordering features, flipping class labels, or reordering instances.
>>
>> I generated 70 different Python unittest tests for eleven different 
>> scikit-learn classifiers. In summary, I found the following potential 
>> problems:
>> - Two errors due to possibly infinite loops for the 
>> LogisticRegressionClassifier for data that approaches MAXDOUBLE.
>> - The classification of LogisticRegression, MLPClassifier, 
>> QuadraticDiscriminantAnalysis, and SVM with a polynomial kernel 
>> changed if one is added to each feature value.
>> - The classification of DecisionTreeClassifier, LogisticRegression, 
>> MLPClassifier, QuadraticDiscriminantAnalysis, RandomForestClassifier, 
>> and SVM with a linear and a polynomial kernel were not inverted when 
>> all binary class labels are flipped.
>> - The classification of LogisticRegression, MLPClassifier, 
>> QuadraticDiscriminantAnalysis, and RandomForestClassifier sometimes 
>> changed when the features are reordered.
>> - The classification of KNeighborsClassifier, MLPClassifier, 
>> QuadraticDiscriminantAnalysis, RandomForestClassifier, and SVM with a 
>> linear kernel sometimes changed when the instances are reordered.
>>
>> You can find details of our results online [1]. The provided 
>> resources include the current draft of the paper that describes the 
>> tests as well as detailed results in detail. Moreover, we provide an 
>> executable test suite with all tests we executed, as well as the 
>> export of our test results as XML file that contains all details of 
>> the test execution, including stack traces in case of exceptions. The 
>> preprint and online materials also contain the results for two other 
>> machine learning libraries, i.e., Weka and Spark MLlib. Additionally, 
>> you can find the atoml tool used to generate the tests on GitHub [2].
>>
>> I hope that these tests may help with the future development of 
>> scikit-learn. You could help me a lot by answering the following 
>> questions:
>> - Do you consider the tests helpful?
>> - Do you consider any source code or documentation changes due to our 
>> findings?
>> - Would you be interested in a pull request or any other type of 
>> integration of (a subset of) the tests into your project?
>> - Would you be interested in more such tests, e.g., for the 
>> consideration of hyper parameters, other algorithm types like 
>> clustering, or more complex algorithm specific metamorphic tests?
>>
>> I am looking forward to your feedback.
>>
>> Best regards,
>> Steffen Herbold
>>
>> [1] http://user.informatik.uni-goettingen.de/~sherbold/atoml-results/
>> [2] https://github.com/sherbold/atoml
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-- 
Dr. Steffen Herbold
Institute of Computer Science
University of Goettingen
Goldschmidtstra?e 7
37077 G?ttingen, Germany
mailto. herbold at cs.uni-goettingen.de
tel. +49 551 39-172037


From adwait96bauskar at gmail.com  Fri Aug 24 04:45:06 2018
From: adwait96bauskar at gmail.com (Adwait Bauskar)
Date: Fri, 24 Aug 2018 14:15:06 +0530
Subject: [scikit-learn] Regarding Contribution to scikit-learn
Message-ID: <CAA-SZ-hfZTXsi+rb3TaXnwbRiNr0JJ+RQKFTdkzL-dSw_-Y5fA@mail.gmail.com>

Hello everybody,

I am a university student with a major in Mathematics and Computer Science
interested in contributing to scikit-learn. I do not have any previous
experience in open source development. I am willing to contribute to
scikit-learn on a regular basis.

I am looking for a mentor who would be ready to guide me in contributing. I
do not need any help in setting up the project or any GitHub related
things. I need help mainly in understanding the current code and a guide to
help me along with my code.

I have experience with Java and C++ before.

If any developer is interested and has the time to guide me, please feel
free to message me on this email. Also, I am sorry if this is not the right
channel to contact developers on.

Thank you,
Adwait
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180824/2694a5de/attachment.html>

From herbold at cs.uni-goettingen.de  Mon Aug 27 09:07:03 2018
From: herbold at cs.uni-goettingen.de (Steffen Herbold)
Date: Mon, 27 Aug 2018 15:07:03 +0200
Subject: [scikit-learn] Smoke and Metamorphic Testing of scikit-learn
In-Reply-To: <47fe57e8-d139-844e-b303-79e20adfb584@cs.uni-goettingen.de>
References: <58887735-dbfa-b26f-fee7-c5cb0290b24e@cs.uni-goettingen.de>
 <e9587713-d8f1-9664-3eb2-e53f7561b583@gmail.com>
 <47fe57e8-d139-844e-b303-79e20adfb584@cs.uni-goettingen.de>
Message-ID: <7a94690a-7243-b26e-03d4-8c370cba1f8a@cs.uni-goettingen.de>

Hi Andy,

I now have results for LinearDiscriminantAnalysis and the SGDClassifier. 
I updated the results online.

The LinearDiscriminantAnalysis had

  * an infinity of NaN for data that approaches MAXDOUBLE and
  * problems with an internal array size computation for data for
    several tests, i.e., data that is very close to zero and cannot be
    expressed by 32bit floats as well as for data that is all zero.

The SGD had

  * an over/underflow for data that approaches MAXDOUBLE
  * differences in the classifications if we added one to the numeric
    features
  * differences in the classification if we reordered the instances.

Best,
Steffen

[1] 
http://user.informatik.uni-goettingen.de/~sherbold/atoml-results/test-export-scikit.xml

Am 23.08.2018 um 13:39 schrieb Steffen Herbold:
> Hi Andy,
>
> thanks for your detailed feedback.
>
> The random states are fixed, and set immediately before calling the 
> fit function. Here is a gist with the code for one smoke tests and a 
> metamorphic test [1].
>
> I will run the tests for LinearDiscriminantAnalysis and the 
> SGDClassifier. I somehow missed them when I scanned the documentation.
>
> I know that these problems should sometimes be expected. However, I 
> was actually not sure what to expect, especially after I started to 
> look at the results for different ML libraries in comparison. The 
> random forest you brought up are good example. I also expected them to 
> be dependent on feature/instance order. However, they are not in Weka, 
> only in scikit-learn and Spark MLlib. There are more such examples, 
> like logistic regression that exihibits different behavior in all 
> three libraries.
>
> I already have a comparison regarding expected differences between 
> machine learning frameworks planned as a topic for future work.
>
> Best,
> Steffen
>
> [1] https://gist.github.com/sherbold/570c9399e9bc39dd980d6c2bdbf3b64a
>
> Am 22.08.2018 um 17:49 schrieb Andreas Mueller:
>> Hi Steffen.
>>
>> Thanks for sharing your analysis. We really need more work in this 
>> direction.
>> I assume you fixed the random states everywhere?
>>
>> I consider these tests helpful but not all your expectations are 
>> warranted depending on the model.
>>
>> If you add one to each feature, there is no expectations that results 
>> will be the same, unless for the tree models.
>> For tree-based models with fixed random states, however, it's 
>> expected that reordering features will change the result.
>> For non-convex optimization it's expected that results are not 
>> symmetric (i.e. the MLPClassifier will not flip
>> the decision function because the optimization is initialized in an 
>> asymetric way), and reordering features will
>> also change the result. If using mini-batches (the default) the 
>> results will also change when instances are reordered.
>> I assume you didn't test SGDClassifier or any of it's derivatives 
>> because it doesn't show up here. Did you test 
>> LinearDiscriminantAnalysis?
>>
>> For the invariance tests it would be interesting to know if they are 
>> due to tie-breaking or numerical issues.
>> There is some numerical issues that are very hard to control, and I'm 
>> pretty sure we have asymmetric tie-breaking
>> (multiclass libsvm is "always predict the first class" 
>> https://github.com/scikit-learn/scikit-learn/issues/8276 )
>>
>> I would looks at QuadraticDiscriminantAnalysis a bit more closely as 
>> a consequence of your tests.
>> Maybe check if the SVM, RF and KNN issues are due to tie-breaking.
>>
>> We could try and document all the cases where the result will not 
>> fulfill these invariances, but I think that might be too much.
>> At some point we need the users to understand what's going on. If you 
>> look at the random forest algorithm and you fix
>> the random state it's obvious that feature order matters.
>>
>> A big question here is how big the differences are. Some algorithms 
>> are randomized (I think the coordinate descent in
>> some of the linear models uses random orders), but the results are 
>> expected to be near-identical, independent of the ordering.
>>
>> Cheers,
>>
>> Andy
>>
>>
>> On 8/22/18 7:12 AM, Steffen Herbold wrote:
>>> Dear developers,
>>>
>>> I am writing you because I applied an approach for the automated 
>>> testing of classification algorithms to scikit-learn and would like 
>>> to forward the results to you.
>>>
>>> The approach is a combination of smoke testing and metamorphic 
>>> testing. The smoke tests try to find problems by executing the 
>>> training and prediction functions of classifiers with different 
>>> data. These smoke tests should ensure the basic functioning of 
>>> classifiers. I defined 20 different data sets, some very simple 
>>> (uniform features in [0,1]), some with extreme distributions, e.g., 
>>> data close to machine precision. The metamorphic tests determine if 
>>> classification results change as expected if the training data is 
>>> modified, e.g., by reordering features, flipping class labels, or 
>>> reordering instances.
>>>
>>> I generated 70 different Python unittest tests for eleven different 
>>> scikit-learn classifiers. In summary, I found the following 
>>> potential problems:
>>> - Two errors due to possibly infinite loops for the 
>>> LogisticRegressionClassifier for data that approaches MAXDOUBLE.
>>> - The classification of LogisticRegression, MLPClassifier, 
>>> QuadraticDiscriminantAnalysis, and SVM with a polynomial kernel 
>>> changed if one is added to each feature value.
>>> - The classification of DecisionTreeClassifier, LogisticRegression, 
>>> MLPClassifier, QuadraticDiscriminantAnalysis, 
>>> RandomForestClassifier, and SVM with a linear and a polynomial 
>>> kernel were not inverted when all binary class labels are flipped.
>>> - The classification of LogisticRegression, MLPClassifier, 
>>> QuadraticDiscriminantAnalysis, and RandomForestClassifier sometimes 
>>> changed when the features are reordered.
>>> - The classification of KNeighborsClassifier, MLPClassifier, 
>>> QuadraticDiscriminantAnalysis, RandomForestClassifier, and SVM with 
>>> a linear kernel sometimes changed when the instances are reordered.
>>>
>>> You can find details of our results online [1]. The provided 
>>> resources include the current draft of the paper that describes the 
>>> tests as well as detailed results in detail. Moreover, we provide an 
>>> executable test suite with all tests we executed, as well as the 
>>> export of our test results as XML file that contains all details of 
>>> the test execution, including stack traces in case of exceptions. 
>>> The preprint and online materials also contain the results for two 
>>> other machine learning libraries, i.e., Weka and Spark MLlib. 
>>> Additionally, you can find the atoml tool used to generate the tests 
>>> on GitHub [2].
>>>
>>> I hope that these tests may help with the future development of 
>>> scikit-learn. You could help me a lot by answering the following 
>>> questions:
>>> - Do you consider the tests helpful?
>>> - Do you consider any source code or documentation changes due to 
>>> our findings?
>>> - Would you be interested in a pull request or any other type of 
>>> integration of (a subset of) the tests into your project?
>>> - Would you be interested in more such tests, e.g., for the 
>>> consideration of hyper parameters, other algorithm types like 
>>> clustering, or more complex algorithm specific metamorphic tests?
>>>
>>> I am looking forward to your feedback.
>>>
>>> Best regards,
>>> Steffen Herbold
>>>
>>> [1] http://user.informatik.uni-goettingen.de/~sherbold/atoml-results/
>>> [2] https://github.com/sherbold/atoml
>>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>

-- 
Dr. Steffen Herbold
Institute of Computer Science
University of Goettingen
Goldschmidtstra?e 7
37077 G?ttingen, Germany
mailto. herbold at cs.uni-goettingen.de
tel. +49 551 39-172037

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180827/b7bb0ae9/attachment.html>

From t3kcit at gmail.com  Fri Aug 31 21:26:39 2018
From: t3kcit at gmail.com (Andreas Mueller)
Date: Fri, 31 Aug 2018 21:26:39 -0400
Subject: [scikit-learn] ANN Scikit-learn 0.20rc1 release candidate available
Message-ID: <5a532836-7f9f-b548-2ad3-d0b8a40b3011@gmail.com>

Hey Folks!

I'm happy to announce that the scikit-learn 0.20 release candidate 1 is 
now available via conda-forge and pip.
Please help us by testing this release candidate so we can make sure the 
final release will go seamlessly!

You can install the release candidate from conda-forge using

conda install scikit-learn=0.20rc1 -c conda-forge/label/rc -c conda-forge

(please take into account that if you're using the default conda channel 
otherwise, this will pull in some other
dependencies from conda-forge).

You can install the release candidate via pip using

pip install --pre scikit-learn

The documentation for 0.20 is available at

http://scikit-learn.org/0.20/

and will move to http://scikit-learn.org/ upon final release.

You can find the release note with all new features and changes here:

http://scikit-learn.org/0.20/whats_new.html#version-0-20

Thank you for your help in testing the RC and thank you to everybody 
that made the release possible!

All the best,

Andy