From pengyu.ut at gmail.com  Sat Feb  1 13:53:55 2020
From: pengyu.ut at gmail.com (Peng Yu)
Date: Sat, 1 Feb 2020 12:53:55 -0600
Subject: [scikit-learn] The exact formula used to compute the tf-idf
Message-ID: <CABrM6wm+QLsG+=A3q25FucW2w9Zd=WmTFvCE=KsMbHscPq+S_g@mail.gmail.com>

Hi,

I am trying to understand the exact formula for tf-idf.

vectorizer = TfidfVectorizer(ngram_range = (1, 1), norm = None)
wordtfidf = vectorizer.fit_transform(texts)

Given the following 3 documents (id1, id2, id3 are the IDs of the
three documents).

id1	AA BB BB CC CC CC
id2	AA AA AA AA BB BB BB BB BB DD DD DD DD DD DD
id3	AA AA AA AA AA AA AA DD DD DD DD DD DD DD DD FF FF FF FF FF FF FF FF FF

The results are the following.

id1?	cc?	5.079441541679836?
id1?	bb?	2.5753641449035616?
id1?	aa?	1.0?
id2?	dd?	7.726092434710685?
id2?	bb?	6.438410362258904?
id2?	aa?	4.0?
id3?	ff?	15.238324625039509?
id3?	dd?	10.301456579614246?
id3?	aa?	7.0?

According to "6.2.3.4. Tf?idf term weighting" on the following page.

https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

For aa, as n = 3 and df =3, idf(aa) = log((1+n)/(1+df)) + 1 = 1.

But I don't understand why tf-idf(id1, aa) is 1. This means that
tf(id1, aa) is 1, which is just the count of aa, shouldn't it be
divided by the number of terms in the doc id1, which should result in
1/6 instead of 1?

Thanks.

-- 
Regards,
Peng

From mail at sebastianraschka.com  Sat Feb  1 14:06:38 2020
From: mail at sebastianraschka.com (Sebastian Raschka)
Date: Sat, 1 Feb 2020 13:06:38 -0600
Subject: [scikit-learn] The exact formula used to compute the tf-idf
In-Reply-To: <CABrM6wm+QLsG+=A3q25FucW2w9Zd=WmTFvCE=KsMbHscPq+S_g@mail.gmail.com>
References: <CABrM6wm+QLsG+=A3q25FucW2w9Zd=WmTFvCE=KsMbHscPq+S_g@mail.gmail.com>
Message-ID: <C54CBF50-7771-4A5F-91D4-A84C11FA2FD2@sebastianraschka.com>

Hi there,

unfortunately I currently don't have time to walk through your example, but I wrote down how the Tf-idf in sklearn works using some examples here: https://github.com/rasbt/pattern_classification/blob/90710922e4f4d7e3f432221b8a4d2ec1dd2d9dc9/machine_learning/scikit-learn/tfidf_scikit-learn.ipynb

(I remember that we used it to write portions of the documentation in sklearn later)

Best,
Sebastian

> On Feb 1, 2020, at 12:53 PM, Peng Yu <pengyu.ut at gmail.com> wrote:
> 
> Hi,
> 
> I am trying to understand the exact formula for tf-idf.
> 
> vectorizer = TfidfVectorizer(ngram_range = (1, 1), norm = None)
> wordtfidf = vectorizer.fit_transform(texts)
> 
> Given the following 3 documents (id1, id2, id3 are the IDs of the
> three documents).
> 
> id1	AA BB BB CC CC CC
> id2	AA AA AA AA BB BB BB BB BB DD DD DD DD DD DD
> id3	AA AA AA AA AA AA AA DD DD DD DD DD DD DD DD FF FF FF FF FF FF FF FF FF
> 
> The results are the following.
> 
> id1?	cc?	5.079441541679836?
> id1?	bb?	2.5753641449035616?
> id1?	aa?	1.0?
> id2?	dd?	7.726092434710685?
> id2?	bb?	6.438410362258904?
> id2?	aa?	4.0?
> id3?	ff?	15.238324625039509?
> id3?	dd?	10.301456579614246?
> id3?	aa?	7.0?
> 
> According to "6.2.3.4. Tf?idf term weighting" on the following page.
> 
> https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction
> 
> For aa, as n = 3 and df =3, idf(aa) = log((1+n)/(1+df)) + 1 = 1.
> 
> But I don't understand why tf-idf(id1, aa) is 1. This means that
> tf(id1, aa) is 1, which is just the count of aa, shouldn't it be
> divided by the number of terms in the doc id1, which should result in
> 1/6 instead of 1?
> 
> Thanks.
> 
> -- 
> Regards,
> Peng
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From tchyk2001 at yahoo.com  Sun Feb  9 15:21:47 2020
From: tchyk2001 at yahoo.com (Paul Chike Ofoche)
Date: Sun, 9 Feb 2020 20:21:47 +0000 (UTC)
Subject: [scikit-learn] Need for multioutput multivariate algorithm for
 Random Forest in Python (using Mahalanobis distance)
References: <2075352869.868953.1581279707668.ref@mail.yahoo.com>
Message-ID: <2075352869.868953.1581279707668@mail.yahoo.com>


Hello all,

My name is Paul and I am enthused about data science. I have been using Python and other programming languages for close to two years. There is an issue that I have been facing since I began applying Python to the analysis of my research work.


My question has remained unanswered for months. Has anybody not run into the need to work with data whereby the regression results are a multiple output, in which the output parameters are correlated with each other? This is called a multi-output multivariate problem. A version of random forest that handles multiple outputs is referred to as the multivariate random forest. It is implemented in the programming language, R (see attached reference documentation below).


Till date, there exists no such package in Python. My question is whether anybody knows how to go about implementing this. The random forest univariate regression case utilizes the Euclidean distance as the measurement criteria, whereas the multivariate regression case uses the Mahalanobis distance, which takes into account the inter-relationships between the multiple outputs. I have inquired about an equivalent capability in Python for many years, but it has still not been addressed. Such a multivariate random forest mode is very applicable to the type of research and analysis that I do. Could someone help, please?

Thank you,

Paul Ofoche

?

PS: This is an important need for multivariate output analysis as a technique to solving practical research problems. Here are some posted questions by various other Python users concerning this same issue.

?

https://datascience.stackexchange.com/questions/21637/code-for-multivariate-random-forest-in-python-r

?

Multi-output regression


| 
| 
| 
|  |  |

 |

 |
| 
|  | 
Multi-output regression

I have been looking in to Multi-output regression the last view weeks. I am working with the scikit learn packag...
 |

 |

 |


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200209/8e81ea58/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: MultivariateRandomForest.pdf
Type: application/pdf
Size: 89595 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200209/8e81ea58/attachment-0001.pdf>

From ross at cgl.ucsf.edu  Sun Feb  9 18:43:40 2020
From: ross at cgl.ucsf.edu (Bill Ross)
Date: Sun, 9 Feb 2020 15:43:40 -0800
Subject: [scikit-learn] Need for multioutput multivariate algorithm for
 Random Forest in Python (using Mahalanobis distance)
In-Reply-To: <2075352869.868953.1581279707668@mail.yahoo.com>
References: <2075352869.868953.1581279707668.ref@mail.yahoo.com>
 <2075352869.868953.1581279707668@mail.yahoo.com>
Message-ID: <69f5a802-e987-a0aa-b219-be8191edb800@cgl.ucsf.edu>

Speaking as an ignorant, lurker/nonuser of sklearn, the way I see this 
being handled in neural nets is


    https://keras.io/examples/cifar10_cnn/

    model.add(Dense(num_classes)) model.add(Activation('softmax'))

Not sure if that will map to sklearn.

Bill

On 2/9/20 12:21 PM, Paul Chike Ofoche via scikit-learn wrote:
>
> Hello all,
>
> My name is Paul and I am enthused about data science. I have been 
> using Python and other programming languages for close to two years. 
> There is an issue that I have been facing since I began applying 
> Python to the analysis of my research work.
>
>
> My question has remained unanswered for months. Has anybody not run 
> into the need to work with data whereby the regression results are a 
> multiple output, in which the output parameters are correlated with 
> each other? This is called a multi-output multivariate problem. A 
> version of random forest that handles multiple outputs is referred to 
> as the multivariate random forest. It is implemented in the 
> programming language, R (see attached reference documentation below).
>
>
> Till date, there exists no such package in Python. My question is 
> whether anybody knows how to go about implementing this. The random 
> forest univariate regression case utilizes the Euclidean distance as 
> the measurement criteria, whereas the multivariate regression case 
> uses the Mahalanobis distance, which takes into account the 
> inter-relationships between the multiple outputs. I have inquired 
> about an equivalent capability in Python for many years, but it has 
> still not been addressed. Such a multivariate random forest mode is 
> very applicable to the type of research and analysis that I do. Could 
> someone help, please?
>
> Thank you,
>
> Paul Ofoche
>
> PS: This is an important need for multivariate output analysis as a 
> technique to solving practical research problems. Here are some posted 
> questions by various other Python users concerning this same issue.
>
> *https://datascience.stackexchange.com/questions/21637/code-for-multivariate-random-forest-in-python-r*
>
> Multi-output regression 
> <https://stackoverflow.com/questions/49391637/multi-output-regression>
>
>
>
> 	
>
>
> 	
>
>
>     Multi-output regression
>
> I have been looking in to Multi-output regression the last view weeks. 
> I am working with the scikit learn packag...
>
> <https://stackoverflow.com/questions/49391637/multi-output-regression>
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200209/0b627cb5/attachment-0001.html>

From ian at ianozsvald.com  Wed Feb 12 11:46:20 2020
From: ian at ianozsvald.com (Ian Ozsvald)
Date: Wed, 12 Feb 2020 16:46:20 +0000
Subject: [scikit-learn] ANN: PyDataLondon 2020 Call for Proposals (closing
 soon on February 21)
Message-ID: <CAPvwANAOpKcek8MMnNpNAoLRtHX7644etA1OkFu2hacUPQ705w@mail.gmail.com>

PyDataLondon 2020 runs on May 15-17, we'll have the same 3 day mult
track format as last year [1] with circa 700 attendees.

We're keen to have new talks and tutorials on machine learning, data
science, data engineering and associated areas. We welcome and
encourage first-time speakers, mentorship is available for first-time
submitters to help refine your proposal:
https://pydata.org/london2020/cfp/

Topics around scikit-learn, related scikit packages and related
packages (e.g. PyMC3, Keras) are all very welcome. Thanks to NumFOCUS
there are some travel and diversity grants.

Our Call for Proposals is open until February 21, all proposals will
be released to the double-blind review committee simultaneously at
this point. This is our second year with a double-blind committee.

In addition to a double-blind review process the conference offers a
creche, prayer and quiet rooms and an "unconference" track which is
scheduled by participants to increase the diversity of talks and
activities, in addition to the usual pre-scheduled tracks:
https://pydata.org/london2020/

If you have questions feel free to email me directly and I'll try to
help. I'm not on the review committee, I'm one of the founders of the
conference series and a continuing volunteer. All of the core
PyDataLondon organisers are volunteers.

Thanks, Ian.

[1] https://pydata.org/london2019/schedule/

-- 
Ian Ozsvald (Data Scientist, PyDataLondon co-chair)
ian at IanOzsvald.com

https://IanOzsvald.com
https://MorConsulting.com
https://twitter.com/IanOzsvald

From t3kcit at gmail.com  Thu Feb 13 23:13:37 2020
From: t3kcit at gmail.com (Andreas Mueller)
Date: Thu, 13 Feb 2020 20:13:37 -0800
Subject: [scikit-learn] Need for multioutput multivariate algorithm for
 Random Forest in Python (using Mahalanobis distance)
In-Reply-To: <2075352869.868953.1581279707668@mail.yahoo.com>
References: <2075352869.868953.1581279707668.ref@mail.yahoo.com>
 <2075352869.868953.1581279707668@mail.yahoo.com>
Message-ID: <f4b97138-56ee-aa70-20df-ee89989d9c99@gmail.com>


On 2/9/20 12:21 PM, Paul Chike Ofoche via scikit-learn wrote:
>
> Hello all,
>
> My name is Paul and I am enthused about data science. I have been 
> using Python and other programming languages for close to two years. 
> There is an issue that I have been facing since I began applying 
> Python to the analysis of my research work.
>
>
> My question has remained unanswered for months. Has anybody not run 
> into the need to work with data whereby the regression results are a 
> multiple output, in which the output parameters are correlated with 
> each other? This is called a multi-output multivariate problem. A 
> version of random forest that handles multiple outputs is referred to 
> as the multivariate random forest. It is implemented in the 
> programming language, R (see attached reference documentation below).
>
The scikit-learn random forest actually handles this. It doesn't use the 
mahalanobis distance but that seems like a simple preprocessing step.
>
>
> Till date, there exists no such package in Python. My question is 
> whether anybody knows how to go about implementing this. The random 
> forest univariate regression case utilizes the Euclidean distance as 
> the measurement criteria, whereas the multivariate regression case 
> uses the Mahalanobis distance, which takes into account the 
> inter-relationships between the multiple outputs. I have inquired 
> about an equivalent capability in Python for many years, but it has 
> still not been addressed. Such a multivariate random forest mode is 
> very applicable to the type of research and analysis that I do. Could 
> someone help, please?
>
> Thank you,
>
> Paul Ofoche
>
> PS: This is an important need for multivariate output analysis as a 
> technique to solving practical research problems. Here are some posted 
> questions by various other Python users concerning this same issue.
>
> *https://datascience.stackexchange.com/questions/21637/code-for-multivariate-random-forest-in-python-r*
>
> Multi-output regression 
> <https://stackoverflow.com/questions/49391637/multi-output-regression>
>
>
>
> 	
>
>
> 	
>
>
>     Multi-output regression
>
> I have been looking in to Multi-output regression the last view weeks. 
> I am working with the scikit learn packag...
>
> <https://stackoverflow.com/questions/49391637/multi-output-regression>
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200213/8f505efb/attachment-0001.html>

From tchyk2001 at yahoo.com  Fri Feb 14 07:37:44 2020
From: tchyk2001 at yahoo.com (Paul Chike Ofoche)
Date: Fri, 14 Feb 2020 12:37:44 +0000 (UTC)
Subject: [scikit-learn] Need for multioutput multivariate algorithm for
 Random Forest in Python (using Mahalanobis distance)
In-Reply-To: <f4b97138-56ee-aa70-20df-ee89989d9c99@gmail.com>
References: <2075352869.868953.1581279707668.ref@mail.yahoo.com>
 <2075352869.868953.1581279707668@mail.yahoo.com>
 <f4b97138-56ee-aa70-20df-ee89989d9c99@gmail.com>
Message-ID: <1915664731.5083146.1581683864251@mail.yahoo.com>

 Scikit-learn random forest does not handle the multi-output case, but only maps to each output one at a time, thereby not accounting for the correlation between multi-outputs, which is what the Mahalanobis distance does. I, as well as other researchers have observed this issue for as much as two years. Could there be a solution to implement it in RandomForest, since Python already has a function that computes Mahalanobis distances?

    On Thursday, February 13, 2020, 10:15:11 PM CST, Andreas Mueller <t3kcit at gmail.com> wrote:  
 
  
 On 2/9/20 12:21 PM, Paul Chike Ofoche via scikit-learn wrote:
  
 
Hello all,
   
 My name is Paul and I am enthused about data science. I have been using Python and other programming languages for close to two years. There is an issue that I have been facing since I began applying Python to the analysis of my research work.
   
 
 My question has remained unanswered for months. Has anybody not run into the need to work with data whereby the regression results are a multiple output, in which the output parameters are correlated with each other? This is called a multi-output multivariate problem. A version of random forest that handles multiple outputs is referred to as the multivariate random forest. It is implemented in the programming language, R (see attached reference documentation below).
    The scikit-learn random forest actually handles this. It doesn't use the mahalanobis distance but that seems like a simple preprocessing step.
 
     
 Till date, there exists no such package in Python. My question is whether anybody knows how to go about implementing this. The random forest univariate regression case utilizes the Euclidean distance as the measurement criteria, whereas the multivariate regression case uses the Mahalanobis distance, which takes into account the inter-relationships between the multiple outputs. I have inquired about an equivalent capability in Python for many years, but it has still not been addressed. Such a multivariate random forest mode is very applicable to the type of research and analysis that I do. Could someone help, please? 
   
 Thank you,
   
 Paul Ofoche
   
 ?
   
 PS: This is an important need for multivariate output analysis as a technique to solving practical research problems. Here are some posted questions by various other Python users concerning this same issue.
   
 ?
   
 https://datascience.stackexchange.com/questions/21637/code-for-multivariate-random-forest-in-python-r
   
 ?
   
 Multi-output regression
 
   
|  
|  
|  
| 
  | 
  |

  |

  |
|  
| 
  |  
Multi-output regression
 
I have been looking in to Multi-output regression the last view weeks. I am working with the scikit learn packag...
  |

  |

  |

  
  _______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn
 
 
 _______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn
  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200214/596b68e0/attachment.html>

From niourf at gmail.com  Fri Feb 14 07:58:18 2020
From: niourf at gmail.com (Nicolas Hug)
Date: Fri, 14 Feb 2020 07:58:18 -0500
Subject: [scikit-learn] Need for multioutput multivariate algorithm for
 Random Forest in Python (using Mahalanobis distance)
In-Reply-To: <1915664731.5083146.1581683864251@mail.yahoo.com>
References: <2075352869.868953.1581279707668.ref@mail.yahoo.com>
 <2075352869.868953.1581279707668@mail.yahoo.com>
 <f4b97138-56ee-aa70-20df-ee89989d9c99@gmail.com>
 <1915664731.5083146.1581683864251@mail.yahoo.com>
Message-ID: <25c468f2-4b9e-a1a5-4d2a-53398960ec10@gmail.com>

Hi Paul,

The way multioutput is handled in decision trees (and thus in the 
forests) is described in 
https://scikit-learn.org/stable/modules/tree.html#multi-output-problems. 
As you can see, the correlation between the output values *is* taken 
into account.

Can you explain what you would like to modify there?

Nicolas

On 2/14/20 7:37 AM, Paul Chike Ofoche via scikit-learn wrote:
> Scikit-learn random forest does *not *handle the multi-output case, 
> but only maps to each output one at a time, thereby not accounting for 
> the correlation between multi-outputs, which is what the Mahalanobis 
> distance does. I, as well as other researchers have observed this 
> issue for as much as two years. Could there be a solution to implement 
> it in RandomForest, since Python already has a function that computes 
> Mahalanobis distances?
>
>
> On Thursday, February 13, 2020, 10:15:11 PM CST, Andreas Mueller 
> <t3kcit at gmail.com> wrote:
>
>
>
>
> On 2/9/20 12:21 PM, Paul Chike Ofoche via scikit-learn wrote:
>
> Hello all,
>
> My name is Paul and I am enthused about data science. I have been 
> using Python and other programming languages for close to two years. 
> There is an issue that I have been facing since I began applying 
> Python to the analysis of my research work.
>
>
> My question has remained unanswered for months. Has anybody not run 
> into the need to work with data whereby the regression results are a 
> multiple output, in which the output parameters are correlated with 
> each other? This is called a multi-output multivariate problem. A 
> version of random forest that handles multiple outputs is referred to 
> as the multivariate random forest. It is implemented in the 
> programming language, R (see attached reference documentation below).
>
> The scikit-learn random forest actually handles this. It doesn't use 
> the mahalanobis distance but that seems like a simple preprocessing step.
>
>>
>> Till date, there exists no such package in Python. My question is 
>> whether anybody knows how to go about implementing this. The random 
>> forest univariate regression case utilizes the Euclidean distance as 
>> the measurement criteria, whereas the multivariate regression case 
>> uses the Mahalanobis distance, which takes into account the 
>> inter-relationships between the multiple outputs. I have inquired 
>> about an equivalent capability in Python for many years, but it has 
>> still not been addressed. Such a multivariate random forest mode is 
>> very applicable to the type of research and analysis that I do. Could 
>> someone help, please?
>>
>> Thank you,
>>
>> Paul Ofoche
>>
>> PS: This is an important need for multivariate output analysis as a 
>> technique to solving practical research problems. Here are some 
>> posted questions by various other Python users concerning this same 
>> issue.
>>
>> *https://datascience.stackexchange.com/questions/21637/code-for-multivariate-random-forest-in-python-r*
>>
>> Multi-output regression 
>> <https://stackoverflow.com/questions/49391637/multi-output-regression>
>>
>>
>>
>> 	
>>
>>
>> 	
>>
>>
>>     Multi-output regression
>>
>> I have been looking in to Multi-output regression the last view 
>> weeks. I am working with the scikit learn packag...
>>
>> <https://stackoverflow.com/questions/49391637/multi-output-regression>
>>
>>
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org  <mailto:scikit-learn at python.org>
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org <mailto:scikit-learn at python.org>
> https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200214/9b60e1f9/attachment-0001.html>

From tchyk2001 at yahoo.com  Fri Feb 14 20:47:06 2020
From: tchyk2001 at yahoo.com (Paul Chike Ofoche)
Date: Sat, 15 Feb 2020 01:47:06 +0000 (UTC)
Subject: [scikit-learn] Need for multioutput multivariate algorithm for
 Random Forest in Python (using Mahalanobis distance)
In-Reply-To: <25c468f2-4b9e-a1a5-4d2a-53398960ec10@gmail.com>
References: <2075352869.868953.1581279707668.ref@mail.yahoo.com>
 <2075352869.868953.1581279707668@mail.yahoo.com>
 <f4b97138-56ee-aa70-20df-ee89989d9c99@gmail.com>
 <1915664731.5083146.1581683864251@mail.yahoo.com>
 <25c468f2-4b9e-a1a5-4d2a-53398960ec10@gmail.com>
Message-ID: <1478783340.5521193.1581731226647@mail.yahoo.com>

 
Many thanks Nicolas and Andreas.

I appreciate your taking the timeand effort to look into the issue that I raised and for pointing me to thedocumentation. It is quite pleasant to know that scikit-learn?sRandomForestRegressor handles multioutput cases. This issue has been veryimportant to me and was the sole reason that I switched from Python to R for myresearch in the Fall of 2018 and have seldom used Python since then. 

I got convinced about my earlierstance when reading a documentation such as https://scikit-learn.org/stable/modules/multiclass.html#multioutput-regressionwhich explained that the ?MultiOutputRegressor fits one regressor per targetand cannot take advantage of correlations between targets?, although I am awarethat this is different from the RandomForestRegressor.


I was wondering whether this multioutputhandling capability of the RandomForestRegressor has been added recently. In order to verify, I went on a fact-finding missionby re-running the exact same codes I had in 2018 and noticed quite a number ofchanges. I guess that many moons have passed since then!

For instance, sklearn.cross_validationhas been deprecated since when last I used it in 2018 (and replaced by sklearn.model_selection).Also, such errors as:

i. ValueError: Expected 2D array, got scalar array instead:

array=6.5.

Reshape your data either using array.reshape(-1, 1) ifyour data has a single feature or array.reshape(1, -1) if it contains a singlesample.

and

ii. DataConversionWarning: A column-vector y was passed whena 1d array was expected. Please change the shape of y to (n_samples,), forexample using ravel().

when passing a scalar and a column-vector y respectively are entirely new from when last I made use ofPython?s RandomForestRegressor. Previously, they worked just fine withoutthrowing out any errors. I know that the ?multioutputs? were handled back in 2018(I actually tested this capability back then), but I assumed that theregressors were fit per target i.e. that there was no correlation betweentargets.

Today, for comparison, I generatedsome random target outputs (three columns) and using the same random_state, I ranthe all-inclusive multioutput prediction (with all three output targetssimultaneously vs. re-running each output prediction one at a time). The results are different, implying that some form ofcorrelation takes place amongst the multioutput targets, when predictedtogether. (For completeness, I display the first 28 predicted outputvalues, from the multioutput prediction as well as the single output predictions.)

Results from the multioutput prediction of thetargets (capturing their correlations).?


Resultsfrom the individual prediction of each single output target.


For my knowledge?s sake, could youplease inform me about the technique being employed now to take advantage ofthe correlations between targets? Is it the Mahalanobis distance or some othermetric? In other words, could you please give me a hint as to the underlyingreason why the single output predictions differ from the multioutputpredictions? I am curious to know as this would finally fully quench my appetiteafter nearly two years. I will have to retrace my steps and get back to the good old Python ways (again). Thank you.
Highest regards,Paul


    On Friday, February 14, 2020, 07:00:35 a.m. CST, Nicolas Hug <niourf at gmail.com> wrote:  
 
  
Hi Paul,
 
The way multioutput is handled in decision trees (and thus in the forests) is described in https://scikit-learn.org/stable/modules/tree.html#multi-output-problems. As you can see, the correlation between the output values *is* taken into account.
 
 
Can you explain what you would like to modify there?
 
Nicolas
 
 On 2/14/20 7:37 AM, Paul Chike Ofoche via scikit-learn wrote:
  
 
 Scikit-learn random forest does not handle the multi-output case, but only maps to each output one at a time, thereby not accounting for the correlation between multi-outputs, which is what the Mahalanobis distance does. I, as well as other researchers have observed this issue for as much as two years. Could there be a solution to implement it in RandomForest, since Python already has a function that computes Mahalanobis distances? 
  
      On Thursday, February 13, 2020, 10:15:11 PM CST, Andreas Mueller <t3kcit at gmail.com> wrote:  
  
     
 On 2/9/20 12:21 PM, Paul Chike Ofoche via scikit-learn wrote:
  
 
Hello all,
   
 My name is Paul and I am enthused about data science. I have been using Python and other programming languages for close to two years. There is an issue that I have been facing since I began applying Python to the analysis of my research work.
   
 
 My question has remained unanswered for months. Has anybody not run into the need to work with data whereby the regression results are a multiple output, in which the output parameters are correlated with each other? This is called a multi-output multivariate problem. A version of random forest that handles multiple outputs is referred to as the multivariate random forest. It is implemented in the programming language, R (see attached reference documentation below).
    The scikit-learn random forest actually handles this. It doesn't use the mahalanobis distance but that seems like a simple preprocessing step. 
 
     
 Till date, there exists no such package in Python. My question is whether anybody knows how to go about implementing this. The random forest univariate regression case utilizes the Euclidean distance as the measurement criteria, whereas the multivariate regression case uses the Mahalanobis distance, which takes into account the inter-relationships between the multiple outputs. I have inquired about an equivalent capability in Python for many years, but it has still not been addressed. Such a multivariate random forest mode is very applicable to the type of research and analysis that I do. Could someone help, please? 
   
 Thank you,
   
 Paul Ofoche
   
 ?
   
 PS: This is an important need for multivariate output analysis as a technique to solving practical research problems. Here are some posted questions by various other Python users concerning this same issue.
   
 ?
   
 https://datascience.stackexchange.com/questions/21637/code-for-multivariate-random-forest-in-python-r
   
 ?
   
 Multi-output regression
 
   
|  
|  
|  
| 
  | 
  |

  |

  |
|  
| 
  |  
Multi-output regression
 
I have been looking in to Multi-output regression the last view weeks. I am working with the scikit learn packag...
  |

  |

  |

  
  _______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn
 
 
    _______________________________________________
 scikit-learn mailing list
 scikit-learn at python.org
 https://mail.python.org/mailman/listinfo/scikit-learn
     
  _______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn
 _______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn
  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200215/8ce2cd0d/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 1581729776861blob.jpg
Type: image/png
Size: 22716 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200215/8ce2cd0d/attachment-0003.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 1581730391881blob.jpg
Type: image/png
Size: 102106 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200215/8ce2cd0d/attachment-0004.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 1581730490558blob.jpg
Type: image/png
Size: 317679 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200215/8ce2cd0d/attachment-0005.png>

From niourf at gmail.com  Sat Feb 15 08:54:29 2020
From: niourf at gmail.com (Nicolas Hug)
Date: Sat, 15 Feb 2020 08:54:29 -0500
Subject: [scikit-learn] Need for multioutput multivariate algorithm for
 Random Forest in Python (using Mahalanobis distance)
In-Reply-To: <1478783340.5521193.1581731226647@mail.yahoo.com>
References: <2075352869.868953.1581279707668.ref@mail.yahoo.com>
 <2075352869.868953.1581279707668@mail.yahoo.com>
 <f4b97138-56ee-aa70-20df-ee89989d9c99@gmail.com>
 <1915664731.5083146.1581683864251@mail.yahoo.com>
 <25c468f2-4b9e-a1a5-4d2a-53398960ec10@gmail.com>
 <1478783340.5521193.1581731226647@mail.yahoo.com>
Message-ID: <c3629d68-0609-b9bf-3530-210306c8277a@gmail.com>

> For my knowledge?s sake, could you please inform me about the 
> technique being employed now to take advantage of the correlations 
> between targets? Is it the Mahalanobis distance or some other metric? 
> In other words, could you please give me a hint as to the underlying 
> reason why the single output predictions differ from the multioutput 
> predictions?

I don't know much more than what's already in the doc that I linked to. 
Namely, the best split is the chosen to minimize the *average* criteria 
across all outputs, instead of just using a single output. You'll find 
more details in the code.

About the docs: we generally try to write all the useful info about the 
estimators in the "User Guide" section 
(https://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees). 
In this case you can find a link to the multi-output handling. Sometimes 
the info is instead in the docstrings. That's not always perfect though, 
and the link might not have been there when you first looked. We're 
working hard to keep on improving the docs. But there's so much info 
that it's easy to miss some...


Welcome back to python!


On 2/14/20 8:47 PM, Paul Chike Ofoche via scikit-learn wrote:
>
> Many thanks Nicolas and Andreas.
>
> I appreciate your taking the time and effort to look into the issue 
> that I raised and for pointing me to the documentation. It is quite 
> pleasant to know that scikit-learn?s RandomForestRegressor handles 
> multioutput cases. This issue has been very important to me and was 
> the sole reason that I switched from Python to R for my research in 
> the Fall of 2018 and have seldom used Python since then.
>
> I got convinced about my earlier stance when reading a documentation 
> such as 
> https://scikit-learn.org/stable/modules/multiclass.html#multioutput-regression 
> which explained that the ?MultiOutputRegressor fits one regressor per 
> target and cannot take advantage of correlations between targets?, 
> although I am aware that this is different from the RandomForestRegressor.
>
>
> Inline image
>
>
> I was wondering whether this multioutput handling capability of the 
> RandomForestRegressor has been added recently. In order to verify, I 
> went on a fact-finding mission by re-running the exact same codes I 
> had in 2018 and noticed quite a number of changes. I guess that many 
> moons have passed since then!
>
> For instance, sklearn.cross_validation has been deprecated since when 
> last I used it in 2018 (and replaced by sklearn.model_selection). 
> Also, such errors as:
>
> i. ValueError: Expected 2D array, got scalar array instead:
>
> array=6.5.
>
> Reshape your data either using array.reshape(-1, 1) if your data has a 
> single feature or array.reshape(1, -1) if it contains a single sample.
>
> and
>
> ii. DataConversionWarning: A column-vector y was passed when a 1d 
> array was expected. Please change the shape of y to (n_samples,), for 
> example using ravel().
>
> when passing a *scalar* and a *column-vector y* respectively are 
> entirely new from when last I made use of Python?s 
> RandomForestRegressor. Previously, they worked just fine without 
> throwing out any errors. I know that the ?multioutputs? were handled 
> back in 2018 (I actually tested this capability back then), but I 
> assumed that the regressors were fit per target i.e. that there was no 
> correlation between targets.
>
> Today, for comparison, I generated some random target outputs (three 
> columns) and using the same *random_state*, I ran the all-inclusive 
> multioutput prediction (with all three output targets simultaneously 
> vs. re-running each output prediction one at a time). The results are 
> different, implying that some form of correlation takes place amongst 
> the multioutput targets, when predicted together. (For completeness, I 
> display the first 28 predicted output values, from the multioutput 
> prediction as well as the single output predictions.)
>
>
> Results from the multioutput prediction of the targets (capturing 
> their correlations).
>
> Inline image
>
>
> Results from the individual prediction of each single output target.
>
> Inline image
>
>
> For my knowledge?s sake, could you please inform me about the 
> technique being employed now to take advantage of the correlations 
> between targets? Is it the Mahalanobis distance or some other metric? 
> In other words, could you please give me a hint as to the underlying 
> reason why the single output predictions differ from the multioutput 
> predictions? I am curious to know as this would finally fully quench 
> my appetite after nearly two years. I will have to retrace my steps 
> and get back to the good old Python ways (again). Thank you.
>
> Highest regards,
> Paul
>
>
>
> On Friday, February 14, 2020, 07:00:35 a.m. CST, Nicolas Hug 
> <niourf at gmail.com> wrote:
>
>
> Hi Paul,
>
> The way multioutput is handled in decision trees (and thus in the 
> forests) is described in 
> https://scikit-learn.org/stable/modules/tree.html#multi-output-problems. 
> As you can see, the correlation between the output values *is* taken 
> into account.
>
> Can you explain what you would like to modify there?
>
> Nicolas
>
> On 2/14/20 7:37 AM, Paul Chike Ofoche via scikit-learn wrote:
> Scikit-learn random forest does *not *handle the multi-output case, 
> but only maps to each output one at a time, thereby not accounting for 
> the correlation between multi-outputs, which is what the Mahalanobis 
> distance does. I, as well as other researchers have observed this 
> issue for as much as two years. Could there be a solution to implement 
> it in RandomForest, since Python already has a function that computes 
> Mahalanobis distances?
>
>
> On Thursday, February 13, 2020, 10:15:11 PM CST, Andreas Mueller 
> <t3kcit at gmail.com> <mailto:t3kcit at gmail.com> wrote:
>
>
>
>
> On 2/9/20 12:21 PM, Paul Chike Ofoche via scikit-learn wrote:
>
> Hello all,
>
> My name is Paul and I am enthused about data science. I have been 
> using Python and other programming languages for close to two years. 
> There is an issue that I have been facing since I began applying 
> Python to the analysis of my research work.
>
>
> My question has remained unanswered for months. Has anybody not run 
> into the need to work with data whereby the regression results are a 
> multiple output, in which the output parameters are correlated with 
> each other? This is called a multi-output multivariate problem. A 
> version of random forest that handles multiple outputs is referred to 
> as the multivariate random forest. It is implemented in the 
> programming language, R (see attached reference documentation below).
>
> The scikit-learn random forest actually handles this. It doesn't use 
> the mahalanobis distance but that seems like a simple preprocessing step.
>
>>
>> Till date, there exists no such package in Python. My question is 
>> whether anybody knows how to go about implementing this. The random 
>> forest univariate regression case utilizes the Euclidean distance as 
>> the measurement criteria, whereas the multivariate regression case 
>> uses the Mahalanobis distance, which takes into account the 
>> inter-relationships between the multiple outputs. I have inquired 
>> about an equivalent capability in Python for many years, but it has 
>> still not been addressed. Such a multivariate random forest mode is 
>> very applicable to the type of research and analysis that I do. Could 
>> someone help, please?
>>
>> Thank you,
>>
>> Paul Ofoche
>>
>> PS: This is an important need for multivariate output analysis as a 
>> technique to solving practical research problems. Here are some 
>> posted questions by various other Python users concerning this same 
>> issue.
>>
>> *https://datascience.stackexchange.com/questions/21637/code-for-multivariate-random-forest-in-python-r*
>>
>> Multi-output regression 
>> <https://stackoverflow.com/questions/49391637/multi-output-regression>
>>
>>
>>
>> 	
>>
>>
>> 	
>>
>>
>>     Multi-output regression
>>
>> I have been looking in to Multi-output regression the last view 
>> weeks. I am working with the scikit learn packag...
>>
>> <https://stackoverflow.com/questions/49391637/multi-output-regression>
>>
>>
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org  <mailto:scikit-learn at python.org>
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org <mailto:scikit-learn at python.org>
> https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org  <mailto:scikit-learn at python.org>
> https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org <mailto:scikit-learn at python.org>
> https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200215/d12275ac/attachment-0001.html>

From t3kcit at gmail.com  Mon Feb 17 18:34:00 2020
From: t3kcit at gmail.com (Andreas Mueller)
Date: Mon, 17 Feb 2020 18:34:00 -0500
Subject: [scikit-learn] Need for multioutput multivariate algorithm for
 Random Forest in Python (using Mahalanobis distance)
In-Reply-To: <1478783340.5521193.1581731226647@mail.yahoo.com>
References: <2075352869.868953.1581279707668.ref@mail.yahoo.com>
 <2075352869.868953.1581279707668@mail.yahoo.com>
 <f4b97138-56ee-aa70-20df-ee89989d9c99@gmail.com>
 <1915664731.5083146.1581683864251@mail.yahoo.com>
 <25c468f2-4b9e-a1a5-4d2a-53398960ec10@gmail.com>
 <1478783340.5521193.1581731226647@mail.yahoo.com>
Message-ID: <611d1291-ba01-8dd0-7557-907225091506@gmail.com>


On 2/14/20 5:47 PM, Paul Chike Ofoche via scikit-learn wrote:
>
> Many thanks Nicolas and Andreas.
>
>
>
> I was wondering whether this multioutput handling capability of the 
> RandomForestRegressor has been added recently. In order to verify, I 
> went on a fact-finding mission by re-running the exact same codes I 
> had in 2018 and noticed quite a number of changes. I guess that many 
> moons have passed since then!
>
> For instance, sklearn.cross_validation has been deprecated since when 
> last I used it in 2018 (and replaced by sklearn.model_selection). 
> Also, such errors as:
>
> i. ValueError: Expected 2D array, got scalar array instead:
>
> array=6.5.
>
> Reshape your data either using array.reshape(-1, 1) if your data has a 
> single feature or array.reshape(1, -1) if it contains a single sample.
>
> and
>
> ii. DataConversionWarning: A column-vector y was passed when a 1d 
> array was expected. Please change the shape of y to (n_samples,), for 
> example using ravel().
>
All of these were errors in 2018 already, you might not have had the 
most up-to-date version then ;)
cross_validation was deprecated in 2016:
https://scikit-learn.org/dev/whats_new/v0.18.html#version-0-18

> when passing a *scalar* and a *column-vector y* respectively are 
> entirely new from when last I made use of Python?s 
> RandomForestRegressor. Previously, they worked just fine without 
> throwing out any errors. I know that the ?multioutputs? were handled 
> back in 2018 (I actually tested this capability back then), but I 
> assumed that the regressors were fit per target i.e. that there was no 
> correlation between targets.
>
I can't find a changelog entry but pretty sure this goes back to 2014 or 
so. Definitely it was present in 2018.
>
> Today, for comparison, I generated some random target outputs (three 
> columns) and using the same *random_state*, I ran the all-inclusive 
> multioutput prediction (with all three output targets simultaneously 
> vs. re-running each output prediction one at a time). The results are 
> different, implying that some form of correlation takes place amongst 
> the multioutput targets, when predicted together. (For completeness, I 
> display the first 28 predicted output values, from the multioutput 
> prediction as well as the single output predictions.
>
>
>
>
> For my knowledge?s sake, could you please inform me about the 
> technique being employed now to take advantage of the correlations 
> between targets? Is it the Mahalanobis distance or some other metric? 
> In other words, could you please give me a hint as to the underlying 
> reason why the single output predictions differ from the multioutput 
> predictions? I am curious to know as this would finally fully quench 
> my appetite after nearly two years. I will have to retrace my steps 
> and get back to the good old Python ways (again). Thank you.
>
It doesn't explicitly use the correlation. The splitting criterion is is 
the sum over the splitting criteria over the outputs. That means there's 
an implicit regularization as the tree is shared between the targets.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200217/ae8931f5/attachment.html>

From adrin.jalali at gmail.com  Wed Feb 19 11:52:04 2020
From: adrin.jalali at gmail.com (Adrin)
Date: Wed, 19 Feb 2020 17:52:04 +0100
Subject: [scikit-learn] SLEP013 VOTE: n_features_out_
Message-ID: <CAEOrW4-ferdX_MmBWSaotzONa5VaXf2shO9nbotWmzfhRYhQ+w@mail.gmail.com>

Hi,

SLEP013
<https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep013/proposal.html>
proposes to add an "n_features_out_" attribute to track the number
of output features to all estimators. Please cast your vote on this PR
<https://github.com/scikit-learn/enhancement_proposals/pull/36> which
proposes to accept the SLEP. Final comments are more than welcome.

Following is the short version of the SLEP.

Regards,
Adrin.

Abstract
########

This SLEP proposes the introduction of a public ``n_features_out_``
attribute
for most transformers (where relevant).

Motivation
##########

Knowing the number of features that a transformer outputs is useful for
inspection purposes. This is in conjunction with `*SLEP010:
``n_features_in_``*
<
https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep010/proposal.html
>`_.

Solution
########

The proposed solution is for the ``n_features_out_`` attribute to be set
once a
call to ``fit`` is done. In many cases the value of ``n_features_out_`` is
the
same as some other attribute stored in the transformer, *e.g.*
``n_components_``, and in these cases a ``Mixin`` such as a
``ComponentsMixin``
can delegate ``n_features_out_`` to those attributes.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200219/9f796930/attachment.html>

From stephenwoodbridge37 at gmail.com  Wed Feb 19 14:31:38 2020
From: stephenwoodbridge37 at gmail.com (Stephen Woodbridge)
Date: Wed, 19 Feb 2020 14:31:38 -0500
Subject: [scikit-learn] Need help getting started with raster pattern
 learning
Message-ID: <daf1812a-22a7-5ee2-8bc2-dc0589f08943@gmail.com>

Hi all,

I think I need some help with getting started to use scikit for a 
project. Having a basic strategy would be helpful. I've been reading the 
docs and examples and can see how various tools might apply, but more 
questions than answers at the moment.

So my goal is to take various observations at locations in the ocean and 
then using satellite data at those locations, use that as training data. 
Then later evaluate new satellite raster images to identify similar 
locations in the new images.

So given a location, I plan to extract a region around that location of 
some size from the satellite data because the area around the point is 
likely significant, not just the point. So say, I have 500x500 extracted 
image.

I think I want to use a one-class SVM because I only have positive 
samples and no negative samples, but I'm not sure how to transform the 
extracted raster into something that can be fed into the SVM for training.

Likewise, what if I had multiple extracted images, like an SST and a 
Chlorophyll, for a given location. Could this be handled as a multiple 
band image?

Later given a trained SVM, what would I use to evaluate a larger 
satellite image to find simial locations.

Not ask for a solution, but what tools should I be looking at for the 
different steps in this process.

Thanks,
   -Steve

From johan.mazel at ssi.gouv.fr  Thu Feb 20 08:30:07 2020
From: johan.mazel at ssi.gouv.fr (Johan Mazel)
Date: Thu, 20 Feb 2020 14:30:07 +0100
Subject: [scikit-learn] Domain adaptation and cross-validation
Message-ID: <d2e0b32e-79d8-ec76-f93c-89da006a3b66@ssi.gouv.fr>

Hello
I am working on a binary classification task using machine learning.
One class C0 is built upon public data called D0. The other class C1 is
made of data named D1 that I generated. I actually generated two
datasets D1a and D1b with slightly different parameters.

I would like to evaluate the domain adaptation of a model in the context
of D0, D1a and D1b. It means that I would like to train a model on data
from D0 and D1a, and then, test its performance on D0 and D1b.
I plan to perform a k-fold cross-validation on D0 and bootstrapping on
D1a on D1b. For example, at each iteration, I would build training data
from k-1 folds of D0 and boostrapped data from D1a. Testing data would
be built upon the single remaining fold of D0 and bootstrapped data from
D1b.

I was thinking of using a class similar to those in
sklearn.model-selection (e.g. KFold or StratifiedKFold) to perform the
method described above.
The init function of the class would be initialized with the
KFold/StratifiedKFold and bootstrapping parameters.
The split function inside this class would be generic enough to handle
many datasets for either D0, or D1a, or D1b. This function's parameters
would be the usual X and y for data and targets, along with information
about the structure of X. This additional information would be
propagated from the fit function in a similar way as optional groups
parameter is passed to group-related split functions in classes such as
GroupKFold.
Here X would be the concatenation of train-test datasets (i.e. D0-like),
train only datasets (i.e. D1a-like) and test only datasets (i.e.
D1b-like). y would be built in a similar way.
The additional parameters may thus be two tuples for the start and end
indexes of train only datasets (i.e. D1a-like) and test only datasets
(i.e. D1b-like). These values would allow the split function to properly
operate on X and y by taking into account boundaries between dataset
types when building folds and performing bootstrapping.

As far as I know, I cannot perform such a procedure with scikit-learn
using the functions in sklearn.model-selection
(https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/model_selection/_split.py).
Did I miss something? Maybe somewhere else in the code?
If there is no implementation in scikit-learn, would you be interested
in a pull-request with such a function?

Best regards,
Johan Mazel
Les donn?es ? caract?re personnel recueillies et trait?es dans le cadre de cet ?change, le sont ? seule fin d?ex?cution d?une relation professionnelle et s?op?rent dans cette seule finalit? et pour la dur?e n?cessaire ? cette relation. Si vous souhaitez faire usage de vos droits de consultation, de rectification et de suppression de vos donn?es, veuillez contacter contact.rgpd at sgdsn.gouv.fr. Si vous avez re?u ce message par erreur, nous vous remercions d?en informer l?exp?diteur et de d?truire le message. The personal data collected and processed during this exchange aims solely at completing a business relationship and is limited to the necessary duration of that relationship. If you wish to use your rights of consultation, rectification and deletion of your data, please contact: contact.rgpd at sgdsn.gouv.fr. If you have received this message in error, we thank you for informing the sender and destroying the message.

From niourf at gmail.com  Thu Feb 20 18:06:15 2020
From: niourf at gmail.com (Nicolas Hug)
Date: Thu, 20 Feb 2020 18:06:15 -0500
Subject: [scikit-learn] Monthly meetings
Message-ID: <d401ed9c-d62c-5e8b-b302-35a7c46f7451@gmail.com>

Hi all,

The next scikit-learn monthly meeting will take place on Monday at the 
usual time 
(https://www.timeanddate.com/worldclock/meetingdetails.html?year=2020&month=2&day=24&hour=12&min=0&sec=0&p1=240&p2=33&p3=37&p4=179&p5=195_ 
<https://www.google.com/url?q=https://www.timeanddate.com/worldclock/meetingdetails.html?year%3D2020%26month%3D2%26day%3D24%26hour%3D12%26min%3D0%26sec%3D0%26p1%3D240%26p2%3D33%26p3%3D37%26p4%3D179%26p5%3D195&sa=D&ust=1582671384105000&usg=AOvVaw136C0s7qjGYZfILSyiZx5h>


While these meetings are mainly for core-devs to discuss the current 
topics, we're also happy to welcome non-core devs and other projects 
maintainers! Feel free to join.


*Location has changed:*

Join Zoom Meeting

https://anaconda.zoom.us/j/947129165?pwd=dEFZNHM0ZFBiQWlDYlJlRW1EaHg2QT09 
<https://www.google.com/url?q=https://anaconda.zoom.us/j/947129165?pwd%3DdEFZNHM0ZFBiQWlDYlJlRW1EaHg2QT09&sa=D&ust=1582671384105000&usg=AOvVaw1xGYPazK6DdI3O1ejkYOTd>

Meeting ID: 947 129 165 Password: 586745


*@core devs, please make sure to update your notes on Friday*


Thanks,

Nicolas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200220/81fc1574/attachment.html>

From g.lemaitre58 at gmail.com  Fri Feb 21 11:15:31 2020
From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=)
Date: Fri, 21 Feb 2020 17:15:31 +0100
Subject: [scikit-learn] Monthly meetings
In-Reply-To: <d401ed9c-d62c-5e8b-b302-35a7c46f7451@gmail.com>
References: <d401ed9c-d62c-5e8b-b302-35a7c46f7451@gmail.com>
Message-ID: <CACDxx9gzN9Hv7m+8KCZciVvYs8kW395-6S4_e-KPuH7HSgPfzw@mail.gmail.com>

Thanks, Nicolas for the recall. I will prepare a sort of agenda for the
meeting which I will post before the meeting.

On Fri, 21 Feb 2020 at 00:08, Nicolas Hug <niourf at gmail.com> wrote:

> Hi all,
>
> The next scikit-learn monthly meeting will take place on Monday at the
> usual time (
> https://www.timeanddate.com/worldclock/meetingdetails.html?year=2020&month=2&day=24&hour=12&min=0&sec=0&p1=240&p2=33&p3=37&p4=179&p5=195_
> <https://www.google.com/url?q=https://www.timeanddate.com/worldclock/meetingdetails.html?year%3D2020%26month%3D2%26day%3D24%26hour%3D12%26min%3D0%26sec%3D0%26p1%3D240%26p2%3D33%26p3%3D37%26p4%3D179%26p5%3D195&sa=D&ust=1582671384105000&usg=AOvVaw136C0s7qjGYZfILSyiZx5h>
>
>
> While these meetings are mainly for core-devs to discuss the current
> topics, we're also happy to welcome non-core devs and other projects
> maintainers! Feel free to join.
>
>
> *Location has changed:*
>
> Join Zoom Meeting
>
> https://anaconda.zoom.us/j/947129165?pwd=dEFZNHM0ZFBiQWlDYlJlRW1EaHg2QT09 <https://www.google.com/url?q=https://anaconda.zoom.us/j/947129165?pwd%3DdEFZNHM0ZFBiQWlDYlJlRW1EaHg2QT09&sa=D&ust=1582671384105000&usg=AOvVaw1xGYPazK6DdI3O1ejkYOTd>
>
> Meeting ID: 947 129 165
> Password: 586745
>
>
> *@core devs, please make sure to update your notes on Friday*
>
>
> Thanks,
>
> Nicolas
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>


-- 
Guillaume Lemaitre
Scikit-learn @ Inria Foundation
https://glemaitre.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200221/0328b480/attachment.html>

From g.lemaitre58 at gmail.com  Fri Feb 21 18:12:43 2020
From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=)
Date: Sat, 22 Feb 2020 00:12:43 +0100
Subject: [scikit-learn] Monthly meetings
In-Reply-To: <CACDxx9gzN9Hv7m+8KCZciVvYs8kW395-6S4_e-KPuH7HSgPfzw@mail.gmail.com>
References: <d401ed9c-d62c-5e8b-b302-35a7c46f7451@gmail.com>
 <CACDxx9gzN9Hv7m+8KCZciVvYs8kW395-6S4_e-KPuH7HSgPfzw@mail.gmail.com>
Message-ID: <CACDxx9hsnZsBq8hdWOanaMqVdacD4_JdGieue=7xF3VX2=0pNg@mail.gmail.com>

Hi all,

I attached the notes that I prepared: notes <https://bit.ly/2SOWpYe>
We might have to prioritize if we want to make the meeting in an hour.

Cheers,


On Fri, 21 Feb 2020 at 17:15, Guillaume Lema?tre <g.lemaitre58 at gmail.com>
wrote:

> Thanks, Nicolas for the recall. I will prepare a sort of agenda for the
> meeting which I will post before the meeting.
>
> On Fri, 21 Feb 2020 at 00:08, Nicolas Hug <niourf at gmail.com> wrote:
>
>> Hi all,
>>
>> The next scikit-learn monthly meeting will take place on Monday at the
>> usual time (
>> https://www.timeanddate.com/worldclock/meetingdetails.html?year=2020&month=2&day=24&hour=12&min=0&sec=0&p1=240&p2=33&p3=37&p4=179&p5=195_
>> <https://www.google.com/url?q=https://www.timeanddate.com/worldclock/meetingdetails.html?year%3D2020%26month%3D2%26day%3D24%26hour%3D12%26min%3D0%26sec%3D0%26p1%3D240%26p2%3D33%26p3%3D37%26p4%3D179%26p5%3D195&sa=D&ust=1582671384105000&usg=AOvVaw136C0s7qjGYZfILSyiZx5h>
>>
>>
>> While these meetings are mainly for core-devs to discuss the current
>> topics, we're also happy to welcome non-core devs and other projects
>> maintainers! Feel free to join.
>>
>>
>> *Location has changed:*
>>
>> Join Zoom Meeting
>>
>> https://anaconda.zoom.us/j/947129165?pwd=dEFZNHM0ZFBiQWlDYlJlRW1EaHg2QT09 <https://www.google.com/url?q=https://anaconda.zoom.us/j/947129165?pwd%3DdEFZNHM0ZFBiQWlDYlJlRW1EaHg2QT09&sa=D&ust=1582671384105000&usg=AOvVaw1xGYPazK6DdI3O1ejkYOTd>
>>
>> Meeting ID: 947 129 165
>> Password: 586745
>>
>>
>> *@core devs, please make sure to update your notes on Friday*
>>
>>
>> Thanks,
>>
>> Nicolas
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
> --
> Guillaume Lemaitre
> Scikit-learn @ Inria Foundation
> https://glemaitre.github.io/
>


-- 
Guillaume Lemaitre
Scikit-learn @ Inria Foundation
https://glemaitre.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200222/cf19c6ca/attachment.html>

From andrea.moglia at endocas.org  Sun Feb 23 04:11:01 2020
From: andrea.moglia at endocas.org (Andrea Moglia)
Date: Sun, 23 Feb 2020 10:11:01 +0100
Subject: [scikit-learn] Nonlinear partial least square (PLS)
Message-ID: <619801D0-D32B-4F69-9B04-F6D26FDD76E6@endocas.org>

Hi,

I need to use nonlinear partial least square (PLS), such as kernel PLS.
Do you know how to do it in Scikit Learn?
Thank you in advance.

Andrea


From joel.nothman at gmail.com  Sun Feb 23 16:23:32 2020
From: joel.nothman at gmail.com (Joel Nothman)
Date: Mon, 24 Feb 2020 08:23:32 +1100
Subject: [scikit-learn] Monthly meetings
In-Reply-To: <CACDxx9hsnZsBq8hdWOanaMqVdacD4_JdGieue=7xF3VX2=0pNg@mail.gmail.com>
References: <d401ed9c-d62c-5e8b-b302-35a7c46f7451@gmail.com>
 <CACDxx9gzN9Hv7m+8KCZciVvYs8kW395-6S4_e-KPuH7HSgPfzw@mail.gmail.com>
 <CACDxx9hsnZsBq8hdWOanaMqVdacD4_JdGieue=7xF3VX2=0pNg@mail.gmail.com>
Message-ID: <CAAkaFLW8JtXZjDD8L-__ycpFBPxqQ4teoBVVhWJO25XMREN0Cw@mail.gmail.com>

Helpful, thank you Guillaume

On Sat., 22 Feb. 2020, 10:14 am Guillaume Lema?tre, <g.lemaitre58 at gmail.com>
wrote:

> Hi all,
>
> I attached the notes that I prepared: notes <https://bit.ly/2SOWpYe>
> We might have to prioritize if we want to make the meeting in an hour.
>
> Cheers,
>
>
>
> On Fri, 21 Feb 2020 at 17:15, Guillaume Lema?tre <g.lemaitre58 at gmail.com>
> wrote:
>
>> Thanks, Nicolas for the recall. I will prepare a sort of agenda for the
>> meeting which I will post before the meeting.
>>
>> On Fri, 21 Feb 2020 at 00:08, Nicolas Hug <niourf at gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> The next scikit-learn monthly meeting will take place on Monday at the
>>> usual time (
>>> https://www.timeanddate.com/worldclock/meetingdetails.html?year=2020&month=2&day=24&hour=12&min=0&sec=0&p1=240&p2=33&p3=37&p4=179&p5=195_
>>> <https://www.google.com/url?q=https://www.timeanddate.com/worldclock/meetingdetails.html?year%3D2020%26month%3D2%26day%3D24%26hour%3D12%26min%3D0%26sec%3D0%26p1%3D240%26p2%3D33%26p3%3D37%26p4%3D179%26p5%3D195&sa=D&ust=1582671384105000&usg=AOvVaw136C0s7qjGYZfILSyiZx5h>
>>>
>>>
>>> While these meetings are mainly for core-devs to discuss the current
>>> topics, we're also happy to welcome non-core devs and other projects
>>> maintainers! Feel free to join.
>>>
>>>
>>> *Location has changed:*
>>>
>>> Join Zoom Meeting
>>>
>>> https://anaconda.zoom.us/j/947129165?pwd=dEFZNHM0ZFBiQWlDYlJlRW1EaHg2QT09 <https://www.google.com/url?q=https://anaconda.zoom.us/j/947129165?pwd%3DdEFZNHM0ZFBiQWlDYlJlRW1EaHg2QT09&sa=D&ust=1582671384105000&usg=AOvVaw1xGYPazK6DdI3O1ejkYOTd>
>>>
>>> Meeting ID: 947 129 165
>>> Password: 586745
>>>
>>>
>>> *@core devs, please make sure to update your notes on Friday*
>>>
>>>
>>> Thanks,
>>>
>>> Nicolas
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>
>>
>> --
>> Guillaume Lemaitre
>> Scikit-learn @ Inria Foundation
>> https://glemaitre.github.io/
>>
>
>
> --
> Guillaume Lemaitre
> Scikit-learn @ Inria Foundation
> https://glemaitre.github.io/
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200224/43217320/attachment.html>

From adrin.jalali at gmail.com  Mon Feb 24 09:08:28 2020
From: adrin.jalali at gmail.com (Adrin)
Date: Mon, 24 Feb 2020 15:08:28 +0100
Subject: [scikit-learn] Monthly meetings
In-Reply-To: <d401ed9c-d62c-5e8b-b302-35a7c46f7451@gmail.com>
References: <d401ed9c-d62c-5e8b-b302-35a7c46f7451@gmail.com>
Message-ID: <CAEOrW49uOcS9oo5=7Xo=Yy=UHuDzZq7NnW_ZiiYUDMPzkM39ow@mail.gmail.com>

The meeting notes are now available at:
https://github.com/scikit-learn/administrative/blob/master/meeting_notes/2020-02-24.md


On Fri, Feb 21, 2020 at 12:07 AM Nicolas Hug <niourf at gmail.com> wrote:

> Hi all,
>
> The next scikit-learn monthly meeting will take place on Monday at the
> usual time (
> https://www.timeanddate.com/worldclock/meetingdetails.html?year=2020&month=2&day=24&hour=12&min=0&sec=0&p1=240&p2=33&p3=37&p4=179&p5=195_
> <https://www.google.com/url?q=https://www.timeanddate.com/worldclock/meetingdetails.html?year%3D2020%26month%3D2%26day%3D24%26hour%3D12%26min%3D0%26sec%3D0%26p1%3D240%26p2%3D33%26p3%3D37%26p4%3D179%26p5%3D195&sa=D&ust=1582671384105000&usg=AOvVaw136C0s7qjGYZfILSyiZx5h>
>
>
> While these meetings are mainly for core-devs to discuss the current
> topics, we're also happy to welcome non-core devs and other projects
> maintainers! Feel free to join.
>
>
> *Location has changed:*
>
> Join Zoom Meeting
>
> https://anaconda.zoom.us/j/947129165?pwd=dEFZNHM0ZFBiQWlDYlJlRW1EaHg2QT09 <https://www.google.com/url?q=https://anaconda.zoom.us/j/947129165?pwd%3DdEFZNHM0ZFBiQWlDYlJlRW1EaHg2QT09&sa=D&ust=1582671384105000&usg=AOvVaw1xGYPazK6DdI3O1ejkYOTd>
>
> Meeting ID: 947 129 165
> Password: 586745
>
>
> *@core devs, please make sure to update your notes on Friday*
>
>
> Thanks,
>
> Nicolas
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200224/265d6cc4/attachment.html>

From adrin.jalali at gmail.com  Mon Feb 24 09:33:58 2020
From: adrin.jalali at gmail.com (Adrin)
Date: Mon, 24 Feb 2020 15:33:58 +0100
Subject: [scikit-learn] Conference call regarding feature names
Message-ID: <CAEOrW4_BAa-ziCHbTeAVH2uYeRbYKY74xvvB8cGhrPbH2ypM_Q@mail.gmail.com>

Hi,

We agreed to have a conference call on March 9th, to talk about feature
names
issues, pandas in, pandas out, xarray, etc.

Time/Date:
https://www.timeanddate.com/worldclock/meetingdetails.html?year=2020&month=3&day=9&hour=12&min=0&sec=0&p1=240&p2=33&p3=37&p4=179&p5=195


Here's the info to the call:

Join Zoom Meeting
https://anaconda.zoom.us/j/158272910?pwd=NURzbWhTSXNYNFhRMVYxRFVybjUrdz09
Meeting ID: 158 272 910
Password: 5867453
Find your local number: https://anaconda.zoom.us/u/adk50yIacN

Best,
Adrin.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200224/05351f7a/attachment-0001.html>