[scikit-learn] Need for multioutput multivariate algorithm for Random Forest in Python (using Mahalanobis distance)

Fri Feb 14 20:47:06 EST 2020

Many thanks Nicolas and Andreas.

I appreciate your taking the timeand effort to look into the issue that I raised and for pointing me to thedocumentation. It is quite pleasant to know that scikit-learn’sRandomForestRegressor handles multioutput cases. This issue has been veryimportant to me and was the sole reason that I switched from Python to R for myresearch in the Fall of 2018 and have seldom used Python since then. 

I got convinced about my earlierstance when reading a documentation such as https://scikit-learn.org/stable/modules/multiclass.html#multioutput-regressionwhich explained that the “MultiOutputRegressor fits one regressor per targetand cannot take advantage of correlations between targets”, although I am awarethat this is different from the RandomForestRegressor.

I was wondering whether this multioutputhandling capability of the RandomForestRegressor has been added recently. In order to verify, I went on a fact-finding missionby re-running the exact same codes I had in 2018 and noticed quite a number ofchanges. I guess that many moons have passed since then!

For instance, sklearn.cross_validationhas been deprecated since when last I used it in 2018 (and replaced by sklearn.model_selection).Also, such errors as:

i. ValueError: Expected 2D array, got scalar array instead:

array=6.5.

Reshape your data either using array.reshape(-1, 1) ifyour data has a single feature or array.reshape(1, -1) if it contains a singlesample.

and

ii. DataConversionWarning: A column-vector y was passed whena 1d array was expected. Please change the shape of y to (n_samples,), forexample using ravel().

when passing a scalar and a column-vector y respectively are entirely new from when last I made use ofPython’s RandomForestRegressor. Previously, they worked just fine withoutthrowing out any errors. I know that the “multioutputs” were handled back in 2018(I actually tested this capability back then), but I assumed that theregressors were fit per target i.e. that there was no correlation betweentargets.

Today, for comparison, I generatedsome random target outputs (three columns) and using the same random_state, I ranthe all-inclusive multioutput prediction (with all three output targetssimultaneously vs. re-running each output prediction one at a time). The results are different, implying that some form ofcorrelation takes place amongst the multioutput targets, when predictedtogether. (For completeness, I display the first 28 predicted outputvalues, from the multioutput prediction as well as the single output predictions.)

Results from the multioutput prediction of thetargets (capturing their correlations). 

Resultsfrom the individual prediction of each single output target.

For my knowledge’s sake, could youplease inform me about the technique being employed now to take advantage ofthe correlations between targets? Is it the Mahalanobis distance or some othermetric? In other words, could you please give me a hint as to the underlyingreason why the single output predictions differ from the multioutputpredictions? I am curious to know as this would finally fully quench my appetiteafter nearly two years. I will have to retrace my steps and get back to the good old Python ways (again). Thank you.
Highest regards,Paul

    On Friday, February 14, 2020, 07:00:35 a.m. CST, Nicolas Hug <niourf at gmail.com> wrote:  

Hi Paul,

The way multioutput is handled in decision trees (and thus in the forests) is described in https://scikit-learn.org/stable/modules/tree.html#multi-output-problems. As you can see, the correlation between the output values *is* taken into account.

Can you explain what you would like to modify there?

Nicolas

 On 2/14/20 7:37 AM, Paul Chike Ofoche via scikit-learn wrote:

 Scikit-learn random forest does not handle the multi-output case, but only maps to each output one at a time, thereby not accounting for the correlation between multi-outputs, which is what the Mahalanobis distance does. I, as well as other researchers have observed this issue for as much as two years. Could there be a solution to implement it in RandomForest, since Python already has a function that computes Mahalanobis distances? 

      On Thursday, February 13, 2020, 10:15:11 PM CST, Andreas Mueller <t3kcit at gmail.com> wrote:  

 On 2/9/20 12:21 PM, Paul Chike Ofoche via scikit-learn wrote:

Hello all,

 My name is Paul and I am enthused about data science. I have been using Python and other programming languages for close to two years. There is an issue that I have been facing since I began applying Python to the analysis of my research work.

 My question has remained unanswered for months. Has anybody not run into the need to work with data whereby the regression results are a multiple output, in which the output parameters are correlated with each other? This is called a multi-output multivariate problem. A version of random forest that handles multiple outputs is referred to as the multivariate random forest. It is implemented in the programming language, R (see attached reference documentation below).
    The scikit-learn random forest actually handles this. It doesn't use the mahalanobis distance but that seems like a simple preprocessing step. 

 Till date, there exists no such package in Python. My question is whether anybody knows how to go about implementing this. The random forest univariate regression case utilizes the Euclidean distance as the measurement criteria, whereas the multivariate regression case uses the Mahalanobis distance, which takes into account the inter-relationships between the multiple outputs. I have inquired about an equivalent capability in Python for many years, but it has still not been addressed. Such a multivariate random forest mode is very applicable to the type of research and analysis that I do. Could someone help, please? 

 Thank you,

 Paul Ofoche

 PS: This is an important need for multivariate output analysis as a technique to solving practical research problems. Here are some posted questions by various other Python users concerning this same issue.

 https://datascience.stackexchange.com/questions/21637/code-for-multivariate-random-forest-in-python-r

 Multi-output regression

|  
|  
|  
| 
  | 
  |

  |

  |
|  
| 
  |  
Multi-output regression

I have been looking in to Multi-output regression the last view weeks. I am working with the scikit learn packag...
  |

  |

  |

  _______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn

    _______________________________________________
 scikit-learn mailing list
 scikit-learn at python.org
 https://mail.python.org/mailman/listinfo/scikit-learn

  _______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn
 _______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200215/8ce2cd0d/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 1581729776861blob.jpg
Type: image/png
Size: 22716 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200215/8ce2cd0d/attachment-0003.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 1581730391881blob.jpg
Type: image/png
Size: 102106 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200215/8ce2cd0d/attachment-0004.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 1581730490558blob.jpg
Type: image/png
Size: 317679 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200215/8ce2cd0d/attachment-0005.png>