[scikit-learn] Confidence Estimation for Regressor Predictions

Dale T Smith Dale.T.Smith at macys.com
Fri Sep 2 08:34:03 EDT 2016


I do not know of any research related to any estimators except linear_model and forests of trees. Knowledge of the underlying distributions is required for confidence intervals. The Jackknife and bootstrap are the most common methods to obtain this information from the data.

If anyone knows of these techniques applied more widely in machine learning to measure confidence intervals, please post the references. I think providing these measures in scikit-learn-contrib provides the entire project with features other packages don't have.

Here's an example of the work done on the StatML side, "Distribution-Free Predictive Inference for Regression"

http://www.stat.cmu.edu/~ryantibs/papers/conformal.pdf

Note the use of leave-one-covariate-out to estimate variable importance.

__________________________________________________________________________________________
Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science and Capacity Planning
 | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.smith at macys.com


-----Original Message-----
From: scikit-learn [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On Behalf Of Jeffrey Levesque via scikit-learn
Sent: Friday, September 2, 2016 12:19 AM
To: Scikit-learn user and developer mailing list
Cc: Jeffrey Levesque
Subject: Re: [scikit-learn] Confidence Estimation for Regressor Predictions

⚠ EXT MSG:

Hi All,

I am also interested in determining a confidence level associated with an SVM, or SVR prediction.  Is there a nice way to generalize this confidence regardless of the kernel chosen, for the given SVM or SVR implementation?

Last year I generally tried the 'predict_proba' method, which was not very good, when implemented generically:

- https://github.com/jeff1evesque/machine-learning/issues/1924#issuecomment-159491052

The 'decision_function' performed a little better.  But, are my examples poor, because the sample data is too small for accurate confidence measurements?  Would both the 'decision_function', and 'predict_proba' improve if my dataset was much larger, or should I customize the latter methods?

Feel free to make any comments on the above github issue.  I've spent more time on the web tools of that repository, than understanding the fundamentals of predictions.  Forgive me ahead of time.


Thank you,

Jeff Levesque
https://github.com/jeff1evesque

> On Sep 1, 2016, at 5:13 PM, Roman Yurchak <rth.yurchak at gmail.com> wrote:
> 
> Dale, I meant for all the methods in scikit.linear_model. Linear 
> regression is well known, but say for rigde regression that does not 
> look that simple http://stats.stackexchange.com/a/15417 .
> Thanks for mentioning the bootstrap method!
> 
> --
> Roman
> 
>> On 01/09/16 21:55, Dale T Smith wrote:
>> Confidence intervals for linear models are well known - see any statistics book or look it up on Wikipedia. You should be able to calculate everything you need for a linear model just from the information the estimator provides. Note the Rsquared provided by linear_model appears to be what statisticians call the adjusted-Rsquared.
>> 
>> 
>> _____________________________________________________________________
>> _____________________ Dale Smith | Macy's Systems and Technology | 
>> IFS eCommerce | Data Science and Capacity Planning
>> | 5985 State Bridge Road, Johns Creek, GA 30097 | 
>> | dale.t.smith at macys.com
>> 
>> 
>> -----Original Message-----
>> From: scikit-learn 
>> [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On 
>> Behalf Of Roman Yurchak
>> Sent: Thursday, September 1, 2016 3:45 PM
>> To: Scikit-learn user and developer mailing list
>> Subject: Re: [scikit-learn] Confidence Estimation for Regressor 
>> Predictions
>> 
>> ⚠ EXT MSG:
>> 
>> I'm also interested to know if there are any projects similar to scikit-learn-contrib/forest-confidence-interval for linear_model or SVM regressors.
>> 
>> In the general case, I think you could get a quick first order approximation of the confidence interval for your regressor, if you take the standard deviation  of predictions obtained by fitting different subsets of your data using,
>>     cross_validation.cross_val_score( ).std() with a fixed set of estimator parameters? Or some multiple of it (e.g.
>> 2*std). Though this will probably not match exactly the mathematical definition of a confidence interval.
>> --
>> Roman
>> 
>> 
>>> On 01/09/16 20:32, Dale T Smith wrote:
>>> There is a scikit-learn-contrib project with confidence intervals for random forests.
>>> 
>>> https://github.com/scikit-learn-contrib/forest-confidence-interval
>>> 
>>> 
>>> ____________________________________________________________________
>>> ______________________ Dale Smith | Macy's Systems and Technology | 
>>> IFS eCommerce | Data Science and Capacity Planning
>>> | 5985 State Bridge Road, Johns Creek, GA 30097 | 
>>> | dale.t.smith at macys.com
>>> 
>>> -----Original Message-----
>>> From: scikit-learn 
>>> [mailto:scikit-learn-bounces+dale.t.smith=macys.com at python.org] On 
>>> Behalf Of Daniel Seeliger via scikit-learn
>>> Sent: Thursday, September 1, 2016 2:28 PM
>>> To: scikit-learn at python.org
>>> Cc: Daniel Seeliger
>>> Subject: [scikit-learn] Confidence Estimation for Regressor 
>>> Predictions
>>> 
>>> ⚠ EXT MSG:
>>> 
>>> Dear all,
>>> 
>>> For classifiers I make use of the predict_proba method to compute a Gini coefficient or entropy to get an estimate of how "sure" the model is about an individual prediction.
>>> 
>>> Is there anything similar I could use for regression models? I guess for a RandomForest I could simply use the indiviual predictions of each tree in clf.estimators_ and compute a standard deviation but I guess this is not a generic approach I can use for other regressors like the GradientBoostingRegressor or a SVR.
>>> 
>>> Thanks a lot for your help,
>>> Daniel
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>> 
>>> * This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> 
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>> 
>> * This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn

* This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening attachments.


More information about the scikit-learn mailing list