[scikit-learn] Can fit a model with a target array of probabilities?

Wed Oct 4 16:09:56 EDT 2017

Hi Stuart.
There is no interface to do this in scikit-learn (and maybe we should at 
this to the FAQ).
Yes, in principle this would be possible with several of the models.

I think statsmodels can do that, and I think I saw another glm package
for Python that does that?

It's certainly a legitimate use-case but would require substantial
changes to the code. I think so far we decided not to support
this in scikit-learn. Basically we don't have a concept of a link
function, and it's a concept that only applies to a subset of models.
We try to have a consistent interface for all our estimators, and
this doesn't really fit well within that interface.

Hth,
Andy

On 10/04/2017 03:58 PM, Stuart Reynolds wrote:
> I'd like to fit a model that maps a matrix of continuous inputs to a
> target that's between 0 and 1 (a probability).
>
> In principle, I'd expect logistic regression should work out of the
> box with no modification (although its often posed as being strictly
> for classification, its loss function allows for fitting targets in
> the range 0 to 1, and not strictly zero or one.)
>
> However, scikit's LogisticRegression and LogisticRegressionCV reject
> target arrays that are continuous. Other LR implementations allow a
> matrix of probability estimates. Looking at:
> http://scikit-learn-general.narkive.com/4dSCktaM/using-logistic-regression-on-a-continuous-target-variable
> and the fix here:
> https://github.com/scikit-learn/scikit-learn/pull/5084, which disables
> continuous inputs, it looks like there was some reason for this. So
> ... I'm looking for alternatives.
>
> SGDClassifier allows log loss and (if I understood the docs correctly)
> adds a logistic link function, but also rejects continuous targets.
> Oddly, SGDRegressor only allows  ‘squared_loss’, ‘huber’,
> ‘epsilon_insensitive’, or ‘squared_epsilon_insensitive’, and doesn't
> seems to give a logistic function.
>
> In principle, GLM allow this, but scikit's docs say the GLM models
> only allows strict linear functions of their input, and doesn't allow
> a logistic link function. The docs direct people to the
> LogisticRegression class for this case.
>
> In R, there is:
>
> glm(Total_Service_Points_Won/Total_Service_Points_Played ~ ... ,
>      family = binomial(link=logit), weights = Total_Service_Points_Played)
> which would be ideal.
>
> Is something similar available in scikit? (Or any continuous model
> that takes and 0 to 1 target and outputs a 0 to 1 target?)
>
> I was surprised to see that the implementation of
> CalibratedClassifierCV(method="sigmoid") uses an internal
> implementation of logistic regression to do its logistic regressing --
> which I can use, although I'd prefer to use a user-facing library.
>
> Thanks,
> - Stuart
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn