[scikit-learn] Can fit a model with a target array of probabilities?

Sean Violante sean.violante at gmail.com
Thu Oct 5 04:24:38 EDT 2017


Hi Stuart

the underlying logistic regression code in scikit learn (at least for the
non liblinear implementation) allows sample weights which would allow you
to do what you want.
[pass in sample weight Total_Service_Points_Won and target 1 and (
Total_Service_Points_Played-Total_Service_Points_Won) and target 0]
ie for each 'instance' you pass in two rows.

Unfortunately it has never been fully implemented
see
https://github.com/scikit-learn/scikit-learn/pull/2784#issuecomment-84734590

Unfortunately, it has never been fully exposed - I have given it a go and I
ran into problems because the code is shared with the linear SVC model as I
recall.
ie logistic regression would work, but some of the test cases would fail
with linear svc

[note that there is also a version of the original liblinear code that
supports sample weights]



[I would point out  having a single row rather than 2 is easier - eg
crossvalidation is a pain]


if you really want to give a continuous target then you probably want beta
regression - an example would be predicting concentrations, then the sample
weights are giving you the # times you observed that concentration
[and you could replace concentration with probability too, eg if you
literally had an 'oracle' that gave you the true probability of an instance]




sean

On Wed, Oct 4, 2017 at 10:26 PM, Stuart Reynolds <stuart at stuartreynolds.net>
wrote:

> Hi Andy,
> Thanks -- I'll give another statsmodels another go.
> I remember I had some fitting speed issues with it in the past, and
> also some issues related their models keeping references to the data
> (=disaster for serialization and multiprocessing) -- although that was
> a long time ago.
> - Stuart
>
> On Wed, Oct 4, 2017 at 1:09 PM, Andreas Mueller <t3kcit at gmail.com> wrote:
> > Hi Stuart.
> > There is no interface to do this in scikit-learn (and maybe we should at
> > this to the FAQ).
> > Yes, in principle this would be possible with several of the models.
> >
> > I think statsmodels can do that, and I think I saw another glm package
> > for Python that does that?
> >
> > It's certainly a legitimate use-case but would require substantial
> > changes to the code. I think so far we decided not to support
> > this in scikit-learn. Basically we don't have a concept of a link
> > function, and it's a concept that only applies to a subset of models.
> > We try to have a consistent interface for all our estimators, and
> > this doesn't really fit well within that interface.
> >
> > Hth,
> > Andy
> >
> >
> > On 10/04/2017 03:58 PM, Stuart Reynolds wrote:
> >>
> >> I'd like to fit a model that maps a matrix of continuous inputs to a
> >> target that's between 0 and 1 (a probability).
> >>
> >> In principle, I'd expect logistic regression should work out of the
> >> box with no modification (although its often posed as being strictly
> >> for classification, its loss function allows for fitting targets in
> >> the range 0 to 1, and not strictly zero or one.)
> >>
> >> However, scikit's LogisticRegression and LogisticRegressionCV reject
> >> target arrays that are continuous. Other LR implementations allow a
> >> matrix of probability estimates. Looking at:
> >>
> >> http://scikit-learn-general.narkive.com/4dSCktaM/using-
> logistic-regression-on-a-continuous-target-variable
> >> and the fix here:
> >> https://github.com/scikit-learn/scikit-learn/pull/5084, which disables
> >> continuous inputs, it looks like there was some reason for this. So
> >> ... I'm looking for alternatives.
> >>
> >> SGDClassifier allows log loss and (if I understood the docs correctly)
> >> adds a logistic link function, but also rejects continuous targets.
> >> Oddly, SGDRegressor only allows  ‘squared_loss’, ‘huber’,
> >> ‘epsilon_insensitive’, or ‘squared_epsilon_insensitive’, and doesn't
> >> seems to give a logistic function.
> >>
> >> In principle, GLM allow this, but scikit's docs say the GLM models
> >> only allows strict linear functions of their input, and doesn't allow
> >> a logistic link function. The docs direct people to the
> >> LogisticRegression class for this case.
> >>
> >> In R, there is:
> >>
> >> glm(Total_Service_Points_Won/Total_Service_Points_Played ~ ... ,
> >>      family = binomial(link=logit), weights =
> Total_Service_Points_Played)
> >> which would be ideal.
> >>
> >> Is something similar available in scikit? (Or any continuous model
> >> that takes and 0 to 1 target and outputs a 0 to 1 target?)
> >>
> >> I was surprised to see that the implementation of
> >> CalibratedClassifierCV(method="sigmoid") uses an internal
> >> implementation of logistic regression to do its logistic regressing --
> >> which I can use, although I'd prefer to use a user-facing library.
> >>
> >> Thanks,
> >> - Stuart
> >> _______________________________________________
> >> scikit-learn mailing list
> >> scikit-learn at python.org
> >> https://mail.python.org/mailman/listinfo/scikit-learn
> >
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171005/617f160d/attachment.html>


More information about the scikit-learn mailing list