[scikit-learn] Can fit a model with a target array of probabilities?

Sean Violante sean.violante at gmail.com
Thu Oct 5 13:32:23 EDT 2017


Stuart
have you tried glmnet ( in R) there is a python version
https://web.stanford.edu/~hastie/glmnet_python/ ....




On Thu, Oct 5, 2017 at 6:34 PM, Stuart Reynolds <stuart at stuartreynolds.net>
wrote:

> Thanks Josef. Was very useful.
>
> result.remove_data() reduces a 5 parameter Logit result object from
> megabytes to 5Kb (as compared to a minimum uncompressed size of the
> parameters of ~320 bytes). Is big improvement. I'll experiment with
> what you suggest -- since this is still >10x larger than possible. I
> think the difference is mostly attribute names.
> I don't mind the lack of a multinomial support. I've often had better
> results mixing independent models for each class.
>
> I'll experiment with the different solvers.  I tried the Logit model
> in the past -- its fit function only exposed a maxiter, and not a
> tolerance -- meaning I had to set maxiter very high. The newer
> statsmodels GLM module looks great and seem to solve this.
>
> For other who come this way, I think the magic for ridge regression is:
>
>         from statsmodels.genmod.generalized_linear_model import GLM
>         from statsmodels.genmod.generalized_linear_model import families
>         from statsmodels.genmod.generalized_linear_model.families import
> links
>
>         model = GLM(y, Xtrain, family=families.Binomial(link=links.Logit))
>         result = model.fit_regularized(method='elastic_net',
> alpha=l2weight, L1_wt=0.0, tol=...)
>         result.remove_data()
>         result.predict(Xtest)
>
> One last thing -- its clear that it should be possible to do something
> like scikit's LogisticRegressionCV in order to quickly optimize a
> single parameter by re-using past coefficients.
> Are there any wrappers in statsmodels for doing this or should I roll my
> own?
>
>
> - Stu
>
>
> On Wed, Oct 4, 2017 at 3:43 PM,  <josef.pktd at gmail.com> wrote:
> >
> >
> > On Wed, Oct 4, 2017 at 4:26 PM, Stuart Reynolds <
> stuart at stuartreynolds.net>
> > wrote:
> >>
> >> Hi Andy,
> >> Thanks -- I'll give another statsmodels another go.
> >> I remember I had some fitting speed issues with it in the past, and
> >> also some issues related their models keeping references to the data
> >> (=disaster for serialization and multiprocessing) -- although that was
> >> a long time ago.
> >
> >
> > The second has not changed and will not change, but there is a
> remove_data
> > method that deletes all references to full, data sized arrays. However,
> once
> > the data is removed, it is not possible anymore to compute any new
> results
> > statistics which are almost all lazily computed.
> > The fitting speed depends a lot on the optimizer, convergence criteria
> and
> > difficulty of the problem, and availability of good starting parameters.
> > Almost all nonlinear estimation problems use the scipy optimizers, all
> > unconstrained optimizers can be used. There are no optimized special
> methods
> > for cases with a very large number of features.
> >
> > Multinomial/multiclass models don't support continuous response (yet),
> all
> > other GLM and discrete models allow for continuous data in the interval
> > extension of the domain.
> >
> > Josef
> >
> >
> >>
> >> - Stuart
> >>
> >> On Wed, Oct 4, 2017 at 1:09 PM, Andreas Mueller <t3kcit at gmail.com>
> wrote:
> >> > Hi Stuart.
> >> > There is no interface to do this in scikit-learn (and maybe we should
> at
> >> > this to the FAQ).
> >> > Yes, in principle this would be possible with several of the models.
> >> >
> >> > I think statsmodels can do that, and I think I saw another glm package
> >> > for Python that does that?
> >> >
> >> > It's certainly a legitimate use-case but would require substantial
> >> > changes to the code. I think so far we decided not to support
> >> > this in scikit-learn. Basically we don't have a concept of a link
> >> > function, and it's a concept that only applies to a subset of models.
> >> > We try to have a consistent interface for all our estimators, and
> >> > this doesn't really fit well within that interface.
> >> >
> >> > Hth,
> >> > Andy
> >> >
> >> >
> >> > On 10/04/2017 03:58 PM, Stuart Reynolds wrote:
> >> >>
> >> >> I'd like to fit a model that maps a matrix of continuous inputs to a
> >> >> target that's between 0 and 1 (a probability).
> >> >>
> >> >> In principle, I'd expect logistic regression should work out of the
> >> >> box with no modification (although its often posed as being strictly
> >> >> for classification, its loss function allows for fitting targets in
> >> >> the range 0 to 1, and not strictly zero or one.)
> >> >>
> >> >> However, scikit's LogisticRegression and LogisticRegressionCV reject
> >> >> target arrays that are continuous. Other LR implementations allow a
> >> >> matrix of probability estimates. Looking at:
> >> >>
> >> >>
> >> >> http://scikit-learn-general.narkive.com/4dSCktaM/using-
> logistic-regression-on-a-continuous-target-variable
> >> >> and the fix here:
> >> >> https://github.com/scikit-learn/scikit-learn/pull/5084, which
> disables
> >> >> continuous inputs, it looks like there was some reason for this. So
> >> >> ... I'm looking for alternatives.
> >> >>
> >> >> SGDClassifier allows log loss and (if I understood the docs
> correctly)
> >> >> adds a logistic link function, but also rejects continuous targets.
> >> >> Oddly, SGDRegressor only allows  ‘squared_loss’, ‘huber’,
> >> >> ‘epsilon_insensitive’, or ‘squared_epsilon_insensitive’, and doesn't
> >> >> seems to give a logistic function.
> >> >>
> >> >> In principle, GLM allow this, but scikit's docs say the GLM models
> >> >> only allows strict linear functions of their input, and doesn't allow
> >> >> a logistic link function. The docs direct people to the
> >> >> LogisticRegression class for this case.
> >> >>
> >> >> In R, there is:
> >> >>
> >> >> glm(Total_Service_Points_Won/Total_Service_Points_Played ~ ... ,
> >> >>      family = binomial(link=logit), weights =
> >> >> Total_Service_Points_Played)
> >> >> which would be ideal.
> >> >>
> >> >> Is something similar available in scikit? (Or any continuous model
> >> >> that takes and 0 to 1 target and outputs a 0 to 1 target?)
> >> >>
> >> >> I was surprised to see that the implementation of
> >> >> CalibratedClassifierCV(method="sigmoid") uses an internal
> >> >> implementation of logistic regression to do its logistic regressing
> --
> >> >> which I can use, although I'd prefer to use a user-facing library.
> >> >>
> >> >> Thanks,
> >> >> - Stuart
> >> >> _______________________________________________
> >> >> scikit-learn mailing list
> >> >> scikit-learn at python.org
> >> >> https://mail.python.org/mailman/listinfo/scikit-learn
> >> >
> >> >
> >> > _______________________________________________
> >> > scikit-learn mailing list
> >> > scikit-learn at python.org
> >> > https://mail.python.org/mailman/listinfo/scikit-learn
> >> _______________________________________________
> >> scikit-learn mailing list
> >> scikit-learn at python.org
> >> https://mail.python.org/mailman/listinfo/scikit-learn
> >
> >
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171005/17bcdc12/attachment.html>


More information about the scikit-learn mailing list