[scikit-learn] Can fit a model with a target array of probabilities?

Stuart Reynolds stuart at stuartreynolds.net
Thu Oct 5 14:52:38 EDT 2017


Turns out sm.Logit does allow setting the tolerance.
Some and quick and dirty time profiling of different methods on a 100k
* 30 features dataset, with different solvers and losses:

sklearn.LogisticRegression: l1 1.13864398003 (seconds)
sklearn.LogisticRegression: l2 0.0538778305054
sm.Logit l1 0.0922629833221  # Although didn't converge
sm.Logit l1_cvxopt_cp 0.958268165588
sm.Logit newton 0.133476018906
sm.Logit nm 0.369864940643
sm.Logit bfgs 0.105798006058
sm.Logit lbfgs 0.06241106987
sm.Logit powell 1.64219808578
sm.Logit cg 0.2184278965
sm.Logit ncg 0.216138124466
sm.Logit basinhopping 8.82164621353
sm.GLM.fit IRLS 0.544688940048
sm.GLM L2: 1.29778695107

I've been getting good results from sm.Logit.fit (although unregularized).
statsmodels GLM seems a little slow. Not sure why.

My benchmark may be a little apples-to-oranges, since the stopping
criteria probably aren't comparable.


For tiny models, which I'm also building: 100 samples, 5 features

sklearn.LogisticRegression: l1 0.00137376785278
sklearn.LogisticRegression: l2 0.00167894363403
sm.Logit l1 0.0198900699615
sm.Logit l1_cvxopt_cp 0.162448167801
sm.Logit newton 0.00689911842346
sm.Logit nm 0.0754928588867
sm.Logit bfgs 0.0210938453674
sm.Logit lbfgs 0.0156588554382
sm.Logit powell 0.0161390304565
sm.Logit cg 0.00759506225586
sm.Logit ncg 0.00541186332703
sm.Logit basinhopping 0.3076171875
sm.GLM.fit IRLS 0.00902199745178
sm.GLM L2: 0.0208361148834

I couldn't get sm.GLM.fit to work with non "IRLS" solvers. (hits a
division by zero).

----

import sklearn.datasets
from sklearn.preprocessing import StandardScaler
X, y = sklearn.datasets.make_classification(n_samples=10000,
n_features=30, random_state=123)
X = StandardScaler(copy=True, with_mean=True, with_std=True).fit_transform(X)

import time
tol = 0.0001
maxiter = 100
DISP = 0


if 1: # sk.LogisticRegression
    import sklearn
    from sklearn.linear_model import LogisticRegression

    for method in ["l1", "l2"]: # TODO, add solvers:
        t = time.time()
        model = LogisticRegression(C=1, tol=tol, max_iter=maxiter,
penalty=method)
        model.fit(X,y)
        print "sklearn.LogisticRegression:", method, time.time() - t




if 1: # sm.Logit.fit_regularized
    from statsmodels.discrete.discrete_model import Logit
    for method in ["l1", "l1_cvxopt_cp"]:
        t = time.time()
        model = Logit(y,X)
        result = model.fit_regularized(method=method, maxiter=maxiter,
                                       alpha=1.,
                                       abstol=tol,
                                       acc=tol,
                                       tol=tol, gtol=tol, pgtol=tol,
                                       disp=DISP)
        print "sm.Logit", method, time.time() - t

if 1: # sm.Logit.fit
    from statsmodels.discrete.discrete_model import Logit

    SOLVERS = ["newton", "nm",
"bfgs","lbfgs","powell","cg","ncg","basinhopping",]
    for method in SOLVERS:
        t = time.time()
        model = Logit(y,X)
        result = model.fit(method=method, maxiter=maxiter,
                           niter=maxiter,
                           ftol=tol,
                           tol=tol, gtol=tol, pgtol=tol,  # Hmmm..
needs to be reviewed.
                           disp=DISP)
        print "sm.Logit", method, time.time() - t

if 1: # sm.GLM.fit
    from statsmodels.genmod.generalized_linear_model import GLM
    from statsmodels.genmod.generalized_linear_model import families
    for method in ["IRLS"]:
        t = time.time()
        model = GLM(y, X, family=families.Binomial(link=families.links.logit))
        result = model.fit(method=method, cnvrg_tol=tol,
maxiter=maxiter, full_output=False, disp=DISP)
        print "sm.GLM.fit", method, time.time() - t


if 1: # GLM.fit_regularized
    from statsmodels.genmod.generalized_linear_model import GLM
    from statsmodels.genmod.generalized_linear_model import families
    t = time.time()
    model = GLM(y, X, family=families.Binomial(link=families.links.logit))
    result = model.fit_regularized(method='elastic_net', alpha=1.0,
L1_wt=0.0, cnvrg_tol=tol, maxiter=maxiter)
    print "sm.GLM L2:", time.time() - t



if 0: # GLM.fit
    # Hits division by zero.
    SOLVERS = ["bfgs","lbfgs", "netwon", "nm",
"powell","cg","ncg","basinhopping",]
    from statsmodels.genmod.generalized_linear_model import GLM
    from statsmodels.genmod.generalized_linear_model import families
    for method in SOLVERS:
        t = time.time()
        model = GLM(y, X, family=families.Binomial(link=families.links.logit))
        result = model.fit(method=method,
#                           scale="X2",
#                            alpha=1.,
#                            abstol=tol,
#                            acc=tol,
#                            tol=tol, gtol=tol, pgtol=tol,
#                            maxiter=maxiter,
#                            #full_output=False,
                           disp=DISP)
        print "sm.GLM.fit", method, time.time() - t


On Thu, Oct 5, 2017 at 10:32 AM, Sean Violante <sean.violante at gmail.com> wrote:
> Stuart
> have you tried glmnet ( in R) there is a python version
> https://web.stanford.edu/~hastie/glmnet_python/ ....
>
>
>
>
> On Thu, Oct 5, 2017 at 6:34 PM, Stuart Reynolds <stuart at stuartreynolds.net>
> wrote:
>>
>> Thanks Josef. Was very useful.
>>
>> result.remove_data() reduces a 5 parameter Logit result object from
>> megabytes to 5Kb (as compared to a minimum uncompressed size of the
>> parameters of ~320 bytes). Is big improvement. I'll experiment with
>> what you suggest -- since this is still >10x larger than possible. I
>> think the difference is mostly attribute names.
>> I don't mind the lack of a multinomial support. I've often had better
>> results mixing independent models for each class.
>>
>> I'll experiment with the different solvers.  I tried the Logit model
>> in the past -- its fit function only exposed a maxiter, and not a
>> tolerance -- meaning I had to set maxiter very high. The newer
>> statsmodels GLM module looks great and seem to solve this.
>>
>> For other who come this way, I think the magic for ridge regression is:
>>
>>         from statsmodels.genmod.generalized_linear_model import GLM
>>         from statsmodels.genmod.generalized_linear_model import families
>>         from statsmodels.genmod.generalized_linear_model.families import
>> links
>>
>>         model = GLM(y, Xtrain, family=families.Binomial(link=links.Logit))
>>         result = model.fit_regularized(method='elastic_net',
>> alpha=l2weight, L1_wt=0.0, tol=...)
>>         result.remove_data()
>>         result.predict(Xtest)
>>
>> One last thing -- its clear that it should be possible to do something
>> like scikit's LogisticRegressionCV in order to quickly optimize a
>> single parameter by re-using past coefficients.
>> Are there any wrappers in statsmodels for doing this or should I roll my
>> own?
>>
>>
>> - Stu
>>
>>
>> On Wed, Oct 4, 2017 at 3:43 PM,  <josef.pktd at gmail.com> wrote:
>> >
>> >
>> > On Wed, Oct 4, 2017 at 4:26 PM, Stuart Reynolds
>> > <stuart at stuartreynolds.net>
>> > wrote:
>> >>
>> >> Hi Andy,
>> >> Thanks -- I'll give another statsmodels another go.
>> >> I remember I had some fitting speed issues with it in the past, and
>> >> also some issues related their models keeping references to the data
>> >> (=disaster for serialization and multiprocessing) -- although that was
>> >> a long time ago.
>> >
>> >
>> > The second has not changed and will not change, but there is a
>> > remove_data
>> > method that deletes all references to full, data sized arrays. However,
>> > once
>> > the data is removed, it is not possible anymore to compute any new
>> > results
>> > statistics which are almost all lazily computed.
>> > The fitting speed depends a lot on the optimizer, convergence criteria
>> > and
>> > difficulty of the problem, and availability of good starting parameters.
>> > Almost all nonlinear estimation problems use the scipy optimizers, all
>> > unconstrained optimizers can be used. There are no optimized special
>> > methods
>> > for cases with a very large number of features.
>> >
>> > Multinomial/multiclass models don't support continuous response (yet),
>> > all
>> > other GLM and discrete models allow for continuous data in the interval
>> > extension of the domain.
>> >
>> > Josef
>> >
>> >
>> >>
>> >> - Stuart
>> >>
>> >> On Wed, Oct 4, 2017 at 1:09 PM, Andreas Mueller <t3kcit at gmail.com>
>> >> wrote:
>> >> > Hi Stuart.
>> >> > There is no interface to do this in scikit-learn (and maybe we should
>> >> > at
>> >> > this to the FAQ).
>> >> > Yes, in principle this would be possible with several of the models.
>> >> >
>> >> > I think statsmodels can do that, and I think I saw another glm
>> >> > package
>> >> > for Python that does that?
>> >> >
>> >> > It's certainly a legitimate use-case but would require substantial
>> >> > changes to the code. I think so far we decided not to support
>> >> > this in scikit-learn. Basically we don't have a concept of a link
>> >> > function, and it's a concept that only applies to a subset of models.
>> >> > We try to have a consistent interface for all our estimators, and
>> >> > this doesn't really fit well within that interface.
>> >> >
>> >> > Hth,
>> >> > Andy
>> >> >
>> >> >
>> >> > On 10/04/2017 03:58 PM, Stuart Reynolds wrote:
>> >> >>
>> >> >> I'd like to fit a model that maps a matrix of continuous inputs to a
>> >> >> target that's between 0 and 1 (a probability).
>> >> >>
>> >> >> In principle, I'd expect logistic regression should work out of the
>> >> >> box with no modification (although its often posed as being strictly
>> >> >> for classification, its loss function allows for fitting targets in
>> >> >> the range 0 to 1, and not strictly zero or one.)
>> >> >>
>> >> >> However, scikit's LogisticRegression and LogisticRegressionCV reject
>> >> >> target arrays that are continuous. Other LR implementations allow a
>> >> >> matrix of probability estimates. Looking at:
>> >> >>
>> >> >>
>> >> >>
>> >> >> http://scikit-learn-general.narkive.com/4dSCktaM/using-logistic-regression-on-a-continuous-target-variable
>> >> >> and the fix here:
>> >> >> https://github.com/scikit-learn/scikit-learn/pull/5084, which
>> >> >> disables
>> >> >> continuous inputs, it looks like there was some reason for this. So
>> >> >> ... I'm looking for alternatives.
>> >> >>
>> >> >> SGDClassifier allows log loss and (if I understood the docs
>> >> >> correctly)
>> >> >> adds a logistic link function, but also rejects continuous targets.
>> >> >> Oddly, SGDRegressor only allows  ‘squared_loss’, ‘huber’,
>> >> >> ‘epsilon_insensitive’, or ‘squared_epsilon_insensitive’, and doesn't
>> >> >> seems to give a logistic function.
>> >> >>
>> >> >> In principle, GLM allow this, but scikit's docs say the GLM models
>> >> >> only allows strict linear functions of their input, and doesn't
>> >> >> allow
>> >> >> a logistic link function. The docs direct people to the
>> >> >> LogisticRegression class for this case.
>> >> >>
>> >> >> In R, there is:
>> >> >>
>> >> >> glm(Total_Service_Points_Won/Total_Service_Points_Played ~ ... ,
>> >> >>      family = binomial(link=logit), weights =
>> >> >> Total_Service_Points_Played)
>> >> >> which would be ideal.
>> >> >>
>> >> >> Is something similar available in scikit? (Or any continuous model
>> >> >> that takes and 0 to 1 target and outputs a 0 to 1 target?)
>> >> >>
>> >> >> I was surprised to see that the implementation of
>> >> >> CalibratedClassifierCV(method="sigmoid") uses an internal
>> >> >> implementation of logistic regression to do its logistic regressing
>> >> >> --
>> >> >> which I can use, although I'd prefer to use a user-facing library.
>> >> >>
>> >> >> Thanks,
>> >> >> - Stuart
>> >> >> _______________________________________________
>> >> >> scikit-learn mailing list
>> >> >> scikit-learn at python.org
>> >> >> https://mail.python.org/mailman/listinfo/scikit-learn
>> >> >
>> >> >
>> >> > _______________________________________________
>> >> > scikit-learn mailing list
>> >> > scikit-learn at python.org
>> >> > https://mail.python.org/mailman/listinfo/scikit-learn
>> >> _______________________________________________
>> >> scikit-learn mailing list
>> >> scikit-learn at python.org
>> >> https://mail.python.org/mailman/listinfo/scikit-learn
>> >
>> >
>> >
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>> >
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>


More information about the scikit-learn mailing list