[scikit-learn] biased predictions in logistic regression

Sean Violante sean.violante at gmail.com
Thu Dec 15 17:05:51 EST 2016


Sorry just saw you are not using the liblinear solver,  agree with
Sebastian,  you should subtract mean not median

On 15 Dec 2016 11:02 pm, "Sean Violante" <sean.violante at gmail.com> wrote:

> The problem is the (stupid!)  liblinear solver that also penalises the
> intercept (in regularisation) . Use a different solver or change  the
> intercept_scaling parameter
>
> On 15 Dec 2016 10:44 pm, "Sebastian Raschka" <se.raschka at gmail.com> wrote:
>
>> Subtracting the median wouldn’t result in normalizing the usual sense,
>> since subtracting a constant just shifts the values by a constant. Instead,
>> for logistic regression & most optimizers, I would recommend subtracting
>> the mean to center the features at mean zero and divide by the standard
>> deviation to get “z” scores (e.g., this can be done by the
>> StandardScaler()).
>>
>> Best,
>> Sebastian
>>
>> > On Dec 15, 2016, at 4:02 PM, Rachel Melamed <melamed at uchicago.edu>
>> wrote:
>> >
>> > I just tried it and it did not appear to change the results at all?
>> > I ran it as follows:
>> > 1) Normalize dummy variables (by subtracting median) to make a matrix
>> of about 10000 x 5
>> >
>> > 2) For each of the 1000 output variables:
>> > a. Each output variable uses the same dummy variables, but not all
>> settings of covariates are observed for all output variables. So I create
>> the design matrix using patsy per output variable to include pairwise
>> interactions.  Then, I have an around 10000 x 350 design matrix , and a
>> matrix I call “success_fail” that has for each setting the number of
>> success and number of fail, so it is of size 10000 x 2
>> >
>> > b. Run regression using:
>> >
>> > skdesign = np.vstack((design,design))
>> >
>> > sklabel = np.hstack((np.ones(success_fail.shape[0]),
>> > np.zeros(success_fail.shape[0])))
>> >
>> > skweight = np.hstack((success_fail['success'], success_fail['fail']))
>> >
>> >         logregN = linear_model.LogisticRegression(C=1,
>> >                                     solver= 'lbfgs',fit_intercept=False)
>> >         logregN.fit(skdesign, sklabel, sample_weight=skweight)
>> >
>> >
>> >> On Dec 15, 2016, at 2:16 PM, Alexey Dral <aadral at gmail.com> wrote:
>> >>
>> >> Could you try to normalize dataset after feature dummy encoding and
>> see if it is reproducible behavior?
>> >>
>> >> 2016-12-15 22:03 GMT+03:00 Rachel Melamed <melamed at uchicago.edu>:
>> >> Thanks for the reply.  The covariates (“X") are all dummy/categorical
>> variables.  So I guess no, nothing is normalized.
>> >>
>> >>> On Dec 15, 2016, at 1:54 PM, Alexey Dral <aadral at gmail.com> wrote:
>> >>>
>> >>> Hi Rachel,
>> >>>
>> >>> Do you have your data normalized?
>> >>>
>> >>> 2016-12-15 20:21 GMT+03:00 Rachel Melamed <melamed at uchicago.edu>:
>> >>> Hi all,
>> >>> Does anyone have any suggestions for this problem:
>> >>> http://stackoverflow.com/questions/41125342/sklearn-logistic
>> -regression-gives-biased-results
>> >>>
>> >>> I am running around 1000 similar logistic regressions, with the same
>> covariates but slightly different data and response variables. All of my
>> response variables have a sparse successes (p(success) < .05 usually).
>> >>>
>> >>> I noticed that with the regularized regression, the results are
>> consistently biased to predict more "successes" than is observed in the
>> training data. When I relax the regularization, this bias goes away. The
>> bias observed is unacceptable for my use case, but the more-regularized
>> model does seem a bit better.
>> >>>
>> >>> Below, I plot the results for the 1000 different regressions for 2
>> different values of C:
>> >>>
>> >>> I looked at the parameter estimates for one of these regressions:
>> below each point is one parameter. It seems like the intercept (the point
>> on the bottom left) is too high for the C=1 model.
>> >>>
>> >>>
>> >>>
>> >>> _______________________________________________
>> >>> scikit-learn mailing list
>> >>> scikit-learn at python.org
>> >>> https://mail.python.org/mailman/listinfo/scikit-learn
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Yours sincerely,
>> >>> Alexey A. Dral
>> >>> _______________________________________________
>> >>> scikit-learn mailing list
>> >>> scikit-learn at python.org
>> >>> https://mail.python.org/mailman/listinfo/scikit-learn
>> >>
>> >>
>> >> _______________________________________________
>> >> scikit-learn mailing list
>> >> scikit-learn at python.org
>> >> https://mail.python.org/mailman/listinfo/scikit-learn
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >> Yours sincerely,
>> >> Alexey A. Dral
>> >> _______________________________________________
>> >> scikit-learn mailing list
>> >> scikit-learn at python.org
>> >> https://mail.python.org/mailman/listinfo/scikit-learn
>> >
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161215/c5fb8bf0/attachment-0001.html>


More information about the scikit-learn mailing list