[scikit-learn] biased predictions in logistic regression

Rachel Melamed melamed at uchicago.edu
Thu Dec 15 22:04:03 EST 2016


Stuart,
Yes the data is quite imbalanced (this is what I meant by p(success) < .05 )

To be clear, I calculate
\sum_i \hat{y_i} = logregN.predict_proba(design)[:,1]*(success_fail.sum(axis=1))
and compare that number to the observed number of success. I find the predicted number to always be higher (I think, because of the intercept).

I was not aware of a bias for imbalanced data.  Can you tell me more? Why does it not appear with the relaxed regularization? Also, using the same data with statsmodels LR, which has no regularization, this doesn't seem to be a problem. Any suggestions for how I could fix this are welcome.

Thank you

On Dec 15, 2016, at 4:41 PM, Stuart Reynolds <stuart at stuartreynolds.net<mailto:stuart at stuartreynolds.net>> wrote:

LR is biased with imbalanced datasets. Is your dataset unbalanced? (e.g. is there one class that has a much smaller prevalence in the data that the other)?

On Thu, Dec 15, 2016 at 1:02 PM, Rachel Melamed <melamed at uchicago.edu<mailto:melamed at uchicago.edu>> wrote:
I just tried it and it did not appear to change the results at all?
I ran it as follows:
1) Normalize dummy variables (by subtracting median) to make a matrix of about 10000 x 5

2) For each of the 1000 output variables:
a. Each output variable uses the same dummy variables, but not all settings of covariates are observed for all output variables. So I create the design matrix using patsy per output variable to include pairwise interactions.  Then, I have an around 10000 x 350 design matrix , and a matrix I call “success_fail” that has for each setting the number of success and number of fail, so it is of size 10000 x 2

b. Run regression using:
    skdesign = np.vstack((design,design))
    sklabel = np.hstack((np.ones(success_fail.shape[0]),
np.zeros(success_fail.shape[0])))
    skweight = np.hstack((success_fail['success'], success_fail['fail']))

        logregN = linear_model.LogisticRegression(C=1,
                                    solver= 'lbfgs',fit_intercept=False)
        logregN.fit(skdesign, sklabel, sample_weight=skweight)


On Dec 15, 2016, at 2:16 PM, Alexey Dral <aadral at gmail.com<mailto:aadral at gmail.com>> wrote:

Could you try to normalize dataset after feature dummy encoding and see if it is reproducible behavior?

2016-12-15 22:03 GMT+03:00 Rachel Melamed <melamed at uchicago.edu<mailto:melamed at uchicago.edu>>:
Thanks for the reply.  The covariates (“X") are all dummy/categorical variables.  So I guess no, nothing is normalized.

On Dec 15, 2016, at 1:54 PM, Alexey Dral <aadral at gmail.com<mailto:aadral at gmail.com>> wrote:

Hi Rachel,

Do you have your data normalized?

2016-12-15 20:21 GMT+03:00 Rachel Melamed <melamed at uchicago.edu<mailto:melamed at uchicago.edu>>:
Hi all,
Does anyone have any suggestions for this problem:
http://stackoverflow.com/questions/41125342/sklearn-logistic-regression-gives-biased-results


I am running around 1000 similar logistic regressions, with the same covariates but slightly different data and response variables. All of my response variables have a sparse successes (p(success) < .05 usually).

I noticed that with the regularized regression, the results are consistently biased to predict more "successes" than is observed in the training data. When I relax the regularization, this bias goes away. The bias observed is unacceptable for my use case, but the more-regularized model does seem a bit better.

Below, I plot the results for the 1000 different regressions for 2 different values of C: [results for the different regressions for 2 different values of C] <https://i.stack.imgur.com/1cbrC.png>

I looked at the parameter estimates for one of these regressions: below each point is one parameter. It seems like the intercept (the point on the bottom left) is too high for the C=1 model. [enter image description here] <https://i.stack.imgur.com/NTFOY.png>


_______________________________________________
scikit-learn mailing list
scikit-learn at python.org<mailto:scikit-learn at python.org>
https://mail.python.org/mailman/listinfo/scikit-learn




--
Yours sincerely,
Alexey A. Dral
_______________________________________________
scikit-learn mailing list
scikit-learn at python.org<mailto:scikit-learn at python.org>
https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
scikit-learn at python.org<mailto:scikit-learn at python.org>
https://mail.python.org/mailman/listinfo/scikit-learn




--
Yours sincerely,
Alexey A. Dral
_______________________________________________
scikit-learn mailing list
scikit-learn at python.org<mailto:scikit-learn at python.org>
https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
scikit-learn at python.org<mailto:scikit-learn at python.org>
https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
scikit-learn at python.org<mailto:scikit-learn at python.org>
https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161216/11990ab7/attachment-0001.html>


More information about the scikit-learn mailing list