[scikit-learn] biased predictions in logistic regression

Fri Dec 16 00:11:00 EST 2016

just some generic comments, I don't have any experience with penalized
estimation nor did I go through the math.

In unregularized Logistis Regression or Logit and in several other models
the estimator satisfies some aggregation properties so that in sample or
training set proportions match between predicted proportions and those of
the sample.

Regularized estimation does not require unbiased estimation of the
parameters because it maximizes a different objective function, like mean
squared error in the linear model. We are trading off bias against
variance. I think this will propagate to the prediction, but I'm not sure
whether an unpenalized intercept can be made to compensate for the bias in
the average prediction.

For Logit this would mean that although we have a bias, we have less
variance/variation in the prediction, so overall we are doing better than
with unregularized prediction under the chosen penalization measure.
I assume because the regularization biases towards zero coefficients it
also biases towards a prediction of 0.5, unless it's compensated for by the
intercept.

I didn't read the King and Zheng (2001) article, but it doesn't mention
penalization or regularization, based on a brief search, so it doesn't seem
to address the regularization bias. (Aside, from the literature I think
many people use a different model than logistic for rare events data,
either Poisson with exponential link or Binomial/Bernoulli with an
asymmetric link function.)

I think, demeaning could help because it reduces the dependence between the
intercept and the other penalized variables, but because of the nonlinear
model it will not make it orthogonal.

The question is whether it's possible to improve the estimator by
additionally adjusting the mean or the threshold for 0-1 predictions. It
might depend on the criteria to choose the penalization. I don't know and
have no idea what scikit-learn does.

Josef

On Thu, Dec 15, 2016 at 11:30 PM, Stuart Reynolds <stuart at stuartreynolds.net
> wrote:

> Here's a discussion
>
> http://stats.stackexchange.com/questions/6067/does-an-
> unbalanced-sample-matter-when-doing-logistic-regression
>
> See the Zheng and King reference.
> It would be nice to have these methods in scikit.
>
>
>
> On Thu, Dec 15, 2016 at 7:05 PM Rachel Melamed <melamed at uchicago.edu>
> wrote:
>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Stuart,
>>
>>
>>
>> Yes the data is quite imbalanced (this is what I meant by p(success) <
>> .05 )
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> To be clear, I calculate
>>
>>
>>
>>
>> \sum_i \hat{y_i} = logregN.predict_proba(design)[:,1]*(success_fail.
>> sum(axis=1))
>>
>>
>>
>>
>> and compare that number to the observed number of success. I find the
>> predicted number to always be higher (I think, because of the intercept).
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> I was not aware of a bias for imbalanced data.  Can you tell me more? Why
>> does it not appear with the relaxed regularization? Also, using the same
>> data with statsmodels LR, which has no regularization, this doesn't seem to
>> be a problem. Any suggestions for
>>
>> how I could fix this are welcome.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Thank you
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Dec 15, 2016, at 4:41 PM, Stuart Reynolds <stuart at stuartreynolds.net>
>> wrote:
>>
>>
>>
>>
>>
>>
>>
>> LR is biased with imbalanced datasets. Is your dataset unbalanced? (e.g.
>> is there one class that has a much smaller prevalence in the data that the
>> other)?
>>
>>
>>
>>
>>
>> On Thu, Dec 15, 2016 at 1:02 PM, Rachel Melamed
>>
>> <melamed at uchicago.edu> wrote:
>>
>>
>>
>>
>> I just tried it and it did not appear to change the results at all?
>>
>> I ran it as follows:
>>
>> 1) Normalize dummy variables (by subtracting median) to make a matrix of
>> about 10000 x 5
>>
>>
>>
>>
>>
>>
>>
>> 2) For each of the 1000 output variables:
>>
>>
>> a. Each output variable uses the same dummy variables, but not all
>> settings of covariates are observed for all output variables. So I create
>> the design matrix using patsy per output variable to include pairwise
>> interactions.  Then, I have an around
>>
>> 10000 x 350 design matrix , and a matrix I call “success_fail” that has
>> for each setting the number of success and number of fail, so it is of size
>> 10000 x 2
>>
>>
>>
>>
>>
>>
>>
>> b. Run regression using:
>>
>>
>>
>>
>>
>>
>> skdesign = np.vstack((design,design))
>>
>>
>>
>>
>> sklabel = np.hstack((np.ones(success_fail.shape[0]),
>>
>>
>> np.zeros(success_fail.shape[0])))
>>
>>
>>
>>
>> skweight = np.hstack((success_fail['success'], success_fail['fail']))
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>         logregN = linear_model.LogisticRegression(C=1,
>>
>>
>>                                     solver= 'lbfgs',fit_intercept=False)
>>
>>
>>         logregN.fit(skdesign, sklabel, sample_weight=skweight)
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Dec 15, 2016, at 2:16 PM, Alexey Dral <aadral at gmail.com> wrote:
>>
>>
>>
>>
>>
>>
>>
>> Could you try to normalize dataset after feature dummy encoding and see
>> if it is reproducible behavior?
>>
>>
>>
>>
>> 2016-12-15 22:03 GMT+03:00 Rachel Melamed
>>
>> <melamed at uchicago.edu>:
>>
>>
>>
>>
>> Thanks for the reply.  The covariates (“X") are all dummy/categorical
>> variables.  So I guess no, nothing is normalized.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Dec 15, 2016, at 1:54 PM, Alexey Dral <aadral at gmail.com> wrote:
>>
>>
>>
>>
>>
>>
>>
>> Hi Rachel,
>>
>>
>>
>>
>>
>>
>> Do you have your data normalized?
>>
>>
>>
>>
>>
>> 2016-12-15 20:21 GMT+03:00 Rachel Melamed
>>
>> <melamed at uchicago.edu>:
>>
>>
>>
>>
>>
>>
>> Hi all,
>>
>>
>> Does anyone have any suggestions for this problem:
>>
>>
>> http://stackoverflow.com/questions/41125342/sklearn-
>> logistic-regression-gives-biased-results
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> I am running around 1000 similar logistic regressions, with the same
>> covariates but slightly different data and response variables. All of my
>> response variables have a sparse successes (p(success) < .05 usually).
>>
>>
>>
>>
>> I noticed that with the regularized regression, the results are
>> consistently biased to predict more "successes" than is observed in the
>> training data. When I relax the regularization, this bias goes away. The
>> bias observed is unacceptable for my use case, but
>>
>> the more-regularized model does seem a bit better.
>>
>>
>>
>>
>> Below, I plot the results for the 1000 different regressions for 2
>> different values of C: [image: results for the different regressions for
>> 2 different values of C] <https://i.stack.imgur.com/1cbrC.png>
>>
>>
>>
>>
>> I looked at the parameter estimates for one of these regressions: below
>> each point is one parameter. It seems like the intercept (the point on the
>> bottom left) is too high for the C=1 model. [image: enter image
>> description here] <https://i.stack.imgur.com/NTFOY.png>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>>
>>
>> scikit-learn mailing list
>>
>>
>> scikit-learn at python.org
>>
>>
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Yours sincerely,
>>
>>
>> Alexey A. Dral
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>>
>>
>> scikit-learn mailing list
>>
>>
>> scikit-learn at python.org
>>
>>
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>>
>>
>> scikit-learn mailing list
>>
>>
>> scikit-learn at python.org
>>
>>
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Yours sincerely,
>>
>>
>> Alexey A. Dral
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>>
>>
>> scikit-learn mailing list
>>
>>
>> scikit-learn at python.org
>>
>>
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>>
>>
>> scikit-learn mailing list
>>
>>
>> scikit-learn at python.org
>>
>>
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>>
>>
>> scikit-learn mailing list
>>
>>
>> scikit-learn at python.org
>>
>>
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>>
>> scikit-learn mailing list
>>
>> scikit-learn at python.org
>>
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161216/6c8aaa97/attachment-0001.html>