[scikit-learn] Using logistic regression with count proportions data

Michael Eickenberg michael.eickenberg at gmail.com
Mon Oct 10 10:46:17 EDT 2016


Here is a possibly useful comment of larsmans on stackoverflow about
exactly this procedure

http://stackoverflow.com/questions/26604175/how-to-predict-a-continuous-dependent-variable-that-expresses-target-class-proba/26614131#comment41846816_26614131


On Mon, Oct 10, 2016 at 4:04 PM, Sean Violante <sean.violante at gmail.com>
wrote:

> sorry yes there was a misunderstanding:
>
> I meant for each feature configuration you should pass in two rows (one
> for the positive cases and one for the negative)
> and the sample weight being the corresponding count for that configuration
> and class
>
> and I am saying that the total  count is important because you could have
> a situation where
> one feature combination occurs 10 times and another feature combination
> 1000 times
>
>
>
>
>
> On Mon, Oct 10, 2016 at 3:48 PM, Raphael C <drraph at gmail.com> wrote:
>
>> On 10 October 2016 at 12:22, Sean Violante <sean.violante at gmail.com>
>> wrote:
>> > no ( but please check !)
>> >
>> > sample weights should be the counts for the respective label (0/1)
>> >
>> > [ I am actually puzzled about the glm help file - proportions loses how
>> > often an input data 'row' was present relative to the other - though you
>> > could do this by repeating the row 'n' times]
>>
>> I think we might be talking at cross purposes.
>>
>> I have a matrix X where each row is a feature vector. I also have an
>> array y where y[i] is a real number between 0 and 1. I would like to
>> build a regression model that predicts the y values given the X rows.
>>
>> Now each y[i] value in fact comes from simply counting the number of
>> positive labelled elements in a particular set (set i) and dividing by
>> the number of elements in that set.  So I can easily fit this into the
>> model given by the R package glm by replacing each y[i] value by a
>> pair of "Number of positives" and "Number of negatives" (this is case
>> 2 in the docs I quoted) or using case 3 which asks for the y[i] plus
>> the total number of elements in set i.
>>
>> I don't see how a single integer for sample_weight[i] would cover this
>> information but I am sure I must have misunderstood.  At best it seems
>> it could cover the number of positive values but this is missing half
>> the information.
>>
>> Raphael
>>
>> >
>> > On Mon, Oct 10, 2016 at 1:15 PM, Raphael C <drraph at gmail.com> wrote:
>> >>
>> >> How do I use sample_weight for my use case?
>> >>
>> >> In my case is "y" an array of 0s and 1s and sample_weight then an
>> >> array real numbers between 0 and 1 where I should make sure to set
>> >> sample_weight[i]= 0 when y[i] = 0?
>> >>
>> >> Raphael
>> >>
>> >> On 10 October 2016 at 12:08, Sean Violante <sean.violante at gmail.com>
>> >> wrote:
>> >> > should be the sample weight function in fit
>> >> >
>> >> >
>> >> > http://scikit-learn.org/stable/modules/generated/sklearn.
>> linear_model.LogisticRegression.html
>> >> >
>> >> > On Mon, Oct 10, 2016 at 1:03 PM, Raphael C <drraph at gmail.com> wrote:
>> >> >>
>> >> >> I just noticed this about the glm package in R.
>> >> >> http://stats.stackexchange.com/a/26779/53128
>> >> >>
>> >> >> "
>> >> >> The glm function in R allows 3 ways to specify the formula for a
>> >> >> logistic regression model.
>> >> >>
>> >> >> The most common is that each row of the data frame represents a
>> single
>> >> >> observation and the response variable is either 0 or 1 (or a factor
>> >> >> with 2 levels, or other varibale with only 2 unique values).
>> >> >>
>> >> >> Another option is to use a 2 column matrix as the response variable
>> >> >> with the first column being the counts of 'successes' and the second
>> >> >> column being the counts of 'failures'.
>> >> >>
>> >> >> You can also specify the response as a proportion between 0 and 1,
>> >> >> then specify another column as the 'weight' that gives the total
>> >> >> number that the proportion is from (so a response of 0.3 and a
>> weight
>> >> >> of 10 is the same as 3 'successes' and 7 'failures')."
>> >> >>
>> >> >> Either of the last two options would do for me.  Does scikit-learn
>> >> >> support either of these last two options?
>> >> >>
>> >> >> Raphael
>> >> >>
>> >> >> On 10 October 2016 at 11:55, Raphael C <drraph at gmail.com> wrote:
>> >> >> > I am trying to perform regression where my dependent variable is
>> >> >> > constrained to be between 0 and 1. This constraint comes from the
>> >> >> > fact
>> >> >> > that it represents a count proportion. That is counts in some
>> >> >> > category
>> >> >> > divided by a total count.
>> >> >> >
>> >> >> > In the literature it seems that one common way to tackle this is
>> to
>> >> >> > use logistic regression. However, it appears that in scikit learn
>> >> >> > logistic regression is only available as a classifier
>> >> >> >
>> >> >> >
>> >> >> > (http://scikit-learn.org/stable/modules/generated/sklearn.
>> linear_model.LogisticRegression.html
>> >> >> > ) . Is that right?
>> >> >> >
>> >> >> > Is there another way to perform regression using scikit learn
>> where
>> >> >> > the dependent variable is a count proportion?
>> >> >> >
>> >> >> > Thanks for any help.
>> >> >> >
>> >> >> > Raphael
>> >> >> _______________________________________________
>> >> >> scikit-learn mailing list
>> >> >> scikit-learn at python.org
>> >> >> https://mail.python.org/mailman/listinfo/scikit-learn
>> >> >
>> >> >
>> >> >
>> >> > _______________________________________________
>> >> > scikit-learn mailing list
>> >> > scikit-learn at python.org
>> >> > https://mail.python.org/mailman/listinfo/scikit-learn
>> >> >
>> >> _______________________________________________
>> >> scikit-learn mailing list
>> >> scikit-learn at python.org
>> >> https://mail.python.org/mailman/listinfo/scikit-learn
>> >
>> >
>> >
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>> >
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161010/1f05de8e/attachment.html>


More information about the scikit-learn mailing list