[scikit-learn] Query regarding parameter class_weight in Random Forest Classifier

Josh Vredevoogd cleverless at gmail.com
Sun Jan 22 23:26:23 EST 2017


If you undersample, taking only 10% of the negative class, the classifier
will see different combinations of attributes and produce a different fit
to explain those distributions. In the worse case, imagine you are
classifying birds and through sampling you eliminate all `red` examples.
Your classifier likely now will not understand that red objects can be
birds. That's an overly simple example, but given a classifier capable of
exploring and explaining feature combinations, less obvious versions of
this are bound to happen.

The extrapolation only works in the other direction: if you manually
duplicate samples by the sampling factor, you should get the exact same fit
as if you increased the class weight.

Hope that helps,
Josh


On Sun, Jan 22, 2017 at 5:00 AM, Debabrata Ghosh <mailfordebu at gmail.com>
wrote:

> Thanks Josh !
>
> I have used the parameter class_weight={0: 1, 1: 10} and the model code
> has run successfully. However, just to get a further clarity around it's
> concept, I am having another question for you please. I did the following 2
> tests:
>
> 1. In my dataset , I have 1 million negative classes and 10,000 positive
> classes. First I ran my model code without supplying any class_weight
> parameter and it gave me certain True Positive and False Positive results.
>
> 2. Now in the second test, I had the same 1 million negative classes but
> reduced the positive classes to 1000 . But this time, I supplied the
> parameter class_weight={0: 1, 1: 10} and got my True Positive and False
> Positive Results
>
> My question is , when I multiply the results obtained from my second test
> with a factor of 10, I don't match with the results obtained from my first
> test. In other words, say I get the true positive against a threshold from
> the second test as 8 , while the true positive from the first test against
> the same threshold is 260. I am getting similar observations for the false
> positive results wherein if I multiply the results obtained in the second
> test by 10, I don't come close to the results obtained from the first set.
>
> Is my expectation correct ? Is my way of executing the test (i.e.,
> reducing the the positive classes by 10 times and then feeding a class
> weight of 10 times the negative classes) and comparing the results with a
> model run without any class weight parameter correct ?
>
> Please let me know as per your convenience as this will help me a big way
> to understand the concept further.
>
> Thanks in advance !
>
> On Sun, Jan 22, 2017 at 1:56 AM, Josh Vredevoogd <cleverless at gmail.com>
> wrote:
>
>> The class_weight parameter doesn't behave the way you're expecting.
>>
>> The value in class_weight is the weight applied to each sample in that
>> class - in your example, each class zero sample has weight 0.001 and each
>> class one sample has weight 0.999, so each class one samples carries 999
>> times the weight of a class zero sample.
>>
>> If you would like each class one sample to have ten times the weight, you
>> would set `class_weight={0: 1, 1: 10}` or `class_weight={0:0.1, 1:1}`
>> equivalently.
>>
>>
>> On Sat, Jan 21, 2017 at 10:18 AM, Debabrata Ghosh <mailfordebu at gmail.com>
>> wrote:
>>
>>> Hi All,
>>>              Greetings !
>>>
>>>               I have a very basic question regarding the usage of the
>>> parameter class_weight in scikit learn's Random Forest Classifier's fit
>>> method.
>>>
>>>               I have a fairly unbalanced sample and my positive class :
>>> negative class ratio is 1:100. In other words, I have a million records
>>> corresponding to negative class and 10,000 records corresponding to
>>> positive class. I have trained the random forest classifier model using the
>>> above record set successfully.
>>>
>>>               Further, for a different problem, I want to test the
>>> parameter class_weight. So, I am setting the class_weight as [0:0.001 ,
>>> 1:0.999] and I have tried running my model on the same dataset as mentioned
>>> in the above paragraph but with the positive class records reduced to 1000
>>> [because now each positive class is given approximately 10 times more
>>> weight than a negative class]. However, the model run results are very very
>>> different between the 2 runs (with and without class_weight). And I
>>> expected a similar run results.
>>>
>>>                 Would you please be able to let me know where am I
>>> getting wrong. I know it's something silly but just want to improve on my
>>> concept.
>>>
>>> Thanks !
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170122/6e0bb36d/attachment.html>


More information about the scikit-learn mailing list