[scikit-learn] Why does sci-kit learn's hashingvectorizer give negative values?

Roman Yurchak rth.yurchak at gmail.com
Sat Oct 1 10:17:40 EDT 2016


On 01/10/16 15:34, Moyi Dang wrote:
> However, I don't understand why the negatives are there in the first
> place, or what they mean. I'm not sure if the absolute values are
> corresponding to the token counts.
> 
> Can someone please help explain what the HashingVectorizer is doing? How
> do I get the HashingVectorizer to return token counts?

Hi Moyi,

it's a mechanism to compensate for hash collisions, see
https://github.com/scikit-learn/scikit-learn/issues/7513 The absolute
values are token counts for most practical applications (if you don't
have too many collisions).  There will be a PR shortly to make this more
consistent.




More information about the scikit-learn mailing list