[scikit-learn] Why does sci-kit learn's hashingvectorizer give negative values?

Joel Nothman joel.nothman at gmail.com
Sat Oct 1 18:11:42 EDT 2016


Negative values are not really there to compensate for hash collisions.
It's there because that makes the hashed vector space an approximation to
the full vector space under inner product.

On 2 October 2016 at 00:17, Roman Yurchak <rth.yurchak at gmail.com> wrote:

> On 01/10/16 15:34, Moyi Dang wrote:
> > However, I don't understand why the negatives are there in the first
> > place, or what they mean. I'm not sure if the absolute values are
> > corresponding to the token counts.
> >
> > Can someone please help explain what the HashingVectorizer is doing? How
> > do I get the HashingVectorizer to return token counts?
>
> Hi Moyi,
>
> it's a mechanism to compensate for hash collisions, see
> https://github.com/scikit-learn/scikit-learn/issues/7513 The absolute
> values are token counts for most practical applications (if you don't
> have too many collisions).  There will be a PR shortly to make this more
> consistent.
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161002/8d918fdc/attachment.html>


More information about the scikit-learn mailing list