[scikit-learn] Why does sci-kit learn's hashingvectorizer give negative values?

Moyi Dang moyi.dang at gmail.com
Sat Oct 1 09:34:10 EDT 2016


Hi,

I'm trying to make the hashingvectorizer work for online learning. To
do this, I need it to give actual token counts.

The HashingVectorizer in Sci-Kit learn doesn't give token counts, but
by default gives a normalized count either l1 or l2.

I need the tokenized counts, so I set norm = None. However, after I do
this, I'm no longer getting decimals, but I'm still getting negative
numbers.

It seems like the negatives can be removed by setting non_negative =
True, which takes the absolute value of the values.

However, I don't understand why the negatives are there in the first
place, or what they mean. I'm not sure if the absolute values are
corresponding to the token counts.

Can someone please help explain what the HashingVectorizer is doing?
How do I get the HashingVectorizer to return token counts?

You can replicate my results with the following code - I'm using the
20newsgroups dataset which comes with sci-kit learn:

from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)
from sklearn.feature_extraction.text import HashingVectorizer

# produces normalized results with mean 0 and unit variance
cv = HashingVectorizer(stop_words = 'english')
X_train = cv.fit_transform(twenty_train.data)
print(X_train)

# produces integer results both positive and negative
cv = HashingVectorizer(stop_words = 'english', norm=None)
X_train = cv.fit_transform(twenty_train.data)
print(X_train)

# produces only positive results but not sure if they correspond to counts
cv = HashingVectorizer(stop_words = 'english', norm=None, non_negative = True)
X_train = cv.fit_transform(twenty_train.data)
print(X_train)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161001/40633f72/attachment.html>


More information about the scikit-learn mailing list