[scikit-learn] CountVectorizer: Additional Feature Suggestion

Mon Jan 29 10:39:35 EST 2018

Hi Folks,

Thank you all for the feedback and interesting discussion.

I do realize that adding a feature comes with risks, and that there should
really be compelling reasons to do so.

Let me try to address your comments here, and make one final case for the
value of this feature:

1) Use Normalizer, FunctionTransformer (or write a custom code) to perform
normalization of CountVectorizer result: That would require an additional
pass on the data. True that's "only" O(N), but if there is a way to speed
up training an ML model, that'd be an advantage.

2) TfidfVectorizer(use_idf=False, norm='l1'): Yes, that would have the same
effect; but not that this not TF-IDF any more, in that TF-IDF is a two-fold
normalization. If one needs TF-IDF (with normalized document counts), then
2 additional passes on the data (with TfidfVectorizer(use_idf=True)) would
be required to get IDF normalization, bringing us to a case similar to the
above.

3)
>> I wouldn't hate if length normalisation was added to TfidfTransformer,
if it was shown that normalising before IDF multiplication was more
effective than (or complementary >> to) norming afterwards.
I think this is one of the most important points here.
Though not a formal proof, I can for example refer to:

   - NLTK <http://www.nltk.org/_modules/nltk/text.html#TextCollection.tf>,
   which is using document-length-normalized term frequencies.

   -  Manning and Schütze's Introduction to Information Retrieval
   <https://nlp.stanford.edu/IR-book/html/htmledition/vector-space-classification-1.html>:
   "The same considerations that led us to prefer weighted representations, in
   particular length-normalized tf-idf representations, in Chapters 6
   <https://nlp.stanford.edu/IR-book/html/htmledition/scoring-term-weighting-and-the-vector-space-model-1.html#ch:tfidf>
   7
   <https://nlp.stanford.edu/IR-book/html/htmledition/computing-scores-in-a-complete-search-system-1.html#ch:cosine>
   also apply here."

On the other hand, applying this kind of normalization to a corpus where
the document lengths are similar (such as tweets) will probably not be of
any advantage.

4) This will be a handy feature as Sebastian mentioned, and the code change
will be very small (careful here...any code change brings risks).

What do you think?

Best regards,
Yacine.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180130/f9c45d47/attachment.html>