[scikit-learn] CountVectorizer: Additional Feature Suggestion

Joel Nothman joel.nothman at gmail.com
Mon Jan 29 15:27:42 EST 2018


I don't think you will do this without an O(N) cost. The fact that it's
done with a second pass is moot.

My position stands: if this change happens, it should be to
TfidfTransformer (which should perhaps be called something like
CountVectorWeighter!) alone.

On 30 January 2018 at 02:39, Yacine MAZARI <y.mazari at gmail.com> wrote:

> Hi Folks,
>
> Thank you all for the feedback and interesting discussion.
>
> I do realize that adding a feature comes with risks, and that there should
> really be compelling reasons to do so.
>
> Let me try to address your comments here, and make one final case for the
> value of this feature:
>
> 1) Use Normalizer, FunctionTransformer (or write a custom code) to perform
> normalization of CountVectorizer result: That would require an additional
> pass on the data. True that's "only" O(N), but if there is a way to speed
> up training an ML model, that'd be an advantage.
>
> 2) TfidfVectorizer(use_idf=False, norm='l1'): Yes, that would have the
> same effect; but not that this not TF-IDF any more, in that TF-IDF is a
> two-fold normalization. If one needs TF-IDF (with normalized document
> counts), then 2 additional passes on the data (with TfidfVectorizer(use_idf=True))
> would be required to get IDF normalization, bringing us to a case similar
> to the above.
>
> 3)
> >> I wouldn't hate if length normalisation was added to TfidfTransformer,
> if it was shown that normalising before IDF multiplication was more
> effective than (or complementary >> to) norming afterwards.
> I think this is one of the most important points here.
> Though not a formal proof, I can for example refer to:
>
>    - NLTK <http://www.nltk.org/_modules/nltk/text.html#TextCollection.tf>,
>    which is using document-length-normalized term frequencies.
>
>
>    -  Manning and Schütze's Introduction to Information Retrieval
>    <https://nlp.stanford.edu/IR-book/html/htmledition/vector-space-classification-1.html>:
>    "The same considerations that led us to prefer weighted representations, in
>    particular length-normalized tf-idf representations, in Chapters 6
>    <https://nlp.stanford.edu/IR-book/html/htmledition/scoring-term-weighting-and-the-vector-space-model-1.html#ch:tfidf>
>    7
>    <https://nlp.stanford.edu/IR-book/html/htmledition/computing-scores-in-a-complete-search-system-1.html#ch:cosine>
>    also apply here."
>
> On the other hand, applying this kind of normalization to a corpus where
> the document lengths are similar (such as tweets) will probably not be of
> any advantage.
>
> 4) This will be a handy feature as Sebastian mentioned, and the code
> change will be very small (careful here...any code change brings risks).
>
> What do you think?
>
> Best regards,
> Yacine.
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180130/b600aac0/attachment.html>


More information about the scikit-learn mailing list