[scikit-learn] CountVectorizer: Additional Feature Suggestion

Yacine MAZARI y.mazari at gmail.com
Tue Jan 30 10:19:32 EST 2018


Okay, thanks for the replies.

@Joel: Should I go ahead and send a PR with the change to TfidfTransformer?

On Tue, Jan 30, 2018 at 5:27 AM, Joel Nothman <joel.nothman at gmail.com>
wrote:

> I don't think you will do this without an O(N) cost. The fact that it's
> done with a second pass is moot.
>
> My position stands: if this change happens, it should be to
> TfidfTransformer (which should perhaps be called something like
> CountVectorWeighter!) alone.
>
> On 30 January 2018 at 02:39, Yacine MAZARI <y.mazari at gmail.com> wrote:
>
>> Hi Folks,
>>
>> Thank you all for the feedback and interesting discussion.
>>
>> I do realize that adding a feature comes with risks, and that there
>> should really be compelling reasons to do so.
>>
>> Let me try to address your comments here, and make one final case for the
>> value of this feature:
>>
>> 1) Use Normalizer, FunctionTransformer (or write a custom code) to
>> perform normalization of CountVectorizer result: That would require an
>> additional pass on the data. True that's "only" O(N), but if there is a way
>> to speed up training an ML model, that'd be an advantage.
>>
>> 2) TfidfVectorizer(use_idf=False, norm='l1'): Yes, that would have the
>> same effect; but not that this not TF-IDF any more, in that TF-IDF is a
>> two-fold normalization. If one needs TF-IDF (with normalized document
>> counts), then 2 additional passes on the data (with TfidfVectorizer(use_idf=True))
>> would be required to get IDF normalization, bringing us to a case similar
>> to the above.
>>
>> 3)
>> >> I wouldn't hate if length normalisation was added to TfidfTransformer,
>> if it was shown that normalising before IDF multiplication was more
>> effective than (or complementary >> to) norming afterwards.
>> I think this is one of the most important points here.
>> Though not a formal proof, I can for example refer to:
>>
>>    - NLTK <http://www.nltk.org/_modules/nltk/text.html#TextCollection.tf>,
>>    which is using document-length-normalized term frequencies.
>>
>>
>>    -  Manning and Schütze's Introduction to Information Retrieval
>>    <https://nlp.stanford.edu/IR-book/html/htmledition/vector-space-classification-1.html>:
>>    "The same considerations that led us to prefer weighted representations, in
>>    particular length-normalized tf-idf representations, in Chapters 6
>>    <https://nlp.stanford.edu/IR-book/html/htmledition/scoring-term-weighting-and-the-vector-space-model-1.html#ch:tfidf>
>>    7
>>    <https://nlp.stanford.edu/IR-book/html/htmledition/computing-scores-in-a-complete-search-system-1.html#ch:cosine>
>>    also apply here."
>>
>> On the other hand, applying this kind of normalization to a corpus where
>> the document lengths are similar (such as tweets) will probably not be of
>> any advantage.
>>
>> 4) This will be a handy feature as Sebastian mentioned, and the code
>> change will be very small (careful here...any code change brings risks).
>>
>> What do you think?
>>
>> Best regards,
>> Yacine.
>>
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180131/27cbfc6f/attachment-0001.html>


More information about the scikit-learn mailing list