[scikit-learn] CountVectorizer: Additional Feature Suggestion

Sun Jan 28 02:31:16 EST 2018

Hi Jake,

Thanks for the quick reply.

What I meant is different from the TfIdfVectorizer. Let me clarify:

In the TfIdfVectorizer, the raw counts are multiplied by IDF, which
badically means normalizing the counts by document frequencies, tf * idf.
But still, tf is deined here as the raw count of a term in the dicument.

What I am suggesting, is to add the possibility to use another definition
of tf, tf= relative frequency of a term in a document = raw counts /
document length.
On top of this, one could further normalize by IDF to get the TF-IDF (
https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency_2).

When can this be useful? Here is an example:
Say term t occurs 5 times in document d1, and also 5 times in document d2.
At first glance, it seems that the term conveys the same information about
both documents. But if we also check document lengths, and find that length
of d1 is 20, wheras lenght of d2 is 200, then probably the “importance” and
information carried by the same term in the two documents is not the same.
If we use relative frequency instead of absolute counts, then tf1=5/20=0.4
whereas tf2=5/200=0.04.

There are many practical cases (document similarity, document
classification, etc...) where using relative frequencies yields better
results, and it might be worth making the CountVectorizer support this.

Regards,
Yacine.

On Sun, Jan 28, 2018 at 15:12 Jacob Vanderplas <jakevdp at cs.washington.edu>
wrote:

> Hi Yacine,
> If I'm understanding you correctly, I think what you have in mind is
> already implemented in scikit-learn in the TF-IDF vectorizer
> <http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html>
> .
>
> Best,
>    Jake
>
>  Jake VanderPlas
>  Senior Data Science Fellow
>  Director of Open Software
>  University of Washington eScience Institute
>
> On Sat, Jan 27, 2018 at 9:59 PM, Yacine MAZARI <y.mazari at gmail.com> wrote:
>
>> Hello,
>>
>> I would like to work on adding an additional feature to
>> "sklearn.feature_extraction.text.CountVectorizer".
>>
>> In the current implementation, the definition of term frequency is the
>> number of times a term t occurs in document d.
>>
>> However, another definition that is very commonly used in practice is the term
>> frequency adjusted for document length
>> <https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency_2>, i.e: tf
>> = raw counts / document length.
>>
>> I intend to implement this by adding an additional boolean parameter
>> "relative_frequency" to the constructor of CountVectorizer.
>> If the parameter is true, normalize X by document length (along x=1) in
>> "CountVectorizer.fit_transform()".
>>
>> What do you think?
>> If this sounds reasonable an worth it, I will send a PR.
>>
>> Thank you,
>> Yacine.
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180128/ac2ab4cd/attachment.html>