[scikit-learn] CountVectorizer: Additional Feature Suggestion

Sun Jan 28 04:36:47 EST 2018

Good point Joel, and I actually forgot that you can set the norm param in the TfidfVectorizer, so one could basically do

vect = TfidfVectorizer(use_idf=False, norm='l1')

to have the CountVectorizer behavior but normalizing by the document length.

Best,
Sebastian

> On Jan 28, 2018, at 1:29 AM, Joel Nothman <joel.nothman at gmail.com> wrote:
> 
> sklearn.preprocessing.Normalizer allows you to normalize any vector by its L1 or L2 norm. L1 would be equivalent to "document length" as long as you did not intend to count stop words in the length. sklearn.feature_extraction.text.TfidfTransformer offers similar norming, but does so only after accounting for IDF or TF transformation. Since the length normalisation transformation is stateless, it can also be computed with a sklearn.preprocessing.FunctionTransformer.
> 
> I can't say it's especially obvious that these features available, and improvements to the documentation are welcome, but CountVectorizer is complicated enough and we would rather avoid more parameters if we can. I wouldn't hate if length normalisation was added to TfidfTransformer, if it was shown that normalising before IDF multiplication was more effective than (or complementary to) norming afterwards.
> 
> On 28 January 2018 at 18:31, Yacine MAZARI <y.mazari at gmail.com> wrote:
> Hi Jake,
> 
> Thanks for the quick reply.
> 
> What I meant is different from the TfIdfVectorizer. Let me clarify:
> 
> In the TfIdfVectorizer, the raw counts are multiplied by IDF, which badically means normalizing the counts by document frequencies, tf * idf.
> But still, tf is deined here as the raw count of a term in the dicument.
> 
> What I am suggesting, is to add the possibility to use another definition of tf, tf= relative frequency of a term in a document = raw counts / document length.
> On top of this, one could further normalize by IDF to get the TF-IDF (https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency_2).
> 
> When can this be useful? Here is an example:
> Say term t occurs 5 times in document d1, and also 5 times in document d2.
> At first glance, it seems that the term conveys the same information about both documents. But if we also check document lengths, and find that length of d1 is 20, wheras lenght of d2 is 200, then probably the “importance” and information carried by the same term in the two documents is not the same.
> If we use relative frequency instead of absolute counts, then tf1=5/20=0.4 whereas tf2=5/200=0.04.
> 
> There are many practical cases (document similarity, document classification, etc...) where using relative frequencies yields better results, and it might be worth making the CountVectorizer support this.
> 
> Regards,
> Yacine.
> 
> On Sun, Jan 28, 2018 at 15:12 Jacob Vanderplas <jakevdp at cs.washington.edu> wrote:
> Hi Yacine,
> If I'm understanding you correctly, I think what you have in mind is already implemented in scikit-learn in the TF-IDF vectorizer.
> 
> Best,
>    Jake
> 
>  Jake VanderPlas
>  Senior Data Science Fellow
>  Director of Open Software
>  University of Washington eScience Institute
> 
> On Sat, Jan 27, 2018 at 9:59 PM, Yacine MAZARI <y.mazari at gmail.com> wrote:
> Hello,
> 
> I would like to work on adding an additional feature to "sklearn.feature_extraction.text.CountVectorizer".
> 
> In the current implementation, the definition of term frequency is the number of times a term t occurs in document d.
> 
> However, another definition that is very commonly used in practice is the term frequency adjusted for document length, i.e: tf = raw counts / document length.
> 
> I intend to implement this by adding an additional boolean parameter "relative_frequency" to the constructor of CountVectorizer.
> If the parameter is true, normalize X by document length (along x=1) in "CountVectorizer.fit_transform()".
> 
> What do you think?
> If this sounds reasonable an worth it, I will send a PR.
> 
> Thank you,
> Yacine.
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn