[scikit-learn] CountVectorizer: Additional Feature Suggestion

Sun Jan 28 04:31:38 EST 2018

Hi, Yacine,

Just on a side note, you can set idf=False in the Tfidf and only normalize the vectors by their L2 norm.

But yeah, the normalization you suggest might be really handy in certain cases. I am not sure though if it's worth making this another parameter in the CountVectorizer (which already has quite a lot of parameters), as it can be computed quite easily if I am not misinterpreting something. Since the length of each document is determined by the sum of the words in each vector, one could simply normalize it by the document length as follows:

> from sklearn.feature_extraction.text import CountVectorizer
> dataset = ['The sun is shining and the weather is sweet',
>            'Hello World. The sun is shining and the weather is sweet']
> 
> vect = CountVectorizer()
> vect.fit(dataset)
> transf = vect.transform(dataset)
> normalized_word_vectors = transf / transf.sum(axis=1)

Where it would be tricky though is when you remove stop words during preprocessing but want to include them in the normalization. Then, you might have to do sth like this:

> from sklearn.feature_extraction.text import CountVectorizer
> import numpy as np
> 
> dataset = ['The sun is shining and the weather is sweet',
>            'Hello World. The sun is shining and the weather is sweet']
> 
> counts = np.array([len(s.split()) for s in dataset]).reshape(-1, 1)
> vect = CountVectorizer(stop_words='english')
> vect.fit(dataset)
> transf = vect.transform(dataset)
> transf / counts

Best,
Sebastian

> On Jan 27, 2018, at 11:31 PM, Yacine MAZARI <y.mazari at gmail.com> wrote:
> 
> Hi Jake,
> 
> Thanks for the quick reply.
> 
> What I meant is different from the TfIdfVectorizer. Let me clarify:
> 
> In the TfIdfVectorizer, the raw counts are multiplied by IDF, which badically means normalizing the counts by document frequencies, tf * idf.
> But still, tf is deined here as the raw count of a term in the dicument.
> 
> What I am suggesting, is to add the possibility to use another definition of tf, tf= relative frequency of a term in a document = raw counts / document length.
> On top of this, one could further normalize by IDF to get the TF-IDF (https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency_2).
> 
> When can this be useful? Here is an example:
> Say term t occurs 5 times in document d1, and also 5 times in document d2.
> At first glance, it seems that the term conveys the same information about both documents. But if we also check document lengths, and find that length of d1 is 20, wheras lenght of d2 is 200, then probably the “importance” and information carried by the same term in the two documents is not the same.
> If we use relative frequency instead of absolute counts, then tf1=5/20=0.4 whereas tf2=5/200=0.04.
> 
> There are many practical cases (document similarity, document classification, etc...) where using relative frequencies yields better results, and it might be worth making the CountVectorizer support this.
> 
> Regards,
> Yacine.
> 
> On Sun, Jan 28, 2018 at 15:12 Jacob Vanderplas <jakevdp at cs.washington.edu> wrote:
> Hi Yacine,
> If I'm understanding you correctly, I think what you have in mind is already implemented in scikit-learn in the TF-IDF vectorizer.
> 
> Best,
>    Jake
> 
>  Jake VanderPlas
>  Senior Data Science Fellow
>  Director of Open Software
>  University of Washington eScience Institute
> 
> On Sat, Jan 27, 2018 at 9:59 PM, Yacine MAZARI <y.mazari at gmail.com> wrote:
> Hello,
> 
> I would like to work on adding an additional feature to "sklearn.feature_extraction.text.CountVectorizer".
> 
> In the current implementation, the definition of term frequency is the number of times a term t occurs in document d.
> 
> However, another definition that is very commonly used in practice is the term frequency adjusted for document length, i.e: tf = raw counts / document length.
> 
> I intend to implement this by adding an additional boolean parameter "relative_frequency" to the constructor of CountVectorizer.
> If the parameter is true, normalize X by document length (along x=1) in "CountVectorizer.fit_transform()".
> 
> What do you think?
> If this sounds reasonable an worth it, I will send a PR.
> 
> Thank you,
> Yacine.
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn