[scikit-learn] CountVectorizer: Additional Feature Suggestion

Joel Nothman joel.nothman at gmail.com
Sun Jan 28 04:56:28 EST 2018


That's equivalent to Normalizer(norm='l1') or
FunctionTransformer(np.linalg.norm, kw_args={'axis': 1, 'ord': 1}).

The problem is that length norm followed by TfidfTransformer now can't do
sublinear TF right... But that's alright if we know we can always do
FunctionTransformer(lambda X: calc_sublinear(X) / X.sum(axis=1)), perhaps
then followed by applying IDF from TfidfTransformer.

Yes, it's not straightforward, but it's very hard to provide a library that
suits everyone's needs... so FunctionTransformer and Pipeline are your
friends :)

On 28 January 2018 at 20:36, Sebastian Raschka <se.raschka at gmail.com> wrote:

> Good point Joel, and I actually forgot that you can set the norm param in
> the TfidfVectorizer, so one could basically do
>
> vect = TfidfVectorizer(use_idf=False, norm='l1')
>
> to have the CountVectorizer behavior but normalizing by the document
> length.
>
> Best,
> Sebastian
>
> > On Jan 28, 2018, at 1:29 AM, Joel Nothman <joel.nothman at gmail.com>
> wrote:
> >
> > sklearn.preprocessing.Normalizer allows you to normalize any vector by
> its L1 or L2 norm. L1 would be equivalent to "document length" as long as
> you did not intend to count stop words in the length.
> sklearn.feature_extraction.text.TfidfTransformer offers similar norming,
> but does so only after accounting for IDF or TF transformation. Since the
> length normalisation transformation is stateless, it can also be computed
> with a sklearn.preprocessing.FunctionTransformer.
> >
> > I can't say it's especially obvious that these features available, and
> improvements to the documentation are welcome, but CountVectorizer is
> complicated enough and we would rather avoid more parameters if we can. I
> wouldn't hate if length normalisation was added to TfidfTransformer, if it
> was shown that normalising before IDF multiplication was more effective
> than (or complementary to) norming afterwards.
> >
> > On 28 January 2018 at 18:31, Yacine MAZARI <y.mazari at gmail.com> wrote:
> > Hi Jake,
> >
> > Thanks for the quick reply.
> >
> > What I meant is different from the TfIdfVectorizer. Let me clarify:
> >
> > In the TfIdfVectorizer, the raw counts are multiplied by IDF, which
> badically means normalizing the counts by document frequencies, tf * idf.
> > But still, tf is deined here as the raw count of a term in the dicument.
> >
> > What I am suggesting, is to add the possibility to use another
> definition of tf, tf= relative frequency of a term in a document = raw
> counts / document length.
> > On top of this, one could further normalize by IDF to get the TF-IDF (
> https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency_2).
> >
> > When can this be useful? Here is an example:
> > Say term t occurs 5 times in document d1, and also 5 times in document
> d2.
> > At first glance, it seems that the term conveys the same information
> about both documents. But if we also check document lengths, and find that
> length of d1 is 20, wheras lenght of d2 is 200, then probably the
> “importance” and information carried by the same term in the two documents
> is not the same.
> > If we use relative frequency instead of absolute counts, then
> tf1=5/20=0.4 whereas tf2=5/200=0.04.
> >
> > There are many practical cases (document similarity, document
> classification, etc...) where using relative frequencies yields better
> results, and it might be worth making the CountVectorizer support this.
> >
> > Regards,
> > Yacine.
> >
> > On Sun, Jan 28, 2018 at 15:12 Jacob Vanderplas <
> jakevdp at cs.washington.edu> wrote:
> > Hi Yacine,
> > If I'm understanding you correctly, I think what you have in mind is
> already implemented in scikit-learn in the TF-IDF vectorizer.
> >
> > Best,
> >    Jake
> >
> >  Jake VanderPlas
> >  Senior Data Science Fellow
> >  Director of Open Software
> >  University of Washington eScience Institute
> >
> > On Sat, Jan 27, 2018 at 9:59 PM, Yacine MAZARI <y.mazari at gmail.com>
> wrote:
> > Hello,
> >
> > I would like to work on adding an additional feature to
> "sklearn.feature_extraction.text.CountVectorizer".
> >
> > In the current implementation, the definition of term frequency is the
> number of times a term t occurs in document d.
> >
> > However, another definition that is very commonly used in practice is
> the term frequency adjusted for document length, i.e: tf = raw counts /
> document length.
> >
> > I intend to implement this by adding an additional boolean parameter
> "relative_frequency" to the constructor of CountVectorizer.
> > If the parameter is true, normalize X by document length (along x=1) in
> "CountVectorizer.fit_transform()".
> >
> > What do you think?
> > If this sounds reasonable an worth it, I will send a PR.
> >
> > Thank you,
> > Yacine.
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180128/9edd6ebc/attachment-0001.html>


More information about the scikit-learn mailing list