[scikit-learn] Recurrent questions about speed for TfidfVectorizer

Matthieu Brucher matthieu.brucher at gmail.com
Wed Dec 5 17:45:19 EST 2018


Hi qll,

Sorry for the late reply, lots of things to work on currently.

I'll have a look at the roadmap and the pointers to see what could be done
to enhance the situation.

Cheers,

Matthieu

Le lun. 26 nov. 2018 à 20:09, Roman Yurchak via scikit-learn <
scikit-learn at python.org> a écrit :

> Tries are interesting, but it appears that while they use less memory
> that dicts/maps they are generally slower than dicts for a large number
> of elements. See e.g.
> https://github.com/pytries/marisa-trie/blob/master/docs/benchmarks.rst.
> This is also consistent with the results in the below linked
> CountVectorizer PR that aimed to use tries, I think.
>
> Though maybe e.g. MARISA-Trie (and generally trie libraries available in
> python) did improve significantly in 5 years since
> https://github.com/scikit-learn/scikit-learn/issues/2639 was done.
>
> The thing is also that even HashingVecorizer that doesn't need to handle
> the vocabulary is only a moderately faster, so using a better data
> structure for the vocabulary might give us its performance at best..
>
> --
> Roman
>
> On 26/11/2018 16:f28, Andreas Mueller wrote:
> > I think tries might be an interesting datastructure, but it really
> > depends on where the bottleneck is.
> > I'm really surprised they are not used more, but maybe that's just
> > because implementations are missing?
> >
> > On 11/26/18 8:39 AM, Roman Yurchak via scikit-learn wrote:
> >> Hi Matthieu,
> >>
> >> if you are interested in general questions regarding improving
> >> scikit-learn performance, you might be want to have a look at the draft
> >> roadmap
> >> https://github.com/scikit-learn/scikit-learn/wiki/Draft-Roadmap-2018 --
> >> there is a lot topics where suggestions / PRs on improving performance
> >> would be very welcome.
> >>
> >> For the particular case of TfidfVectorizer, it is a bit different from
> >> the rest of the scikit-learn code base in the sense that it's not
> >> limited by the performance of numerical calculation but rather that of
> >> string processing and counting. TfidfVectorizer is equivalent to
> >> CountVectorizer + TfidfTransformer and the later  has only a marginal
> >> computational cost. As to CountVectorizer, last time I checked, its
> >> profiling was something along the lines of,
> >>     - part regexp for tokenization (see token_pattern.findall)
> >>     - part token counting (see CountVectorizer._count_vocab)
> >>     - and a comparable part for all the rest
> >>
> >> Because of that, porting it to Cython is not that immediate, as one is
> >> still going to use CPython regexp and token counting in a dict. For
> >> instance, HashingVectorizer implements token counting in Cython -- it's
> >> faster but not that much faster. Using C++ maps or some less common
> >> structures have been discussed in
> >> https://github.com/scikit-learn/scikit-learn/issues/2639
> >>
> >> Currently, I think, there are ~3 main ways performance could be
> improved,
> >>     1. Optimize the current implementation while remaining in Python.
> >> Possible but IMO would require some effort, because there are not much
> >> low hanging fruits left there. Though a new look would definitely be
> good.
> >>
> >>     2. Parallelize computations. There was some earlier discussion about
> >> this in scikit-learn issues, but at present, the better way would
> >> probably be to add it in dask-ml (see
> >> https://github.com/dask/dask-ml/issues/5). HashingVectorizer is already
> >> supported. Someone would need to implement CountVectorizer.
> >>
> >>     3. Rewrite part of the implementation in a lower level language
> (e.g.
> >> Cython). The question is how maintainable that would be, and whether the
> >> performance gains would be worth it.  Now that Python 2 will be dropped,
> >> at least not having to deal with Py2/3 compatibility for strings in
> >> Cython might make things a bit easier. Though, if the processing is in
> >> Cython it might also make using custom tokenizers/analyzers more
> difficult.
> >>
> >>       On a related topic, I have been experimenting with implementing
> part
> >> of this processing in Rust lately:
> >> https://github.com/rth/text-vectorize. So far it looks promising.
> >> Though, of course, it will remain a separate project because of language
> >> constraints in scikit-learn.
> >>
> >> In general if you have thoughts on things that can be improved, don't
> >> hesitate to open issues,
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>


-- 
Quantitative analyst, Ph.D.
Blog: http://blog.audio-tk.com/
LinkedIn: http://www.linkedin.com/in/matthieubrucher
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181205/01cac6e1/attachment.html>


More information about the scikit-learn mailing list