[scikit-learn] Recurrent questions about speed for TfidfVectorizer

Mon Nov 26 15:07:24 EST 2018

Tries are interesting, but it appears that while they use less memory 
that dicts/maps they are generally slower than dicts for a large number 
of elements. See e.g. 
https://github.com/pytries/marisa-trie/blob/master/docs/benchmarks.rst. 
This is also consistent with the results in the below linked 
CountVectorizer PR that aimed to use tries, I think.

Though maybe e.g. MARISA-Trie (and generally trie libraries available in 
python) did improve significantly in 5 years since 
https://github.com/scikit-learn/scikit-learn/issues/2639 was done.

The thing is also that even HashingVecorizer that doesn't need to handle 
the vocabulary is only a moderately faster, so using a better data 
structure for the vocabulary might give us its performance at best..

-- 
Roman

On 26/11/2018 16:f28, Andreas Mueller wrote:
> I think tries might be an interesting datastructure, but it really
> depends on where the bottleneck is.
> I'm really surprised they are not used more, but maybe that's just
> because implementations are missing?
> 
> On 11/26/18 8:39 AM, Roman Yurchak via scikit-learn wrote:
>> Hi Matthieu,
>>
>> if you are interested in general questions regarding improving
>> scikit-learn performance, you might be want to have a look at the draft
>> roadmap
>> https://github.com/scikit-learn/scikit-learn/wiki/Draft-Roadmap-2018 --
>> there is a lot topics where suggestions / PRs on improving performance
>> would be very welcome.
>>
>> For the particular case of TfidfVectorizer, it is a bit different from
>> the rest of the scikit-learn code base in the sense that it's not
>> limited by the performance of numerical calculation but rather that of
>> string processing and counting. TfidfVectorizer is equivalent to
>> CountVectorizer + TfidfTransformer and the later  has only a marginal
>> computational cost. As to CountVectorizer, last time I checked, its
>> profiling was something along the lines of,
>>     - part regexp for tokenization (see token_pattern.findall)
>>     - part token counting (see CountVectorizer._count_vocab)
>>     - and a comparable part for all the rest
>>
>> Because of that, porting it to Cython is not that immediate, as one is
>> still going to use CPython regexp and token counting in a dict. For
>> instance, HashingVectorizer implements token counting in Cython -- it's
>> faster but not that much faster. Using C++ maps or some less common
>> structures have been discussed in
>> https://github.com/scikit-learn/scikit-learn/issues/2639
>>
>> Currently, I think, there are ~3 main ways performance could be improved,
>>     1. Optimize the current implementation while remaining in Python.
>> Possible but IMO would require some effort, because there are not much
>> low hanging fruits left there. Though a new look would definitely be good.
>>
>>     2. Parallelize computations. There was some earlier discussion about
>> this in scikit-learn issues, but at present, the better way would
>> probably be to add it in dask-ml (see
>> https://github.com/dask/dask-ml/issues/5). HashingVectorizer is already
>> supported. Someone would need to implement CountVectorizer.
>>
>>     3. Rewrite part of the implementation in a lower level language (e.g.
>> Cython). The question is how maintainable that would be, and whether the
>> performance gains would be worth it.  Now that Python 2 will be dropped,
>> at least not having to deal with Py2/3 compatibility for strings in
>> Cython might make things a bit easier. Though, if the processing is in
>> Cython it might also make using custom tokenizers/analyzers more difficult.
>>
>>       On a related topic, I have been experimenting with implementing part
>> of this processing in Rust lately:
>> https://github.com/rth/text-vectorize. So far it looks promising.
>> Though, of course, it will remain a separate project because of language
>> constraints in scikit-learn.
>>
>> In general if you have thoughts on things that can be improved, don't
>> hesitate to open issues,
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>