[scikit-learn] Recurrent questions about speed for TfidfVectorizer
Roman Yurchak
rth.yurchak at pm.me
Mon Nov 26 15:07:24 EST 2018
Tries are interesting, but it appears that while they use less memory
that dicts/maps they are generally slower than dicts for a large number
of elements. See e.g.
https://github.com/pytries/marisa-trie/blob/master/docs/benchmarks.rst.
This is also consistent with the results in the below linked
CountVectorizer PR that aimed to use tries, I think.
Though maybe e.g. MARISA-Trie (and generally trie libraries available in
python) did improve significantly in 5 years since
https://github.com/scikit-learn/scikit-learn/issues/2639 was done.
The thing is also that even HashingVecorizer that doesn't need to handle
the vocabulary is only a moderately faster, so using a better data
structure for the vocabulary might give us its performance at best..
--
Roman
On 26/11/2018 16:f28, Andreas Mueller wrote:
> I think tries might be an interesting datastructure, but it really
> depends on where the bottleneck is.
> I'm really surprised they are not used more, but maybe that's just
> because implementations are missing?
>
> On 11/26/18 8:39 AM, Roman Yurchak via scikit-learn wrote:
>> Hi Matthieu,
>>
>> if you are interested in general questions regarding improving
>> scikit-learn performance, you might be want to have a look at the draft
>> roadmap
>> https://github.com/scikit-learn/scikit-learn/wiki/Draft-Roadmap-2018 --
>> there is a lot topics where suggestions / PRs on improving performance
>> would be very welcome.
>>
>> For the particular case of TfidfVectorizer, it is a bit different from
>> the rest of the scikit-learn code base in the sense that it's not
>> limited by the performance of numerical calculation but rather that of
>> string processing and counting. TfidfVectorizer is equivalent to
>> CountVectorizer + TfidfTransformer and the later has only a marginal
>> computational cost. As to CountVectorizer, last time I checked, its
>> profiling was something along the lines of,
>> - part regexp for tokenization (see token_pattern.findall)
>> - part token counting (see CountVectorizer._count_vocab)
>> - and a comparable part for all the rest
>>
>> Because of that, porting it to Cython is not that immediate, as one is
>> still going to use CPython regexp and token counting in a dict. For
>> instance, HashingVectorizer implements token counting in Cython -- it's
>> faster but not that much faster. Using C++ maps or some less common
>> structures have been discussed in
>> https://github.com/scikit-learn/scikit-learn/issues/2639
>>
>> Currently, I think, there are ~3 main ways performance could be improved,
>> 1. Optimize the current implementation while remaining in Python.
>> Possible but IMO would require some effort, because there are not much
>> low hanging fruits left there. Though a new look would definitely be good.
>>
>> 2. Parallelize computations. There was some earlier discussion about
>> this in scikit-learn issues, but at present, the better way would
>> probably be to add it in dask-ml (see
>> https://github.com/dask/dask-ml/issues/5). HashingVectorizer is already
>> supported. Someone would need to implement CountVectorizer.
>>
>> 3. Rewrite part of the implementation in a lower level language (e.g.
>> Cython). The question is how maintainable that would be, and whether the
>> performance gains would be worth it. Now that Python 2 will be dropped,
>> at least not having to deal with Py2/3 compatibility for strings in
>> Cython might make things a bit easier. Though, if the processing is in
>> Cython it might also make using custom tokenizers/analyzers more difficult.
>>
>> On a related topic, I have been experimenting with implementing part
>> of this processing in Rust lately:
>> https://github.com/rth/text-vectorize. So far it looks promising.
>> Though, of course, it will remain a separate project because of language
>> constraints in scikit-learn.
>>
>> In general if you have thoughts on things that can be improved, don't
>> hesitate to open issues,
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
More information about the scikit-learn
mailing list