[scikit-learn] Adding BM25 to sklearn.feature_extraction.text (Update)

Sebastian Raschka mail at sebastianraschka.com
Thu Jun 30 18:33:49 EDT 2016


Hi, Basil,

I’d say runtime may not be the main concern regarding sparse vs. dense. In my opinion, the main reason to use sparse arrays would be memory useage. I.e., text data is typically rather large (esp. high-dimensional, sparse feature vector). So one limitation with scikit-learn is typically memory capacity, especially if you are using multiprocessing via the cv param.

PS:

> regular numpy matrix

I think you mean "numpy array”? (Since there’s a numpy matrix datastruct in numpy as well, however, almost no one uses it)

Best,
Sebastian

> On Jun 30, 2016, at 6:23 PM, Basil Beirouti <basilbeirouti at gmail.com> wrote:
> 
> Hello everyone, 
> 
> I have successfully created a few versions of the BM25Transformer. I looked at TFIDFTransformer for guidance and I noticed that it outputs a sparse matrix when given a sparse termcount matrix as an input. 
> 
> Unfortunately, the fastest implementation of BM25Transformer that I have been able to come up with does NOT output a sparse matrix, it will return a regular numpy matrix. 
> 
> Benchmarked against the entire 20newsgroups corpus, here is how they perform (assuming input is csr_matrix for all):
> 
> 1.) finishes in 4 seconds, outputs a regular numpy matrix
> 2.) finishes in 30 seconds, outputs a dok_matrix
> 3.) finishes in 130 seconds, outputs a regular numpy matrix
> 
> It's worth noting that using algorithm 1 and converting the output to a sparse matrix still takes less time than 3, and takes about as long as 2. 
> 
> So my question is, how important is it that my BM25Transformer outputs a sparse matrix? 
> 
> I'm going to try another implementation which looks directly at the data, indices, and indptr attributes of the inputted csr_matrix. I just wanted to check in and see what people thought.
> 
> Sincerely,
> Basil Beirouti
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn



More information about the scikit-learn mailing list