[scikit-learn] memory efficient feature extraction

Mon Jun 6 08:29:49 EDT 2016

>  - concatenation of theses arrays into a single CSR array appears to be
> non-tivial given the memory constraints (e.g. scipy.sparse.vstack
> transforms all arrays to COO sparse representation internally).

There is a fast path for stacking a series of CSR matrices.

On 6 June 2016 at 22:19, Roman Yurchak <rth.yurchak at gmail.com> wrote:

> Dear all,
>
> I was wondering if somebody could advise on the best way for
> generating/storing large sparse feature sets that do not fit in memory?
> In particular, I have the following workflow,
>
> Large text dataset -> HashingVectorizer -> Feature set in a sparse CSR
> array on disk -> Training a classifier -> Predictions
>
> where the the generated feature set is too large to fit in RAM, however
> the classifier training can be done in one step (as it uses only certain
> rows of the CSR array) and the prediction can be split in several steps,
> all of which fit in memory. Since the training can be performed in one
> step, I'm not looking for incremental learning out-of-core approaches
> and saving features to disk for later processing is definitely useful.
>
> For instance, if it was possible to save the output of the
> HashingVectorizer to a single file on disk (using e.g. joblib.dump) then
> load this file as a memory map (using e.g. joblib.load(..,
> mmap_mode='r')) everything would work great. Due to memory constraints
> this cannot be done directly, and the best case scenario is applying
> HashingVectorizer on chunks of the dataset, which produces a series of
> sparse CSR arrays on disk. Then,
>  - concatenation of theses arrays into a single CSR array appears to be
> non-tivial given the memory constraints (e.g. scipy.sparse.vstack
> transforms all arrays to COO sparse representation internally).
>  - I was not able to find an abstraction layer that would allow to
> represent these sparse arrays as a single array. For instance, dask
> could allow to do this for dense arrays (
> http://dask.pydata.org/en/latest/array-stack.html ), however support for
> sparse arrays is only planned at this point (
> https://github.com/dask/dask/issues/174 ).
>   Finally, it is not possible to pre-allocate the full array on disk in
> advance (and access it as a memory map) because we don't know the number
> of non-zero elements in the sparse array before running the feature
> extraction.
>
>   Of course, it is possible to overcome all these difficulties by using
> a machine with more memory, but my point is rather to have a memory
> efficient workflow.
>
>   I would really appreciate any advice on this and would be happy to
> contribute to a project in the scikit-learn environment aiming to
> address similar issues,
>
> Thank you,
> Best,
> --
> Roman
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160606/fd285afa/attachment.html>