[scikit-learn] best way to scale on the random forest for text w bag of words ...

Sasha Kacanski skacanski at gmail.com
Thu Mar 16 08:38:03 EDT 2017


Thanks Joel, what would be your
approach?



Sasha Kacanski

On Mar 15, 2017 9:46 PM, "Joel Nothman" <joel.nothman at gmail.com> wrote:

> Trees are not a traditional choice for bag of words models, but you should
> make sure you are at least using the parameters of the random forest to
> limit the size (depth, branching) of the trees.
>
> On 16 March 2017 at 12:20, Sasha Kacanski <skacanski at gmail.com> wrote:
>
>> Hi,
>> As soon as number of trees and features goes higher, 70Gb of ram is gone
>> and i am getting out of memory errors.
>> file size is 700Mb. Dataframe quickly shrinks from 14 to 2 columns but
>> there is ton of text ...
>> with 10 estimators and 100 features per word I can't tackle ~900 k of
>> records ...
>> Training set, about 15% of data does perfectly fine but when test come
>> that is it.
>>
>> i can split stuff and multiprocess it but I believe that will simply skew
>> results...
>>
>> Any ideas?
>>
>>
>> --
>> Aleksandar Kacanski
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170316/39e57683/attachment.html>


More information about the scikit-learn mailing list