[scikit-learn] One-hot encoding

Sun Feb 4 23:31:21 EST 2018

Hi Joel -

20 million categorical variables.  It comes from segmenting the genome into
20 million parts.  Genomes are big :)  For n_values, I am a bit confused.
Is the input the same as the output for n values.  Originally, I thought it
was just the number of levels per column, but it seems like it is more like
the highest value of the levels (in terms of integers).

Cheers,
Sarah

On Sun, Feb 4, 2018 at 11:27 PM, Joel Nothman <joel.nothman at gmail.com>
wrote:

> 20 million categories, or 20 million categorical variables?
>
> OneHotEncoder is pretty efficient if you specify n_values.
>
> On 5 February 2018 at 15:10, Sarah Wait Zaranek <sarah.zaranek at gmail.com>
> wrote:
>
>> Hello -
>>
>> I was just wondering if there was a way to improve performance on the
>> one-hot encoder.  Or, is there any plans to do so in the future?  I am
>> working with a matrix that will ultimately have 20 million categorical
>> variables, and my bottleneck is the one-hot encoder.
>>
>> Let me know if this isn't the place to inquire.  My code is very simple
>> when using the encoder, but I cut and pasted it here for completeness.
>>
>>     enc = OneHotEncoder(sparse=True)
>>     Xtrain = enc.fit_transform(tiledata)
>>
>>
>> Thanks,
>> Sarah
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180204/aa341a0c/attachment.html>