[scikit-learn] One-hot encoding

Mon Feb 5 21:24:46 EST 2018

Hi Joel -

I am also seeing a huge overhead in memory for calling the onehot-encoder.
I have hacked it by running it splitting by matrix into 4-5 smaller
matrices (by columns) and then concatenating the results.  But, I am seeing
upwards of 100 Gigs overhead. Should I file a bug report?  Or is this to be
expected.

Cheers,
Sarah

On Mon, Feb 5, 2018 at 1:05 AM, Sarah Wait Zaranek <sarah.zaranek at gmail.com>
wrote:

> Great.  Thank you for all your help.
>
> Cheers,
> Sarah
>
> On Mon, Feb 5, 2018 at 12:56 AM, Joel Nothman <joel.nothman at gmail.com>
> wrote:
>
>> If you specify n_values=[list_of_vals_for_column1,
>> list_of_vals_for_column2], you should be able to engineer it to how you
>> want.
>>
>> On 5 February 2018 at 16:31, Sarah Wait Zaranek <sarah.zaranek at gmail.com>
>> wrote:
>>
>>> If I use the n+1 approach, then I get the correct matrix, except with
>>> the columns of zeros:
>>>
>>> >>> test
>>> array([[0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1.],
>>>        [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0.],
>>>        [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0.],
>>>        [0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0.]])
>>>
>>>
>>> On Mon, Feb 5, 2018 at 12:25 AM, Sarah Wait Zaranek <
>>> sarah.zaranek at gmail.com> wrote:
>>>
>>>> Hi Joel -
>>>>
>>>> Conceptually, that makes sense.  But when I assign n_values, I can't
>>>> make it match the result when you don't specify them. See below.  I used
>>>> the number of unique levels per column.
>>>>
>>>> >>> enc = OneHotEncoder(sparse=False)
>>>> >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, 0,
>>>> 2]])
>>>> >>> test
>>>> array([[0., 0., 1., 1., 0., 0., 0., 0., 1.],
>>>>        [0., 1., 0., 0., 1., 1., 0., 0., 0.],
>>>>        [1., 0., 0., 0., 1., 0., 1., 0., 0.],
>>>>        [0., 1., 0., 1., 0., 0., 0., 1., 0.]])
>>>> >>> enc = OneHotEncoder(sparse=False,n_values=[3,2,4])
>>>> >>> test = enc.fit_transform([[7, 0, 3], [1, 2, 0], [0, 2, 1], [1, 0,
>>>> 2]])
>>>> >>> test
>>>> array([[0., 0., 0., 1., 0., 0., 0., 1., 1.],
>>>>        [0., 1., 0., 0., 0., 2., 0., 0., 0.],
>>>>        [1., 0., 0., 0., 0., 1., 1., 0., 0.],
>>>>        [0., 1., 0., 1., 0., 0., 0., 1., 0.]])
>>>>
>>>> Cheers,
>>>> Sarah
>>>>
>>>> Cheers,
>>>> Sarah
>>>>
>>>> On Mon, Feb 5, 2018 at 12:02 AM, Joel Nothman <joel.nothman at gmail.com>
>>>> wrote:
>>>>
>>>>> If each input column is encoded as a value from 0 to the (number of
>>>>> possible values for that column - 1) then n_values for that column should
>>>>> be the highest value + 1, which is also the number of levels per column.
>>>>> Does that make sense?
>>>>>
>>>>> Actually, I've realised there's a somewhat slow and unnecessary bit of
>>>>> code in the one-hot encoder: where the COO matrix is converted to CSR. I
>>>>> suspect this was done because most of our ML algorithms perform better on
>>>>> CSR, or else to maintain backwards compatibility with an earlier
>>>>> implementation.
>>>>>
>>>>> _______________________________________________
>>>>> scikit-learn mailing list
>>>>> scikit-learn at python.org
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>>
>>>>>
>>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180205/8336b861/attachment.html>