[scikit-learn] behaviour of OneHotEncoder somewhat confusing

Lee Zamparo zamparo at gmail.com
Mon Sep 19 20:15:33 EDT 2016


Hi Sebastian,

Great, thanks!

The docstring doesn’t make it very clear that using the default
’n_values=‘auto’ infers the number of different values column-wise; maybe I
could do a quick PR to update it?  Or, maybe I could make your example into
a, well, example for the documentation online?

Alternatively, if you think this case is too off-usage for OneHotEncoder,
maybe doing nothing is the best course?

Thanks,

-- 
Lee Zamparo

On September 19, 2016 at 6:08:15 PM, Sebastian Raschka (se.raschka at gmail.com)
wrote:

Hi, Lee,

maybe set `n_value=4`, this seems to do the job. I think the problem you
encountered is due to the fact that the one-hot encoder infers the number
of values for each feature (column) from the dataset. In your case, each
column had only 1 unique feature in your example

> array([[0, 1, 2, 3],
> [0, 1, 2, 3],
> [0, 1, 2, 3]])

If you had an array like

> array([[0],
> [1],
> [2],
> [3]])

it should work though. Alternatively, set n_values to 4:


> >>> from sklearn.preprocessing import OneHotEncoder
> >>> import numpy as np
>
> >>> enc = OneHotEncoder(n_values=4)
> >>> X = np.array([[0, 1, 2, 3]])
> >>> enc.fit_transform(X).toarray()


array([[ 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0.,
0., 0., 1.]])

and

> X2 = np.array([[0, 1, 2, 3],
> [0, 1, 2, 3],
> [0, 1, 2, 3]])
>
> enc.transform(X2).toarray()



array([[ 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0.,
0., 0., 1.],
[ 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0.,
0., 0., 1.],
[ 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0.,
0., 0., 1.]])


Best,
Sebastian


> On Sep 19, 2016, at 5:45 PM, Lee Zamparo <zamparo at gmail.com> wrote:
>
> Hi sklearners,
>
> A lab-mate came to me with a problem about encoding DNA sequences using
preprocessing.OneHotEncoder, and I find it to produce confusing results.
>
> Suppose I have a DNA string: myguide = ‘ACGT’
>
> He’d like use OneHotEncoder to transform DNA strings, character by
character, into a one hot encoded representation like this: [[1,0,0,0],
[0,1,0,0], [0,0,1,0], [0,0,0,1]]. The use-case seems to be solved in pandas
using the dubiously named get_dummies method (
http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.get_dummies.html).
I thought that it would be trivial to do with OneHotEncoder, but it seems
strangely difficult:
>
> In [23]: myarray = le.fit_transform([c for c in myguide])
>
> In [24]: myarray
> Out[24]: array([0, 1, 2, 3])
>
> In [27]: myarray = le.transform([[c for c in myguide],[c for c in
myguide],[c for c in myguide]])
>
> In [28]: myarray
> Out[28]:
> array([[0, 1, 2, 3],
> [0, 1, 2, 3],
> [0, 1, 2, 3]])
>
> In [29]: ohe.fit_transform(myarray)
> Out[29]:
> array([[ 1., 1., 1., 1.],
> [ 1., 1., 1., 1.],
> [ 1., 1., 1., 1.]]) <— ????
>
> So this is not at all what I expected. I read the documentation for
OneHotEncoder (
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder),
but did not find if clear how it worked (also I found the example using
integers confusing). Neither FeatureHasher nor DictVectorizer seem to be
more appropriate for transforming strings into positional OneHot encoded
arrays. Am I missing something, or is this operation not supported in
sklearn?
>
> Thanks,
>
> --
> Lee Zamparo
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160919/5124187b/attachment.html>


More information about the scikit-learn mailing list