[scikit-learn] behaviour of OneHotEncoder somewhat confusing

Mon Sep 19 17:45:39 EDT 2016

Hi sklearners,

A lab-mate came to me with a problem about encoding DNA sequences using
preprocessing.OneHotEncoder, and I find it to produce confusing results.

Suppose I have a DNA string:  myguide = ‘ACGT’

He’d like use OneHotEncoder to transform DNA strings, character by
character, into a one hot encoded representation like this: [[1,0,0,0],
[0,1,0,0], [0,0,1,0], [0,0,0,1]].  The use-case seems to be solved in
pandas using the dubiously named get_dummies method (
http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.get_dummies.html).
I thought that it would be trivial to do with OneHotEncoder, but it seems
strangely difficult:

In [23]: myarray = le.fit_transform([c for c in myguide])

In [24]: myarray
Out[24]: array([0, 1, 2, 3])

In [27]: myarray = le.transform([[c for c in myguide],[c for c in
myguide],[c for c in myguide]])

In [28]: myarray
Out[28]:
array([[0, 1, 2, 3],
       [0, 1, 2, 3],
       [0, 1, 2, 3]])

In [29]: ohe.fit_transform(myarray)
Out[29]:
array([[ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.]])    <— ????

So this is not at all what I expected.  I read the documentation for
OneHotEncoder (
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder),
but did not find if clear how it worked (also I found the example using
integers confusing).  Neither FeatureHasher nor DictVectorizer seem to be
more appropriate for transforming strings into positional OneHot encoded
arrays.  Am I missing something, or is this operation not supported in
sklearn?

Thanks,

-- 
Lee Zamparo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160919/20bf9f29/attachment.html>