[scikit-learn] behaviour of OneHotEncoder somewhat confusing
Lee Zamparo
zamparo at gmail.com
Mon Sep 19 17:45:39 EDT 2016
Hi sklearners,
A lab-mate came to me with a problem about encoding DNA sequences using
preprocessing.OneHotEncoder, and I find it to produce confusing results.
Suppose I have a DNA string: myguide = ‘ACGT’
He’d like use OneHotEncoder to transform DNA strings, character by
character, into a one hot encoded representation like this: [[1,0,0,0],
[0,1,0,0], [0,0,1,0], [0,0,0,1]]. The use-case seems to be solved in
pandas using the dubiously named get_dummies method (
http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.get_dummies.html).
I thought that it would be trivial to do with OneHotEncoder, but it seems
strangely difficult:
In [23]: myarray = le.fit_transform([c for c in myguide])
In [24]: myarray
Out[24]: array([0, 1, 2, 3])
In [27]: myarray = le.transform([[c for c in myguide],[c for c in
myguide],[c for c in myguide]])
In [28]: myarray
Out[28]:
array([[0, 1, 2, 3],
[0, 1, 2, 3],
[0, 1, 2, 3]])
In [29]: ohe.fit_transform(myarray)
Out[29]:
array([[ 1., 1., 1., 1.],
[ 1., 1., 1., 1.],
[ 1., 1., 1., 1.]]) <— ????
So this is not at all what I expected. I read the documentation for
OneHotEncoder (
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder),
but did not find if clear how it worked (also I found the example using
integers confusing). Neither FeatureHasher nor DictVectorizer seem to be
more appropriate for transforming strings into positional OneHot encoded
arrays. Am I missing something, or is this operation not supported in
sklearn?
Thanks,
--
Lee Zamparo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160919/20bf9f29/attachment.html>
More information about the scikit-learn
mailing list