[scikit-learn] New Transformer to Support Multiple Column Pipelines & One Hot Encoding
Dale Jacques
djacques at uwalumni.com
Tue Feb 20 13:06:06 EST 2018
Hello all,
Long time lurker, first time emailer.
I have two small contributions I would like to propose to the email list.
I was working on a project this weekend that was using both categorical and
numerical columns to predict a final output. I needed to save my
transformations to make future predictions and grid search over multiple
models and parameters, so sklearn pipelines were the obvious answer. I
setup a pipeline, grid searched, then pickled the best model to use for
future predictions.
This worked well, but I ran into two issues.
*1). I needed a transformer to select individual columns in my pipeline. *I
needed to apply unique transformations to each column in my data, then
recombine with a FeatureUnion. I realized there is not a supported
transformer to extract a specific column within pipelines. See this issue
here as an example
<https://stackoverflow.com/questions/39001956/sklearn-pipeline-how-to-apply-different-transformations-on-different-columns?rq=1>.
I created a transformation that explicitly extracts columns of interest for
use in a pipeline with FeatureUnion. A FunctionTransformer will solve this
issue, but I feel as if sklearn should directly and explicitly support this
functionality. I believe this will make pipelines significantly more
intuitive and accessible for most users.
*2). One hot encoding requires arrays that are already integers.* You can
find a similar issue here
<https://stackoverflow.com/questions/40456867/labelbinarizer-for-multiple-columns-in-data-frame>.
This can be accomplished using Pandas.get_dummies() (where the
transformation cannot be saved to apply to future predictions) or by using
a scikit-learn LabelBinarizer
<http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html>
transformation. LabelBinarizer is designed to transform y and does not
have a method to pass x and y in a pipeline. This breaks scikit-learn
pipelines. I built a LabelBinarizer transformation that can be used with
FeatureUnion in pipelines. This issue may be moot with the new
CategoricalEncoder
<http://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.CategoricalEncoder.html>
that is about to be released.
Does the community believe I should pursue contributing either of these?
--
Cheers,
DJ
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180220/4b61268d/attachment.html>
More information about the scikit-learn
mailing list