[scikit-learn] New Transformer to Support Multiple Column Pipelines & One Hot Encoding

Tue Feb 20 13:06:06 EST 2018

Hello all,

Long time lurker, first time emailer.

I have two small contributions I would like to propose to the email list.

I was working on a project this weekend that was using both categorical and
numerical columns to predict a final output. I needed to save my
transformations to make future predictions and grid search over multiple
models and parameters, so sklearn pipelines were the obvious answer.  I
setup a pipeline, grid searched, then pickled the best model to use for
future predictions.

This worked well, but I ran into two issues.
*1).  I needed a transformer to select individual columns in my pipeline.  *I
needed to apply unique transformations to each column in my data, then
recombine with a FeatureUnion.  I realized there is not a supported
transformer to extract a specific column within pipelines.  See this issue
here as an example
<https://stackoverflow.com/questions/39001956/sklearn-pipeline-how-to-apply-different-transformations-on-different-columns?rq=1>.
I created a transformation that explicitly extracts columns of interest for
use in a pipeline with FeatureUnion.  A FunctionTransformer will solve this
issue, but I feel as if sklearn should directly and explicitly support this
functionality.  I believe this will make pipelines significantly more
intuitive and accessible for most users.

*2).  One hot encoding requires arrays that are already integers.*  You can
find a similar issue here
<https://stackoverflow.com/questions/40456867/labelbinarizer-for-multiple-columns-in-data-frame>.
This can be accomplished using Pandas.get_dummies() (where the
transformation cannot be saved to apply to future predictions) or by using
a scikit-learn LabelBinarizer
<http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html>
transformation.  LabelBinarizer is designed to transform y and does not
have a method to pass x and y in a pipeline.  This breaks scikit-learn
pipelines.  I built a LabelBinarizer transformation that can be used with
FeatureUnion in pipelines.  This issue may be moot with the new
CategoricalEncoder
<http://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.CategoricalEncoder.html>
that is about to be released.

Does the community believe I should pursue contributing either of these?

-- 
Cheers,

DJ
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180220/4b61268d/attachment.html>