[scikit-learn] transform categorical data to numerical representation

Georg Heiler georg.kf.heiler at gmail.com
Mon Aug 7 02:40:18 EDT 2017


I will need to look into factorize. Here is the result from profiling the
transform method on a single new observation
https://codereview.stackexchange.com/q/171622/132999


Best Georg
Sebastian Raschka <se.raschka at gmail.com> schrieb am So. 6. Aug. 2017 um
20:39:

> > performance of prediction is pretty lame when there are around 100-150
> columns used as the input.
>
> you are talking about computational performance when you are calling the
> "transform" method? Have you done some profiling to find out where your
> bottle neck (in the for loop) is? Just one a very quick look, I think this
>
> data.loc[~data[column].isin(fittedLabels), column] =
> str(replacementForUnseen)
>
> is already very slow because fittedLabels is an array where you have O(n)
> lookup instead of an average O(1) by using a hash table. Or is the isin
> function converting it to a hashtable/set/dict?
>
> In general, would it maybe help to use pandas' factorize?
> https://pandas.pydata.org/pandas-docs/stable/generated/pandas.factorize.html
> For predict time, say you have only 1 example for prediction that needs to
> be converted, you could append prototypes of all possible values that could
> occur, do the transformation, and then only pass the 1 transformed sample
> to the classifier. I guess that could be even slow though ...
>
> Best,
> Sebastian
>
> > On Aug 6, 2017, at 6:30 AM, Georg Heiler <georg.kf.heiler at gmail.com>
> wrote:
> >
> > @sebastian: thanks. Indeed, I am aware of this problem.
> >
> > I developed something here:
> https://gist.github.com/geoHeil/5caff5236b4850d673b2c9b0799dc2ce but
> realized that the performance of prediction is pretty lame when there are
> around 100-150 columns used as the input.
> > Do you have some ideas how to speed this up?
> >
> > Regards,
> > Georg
> >
> > Joel Nothman <joel.nothman at gmail.com> schrieb am So., 6. Aug. 2017 um
> 00:49 Uhr:
> > We are working on CategoricalEncoder in
> https://github.com/scikit-learn/scikit-learn/pull/9151 to help users more
> with this kind of thing. Feedback and testing is welcome.
> >
> > On 6 August 2017 at 02:13, Sebastian Raschka <se.raschka at gmail.com>
> wrote:
> > Hi, Georg,
> >
> > I bring this up every time here on the mailing list :), and you probably
> aware of this issue, but it makes a difference whether your categorical
> data is nominal or ordinal. For instance if you have an ordinal variable
> like with values like {small, medium, large} you probably want to encode it
> as {1, 2, 3} or {1, 20, 100} or whatever is appropriate based on your
> domain knowledge regarding the variable. If you have sth like {blue, red,
> green} it may make more sense to do a one-hot encoding so that the
> classifier doesn't assume  a relationship between the variables like blue >
> red > green or sth like that.
> >
> > Now, the DictVectorizer and OneHotEncoder are both doing one hot
> encoding. The LabelEncoder does convert a variable to integer values, but
> if you have sth like {small, medium, large}, it wouldn't know the order (if
> that's an ordinal variable) and it would just assign arbitrary integers in
> increasing order. Thus, if you are dealing ordinal variables, there's no
> way around doing this manually; for example you could create mapping
> dictionaries for that (most conveniently done in pandas).
> >
> > Best,
> > Sebastian
> >
> > > On Aug 5, 2017, at 5:10 AM, Georg Heiler <georg.kf.heiler at gmail.com>
> wrote:
> > >
> > > Hi,
> > >
> > > the LabelEncooder is only meant for a single column i.e. target
> variable. Is the DictVectorizeer or a manual chaining of multiple
> LabelEncoders (one per categorical column) the desired way to get values
> which can be fed into a subsequent classifier?
> > >
> > > Is there some way I have overlooked which works better and possibly
> also can handle unseen values by applying most frequent imputation?
> > >
> > > regards,
> > > Georg
> > > _______________________________________________
> > > scikit-learn mailing list
> > > scikit-learn at python.org
> > > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170807/419bc1b0/attachment.html>


More information about the scikit-learn mailing list