[scikit-learn] transform categorical data to numerical representation

Mon Aug 7 02:41:39 EDT 2017

To my understanding pandas.factorize only works for the static case where
no unseen variables can occur.
Georg Heiler <georg.kf.heiler at gmail.com> schrieb am Mo. 7. Aug. 2017 um
08:40:

> I will need to look into factorize. Here is the result from profiling the
> transform method on a single new observation
> https://codereview.stackexchange.com/q/171622/132999
>
>
> Best Georg
> Sebastian Raschka <se.raschka at gmail.com> schrieb am So. 6. Aug. 2017 um
> 20:39:
>
>> > performance of prediction is pretty lame when there are around 100-150
>> columns used as the input.
>>
>> you are talking about computational performance when you are calling the
>> "transform" method? Have you done some profiling to find out where your
>> bottle neck (in the for loop) is? Just one a very quick look, I think this
>>
>> data.loc[~data[column].isin(fittedLabels), column] =
>> str(replacementForUnseen)
>>
>> is already very slow because fittedLabels is an array where you have O(n)
>> lookup instead of an average O(1) by using a hash table. Or is the isin
>> function converting it to a hashtable/set/dict?
>>
>> In general, would it maybe help to use pandas' factorize?
>> https://pandas.pydata.org/pandas-docs/stable/generated/pandas.factorize.html
>> For predict time, say you have only 1 example for prediction that needs
>> to be converted, you could append prototypes of all possible values that
>> could occur, do the transformation, and then only pass the 1 transformed
>> sample to the classifier. I guess that could be even slow though ...
>>
>> Best,
>> Sebastian
>>
>> > On Aug 6, 2017, at 6:30 AM, Georg Heiler <georg.kf.heiler at gmail.com>
>> wrote:
>> >
>> > @sebastian: thanks. Indeed, I am aware of this problem.
>> >
>> > I developed something here:
>> https://gist.github.com/geoHeil/5caff5236b4850d673b2c9b0799dc2ce but
>> realized that the performance of prediction is pretty lame when there are
>> around 100-150 columns used as the input.
>> > Do you have some ideas how to speed this up?
>> >
>> > Regards,
>> > Georg
>> >
>> > Joel Nothman <joel.nothman at gmail.com> schrieb am So., 6. Aug. 2017 um
>> 00:49 Uhr:
>> > We are working on CategoricalEncoder in
>> https://github.com/scikit-learn/scikit-learn/pull/9151 to help users
>> more with this kind of thing. Feedback and testing is welcome.
>> >
>> > On 6 August 2017 at 02:13, Sebastian Raschka <se.raschka at gmail.com>
>> wrote:
>> > Hi, Georg,
>> >
>> > I bring this up every time here on the mailing list :), and you
>> probably aware of this issue, but it makes a difference whether your
>> categorical data is nominal or ordinal. For instance if you have an ordinal
>> variable like with values like {small, medium, large} you probably want to
>> encode it as {1, 2, 3} or {1, 20, 100} or whatever is appropriate based on
>> your domain knowledge regarding the variable. If you have sth like {blue,
>> red, green} it may make more sense to do a one-hot encoding so that the
>> classifier doesn't assume  a relationship between the variables like blue >
>> red > green or sth like that.
>> >
>> > Now, the DictVectorizer and OneHotEncoder are both doing one hot
>> encoding. The LabelEncoder does convert a variable to integer values, but
>> if you have sth like {small, medium, large}, it wouldn't know the order (if
>> that's an ordinal variable) and it would just assign arbitrary integers in
>> increasing order. Thus, if you are dealing ordinal variables, there's no
>> way around doing this manually; for example you could create mapping
>> dictionaries for that (most conveniently done in pandas).
>> >
>> > Best,
>> > Sebastian
>> >
>> > > On Aug 5, 2017, at 5:10 AM, Georg Heiler <georg.kf.heiler at gmail.com>
>> wrote:
>> > >
>> > > Hi,
>> > >
>> > > the LabelEncooder is only meant for a single column i.e. target
>> variable. Is the DictVectorizeer or a manual chaining of multiple
>> LabelEncoders (one per categorical column) the desired way to get values
>> which can be fed into a subsequent classifier?
>> > >
>> > > Is there some way I have overlooked which works better and possibly
>> also can handle unseen values by applying most frequent imputation?
>> > >
>> > > regards,
>> > > Georg
>> > > _______________________________________________
>> > > scikit-learn mailing list
>> > > scikit-learn at python.org
>> > > https://mail.python.org/mailman/listinfo/scikit-learn
>> >
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>> >
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170807/96027774/attachment-0001.html>