[scikit-learn] Imputers and DataFrame objects

Ram Rachum ram at rachum.com
Tue Aug 18 14:41:10 EDT 2020


On Tue, Aug 18, 2020 at 6:53 PM Kevin Markham <kevin at dataschool.io> wrote:

> Hi Ram,
>
> > For a column with numbers written like "one", "two" and missing values
> "?", I had to do two things: Change them to numbers (1, 2), and then,
> instead of the missing values, add the most common element, or mean or
> whatever. When I tried to use LabelEncoder to do the first part, it
> complained about the missing values.
>
> LabelEncoder is not the right tool for this task. It does map strings to
> integers, but it's not a tool for mapping *particular* strings to
> *particular* integers. More generally: LabelEncoder is a tool for encoding
> a label, not a tool for data cleaning (which is how I would describe your
> task).
>
> > all the while I'm thinking "It would be so much simpler to just write my
> own logic in a for-loop rather than try to get Pandas and scikit-learn
> working together.
>
> I wouldn't describe this as a case in which "pandas and scikit-learn
> aren't working well together." Rather, I would describe this as a case of
> trying to use a scikit-learn function when what you actually need is a
> pandas function.
>
> Here's a solution to your problem in two lines of pandas code:
> df['col'] = df['col'].map({'one':1, 'two':2, '?':np.nan})
> df['col'] = df['col'].fillna(df['col'].mean())
>
> Showing you that there is a simple solution is not a critique of you.
> Rather, pandas and scikit-learn are complex tools with huge APIs, and it
> takes time to master them. And to be clear, I'm not critiquing the tools
> either: they are complex tools with huge APIs because they are addressing
> complex problems with lots of functional areas.
>

I understand, that makes sense. Thank you.


>
> > But it kind of felt like... What am I using a framework for to begin
> with?
>
> I think you will find that pandas and scikit-learn can save you a lot of
> code, but it does require finding the right function or class. Learning
> these tools requires an investment of time, and many people have found that
> this investment is well worth it.
>
> However, solving your problems with custom code is always an option, and
> it's totally fine if that is your preferred option!
>
> Hope that helps,
>
> Kevin
>
>
Thanks for your help Kevin.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200818/1328ab63/attachment.html>


More information about the scikit-learn mailing list