[scikit-learn] Imputers and DataFrame objects

Kevin Markham kevin at dataschool.io
Tue Aug 18 11:39:19 EDT 2020


Hi Ram,

> For a column with numbers written like "one", "two" and missing values
"?", I had to do two things: Change them to numbers (1, 2), and then,
instead of the missing values, add the most common element, or mean or
whatever. When I tried to use LabelEncoder to do the first part, it
complained about the missing values.

LabelEncoder is not the right tool for this task. It does map strings to
integers, but it's not a tool for mapping *particular* strings to
*particular* integers. More generally: LabelEncoder is a tool for encoding
a label, not a tool for data cleaning (which is how I would describe your
task).

> all the while I'm thinking "It would be so much simpler to just write my
own logic in a for-loop rather than try to get Pandas and scikit-learn
working together.

I wouldn't describe this as a case in which "pandas and scikit-learn aren't
working well together." Rather, I would describe this as a case of trying
to use a scikit-learn function when what you actually need is a pandas
function.

Here's a solution to your problem in two lines of pandas code:
df['col'] = df['col'].map({'one':1, 'two':2, '?':np.nan})
df['col'] = df['col'].fillna(df['col'].mean())

Showing you that there is a simple solution is not a critique of you.
Rather, pandas and scikit-learn are complex tools with huge APIs, and it
takes time to master them. And to be clear, I'm not critiquing the tools
either: they are complex tools with huge APIs because they are addressing
complex problems with lots of functional areas.

> But it kind of felt like... What am I using a framework for to begin with?

I think you will find that pandas and scikit-learn can save you a lot of
code, but it does require finding the right function or class. Learning
these tools requires an investment of time, and many people have found that
this investment is well worth it.

However, solving your problems with custom code is always an option, and
it's totally fine if that is your preferred option!

Hope that helps,

Kevin

On Tue, Aug 18, 2020 at 7:56 AM Ram Rachum <ram at rachum.com> wrote:

>
>
> On Mon, Aug 17, 2020 at 8:55 PM Kevin Markham <kevin at dataschool.io> wrote:
>
>> Hi Ram,
>>
>> These are great questions!
>>
>
> Thank you for the detailed answers.
>
>>
>> > The task was to remove these irregularities. So for the "?" items,
>> replace them with mean, and for the "one", "two" etc. replace with a
>> numerical value.
>>
>> If your primary task is "data cleaning", then pandas is usually the
>> optimal tool. If "preprocessing your data for Machine Learning" is your
>> primary task, then scikit-learn is usually the optimal tool. There is some
>> overlap between what is considered "cleaning" and "preprocessing", but I
>> mention this distinction because it can help you decide what tool to use.
>>
>
> Okay, but here's one example where it gets tricky. For a column with
> numbers written like "one", "two" and missing values "?", I had to do two
> things: Change them to numbers (1, 2), and then, instead of the missing
> values, add the most common element, or mean or whatever. When I tried to
> use LabelEncoder to do the first part, it complained about the missing
> values. I couldn't fix these missing values until the labels were changed
> to ints. So that put me in a frustrating Catch-22 situation, and all the
> while I'm thinking "It would be so much simpler to just write my own logic
> in a for-loop rather than try to get Pandas and scikit-learn working
> together.
>
> Any insights about that?
>
>
>> > For one, I couldn't figure out how to apply SimpleImputer on just one
>> column in the DataFrame, and then get the results in the form of a
>> dataframe.
>>
>> Like most scikit-learn transformers, SimpleImputer expects 2-dimensional
>> input. In your case, this would be a 1-column DataFrame (such as
>> df[['col']]) rather than a Series (such as df['col']).
>>
>> Also like most scikit-learn transformers, SimpleImputer outputs a NumPy
>> array. If you need the output to be a DataFrame, one option is to convert
>> the array to a pandas object and concatenate it to the original DataFrame.
>>
>
> Well, I did do that in the `process_column` helper function in the code I
> linked to above. But it kind of felt like... What am I using a framework
> for to begin with? Because that kind of logistics is the reason I want to
> use a framework instead of managing my own arrays and imputing logic.
>
> Thanks for your help Kevin.
>


-- 
Kevin Markham
Founder, Data School
https://www.dataschool.io
https://www.youtube.com/dataschool
https://www.patreon.com/dataschool
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200818/1f578de6/attachment-0001.html>


More information about the scikit-learn mailing list