[scikit-learn] Imputers and DataFrame objects

Kevin Markham kevin at dataschool.io
Mon Aug 17 13:53:29 EDT 2020


Hi Ram,

These are great questions!

> The task was to remove these irregularities. So for the "?" items,
replace them with mean, and for the "one", "two" etc. replace with a
numerical value.

If your primary task is "data cleaning", then pandas is usually the optimal
tool. If "preprocessing your data for Machine Learning" is your primary
task, then scikit-learn is usually the optimal tool. There is some overlap
between what is considered "cleaning" and "preprocessing", but I mention
this distinction because it can help you decide what tool to use.

> she told me I should use the tools that come with sklearn: SimpleImputer,
OneHotEncoder, BinaryEncoder for the "one" "two" "three".

Just for clarification, BinaryEncoder is not part of scikit-learn. Instead,
it's part of the Category Encoders library, which is a related project to
scikit-learn.

> For one, I couldn't figure out how to apply SimpleImputer on just one
column in the DataFrame, and then get the results in the form of a
dataframe.

Like most scikit-learn transformers, SimpleImputer expects 2-dimensional
input. In your case, this would be a 1-column DataFrame (such as
df[['col']]) rather than a Series (such as df['col']).

Also like most scikit-learn transformers, SimpleImputer outputs a NumPy
array. If you need the output to be a DataFrame, one option is to convert
the array to a pandas object and concatenate it to the original DataFrame.

> Also, when trying to use BinaryEncoder for "one" "two" "three", it raised
an exception because there were NaN values there.

Neither OneHotEncoder nor BinaryEncoder will help you to replace these
string values with the corresponding numbers. Instead, I recommend using
the pandas DataFrame map method.

Alternatively, if you need to do this mapping operation within
scikit-learn, you could wrap the pandas functionality into a custom
scikit-learn transformer using FunctionTransformer. That is a bit more
complicated, though it does have the benefit that you can chain it into a
Pipeline with a SimpleImputer. But again, this is more complicated and is
not the recommended approach unless you are already fluent with the
scikit-learn API.

> Any insight you could give me would be useful.

It sounds like using pandas for the tasks you described is the optimal
approach, but I'm basing that opinion purely on what I know from your email.

Hope that helps!

Kevin

On Mon, Aug 17, 2020 at 3:54 AM Ram Rachum <ram at rachum.com> wrote:

> Hey guys,
>
> This is a bit of a complicated question.
>
> I was helping my friend do a task with Pandas/sklearn for her data science
> class. I figured it'll be a breeze, since I'm fancy-pancy Python
> programmer. Oh wow, it was so not.
>
> I was trying to do things that felt simple to me, but there were so many
> problems, I spent 2 hours and only had a partial solution. I'm wondering
> whether I'm missing something.
>
> She got a CSV with lots of data about cars. Some of the data had missing
> values (marked with "?"). Additionally, some columns had small numbers
> written as strings like "one", "two", "three", etc. There were maybe a few
> more issues like these.
>
> The task was to remove these irregularities. So for the "?" items, replace
> them with mean, and for the "one", "two" etc. replace with a numerical
> value.
>
> I could easily write my own logic that does that, but she told me I should
> use the tools that come with sklearn: SimpleImputer, OneHotEncoder,
> BinaryEncoder for the "one" "two" "three".
>
> They gave me so, so many problems. For one, I couldn't figure out how to
> apply SimpleImputer on just one column in the DataFrame, and then get the
> results in the form of a dataframe. (Either changing in-place or creating a
> new DataFrame.) I think I spent an hour on this problem alone. Eventually I found
> a way <https://www.dropbox.com/preview/Desktop/Shani/floof.py>, but it
> definitely felt like I was doing something wrong, like this is supposed to
> be simpler.
>
> Also, when trying to use BinaryEncoder for "one" "two" "three", it raised
> an exception because there were NaN values there. Well, I wanted to first
> convert them to real numbers and then use the same SimpleImputer to fix
> these. But I couldn't, because of the exception.
>
> Any insight you could give me would be useful.
>
>
> Thanks,
> Ram.
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>


-- 
Kevin Markham
Founder, Data School
https://www.dataschool.io
https://www.youtube.com/dataschool
https://www.patreon.com/dataschool
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20200817/d439abdc/attachment.html>


More information about the scikit-learn mailing list