Pandas cat.categories.isin list, is this a bug?

Matt Ruffalo matt.ruffalo at gmail.com
Tue May 15 08:53:48 EDT 2018


On 2018-05-15 06:23, Zoran Ljubišić wrote:
> Matt,
>
> thanks for the info about pydata mailing group. I didn't know it exists.
> Because comp.lang.python is not appropriate group for this question, I
> will continue our conversation on gmail.
>
> I have put len(df.CRM_assetID.cat
> <http://df.CRM_assetID.cat>.categories.isin(['V1254748', 'V805722',
> 'V1105400']))  = 55418 in next message, after I noticed that this
> information is missing.
>
> If I want to select all rows that have categories from the list, how
> to do that?
>
> Regards,
>
> Zoran
>

Hi Zoran-

(Including python-list again, for lack of a reason not to. This
conversation is still relevant and appropriate for the general Python
mailing list -- I just meant that the pydata list likely has many more
Pandas users/experts, so you're more likely to get a better answer,
faster, from a more specialized group.)

Selecting all rows that have categories is a bit simpler than what you
are doing -- your issue is that you are working with the *set of
distinct categories*, and not the actual vector of categories
corresponding to your data.

You can select items you're interested in with something like the following:

"""
In [1]: import pandas as pd

In [2]: s = pd.Series(['apple', 'banana', 'apple', 'pear', 'banana',
'cherry', 'pear', 'cherry']).astype('category')

In [3]: s
Out[3]:
0     apple
1    banana
2     apple
3      pear
4    banana
5    cherry
6      pear
7    cherry
dtype: category
Categories (4, object): [apple, banana, cherry, pear]

In [4]: s.isin({'apple', 'pear'})
Out[4]:
0     True
1    False
2     True
3     True
4    False
5    False
6     True
7    False
dtype: bool

In [5]: s.loc[s.isin({'apple', 'pear'})]
Out[5]:
0    apple
2    apple
3     pear
6     pear
dtype: category
Categories (4, object): [apple, banana, cherry, pear]
"""

(Note that I'm also passing a set to `isin` instead of a list -- this
doesn't matter when looking for two or three values, but if you're
passing 1000 values to `isin`, or 10_000, or 1_000_000, then linear-time
membership testing can start to become an issue.)

You are accessing the vector of the *unique categories* in that column, like

"""
In [6]: s.cat.categories
Out[6]: Index(['apple', 'banana', 'cherry', 'pear'], dtype='object')

In [7]: s.cat.categories.isin({'apple', 'pear'})
Out[7]: array([ True, False, False,  True])
"""

The vector `s.cat.categories` has one element for each distinct category
in your column, and your column apparently contains 55418 different
categories.

MMR...



More information about the Python-list mailing list