Pandas cat.categories.isin list, is this a bug?

Matt Ruffalo matt.ruffalo at gmail.com
Mon May 14 08:34:51 EDT 2018


On 2018-05-14 07:05, zljubisic at gmail.com wrote:
> Hi,
>
> I have dataframe with CRM_assetID column as category dtype:
>
> df.info()
>
> <class 'pandas.core.frame.DataFrame'>
> RangeIndex: 1435952 entries, 0 to 1435951
> Data columns (total 75 columns):
> startTime                            1435952 non-null object
> CRM_assetID                          1435952 non-null category
>
> searching a dataframe for each of three categories:
>
> df[df.CRM_assetID == 'V1254748'].shape
> (35, 75)
> df[df.CRM_assetID == 'V805722'].shape
> (45, 75)
> df[df.CRM_assetID == 'V1105400'].shape
> (34, 75)
>
>
> len(df.CRM_assetID.cat.categories.isin(['V1254748', 'V805722', 'V1105400']))
>
> Why this len is not equal to 114 (35 + 45 + 34)?
>
> Regards.

Hello-

First, this is a general Python group; not everyone here is necessarily
an expert in or user of Pandas. In the future you might have more
success with the pydata mailing list/group.

When you say that `len(df.CRM_assetID.cat.categories.isin(['V1254748',
'V805722', 'V1105400']))` is not equal to 114, it would be helpful to
say what this length actually is.

Your usage of `df.CRM_assetID.cat.categories` refers to the *unique
categories in that column*, not the actual values in that column.
Presumably you have more categories in that column than the three you
are checking with `isin`, since you are checking the length of a boolean
vector that signifies whether each distinct category is in that list.

MMR...



More information about the Python-list mailing list