what is happening in panda "where" clause

Sat Sep 23 16:10:14 EDT 2017

On 9/22/17, Peter Otten <__peter__ at web.de> wrote:
> Exposito, Pedro (RIS-MDW) wrote:
>
>> This code does a "where" clause on a panda data frame...
>>
>> Code:
>> import pandas as pd;
>> col_names = ['Name', 'Age', 'Weight', "Education"];
>> # create panda dataframe
>> x = pd.read_csv('test.dat', sep='|', header=None, names = col_names);
>>                 # apply "where" condition
>> z = x[ (x['Age'] == 55) ]
>> # prints row WHERE age == 55
>> print (z);
>>
>> What is happening in this statement:
>> z = x[ (x['Age'] == 55) ]
>>
>> Thanks,
>
> Let's take it apart into individual steps:
>
> Make up example data:
>
>>>> import pandas as pd
>>>> x = pd.DataFrame([["Jim", 44], ["Sue", 55], ["Alice", 66]],
> columns=["Name", "Age"])
>>>> x
>     Name  Age
> 0    Jim   44
> 1    Sue   55
> 2  Alice   66
>
> Have a look at the inner expression:
>
>>>> x["Age"] == 55
> 0    False
> 1     True
> 2    False
>
> So this is a basically vector of boolean values. If you want more details:
> in numpy operations involving a a scalar and an array work via
> "broadcasting". In pure Python you would write something similar as
>
>>>> [v == 55 for v in x["Age"]]
> [False, True, False]
>
> Use the result as an index:
>
>>>> x[[False, True, True]]
>     Name  Age
> 1    Sue   55
> 2  Alice   66
>
> [2 rows x 2 columns]
>
> This is again in line with numpy arrays -- if you pass an array of boolean
> values as an index the values in the True positions are selected. In pure
> Python you could achieve that with
>
>>>> index = [v == 55 for v in x["Age"]]
>>>> index
> [False, True, False]
>>>> [v for b, v in zip(index, x["Age"]) if b]
> [55]

You could also write:

>>> x[index]
x[index]
  Name  Age
1  Sue   55

As Peter told, understanding numpy or "numpy like" behavior behind
curtain could be key to understanding what is happening.

Look at this:

>>> def half_age(a):
>>>     return a/2

we could use this function in where() ->

>>> x[half_age(x.Age)<30]
  Name  Age
0  Jim   44
1  Sue   55

You could think that any function could be used, but it is not true.
a/2 (where a is numpy array) is defined. (I am not sure that it is
really numpy under the hood or if it will be in future pandas versions
but it seems so).

But what if we have more universal function?

>>> def first_char(a):
>>>     return a[0].lower()

>>> x[first_char(x.Name)=='a']
... ERROR ... (first_char could not work with pandas serie as a
argument. It means pandas Series doesn't have lower() function)

But you could create index like Peter described. It could be simpler
to remember if you are using pandas not often:

>>> x[[first_char(i)=='a' for i in x.Name]]
    Name  Age
2  Alice   66

It is not (IMHO) best solution because you are creating list in memory
(which could be big) . Unfortunately my version of pandas (0.20.3)
doesn't support generator

>>> x[(first_char(i)=='a' for i in x.Name)]
... ERROR...

But you could apply more complex function to Series!

>>> x[x.Name.apply(first_char)=='a']
    Name  Age
2  Alice   66

x.Name.apply(first_char) is Series where first_char is applied on each
value. And serie (and numpy.array) understand == operator (and return
value is Series. Which is acceptable as where function parameter)

Although I am not sure that actual version of pandas doesn't create
whole Series in memory :) I expect it is more optimized (and could be
even more in future).

Be aware that I am not expert and I am using pandas only very seldom
just for my little statistics!