[Pandas-dev] Chained filtering with lazy evaluation ("where")

Pietro Battiston ml at pietrobattiston.it
Thu Mar 15 13:36:38 EDT 2018


Dear pandas devs,

like most (I think) of you, I love how pandas supports chained
assignments.

And like several other users, I get frustrated when I have to break
some chained sequence of calls because a given operation cannot be
included. See for instance
https://stackoverflow.com/q/11869910/2858145
https://stackoverflow.com/q/40028500/2858145
https://stackoverflow.com/q/44912692/2858145

I ended up noticing that most of the time, the problematic operation is
a filtering, since it is typically done as

df.loc[condition_on(df)]

e.g.

df.loc[df['a'] > 3]

In R, we would do (something more similar to)

df.loc[a>3]

... but we can't in Python syntax. This is not usually a huge deal -
one could even claim that "df[df['a'] > 3]" is nicer because it's more
explicit.
Still, when it's not df but rather a 5 lines chained assignment, one
needs to create the df, and then filter it, which is annoying.

There are a couple of other solutions: df.filter, adding an ad-hoc
method to pandas objects... but I never found any of them general
and/or pythonic enough. So I tried with an alternative: lazy
evaluation. It took relatively few lines of code, and after some weeks
of use, I'm really satisfied of the result:
https://github.com/toobaz/generic_utils/blob/master/generic_utils/panda
s/where.py
(do not bother about the rest of the repo, the file works as a
standalone module).

This allows to replace
  df.loc[df['a'] > 2]
with
  df.loc[W['a'] > 2]
... and to apply virtually any operation one would apply to df (more
precisely, any operation... which is chainable).¹
As a bonus, one can write a condition and reuse it to filter several
pandas objects.

I'm writing this email to ask:
- whether you have in mind some alternative solution I did not consider
to the problem of "unchainable filterings"
- whether you have suggestions on how to improve my solution
- whether you think this is worth merging in pandas (the amount of
monkey patching required is so small that it is not burdensome to keep
it separated - it just means one more dependency for users who want to
use it)

For the records: it currently works only in .loc... and I don't expect
this to change: I guess pd.{Series,DataFrame}.__getitem__ already
support too many different mechanisms.

Supporting .loc as setter should be instead pretty straightforward - it
is just lower priority as not used in chaining.

Pietro


¹ Only exception (I know of) at the moment: W.loc(axis=1)[.] won't
work, because I "taught" it that "loc" is not a callable. Shouldn't be
hard to fix.


More information about the Pandas-dev mailing list