[Pandas-dev] Chained filtering with lazy evaluation ("where")

Tom Augspurger tom.augspurger88 at gmail.com
Thu Mar 15 13:58:23 EDT 2018


FYI, `df.loc[lambda x: x['a'] > 3]` is valid. loc takes a callable, and
evaluates it with the NDFrame as the first (only) argument.

So the downside is now that `lambda x:` is a bit more to type that `W`, but
it's not so bad.

And if you have a pre-defined method for filtering, it's
`df.loc[condition_on]`, which is the shortest (but maybe not clearest) way
of spelling that.

- Tom

On Thu, Mar 15, 2018 at 12:36 PM, Pietro Battiston <ml at pietrobattiston.it>
wrote:

> Dear pandas devs,
>
> like most (I think) of you, I love how pandas supports chained
> assignments.
>
> And like several other users, I get frustrated when I have to break
> some chained sequence of calls because a given operation cannot be
> included. See for instance
> https://stackoverflow.com/q/11869910/2858145
> https://stackoverflow.com/q/40028500/2858145
> https://stackoverflow.com/q/44912692/2858145
>
> I ended up noticing that most of the time, the problematic operation is
> a filtering, since it is typically done as
>
> df.loc[condition_on(df)]
>
> e.g.
>
> df.loc[df['a'] > 3]
>
> In R, we would do (something more similar to)
>
> df.loc[a>3]
>
> ... but we can't in Python syntax. This is not usually a huge deal -
> one could even claim that "df[df['a'] > 3]" is nicer because it's more
> explicit.
> Still, when it's not df but rather a 5 lines chained assignment, one
> needs to create the df, and then filter it, which is annoying.
>
> There are a couple of other solutions: df.filter, adding an ad-hoc
> method to pandas objects... but I never found any of them general
> and/or pythonic enough. So I tried with an alternative: lazy
> evaluation. It took relatively few lines of code, and after some weeks
> of use, I'm really satisfied of the result:
> https://github.com/toobaz/generic_utils/blob/master/generic_utils/panda
> s/where.py
> (do not bother about the rest of the repo, the file works as a
> standalone module).
>
> This allows to replace
>   df.loc[df['a'] > 2]
> with
>   df.loc[W['a'] > 2]
> ... and to apply virtually any operation one would apply to df (more
> precisely, any operation... which is chainable).¹
> As a bonus, one can write a condition and reuse it to filter several
> pandas objects.
>
> I'm writing this email to ask:
> - whether you have in mind some alternative solution I did not consider
> to the problem of "unchainable filterings"
> - whether you have suggestions on how to improve my solution
> - whether you think this is worth merging in pandas (the amount of
> monkey patching required is so small that it is not burdensome to keep
> it separated - it just means one more dependency for users who want to
> use it)
>
> For the records: it currently works only in .loc... and I don't expect
> this to change: I guess pd.{Series,DataFrame}.__getitem__ already
> support too many different mechanisms.
>
> Supporting .loc as setter should be instead pretty straightforward - it
> is just lower priority as not used in chaining.
>
> Pietro
>
>
> ¹ Only exception (I know of) at the moment: W.loc(axis=1)[.] won't
> work, because I "taught" it that "loc" is not a callable. Shouldn't be
> hard to fix.
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180315/1ed6a443/attachment.html>


More information about the Pandas-dev mailing list