[Pandas-dev] Colon available everywhere
Pietro Battiston
me at pietrobattiston.it
Thu Jul 19 11:17:41 EDT 2018
Il giorno mer, 18/07/2018 alle 09.01 +0200, Pietro Battiston ha
scritto:
> Il giorno mar, 17/07/2018 alle 16.10 -0700, William Ayd ha scritto:
> > > - if, after creating all my columns, I want to e.g. select all
> > > columns
> > > that contain sums, I need to do some sort of "df[[col if
> > > col.startswith("Sum of")]]". Compare to "df.loc[:, ('Sum',)]”
> >
> > Unless I am mistaken you would have to do something like
> > "df.groupby('a').agg([sum]).loc[:, slice(None, 'sum’)]” to get that
> > to work.
>
> Yeah, I had swapped the levels, it is
>
> df.groupby('a').agg([sum]).loc[:, (slice(None), 'sum’)]
>
>
> > I don’t think that syntax really is that clean
>
> In my code I always start by defining
>
> WE = slice(None) # WhatEver
>
> and we could advertise this as a way to make the syntax shorter, but
> regardless of that, it definitely is cleaner than any string
> manipulation.
Related to this, I'm curious about some opinion from pandas devs on an
idea which I think would simplify our users' life (and by that, I don't
only mean current users of current pandas API) at (almost) no cost.
The colon in Python is meant for:
1) logical blocks:
if True:
2) separating args and body of a lambda:
lambda x : x**2
3) assignment expressions (since 3.8):
if (a := True):
4) separating key and value in dict:
{1 : 'a'}
5) define slices:
a_series.loc['2018-06-01':'2018-07-03']
The last example is entirely indistinguishable from
a_series.loc[slice('2018-06-01','2018-07-03')]
... but unfortunately, only works inside __getitem__ calls.
My idea is: there is no obvious reason why it should be so, that is,
why
'2018-06-01':'2018-07-03'
couldn't just be parsed as slice('2018-06-01','2018-07-03').
The alternative uses 1)-4) of the colon imply that some precaution must
be taken, but:
1) should not create ambiguity, as the ":" is always matched with a
control flow statement
2) should not create ambiguity, as the ":" is always matched with the
"lambda" statement
3) should not create ambiguity, as the ":" is always present close to
"=", while the "slice interpretation" of ":" would never appear (unless
nested) in the left part of an assignment
4) is the only potential problematic case, as
{2 : 3}
could be interpreted as
{slice(2, 3)}
but is currently interpreted as
dict([(1,3)])
However, the solution could be to just prioritize the current
interpretation, and use
{(2 : 3)}
to force the second.
If this proposal was implemented,
df.loc[:, (slice(None), 'sum’)]
would finally just become
df.loc[:, (:, 'sum’)]
at the cost of a minimal ambiguity (in the case shown above), which is
easy to solve (and no more grave, I guess, than the fact that {} is an
empty dict and not an empty set).
For Python beginners, it would probably even simplify the understanding
of slices (today, it is not trivial, I think, to understand that obj[:]
is exactly equivalent to obj[slice(None)] - but that ":" does not per
se mean anything).
Moreover, it would mimick "...", which is instead available also
outside of __getitem__ calls.
Would it be crazy to propose a PEP with this?
A milder form would be to allow ":" to be used only inside __getitem__
calls, but also nested: I think however this would be more confusing
and probably difficult to implement.
Thoughts?
Pietro
More information about the Pandas-dev
mailing list