[Pandas-dev] Colon available everywhere

Pietro Battiston me at pietrobattiston.it
Thu Jul 19 11:17:41 EDT 2018


Il giorno mer, 18/07/2018 alle 09.01 +0200, Pietro Battiston ha
scritto:
> Il giorno mar, 17/07/2018 alle 16.10 -0700, William Ayd ha scritto:
> > > - if, after creating all my columns, I want to e.g. select all
> > > columns
> > > that contain sums, I need to do some sort of "df[[col if
> > > col.startswith("Sum of")]]". Compare to "df.loc[:, ('Sum',)]”
> > 
> > Unless I am mistaken you would have to do something like
> > "df.groupby('a').agg([sum]).loc[:, slice(None, 'sum’)]” to get that
> > to work.
> 
> Yeah, I had swapped the levels, it is
> 
> df.groupby('a').agg([sum]).loc[:, (slice(None), 'sum’)]
> 
> 
> > I don’t think that syntax really is that clean
> 
> In my code I always start by defining
> 
> WE = slice(None) # WhatEver
> 
> and we could advertise this as a way to make the syntax shorter, but
> regardless of that, it definitely is cleaner than any string
> manipulation.


Related to this, I'm curious about some opinion from pandas devs on an
idea which I think would simplify our users' life (and by that, I don't
only mean current users of current pandas API) at (almost) no cost.

The colon in Python is meant for:

1) logical blocks:
  if True:

2) separating args and body of a lambda:
  lambda x : x**2

3) assignment expressions (since 3.8):
  if (a := True):

4) separating key and value in dict:
  {1 : 'a'}

5) define slices:
  a_series.loc['2018-06-01':'2018-07-03']

The last example is entirely indistinguishable from
a_series.loc[slice('2018-06-01','2018-07-03')]
... but unfortunately, only works inside __getitem__ calls.

My idea is: there is no obvious reason why it should be so, that is,
why

'2018-06-01':'2018-07-03'

couldn't just be parsed as slice('2018-06-01','2018-07-03').

The alternative uses 1)-4) of the colon imply that some precaution must
be taken, but:

1) should not create ambiguity, as the ":" is always matched with a
control flow statement

2) should not create ambiguity, as the ":" is always matched with the
"lambda" statement

3) should not create ambiguity, as the ":" is always present close to
"=", while the "slice interpretation" of ":" would never appear (unless
nested) in the left part of an assignment

4) is the only potential problematic case, as
  {2 : 3}
could be interpreted as
  {slice(2, 3)}
but is currently interpreted as 
  dict([(1,3)])

However, the solution could be to just prioritize the current
interpretation, and use 
  {(2 : 3)}
to force the second.


If this proposal was implemented,

  df.loc[:, (slice(None), 'sum’)]

would finally just become

  df.loc[:, (:, 'sum’)]

at the cost of a minimal ambiguity (in the case shown above), which is
easy to solve (and no more grave, I guess, than the fact that {} is an
empty dict and not an empty set).

For Python beginners, it would probably even simplify the understanding
of slices (today, it is not trivial, I think, to understand that obj[:]
is exactly equivalent to obj[slice(None)] - but that ":" does not per
se mean anything).
Moreover, it would mimick "...", which is instead available also
outside of __getitem__ calls.

Would it be crazy to propose a PEP with this?

A milder form would be to allow ":" to be used only inside __getitem__
calls, but also nested: I think however this would be more confusing
and probably difficult to implement.

Thoughts?

Pietro


More information about the Pandas-dev mailing list