[Python-ideas] History on proposals for Macros?

Tue Mar 31 08:21:50 CEST 2015

Macros would be an extremely useful feature for pandas, the main data
analysis library for Python (for which I'm a core developer).

Why? Well, right now, R has better syntax than Python for writing data
analysis code. The difference comes down to two macros that R developers
have written within the past few years.

Here's an example borrowed from the documentation for the dplyr R package
[1]:

flights %>%
  group_by(year, month, day) %>%
  select(arr_delay, dep_delay) %>%
  summarise(
    arr = mean(arr_delay),
    dep = mean(dep_delay)
  ) %>%
  filter(arr > 30 | dep > 30)

Here "flights" is a dataframe, similar to a table in spreadsheet. It is
also the only global variables in the analysis -- variables like "year" and
"arr_delay" are actually columns in the dataframe. R evaluates variables
lazily, in the context of the provided frame. In Python, functions like
groupby_by would need to be macros.

The other macro is the "pipe" or chaining operator %>%. This operator is
used to avoid the need many temporary or highly nested expressions. The
result is quite readable, but again, it needs to be a macro, because
group_by and filter are simply functions that take a dataframe as their
first argument. The fact that chaining works with plain functions means
that it works even on libraries that weren't designed for it. We could do
function chaining in Python by abusing an exist binary operator like >> or
|, but all the objects on which it works would need to be custom types.

What does this example look using pandas? Well, it's not as nice, and
there's not much we can do about it because of the limitations of Python
syntax:

(flights
 .group_by('year', 'month', 'day')
 .select('arr_delay', 'dep_delay')
 .summarize(
    arr = lambda df: mean(df.arr_delay)),
    dep = lambda df: mean(df.dep_delay)))
 .filter(lambda df: (df.arr > 30) | (df.dep > 30)))

(Astute readers will note that I've taken a few liberties with pandas
syntax to make more similar to dplyr.)

Instead of evaluating expressions in the delayed context of a dataframes,
we use strings or functions. With all the lambdas there's a lot more noise
than the R example, and it's harder to keep track of what's on. In
principle we could simplify the lambda expressions to not use any arguments
(Matthew linked to the GitHub comment where I showed what that would look
like [2]), but the code remains awkwardly verbose.

For chaining, instead of using functions and the pipe operator, we use
methods. This works fine as long as users are only using pandas, but it
means that unlike R, the Python dataframe is a closed ecosystem. Python
developers (rightly) frown upon monkey-patching, so there's no way for
external libraries to add their own functions (e.g., for custom plotting or
file formats) on an equal footing to the methods built-in to pandas.

I hope these use cases are illustrative. I don't have strong opinions on
the technical merits of particular proposals. The "light lambda" syntax
described by Andrew Barnert would at least solve the delayed evaluation
use-case nicely, though the colon character is not ideal because it would
rule out using light lambdas inside indexing brackets.

Best,
Stephan

[1]
http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html#chaining
[2] https://github.com/pydata/pandas/issues/9229#issuecomment-69691738
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20150330/13367c1c/attachment-0001.html>