[Pandas-dev] Pandas Deferred Expressions

Tue May 30 17:19:08 EDT 2017

Hi all,

I'd like to fork part of the thread from Wes's original email about the
future of pandas and discuss all things deferred expressions. To start,
here's Wes's original thoughts, and a response from Chris Bartak that was
in a different thread. After I send this email I'm going to follow up with
my own thoughts in a different email so I can address any specific concerns
as well as offer up a list of advantages and disadvantages to this approach
and lessons learned about building DSLs in Python.

*Wes's post:*

*TOPIC THREE:* I think we should start developing a "deferred pandas API"
that is designed and directly developed by the pandas developer community.
>From our respective experiences creating expression DSLs and other
computation frameworks on top of pandas, I believe this is something where
we can build something reasonable and useful. As one concrete problem this
would help with: addressing some of the awkwardness around complex
groupby-aggregate expressions (custom aggregations would simply be named
expressions).

The idea of the deferred expression API would be similar to dplyr in R:

* "True" schemas (we'll have to work around pandas 0.x warts with implicit
casts, etc.)

* Immutable data structures / no mutation outside "amend" operations that
change values by returning new objects

* Less index-related stuff in this API (perhaps this is controversial, we
shall see)

We can create an in-memory backend for "pandas expressions" on pandas
0.x/1.0 and separately create an alternative backend using libpandas (once
that is more fully baked / functional) -- this will also help provide a
forcing function for implementing analytics that are required for
implementing the backend.

Distributed execution for us is almost certainly out of scope, and even if
so we would probably want to offload onto prior art in Dask or elsewhere.
So if the dask.dataframe API and the pandas expression API look different
in ways that are unpleasant, we could either compile from pandas -> dask
under the hood, or make API changes to make the semantics more conforming.

When libpandas / pandas 2.0 is more mature we can consider building
stronger out-of-core execution (plenty of prior art we can learn from here,
e.g. SFrame).

As far as tools to implement the deferred expression API -- I will leave
this to discussion. I spent a considerable amount of time making a
pandas-like expression API for SQL in Ibis (see
https://github.com/cloudera/ibis/tree/master/ibis/expr) while I was at
Cloudera, so there's some ideas there (like separating the "internal" AST
from the "external" user expressions) that we can learn from, or fork or
use some of that expression code in some way. I don't have a strong opinion
as long as the expressions are as strongly-typed as possible (i.e. tables
have schemas, operations have checked input and output types) and catch
user errors as soon as feasible.

*Chris B's response:*

Deferred API

Mixed thoughts about this.  On the one hand, it's obviously a good thing,
enables smarter execution, typing/schemas could result in much easier/safer
to write code, etc.

On the other hand, the pandas API is already massive and reasonably
difficult to master, and it's a big ask to learn a new one.  Dask is a good
example of how NOT having a new API can be very valuable.  All this to say
I think adoption might be pretty low?  Could be my own biases - coming from
a "smallish data" user of pandas, I've never found the "write once, execute
on different backends" argument especially compelling because I've never
had the need.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20170530/974bc03e/attachment.html>