[Pandas-dev] Pandas Deferred Expressions

Tue May 30 18:28:14 EDT 2017

On Tue, May 30, 2017 at 5:19 PM Phillip Cloud <cpcloud at gmail.com> wrote:

Hi all,
>
> I'd like to fork part of the thread from Wes's original email about the
> future of pandas and discuss all things deferred expressions. To start,
> here's Wes's original thoughts, and a response from Chris Bartak that was
> in a different thread. After I send this email I'm going to follow up with
> my own thoughts in a different email so I can address any specific concerns
> as well as offer up a list of advantages and disadvantages to this approach
> and lessons learned about building DSLs in Python.
>
> *Wes's post:*
>
> *TOPIC THREE:* I think we should start developing a "deferred pandas API"
> that is designed and directly developed by the pandas developer community.
> From our respective experiences creating expression DSLs and other
> computation frameworks on top of pandas, I believe this is something where
> we can build something reasonable and useful. As one concrete problem this
> would help with: addressing some of the awkwardness around complex
> groupby-aggregate expressions (custom aggregations would simply be named
> expressions).
>
> The idea of the deferred expression API would be similar to dplyr in R:
>

> * "True" schemas (we'll have to work around pandas 0.x warts with implicit
> casts, etc.)
>
> * Immutable data structures / no mutation outside "amend" operations that
> change values by returning new objects
>
> * Less index-related stuff in this API (perhaps this is controversial, we
> shall see)
>
> We can create an in-memory backend for "pandas expressions" on pandas
> 0.x/1.0 and separately create an alternative backend using libpandas (once
> that is more fully baked / functional) -- this will also help provide a
> forcing function for implementing analytics that are required for
> implementing the backend.
>
> Distributed execution for us is almost certainly out of scope, and even if
> so we would probably want to offload onto prior art in Dask or elsewhere.
> So if the dask.dataframe API and the pandas expression API look different
> in ways that are unpleasant, we could either compile from pandas -> dask
> under the hood, or make API changes to make the semantics more conforming.
>
> When libpandas / pandas 2.0 is more mature we can consider building
> stronger out-of-core execution (plenty of prior art we can learn from here,
> e.g. SFrame).
>
> As far as tools to implement the deferred expression API -- I will leave
> this to discussion. I spent a considerable amount of time making a
> pandas-like expression API for SQL in Ibis (see
> https://github.com/cloudera/ibis/tree/master/ibis/expr) while I was at
> Cloudera, so there's some ideas there (like separating the "internal" AST
> from the "external" user expressions) that we can learn from, or fork or
> use some of that expression code in some way. I don't have a strong
> opinion as long as the expressions are as strongly-typed as possible
> (i.e. tables have schemas, operations have checked input and output types)
> and catch user errors as soon as feasible.
>
> *Chris B's response:*
>
> Deferred API
>
> Mixed thoughts about this.  On the one hand, it's obviously a good thing,
> enables smarter execution, typing/schemas could result in much easier/safer
> to write code, etc.
>

> On the other hand, the pandas API is already massive and reasonably
> difficult to master, and it's a big ask to learn a new one.  Dask is a good
> example of how NOT having a new API can be very valuable.  All this to say
> I think adoption might be pretty low?  Could be my own biases - coming from
> a "smallish data" user of pandas, I've never found the "write once, execute
> on different backends" argument especially compelling because I've never
> had the need.
>
I agree with the underlying sentiment in Chris’s post. If we are going to
build something new, there needs to be very compelling reasons to switch so
that there’s some offset to the switching costs.
Benefits I see from using expressions that individual users may find
convincing:

   1. Code correctness guarantees and API clarity using schemas and types.
      1. Operations fail very early and tab completion shows you exactly
      what operations are valid on a particular object.
   2. Optimizations through expression rewriting (column pruning, predicate
   pushdown).
      1. We don’t need to read every column to select just one. Last time I
      checked nearly all of our IO APIs require reading in all columns to do an
      operation on just a few.
   3. Somewhat ironically, a much smaller API to learn.
      1. No indexes, extremely complex slicing or functions that have many
      different ways to do the same thing (like our old friend replace).

Reasons that I think individual users will not find convincing:

   1. The ability to run on multiple backends. Many people do not have this
   problem. I suspect the majority of pandas users do *not* have this
   problem. We shouldn’t try to convince our users that this is why they
   should switch, nor should we prioritize this aspect of pandas2.

Potential pitfalls to adoption with using expressions to build pandas2:

   1. Too dissimilar from current pandas.
   2. Development getting bogged down in lowest common denominator problems
   (i.e., requiring that every backend implement every operation) resulting in
   an extremely limited API.
   3. More abstract execution model, and therefore more difficult to
   understand and debug errors.

I personally think we should do the following:

   1. Draft a list of “must-have” operations on DataFrames
   2. Use ibis as a base for building experimental pandas deferred
   expressions.
   3. Forget about supporting “all the backends” and focus on SQL and
   pandas. Make sure that most of our users don’t have to care about this
   aspect of pandas. The fact that operations are delayed should be almost
   invisible unless desired. For example, even though we are delaying
   operations internally, the result should appear to be eagerly evaluated.
   The model would be: “write once, execute on pandas only by default, nearly
   invisible to the user”
   4. Go deep on pandas expressions and add non SQL compatible ones if
   necessary to preserve as much of the spec’d-out API that we can.
   5. Try not to break backwards compatibility with SQL backends, but don’t
   require it if it’s needed for pandas2. Alternatively, we build the pandas
   backend on top of ibis instead of inside so that we have even more freedom.

I’ve got a patch up that implements some of the pandas API in ibis here
<https://github.com/pandas-dev/ibis/pull/981>, if anyone would like to
follow along.

-Phillip

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20170530/140569a4/attachment-0001.html>