[Pandas-dev] Pandas Deferred Expressions

Thu Jun 15 20:55:28 EDT 2017

Probably one of the least invasive places where a deferred syntax could be
introduced would be in pandas's IO / data access layer. Then we could start
to think about simple predicate pushdown in a uniform way, and in some
cases this could help avoid materializing huge datasets in memory only to
immediately filter them down

On Tue, May 30, 2017 at 6:51 PM, Matthew Rocklin <mrocklin at gmail.com> wrote:

> *(My apologies for chiming in here without intending to do any of the
> actual work.)*
>
> I wonder if there is a half-solution where a small subset of operations
> are lazy much in the same way that the current groupby operations are lazy
> in Pandas 0.x.  If this laziness were extended to a small set of mostly
> linear operations (element-wise, filters, aggregations, column projections,
> groupbys) then that might hit a few of the bigger optimizations that people
> care about without going down the full lazy-relational-algebra-in-python
> path.  Once you do an operation that is not one of these, we collapse the
> lazy dataframe and replace it with a concrete one.  Slowing extending a
> small set of operations may also be doable in an incremental fashion as
> needed, which might be an easier transition for a community of users.
>
> Of course, half-measures can also cause more maintenance costs long term
> and may lack optimizations that Pandas devs find valuable.  I'm unqualified
> to judge the merits of any of these solutions, just thought I'd bring this
> up.  Feel free to ignore.
>
> On Tue, May 30, 2017 at 6:28 PM, Phillip Cloud <cpcloud at gmail.com> wrote:
>
>> On Tue, May 30, 2017 at 5:19 PM Phillip Cloud <cpcloud at gmail.com> wrote:
>>
>> Hi all,
>>>
>>> I'd like to fork part of the thread from Wes's original email about the
>>> future of pandas and discuss all things deferred expressions. To start,
>>> here's Wes's original thoughts, and a response from Chris Bartak that was
>>> in a different thread. After I send this email I'm going to follow up with
>>> my own thoughts in a different email so I can address any specific concerns
>>> as well as offer up a list of advantages and disadvantages to this approach
>>> and lessons learned about building DSLs in Python.
>>>
>>> *Wes's post:*
>>>
>>> *TOPIC THREE:* I think we should start developing a "deferred pandas
>>> API" that is designed and directly developed by the pandas developer
>>> community. From our respective experiences creating expression DSLs and
>>> other computation frameworks on top of pandas, I believe this is something
>>> where we can build something reasonable and useful. As one concrete problem
>>> this would help with: addressing some of the awkwardness around complex
>>> groupby-aggregate expressions (custom aggregations would simply be named
>>>  expressions).
>>>
>>> The idea of the deferred expression API would be similar to dplyr in R:
>>>
>>
>>> * "True" schemas (we'll have to work around pandas 0.x warts with
>>> implicit casts, etc.)
>>>
>>> * Immutable data structures / no mutation outside "amend" operations
>>> that change values by returning new objects
>>>
>>> * Less index-related stuff in this API (perhaps this is controversial,
>>> we shall see)
>>>
>>> We can create an in-memory backend for "pandas expressions" on pandas
>>> 0.x/1.0 and separately create an alternative backend using libpandas (once
>>> that is more fully baked / functional) -- this will also help provide a
>>> forcing function for implementing analytics that are required for
>>> implementing the backend.
>>>
>>> Distributed execution for us is almost certainly out of scope, and even
>>> if so we would probably want to offload onto prior art in Dask or
>>> elsewhere. So if the dask.dataframe API and the pandas expression API
>>> look different in ways that are unpleasant, we could either compile from
>>> pandas -> dask under the hood, or make API changes to make the semantics
>>> more conforming.
>>>
>>> When libpandas / pandas 2.0 is more mature we can consider building
>>> stronger out-of-core execution (plenty of prior art we can learn from here,
>>> e.g. SFrame).
>>>
>>> As far as tools to implement the deferred expression API -- I will
>>> leave this to discussion. I spent a considerable amount of time making a
>>> pandas-like expression API for SQL in Ibis (see
>>> https://github.com/cloudera/ibis/tree/master/ibis/expr) while I was at
>>> Cloudera, so there's some ideas there (like separating the "internal" AST
>>> from the "external" user expressions) that we can learn from, or fork
>>> or use some of that expression code in some way. I don't have a strong
>>> opinion as long as the expressions are as strongly-typed as possible
>>> (i.e. tables have schemas, operations have checked input and output types)
>>> and catch user errors as soon as feasible.
>>>
>>> *Chris B's response:*
>>>
>>> Deferred API
>>>
>>> Mixed thoughts about this.  On the one hand, it's obviously a good
>>> thing, enables smarter execution, typing/schemas could result in much
>>> easier/safer to write code, etc.
>>>
>>
>>> On the other hand, the pandas API is already massive and reasonably
>>> difficult to master, and it's a big ask to learn a new one.  Dask is a good
>>> example of how NOT having a new API can be very valuable.  All this to say
>>> I think adoption might be pretty low?  Could be my own biases - coming from
>>> a "smallish data" user of pandas, I've never found the "write once, execute
>>> on different backends" argument especially compelling because I've never
>>> had the need.
>>>
>> I agree with the underlying sentiment in Chris’s post. If we are going to
>> build something new, there needs to be very compelling reasons to switch so
>> that there’s some offset to the switching costs.
>> Benefits I see from using expressions that individual users may find
>> convincing:
>>
>>    1. Code correctness guarantees and API clarity using schemas and
>>    types.
>>       1. Operations fail very early and tab completion shows you exactly
>>       what operations are valid on a particular object.
>>    2. Optimizations through expression rewriting (column pruning,
>>    predicate pushdown).
>>       1. We don’t need to read every column to select just one. Last
>>       time I checked nearly all of our IO APIs require reading in all columns to
>>       do an operation on just a few.
>>    3. Somewhat ironically, a much smaller API to learn.
>>       1. No indexes, extremely complex slicing or functions that have
>>       many different ways to do the same thing (like our old friend
>>       replace).
>>
>> Reasons that I think individual users will not find convincing:
>>
>>    1. The ability to run on multiple backends. Many people do not have
>>    this problem. I suspect the majority of pandas users do *not* have
>>    this problem. We shouldn’t try to convince our users that this is why they
>>    should switch, nor should we prioritize this aspect of pandas2.
>>
>> Potential pitfalls to adoption with using expressions to build pandas2:
>>
>>    1. Too dissimilar from current pandas.
>>    2. Development getting bogged down in lowest common denominator
>>    problems (i.e., requiring that every backend implement every operation)
>>    resulting in an extremely limited API.
>>    3. More abstract execution model, and therefore more difficult to
>>    understand and debug errors.
>>
>> I personally think we should do the following:
>>
>>    1. Draft a list of “must-have” operations on DataFrames
>>    2. Use ibis as a base for building experimental pandas deferred
>>    expressions.
>>    3. Forget about supporting “all the backends” and focus on SQL and
>>    pandas. Make sure that most of our users don’t have to care about this
>>    aspect of pandas. The fact that operations are delayed should be almost
>>    invisible unless desired. For example, even though we are delaying
>>    operations internally, the result should appear to be eagerly evaluated.
>>    The model would be: “write once, execute on pandas only by default, nearly
>>    invisible to the user”
>>    4. Go deep on pandas expressions and add non SQL compatible ones if
>>    necessary to preserve as much of the spec’d-out API that we can.
>>    5. Try not to break backwards compatibility with SQL backends, but
>>    don’t require it if it’s needed for pandas2. Alternatively, we build the
>>    pandas backend on top of ibis instead of inside so that we have even more
>>    freedom.
>>
>> I’ve got a patch up that implements some of the pandas API in ibis here
>> <https://github.com/pandas-dev/ibis/pull/981>, if anyone would like to
>> follow along.
>>
>> -Phillip
>> 
>>
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
>>
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20170615/fb83ae62/attachment-0001.html>