From wesmckinn at gmail.com  Wed May  3 19:09:43 2017
From: wesmckinn at gmail.com (Wes McKinney)
Date: Wed, 3 May 2017 19:09:43 -0400
Subject: [Pandas-dev] Developing libpandas as a separate codebase +
 Componentization + Deferred "pandas expressions"
Message-ID: <CAJPUwMCB=72c=KQJVQvVpURwWcODqKWAfO_rVrst1EuarZk6Eg@mail.gmail.com>

hi folks,

Bit of a multi-tiered discussion, but it's all somewhat related so putting
it all in one e-mail.

*TOPIC ONE:* I have been thinking about how to proceed with pandas 2.0
development in a sane way with the following goals:

* Delivering some incrementally valuable functionality to production pandas
users (e.g. faster CSV reader, some faster algorithms). There might be
faster multithreaded code we can make available via a memory layout
conversion layer (NaNs to bitmaps, etc.)

* Being able to install subcomponents of pandas 2.0 (like libpandas)
alongside production pandas to get feedback from users, particularly around
low-level data semantics (copy-on-write, etc.)

* Migrating compiled code and utility algorithms out of current pandas
codebase

Changing the internals of Series and DataFrame is going to be a difficult
process (and frankly, it would be easier to build a brand new project, but
I am not going to advocate for that). But I think one way we can make
things easier is by developing "libpandas" and its Python bindings as a
separate codebase.

What goes in libpandas? In my view:

* The semantic contents of pandas._lib
* New "guts" of Series and DataFrame, what I've been colloquially calling
pandas.Array (Series with no Index) and pandas.Table (DataFrame with no
index)
* New implementations of Index, based on libpandas.Array
* Computational libraries that presume a particular memory layout:
pandas.core.algorithms, pandas.core.ops, pandas.core.nanops, etc.
* Low-level IO code (moving data from other formats into new pandas data
structures)

The idea is that libpandas would (someday) be a hard dependency of pandas,
and contain most or all of the compiled code in pandas. To simplify things
for most contributors, we could publish nightly dev wheels or conda
packages so that you can update libpandas in your dev environment and
proceed with developing pure Python code.

Let me know what you think. I've spent the majority of my net development
time over the last year hardening Apache Arrow as a C++ library we can use
in libpandas for physical columnar in-memory management, so I'm ready (now
with the Arrow 0.3 release about ready to drop) to start making some more
progress on this.

*TOPIC TWO:* We discussed this on the last dev meeting call, but I wanted
to see what others think and if there's some action items. To help with
more frequent pandas releases, particularly of subcomponents which are pure
Python, I wonder if we could move toward a release model of "pandas" as a
metapackage for a series of subcomponents which are packaged independently.
As an example

pandas depends on
  pandas_display (Display for humans)
  pandas_io
  pandas_plotting
  pandas_core

and so forth. I think it would be better to go with a single codebase for
this; I don't have a strong opinion about having separate release cycles,
it's more to help establish cleaner boundaries about use of private and
public APIs. Effectively the codebase is already organized like this, so
I'm not sure concretely what we would want to do around this.

*TOPIC THREE:* I think we should start developing a "deferred pandas API"
that is designed and directly developed by the pandas developer community.
>From our respective experiences creating expression DSLs and other
computation frameworks on top of pandas, I believe this is something where
we can build something reasonable and useful. As one concrete problem this
would help with: addressing some of the awkwardness around complex
groupby-aggregate expressions (custom aggregations would simply be named
expressions).

The idea of the deferred expression API would be similar to dplyr in R:

* "True" schemas (we'll have to work around pandas 0.x warts with implicit
casts, etc.)

* Immutable data structures / no mutation outside "amend" operations that
change values by returning new objects

* Less index-related stuff in this API (perhaps this is controversial, we
shall see)

We can create an in-memory backend for "pandas expressions" on pandas
0.x/1.0 and separately create an alternative backend using libpandas (once
that is more fully baked / functional) -- this will also help provide a
forcing function for implementing analytics that are required for
implementing the backend.

Distributed execution for us is almost certainly out of scope, and even if
so we would probably want to offload onto prior art in Dask or elsewhere.
So if the dask.dataframe API and the pandas expression API look different
in ways that are unpleasant, we could either compile from pandas -> dask
under the hood, or make API changes to make the semantics more conforming.

When libpandas / pandas 2.0 is more mature we can consider building
stronger out-of-core execution (plenty of prior art we can learn from here,
e.g. SFrame).

As far as tools to implement the deferred expression API -- I will leave
this to discussion. I spent a considerable amount of time making a
pandas-like expression API for SQL in Ibis (see
https://github.com/cloudera/ibis/tree/master/ibis/expr) while I was at
Cloudera, so there's some ideas there (like separating the "internal" AST
from the "external" user expressions) that we can learn from, or fork or
use some of that expression code in some way. I don't have a strong opinion
as long as the expressions are as strongly-typed as possible (i.e. tables
have schemas, operations have checked input and output types) and catch
user errors as soon as feasible.

This ended up being more text than I planned. If we want to discuss these
things independently, feel free to send a reply with an altered subject
line. Looking forward to see what everyone thinks.

Thanks!
Wes
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20170503/3cce65c5/attachment.html>

From shoyer at gmail.com  Wed May  3 19:42:34 2017
From: shoyer at gmail.com (Stephan Hoyer)
Date: Wed, 3 May 2017 16:42:34 -0700
Subject: [Pandas-dev] Developing libpandas as a separate codebase +
 Componentization + Deferred "pandas expressions"
In-Reply-To: <CAJPUwMCB=72c=KQJVQvVpURwWcODqKWAfO_rVrst1EuarZk6Eg@mail.gmail.com>
References: <CAJPUwMCB=72c=KQJVQvVpURwWcODqKWAfO_rVrst1EuarZk6Eg@mail.gmail.com>
Message-ID: <CAEQ_Tve0VpLWgqp3juzDxGpnw8_=6d5x5faJJNYFCANJLKh=tA@mail.gmail.com>

On Wed, May 3, 2017 at 4:09 PM, Wes McKinney <wesmckinn at gmail.com> wrote:

> *TOPIC ONE:* I have been thinking about how to proceed with pandas 2.0
> development in a sane way with the following goals:
>
> ...
>


> Changing the internals of Series and DataFrame is going to be a difficult
> process (and frankly, it would be easier to build a brand new project, but
> I am not going to advocate for that). But I think one way we can make
> things easier is by developing "libpandas" and its Python bindings as a
> separate codebase.
>

I'm strongly supportive of a separate "libpandas", but do consider going
further and making "pandas2" a separate thing.

If users have to switch from "import pandas" to "import pandas2", it would
give us the freedom to do some important API clean-up/simplification (e.g.,
for indexing and other pandas methods that don't have well defined type
signatures). Also, we will have the option to leave old stuff behind rather
immediately porting everything to pandas2 with complete backport support,
which is rather ambitious.


> *TOPIC THREE:* I think we should start developing a "deferred pandas API"
> that is designed and directly developed by the pandas developer community.
>
...
>
> * "True" schemas (we'll have to work around pandas 0.x warts with implicit
> casts, etc.)
>
> * Immutable data structures / no mutation outside "amend" operations that
> change values by returning new objects
>
> * Less index-related stuff in this API (perhaps this is controversial, we
> shall see)
>

This all sounds fantastic, but could you clarify a little bit what you mean
by schemas?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20170503/94253177/attachment.html>

From tom.w.augspurger at gmail.com  Fri May  5 16:20:09 2017
From: tom.w.augspurger at gmail.com (Tom Augspurger)
Date: Fri, 5 May 2017 15:20:09 -0500
Subject: [Pandas-dev] ANN: pandas v0.20.1 released
Message-ID: <CAE1aY-=6m+7mQBJYHRvutHUq-OjwJ3hmqcD+cEA+voUgOjWRxQ@mail.gmail.com>

Hi all,

I'm happy to announce that pandas 0.20.0 and 0.20.1 have been released.
Pandas 0.20.1 contains a single additional change from 0.20.0 for backwards
compatibility with projects using pandas' utils methods. See
https://github.com/pandas-dev/pandas/pull/16250. The full release notes for
0.20.0 are below.

This is a major release from 0.19.2 and includes a number of API changes,
several new features, enhancements, and performance improvements along with
a large number of bug fixes. See the Whatsnew file for more information:
http://pandas.pydata.org/pandas-docs/version/0.20/whatsnew.html
We recommend that all users upgrade to this version.

This release includes 897 commits over 5 months of development by 204
contributors. A big thank you to all contributors!

Tom

---

*## What is it:*

pandas is a Python package providing fast, flexible, and expressive data
structures designed to make working with ?relational? or ?labeled? data
both easy and intuitive. It aims to be the fundamental high-level building
block for doing practical, real world data analysis in Python.
Additionally, it has the broader goal of becoming the most powerful and
flexible open source data analysis / manipulation tool available in any
language.

*## Highlights of the 0.20.0 release include:*


   - new .agg() API for Series/DataFrame similar to the
   groupby-rolling-resample API's, see here
   <http://pandas.pydata.org/pandas-docs/version/0.20/whatsnew.html#whatsnew-0200-enhancements-agg>
   - Integration with the feather-format, including a new top-level
   pd.read_feather() and DataFrame.to_feather() method, see here
   <http://pandas.pydata.org/pandas-docs/version/0.20/io.html#io-feather>
   - The .ix indexer has been deprecated, see here
   <http://pandas.pydata.org/pandas-docs/version/0.20/whatsnew.html#whatsnew-0200-api-breaking-deprecate-ix>
   - Panel has been deprecated, see here
   <http://pandas.pydata.org/pandas-docs/version/0.20/whatsnew.html#whatsnew-0200-api-breaking-deprecate-panel>
   - Addition of an IntervalIndex and Interval scalar type, see here
   <http://pandas.pydata.org/pandas-docs/version/0.20/whatsnew.html#whatsnew-0200-enhancements-intervalindex>
   - Improved user API when accessing levels in .groupby(), see here
   <http://pandas.pydata.org/pandas-docs/version/0.20/whatsnew.html#whatsnew-0200-enhancements-groupby-access>
   - Improved support for UInt64 dtypes, see here
   <http://pandas.pydata.org/pandas-docs/version/0.20/whatsnew.html#whatsnew-0200-enhancements-uint64-support>
   - A new orient for JSON serialization, orient='table', that uses the
   Table Schema spec, see here
   <http://pandas.pydata.org/pandas-docs/version/0.20/whatsnew.html#whatsnew-0200-enhancements-table-schema>
   - Experimental support for exporting DataFrame.style formats to Excel,
   see here
   <http://pandas.pydata.org/pandas-docs/version/0.20/whatsnew.html#whatsnew-0200-enhancements-style-excel>
   - Window Binary Corr/Cov operations now return a MultiIndexed
DataFrame rather
   than a Panel, as Panel is now deprecated, see here
   <http://pandas.pydata.org/pandas-docs/version/0.20/whatsnew.html#whatsnew-0200-api-breaking-rolling-pairwise>
   - Support for S3 handling now uses s3fs, see here
   <http://pandas.pydata.org/pandas-docs/version/0.20/whatsnew.html#whatsnew-0200-api-breaking-s3>
   - Google BigQuery support now uses the pandas-gbq library, see here
   <http://pandas.pydata.org/pandas-docs/version/0.20/whatsnew.html#whatsnew-0200-api-breaking-gbq>
   - Switched the test framework to use pytest <http://docs.pytest.org/>


*## How to get it:*

Source tarballs and Windows / Mac / Linux wheels are available on PyPI (thanks
to Christoph Gohlke for the windows wheels, and to Matthew Brett for
setting up the Mac / Linux wheels)

    pip install --upgrade pip setuptools
    pip install --upgrade --upgrade-strategy=only-if-needed pandas

Conda packages currently building, and will be available via the
conda-forge channel (conda install pandas -c conda-forge). It will be
available on the default channel soon.

    conda install -c conda-forge pandas

*## Issues:*

Please report any issues on our issue tracker: https://github.com/pydata/
pandas/issues/

*## Thanks to all the contributors:*

- Adam J. Stewart
- Adrian
- Ajay Saxena
- Akash Tandon
- Albert Villanova del Moral
- Aleksey Bilogur
- Alexis Mignon
- Amol Kahat
- Andreas Winkler
- Andrew Kittredge
- Anthonios Partheniou
- Arco Bast
- Ashish Singal
- Baurzhan Muftakhidinov
- Ben Kandel
- Ben Thayer
- Ben Welsh
- Bill Chambers
- Brandon M. Burroughs
- Brian
- Brian McFee
- Carlos Souza
- Chris
- Chris Ham
- Chris Warth
- Christoph Gohlke
- Christoph Paulik
- Christopher C. Aycock
- Clemens Brunner
- D.S. McNeil
- DaanVanHauwermeiren
- Daniel Himmelstein
- Dave Willmer
- David Cook
- David Gwynne
- David Hoffman
- David Krych
- Diego Fernandez
- Dimitris Spathis
- Dmitry L
- Dody Suria Wijaya
- Dominik Stanczak
- Dr-Irv
- Dr. Irv
- Elliott Sales de Andrade
- Ennemoser Christoph
- Francesc Alted
- Fumito Hamamura
- Giacomo Ferroni
- Graham R. Jeffries
- Greg Williams
- Guilherme Beltramini
- Guilherme Samora
- Hao Wu
- Harshit Patni
- Ilya V. Schurov
- Iv?n Vall?s P?rez
- Jackie Leng
- Jaehoon Hwang
- James Draper
- James Goppert
- James McBride
- James Santucci
- Jan Schulz
- Jeff Carey
- Jeff Reback
- JennaVergeynst
- Jim
- Jim Crist
- Joe Jevnik
- Joel Nothman
- John
- John Tucker
- John W. O'Brien
- John Zwinck
- Jon M. Mease
- Jon Mease
- Jonathan Whitmore
- Jonathan de Bruin
- Joost Kranendonk
- Joris Van den Bossche
- Joshua Bradt
- Julian Santander
- Julien Marrec
- Jun Kim
- Justin Solinsky
- Kacawi
- Kamal Kamalaldin
- Kerby Shedden
- Kernc
- Keshav Ramaswamy
- Kevin Sheppard
- Kyle Kelley
- Larry Ren
- Leon Yin
- Line Pedersen
- Lorenzo Cestaro
- Luca Scarabello
- Lukasz
- Mahmoud Lababidi
- Mark Mandel
- Matt Roeschke
- Matthew Brett
- Matthew Roeschke
- Matti Picus
- Maximilian Roos
- Michael Charlton
- Michael Felt
- Michael Lamparski
- Michiel Stock
- Mikolaj Chwalisz
- Min RK
- Miroslav ?ediv?
- Mykola Golubyev
- Nate Yoder
- Nathalie Rud
- Nicholas Ver Halen
- Nick Chmura
- Nolan Nichols
- Pankaj Pandey
- Pawel Kordek
- Pete Huang
- Peter
- Peter Csizsek
- Petio Petrov
- Phil Ruffwind
- Pietro Battiston
- Piotr Chromiec
- Prasanjit Prakash
- Rob Forgione
- Robert Bradshaw
- Robin
- Rodolfo Fernandez
- Roger Thomas
- Rouz Azari
- Sahil Dua
- Sam Foo
- Sami Salonen
- Sarah Bird
- Sarma Tangirala
- Scott Sanderson
- Sebastian Bank
- Sebastian Gs?nger
- Shawn Heide
- Shyam Saladi
- Sinhrks
- Stephen Rauch
- S?bastien de Menten
- Tara Adiseshan
- Thiago Serafim
- Thoralf Gutierrez
- Thrasibule
- Tobias Gustafsson
- Tom Augspurger
- Tong SHEN
- Tong Shen
- TrigonaMinima
- Uwe
- Wes Turner
- Wiktor Tomczak
- WillAyd
- Yaroslav Halchenko
- Yimeng Zhang
- abaldenko
- adrian-stepien
- alexandercbooth
- atbd
- bastewart
- bmagnusson
- carlosdanielcsantos
- chaimdemulder
- chris-b1
- dickreuter
- discort
- dr-leo
- dubourg
- dwkenefick
- funnycrab
- gfyoung
- goldenbull
- hesham.shabana
- jojomdt
- linebp
- manu
- manuels
- mattip
- maxalbert
- mcocdawc
- nuffe
- paul-mannino
- pbreach
- sakkemo
- scls19fr
- sinhrks
- stijnvanhoey
- the-nose-knows
- themrmax
- tomrod
- tzinckgraf
- wandersoncferreira
- watercrossing
- wcwagner
- xgdgsc
- yui-knk
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20170505/76420cc6/attachment-0001.html>

From robinfishbein at yahoo.com  Thu May 11 03:07:59 2017
From: robinfishbein at yahoo.com (Robin Fishbein)
Date: Thu, 11 May 2017 07:07:59 +0000 (UTC)
Subject: [Pandas-dev] Performance drop with v0.20 - changes between two
 dataframes
References: <88358759.8764046.1494486479452.ref@mail.yahoo.com>
Message-ID: <88358759.8764046.1494486479452@mail.yahoo.com>

I apologize in advance, I'm not sure where to ask about this, so I thought perhaps the dev list.
The function below returns a tuple of dataframes to identify the rows added, changed, and deleted between two dataframes.* Adds and deletes should be an easy index comparison, so it should spend most of its time distinguishing changed rows from unchanged rows. With Python 3.6.0 and pandas 0.19.2 it runs on test files** in around 22 seconds, with get_loc and __getitem__ at the top of %prun. With pandas 0.20.1 it's unusably slow. I believe there's something about this expression?
pd.Index(i for i in index_both?if not df1t[i].equals(df2t[i]))
?that is handled differently in 0.20. I'm not sure how, though, or what I could do to make this effective in v0.20.
* I used?equals()?to address missing values and handle any potential dtype; if we guarantee no missing values, the solution is much easier. I transposed because I couldn't work out a better way to convert the rows to?Series, allowing the use of?equals().** About 54k rows, 4 index columns (all object), and 8 other columns (4 int, 3 obj, 1 datetime).
Thanks!-Robin
def delta(left, right, index_cols=None, suffixes=('_1', '_2'),? ? ? ? ? reset_indexes=True, value_counts=True):? ? df1 = left.copy()? ? df2 = right.copy()? ? if isinstance(index_cols, pd.Index):? ? ? ? index_cols = list(index_cols)? ? if index_cols:? ? ? ? df1 = df1.set_index(index_cols)? ? ? ? df2 = df2.set_index(index_cols)? ? full = (pd.merge(df1, df2, left_index=True, right_index=True,? ? ? ? ? ? ? ? ? ? ?how='outer', suffixes=suffixes, indicator=True)? ? ? ? ? ? .astype({'_merge': object}))? ? index_both = full[full._merge == 'both'].index? ? df1t = df1.reindex(index_both).T? ? df2t = df2.reindex(index_both).T? ? index_changes = pd.Index(i for i in index_both? ? ? ? ? ? ? ? ? ? ? ? ? ? ?if not df1t[i].equals(df2t[i]))? ? if index_changes.size > 0:? ? ? ? full.loc[index_changes, '_merge'] = 'c'? ? mappings = {? ? ? ? 'both': 'm', ? ? ? ?# match? ? ? ? 'right_only': 'a', ?# add? ? ? ? 'c': 'c', ? ? ? ? ? # change? ? ? ? 'left_only': 'd', ? # delete? ? }? ? full._merge = full._merge.map(mappings)? ? add = df2.reindex(full.loc[full._merge == 'a'].index)? ? change = full.loc[full._merge == 'c'].drop('_merge', axis=1)? ? delete = df1.reindex(full.loc[full._merge == 'd'].index)? ? if reset_indexes:? ? ? ? full = full.reset_index()? ? ? ? add = add.reset_index()? ? ? ? change = change.reset_index()? ? ? ? delete = delete.reset_index()? ? if value_counts:? ? ? ? print(full._merge.value_counts())? ? return full, add, change, delete
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20170511/bea7d157/attachment.html>

From jorisvandenbossche at gmail.com  Thu May 11 08:35:39 2017
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Thu, 11 May 2017 14:35:39 +0200
Subject: [Pandas-dev] Performance drop with v0.20 - changes between two
 dataframes
In-Reply-To: <88358759.8764046.1494486479452@mail.yahoo.com>
References: <88358759.8764046.1494486479452.ref@mail.yahoo.com>
 <88358759.8764046.1494486479452@mail.yahoo.com>
Message-ID: <CALQtMBZHPyfbbufhN+uZ0sO54=-yUucUywRD3+Qg1XJvKnxyjA@mail.gmail.com>

Hi Robin,

I didn't yet look into your code example (and it might be that it can be
optimized in other ways to also prevent the slowdown), but as it seems you
have multi-indexes and you mention getitem, this is probably the same issue
as reported here: https://github.com/pandas-dev/pandas/issues/16319, and
with already a PR open to try to fix it:
https://github.com/pandas-dev/pandas/pull/16324

You are always welcome to try out the PR and give feedback on whether that
fixed the performance regression for you.

Regards,
Joris

2017-05-11 9:07 GMT+02:00 Robin Fishbein via Pandas-dev <
pandas-dev at python.org>:

> I apologize in advance, I'm not sure where to ask about this, so I thought
> perhaps the dev list.
>
> The function below returns a tuple of dataframes to identify the rows
> added, changed, and deleted between two dataframes.* Adds and deletes
> should be an easy index comparison, so it should spend most of its time
> distinguishing changed rows from unchanged rows. With Python 3.6.0 and
> pandas 0.19.2 it runs on test files** in around 22 seconds, with get_loc
> and __getitem__ at the top of %prun. With pandas 0.20.1 it's unusably slow.
> I believe there's something about this expression?
>
> pd.Index(i for i in index_both if not df1t[i].equals(df2t[i]))
>
> ?that is handled differently in 0.20. I'm not sure how, though, or what I
> could do to make this effective in v0.20.
>
> * I used equals() to address missing values and handle any potential
> dtype; if we guarantee no missing values, the solution is much easier. I
> transposed because I couldn't work out a better way to convert the rows to
> Series, allowing the use of equals().
> ** About 54k rows, 4 index columns (all object), and 8 other columns (4
> int, 3 obj, 1 datetime).
>
> Thanks!
> -Robin
>
> def delta(left, right, index_cols=None, suffixes=('_1', '_2'),
>           reset_indexes=True, value_counts=True):
>     df1 = left.copy()
>     df2 = right.copy()
>     if isinstance(index_cols, pd.Index):
>         index_cols = list(index_cols)
>     if index_cols:
>         df1 = df1.set_index(index_cols)
>         df2 = df2.set_index(index_cols)
>     full = (pd.merge(df1, df2, left_index=True, right_index=True,
>                      how='outer', suffixes=suffixes, indicator=True)
>             .astype({'_merge': object}))
>     index_both = full[full._merge == 'both'].index
>     df1t = df1.reindex(index_both).T
>     df2t = df2.reindex(index_both).T
> *    index_changes = pd.Index(i for i in index_both*
> *                             if not df1t[i].equals(df2t[i]))*
>     if index_changes.size > 0:
>         full.loc[index_changes, '_merge'] = 'c'
>     mappings = {
>         'both': 'm',        # match
>         'right_only': 'a',  # add
>         'c': 'c',           # change
>         'left_only': 'd',   # delete
>     }
>     full._merge = full._merge.map(mappings)
>     add = df2.reindex(full.loc[full._merge == 'a'].index)
>     change = full.loc[full._merge == 'c'].drop('_merge', axis=1)
>     delete = df1.reindex(full.loc[full._merge == 'd'].index)
>     if reset_indexes:
>         full = full.reset_index()
>         add = add.reset_index()
>         change = change.reset_index()
>         delete = delete.reset_index()
>     if value_counts:
>         print(full._merge.value_counts())
>     return full, add, change, delete
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20170511/135b79a4/attachment-0001.html>

From robinfishbein at yahoo.com  Thu May 11 10:41:08 2017
From: robinfishbein at yahoo.com (Robin Fishbein)
Date: Thu, 11 May 2017 09:41:08 -0500
Subject: [Pandas-dev] Performance drop with v0.20 - changes between two
 dataframes
In-Reply-To: <CALQtMBZHPyfbbufhN+uZ0sO54=-yUucUywRD3+Qg1XJvKnxyjA@mail.gmail.com>
References: <88358759.8764046.1494486479452.ref@mail.yahoo.com>
 <88358759.8764046.1494486479452@mail.yahoo.com>
 <CALQtMBZHPyfbbufhN+uZ0sO54=-yUucUywRD3+Qg1XJvKnxyjA@mail.gmail.com>
Message-ID: <CEAFBD3C-D177-4A30-8E43-F6F53CF48B53@yahoo.com>

Thanks, Joris! I'll take a look at the PR. Turns out optimizing with something to the effect of
((pd.isnull(df1) & pd.isnull(df2)) |  (df1 == df2)).all(axis=1)
avoids this issue and runs the test files in 0.9 seconds with either v0.19 or v0.20.

-Robin

Sent from my iPhone

> On May 11, 2017, at 7:35 AM, Joris Van den Bossche <jorisvandenbossche at gmail.com> wrote:
> 
> Hi Robin,
> 
> I didn't yet look into your code example (and it might be that it can be optimized in other ways to also prevent the slowdown), but as it seems you have multi-indexes and you mention getitem, this is probably the same issue as reported here: https://github.com/pandas-dev/pandas/issues/16319, and with already a PR open to try to fix it: https://github.com/pandas-dev/pandas/pull/16324
> 
> You are always welcome to try out the PR and give feedback on whether that fixed the performance regression for you.
> 
> Regards,
> Joris
> 
> 2017-05-11 9:07 GMT+02:00 Robin Fishbein via Pandas-dev <pandas-dev at python.org>:
>> I apologize in advance, I'm not sure where to ask about this, so I thought perhaps the dev list.
>> 
>> The function below returns a tuple of dataframes to identify the rows added, changed, and deleted between two dataframes.* Adds and deletes should be an easy index comparison, so it should spend most of its time distinguishing changed rows from unchanged rows. With Python 3.6.0 and pandas 0.19.2 it runs on test files** in around 22 seconds, with get_loc and __getitem__ at the top of %prun. With pandas 0.20.1 it's unusably slow. I believe there's something about this expression?
>> 
>> pd.Index(i for i in index_both if not df1t[i].equals(df2t[i]))
>> 
>> ?that is handled differently in 0.20. I'm not sure how, though, or what I could do to make this effective in v0.20.
>> 
>> * I used equals() to address missing values and handle any potential dtype; if we guarantee no missing values, the solution is much easier. I transposed because I couldn't work out a better way to convert the rows to Series, allowing the use of equals().
>> ** About 54k rows, 4 index columns (all object), and 8 other columns (4 int, 3 obj, 1 datetime).
>> 
>> Thanks!
>> -Robin
>> 
>> def delta(left, right, index_cols=None, suffixes=('_1', '_2'),
>>           reset_indexes=True, value_counts=True):
>>     df1 = left.copy()
>>     df2 = right.copy()
>>     if isinstance(index_cols, pd.Index):
>>         index_cols = list(index_cols)
>>     if index_cols:
>>         df1 = df1.set_index(index_cols)
>>         df2 = df2.set_index(index_cols)
>>     full = (pd.merge(df1, df2, left_index=True, right_index=True,
>>                      how='outer', suffixes=suffixes, indicator=True)
>>             .astype({'_merge': object}))
>>     index_both = full[full._merge == 'both'].index
>>     df1t = df1.reindex(index_both).T
>>     df2t = df2.reindex(index_both).T
>>     index_changes = pd.Index(i for i in index_both
>>                              if not df1t[i].equals(df2t[i]))
>>     if index_changes.size > 0:
>>         full.loc[index_changes, '_merge'] = 'c'
>>     mappings = {
>>         'both': 'm',        # match
>>         'right_only': 'a',  # add
>>         'c': 'c',           # change
>>         'left_only': 'd',   # delete
>>     }
>>     full._merge = full._merge.map(mappings)
>>     add = df2.reindex(full.loc[full._merge == 'a'].index)
>>     change = full.loc[full._merge == 'c'].drop('_merge', axis=1)
>>     delete = df1.reindex(full.loc[full._merge == 'd'].index)
>>     if reset_indexes:
>>         full = full.reset_index()
>>         add = add.reset_index()
>>         change = change.reset_index()
>>         delete = delete.reset_index()
>>     if value_counts:
>>         print(full._merge.value_counts())
>>     return full, add, change, delete
>> 
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>> 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20170511/6c39f3f8/attachment.html>

From jorisvandenbossche at gmail.com  Thu May 11 10:50:32 2017
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Thu, 11 May 2017 16:50:32 +0200
Subject: [Pandas-dev] Performance drop with v0.20 - changes between two
 dataframes
In-Reply-To: <CEAFBD3C-D177-4A30-8E43-F6F53CF48B53@yahoo.com>
References: <88358759.8764046.1494486479452.ref@mail.yahoo.com>
 <88358759.8764046.1494486479452@mail.yahoo.com>
 <CALQtMBZHPyfbbufhN+uZ0sO54=-yUucUywRD3+Qg1XJvKnxyjA@mail.gmail.com>
 <CEAFBD3C-D177-4A30-8E43-F6F53CF48B53@yahoo.com>
Message-ID: <CALQtMBaqZ5-_6aF45gdS8O-Uj8D3QwfmA+p9XjUCjhj=EmbfHg@mail.gmail.com>

Good to hear. That is indeed often that case, that a bunch of indexing
operations can be easily avoided with a smarter implementation.

I think (but did not try out) that you can even further simplyfy to
df1.equals(df2).all(axis=1) as equals should take care of those NaNs in the
same places.

Joris

2017-05-11 16:41 GMT+02:00 Robin Fishbein <robinfishbein at yahoo.com>:

> Thanks, Joris! I'll take a look at the PR. Turns out optimizing with
> something to the effect of
> ((pd.isnull(df1) & pd.isnull(df2)) |  (df1 == df2)).all(axis=1)
> avoids this issue and runs the test files in 0.9 seconds with either v0.19
> or v0.20.
>
> -Robin
>
> Sent from my iPhone
>
> On May 11, 2017, at 7:35 AM, Joris Van den Bossche <
> jorisvandenbossche at gmail.com> wrote:
>
> Hi Robin,
>
> I didn't yet look into your code example (and it might be that it can be
> optimized in other ways to also prevent the slowdown), but as it seems you
> have multi-indexes and you mention getitem, this is probably the same issue
> as reported here: https://github.com/pandas-dev/pandas/issues/16319, and
> with already a PR open to try to fix it: https://github.com/pandas-dev/
> pandas/pull/16324
>
> You are always welcome to try out the PR and give feedback on whether that
> fixed the performance regression for you.
>
> Regards,
> Joris
>
> 2017-05-11 9:07 GMT+02:00 Robin Fishbein via Pandas-dev <
> pandas-dev at python.org>:
>
>> I apologize in advance, I'm not sure where to ask about this, so I
>> thought perhaps the dev list.
>>
>> The function below returns a tuple of dataframes to identify the rows
>> added, changed, and deleted between two dataframes.* Adds and deletes
>> should be an easy index comparison, so it should spend most of its time
>> distinguishing changed rows from unchanged rows. With Python 3.6.0 and
>> pandas 0.19.2 it runs on test files** in around 22 seconds, with get_loc
>> and __getitem__ at the top of %prun. With pandas 0.20.1 it's unusably slow.
>> I believe there's something about this expression?
>>
>> pd.Index(i for i in index_both if not df1t[i].equals(df2t[i]))
>>
>> ?that is handled differently in 0.20. I'm not sure how, though, or what I
>> could do to make this effective in v0.20.
>>
>> * I used equals() to address missing values and handle any potential
>> dtype; if we guarantee no missing values, the solution is much easier. I
>> transposed because I couldn't work out a better way to convert the rows to
>> Series, allowing the use of equals().
>> ** About 54k rows, 4 index columns (all object), and 8 other columns (4
>> int, 3 obj, 1 datetime).
>>
>> Thanks!
>> -Robin
>>
>> def delta(left, right, index_cols=None, suffixes=('_1', '_2'),
>>           reset_indexes=True, value_counts=True):
>>     df1 = left.copy()
>>     df2 = right.copy()
>>     if isinstance(index_cols, pd.Index):
>>         index_cols = list(index_cols)
>>     if index_cols:
>>         df1 = df1.set_index(index_cols)
>>         df2 = df2.set_index(index_cols)
>>     full = (pd.merge(df1, df2, left_index=True, right_index=True,
>>                      how='outer', suffixes=suffixes, indicator=True)
>>             .astype({'_merge': object}))
>>     index_both = full[full._merge == 'both'].index
>>     df1t = df1.reindex(index_both).T
>>     df2t = df2.reindex(index_both).T
>> *    index_changes = pd.Index(i for i in index_both*
>> *                             if not df1t[i].equals(df2t[i]))*
>>     if index_changes.size > 0:
>>         full.loc[index_changes, '_merge'] = 'c'
>>     mappings = {
>>         'both': 'm',        # match
>>         'right_only': 'a',  # add
>>         'c': 'c',           # change
>>         'left_only': 'd',   # delete
>>     }
>>     full._merge = full._merge.map(mappings)
>>     add = df2.reindex(full.loc[full._merge == 'a'].index)
>>     change = full.loc[full._merge == 'c'].drop('_merge', axis=1)
>>     delete = df1.reindex(full.loc[full._merge == 'd'].index)
>>     if reset_indexes:
>>         full = full.reset_index()
>>         add = add.reset_index()
>>         change = change.reset_index()
>>         delete = delete.reset_index()
>>     if value_counts:
>>         print(full._merge.value_counts())
>>     return full, add, change, delete
>>
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20170511/1a44b71f/attachment-0001.html>

From jorisvandenbossche at gmail.com  Fri May 12 18:51:35 2017
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Sat, 13 May 2017 00:51:35 +0200
Subject: [Pandas-dev] Componentization
Message-ID: <CALQtMBatt26dcykTy4E_r0=oeeWP3YKGH8m-q-OfHSUgzgpfUQ@mail.gmail.com>

2017-05-04 1:09 GMT+02:00 Wes McKinney <wesmckinn at gmail.com>:

>
> *TOPIC TWO:* We discussed this on the last dev meeting call, but I wanted
> to see what others think and if there's some action items. To help with
> more frequent pandas releases, particularly of subcomponents which are pure
> Python, I wonder if we could move toward a release model of "pandas" as a
> metapackage for a series of subcomponents which are packaged independently.
> As an example
>
> pandas depends on
>   pandas_display (Display for humans)
>   pandas_io
>   pandas_plotting
>   pandas_core
>
> and so forth. I think it would be better to go with a single codebase for
> this; I don't have a strong opinion about having separate release cycles,
> it's more to help establish cleaner boundaries about use of private and
> public APIs. Effectively the codebase is already organized like this, so
> I'm not sure concretely what we would want to do around this.
>
> Would your idea be to have those as separate python packages, but in a
single (github) repo?
If we want to go in this direction, I would first like to see some more
detailed practicalities, to be able to better evaluate if this would not
needlessly make the contributing cycle more complex.
Eg what would separate release cycles in practice mean? Would the
subpackages then need to support multiple versions of the base package (at
least master + latest stable, otherwise there is not really a benefit in a
separate release cycle)? That would also mean more testing builds? How
would it work with Travis with multiple packages in a single repo? (can it
run separate tests depending on the PR?)

Trying to establish cleaner boundaries between the private and public API's
is good goal, but in principle we could also try to do this in the current
directory layout (eg remove usage of private API's in pandas.plotting,
pandas.io, ..).

I am certainly not against a potential reorganisation on this front, but I
would like to see some more advantages in doing so (as I am not familiar
with such workflows from other project).

Regards,
Joris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20170513/e95bd6bc/attachment.html>

From jorisvandenbossche at gmail.com  Fri May 12 19:07:45 2017
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Sat, 13 May 2017 01:07:45 +0200
Subject: [Pandas-dev] Dev meeting doodle
Message-ID: <CALQtMBYKcOu=xwV_+udU_P1Ka_EEMPoUAMakP=9mhuf_jrd4gA@mail.gmail.com>

The last dev meeting has been more than a month ago, so maybe time to at
least try to find a date for the next one. Therefore, a doodle!

https://doodle.com/poll/dp6s7r3q2u6wcyup

Fixed the doodle at 2pm EST / 8pm GMT / 11am PST, like the last meeting (if
we want to try another time, eg like two meetings ago, please let it know)

Enough content to discuss from Wes' mail and the 0.21 / 1.0 topics listed
in the notes
<https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit#heading=h.9b0qr8l2abai>
from last meeting (we should probably pick some in advance to discuss).

Joris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20170513/9c7892b9/attachment.html>

From jorisvandenbossche at gmail.com  Tue May 30 09:33:31 2017
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Tue, 30 May 2017 15:33:31 +0200
Subject: [Pandas-dev] Dev meeting doodle
In-Reply-To: <CALQtMBYKcOu=xwV_+udU_P1Ka_EEMPoUAMakP=9mhuf_jrd4gA@mail.gmail.com>
References: <CALQtMBYKcOu=xwV_+udU_P1Ka_EEMPoUAMakP=9mhuf_jrd4gA@mail.gmail.com>
Message-ID: <CALQtMBbxOGA6ne8=OPBXAG8JbwNibyaXTjwqiijkUit+XiCT=Q@mail.gmail.com>

A bit late to notify this list, but the doodle decided on a meeting today.
You are certainly still welcome to attend!

Meeting at 2pm EST / 6pm GMT today

Video Link:
https://appear.in/pandas-dev

Doc Link:
https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJlBw5dO
kVJLY-licoBmBU/edit#


2017-05-13 1:07 GMT+02:00 Joris Van den Bossche <
jorisvandenbossche at gmail.com>:

> The last dev meeting has been more than a month ago, so maybe time to at
> least try to find a date for the next one. Therefore, a doodle!
>
> https://doodle.com/poll/dp6s7r3q2u6wcyup
>
> Fixed the doodle at 2pm EST / 8pm GMT / 11am PST, like the last meeting
> (if we want to try another time, eg like two meetings ago, please let it
> know)
>
> Enough content to discuss from Wes' mail and the 0.21 / 1.0 topics listed
> in the notes
> <https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit#heading=h.9b0qr8l2abai>
> from last meeting (we should probably pick some in advance to discuss).
>
> Joris
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20170530/8672478a/attachment.html>

From cpcloud at gmail.com  Tue May 30 17:19:08 2017
From: cpcloud at gmail.com (Phillip Cloud)
Date: Tue, 30 May 2017 21:19:08 +0000
Subject: [Pandas-dev] Pandas Deferred Expressions
Message-ID: <CAKRVfm4uF3-_BSHL0gkoguCjFEDuO9GDVHLv77Lzo49ho_3Ppg@mail.gmail.com>

Hi all,

I'd like to fork part of the thread from Wes's original email about the
future of pandas and discuss all things deferred expressions. To start,
here's Wes's original thoughts, and a response from Chris Bartak that was
in a different thread. After I send this email I'm going to follow up with
my own thoughts in a different email so I can address any specific concerns
as well as offer up a list of advantages and disadvantages to this approach
and lessons learned about building DSLs in Python.

*Wes's post:*

*TOPIC THREE:* I think we should start developing a "deferred pandas API"
that is designed and directly developed by the pandas developer community.
>From our respective experiences creating expression DSLs and other
computation frameworks on top of pandas, I believe this is something where
we can build something reasonable and useful. As one concrete problem this
would help with: addressing some of the awkwardness around complex
groupby-aggregate expressions (custom aggregations would simply be named
expressions).

The idea of the deferred expression API would be similar to dplyr in R:

* "True" schemas (we'll have to work around pandas 0.x warts with implicit
casts, etc.)

* Immutable data structures / no mutation outside "amend" operations that
change values by returning new objects

* Less index-related stuff in this API (perhaps this is controversial, we
shall see)

We can create an in-memory backend for "pandas expressions" on pandas
0.x/1.0 and separately create an alternative backend using libpandas (once
that is more fully baked / functional) -- this will also help provide a
forcing function for implementing analytics that are required for
implementing the backend.

Distributed execution for us is almost certainly out of scope, and even if
so we would probably want to offload onto prior art in Dask or elsewhere.
So if the dask.dataframe API and the pandas expression API look different
in ways that are unpleasant, we could either compile from pandas -> dask
under the hood, or make API changes to make the semantics more conforming.

When libpandas / pandas 2.0 is more mature we can consider building
stronger out-of-core execution (plenty of prior art we can learn from here,
e.g. SFrame).

As far as tools to implement the deferred expression API -- I will leave
this to discussion. I spent a considerable amount of time making a
pandas-like expression API for SQL in Ibis (see
https://github.com/cloudera/ibis/tree/master/ibis/expr) while I was at
Cloudera, so there's some ideas there (like separating the "internal" AST
from the "external" user expressions) that we can learn from, or fork or
use some of that expression code in some way. I don't have a strong opinion
as long as the expressions are as strongly-typed as possible (i.e. tables
have schemas, operations have checked input and output types) and catch
user errors as soon as feasible.

*Chris B's response:*

Deferred API

Mixed thoughts about this.  On the one hand, it's obviously a good thing,
enables smarter execution, typing/schemas could result in much easier/safer
to write code, etc.

On the other hand, the pandas API is already massive and reasonably
difficult to master, and it's a big ask to learn a new one.  Dask is a good
example of how NOT having a new API can be very valuable.  All this to say
I think adoption might be pretty low?  Could be my own biases - coming from
a "smallish data" user of pandas, I've never found the "write once, execute
on different backends" argument especially compelling because I've never
had the need.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20170530/974bc03e/attachment.html>

From cpcloud at gmail.com  Tue May 30 18:28:14 2017
From: cpcloud at gmail.com (Phillip Cloud)
Date: Tue, 30 May 2017 22:28:14 +0000
Subject: [Pandas-dev] Pandas Deferred Expressions
In-Reply-To: <CAKRVfm4uF3-_BSHL0gkoguCjFEDuO9GDVHLv77Lzo49ho_3Ppg@mail.gmail.com>
References: <CAKRVfm4uF3-_BSHL0gkoguCjFEDuO9GDVHLv77Lzo49ho_3Ppg@mail.gmail.com>
Message-ID: <CAKRVfm7Dw6zwJS4w9=+nQwvO3y5zMY0DoRqhVjQtuHB1_uaMhA@mail.gmail.com>

On Tue, May 30, 2017 at 5:19 PM Phillip Cloud <cpcloud at gmail.com> wrote:

Hi all,
>
> I'd like to fork part of the thread from Wes's original email about the
> future of pandas and discuss all things deferred expressions. To start,
> here's Wes's original thoughts, and a response from Chris Bartak that was
> in a different thread. After I send this email I'm going to follow up with
> my own thoughts in a different email so I can address any specific concerns
> as well as offer up a list of advantages and disadvantages to this approach
> and lessons learned about building DSLs in Python.
>
> *Wes's post:*
>
> *TOPIC THREE:* I think we should start developing a "deferred pandas API"
> that is designed and directly developed by the pandas developer community.
> From our respective experiences creating expression DSLs and other
> computation frameworks on top of pandas, I believe this is something where
> we can build something reasonable and useful. As one concrete problem this
> would help with: addressing some of the awkwardness around complex
> groupby-aggregate expressions (custom aggregations would simply be named
> expressions).
>
> The idea of the deferred expression API would be similar to dplyr in R:
>

> * "True" schemas (we'll have to work around pandas 0.x warts with implicit
> casts, etc.)
>
> * Immutable data structures / no mutation outside "amend" operations that
> change values by returning new objects
>
> * Less index-related stuff in this API (perhaps this is controversial, we
> shall see)
>
> We can create an in-memory backend for "pandas expressions" on pandas
> 0.x/1.0 and separately create an alternative backend using libpandas (once
> that is more fully baked / functional) -- this will also help provide a
> forcing function for implementing analytics that are required for
> implementing the backend.
>
> Distributed execution for us is almost certainly out of scope, and even if
> so we would probably want to offload onto prior art in Dask or elsewhere.
> So if the dask.dataframe API and the pandas expression API look different
> in ways that are unpleasant, we could either compile from pandas -> dask
> under the hood, or make API changes to make the semantics more conforming.
>
> When libpandas / pandas 2.0 is more mature we can consider building
> stronger out-of-core execution (plenty of prior art we can learn from here,
> e.g. SFrame).
>
> As far as tools to implement the deferred expression API -- I will leave
> this to discussion. I spent a considerable amount of time making a
> pandas-like expression API for SQL in Ibis (see
> https://github.com/cloudera/ibis/tree/master/ibis/expr) while I was at
> Cloudera, so there's some ideas there (like separating the "internal" AST
> from the "external" user expressions) that we can learn from, or fork or
> use some of that expression code in some way. I don't have a strong
> opinion as long as the expressions are as strongly-typed as possible
> (i.e. tables have schemas, operations have checked input and output types)
> and catch user errors as soon as feasible.
>
> *Chris B's response:*
>
> Deferred API
>
> Mixed thoughts about this.  On the one hand, it's obviously a good thing,
> enables smarter execution, typing/schemas could result in much easier/safer
> to write code, etc.
>

> On the other hand, the pandas API is already massive and reasonably
> difficult to master, and it's a big ask to learn a new one.  Dask is a good
> example of how NOT having a new API can be very valuable.  All this to say
> I think adoption might be pretty low?  Could be my own biases - coming from
> a "smallish data" user of pandas, I've never found the "write once, execute
> on different backends" argument especially compelling because I've never
> had the need.
>
I agree with the underlying sentiment in Chris?s post. If we are going to
build something new, there needs to be very compelling reasons to switch so
that there?s some offset to the switching costs.
Benefits I see from using expressions that individual users may find
convincing:

   1. Code correctness guarantees and API clarity using schemas and types.
      1. Operations fail very early and tab completion shows you exactly
      what operations are valid on a particular object.
   2. Optimizations through expression rewriting (column pruning, predicate
   pushdown).
      1. We don?t need to read every column to select just one. Last time I
      checked nearly all of our IO APIs require reading in all columns to do an
      operation on just a few.
   3. Somewhat ironically, a much smaller API to learn.
      1. No indexes, extremely complex slicing or functions that have many
      different ways to do the same thing (like our old friend replace).

Reasons that I think individual users will not find convincing:

   1. The ability to run on multiple backends. Many people do not have this
   problem. I suspect the majority of pandas users do *not* have this
   problem. We shouldn?t try to convince our users that this is why they
   should switch, nor should we prioritize this aspect of pandas2.

Potential pitfalls to adoption with using expressions to build pandas2:

   1. Too dissimilar from current pandas.
   2. Development getting bogged down in lowest common denominator problems
   (i.e., requiring that every backend implement every operation) resulting in
   an extremely limited API.
   3. More abstract execution model, and therefore more difficult to
   understand and debug errors.

I personally think we should do the following:

   1. Draft a list of ?must-have? operations on DataFrames
   2. Use ibis as a base for building experimental pandas deferred
   expressions.
   3. Forget about supporting ?all the backends? and focus on SQL and
   pandas. Make sure that most of our users don?t have to care about this
   aspect of pandas. The fact that operations are delayed should be almost
   invisible unless desired. For example, even though we are delaying
   operations internally, the result should appear to be eagerly evaluated.
   The model would be: ?write once, execute on pandas only by default, nearly
   invisible to the user?
   4. Go deep on pandas expressions and add non SQL compatible ones if
   necessary to preserve as much of the spec?d-out API that we can.
   5. Try not to break backwards compatibility with SQL backends, but don?t
   require it if it?s needed for pandas2. Alternatively, we build the pandas
   backend on top of ibis instead of inside so that we have even more freedom.

I?ve got a patch up that implements some of the pandas API in ibis here
<https://github.com/pandas-dev/ibis/pull/981>, if anyone would like to
follow along.

-Phillip
?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20170530/140569a4/attachment-0001.html>

From mrocklin at gmail.com  Tue May 30 18:51:35 2017
From: mrocklin at gmail.com (Matthew Rocklin)
Date: Tue, 30 May 2017 18:51:35 -0400
Subject: [Pandas-dev] Pandas Deferred Expressions
In-Reply-To: <CAKRVfm7Dw6zwJS4w9=+nQwvO3y5zMY0DoRqhVjQtuHB1_uaMhA@mail.gmail.com>
References: <CAKRVfm4uF3-_BSHL0gkoguCjFEDuO9GDVHLv77Lzo49ho_3Ppg@mail.gmail.com>
 <CAKRVfm7Dw6zwJS4w9=+nQwvO3y5zMY0DoRqhVjQtuHB1_uaMhA@mail.gmail.com>
Message-ID: <CAJ8oX-FacuhfhUOQLSEgvBqpagmR5-+jWn6uxDw6L941CQmHhg@mail.gmail.com>

*(My apologies for chiming in here without intending to do any of the
actual work.)*

I wonder if there is a half-solution where a small subset of operations are
lazy much in the same way that the current groupby operations are lazy in
Pandas 0.x.  If this laziness were extended to a small set of mostly linear
operations (element-wise, filters, aggregations, column projections,
groupbys) then that might hit a few of the bigger optimizations that people
care about without going down the full lazy-relational-algebra-in-python
path.  Once you do an operation that is not one of these, we collapse the
lazy dataframe and replace it with a concrete one.  Slowing extending a
small set of operations may also be doable in an incremental fashion as
needed, which might be an easier transition for a community of users.

Of course, half-measures can also cause more maintenance costs long term
and may lack optimizations that Pandas devs find valuable.  I'm unqualified
to judge the merits of any of these solutions, just thought I'd bring this
up.  Feel free to ignore.

On Tue, May 30, 2017 at 6:28 PM, Phillip Cloud <cpcloud at gmail.com> wrote:

> On Tue, May 30, 2017 at 5:19 PM Phillip Cloud <cpcloud at gmail.com> wrote:
>
> Hi all,
>>
>> I'd like to fork part of the thread from Wes's original email about the
>> future of pandas and discuss all things deferred expressions. To start,
>> here's Wes's original thoughts, and a response from Chris Bartak that was
>> in a different thread. After I send this email I'm going to follow up with
>> my own thoughts in a different email so I can address any specific concerns
>> as well as offer up a list of advantages and disadvantages to this approach
>> and lessons learned about building DSLs in Python.
>>
>> *Wes's post:*
>>
>> *TOPIC THREE:* I think we should start developing a "deferred pandas
>> API" that is designed and directly developed by the pandas developer
>> community. From our respective experiences creating expression DSLs and
>> other computation frameworks on top of pandas, I believe this is something
>> where we can build something reasonable and useful. As one concrete problem
>> this would help with: addressing some of the awkwardness around complex
>> groupby-aggregate expressions (custom aggregations would simply be named
>> expressions).
>>
>> The idea of the deferred expression API would be similar to dplyr in R:
>>
>
>> * "True" schemas (we'll have to work around pandas 0.x warts with
>> implicit casts, etc.)
>>
>> * Immutable data structures / no mutation outside "amend" operations that
>> change values by returning new objects
>>
>> * Less index-related stuff in this API (perhaps this is controversial, we
>> shall see)
>>
>> We can create an in-memory backend for "pandas expressions" on pandas
>> 0.x/1.0 and separately create an alternative backend using libpandas (once
>> that is more fully baked / functional) -- this will also help provide a
>> forcing function for implementing analytics that are required for
>> implementing the backend.
>>
>> Distributed execution for us is almost certainly out of scope, and even
>> if so we would probably want to offload onto prior art in Dask or
>> elsewhere. So if the dask.dataframe API and the pandas expression API
>> look different in ways that are unpleasant, we could either compile from
>> pandas -> dask under the hood, or make API changes to make the semantics
>> more conforming.
>>
>> When libpandas / pandas 2.0 is more mature we can consider building
>> stronger out-of-core execution (plenty of prior art we can learn from here,
>> e.g. SFrame).
>>
>> As far as tools to implement the deferred expression API -- I will leave
>> this to discussion. I spent a considerable amount of time making a
>> pandas-like expression API for SQL in Ibis (see https://github.com/
>> cloudera/ibis/tree/master/ibis/expr) while I was at Cloudera, so there's
>> some ideas there (like separating the "internal" AST from the "external"
>> user expressions) that we can learn from, or fork or use some of that
>> expression code in some way. I don't have a strong opinion as long as the
>>  expressions are as strongly-typed as possible (i.e. tables have
>> schemas, operations have checked input and output types) and catch user
>> errors as soon as feasible.
>>
>> *Chris B's response:*
>>
>> Deferred API
>>
>> Mixed thoughts about this.  On the one hand, it's obviously a good thing,
>> enables smarter execution, typing/schemas could result in much easier/safer
>> to write code, etc.
>>
>
>> On the other hand, the pandas API is already massive and reasonably
>> difficult to master, and it's a big ask to learn a new one.  Dask is a good
>> example of how NOT having a new API can be very valuable.  All this to say
>> I think adoption might be pretty low?  Could be my own biases - coming from
>> a "smallish data" user of pandas, I've never found the "write once, execute
>> on different backends" argument especially compelling because I've never
>> had the need.
>>
> I agree with the underlying sentiment in Chris?s post. If we are going to
> build something new, there needs to be very compelling reasons to switch so
> that there?s some offset to the switching costs.
> Benefits I see from using expressions that individual users may find
> convincing:
>
>    1. Code correctness guarantees and API clarity using schemas and types.
>       1. Operations fail very early and tab completion shows you exactly
>       what operations are valid on a particular object.
>    2. Optimizations through expression rewriting (column pruning,
>    predicate pushdown).
>       1. We don?t need to read every column to select just one. Last time
>       I checked nearly all of our IO APIs require reading in all columns to do an
>       operation on just a few.
>    3. Somewhat ironically, a much smaller API to learn.
>       1. No indexes, extremely complex slicing or functions that have
>       many different ways to do the same thing (like our old friend
>       replace).
>
> Reasons that I think individual users will not find convincing:
>
>    1. The ability to run on multiple backends. Many people do not have
>    this problem. I suspect the majority of pandas users do *not* have
>    this problem. We shouldn?t try to convince our users that this is why they
>    should switch, nor should we prioritize this aspect of pandas2.
>
> Potential pitfalls to adoption with using expressions to build pandas2:
>
>    1. Too dissimilar from current pandas.
>    2. Development getting bogged down in lowest common denominator
>    problems (i.e., requiring that every backend implement every operation)
>    resulting in an extremely limited API.
>    3. More abstract execution model, and therefore more difficult to
>    understand and debug errors.
>
> I personally think we should do the following:
>
>    1. Draft a list of ?must-have? operations on DataFrames
>    2. Use ibis as a base for building experimental pandas deferred
>    expressions.
>    3. Forget about supporting ?all the backends? and focus on SQL and
>    pandas. Make sure that most of our users don?t have to care about this
>    aspect of pandas. The fact that operations are delayed should be almost
>    invisible unless desired. For example, even though we are delaying
>    operations internally, the result should appear to be eagerly evaluated.
>    The model would be: ?write once, execute on pandas only by default, nearly
>    invisible to the user?
>    4. Go deep on pandas expressions and add non SQL compatible ones if
>    necessary to preserve as much of the spec?d-out API that we can.
>    5. Try not to break backwards compatibility with SQL backends, but
>    don?t require it if it?s needed for pandas2. Alternatively, we build the
>    pandas backend on top of ibis instead of inside so that we have even more
>    freedom.
>
> I?ve got a patch up that implements some of the pandas API in ibis here
> <https://github.com/pandas-dev/ibis/pull/981>, if anyone would like to
> follow along.
>
> -Phillip
> ?
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20170530/974d08e3/attachment-0001.html>

From remi.denise at pasteur.fr  Mon May 22 05:11:39 2017
From: remi.denise at pasteur.fr (=?utf-8?B?UsOpbWkgIERFTklTRQ==?=)
Date: Mon, 22 May 2017 09:11:39 +0000
Subject: [Pandas-dev] Question
Message-ID: <B084E407-477B-4EC1-931E-9960F4E3B9C0@pasteur.fr>

To whom it may concern,

I wanted to say that I really like pandas, it is the best for what I?m doing.

For one of my script I had to use rpy2 and that create a matrix but I needed to have this matrix to pandas so I use "import pandas.rpy.common as com" and a message came :


FutureWarning: The pandas.rpy module is deprecated and will be removed in a future version. We refer to external packages like rpy2.
See here for a guide on how to port your code to rpy2: http://pandas.pydata.org/pandas-docs/stable/r_interface.html
  import pandas.rpy.common as com

So I check how to port my code to rpy2 instead of use it.

But when I compare the two methods, it?s not the same. In rpy2 the method return me a numpy array, but your method return me a pandas DataFrame which have the name of the rows and columns as I want.

So my question is how can I have exactly the same result as "com.convert_robj" with rpy2 because it?s not the same now and I want that pandas keep this method that does easily what I want and rpy2 doesn?t.


Thank you for your answer

R?mi

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20170522/1bce002c/attachment.html>

From jorisvandenbossche at gmail.com  Wed May 31 18:05:54 2017
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Thu, 1 Jun 2017 00:05:54 +0200
Subject: [Pandas-dev] Question
In-Reply-To: <B084E407-477B-4EC1-931E-9960F4E3B9C0@pasteur.fr>
References: <B084E407-477B-4EC1-931E-9960F4E3B9C0@pasteur.fr>
Message-ID: <CALQtMBaFtofxKJJJz1jjP=Hin+9Qcwjcpv-9A9he-nhu2u+izg@mail.gmail.com>

Can you give an actual small reproducible code example that shows the
problem?
Otherwise it will be difficult to help.

Joris

2017-05-22 11:11 GMT+02:00 R?mi DENISE <remi.denise at pasteur.fr>:

> To whom it may concern,
>
> I wanted to say that I really like pandas, it is the best for what I?m
> doing.
>
> For one of my script I had to use rpy2 and that create a matrix but I
> needed to have this matrix to pandas so I use "import pandas.rpy.common as
> com" and a message came :
>
> FutureWarning: The pandas.rpy module is deprecated and will be removed in a future version. We refer to external packages like rpy2.
> See here for a guide on how to port your code to rpy2: http://pandas.pydata.org/pandas-docs/stable/r_interface.html
>   import pandas.rpy.common as com
>
> So I check how to port my code to rpy2 instead of use it.
>
> But when I compare the two methods, it?s not the same. In rpy2 the method
> return me a numpy array, but your method return me a pandas DataFrame which
> have the name of the rows and columns as I want.
>
> So my question is how can I have exactly the same result as
> "com.convert_robj" with rpy2 because it?s not the same now and I want that
> pandas keep this method that does easily what I want and rpy2 doesn?t.
>
>
> Thank you for your answer
>
> R?mi
>
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20170601/81d2f23b/attachment.html>