From tom.augspurger88 at gmail.com  Fri Jul  6 08:40:22 2018
From: tom.augspurger88 at gmail.com (Tom Augspurger)
Date: Fri, 6 Jul 2018 07:40:22 -0500
Subject: [Pandas-dev] ANN: Pandas 0.23.2 Released
Message-ID: <CAE1aY-m88te+5vfPG2oOo7XB=kcYvFybuzZ7ZU9Awidxr-z58w@mail.gmail.com>

Hi all,

I'm happy to announce pandas that pandas 0.23.2 has been released.

This is a minor bug-fix release in the 0.23.x series and includes some
regression fixes, bug fixes, and performance improvements. We recommend
that all users upgrade to this version.
See the full whatsnew
<https://pandas.pydata.org/pandas-docs/version/0.23.2/whatsnew.html> for a
list of all the changes.

The release can be installed with conda from the default channel and
conda-forge::

conda install pandas

Or via PyPI:

python -m pip install --upgrade pandas

A total of 17 people contributed to this release. People with a ?+? by
their names contributed a patch for the first time.

   - David Krych
   - Jacopo Rota +
   - Jeff Reback
   - Jeremy Schendel
   - Joris Van den Bossche
   - Kalyan Gokhale
   - Matthew Roeschke
   - Michael Odintsov +
   - Ming Li
   - Pietro Battiston
   - Tom Augspurger
   - Uddeshya Singh
   - Vu Le +
   - alimcmaster1 +
   - david-liu-brattle-1 +
   - gfyoung
   - jbrockmendel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180706/70e15d48/attachment.html>

From tom.w.augspurger at gmail.com  Sun Jul  8 17:26:58 2018
From: tom.w.augspurger at gmail.com (Tom Augspurger)
Date: Sun, 8 Jul 2018 16:26:58 -0500
Subject: [Pandas-dev] Pandas Sprint Recap
Message-ID: <CAE1aY-nPUpVb5MUziiEM2RMKhnYuD4qaK4tcQs16mNbyFg7HQg@mail.gmail.com>

Hi all,

This week, a group of the pandas maintainers sat down in Austin to talk
through the status of the project, and its future direction.

I've posted a document on our wiki with a summary of the topics discussed.
https://github.com/pandas-dev/pandas/wiki/Pandas-Sprint-(July,-2018)

If people have questions or comments, feel free to post here and we'll
clarify that document.

Tom
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180708/a9d5700c/attachment.html>

From wesmckinn at gmail.com  Mon Jul  9 14:25:47 2018
From: wesmckinn at gmail.com (Wes McKinney)
Date: Mon, 9 Jul 2018 14:25:47 -0400
Subject: [Pandas-dev] Pandas Sprint Recap
In-Reply-To: <CAE1aY-nPUpVb5MUziiEM2RMKhnYuD4qaK4tcQs16mNbyFg7HQg@mail.gmail.com>
References: <CAE1aY-nPUpVb5MUziiEM2RMKhnYuD4qaK4tcQs16mNbyFg7HQg@mail.gmail.com>
Message-ID: <CAJPUwMCH5UkooFs6RHzJn5eF0zzWVXNp=1ZySgDkfx7BF9R30Q@mail.gmail.com>

Thanks Tom! To all readers: please have a look. We are really
interested in the community's feedback on the upcoming roadmap and the
future work the team is contemplating

On Sun, Jul 8, 2018 at 5:26 PM, Tom Augspurger
<tom.w.augspurger at gmail.com> wrote:
> Hi all,
>
> This week, a group of the pandas maintainers sat down in Austin to talk
> through the status of the project, and its future direction.
>
> I've posted a document on our wiki with a summary of the topics discussed.
> https://github.com/pandas-dev/pandas/wiki/Pandas-Sprint-(July,-2018)
>
> If people have questions or comments, feel free to post here and we'll
> clarify that document.
>
> Tom
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>

From shoyer at gmail.com  Mon Jul  9 14:41:50 2018
From: shoyer at gmail.com (Stephan Hoyer)
Date: Mon, 9 Jul 2018 11:41:50 -0700
Subject: [Pandas-dev] Pandas Sprint Recap
In-Reply-To: <CAJPUwMCH5UkooFs6RHzJn5eF0zzWVXNp=1ZySgDkfx7BF9R30Q@mail.gmail.com>
References: <CAE1aY-nPUpVb5MUziiEM2RMKhnYuD4qaK4tcQs16mNbyFg7HQg@mail.gmail.com>
 <CAJPUwMCH5UkooFs6RHzJn5eF0zzWVXNp=1ZySgDkfx7BF9R30Q@mail.gmail.com>
Message-ID: <CAEQ_TvfgKthwF3N5M5rZVTXTtjnMmhRMk+bXGjqmxuO_JHJrCg@mail.gmail.com>

Hi Wes and Tom,

I'm sorry I missed the sprint, but this looks like a great plan!

My main concern would be figuring out how to smoothly transition from
exposing the NumPy API directly to users to a separate abstraction layer.
The "mixed" semantics of dtype and values in pandas (sometimes matching
NumPy, sometimes not) is highly confusing, especially with incremental
changes for data types in new releases of pandas. There is still a lot of
code that relies on the internal representation of pandas.Series as a
wrapper over a NumPy array.

I wonder if it would be possible to do a hard break in 1.0 that switches
everything over to a pandas dtype system, even if it is currently just a
wrapper over NumPy. Or maybe that would be too large of a change to do
right now, and it would be enough to merely stabilize the set of numpy vs
pandas dtypes for the 1.X series.

Cheers,
Stephan

On Mon, Jul 9, 2018 at 11:26 AM Wes McKinney <wesmckinn at gmail.com> wrote:

> Thanks Tom! To all readers: please have a look. We are really
> interested in the community's feedback on the upcoming roadmap and the
> future work the team is contemplating
>
> On Sun, Jul 8, 2018 at 5:26 PM, Tom Augspurger
> <tom.w.augspurger at gmail.com> wrote:
> > Hi all,
> >
> > This week, a group of the pandas maintainers sat down in Austin to talk
> > through the status of the project, and its future direction.
> >
> > I've posted a document on our wiki with a summary of the topics
> discussed.
> > https://github.com/pandas-dev/pandas/wiki/Pandas-Sprint-(July,-2018)
> >
> > If people have questions or comments, feel free to post here and we'll
> > clarify that document.
> >
> > Tom
> >
> > _______________________________________________
> > Pandas-dev mailing list
> > Pandas-dev at python.org
> > https://mail.python.org/mailman/listinfo/pandas-dev
> >
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180709/f152133a/attachment.html>

From jorisvandenbossche at gmail.com  Mon Jul  9 15:20:51 2018
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Mon, 9 Jul 2018 14:20:51 -0500
Subject: [Pandas-dev] Pandas Sprint Recap
In-Reply-To: <CAEQ_TvfgKthwF3N5M5rZVTXTtjnMmhRMk+bXGjqmxuO_JHJrCg@mail.gmail.com>
References: <CAE1aY-nPUpVb5MUziiEM2RMKhnYuD4qaK4tcQs16mNbyFg7HQg@mail.gmail.com>
 <CAJPUwMCH5UkooFs6RHzJn5eF0zzWVXNp=1ZySgDkfx7BF9R30Q@mail.gmail.com>
 <CAEQ_TvfgKthwF3N5M5rZVTXTtjnMmhRMk+bXGjqmxuO_JHJrCg@mail.gmail.com>
Message-ID: <CALQtMBZ3Jq5VuWQySHXN6SKMigN_8N-VS_XwNmG-b2Munx+uKg@mail.gmail.com>

2018-07-09 13:41 GMT-05:00 Stephan Hoyer <shoyer at gmail.com>:

> Hi Wes and Tom,
>
> I'm sorry I missed the sprint, but this looks like a great plan!
>
> My main concern would be figuring out how to smoothly transition from
> exposing the NumPy API directly to users to a separate abstraction layer.
> The "mixed" semantics of dtype and values in pandas (sometimes matching
> NumPy, sometimes not) is highly confusing, especially with incremental
> changes for data types in new releases of pandas. There is still a lot of
> code that relies on the internal representation of pandas.Series as a
> wrapper over a NumPy array.
>
> I wonder if it would be possible to do a hard break in 1.0 that switches
> everything over to a pandas dtype system, even if it is currently just a
> wrapper over NumPy.
>

That's a good point, and I agree that the mixture of numpy dtypes and
pandas dtypes (and which are not fully compatible) is confusing. And in the
long term I also think we should probably have our own pandas dtype system
(that hopefully interoperates better with other dtypes systems by that
time), so that we have a consistent user experience within pandas.

But, I personally don't think we should do that for 1.0. Of course we can
choose to keep on going with the 0.x releases for another few years, but
personally I would rather go for a 1.0 that basically is the current state
of pandas (with some clean-ups like removing panel and deprecated stuff,
but no fundamental changes).
Then we can discuss doing such a change for 2.0 regarding the dtypes
(redesigning the internal blockmanager to something simpler will also break
some stuff). Designing our own type system and functionality around it will
take some time.


> Or maybe that would be too large of a change to do right now, and it would
> be enough to merely stabilize the set of numpy vs pandas dtypes for the 1.X
> series.
>

I don't think we plan to have additional pandas dtypes for now (for 1.0),
except for:

- interval, period, datetimetz are already pandas dtypes, but will get more
publicly exposed once we allow those to be stored in Series (now only in
Index, and in Series you see them as object dtyped arrays)
- the integer with NA support, but which will only be an experimental
opt-in feature

I personally think we should keep it on that for the 1.x series.

Cheers,
Joris


>
> Cheers,
> Stephan
>
> On Mon, Jul 9, 2018 at 11:26 AM Wes McKinney <wesmckinn at gmail.com> wrote:
>
>> Thanks Tom! To all readers: please have a look. We are really
>> interested in the community's feedback on the upcoming roadmap and the
>> future work the team is contemplating
>>
>> On Sun, Jul 8, 2018 at 5:26 PM, Tom Augspurger
>> <tom.w.augspurger at gmail.com> wrote:
>> > Hi all,
>> >
>> > This week, a group of the pandas maintainers sat down in Austin to talk
>> > through the status of the project, and its future direction.
>> >
>> > I've posted a document on our wiki with a summary of the topics
>> discussed.
>> > https://github.com/pandas-dev/pandas/wiki/Pandas-Sprint-(July,-2018)
>> >
>> > If people have questions or comments, feel free to post here and we'll
>> > clarify that document.
>> >
>> > Tom
>> >
>> > _______________________________________________
>> > Pandas-dev mailing list
>> > Pandas-dev at python.org
>> > https://mail.python.org/mailman/listinfo/pandas-dev
>> >
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180709/e485dfd3/attachment-0001.html>

From shoyer at gmail.com  Mon Jul  9 19:47:05 2018
From: shoyer at gmail.com (Stephan Hoyer)
Date: Mon, 9 Jul 2018 16:47:05 -0700
Subject: [Pandas-dev] Pandas Sprint Recap
In-Reply-To: <CALQtMBZ3Jq5VuWQySHXN6SKMigN_8N-VS_XwNmG-b2Munx+uKg@mail.gmail.com>
References: <CAE1aY-nPUpVb5MUziiEM2RMKhnYuD4qaK4tcQs16mNbyFg7HQg@mail.gmail.com>
 <CAJPUwMCH5UkooFs6RHzJn5eF0zzWVXNp=1ZySgDkfx7BF9R30Q@mail.gmail.com>
 <CAEQ_TvfgKthwF3N5M5rZVTXTtjnMmhRMk+bXGjqmxuO_JHJrCg@mail.gmail.com>
 <CALQtMBZ3Jq5VuWQySHXN6SKMigN_8N-VS_XwNmG-b2Munx+uKg@mail.gmail.com>
Message-ID: <CAEQ_TvfpyGpNa8oHF_KHBd-LM06tYdP69XQoypGzfvXLvr52zw@mail.gmail.com>

On Mon, Jul 9, 2018 at 12:21 PM Joris Van den Bossche <
jorisvandenbossche at gmail.com> wrote:

> I don't think we plan to have additional pandas dtypes for now (for 1.0),
> except for:
>
> - interval, period, datetimetz are already pandas dtypes, but will get
> more publicly exposed once we allow those to be stored in Series (now only
> in Index, and in Series you see them as object dtyped arrays)
> - the integer with NA support, but which will only be an experimental
> opt-in feature
>
> I personally think we should keep it on that for the 1.x series.
>

I agree, this sounds good to me.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180709/567498bd/attachment.html>

From william.ayd at icloud.com  Mon Jul  9 20:18:26 2018
From: william.ayd at icloud.com (William Ayd)
Date: Mon, 09 Jul 2018 17:18:26 -0700
Subject: [Pandas-dev] GroupBy Overhaul Proposal
Message-ID: <F0DE6110-6B5B-468E-B8CD-47DF87F38E07@icloud.com>

Hi All,

I?ve been thinking through what a redesigned GroupBy module could look like in 1.0. The main problems I am trying to address are:

  - The current module is relatively convoluted, making contribution and debugging challenging
  - Behavior is sometimes non-obvious and buggy (see here <https://github.com/pandas-dev/pandas/issues/21790>, here <https://github.com/pandas-dev/pandas/issues/20958> and here <https://github.com/pandas-dev/pandas/issues/20665> as some examples) AND
  - We violate the mantra of there being ?only one obvious way to do things?

Along those lines, here were four things I thought could be of immense value:
? Removal of apply method
? Removal of DataFrameGroupBy and SeriesGroupBy classes
? Explicit default column naming

? Removal of axis argument

These are easier said than done and admittedly controversial. I've pieced together my reasoning and what I think counter arguments could be in the attached documented. 

I?d be curious to hear everyone?s feedback.

- Will

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180709/8c88acb5/attachment-0002.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: GroupByOverhaul.pdf
Type: application/pdf
Size: 126160 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180709/8c88acb5/attachment-0001.pdf>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180709/8c88acb5/attachment-0003.html>

From jorisvandenbossche at gmail.com  Tue Jul 10 18:24:55 2018
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Tue, 10 Jul 2018 17:24:55 -0500
Subject: [Pandas-dev] Welcome Brock Mendel and Marc Garcia to the team
Message-ID: <CALQtMBa_D+uoRUxm1wVt-4MRmL3mgekTSdVPnBz2DB3Gny5qtA@mail.gmail.com>

We are happy to announce that Brock Mendel (@jbrockmendel) and Marc Garcia
(@datapythonista) have been added to the pandas core team.

Amongst many other things, Brock has been working a lot on refactoring the
time series related code, and Marc has done amazing work on the
documentation.

Thanks both, and congratulations!

Joris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180710/00721c44/attachment.html>

From wesmckinn at gmail.com  Tue Jul 10 18:26:45 2018
From: wesmckinn at gmail.com (Wes McKinney)
Date: Tue, 10 Jul 2018 18:26:45 -0400
Subject: [Pandas-dev] Welcome Brock Mendel and Marc Garcia to the team
In-Reply-To: <CALQtMBa_D+uoRUxm1wVt-4MRmL3mgekTSdVPnBz2DB3Gny5qtA@mail.gmail.com>
References: <CALQtMBa_D+uoRUxm1wVt-4MRmL3mgekTSdVPnBz2DB3Gny5qtA@mail.gmail.com>
Message-ID: <CAJPUwMBcDX-WrtWyUJKgb01+eSVJwUfKRVTXoUcYrv+P_dd6=A@mail.gmail.com>

Thanks to both of you!

On Tue, Jul 10, 2018 at 6:24 PM, Joris Van den Bossche
<jorisvandenbossche at gmail.com> wrote:
> We are happy to announce that Brock Mendel (@jbrockmendel) and Marc Garcia
> (@datapythonista) have been added to the pandas core team.
>
> Amongst many other things, Brock has been working a lot on refactoring the
> time series related code, and Marc has done amazing work on the
> documentation.
>
> Thanks both, and congratulations!
>
> Joris
>
>
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>

From tom.augspurger88 at gmail.com  Thu Jul 12 15:06:52 2018
From: tom.augspurger88 at gmail.com (Tom Augspurger)
Date: Thu, 12 Jul 2018 14:06:52 -0500
Subject: [Pandas-dev] Pandas Sprint Recap
In-Reply-To: <CAEQ_TvfpyGpNa8oHF_KHBd-LM06tYdP69XQoypGzfvXLvr52zw@mail.gmail.com>
References: <CAE1aY-nPUpVb5MUziiEM2RMKhnYuD4qaK4tcQs16mNbyFg7HQg@mail.gmail.com>
 <CAJPUwMCH5UkooFs6RHzJn5eF0zzWVXNp=1ZySgDkfx7BF9R30Q@mail.gmail.com>
 <CAEQ_TvfgKthwF3N5M5rZVTXTtjnMmhRMk+bXGjqmxuO_JHJrCg@mail.gmail.com>
 <CALQtMBZ3Jq5VuWQySHXN6SKMigN_8N-VS_XwNmG-b2Munx+uKg@mail.gmail.com>
 <CAEQ_TvfpyGpNa8oHF_KHBd-LM06tYdP69XQoypGzfvXLvr52zw@mail.gmail.com>
Message-ID: <CAE1aY-nqXvexaqtUXxWWNX1gjfUm_oyOTMPrW9z1TvxXAVN4Mg@mail.gmail.com>

Updated the wiki page with an attempt to summarize Stephan's and Joris'
points:
https://github.com/pandas-dev/pandas/wiki/Pandas-Sprint-(July,-2018)/_compare/ac9aaf55348b8c62eb2ddb020d4be1dec6e7896b

On Mon, Jul 9, 2018 at 6:47 PM, Stephan Hoyer <shoyer at gmail.com> wrote:

> On Mon, Jul 9, 2018 at 12:21 PM Joris Van den Bossche <
> jorisvandenbossche at gmail.com> wrote:
>
>> I don't think we plan to have additional pandas dtypes for now (for 1.0),
>> except for:
>>
>> - interval, period, datetimetz are already pandas dtypes, but will get
>> more publicly exposed once we allow those to be stored in Series (now only
>> in Index, and in Series you see them as object dtyped arrays)
>> - the integer with NA support, but which will only be an experimental
>> opt-in feature
>>
>> I personally think we should keep it on that for the 1.x series.
>>
>
> I agree, this sounds good to me.
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180712/1a9cad6a/attachment.html>

From me at pietrobattiston.it  Fri Jul 13 13:12:04 2018
From: me at pietrobattiston.it (Pietro Battiston)
Date: Fri, 13 Jul 2018 19:12:04 +0200
Subject: [Pandas-dev] Pandas Sprint Recap
In-Reply-To: <CAE1aY-nPUpVb5MUziiEM2RMKhnYuD4qaK4tcQs16mNbyFg7HQg@mail.gmail.com>
References: <CAE1aY-nPUpVb5MUziiEM2RMKhnYuD4qaK4tcQs16mNbyFg7HQg@mail.gmail.com>
Message-ID: <1531501924.11738.110.camel@pietrobattiston.it>

Hi Tom,

first, thanks to all those who participated in the sprint, and for the
recap.

Il giorno dom, 08/07/2018 alle 16.26 -0500, Tom Augspurger ha scritto:
> [...]
> I've posted a document on our wiki with a summary of the topics
> discussed. https://github.com/pandas-dev/pandas/wiki/Pandas-Sprint-(J
> uly,-2018)
> 
> If people have questions or comments, feel free to post here and
> we'll clarify that document.

Something that scares me - but maybe because I'm missing something
obvious - is what exactly qualifies as "deprecation". Is it something
which was once presented as a distinct feature and is then disabled, or
any general change to what any API call performs (that is, anything
requiring a deprecation cycle - that is)?

There are many bugs - in particular, in indexing code - which might
potentially break existing code when fixed. Some of them will have non-
trivial deprecation paths/detection strategies. The first ones that
come to my mind are #18631, #12827, #9519. The last one, in particular,
implies changing the result of potentially tons of calls to .loc on a
non-unique index.

My view is that those (and many more, including several that will be
found) will be best fixed through a total rewrite of indexing code
(i.e., all code in indexing.py, and some code in internals.py), which I
assumed would happen before 1.0, and which I certainly won't be able to
do before 0.24.0 (September 2018).
I'm clearly not claiming that nobody else can do it (nor that the bugs
can necessarily only be fixed through a complete rewrite)... but since
I did not get any feedback on
https://github.com/pandas-dev/pandas/wiki/(Tentative)-rules-for-restruc
turing-indexing-code
... I assume that nobody is focusing/planning to focus on this in the
near future (or was it somehow discussed in the sprint?).

I perfectly understand the desire to stop postponing 1.0 to a vague
future, if it's just a matter of recognizing that pandas is worth
using.
But if it's a statement/commitment about code robustness/quality, and
relatedly API stability... then I think we it is risky to leave the
indexing API, and more in general the core codebase (as opposed to
important but more lateral features such as new dtypes) out of the
picture (e.g. out of #21894).

Cheers,

Pietro

From tom.augspurger88 at gmail.com  Fri Jul 13 13:45:36 2018
From: tom.augspurger88 at gmail.com (Tom Augspurger)
Date: Fri, 13 Jul 2018 12:45:36 -0500
Subject: [Pandas-dev] Pandas Sprint Recap
In-Reply-To: <1531501924.11738.110.camel@pietrobattiston.it>
References: <CAE1aY-nPUpVb5MUziiEM2RMKhnYuD4qaK4tcQs16mNbyFg7HQg@mail.gmail.com>
 <1531501924.11738.110.camel@pietrobattiston.it>
Message-ID: <CAE1aY-=FX6ncmL3NMaxTeBkKC7qZ+F5YhOczbnmKqKBV1GTF_Q@mail.gmail.com>

Thanks Pietro,

We didn't discuss indexing much, beyond agreeing that there's work to be
done, and that fixing it was too large
a task for 1.0.

As for whether an individual issue is a bug or feature, we'll have to
continue using our judgement. I think we'll
inevitably break users' code in a 1.x release as we fix bugs.

We'll need to discuss workflows for these large changes (e.g. ripping out
the block manager) that will be API
breaking, but may take some time to land. Keeping a separate branch in sync
is a pain, but may be the least
painful alternative.

One thing I want to reiterate: it's not going to take another 11 years to
reach pandas 2.0 :) Just because we don't
solve indexing for 1.0 doesn't mean we won't ever be able to fix it.

Tom

On Fri, Jul 13, 2018 at 12:12 PM, Pietro Battiston <me at pietrobattiston.it>
wrote:

> Hi Tom,
>
> first, thanks to all those who participated in the sprint, and for the
> recap.
>
> Il giorno dom, 08/07/2018 alle 16.26 -0500, Tom Augspurger ha scritto:
> > [...]
> > I've posted a document on our wiki with a summary of the topics
> > discussed. https://github.com/pandas-dev/pandas/wiki/Pandas-Sprint-(J
> > uly,-2018)
> >
> > If people have questions or comments, feel free to post here and
> > we'll clarify that document.
>
> Something that scares me - but maybe because I'm missing something
> obvious - is what exactly qualifies as "deprecation". Is it something
> which was once presented as a distinct feature and is then disabled, or
> any general change to what any API call performs (that is, anything
> requiring a deprecation cycle - that is)?
>
> There are many bugs - in particular, in indexing code - which might
> potentially break existing code when fixed. Some of them will have non-
> trivial deprecation paths/detection strategies. The first ones that
> come to my mind are #18631, #12827, #9519. The last one, in particular,
> implies changing the result of potentially tons of calls to .loc on a
> non-unique index.
>
> My view is that those (and many more, including several that will be
> found) will be best fixed through a total rewrite of indexing code
> (i.e., all code in indexing.py, and some code in internals.py), which I
> assumed would happen before 1.0, and which I certainly won't be able to
> do before 0.24.0 (September 2018).
> I'm clearly not claiming that nobody else can do it (nor that the bugs
> can necessarily only be fixed through a complete rewrite)... but since
> I did not get any feedback on
> https://github.com/pandas-dev/pandas/wiki/(Tentative)-rules-for-restruc
> turing-indexing-code
> ... I assume that nobody is focusing/planning to focus on this in the
> near future (or was it somehow discussed in the sprint?).
>
> I perfectly understand the desire to stop postponing 1.0 to a vague
> future, if it's just a matter of recognizing that pandas is worth
> using.
> But if it's a statement/commitment about code robustness/quality, and
> relatedly API stability... then I think we it is risky to leave the
> indexing API, and more in general the core codebase (as opposed to
> important but more lateral features such as new dtypes) out of the
> picture (e.g. out of #21894).
>
> Cheers,
>
> Pietro
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180713/c01b67e6/attachment.html>

From wesmckinn at gmail.com  Mon Jul 16 13:50:29 2018
From: wesmckinn at gmail.com (Wes McKinney)
Date: Mon, 16 Jul 2018 13:50:29 -0400
Subject: [Pandas-dev] Pandas Sprint Recap
In-Reply-To: <CAE1aY-=FX6ncmL3NMaxTeBkKC7qZ+F5YhOczbnmKqKBV1GTF_Q@mail.gmail.com>
References: <CAE1aY-nPUpVb5MUziiEM2RMKhnYuD4qaK4tcQs16mNbyFg7HQg@mail.gmail.com>
 <1531501924.11738.110.camel@pietrobattiston.it>
 <CAE1aY-=FX6ncmL3NMaxTeBkKC7qZ+F5YhOczbnmKqKBV1GTF_Q@mail.gmail.com>
Message-ID: <CAJPUwMAnySm2f_zsUwkhcZ3q68=6kAAFjqVvyw=PWvyPWe0Fng@mail.gmail.com>

> One thing I want to reiterate: it's not going to take another 11 years to
> reach pandas 2.0 :) Just because we don't
> solve indexing for 1.0 doesn't mean we won't ever be able to fix it.

One point on this that we discussed some in the sprint and during
SciPy: to undertake a major overhaul of pandas, at some point it may
require a shift to a "new codebase". This could cohabit the same
pandas-dev/pandas git repository which can serve as a monorepo for
several Python package artifacts. This would make refactoring to
separate out reusable components and code reuse much easier. The test
suite could also be refactored to be able to run against
"future-pandas" and "pandas" (or whatever we want to call them).

I'm skeptical whether the kinds of significant / breaking changes
we've discussed the last 3 years can happen in an iterative / organic
fashion within the current pandas codebase. I'd like to avoid getting
stuck in place for a decade; if we haven't made much progress toward
some of these major changes by say beginning of 2020 or 2021 we might
want to take a step back and evaluate our situation.

I spent some time at the sprint looking through pandas.core.internals,
pandas.core.generic, and some of the other low level pieces, and my
feeling is that it would be easier to start over.

All of this is made much more difficult by pandas's spartan funding
situation (Joris and Tom supported at ~50% time, rest of maintainers
are volunteers AFAIK).

In the meantime, personally my efforts will continue to be focused on
building portable, front end agnostic, reusable computational
libraries for in-memory computing on large tabular datasets (ie
https://www.slideshare.net/wesm/apache-arrow-crosslanguage-development-platform-for-inmemory-data-105427919).
I believe that bootstrapping a much larger community to work on these
problems will reduce our collective maintenance burden (though it is
likely to take a number of years for this to pay off).

- Wes

On Fri, Jul 13, 2018 at 1:45 PM, Tom Augspurger
<tom.augspurger88 at gmail.com> wrote:
> Thanks Pietro,
>
> We didn't discuss indexing much, beyond agreeing that there's work to be
> done, and that fixing it was too large
> a task for 1.0.
>
> As for whether an individual issue is a bug or feature, we'll have to
> continue using our judgement. I think we'll
> inevitably break users' code in a 1.x release as we fix bugs.
>
> We'll need to discuss workflows for these large changes (e.g. ripping out
> the block manager) that will be API
> breaking, but may take some time to land. Keeping a separate branch in sync
> is a pain, but may be the least
> painful alternative.
>
> One thing I want to reiterate: it's not going to take another 11 years to
> reach pandas 2.0 :) Just because we don't
> solve indexing for 1.0 doesn't mean we won't ever be able to fix it.
>
> Tom
>
> On Fri, Jul 13, 2018 at 12:12 PM, Pietro Battiston <me at pietrobattiston.it>
> wrote:
>>
>> Hi Tom,
>>
>> first, thanks to all those who participated in the sprint, and for the
>> recap.
>>
>> Il giorno dom, 08/07/2018 alle 16.26 -0500, Tom Augspurger ha scritto:
>> > [...]
>> > I've posted a document on our wiki with a summary of the topics
>> > discussed. https://github.com/pandas-dev/pandas/wiki/Pandas-Sprint-(J
>> > uly,-2018)
>> >
>> > If people have questions or comments, feel free to post here and
>> > we'll clarify that document.
>>
>> Something that scares me - but maybe because I'm missing something
>> obvious - is what exactly qualifies as "deprecation". Is it something
>> which was once presented as a distinct feature and is then disabled, or
>> any general change to what any API call performs (that is, anything
>> requiring a deprecation cycle - that is)?
>>
>> There are many bugs - in particular, in indexing code - which might
>> potentially break existing code when fixed. Some of them will have non-
>> trivial deprecation paths/detection strategies. The first ones that
>> come to my mind are #18631, #12827, #9519. The last one, in particular,
>> implies changing the result of potentially tons of calls to .loc on a
>> non-unique index.
>>
>> My view is that those (and many more, including several that will be
>> found) will be best fixed through a total rewrite of indexing code
>> (i.e., all code in indexing.py, and some code in internals.py), which I
>> assumed would happen before 1.0, and which I certainly won't be able to
>> do before 0.24.0 (September 2018).
>> I'm clearly not claiming that nobody else can do it (nor that the bugs
>> can necessarily only be fixed through a complete rewrite)... but since
>> I did not get any feedback on
>> https://github.com/pandas-dev/pandas/wiki/(Tentative)-rules-for-restruc
>> turing-indexing-code
>> ... I assume that nobody is focusing/planning to focus on this in the
>> near future (or was it somehow discussed in the sprint?).
>>
>> I perfectly understand the desire to stop postponing 1.0 to a vague
>> future, if it's just a matter of recognizing that pandas is worth
>> using.
>> But if it's a statement/commitment about code robustness/quality, and
>> relatedly API stability... then I think we it is risky to leave the
>> indexing API, and more in general the core codebase (as opposed to
>> important but more lateral features such as new dtypes) out of the
>> picture (e.g. out of #21894).
>>
>> Cheers,
>>
>> Pietro
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>
>
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>

From me at pietrobattiston.it  Mon Jul 16 18:35:55 2018
From: me at pietrobattiston.it (Pietro Battiston)
Date: Tue, 17 Jul 2018 00:35:55 +0200
Subject: [Pandas-dev] Pandas Sprint Recap
In-Reply-To: <CAJPUwMAnySm2f_zsUwkhcZ3q68=6kAAFjqVvyw=PWvyPWe0Fng@mail.gmail.com>
References: <CAE1aY-nPUpVb5MUziiEM2RMKhnYuD4qaK4tcQs16mNbyFg7HQg@mail.gmail.com>
 <1531501924.11738.110.camel@pietrobattiston.it>
 <CAE1aY-=FX6ncmL3NMaxTeBkKC7qZ+F5YhOczbnmKqKBV1GTF_Q@mail.gmail.com>
 <CAJPUwMAnySm2f_zsUwkhcZ3q68=6kAAFjqVvyw=PWvyPWe0Fng@mail.gmail.com>
Message-ID: <1531780555.15070.13.camel@pietrobattiston.it>

Il giorno lun, 16/07/2018 alle 13.50 -0400, Wes McKinney ha scritto:
> > One thing I want to reiterate: it's not going to take another 11
> > years to
> > reach pandas 2.0 :) Just because we don't
> > solve indexing for 1.0 doesn't mean we won't ever be able to fix
> > it.
> 
> One point on this that we discussed some in the sprint and during
> SciPy: to undertake a major overhaul of pandas, at some point it may
> require a shift to a "new codebase". This could cohabit the same
> pandas-dev/pandas git repository which can serve as a monorepo for
> several Python package artifacts. This would make refactoring to
> separate out reusable components and code reuse much easier. The test
> suite could also be refactored to be able to run against
> "future-pandas" and "pandas" (or whatever we want to call them).
> 
> I'm skeptical whether the kinds of significant / breaking changes
> we've discussed the last 3 years can happen in an iterative / organic
> fashion within the current pandas codebase. I'd like to avoid getting
> stuck in place for a decade; if we haven't made much progress toward
> some of these major changes by say beginning of 2020 or 2021 we might
> want to take a step back and evaluate our situation.

Let me reverse the question: how much progress has/will have pandas 2.0
codebase made in the meanwhile? :-)

Joking apart, it's not that if current pandas progresses slowly, then
pandas 2.0 has any guarantee to progress more quickly.

There is a thing I never understood of pandas 2.0, and I undertand even
less now that 1.0 gets closer: if you/we have clear plans to rewrite
the codebase, then why aren't we doing it now? Why are we wasting time
on the current one?! Why are we releasing a pandas 1.0 with its
"illusion of maturity"?!
Rewriting a large project takes effort but can be worth it; _planning_
a (not so close) future rewrite seems to me just a sort of perversion.

I do respect the desire to improve the API under several aspects - and
I see this as the main reason for having something called pandas 2.0.

But I think this discussion of the API could and should be decoupled
from the idea to rewrite/reorganize internals.

Indexing code and many internals badly need a rewrite, regardless of
whether we change the API, regardles of whether we call it "2.0", and
regardless of whether we change the entire codebase all at once, or
refactor bit by bit. They firstly need it because there are 313 open
bugs labeled "Indexing", and some of them are very difficult to solve
because the code is unnecessarily complicated.
But I think this rewrite is basically what is happening daily.

More generally, if our plan is to close, sooner or later, 2400+ bugs by
basically saying "pandas 1.0 is obsolete, long live pandas 2.0"... then
we are not doing a great service to our users in releasing pandas 1.0
as such.


> I spent some time at the sprint looking through
> pandas.core.internals,
> pandas.core.generic, and some of the other low level pieces, and my
> feeling is that it would be easier to start over.

I have a different feeling on what is easier, but I might very well be
wrong, or it might be a matter of personal taste (e.g. it is true that
when I reformat some code in current pandas I more often than not end
up finding bugs in some other place, but at least I can immediately
test the code I am writing because that "other place" exists).

Something we all agree on is that, be in a new or in the same codebase,
rewriting internals takes time and effort. It requires that some of the
already few devs divert effort from improving the current codebase to
focusing on the new one. If we plan to do it, shouldn't we be doing it
now?

(And if we don't do it now, isn't it because we don't really feel the
urge to do it?)

In any case, what your mail suggests me is that we definitely need to
spend one (more more) dev talk looking through pandas.core.internals
and pandas.core.generic all together!

Cheers,

Pietro

From me at pietrobattiston.it  Mon Jul 16 19:11:15 2018
From: me at pietrobattiston.it (Pietro Battiston)
Date: Tue, 17 Jul 2018 01:11:15 +0200
Subject: [Pandas-dev] GroupBy Overhaul Proposal
In-Reply-To: <F0DE6110-6B5B-468E-B8CD-47DF87F38E07@icloud.com>
References: <F0DE6110-6B5B-468E-B8CD-47DF87F38E07@icloud.com>
Message-ID: <1531782675.15070.18.camel@pietrobattiston.it>

Hi Will,

there might be parts of your document I don't entirely understand, but
I definitely appreciate the desire to clean up the groupby module, and
have some comments on what I (think I) understood.


Il giorno lun, 09/07/2018 alle 17.18 -0700, William Ayd via Pandas-dev
ha scritto:
> Hi All,
> 
> I?ve been thinking through what a redesigned GroupBy module could
> look like in 1.0. The main problems I am trying to address are:
> 
> ? - The current module is relatively convoluted, making contribution
> and debugging challenging
> ? - Behavior is sometimes non-obvious and buggy
> (see?here,?here?and?here?as some examples) AND
> ? - We violate the mantra of there being ?only one obvious way to do
> things?
> 
> Along those lines, here were four things I thought could be of
> immense value:
> ? Removal of?apply?method

I would not worry too much about the fact that apply's performance on
user-provided functions is bad (as long as it's documented), or that
sum() returns different results from .apply(sum) (again, as long as
it's documented in sum()'s docstring).

What I think we should definitely avoid is _any_ case of

apply(a_func)

giving different results from

apply(lambda x : a_func(x))

... that is, inference can be made on the result on a function, but not
on the function itself.

However, this is an issue not specifically related to .apply() - see
#17035.

So my (not very informed) opinion is that we could just simplify
.apply() a lot, reducing it to few simple rules/cases on the kind of
output returned by the function, to be clearly documented, without
suppressing it.

By the way, assuming we keep apply(), I really think it shouldn't be
too hard to avoid evaluating the first chunk twice.

Apart from this, I was assuming that apply() covered some cases that no
other aggregation method covers (e.g. when func returns df but of
different shape than original chunk)... but I might be wrong.


> ? Removal of?DataFrameGroupBy?and?SeriesGroupBy?classes

I miss the technical details, but I don't think we should force the
output of a DataFrame.groupby()[col].anything() to be a DataFrame; and
most importantly, force the "func" in

DataFrame.groupby()[col].apply(func)

to accept DataFrame chunks. However there might be a lot of scope for
code simplification by having the above case _implemented_ as a
DataFrameGroupBy (or just code in .groupby()), of which we then extract
the column.

> ? Explicit default column naming
I understand the concern about the sum of A being called A, which is
bad, but I would never want "Sum of A" to appear in my DataFrame. I
think this is the typical task to be solved through a MultiIndex,
consistently with .agg().

> ? Removal of?axis?argument

I never used it indeed... but if it's really just a matter of transpose
-> operate -> transpose, couldn't we just do this under the hood (and
maybe warn the user in the docs about performance/dtypes mess)?

Pietro

From wesmckinn at gmail.com  Mon Jul 16 19:17:05 2018
From: wesmckinn at gmail.com (Wes McKinney)
Date: Mon, 16 Jul 2018 19:17:05 -0400
Subject: [Pandas-dev] Pandas Sprint Recap
In-Reply-To: <1531780555.15070.13.camel@pietrobattiston.it>
References: <CAE1aY-nPUpVb5MUziiEM2RMKhnYuD4qaK4tcQs16mNbyFg7HQg@mail.gmail.com>
 <1531501924.11738.110.camel@pietrobattiston.it>
 <CAE1aY-=FX6ncmL3NMaxTeBkKC7qZ+F5YhOczbnmKqKBV1GTF_Q@mail.gmail.com>
 <CAJPUwMAnySm2f_zsUwkhcZ3q68=6kAAFjqVvyw=PWvyPWe0Fng@mail.gmail.com>
 <1531780555.15070.13.camel@pietrobattiston.it>
Message-ID: <CAJPUwMB4sUePUFiqwiShTUYVvtM8SMdNTbQqpjuxQp=0_f398Q@mail.gmail.com>

hi Pietro,

On Mon, Jul 16, 2018 at 6:35 PM, Pietro Battiston <me at pietrobattiston.it> wrote:
> Il giorno lun, 16/07/2018 alle 13.50 -0400, Wes McKinney ha scritto:
>> > One thing I want to reiterate: it's not going to take another 11
>> > years to
>> > reach pandas 2.0 :) Just because we don't
>> > solve indexing for 1.0 doesn't mean we won't ever be able to fix
>> > it.
>>
>> One point on this that we discussed some in the sprint and during
>> SciPy: to undertake a major overhaul of pandas, at some point it may
>> require a shift to a "new codebase". This could cohabit the same
>> pandas-dev/pandas git repository which can serve as a monorepo for
>> several Python package artifacts. This would make refactoring to
>> separate out reusable components and code reuse much easier. The test
>> suite could also be refactored to be able to run against
>> "future-pandas" and "pandas" (or whatever we want to call them).
>>
>> I'm skeptical whether the kinds of significant / breaking changes
>> we've discussed the last 3 years can happen in an iterative / organic
>> fashion within the current pandas codebase. I'd like to avoid getting
>> stuck in place for a decade; if we haven't made much progress toward
>> some of these major changes by say beginning of 2020 or 2021 we might
>> want to take a step back and evaluate our situation.
>
> Let me reverse the question: how much progress has/will have pandas 2.0
> codebase made in the meanwhile? :-)
>
> Joking apart, it's not that if current pandas progresses slowly, then
> pandas 2.0 has any guarantee to progress more quickly.
>
> There is a thing I never understood of pandas 2.0, and I undertand even
> less now that 1.0 gets closer: if you/we have clear plans to rewrite
> the codebase, then why aren't we doing it now? Why are we wasting time
> on the current one?! Why are we releasing a pandas 1.0 with its
> "illusion of maturity"?!
> Rewriting a large project takes effort but can be worth it; _planning_
> a (not so close) future rewrite seems to me just a sort of perversion.

So when you say "if you/we have clear plans to rewrite the codebase,
then why aren't we doing it now", why do you presume that I am not
doing that right now? When I initiated conversations around improved
internals for pandas and projects like pandas in late 2015, I had just
signed on a large group of people to start the Apache Arrow project
and that's been about 90% of where I've invested my time since then.

My idea with Arrow always has been to build stronger memory management
and computational underpinning for a next-generation pandas-type
library. One of the "original sins" of pandas is that we own the full
stack: data structures, IO, deserialization and serialization,
computation/algorithms, visualization, and front end UI.

We are (more or less) completely on our own.

What I am proposing is to share the burden of developing the low-level
stuff with a vastly larger group of developers, at least 10x as large
as we currently have in pandas. The wheels are already well in motion
for this to happen.

I don't see any way without basing the work on top of open standards
developed with a community that extends beyond the walls of Python.
Maybe I'm going about it wrong, but I've invested 3 years of my life
in this at this point, and it's looking like a 7-10 year effort.

This is all to say, if the pandas community doesn't agree with my
approach to this problem, I'm not going to twist anyone's arm. Either
we agree or we go our separate ways; we are all volunteers after all.
Unfortunately there are still factions within the Python data world
that do not collaborate with each other very actively; I'm not sure
what to do about that.

We're all getting a lot older; if it turns out that the Python /
pandas community doesn't want to leap forward in the ways that we've
discussed (where this "leap forward" requires a certain amount of
funding and activation energy) after say 20 years since the inception
of the project then, as they say, that will be the story of us for the
history books. At some point the world could in all likelihood move on
from us to something else.

>
> I do respect the desire to improve the API under several aspects - and
> I see this as the main reason for having something called pandas 2.0.

Frankly, I would rather have a new project name, but retain
affiliation with the pandas community in spirit and governance.

>
> But I think this discussion of the API could and should be decoupled
> from the idea to rewrite/reorganize internals.
>
> Indexing code and many internals badly need a rewrite, regardless of
> whether we change the API, regardles of whether we call it "2.0", and
> regardless of whether we change the entire codebase all at once, or
> refactor bit by bit. They firstly need it because there are 313 open
> bugs labeled "Indexing", and some of them are very difficult to solve
> because the code is unnecessarily complicated.
> But I think this rewrite is basically what is happening daily.
>
> More generally, if our plan is to close, sooner or later, 2400+ bugs by
> basically saying "pandas 1.0 is obsolete, long live pandas 2.0"... then
> we are not doing a great service to our users in releasing pandas 1.0
> as such.

I don't think pandas as it exists now will ever be obsolete, at least
not on a 10 year horizon. At some point I think we should close off
the core to anything new, and restrict changes to either bug fixes or
deprecations / removals. New functionality should come in the form of
add-on libraries that build off the core API.

In a way, what I would like to see is something more like what the R
community has -- data frames are "built into the language" and so many
libraries can work on the same data and be assured of interop. We are
already sort of doing this with pandas + statsmodels, sklearn, etc.
but I would argue it needs to be taken even further.

For the record, I am never going to argue that pandas should not be
maintained or that the user base should be abandoned. However, I
question whether the current core maintainers have a duty to be
tethered to the issue backlog for the rest of their lives. Perhaps
maintenance could be taken up by a for-profit company at some point?

I do know that it is difficult to impossible to innovate and build new
software while simultaneously keeping up with a bugfix/maintenance
grind.

>
>
>> I spent some time at the sprint looking through
>> pandas.core.internals,
>> pandas.core.generic, and some of the other low level pieces, and my
>> feeling is that it would be easier to start over.
>
> I have a different feeling on what is easier, but I might very well be
> wrong, or it might be a matter of personal taste (e.g. it is true that
> when I reformat some code in current pandas I more often than not end
> up finding bugs in some other place, but at least I can immediately
> test the code I am writing because that "other place" exists).
>
> Something we all agree on is that, be in a new or in the same codebase,
> rewriting internals takes time and effort. It requires that some of the
> already few devs divert effort from improving the current codebase to
> focusing on the new one. If we plan to do it, shouldn't we be doing it
> now?
>
> (And if we don't do it now, isn't it because we don't really feel the
> urge to do it?)

See above...

- Wes

>
> In any case, what your mail suggests me is that we definitely need to
> spend one (more more) dev talk looking through pandas.core.internals
> and pandas.core.generic all together!
>
> Cheers,
>
> Pietro

From me at pietrobattiston.it  Mon Jul 16 20:23:37 2018
From: me at pietrobattiston.it (Pietro Battiston)
Date: Tue, 17 Jul 2018 02:23:37 +0200
Subject: [Pandas-dev] Pandas Sprint Recap
In-Reply-To: <CAJPUwMB4sUePUFiqwiShTUYVvtM8SMdNTbQqpjuxQp=0_f398Q@mail.gmail.com>
References: <CAE1aY-nPUpVb5MUziiEM2RMKhnYuD4qaK4tcQs16mNbyFg7HQg@mail.gmail.com>
 <1531501924.11738.110.camel@pietrobattiston.it>
 <CAE1aY-=FX6ncmL3NMaxTeBkKC7qZ+F5YhOczbnmKqKBV1GTF_Q@mail.gmail.com>
 <CAJPUwMAnySm2f_zsUwkhcZ3q68=6kAAFjqVvyw=PWvyPWe0Fng@mail.gmail.com>
 <1531780555.15070.13.camel@pietrobattiston.it>
 <CAJPUwMB4sUePUFiqwiShTUYVvtM8SMdNTbQqpjuxQp=0_f398Q@mail.gmail.com>
Message-ID: <1531787017.15070.24.camel@pietrobattiston.it>

Hi Wes,

thanks for the extensive reply. But sorry, it's probably that I missed
the sprint, but I really can't follow you. Do you have any pointers to
better understand the future pandas (alternative) you have in mind? I
know about Arrow, but I see it as a future potentiality for pandas, not
as an alternative, or even the germ of it (and clearly not in the sense
of "it's not powerful enough", but of "it has different scope"). Even
less do I understand why pandas (or a "pandas-like library") should
change name, if we are mostly talking about internals/implementation
issues (rather than about API/features). Compared to this, the decision
to rewrite the codebase or not is admittedly minor...

I see a vision in your email, and certainly many political/community
aspects I must be missing... but I still mostly miss the technical
details supporting this vision, and apparently https://pandas-dev.githu
b.io/pandas2 won't help me. Again, talking all together about what
makes you think that the current codebase needs a complete rewrite
would be great. Hope we can do this in one of the next devs calls.

In any case, 

Il giorno lun, 16/07/2018 alle 19.17 -0400, Wes McKinney ha scritto:
> [...]
> For the record, I am never going to argue that pandas should not be
> maintained or that the user base should be abandoned. However, I
> question whether the current core maintainers have a duty to be
> tethered to the issue backlog for the rest of their lives. Perhaps
> maintenance could be taken up by a for-profit company at some point?

The idea that I should sooner or later pay to use (a working version
of) the code I'm helping to write is even more depressing, to me, than
the idea that such effort will go partly wasted in a rewrite.

I'm personally "tethered" to a software which changed the way I work
every day, and to which I occasionally?try to contribute back. The
"backlog" is not just a pile of dirt: it signals that (net of some
possible better triaging) there are things to fix in the software.
I see any change, and even a rewrite, as good basically if and only if
it allows us to reduce this "backlog".

Your answer to the question "are we wasting time on pandas?" is
basically "I'm not, you are". I wonder whether it was discussed in this
terms at the sprint!

Cheers,

Pietro

From wesmckinn at gmail.com  Mon Jul 16 21:04:21 2018
From: wesmckinn at gmail.com (Wes McKinney)
Date: Mon, 16 Jul 2018 21:04:21 -0400
Subject: [Pandas-dev] Pandas Sprint Recap
In-Reply-To: <1531787017.15070.24.camel@pietrobattiston.it>
References: <CAE1aY-nPUpVb5MUziiEM2RMKhnYuD4qaK4tcQs16mNbyFg7HQg@mail.gmail.com>
 <1531501924.11738.110.camel@pietrobattiston.it>
 <CAE1aY-=FX6ncmL3NMaxTeBkKC7qZ+F5YhOczbnmKqKBV1GTF_Q@mail.gmail.com>
 <CAJPUwMAnySm2f_zsUwkhcZ3q68=6kAAFjqVvyw=PWvyPWe0Fng@mail.gmail.com>
 <1531780555.15070.13.camel@pietrobattiston.it>
 <CAJPUwMB4sUePUFiqwiShTUYVvtM8SMdNTbQqpjuxQp=0_f398Q@mail.gmail.com>
 <1531787017.15070.24.camel@pietrobattiston.it>
Message-ID: <CAJPUwMD5Jv2TGNuSQayOybjDCTsPoGj87FVvZkNbrhn3TkO+aw@mail.gmail.com>

hi Pietro,

On Mon, Jul 16, 2018 at 8:23 PM, Pietro Battiston <me at pietrobattiston.it> wrote:
> Hi Wes,
>
> thanks for the extensive reply. But sorry, it's probably that I missed
> the sprint, but I really can't follow you. Do you have any pointers to
> better understand the future pandas (alternative) you have in mind? I
> know about Arrow, but I see it as a future potentiality for pandas, not
> as an alternative, or even the germ of it (and clearly not in the sense
> of "it's not powerful enough", but of "it has different scope"). Even
> less do I understand why pandas (or a "pandas-like library") should
> change name, if we are mostly talking about internals/implementation
> issues (rather than about API/features). Compared to this, the decision
> to rewrite the codebase or not is admittedly minor...
>
> I see a vision in your email, and certainly many political/community
> aspects I must be missing... but I still mostly miss the technical
> details supporting this vision, and apparently https://pandas-dev.githu
> b.io/pandas2 won't help me. Again, talking all together about what
> makes you think that the current codebase needs a complete rewrite
> would be great. Hope we can do this in one of the next devs calls.
>

Well, here is a document from more than 2 years ago now:

https://pandas-dev.github.io/pandas2/goals.html

The way I would summarize the big picture goals are:

* Simpler, more predictable and precise memory management
* Ability to work with memory-mapped, on-disk data (this part is essential)
* Substantially less memory use for non-numeric data
* More civilized copy-on-write semantics
* Improved interoperability with the rest of the world (being able to
reuse libraries, algorithms for analytics more gracefully)

I have been working very hard to present a sound, working,
non-hand-wavy solution to these low-level problems. I am a
mathematician by training, and so I am allergic to hand-wavy solutions
or "designs" lacking in rigor in the fine details.

I wrote this blog post addressing some of these topics and more:
http://wesmckinney.com/blog/apache-arrow-pandas-internals/

I have spent a great deal of energy in blog posts, slide decks, etc.
laying out the technical details about how can work and why it is a
sound approach. I am not sure what more I can do other than to hope
that those of like-mind and inclination to work on systems engineering
follow along.

Given the above requirements, I don't see a way forward that does not
involve at minimum scrapping pandas.core.internals.

> In any case,
>
> Il giorno lun, 16/07/2018 alle 19.17 -0400, Wes McKinney ha scritto:
>> [...]
>> For the record, I am never going to argue that pandas should not be
>> maintained or that the user base should be abandoned. However, I
>> question whether the current core maintainers have a duty to be
>> tethered to the issue backlog for the rest of their lives. Perhaps
>> maintenance could be taken up by a for-profit company at some point?
>
> The idea that I should sooner or later pay to use (a working version
> of) the code I'm helping to write is even more depressing, to me, than
> the idea that such effort will go partly wasted in a rewrite.
>
> I'm personally "tethered" to a software which changed the way I work
> every day, and to which I occasionally try to contribute back. The
> "backlog" is not just a pile of dirt: it signals that (net of some
> possible better triaging) there are things to fix in the software.
> I see any change, and even a rewrite, as good basically if and only if
> it allows us to reduce this "backlog".
>
> Your answer to the question "are we wasting time on pandas?" is
> basically "I'm not, you are". I wonder whether it was discussed in this
> terms at the sprint!

Whoa, I never said this and I do not believe anyone is wasting their time.

Maintaining/supporting pandas in its current state is a valid way to
spend your time, but my concern is that the feelings of obligation
toward keeping the status quo afloat may stop the community from
making progress on fundamental issues in performance and scalability.
Realistically we need to find a way to do both sustainably (though
it's arguable whether development now is sustainable).

-W

>
> Cheers,
>
> Pietro

From shoyer at gmail.com  Mon Jul 16 21:14:46 2018
From: shoyer at gmail.com (Stephan Hoyer)
Date: Mon, 16 Jul 2018 18:14:46 -0700
Subject: [Pandas-dev] Pandas Sprint Recap
In-Reply-To: <1531787017.15070.24.camel@pietrobattiston.it>
References: <CAE1aY-nPUpVb5MUziiEM2RMKhnYuD4qaK4tcQs16mNbyFg7HQg@mail.gmail.com>
 <1531501924.11738.110.camel@pietrobattiston.it>
 <CAE1aY-=FX6ncmL3NMaxTeBkKC7qZ+F5YhOczbnmKqKBV1GTF_Q@mail.gmail.com>
 <CAJPUwMAnySm2f_zsUwkhcZ3q68=6kAAFjqVvyw=PWvyPWe0Fng@mail.gmail.com>
 <1531780555.15070.13.camel@pietrobattiston.it>
 <CAJPUwMB4sUePUFiqwiShTUYVvtM8SMdNTbQqpjuxQp=0_f398Q@mail.gmail.com>
 <1531787017.15070.24.camel@pietrobattiston.it>
Message-ID: <CAEQ_Tvf=_VBBK3dQucXH==Ho23sxW2M4OzD0vbGh8s06RtNGxg@mail.gmail.com>

On Mon, Jul 16, 2018 at 7:23 PM Pietro Battiston <me at pietrobattiston.it>
wrote:

> Even less do I understand why pandas (or a "pandas-like library") should
> change name, if we are mostly talking about internals/implementation
> issues (rather than about API/features). Compared to this, the decision
> to rewrite the codebase or not is admittedly minor...
>

Indeed, I suspect that a big part of the reason for suggesting a new
project name rather than pandas2 is that pandas also needs a major overhaul
of API/features, in addition to new internals.

There are several main issues:
1. Several core features in pandas (notably dtypes and indexing) are
difficult to use correctly in their current state, and need a major
overhaul/simplification of their functionality.
2. The indexed pandas.Series and pandas.DataFrame isn't the right
abstraction for many tasks. A simpler, index free DataFrame would be a
better data model for many tasks. For tasks that really need axis labels, a
tool like xarray might be more appropriate.
3. Despite its flaws, pandas is extremely useful, so it has grown a large
number of features/contributions. It would be difficult to reimplement all
of these features immediately on top of a new implementation.

Given infinite manpower, all these things could be changed incrementally
and in a backwards compatible manner on top of current pandas. But the
result would look very different from the pandas we know today. Of course,
we are vastly under-resourced, so it will take quite a long time to get to
a better place. I don't think it would serve either users or developers
well to make such major changes in an incremental way over the course of
multiple years.

For these reasons, I agree with Wes that it wouldn't make sense to call the
hypothetical Python library he is working towards pandas, at least in the
sense that you use it by writing "import pandas". At best, we should write
"import pandas2". Or perhaps, as Wes suggests, it would more appropriately
be given a new name to indicate its new design/scope.

Best,
Stephan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180716/139869e9/attachment.html>

From william.ayd at icloud.com  Mon Jul 16 22:45:24 2018
From: william.ayd at icloud.com (William Ayd)
Date: Mon, 16 Jul 2018 19:45:24 -0700
Subject: [Pandas-dev] GroupBy Overhaul Proposal
In-Reply-To: <1531782675.15070.18.camel@pietrobattiston.it>
References: <F0DE6110-6B5B-468E-B8CD-47DF87F38E07@icloud.com>
 <1531782675.15070.18.camel@pietrobattiston.it>
Message-ID: <09F05CAB-B491-416B-AC63-486C9AD132CD@icloud.com>

Thanks Pietro for your feedback - very much appreciated!

> I would not worry too much about the fact that apply's performance on
> user-provided functions is bad (as long as it's documented), or that
> sum() returns different results from .apply(sum) (again, as long as
> it's documented in sum()'s docstring).

Perfectly fine to ignore performance for now, but I disagree on the second point you make. To a core developer it makes perfect sense that .sum() and .apply(sum) may return different results, but I don?t think that is as apparent to newcomers or just casual users of pandas. In fact I?d worry about a newcomer thinking ?why bother with other methods when I can just send everything on through apply?? Even if I?m overly concerned about that, I don?t think there?s a simple explanation to when these should differ, which is again why I think it?s a mistake to offer two very similar but actually slightly different ways of going about that calculation.

> So my (not very informed) opinion is that we could just simplify
> .apply() a lot, reducing it to few simple rules/cases on the kind of
> output returned by the function, to be clearly documented, without
> suppressing it.

Totally agree here. I think there?s a lot of overlap between .apply and other methods which obfuscates the need for the former. One example was cited above, but another thing to consider is its overlap with .agg. IMO the ?cleanest? use of apply is sending it a function which reduces to a scalar, but in that case you could arguably just use .agg. The other uses cases would cover Series, DataFrame, collections, etc? and it kind of ?just works? with those, but I think those types of objects are impossible to make guarantees about how to properly piece back together. DataFrames of differing dimensions can easily create sparse objects (which may or may not be the intention) and for things like collections I?d question how we will walk the tightrope of expectations if / when we get some Extension Arrays in place that support that as first class objects in pandas.

> I miss the technical details, but I don't think we should force the
> output of a DataFrame.groupby()[col].anything() to be a DataFrame...
> 
> ... However there might be a lot of scope for
> code simplification by having the above case _implemented_ as a
> DataFrameGroupBy (or just code in .groupby()), of which we then extract
> the column.

Yea that?s a valid point. The suggestion here may be extreme, but with your last statement there I think we are aligned on high level the intention and how it could simplify the code. 

> I understand the concern about the sum of A being called A, which is
> bad, but I would never want "Sum of A" to appear in my DataFrame. I
> think this is the typical task to be solved through a MultiIndex,
> consistently with .agg().

The problem with that is we have a variety of issues that are trying to work around the MultiIndex columns being returned. #18366 is probably the main issue (with 15 upvotes!), but you?ll also see this loosely manifested in #20241, #19978 and potentially quite a few more. 

What is the apprehension with something like ?Sum of A?? I?m not tied to that naming per se, but it at least mimics Excel and therefore isn?t that farfetched of a solution. From an end user perspective I can see the big gripe that we kind of force a MultiIndex column on them when they often don?t have that to begin with, and it just adds more complexity and method chaining to their pipeline. Something like ?Sum of A? (or whatever else really) could maintain the original dimensions of the columns being used while also being a solution that might work across all of the various aggregation / transformation methods and acceptable arguments.

> I never used it indeed... but if it's really just a matter of transpose
> -> operate -> transpose, couldn't we just do this under the hood (and
> maybe warn the user in the docs about performance/dtypes mess)?

That could be an option as well. Curious to hear what others think.

- Will


> On Jul 16, 2018, at 4:11 PM, Pietro Battiston <me at pietrobattiston.it> wrote:
> 
> Hi Will,
> 
> there might be parts of your document I don't entirely understand, but
> I definitely appreciate the desire to clean up the groupby module, and
> have some comments on what I (think I) understood.
> 
> 
> Il giorno lun, 09/07/2018 alle 17.18 -0700, William Ayd via Pandas-dev
> ha scritto:
>> Hi All,
>> 
>> I?ve been thinking through what a redesigned GroupBy module could
>> look like in 1.0. The main problems I am trying to address are:
>> 
>>   - The current module is relatively convoluted, making contribution
>> and debugging challenging
>>   - Behavior is sometimes non-obvious and buggy
>> (see here, here and here as some examples) AND
>>   - We violate the mantra of there being ?only one obvious way to do
>> things?
>> 
>> Along those lines, here were four things I thought could be of
>> immense value:
>> ? Removal of apply method
> 
> I would not worry too much about the fact that apply's performance on
> user-provided functions is bad (as long as it's documented), or that
> sum() returns different results from .apply(sum) (again, as long as
> it's documented in sum()'s docstring).
> 
> What I think we should definitely avoid is _any_ case of
> 
> apply(a_func)
> 
> giving different results from
> 
> apply(lambda x : a_func(x))
> 
> ... that is, inference can be made on the result on a function, but not
> on the function itself.
> 
> However, this is an issue not specifically related to .apply() - see
> #17035.
> 
> So my (not very informed) opinion is that we could just simplify
> .apply() a lot, reducing it to few simple rules/cases on the kind of
> output returned by the function, to be clearly documented, without
> suppressing it.
> 
> By the way, assuming we keep apply(), I really think it shouldn't be
> too hard to avoid evaluating the first chunk twice.
> 
> Apart from this, I was assuming that apply() covered some cases that no
> other aggregation method covers (e.g. when func returns df but of
> different shape than original chunk)... but I might be wrong.
> 
> 
>> ? Removal of DataFrameGroupBy and SeriesGroupBy classes
> 
> I miss the technical details, but I don't think we should force the
> output of a DataFrame.groupby()[col].anything() to be a DataFrame; and
> most importantly, force the "func" in
> 
> DataFrame.groupby()[col].apply(func)
> 
> to accept DataFrame chunks. However there might be a lot of scope for
> code simplification by having the above case _implemented_ as a
> DataFrameGroupBy (or just code in .groupby()), of which we then extract
> the column.
> 
>> ? Explicit default column naming
> I understand the concern about the sum of A being called A, which is
> bad, but I would never want "Sum of A" to appear in my DataFrame. I
> think this is the typical task to be solved through a MultiIndex,
> consistently with .agg().
> 
>> ? Removal of axis argument
> 
> I never used it indeed... but if it's really just a matter of transpose
> -> operate -> transpose, couldn't we just do this under the hood (and
> maybe warn the user in the docs about performance/dtypes mess)?
> 
> Pietro

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180716/daa2db79/attachment-0001.html>

From me at pietrobattiston.it  Tue Jul 17 03:59:33 2018
From: me at pietrobattiston.it (Pietro Battiston)
Date: Tue, 17 Jul 2018 09:59:33 +0200
Subject: [Pandas-dev] GroupBy Overhaul Proposal
In-Reply-To: <09F05CAB-B491-416B-AC63-486C9AD132CD@icloud.com>
References: <F0DE6110-6B5B-468E-B8CD-47DF87F38E07@icloud.com>
 <1531782675.15070.18.camel@pietrobattiston.it>
 <09F05CAB-B491-416B-AC63-486C9AD132CD@icloud.com>
Message-ID: <1531814373.15070.27.camel@pietrobattiston.it>

Quick reply on a couple of points.

Il giorno lun, 16/07/2018 alle 19.45 -0700, William Ayd ha scritto:
> [...]
> Even
> if I?m overly concerned about that, I don?t think there?s a simple
> explanation to when?these should?differ, which is again why I think
> it?s a mistake to offer two very similar but actually slightly
> different ways of going about that calculation.

In fact, my preference for keeping apply is pretty weak as long as
there are alternatives that cover each of its use cases. But again, I'm
not sure this is true.


> > I understand the concern about the sum of A being called A, which
> > is
> > bad, but I would never want "Sum of A" to appear in my DataFrame. I
> > think this is the typical task to be solved through a MultiIndex,
> > consistently with .agg().
> 
> The problem with that is we have a variety of issues that are trying
> to work around the MultiIndex columns being returned. #18366 is
> probably the main issue (with 15 upvotes!),


Unless I'm wrong, #18366 is orthgonal to what we are discussing:
unnamed lambdas would remain?unnamed lambdas.
(And the obvious solution to my eyes is used named methods instead)

> but you?ll also see this loosely manifested in #20241, #19978 and
> potentially quite a few more.?

I think we do need a better ability to do in-line renaming of
MultiIndexed DataFrames, regardless of whether they come from
groupby().

> What is the apprehension with something like ?Sum of A?? I?m not tied
> to that naming per se, but it at least mimics Excel and therefore
> isn?t that farfetched of a solution.

Problems I have with "Sum of A":
- if, after creating all my columns, I want to e.g. select all columns
that contain sums, I need to do some sort of "df[[col if
col.startswith("Sum of")]]". Compare to "df.loc[:, ('Sum',)]"
- it would be the only case in pandas in which we decide how to call a
column on behalf of the user
- ... and this unexpected behavior is introduced to solve a relatively
specific case of aggregation (1 column -> 1 scalar)
- if one wants to allow the user to name the columns according to her
taste, it's pretty simple to introduce an argument which takes a string
to be .format()ted with the name of the column (or even of the method),
e.g. name="Sum of {}"
- ... although it is actually pretty simple to just do
df.columns = "Sum of " + df.columns
- agg already returns MultiIndexes (when passed multiple functions)
- we would be following Excel as example :-D

By the way, despite some related issues, I still think tuples can be
first class citizens of flat indexes. So if one doesn't like
MultiIndexes, or they do not fit one's needs, ("sum", "A") can well be
a label in a regular index.

Pietro

From me at pietrobattiston.it  Tue Jul 17 04:29:13 2018
From: me at pietrobattiston.it (Pietro Battiston)
Date: Tue, 17 Jul 2018 10:29:13 +0200
Subject: [Pandas-dev] Pandas Sprint Recap
In-Reply-To: <CAJPUwMD5Jv2TGNuSQayOybjDCTsPoGj87FVvZkNbrhn3TkO+aw@mail.gmail.com>
References: <CAE1aY-nPUpVb5MUziiEM2RMKhnYuD4qaK4tcQs16mNbyFg7HQg@mail.gmail.com>
 <1531501924.11738.110.camel@pietrobattiston.it>
 <CAE1aY-=FX6ncmL3NMaxTeBkKC7qZ+F5YhOczbnmKqKBV1GTF_Q@mail.gmail.com>
 <CAJPUwMAnySm2f_zsUwkhcZ3q68=6kAAFjqVvyw=PWvyPWe0Fng@mail.gmail.com>
 <1531780555.15070.13.camel@pietrobattiston.it>
 <CAJPUwMB4sUePUFiqwiShTUYVvtM8SMdNTbQqpjuxQp=0_f398Q@mail.gmail.com>
 <1531787017.15070.24.camel@pietrobattiston.it>
 <CAJPUwMD5Jv2TGNuSQayOybjDCTsPoGj87FVvZkNbrhn3TkO+aw@mail.gmail.com>
Message-ID: <1531816153.15070.29.camel@pietrobattiston.it>

Il giorno lun, 16/07/2018 alle 21.04 -0400, Wes McKinney ha scritto:
> hi Pietro,
> 
> On Mon, Jul 16, 2018 at 8:23 PM, Pietro Battiston <me at pietrobattiston
> .it> wrote:
> > Hi Wes,
> > 
> > thanks for the extensive reply. But sorry, it's probably that I
> > missed
> > the sprint, but I really can't follow you. Do you have any pointers
> > to
> > better understand the future pandas (alternative) you have in mind?
> > I
> > know about Arrow, but I see it as a future potentiality for pandas,
> > not
> > as an alternative, or even the germ of it (and clearly not in the
> > sense
> > of "it's not powerful enough", but of "it has different scope").
> > Even
> > less do I understand why pandas (or a "pandas-like library") should
> > change name, if we are mostly talking about
> > internals/implementation
> > issues (rather than about API/features). Compared to this, the
> > decision
> > to rewrite the codebase or not is admittedly minor...
> > 
> > I see a vision in your email, and certainly many
> > political/community
> > aspects I must be missing... but I still mostly miss the technical
> > details supporting this vision, and apparently https://pandas-dev.g
> > ithu
> > b.io/pandas2 won't help me. Again, talking all together about what
> > makes you think that the current codebase needs a complete rewrite
> > would be great. Hope we can do this in one of the next devs calls.
> > 
> 
> Well, here is a document from more than 2 years ago now:
> 
> https://pandas-dev.github.io/pandas2/goals.html

Yes, that's what I was referring to...

> 
> The way I would summarize the big picture goals are:
> 
> * Simpler, more predictable and precise memory management
> * Ability to work with memory-mapped, on-disk data (this part is
> essential)
> * Substantially less memory use for non-numeric data
> * More civilized copy-on-write semantics
> * Improved interoperability with the rest of the world (being able to
> reuse libraries, algorithms for analytics more gracefully)


This is all very important stuff, but my rough guess is that it
concerns no more than 25% of the pandas codebase, and no more than 15%
of open bugs.

> 
> I have been working very hard to present a sound, working,
> non-hand-wavy solution to these low-level problems. I am a
> mathematician by training, and so I am allergic to hand-wavy
> solutions
> or "designs" lacking in rigor in the fine details.
> 
> I wrote this blog post addressing some of these topics and more:
> http://wesmckinney.com/blog/apache-arrow-pandas-internals/
> 

I happened to have read this too, but my comment above more or less
applies unchanged, with the exception of point 10, "Eager evaluation
model, no query planning", which is (probably) incompatible with the
current pandas codebase/API.
That's in principle Dask's job. Now, I don't know Dask well enough to
judge whether it's up to the task, but the criticisms in your
"Addendum" are again mainly about pandas, not about how Dask extends
pandas. And they are again about the memory management/IO, which again
is where I thought we thought Arrow would improve pandas.

> I have spent a great deal of energy in blog posts, slide decks, etc.
> laying out the technical details about how can work and why it is a
> sound approach. I am not sure what more I can do other than to hope
> that those of like-mind and inclination to work on systems
> engineering
> follow along.

This conversation (which is the first one in which I see Arrow as an
alternative, not a complement, to pandas) didn't start with blog posts,
slide decks etc. (which I had partly read), it started with you saying
"I spent some time at the sprint looking through pandas.core.internals,
pandas.core.generic".

I would really love to know/talk about that.


> Given the above requirements, I don't see a way forward that does not
> involve at minimum scrapping pandas.core.internals.

I'm not making claims on the amount of lines of code (or
methods/features) we should save or drop. I'm just claiming that if we
want to change "import pandas" to something else different enough to
justify a change of name,
- it would be mostly because of changes in the API, not of
implementation
- not having human forces enough to solve current pandas issues is not
a valid reason (actually the opposite)
- it would be nice to have a clearer roadmap (again: the pands2 docs
are not clear at all about this) and discussion about the future of
pandas (as in "current name and user requirements/expectations", not
"current codebase")


> [...]
> > Your answer to the question "are we wasting time on pandas?" is
> > basically "I'm not, you are". I wonder whether it was discussed in
> > this
> > terms at the sprint!
> 
> Whoa, I never said this and I do not believe anyone is wasting their
> time.

I am a mathematician by training, give me the premises and I will
naturally look for logical consequences ;-)

Pietro

From me at pietrobattiston.it  Tue Jul 17 05:00:58 2018
From: me at pietrobattiston.it (Pietro Battiston)
Date: Tue, 17 Jul 2018 11:00:58 +0200
Subject: [Pandas-dev] Pandas Sprint Recap
In-Reply-To: <CAEQ_Tvf=_VBBK3dQucXH==Ho23sxW2M4OzD0vbGh8s06RtNGxg@mail.gmail.com>
References: <CAE1aY-nPUpVb5MUziiEM2RMKhnYuD4qaK4tcQs16mNbyFg7HQg@mail.gmail.com>
 <1531501924.11738.110.camel@pietrobattiston.it>
 <CAE1aY-=FX6ncmL3NMaxTeBkKC7qZ+F5YhOczbnmKqKBV1GTF_Q@mail.gmail.com>
 <CAJPUwMAnySm2f_zsUwkhcZ3q68=6kAAFjqVvyw=PWvyPWe0Fng@mail.gmail.com>
 <1531780555.15070.13.camel@pietrobattiston.it>
 <CAJPUwMB4sUePUFiqwiShTUYVvtM8SMdNTbQqpjuxQp=0_f398Q@mail.gmail.com>
 <1531787017.15070.24.camel@pietrobattiston.it>
 <CAEQ_Tvf=_VBBK3dQucXH==Ho23sxW2M4OzD0vbGh8s06RtNGxg@mail.gmail.com>
Message-ID: <1531818058.15070.31.camel@pietrobattiston.it>

Hi Stephan,

I appreciate that your email focuses on specific API changes (again,
the only reason, in my view, to justify a pandas 2, or even a change of
name)

Il giorno lun, 16/07/2018 alle 18.14 -0700, Stephan Hoyer ha scritto:
> On Mon, Jul 16, 2018 at 7:23 PM Pietro Battiston <me at pietrobattiston.
> it> wrote:
> [...]

> 2. The indexed pandas.Series and pandas.DataFrame isn't the right
> abstraction for many tasks. A simpler, index free DataFrame would be
> a better data model for many tasks. For tasks that really need axis
> labels, a tool like xarray might be more appropriate.

... but this makes me smile.

First, because labels/indexes are in my experience the main reason why
people come to pandas (another important reason is having multiple
dtypes in a single data structure, but numpy structured arrays also do
this).

Second because supporting a DataFrame with no index would be pretty
easy in the current codebase/API (e.g. "index=False").
I know it would break some code, but it would be wrong code anyway
(that is, code that doesn't decouple indexes from data storage).

Third, because now that the default index is RangeIndex(n) (which a
user is free not to rely on in any way), and as long as broken code is
fixed (see above), a DataFrame with no index wouldn't really be
"simpler". It would mostly amount to deciding whether to show the index
or not when printing to screen/doing IO.

Fourth, because you cite xarray as an alternative... but unless I'm
wrong, labels are now optional in xarray (precisely the path I suggest
we could take).

More in general, in my view, asking users to choose between multiple
dtypes and indexes would bring the state of data manipulation in Python
backwards by several years (and probably behind the state of data
manipulation in R).

> 3. Despite its flaws, pandas is extremely useful, so it has grown a
> large number of features/contributions. It would be difficult to
> reimplement all of these features immediately on top of a new
> implementation.
> 
> Given infinite manpower, all these things could be changed
> incrementally and in a backwards compatible manner on top of current
> pandas. But the result would look very different from the pandas we
> know today. Of course, we are vastly under-resourced, so it will take
> quite a long time to get to a better place. I don't think it would
> serve either users or developers well to make such major changes in
> an incremental way over the course of multiple years.
> ?
> For these reasons, I agree with Wes that it wouldn't make sense to
> call the hypothetical Python library he is working towards pandas, at
> least in the sense that you use it by writing "import pandas". At
> best, we should write "import pandas2". Or perhaps, as Wes suggests,
> it would more appropriately be given a new name to indicate its new
> design/scope.

I understand the need to change the import name not to break people's
code.

But again, I fail to see a new "scope". I don't see an analysis of
which (share) of the current pandas problems (=issues) would be solved.
Add to this the urge of rewriting the code base (while at the same time
I'm having troubles getting any comments on _how_ to restructure - or
structure, in a potential rewrite - an important part of it?), and I
can't help but fear that we are trying very hard to reinvent the wheel.

Pietro

? https://github.com/pandas-dev/pandas/wiki/(Tentative)-rules-for-restr
ucturing-indexing-code


From shoyer at gmail.com  Tue Jul 17 18:28:56 2018
From: shoyer at gmail.com (Stephan Hoyer)
Date: Tue, 17 Jul 2018 15:28:56 -0700
Subject: [Pandas-dev] Pandas Sprint Recap
In-Reply-To: <1531818058.15070.31.camel@pietrobattiston.it>
References: <CAE1aY-nPUpVb5MUziiEM2RMKhnYuD4qaK4tcQs16mNbyFg7HQg@mail.gmail.com>
 <1531501924.11738.110.camel@pietrobattiston.it>
 <CAE1aY-=FX6ncmL3NMaxTeBkKC7qZ+F5YhOczbnmKqKBV1GTF_Q@mail.gmail.com>
 <CAJPUwMAnySm2f_zsUwkhcZ3q68=6kAAFjqVvyw=PWvyPWe0Fng@mail.gmail.com>
 <1531780555.15070.13.camel@pietrobattiston.it>
 <CAJPUwMB4sUePUFiqwiShTUYVvtM8SMdNTbQqpjuxQp=0_f398Q@mail.gmail.com>
 <1531787017.15070.24.camel@pietrobattiston.it>
 <CAEQ_Tvf=_VBBK3dQucXH==Ho23sxW2M4OzD0vbGh8s06RtNGxg@mail.gmail.com>
 <1531818058.15070.31.camel@pietrobattiston.it>
Message-ID: <CAEQ_TveTnH4hSh2GHmyCKLbya9ZsHbowscGAkRdjx5-6QmBDVQ@mail.gmail.com>

On Tue, Jul 17, 2018 at 2:01 AM Pietro Battiston <me at pietrobattiston.it>
wrote:

> First, because labels/indexes are in my experience the main reason why
> people come to pandas (another important reason is having multiple
> dtypes in a single data structure, but numpy structured arrays also do
> this).
>

Certainly the functionality of indexes is valuable (especially from some
use-cases), but I don't think the particular way we expose them is optimal.
In my experience, the need to call reset_index() or assign directly to
.index or .columns is a frequent source of annoyance.


> Second because supporting a DataFrame with no index would be pretty
> easy in the current codebase/API (e.g. "index=False").
> I know it would break some code, but it would be wrong code anyway
> (that is, code that doesn't decouple indexes from data storage).
>
> Third, because now that the default index is RangeIndex(n) (which a
> user is free not to rely on in any way), and as long as broken code is
> fixed (see above), a DataFrame with no index wouldn't really be
> "simpler". It would mostly amount to deciding whether to show the index
> or not when printing to screen/doing IO.
>

Sure, you *could* fix all this on top of the current pandas data model. But
it would be quite a challenging effort, and the full pandas data model
would remain quite complex.

The current pandas data model looks something like this:

DataFrame:
- values: BlockManager wrapping 1d and/or 2d NumPy arrays
- index: Index
- columns: Index

The data model I'd like to work with in the future for most use-cases
involving tabular data is something closer to:

DataFrame:
- data: OrderedDict[str, Array]
- indexes: OrderedDict[str, Index]

Conveniently, this looks very similar to the data model of Arrow or R.
Optional indexes would provide fast reverse lookup for some subset of
dataframe columns.

This pretty obviously could not support everything pandas can do today. For
example, you couldn't have a hierarchical index for column names. But in my
experience, you're better off working with "tidy data" anyways, as
popularized in R's tidyverse.


> But again, I fail to see a new "scope". I don't see an analysis of
> which (share) of the current pandas problems (=issues) would be solved.
>

One way in which we have reduced pandas' scope recently is the proposed
deprecation of Panel.

This is an example of focusing pandas on tabular data rather than
N-dimensional arrays.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180717/b1c27ba1/attachment.html>

From mrocklin at gmail.com  Tue Jul 17 18:46:45 2018
From: mrocklin at gmail.com (Matthew Rocklin)
Date: Tue, 17 Jul 2018 18:46:45 -0400
Subject: [Pandas-dev] Pandas Sprint Recap
In-Reply-To: <CAEQ_TveTnH4hSh2GHmyCKLbya9ZsHbowscGAkRdjx5-6QmBDVQ@mail.gmail.com>
References: <CAE1aY-nPUpVb5MUziiEM2RMKhnYuD4qaK4tcQs16mNbyFg7HQg@mail.gmail.com>
 <1531501924.11738.110.camel@pietrobattiston.it>
 <CAE1aY-=FX6ncmL3NMaxTeBkKC7qZ+F5YhOczbnmKqKBV1GTF_Q@mail.gmail.com>
 <CAJPUwMAnySm2f_zsUwkhcZ3q68=6kAAFjqVvyw=PWvyPWe0Fng@mail.gmail.com>
 <1531780555.15070.13.camel@pietrobattiston.it>
 <CAJPUwMB4sUePUFiqwiShTUYVvtM8SMdNTbQqpjuxQp=0_f398Q@mail.gmail.com>
 <1531787017.15070.24.camel@pietrobattiston.it>
 <CAEQ_Tvf=_VBBK3dQucXH==Ho23sxW2M4OzD0vbGh8s06RtNGxg@mail.gmail.com>
 <1531818058.15070.31.camel@pietrobattiston.it>
 <CAEQ_TveTnH4hSh2GHmyCKLbya9ZsHbowscGAkRdjx5-6QmBDVQ@mail.gmail.com>
Message-ID: <CAJ8oX-GjHTC0CEerzP6V3uK_pqQQ9uQoxtOZJwAN2sdR-qUG=g@mail.gmail.com>

Has Pandas ever done a user survey?

I would be curious to know the answer to the question "do you make heavy
use of the Pandas index" among users, and how that correlates with
different domain/industry.

On Tue, Jul 17, 2018 at 6:29 PM Stephan Hoyer <shoyer at gmail.com> wrote:

> On Tue, Jul 17, 2018 at 2:01 AM Pietro Battiston <me at pietrobattiston.it>
> wrote:
>
>> First, because labels/indexes are in my experience the main reason why
>> people come to pandas (another important reason is having multiple
>> dtypes in a single data structure, but numpy structured arrays also do
>> this).
>>
>
> Certainly the functionality of indexes is valuable (especially from some
> use-cases), but I don't think the particular way we expose them is optimal.
> In my experience, the need to call reset_index() or assign directly to
> .index or .columns is a frequent source of annoyance.
>
>
>> Second because supporting a DataFrame with no index would be pretty
>> easy in the current codebase/API (e.g. "index=False").
>> I know it would break some code, but it would be wrong code anyway
>> (that is, code that doesn't decouple indexes from data storage).
>>
>> Third, because now that the default index is RangeIndex(n) (which a
>> user is free not to rely on in any way), and as long as broken code is
>> fixed (see above), a DataFrame with no index wouldn't really be
>> "simpler". It would mostly amount to deciding whether to show the index
>> or not when printing to screen/doing IO.
>>
>
> Sure, you *could* fix all this on top of the current pandas data model.
> But it would be quite a challenging effort, and the full pandas data model
> would remain quite complex.
>
> The current pandas data model looks something like this:
>
> DataFrame:
> - values: BlockManager wrapping 1d and/or 2d NumPy arrays
> - index: Index
> - columns: Index
>
> The data model I'd like to work with in the future for most use-cases
> involving tabular data is something closer to:
>
> DataFrame:
> - data: OrderedDict[str, Array]
> - indexes: OrderedDict[str, Index]
>
> Conveniently, this looks very similar to the data model of Arrow or R.
> Optional indexes would provide fast reverse lookup for some subset of
> dataframe columns.
>
> This pretty obviously could not support everything pandas can do today.
> For example, you couldn't have a hierarchical index for column names. But
> in my experience, you're better off working with "tidy data" anyways, as
> popularized in R's tidyverse.
>
>
>> But again, I fail to see a new "scope". I don't see an analysis of
>> which (share) of the current pandas problems (=issues) would be solved.
>>
>
> One way in which we have reduced pandas' scope recently is the proposed
> deprecation of Panel.
>
> This is an example of focusing pandas on tabular data rather than
> N-dimensional arrays.
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180717/1972c18a/attachment.html>

From william.ayd at icloud.com  Tue Jul 17 19:10:47 2018
From: william.ayd at icloud.com (William Ayd)
Date: Tue, 17 Jul 2018 16:10:47 -0700
Subject: [Pandas-dev] GroupBy Overhaul Proposal
In-Reply-To: <1531814373.15070.27.camel@pietrobattiston.it>
References: <F0DE6110-6B5B-468E-B8CD-47DF87F38E07@icloud.com>
 <1531782675.15070.18.camel@pietrobattiston.it>
 <09F05CAB-B491-416B-AC63-486C9AD132CD@icloud.com>
 <1531814373.15070.27.camel@pietrobattiston.it>
Message-ID: <B7A9F796-25B0-43EF-8CA5-4ABFEFE9BD5A@icloud.com>

> In fact, my preference for keeping apply is pretty weak as long as
> there are alternatives that cover each of its use cases. But again, I'm
> not sure this is true.

Just to clarify my position:

	1. .apply() + UDF reducing to a scalar should be replaceable with .agg() + same UDF (even though there are differences today?)
	2. .apply() + UDF returning Series / DataFrame / collection doesn?t have anything else to cover it

But with #2 above I think its dangerous to assume that .apply can always do the ?right thing? with those types of inputs. We don?t make any assertions about the indexing / labeling of returned Series and DataFrames. As far as collections are concerned I?m not sure if there will be a clear answer on how to handle those assuming we start getting EAs that add first-class support for those. 

> Unless I'm wrong, #18366 is orthgonal to what we are discussing:
> unnamed lambdas would remain unnamed lambdas.
> (And the obvious solution to my eyes is used named methods instead)

I don?t think this is orthogonal. Your concern is valid on lambdas and I don?t know what the solution there is (perhaps some kind of keyword argument) but without getting tripped up on that I don?t think its immediately apparent that the returned object for a DataFrame with columns ?a?, ?b?, ?c? will have a single column when called as follows:

 - df.groupby(?a?).agg(sum)
 - df.groupby(?a?).agg({?b?: sum, ?c?: min})

Yet the following will yield a MultiIndex column:

 - df.groupby(?a?).agg([sum])
 - df.groupby(?a?).agg({?b?: [sum], ?c?: min})

If you reduce the returned columns to ??sum? of ?b?? and ??min? of ?c?? you can ensure that the returned columns have the same number of levels regardless of call signature, AND have the added bonus of not obfuscating what type of aggregation was performed with the former two examples. Of course the end user may ultimately decide that they don?t like those labels at all and completely override them, but that effort becomes much easier if they can make guarantees around the number of levels of the returned object (especially if it?s just one!).

> - if, after creating all my columns, I want to e.g. select all columns
> that contain sums, I need to do some sort of "df[[col if
> col.startswith("Sum of")]]". Compare to "df.loc[:, ('Sum',)]?

Unless I am mistaken you would have to do something like "df.groupby('a').agg([sum]).loc[:, slice(None, 'sum?)]? to get that to work. I don?t think that syntax really is that clean and it starts taking us down the path of advanced indexing for what may start off to the end user as a very simple aggregation exercise.

> - it would be the only case in pandas in which we decide how to call a
> column on behalf of the user

Well we have to do something to reduce ambiguity?I think a consistent naming convention and dimension for the columns across all invocations is strongly preferable to inserting a column level some of the time.

> - if one wants to allow the user to name the columns according to her
> taste, it's pretty simple to introduce an argument which takes a string
> to be .format()ted with the name of the column (or even of the method),
> e.g. name="Sum of {}"


Agreed. In my head I feel like this defaults to something like f?{fname} of {colname}? but gives the user potentially the option to override. By default keep the same number of levels as what is being passed in, though maybe None as an argument reverts to the old style behavior of simply inserting a new column index level.

> By the way, despite some related issues, I still think tuples can be
> first class citizens of flat indexes. So if one doesn't like
> MultiIndexes, or they do not fit one's needs, ("sum", "A") can well be
> a label in a regular index.

You know better than I do here, but again I don?t think it makes for a good user experience to convert columns with one level into multiple levels after a GroupBy operation regardless of how you could subsequently access those values.

William Ayd
william.ayd at icloud.com


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180717/d230a4bb/attachment-0001.html>

From me at pietrobattiston.it  Wed Jul 18 03:01:22 2018
From: me at pietrobattiston.it (Pietro Battiston)
Date: Wed, 18 Jul 2018 09:01:22 +0200
Subject: [Pandas-dev] GroupBy Overhaul Proposal
In-Reply-To: <B7A9F796-25B0-43EF-8CA5-4ABFEFE9BD5A@icloud.com>
References: <F0DE6110-6B5B-468E-B8CD-47DF87F38E07@icloud.com>
 <1531782675.15070.18.camel@pietrobattiston.it>
 <09F05CAB-B491-416B-AC63-486C9AD132CD@icloud.com>
 <1531814373.15070.27.camel@pietrobattiston.it>
 <B7A9F796-25B0-43EF-8CA5-4ABFEFE9BD5A@icloud.com>
Message-ID: <1531897282.3286.6.camel@pietrobattiston.it>

Il giorno mar, 17/07/2018 alle 16.10 -0700, William Ayd ha scritto:
> > In fact, my preference for keeping apply is pretty weak as long as
> > there are alternatives that cover each of its use cases. But again,
> > I'm
> > not sure this is true.
> 
> Just to clarify my position:
> 
> 	1. .apply() + UDF reducing to a scalar should be replaceable
> with .agg() + same UDF (even though there are differences today?)
> 	2. .apply() + UDF returning Series / DataFrame / collection
> doesn?t have anything else to cover it

.transform() at least covers the case in which the shape of the chunk
is unchanged.

> But with #2 above I think its dangerous to assume that .apply can
> always do the ?right thing? with those types of inputs. We don?t make
> any assertions about the indexing / labeling of returned Series and
> DataFrames.

There is a simple way to stop throwing magic at users, and it is to
clearly document which cases .apply() covers (and which should be
covered by .agg() or transform()), reflecting the actual guesswork
taking place in the code.
By the way, my understanding (without having looked at the code) is
that
UDF returns Series -> concat in a new Series
UDF returns DataFrame -> concat in a new DataFrame
and the guesswork mostly concerns understanding whether the new index
is the same as the old. Am I missing anything relevant?


Now, I would be all for suppressing a complicated function by replacing
it with simpler ways to do the same thing. But for instance I would
like the following to still work with groupby().something():

def remove_group_outliers(group):
    outliers = # code to identify them
    return group[~group.index.isin(outliers)]

... and I currently don't see any way but .apply().

> As far as collections are concerned I?m not sure if there will be a
> clear answer on how to handle those assuming we start getting EAs
> that add first-class support for those.?

Do you have any pointer/example? I'm missing the relation between
collections and .apply().


> > Unless I'm wrong, #18366 is orthgonal to what we are discussing:
> > unnamed lambdas would remain?unnamed lambdas.
> > (And the obvious solution to my eyes is used named methods instead)
> 
> I don?t think this is orthogonal. Your concern is valid on lambdas
> and I don?t know what the solution there is (perhaps some kind of
> keyword argument) but without getting tripped up on that I don?t
> think its immediately apparent that the returned object for a
> DataFrame with columns ?a?, ?b?, ?c? will have a single column when
> called as follows:
> 
> ?- df.groupby(?a?).agg(sum)
> ?- df.groupby(?a?).agg({?b?: sum, ?c?: min})
> 
> Yet the following will yield a MultiIndex column:
> 
> ?- df.groupby(?a?).agg([sum])
> ?- df.groupby(?a?).agg({?b?: [sum], ?c?: min})

The rule is not very complicated either (if correctly documented), but
anyway, the inconsistency would disappear by just having the first two
examples also return a MultiIndex.

... and maybe provide the users a very simple way to flatten
MultiIndexes (see below).


> If you reduce the returned columns to ??sum? of ?b?? and ??min? of
> ?c?? you can ensure that the returned columns have the same number of
> levels regardless of call signature,
> AND have the added bonus of not obfuscating what type of aggregation
> was performed with the former two examples.

Both can be solved through a MI, or through an Index(dtype=object)
containing tuples.

> Of course the end user may ultimately decide that they don?t like
> those labels at all and completely override them, but that effort
> becomes much easier if they can make guarantees around the number of
> levels of the returned object

I agree on this

>  (especially if it?s just one!).

... not on that.

MI (or tuples) -> arbitrary strings

is much simpler/cleaner to do than

arbitrary strings -> MI (or tuples)

> 
> > - if, after creating all my columns, I want to e.g. select all
> > columns
> > that contain sums, I need to do some sort of "df[[col if
> > col.startswith("Sum of")]]". Compare to "df.loc[:, ('Sum',)]?
> 
> Unless I am mistaken you would have to do something like
> "df.groupby('a').agg([sum]).loc[:, slice(None, 'sum?)]? to get that
> to work.

Yeah, I had swapped the levels, it is

df.groupby('a').agg([sum]).loc[:, (slice(None), 'sum?)]


> I don?t think that syntax really is that clean

In my code I always start by defining

WE = slice(None) # WhatEver

and we could advertise this as a way to make the syntax shorter, but
regardless of that, it definitely is cleaner than any string
manipulation.


> and it starts taking us down the path of advanced indexing for what
> may start off to the end user as a very simple aggregation exercise.

On this I agree with you. I'm all for providing

- a MultiIndex.flatten() method which allows me to do
res.columns = res.columns.flatten("{} of {}".format)

- a simple way to do the above in-line (which is already being
discussed, regardless of groupby)

> [...]
> > - it would be the only case in pandas in which we decide how to
> > call a
> > column on behalf of the user
> 
> Well we have to do something to reduce ambiguity?I think a consistent
> naming convention and dimension for the columns across all
> invocations is strongly preferable to inserting a column level some
> of the time.

Again, I agree on this.

> 
> - if one wants to allow the user to name the columns according to her
> > taste, it's pretty simple to introduce an argument which takes a
> > string
> > to be .format()ted with the name of the column (or even of the
> > method),
> > e.g. name="Sum of {}"
> 
> Agreed. In my head I feel like this defaults to something like
> f?{fname} of {colname}? but gives the user potentially the option to
> override. By default keep the same number of levels as what is being
> passed in, though maybe None as an argument reverts to the old style
> behavior of simply inserting a new column index level.

Agree on everything but the default, again, because it is arbitrary


> > By the way, despite some related issues, I still think tuples can
> > be
> > first class citizens of flat indexes. So if one doesn't like
> > MultiIndexes, or they do not fit one's needs, ("sum", "A") can well
> > be
> > a label in a regular index.
> 
> You know better than I do here, but again I don?t think it makes for
> a good user experience to convert columns with one level into
> multiple levels after a GroupBy operation regardless of how you could
> subsequently access those values.

Notice that I'm not talking about a MultiIndex, but about a flat index.
But it is an inferior solution, given the API we already expose, to the
MultiIndex.


Pietro

From me at pietrobattiston.it  Wed Jul 18 03:23:35 2018
From: me at pietrobattiston.it (Pietro Battiston)
Date: Wed, 18 Jul 2018 09:23:35 +0200
Subject: [Pandas-dev] Pandas Sprint Recap
In-Reply-To: <CAEQ_TveTnH4hSh2GHmyCKLbya9ZsHbowscGAkRdjx5-6QmBDVQ@mail.gmail.com>
References: <CAE1aY-nPUpVb5MUziiEM2RMKhnYuD4qaK4tcQs16mNbyFg7HQg@mail.gmail.com>
 <1531501924.11738.110.camel@pietrobattiston.it>
 <CAE1aY-=FX6ncmL3NMaxTeBkKC7qZ+F5YhOczbnmKqKBV1GTF_Q@mail.gmail.com>
 <CAJPUwMAnySm2f_zsUwkhcZ3q68=6kAAFjqVvyw=PWvyPWe0Fng@mail.gmail.com>
 <1531780555.15070.13.camel@pietrobattiston.it>
 <CAJPUwMB4sUePUFiqwiShTUYVvtM8SMdNTbQqpjuxQp=0_f398Q@mail.gmail.com>
 <1531787017.15070.24.camel@pietrobattiston.it>
 <CAEQ_Tvf=_VBBK3dQucXH==Ho23sxW2M4OzD0vbGh8s06RtNGxg@mail.gmail.com>
 <1531818058.15070.31.camel@pietrobattiston.it>
 <CAEQ_TveTnH4hSh2GHmyCKLbya9ZsHbowscGAkRdjx5-6QmBDVQ@mail.gmail.com>
Message-ID: <1531898615.3286.8.camel@pietrobattiston.it>

Il giorno mar, 17/07/2018 alle 15.28 -0700, Stephan Hoyer ha scritto:
> On Tue, Jul 17, 2018 at 2:01 AM Pietro Battiston <me at pietrobattiston.
> it> wrote:
> > First, because labels/indexes are in my experience the main reason
> > why
> > people come to pandas (another important reason is having multiple
> > dtypes in a single data structure, but numpy structured arrays also
> > do
> > this).
> 
> Certainly the functionality of indexes is valuable (especially from
> some use-cases), but I don't think the particular way we expose them
> is optimal. In my experience, the need to call reset_index() or
> assign directly to .index or .columns is a frequent source of
> annoyance.

I agree if you're refering to the impossibility to do this in-line...
but that's not extremely difficult to solve.

?
> > Second because supporting a DataFrame with no index would be pretty
> > easy in the current codebase/API (e.g. "index=False").
> > I know it would break some code, but it would be wrong code anyway
> > (that is, code that doesn't decouple indexes from data storage).
> > 
> > Third, because now that the default index is RangeIndex(n) (which a
> > user is free not to rely on in any way), and as long as broken code
> > is
> > fixed (see above), a DataFrame with no index wouldn't really be
> > "simpler". It would mostly amount to deciding whether to show the
> > index
> > or not when printing to screen/doing IO.
> 
> Sure, you *could* fix all this on top of the current pandas data
> model. But it would be quite a challenging effort, and the full
> pandas data model would remain quite complex.
> 
> The current pandas data model looks something like this:
> 
> DataFrame:
> - values: BlockManager wrapping 1d and/or 2d NumPy arrays
> - index: Index
> - columns: Index
> ?
> The data model I'd like to work with in the future for most use-cases 
> involving tabular data is something closer to:
> 
> DataFrame:
> - data: OrderedDict[str, Array]
> - indexes: OrderedDict[str, Index]
> 
> Conveniently, this looks very similar to the data model of Arrow or
> R. Optional indexes would provide fast reverse lookup for some subset
> of dataframe columns.

How is this different (API-wise) from a list of (same lenght, I assume)
Series?

> This pretty obviously could not support everything pandas can do
> today. For example, you couldn't have a hierarchical index for column
> names. But in my experience, you're better off working with "tidy
> data" anyways, as popularized in R's tidyverse.

We have deprecated Panel (and I think it was the right choice) because
you can always tell (and I've often told) users "work with MultiIndex
and stack/unstack, that's efficient and much better and easier to
understand".

... do you instead see all the stack/unstack machinery as just
useless?!

Do you have a feeling of what the user base (present and potential)
thinks about this?
(Do you think it matters in some way?)

?
> > But again, I fail to see a new "scope". I don't see an analysis of
> > which (share) of the current pandas problems (=issues) would be
> > solved.
> 
> One way in which we have reduced pandas' scope recently is the
> proposed deprecation of Panel.
> 
> This is an example of focusing pandas on tabular data rather than N-
> dimensional arrays.


See above, we have lost N-dimensional arrays (with N=3 or 4) but
luckily not the concept of N-features data.
I can't even think of retrieving data with pandaSDMX without MultiIndex
columns, let alone manipulate it in any way.

We are constantly comparing pandas to R, but while so far I have always
implicitly thought we were proud of the?agile manipulation abilities
that pandas has and R frames don't, only now I seem to understand we
are envious, for some weird reason, of their lack of features.

I understand we could be envious that they have a cleaner codebase...
but since it's GPL covered, we actually have no idea :-D
And more seriously, we have talked very little (also in the
sprint,?from what I understand) of the possibility to improve/clean the
internals code.

I think we agree on the fact that having data stored in a BlockManager
based on as many arrays as there are dtypes does not per se really
qualify as rocket science.
But then, even if we decided that we are not good programmers enough to
implement this cleanly, even the solution of storing stuff as single
arrays but leaving the API (e.g. .columns as Index) as it is is
superior to the DataFrame being a mere collection of (same length)
Series.

Pietro

From shoyer at gmail.com  Wed Jul 18 12:49:37 2018
From: shoyer at gmail.com (Stephan Hoyer)
Date: Wed, 18 Jul 2018 09:49:37 -0700
Subject: [Pandas-dev] Pandas Sprint Recap
In-Reply-To: <CAJ8oX-GjHTC0CEerzP6V3uK_pqQQ9uQoxtOZJwAN2sdR-qUG=g@mail.gmail.com>
References: <CAE1aY-nPUpVb5MUziiEM2RMKhnYuD4qaK4tcQs16mNbyFg7HQg@mail.gmail.com>
 <1531501924.11738.110.camel@pietrobattiston.it>
 <CAE1aY-=FX6ncmL3NMaxTeBkKC7qZ+F5YhOczbnmKqKBV1GTF_Q@mail.gmail.com>
 <CAJPUwMAnySm2f_zsUwkhcZ3q68=6kAAFjqVvyw=PWvyPWe0Fng@mail.gmail.com>
 <1531780555.15070.13.camel@pietrobattiston.it>
 <CAJPUwMB4sUePUFiqwiShTUYVvtM8SMdNTbQqpjuxQp=0_f398Q@mail.gmail.com>
 <1531787017.15070.24.camel@pietrobattiston.it>
 <CAEQ_Tvf=_VBBK3dQucXH==Ho23sxW2M4OzD0vbGh8s06RtNGxg@mail.gmail.com>
 <1531818058.15070.31.camel@pietrobattiston.it>
 <CAEQ_TveTnH4hSh2GHmyCKLbya9ZsHbowscGAkRdjx5-6QmBDVQ@mail.gmail.com>
 <CAJ8oX-GjHTC0CEerzP6V3uK_pqQQ9uQoxtOZJwAN2sdR-qUG=g@mail.gmail.com>
Message-ID: <CAEQ_TvcGT62j7qmWSkYD_yhLEk9AsOkiR=zwX85Cw-d0t=tb3Q@mail.gmail.com>

On Tue, Jul 17, 2018 at 3:47 PM Matthew Rocklin <mrocklin at gmail.com> wrote:

> Has Pandas ever done a user survey?
>
> I would be curious to know the answer to the question "do you make heavy
> use of the Pandas index" among users, and how that correlates with
> different domain/industry.
>

This is a great question. I don't think we've ever done this sort of
reserach.

My suspicion is that most of the time users ignore the index, and find the
way it is used heavily in pandas more annoying than helpful. But certainly
there are some use-cases for which automatic alignment with an index is
fantastic.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180718/c9a978d8/attachment.html>

From shoyer at gmail.com  Wed Jul 18 13:16:17 2018
From: shoyer at gmail.com (Stephan Hoyer)
Date: Wed, 18 Jul 2018 10:16:17 -0700
Subject: [Pandas-dev] Pandas Sprint Recap
In-Reply-To: <1531898615.3286.8.camel@pietrobattiston.it>
References: <CAE1aY-nPUpVb5MUziiEM2RMKhnYuD4qaK4tcQs16mNbyFg7HQg@mail.gmail.com>
 <1531501924.11738.110.camel@pietrobattiston.it>
 <CAE1aY-=FX6ncmL3NMaxTeBkKC7qZ+F5YhOczbnmKqKBV1GTF_Q@mail.gmail.com>
 <CAJPUwMAnySm2f_zsUwkhcZ3q68=6kAAFjqVvyw=PWvyPWe0Fng@mail.gmail.com>
 <1531780555.15070.13.camel@pietrobattiston.it>
 <CAJPUwMB4sUePUFiqwiShTUYVvtM8SMdNTbQqpjuxQp=0_f398Q@mail.gmail.com>
 <1531787017.15070.24.camel@pietrobattiston.it>
 <CAEQ_Tvf=_VBBK3dQucXH==Ho23sxW2M4OzD0vbGh8s06RtNGxg@mail.gmail.com>
 <1531818058.15070.31.camel@pietrobattiston.it>
 <CAEQ_TveTnH4hSh2GHmyCKLbya9ZsHbowscGAkRdjx5-6QmBDVQ@mail.gmail.com>
 <1531898615.3286.8.camel@pietrobattiston.it>
Message-ID: <CAEQ_TvcWPOr0D_eJOX+EB=uJNu01=7sbK-6Z3FJpT+AmBf_meA@mail.gmail.com>

On Wed, Jul 18, 2018 at 12:23 AM Pietro Battiston <me at pietrobattiston.it>
wrote:

> > The data model I'd like to work with in the future for most use-cases
> > involving tabular data is something closer to:
> >
> > DataFrame:
> > - data: OrderedDict[str, Array]
> > - indexes: OrderedDict[str, Index]
> >
> > Conveniently, this looks very similar to the data model of Arrow or
> > R. Optional indexes would provide fast reverse lookup for some subset
> > of dataframe columns.
>
> How is this different (API-wise) from a list of (same lenght, I assume)
> Series?
>

It does sound very similar to me -- the DataFrame just provides a nice way
to do collective operations.

See above, we have lost N-dimensional arrays (with N=3 or 4) but
> luckily not the concept of N-features data.
> I can't even think of retrieving data with pandaSDMX without MultiIndex
> columns, let alone manipulate it in any way.
> ...

But then, even if we decided that we are not good programmers enough to
> implement this cleanly, even the solution of storing stuff as single
> arrays but leaving the API (e.g. .columns as Index) as it is is
> superior to the DataFrame being a mere collection of (same length)
> Series.
>

To be entirely clear, I'm only speaking for myself -- not Wes or the entire
pandas development team. I wasn't even at the sprint!

I certainly find stacking/unstacking useful, but it is isn't the only way
to manipulate multi-dimensional tabular data. I do think R's tidyverse
shows an alternative viable path. Without having used it extensively, it
appears to be more consistent and easier to use than pandas.

For multi-dimensional data analysis, these days I generally prefer to use
xarray (disclaimer: my project) instead of a pandas.MultiIndex. I find it
more satisfying to have indexed N-D arrays (in an xarray.Dataset) rather
than indexed 2D dataframes. The way that pandas.DataFrame uses an Index for
both row and column labels makes it in some ways similar to the fixed 2D
numpy.matrix, which personally I find less useful. It also makes all pandas
operations more complex to implement than those on index-free "simple"
dataframes.

These are certainly not mutually exclusive options -- there is room for
packages that provide all of these data models (simple dataframes like R,
indexed dataframes like pandas, N-D labeled arrays like xarray, and even
N-D arrays without labels like NumPy). I do hope that one day all of them
can share the same foundation -- that would have major benefits for the
ecosystem.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180718/1121e040/attachment.html>

From wesmckinn at gmail.com  Wed Jul 18 13:30:01 2018
From: wesmckinn at gmail.com (Wes McKinney)
Date: Wed, 18 Jul 2018 13:30:01 -0400
Subject: [Pandas-dev] Pandas Sprint Recap
In-Reply-To: <CAEQ_TvcWPOr0D_eJOX+EB=uJNu01=7sbK-6Z3FJpT+AmBf_meA@mail.gmail.com>
References: <CAE1aY-nPUpVb5MUziiEM2RMKhnYuD4qaK4tcQs16mNbyFg7HQg@mail.gmail.com>
 <1531501924.11738.110.camel@pietrobattiston.it>
 <CAE1aY-=FX6ncmL3NMaxTeBkKC7qZ+F5YhOczbnmKqKBV1GTF_Q@mail.gmail.com>
 <CAJPUwMAnySm2f_zsUwkhcZ3q68=6kAAFjqVvyw=PWvyPWe0Fng@mail.gmail.com>
 <1531780555.15070.13.camel@pietrobattiston.it>
 <CAJPUwMB4sUePUFiqwiShTUYVvtM8SMdNTbQqpjuxQp=0_f398Q@mail.gmail.com>
 <1531787017.15070.24.camel@pietrobattiston.it>
 <CAEQ_Tvf=_VBBK3dQucXH==Ho23sxW2M4OzD0vbGh8s06RtNGxg@mail.gmail.com>
 <1531818058.15070.31.camel@pietrobattiston.it>
 <CAEQ_TveTnH4hSh2GHmyCKLbya9ZsHbowscGAkRdjx5-6QmBDVQ@mail.gmail.com>
 <1531898615.3286.8.camel@pietrobattiston.it>
 <CAEQ_TvcWPOr0D_eJOX+EB=uJNu01=7sbK-6Z3FJpT+AmBf_meA@mail.gmail.com>
Message-ID: <CAJPUwMDmzWBKBsj14TU7QtUeCeKo5JLzOem3VeozSv4o6kddGw@mail.gmail.com>

> These are certainly not mutually exclusive options -- there is room for packages that provide all of these data models (simple dataframes like R, indexed dataframes like pandas, N-D labeled arrays like xarray, and even N-D arrays without labels like NumPy). I do hope that one day all of them can share the same foundation -- that would have major benefits for the ecosystem.

On this subject -- one of the objectives of the next years is to
enable dplyr / tidyverse expressions to run atop Arrow-based data
frames (in addition to R's native data frames). While this will
require some type coercion in some cases (for strings, non-numeric
data) the net benefits in terms of SIMD / parallelization /
out-of-core computing should be well worth it. The tidyverse
developers have created much cleaner boundaries between the expression
/ API semantics and the implementation details than we have, and this
has all happened in the last 5 years.

On Wed, Jul 18, 2018 at 1:16 PM, Stephan Hoyer <shoyer at gmail.com> wrote:
> On Wed, Jul 18, 2018 at 12:23 AM Pietro Battiston <me at pietrobattiston.it>
> wrote:
>>
>> > The data model I'd like to work with in the future for most use-cases
>> > involving tabular data is something closer to:
>> >
>> > DataFrame:
>> > - data: OrderedDict[str, Array]
>> > - indexes: OrderedDict[str, Index]
>> >
>> > Conveniently, this looks very similar to the data model of Arrow or
>> > R. Optional indexes would provide fast reverse lookup for some subset
>> > of dataframe columns.
>>
>> How is this different (API-wise) from a list of (same lenght, I assume)
>> Series?
>
>
> It does sound very similar to me -- the DataFrame just provides a nice way
> to do collective operations.
>
>> See above, we have lost N-dimensional arrays (with N=3 or 4) but
>> luckily not the concept of N-features data.
>> I can't even think of retrieving data with pandaSDMX without MultiIndex
>> columns, let alone manipulate it in any way.
>> ...
>>
>> But then, even if we decided that we are not good programmers enough to
>> implement this cleanly, even the solution of storing stuff as single
>> arrays but leaving the API (e.g. .columns as Index) as it is is
>> superior to the DataFrame being a mere collection of (same length)
>> Series.
>
>
> To be entirely clear, I'm only speaking for myself -- not Wes or the entire
> pandas development team. I wasn't even at the sprint!
>
> I certainly find stacking/unstacking useful, but it is isn't the only way to
> manipulate multi-dimensional tabular data. I do think R's tidyverse shows an
> alternative viable path. Without having used it extensively, it appears to
> be more consistent and easier to use than pandas.
>
> For multi-dimensional data analysis, these days I generally prefer to use
> xarray (disclaimer: my project) instead of a pandas.MultiIndex. I find it
> more satisfying to have indexed N-D arrays (in an xarray.Dataset) rather
> than indexed 2D dataframes. The way that pandas.DataFrame uses an Index for
> both row and column labels makes it in some ways similar to the fixed 2D
> numpy.matrix, which personally I find less useful. It also makes all pandas
> operations more complex to implement than those on index-free "simple"
> dataframes.
>
> These are certainly not mutually exclusive options -- there is room for
> packages that provide all of these data models (simple dataframes like R,
> indexed dataframes like pandas, N-D labeled arrays like xarray, and even N-D
> arrays without labels like NumPy). I do hope that one day all of them can
> share the same foundation -- that would have major benefits for the
> ecosystem.

From me at pietrobattiston.it  Wed Jul 18 13:48:46 2018
From: me at pietrobattiston.it (Pietro Battiston)
Date: Wed, 18 Jul 2018 19:48:46 +0200
Subject: [Pandas-dev] Pandas Sprint Recap
In-Reply-To: <CAJPUwMDmzWBKBsj14TU7QtUeCeKo5JLzOem3VeozSv4o6kddGw@mail.gmail.com>
References: <CAE1aY-nPUpVb5MUziiEM2RMKhnYuD4qaK4tcQs16mNbyFg7HQg@mail.gmail.com>
 <1531501924.11738.110.camel@pietrobattiston.it>
 <CAE1aY-=FX6ncmL3NMaxTeBkKC7qZ+F5YhOczbnmKqKBV1GTF_Q@mail.gmail.com>
 <CAJPUwMAnySm2f_zsUwkhcZ3q68=6kAAFjqVvyw=PWvyPWe0Fng@mail.gmail.com>
 <1531780555.15070.13.camel@pietrobattiston.it>
 <CAJPUwMB4sUePUFiqwiShTUYVvtM8SMdNTbQqpjuxQp=0_f398Q@mail.gmail.com>
 <1531787017.15070.24.camel@pietrobattiston.it>
 <CAEQ_Tvf=_VBBK3dQucXH==Ho23sxW2M4OzD0vbGh8s06RtNGxg@mail.gmail.com>
 <1531818058.15070.31.camel@pietrobattiston.it>
 <CAEQ_TveTnH4hSh2GHmyCKLbya9ZsHbowscGAkRdjx5-6QmBDVQ@mail.gmail.com>
 <1531898615.3286.8.camel@pietrobattiston.it>
 <CAEQ_TvcWPOr0D_eJOX+EB=uJNu01=7sbK-6Z3FJpT+AmBf_meA@mail.gmail.com>
 <CAJPUwMDmzWBKBsj14TU7QtUeCeKo5JLzOem3VeozSv4o6kddGw@mail.gmail.com>
Message-ID: <1531936126.3286.17.camel@pietrobattiston.it>

Il giorno mer, 18/07/2018 alle 13.30 -0400, Wes McKinney ha scritto:
> > These are certainly not mutually exclusive options -- there is room
> > for packages that provide all of these data models (simple
> > dataframes like R, indexed dataframes like pandas, N-D labeled
> > arrays like xarray, and even N-D arrays without labels like NumPy).
> > I do hope that one day all of them can share the same foundation --
> > that would have major benefits for the ecosystem.
> 
> On this subject -- one of the objectives of the next years is to
> enable dplyr / tidyverse expressions to run atop Arrow-based data
> frames (in addition to R's native data frames).

You mean Arrow-based R data frames, right? Or are you thinking about a
sort of cross-language dplyr?

(My understanding is that in Python we can't go much closer to dplyr
than with .pipe(), but consistent terminology helps users, and
certainly dplyr developers have done great work on this)

Pietro

From wesmckinn at gmail.com  Wed Jul 18 13:56:31 2018
From: wesmckinn at gmail.com (Wes McKinney)
Date: Wed, 18 Jul 2018 13:56:31 -0400
Subject: [Pandas-dev] Pandas Sprint Recap
In-Reply-To: <CAKf8g9QuP2JNSRS9Ofa+rO6ortJp3Fo95vcKv9EoLb_z1iMxdg@mail.gmail.com>
References: <CAE1aY-nPUpVb5MUziiEM2RMKhnYuD4qaK4tcQs16mNbyFg7HQg@mail.gmail.com>
 <1531501924.11738.110.camel@pietrobattiston.it>
 <CAE1aY-=FX6ncmL3NMaxTeBkKC7qZ+F5YhOczbnmKqKBV1GTF_Q@mail.gmail.com>
 <CAJPUwMAnySm2f_zsUwkhcZ3q68=6kAAFjqVvyw=PWvyPWe0Fng@mail.gmail.com>
 <1531780555.15070.13.camel@pietrobattiston.it>
 <CAJPUwMB4sUePUFiqwiShTUYVvtM8SMdNTbQqpjuxQp=0_f398Q@mail.gmail.com>
 <1531787017.15070.24.camel@pietrobattiston.it>
 <CAEQ_Tvf=_VBBK3dQucXH==Ho23sxW2M4OzD0vbGh8s06RtNGxg@mail.gmail.com>
 <1531818058.15070.31.camel@pietrobattiston.it>
 <CAEQ_TveTnH4hSh2GHmyCKLbya9ZsHbowscGAkRdjx5-6QmBDVQ@mail.gmail.com>
 <1531898615.3286.8.camel@pietrobattiston.it>
 <CAEQ_TvcWPOr0D_eJOX+EB=uJNu01=7sbK-6Z3FJpT+AmBf_meA@mail.gmail.com>
 <CAJPUwMDmzWBKBsj14TU7QtUeCeKo5JLzOem3VeozSv4o6kddGw@mail.gmail.com>
 <1531936126.3286.17.camel@pietrobattiston.it>
 <CAKf8g9QuP2JNSRS9Ofa+rO6ortJp3Fo95vcKv9EoLb_z1iMxdg@mail.gmail.com>
Message-ID: <CAJPUwMCwUWEY__HxKKF2woOi0G3rxXR5vb5tzPHAW6h2xedjrw@mail.gmail.com>

> You mean Arrow-based R data frames, right? Or are you thinking about a
sort of cross-language dplyr?

Precisely a cross-language computational system (I've been talking
about this publicly for well over 3 years now, e.g. here in April 2015
https://www.slideshare.net/wesm/dataframes-the-good-bad-and-ugly).
Same implementation (in C/C++/LLVM), different front end. dplyr
already has interfaces to SQL, for example.

On Wed, Jul 18, 2018 at 1:52 PM, Brock Mendel <jbrockmendel at gmail.com> wrote:
> ... and in conclusion, everyone at the sprint learned a valuable less about
> the power of friendship.
>
> The discussion so far seems to point towards making a push towards loose
> coupling.  Regardless of release names or numbers, and regardless of exactly
> what happens with core.internals, there will be a need for something like
> the current `pandas.io`, `pandas.plotting`, `khash`, `tslibs`, etc.  The
> more of this we can make independent of core.internals, the more room we'll
> have to customize/experiment with options discussed above.
>
> On Wed, Jul 18, 2018 at 10:48 AM, Pietro Battiston <me at pietrobattiston.it>
> wrote:
>>
>> Il giorno mer, 18/07/2018 alle 13.30 -0400, Wes McKinney ha scritto:
>> > > These are certainly not mutually exclusive options -- there is room
>> > > for packages that provide all of these data models (simple
>> > > dataframes like R, indexed dataframes like pandas, N-D labeled
>> > > arrays like xarray, and even N-D arrays without labels like NumPy).
>> > > I do hope that one day all of them can share the same foundation --
>> > > that would have major benefits for the ecosystem.
>> >
>> > On this subject -- one of the objectives of the next years is to
>> > enable dplyr / tidyverse expressions to run atop Arrow-based data
>> > frames (in addition to R's native data frames).
>>
>> You mean Arrow-based R data frames, right? Or are you thinking about a
>> sort of cross-language dplyr?
>>
>> (My understanding is that in Python we can't go much closer to dplyr
>> than with .pipe(), but consistent terminology helps users, and
>> certainly dplyr developers have done great work on this)
>>
>> Pietro
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>
>

From me at pietrobattiston.it  Wed Jul 18 14:02:44 2018
From: me at pietrobattiston.it (Pietro Battiston)
Date: Wed, 18 Jul 2018 20:02:44 +0200
Subject: [Pandas-dev] Pandas Sprint Recap
In-Reply-To: <CAEQ_TvcWPOr0D_eJOX+EB=uJNu01=7sbK-6Z3FJpT+AmBf_meA@mail.gmail.com>
References: <CAE1aY-nPUpVb5MUziiEM2RMKhnYuD4qaK4tcQs16mNbyFg7HQg@mail.gmail.com>
 <1531501924.11738.110.camel@pietrobattiston.it>
 <CAE1aY-=FX6ncmL3NMaxTeBkKC7qZ+F5YhOczbnmKqKBV1GTF_Q@mail.gmail.com>
 <CAJPUwMAnySm2f_zsUwkhcZ3q68=6kAAFjqVvyw=PWvyPWe0Fng@mail.gmail.com>
 <1531780555.15070.13.camel@pietrobattiston.it>
 <CAJPUwMB4sUePUFiqwiShTUYVvtM8SMdNTbQqpjuxQp=0_f398Q@mail.gmail.com>
 <1531787017.15070.24.camel@pietrobattiston.it>
 <CAEQ_Tvf=_VBBK3dQucXH==Ho23sxW2M4OzD0vbGh8s06RtNGxg@mail.gmail.com>
 <1531818058.15070.31.camel@pietrobattiston.it>
 <CAEQ_TveTnH4hSh2GHmyCKLbya9ZsHbowscGAkRdjx5-6QmBDVQ@mail.gmail.com>
 <1531898615.3286.8.camel@pietrobattiston.it>
 <CAEQ_TvcWPOr0D_eJOX+EB=uJNu01=7sbK-6Z3FJpT+AmBf_meA@mail.gmail.com>
Message-ID: <1531936964.3286.19.camel@pietrobattiston.it>

Il giorno mer, 18/07/2018 alle 10.16 -0700, Stephan Hoyer ha scritto:
> [...]
> I certainly find stacking/unstacking useful, but it is isn't the only
> way to manipulate multi-dimensional tabular data. I do think R's
> tidyverse shows an alternative viable path. Without having used it
> extensively, it appears to be more consistent and easier to use than
> pandas.

Probably, but (from the limited knowledge I gathered in the last
months) it just doesn't seem as powerful, by far.

> For multi-dimensional data analysis, these days I generally prefer to
> use xarray (disclaimer: my project) instead of a pandas.MultiIndex. I
> find it more satisfying to have indexed N-D arrays (in an
> xarray.Dataset) rather than indexed 2D dataframes.

I certainly find xarray the closest thing to a pandas replacement; the
main tradeoff is the single dtype and the waste of memory if dimensions
are not aligned, right?

> The way that pandas.DataFrame uses an Index for both row and column
> labels makes it in some ways similar to the fixed 2D numpy.matrix,
> which personally I find less useful.

Tastes are tastes, but it is a fact that a 2D DataFrame + MultiIndex
offers possibilities that only nD numpy arrays + lot of manual effort
would rival.

(That's actually how I used to work before knowing pandas: I would
build my own indexes and use them to access numpy arrays)

Please correct me if I'm wrong, but I think that even a Series with
MultiIndex is quite close, in terms of manipulation abilities (that is,
discarding e.g. efficiency, and cleanness of the API) to a
xarray?xarray.Dataset.

Pietro

From me at pietrobattiston.it  Wed Jul 18 14:08:20 2018
From: me at pietrobattiston.it (Pietro Battiston)
Date: Wed, 18 Jul 2018 20:08:20 +0200
Subject: [Pandas-dev] Pandas Sprint Recap
In-Reply-To: <CAJPUwMCwUWEY__HxKKF2woOi0G3rxXR5vb5tzPHAW6h2xedjrw@mail.gmail.com>
References: <CAE1aY-nPUpVb5MUziiEM2RMKhnYuD4qaK4tcQs16mNbyFg7HQg@mail.gmail.com>
 <1531501924.11738.110.camel@pietrobattiston.it>
 <CAE1aY-=FX6ncmL3NMaxTeBkKC7qZ+F5YhOczbnmKqKBV1GTF_Q@mail.gmail.com>
 <CAJPUwMAnySm2f_zsUwkhcZ3q68=6kAAFjqVvyw=PWvyPWe0Fng@mail.gmail.com>
 <1531780555.15070.13.camel@pietrobattiston.it>
 <CAJPUwMB4sUePUFiqwiShTUYVvtM8SMdNTbQqpjuxQp=0_f398Q@mail.gmail.com>
 <1531787017.15070.24.camel@pietrobattiston.it>
 <CAEQ_Tvf=_VBBK3dQucXH==Ho23sxW2M4OzD0vbGh8s06RtNGxg@mail.gmail.com>
 <1531818058.15070.31.camel@pietrobattiston.it>
 <CAEQ_TveTnH4hSh2GHmyCKLbya9ZsHbowscGAkRdjx5-6QmBDVQ@mail.gmail.com>
 <1531898615.3286.8.camel@pietrobattiston.it>
 <CAEQ_TvcWPOr0D_eJOX+EB=uJNu01=7sbK-6Z3FJpT+AmBf_meA@mail.gmail.com>
 <CAJPUwMDmzWBKBsj14TU7QtUeCeKo5JLzOem3VeozSv4o6kddGw@mail.gmail.com>
 <1531936126.3286.17.camel@pietrobattiston.it>
 <CAKf8g9QuP2JNSRS9Ofa+rO6ortJp3Fo95vcKv9EoLb_z1iMxdg@mail.gmail.com>
 <CAJPUwMCwUWEY__HxKKF2woOi0G3rxXR5vb5tzPHAW6h2xedjrw@mail.gmail.com>
Message-ID: <1531937300.3286.21.camel@pietrobattiston.it>

Il giorno mer, 18/07/2018 alle 13.56 -0400, Wes McKinney ha scritto:
> > You mean Arrow-based R data frames, right? Or are you thinking
> > about a
> 
> sort of cross-language dplyr?
> 
> Precisely a cross-language computational system (I've been talking
> about this publicly for well over 3 years now, e.g. here in April
> 2015
> https://www.slideshare.net/wesm/dataframes-the-good-bad-and-ugly).
> Same implementation (in C/C++/LLVM), different front end. dplyr
> already has interfaces to SQL, for example.

_Through R_, right?

Pietro

From wesmckinn at gmail.com  Wed Jul 18 14:45:28 2018
From: wesmckinn at gmail.com (Wes McKinney)
Date: Wed, 18 Jul 2018 14:45:28 -0400
Subject: [Pandas-dev] Pandas Sprint Recap
In-Reply-To: <1531937300.3286.21.camel@pietrobattiston.it>
References: <CAE1aY-nPUpVb5MUziiEM2RMKhnYuD4qaK4tcQs16mNbyFg7HQg@mail.gmail.com>
 <1531501924.11738.110.camel@pietrobattiston.it>
 <CAE1aY-=FX6ncmL3NMaxTeBkKC7qZ+F5YhOczbnmKqKBV1GTF_Q@mail.gmail.com>
 <CAJPUwMAnySm2f_zsUwkhcZ3q68=6kAAFjqVvyw=PWvyPWe0Fng@mail.gmail.com>
 <1531780555.15070.13.camel@pietrobattiston.it>
 <CAJPUwMB4sUePUFiqwiShTUYVvtM8SMdNTbQqpjuxQp=0_f398Q@mail.gmail.com>
 <1531787017.15070.24.camel@pietrobattiston.it>
 <CAEQ_Tvf=_VBBK3dQucXH==Ho23sxW2M4OzD0vbGh8s06RtNGxg@mail.gmail.com>
 <1531818058.15070.31.camel@pietrobattiston.it>
 <CAEQ_TveTnH4hSh2GHmyCKLbya9ZsHbowscGAkRdjx5-6QmBDVQ@mail.gmail.com>
 <1531898615.3286.8.camel@pietrobattiston.it>
 <CAEQ_TvcWPOr0D_eJOX+EB=uJNu01=7sbK-6Z3FJpT+AmBf_meA@mail.gmail.com>
 <CAJPUwMDmzWBKBsj14TU7QtUeCeKo5JLzOem3VeozSv4o6kddGw@mail.gmail.com>
 <1531936126.3286.17.camel@pietrobattiston.it>
 <CAKf8g9QuP2JNSRS9Ofa+rO6ortJp3Fo95vcKv9EoLb_z1iMxdg@mail.gmail.com>
 <CAJPUwMCwUWEY__HxKKF2woOi0G3rxXR5vb5tzPHAW6h2xedjrw@mail.gmail.com>
 <1531937300.3286.21.camel@pietrobattiston.it>
Message-ID: <CAJPUwMAEfzAS2+oREXCQrV7iLb+J_PLS_+vv+kF62yn97CFaUg@mail.gmail.com>

On Wed, Jul 18, 2018, 2:08 PM Pietro Battiston <me at pietrobattiston.it>
wrote:

> Il giorno mer, 18/07/2018 alle 13.56 -0400, Wes McKinney ha scritto:
> > > You mean Arrow-based R data frames, right? Or are you thinking
> > > about a
> >
> > sort of cross-language dplyr?
> >
> > Precisely a cross-language computational system (I've been talking
> > about this publicly for well over 3 years now, e.g. here in April
> > 2015
> > https://www.slideshare.net/wesm/dataframes-the-good-bad-and-ugly).
> > Same implementation (in C/C++/LLVM), different front end. dplyr
> > already has interfaces to SQL, for example.
>
> _Through R_, right?
>

Replying all this time

Well, dplyr is an R package. My point was that it was not designed around
R-specific semantics per se. This is explained in the slide deck I linked.


> Pietro
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180718/802a7254/attachment.html>

From me at pietrobattiston.it  Wed Jul 18 15:05:15 2018
From: me at pietrobattiston.it (Pietro Battiston)
Date: Wed, 18 Jul 2018 21:05:15 +0200
Subject: [Pandas-dev] Pandas Sprint Recap
In-Reply-To: <CAJPUwMAEfzAS2+oREXCQrV7iLb+J_PLS_+vv+kF62yn97CFaUg@mail.gmail.com>
References: <CAE1aY-nPUpVb5MUziiEM2RMKhnYuD4qaK4tcQs16mNbyFg7HQg@mail.gmail.com>
 <1531501924.11738.110.camel@pietrobattiston.it>
 <CAE1aY-=FX6ncmL3NMaxTeBkKC7qZ+F5YhOczbnmKqKBV1GTF_Q@mail.gmail.com>
 <CAJPUwMAnySm2f_zsUwkhcZ3q68=6kAAFjqVvyw=PWvyPWe0Fng@mail.gmail.com>
 <1531780555.15070.13.camel@pietrobattiston.it>
 <CAJPUwMB4sUePUFiqwiShTUYVvtM8SMdNTbQqpjuxQp=0_f398Q@mail.gmail.com>
 <1531787017.15070.24.camel@pietrobattiston.it>
 <CAEQ_Tvf=_VBBK3dQucXH==Ho23sxW2M4OzD0vbGh8s06RtNGxg@mail.gmail.com>
 <1531818058.15070.31.camel@pietrobattiston.it>
 <CAEQ_TveTnH4hSh2GHmyCKLbya9ZsHbowscGAkRdjx5-6QmBDVQ@mail.gmail.com>
 <1531898615.3286.8.camel@pietrobattiston.it>
 <CAEQ_TvcWPOr0D_eJOX+EB=uJNu01=7sbK-6Z3FJpT+AmBf_meA@mail.gmail.com>
 <CAJPUwMDmzWBKBsj14TU7QtUeCeKo5JLzOem3VeozSv4o6kddGw@mail.gmail.com>
 <1531936126.3286.17.camel@pietrobattiston.it>
 <CAKf8g9QuP2JNSRS9Ofa+rO6ortJp3Fo95vcKv9EoLb_z1iMxdg@mail.gmail.com>
 <CAJPUwMCwUWEY__HxKKF2woOi0G3rxXR5vb5tzPHAW6h2xedjrw@mail.gmail.com>
 <1531937300.3286.21.camel@pietrobattiston.it>
 <CAJPUwMAEfzAS2+oREXCQrV7iLb+J_PLS_+vv+kF62yn97CFaUg@mail.gmail.com>
Message-ID: <1531940715.3286.24.camel@pietrobattiston.it>

Il giorno mer, 18/07/2018 alle 14.45 -0400, Wes McKinney ha scritto:
> 
> 
> On Wed, Jul 18, 2018, 2:08 PM Pietro Battiston <me at pietrobattiston.it
> > wrote:
> > Il giorno mer, 18/07/2018 alle 13.56 -0400, Wes McKinney ha
> > scritto:
> > > > You mean Arrow-based R data frames, right? Or are you thinking
> > > > about a
> > >?
> > > sort of cross-language dplyr?
> > >?
> > > Precisely a cross-language computational system (I've been
> > talking
> > > about this publicly for well over 3 years now, e.g. here in April
> > > 2015
> > > https://www.slideshare.net/wesm/dataframes-the-good-bad-and-ugly)
> > .
> > > Same implementation (in C/C++/LLVM), different front end. dplyr
> > > already has interfaces to SQL, for example.
> > 
> > _Through R_, right?
> 
> Replying all this time
> 
> Well, dplyr is an R package. My point was that it was not designed
> around R-specific semantics per se. This is explained in the slide
> deck I linked.?


I think clearly distinguishing API problems/solutions from
implementation problems/solutions can only help this discussion (and I
don't just mean this thread).

Your slides describe a nice plan for what concerns solutions. But my
limited understanding is that the dplyr _syntax_ is more innovative
than the dplyr _semantics_, from which pandas doesn't have that much to
learn. Then, sure, a shared codebase is cool.

Pietro

From jbrockmendel at gmail.com  Wed Jul 18 13:52:18 2018
From: jbrockmendel at gmail.com (Brock Mendel)
Date: Wed, 18 Jul 2018 10:52:18 -0700
Subject: [Pandas-dev] Pandas Sprint Recap
In-Reply-To: <1531936126.3286.17.camel@pietrobattiston.it>
References: <CAE1aY-nPUpVb5MUziiEM2RMKhnYuD4qaK4tcQs16mNbyFg7HQg@mail.gmail.com>
 <1531501924.11738.110.camel@pietrobattiston.it>
 <CAE1aY-=FX6ncmL3NMaxTeBkKC7qZ+F5YhOczbnmKqKBV1GTF_Q@mail.gmail.com>
 <CAJPUwMAnySm2f_zsUwkhcZ3q68=6kAAFjqVvyw=PWvyPWe0Fng@mail.gmail.com>
 <1531780555.15070.13.camel@pietrobattiston.it>
 <CAJPUwMB4sUePUFiqwiShTUYVvtM8SMdNTbQqpjuxQp=0_f398Q@mail.gmail.com>
 <1531787017.15070.24.camel@pietrobattiston.it>
 <CAEQ_Tvf=_VBBK3dQucXH==Ho23sxW2M4OzD0vbGh8s06RtNGxg@mail.gmail.com>
 <1531818058.15070.31.camel@pietrobattiston.it>
 <CAEQ_TveTnH4hSh2GHmyCKLbya9ZsHbowscGAkRdjx5-6QmBDVQ@mail.gmail.com>
 <1531898615.3286.8.camel@pietrobattiston.it>
 <CAEQ_TvcWPOr0D_eJOX+EB=uJNu01=7sbK-6Z3FJpT+AmBf_meA@mail.gmail.com>
 <CAJPUwMDmzWBKBsj14TU7QtUeCeKo5JLzOem3VeozSv4o6kddGw@mail.gmail.com>
 <1531936126.3286.17.camel@pietrobattiston.it>
Message-ID: <CAKf8g9QuP2JNSRS9Ofa+rO6ortJp3Fo95vcKv9EoLb_z1iMxdg@mail.gmail.com>

... and in conclusion, everyone at the sprint learned a valuable less about
the power of friendship.

The discussion so far seems to point towards making a push towards loose
coupling.  Regardless of release names or numbers, and regardless of
exactly what happens with core.internals, there will be a need for
something like the current `pandas.io`, `pandas.plotting`, `khash`,
`tslibs`, etc.  The more of this we can make independent of core.internals,
the more room we'll have to customize/experiment with options discussed
above.

On Wed, Jul 18, 2018 at 10:48 AM, Pietro Battiston <me at pietrobattiston.it>
wrote:

> Il giorno mer, 18/07/2018 alle 13.30 -0400, Wes McKinney ha scritto:
> > > These are certainly not mutually exclusive options -- there is room
> > > for packages that provide all of these data models (simple
> > > dataframes like R, indexed dataframes like pandas, N-D labeled
> > > arrays like xarray, and even N-D arrays without labels like NumPy).
> > > I do hope that one day all of them can share the same foundation --
> > > that would have major benefits for the ecosystem.
> >
> > On this subject -- one of the objectives of the next years is to
> > enable dplyr / tidyverse expressions to run atop Arrow-based data
> > frames (in addition to R's native data frames).
>
> You mean Arrow-based R data frames, right? Or are you thinking about a
> sort of cross-language dplyr?
>
> (My understanding is that in Python we can't go much closer to dplyr
> than with .pipe(), but consistent terminology helps users, and
> certainly dplyr developers have done great work on this)
>
> Pietro
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180718/6f27bafc/attachment.html>

From irv at princeton.com  Wed Jul 18 15:05:42 2018
From: irv at princeton.com (Irv Lustig)
Date: Wed, 18 Jul 2018 15:05:42 -0400
Subject: [Pandas-dev] Pandas Sprint Recap
Message-ID: <CAG4hgvOWsOx6eU9aV5knOgCASgDuia6x7ndB8JWQc2WV5ug7Mg@mail.gmail.com>

>
Stephan Hoyer wrote:


>
> On Tue, Jul 17, 2018 at 3:47 PM Matthew Rocklin <mrocklin at gmail.com>
> wrote:
>
> > Has Pandas ever done a user survey?
> >
> > I would be curious to know the answer to the question "do you make heavy
> > use of the Pandas index" among users, and how that correlates with
> > different domain/industry.
> >
>
> This is a great question. I don't think we've ever done this sort of
> reserach.
>
> My suspicion is that most of the time users ignore the index, and find the
> way it is used heavily in pandas more annoying than helpful. But certainly
> there are some use-cases for which automatic alignment with an index is
> fantastic.
>

For our team, we heavily use the MultiIndex capability for rows (but not
columns). Our main use of pandas is to read in data from disparate data
sources, and do data wrangling to reshape the data.  We do lots of
joins/merges of different DataFrames, and placing the keys in a MultiIndex
makes it easier to track the join operations.

>From our perspective, the MultiIndex on rows is akin to the primary keys of
a data table. As we explore data, being able to slice the data along
various dimensions is quite valuable.  It is also quite natural that a
groupby() operation returns a Series or DataFrame with a MultiIndex.

What I find a bit frustrating is the lack of symmetry in the API between
dealing with the names of a MultiIndex and the names of a column.  It's why
I created this pull request (https://github.com/pandas-dev/pandas/pull/20046
[ENH: Allow rename_axis to specify index and columns arguments]) and opened
this issue https://github.com/pandas-dev/pandas/issues/20421 [API: Allow
MultiIndex.rename() to accept a dict as an argument]

Dr-Irv
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180718/4c19c2e4/attachment.html>

From wesmckinn at gmail.com  Wed Jul 18 22:26:23 2018
From: wesmckinn at gmail.com (Wes McKinney)
Date: Wed, 18 Jul 2018 22:26:23 -0400
Subject: [Pandas-dev] Pandas Sprint Recap
In-Reply-To: <CAG4hgvOWsOx6eU9aV5knOgCASgDuia6x7ndB8JWQc2WV5ug7Mg@mail.gmail.com>
References: <CAG4hgvOWsOx6eU9aV5knOgCASgDuia6x7ndB8JWQc2WV5ug7Mg@mail.gmail.com>
Message-ID: <CAJPUwMAVpN+Nej=LMP6UsgnMz=vjgaeWr1DjQJwTLiZHW1JvmA@mail.gmail.com>

Just as an aside on the pandas internals discussion, want to draw
attention to two projects in development right now with the objective
of providing faster/more scalable pandas-type operations:

https://github.com/h2oai/datatable
https://github.com/maartenbreddels/vaex

Both utilize memory-mapped data representations of their own devising
(vs. using an open standard of some kind).

- Wes

On Wed, Jul 18, 2018 at 3:05 PM, Irv Lustig <irv at princeton.com> wrote:
>
>
>>
>
> Stephan Hoyer wrote:
>
>>
>>
>> On Tue, Jul 17, 2018 at 3:47 PM Matthew Rocklin <mrocklin at gmail.com>
>> wrote:
>>
>> > Has Pandas ever done a user survey?
>> >
>> > I would be curious to know the answer to the question "do you make heavy
>> > use of the Pandas index" among users, and how that correlates with
>> > different domain/industry.
>> >
>>
>> This is a great question. I don't think we've ever done this sort of
>> reserach.
>>
>> My suspicion is that most of the time users ignore the index, and find the
>> way it is used heavily in pandas more annoying than helpful. But certainly
>> there are some use-cases for which automatic alignment with an index is
>> fantastic.
>
>
> For our team, we heavily use the MultiIndex capability for rows (but not
> columns). Our main use of pandas is to read in data from disparate data
> sources, and do data wrangling to reshape the data.  We do lots of
> joins/merges of different DataFrames, and placing the keys in a MultiIndex
> makes it easier to track the join operations.
>
> From our perspective, the MultiIndex on rows is akin to the primary keys of
> a data table. As we explore data, being able to slice the data along various
> dimensions is quite valuable.  It is also quite natural that a groupby()
> operation returns a Series or DataFrame with a MultiIndex.
>
> What I find a bit frustrating is the lack of symmetry in the API between
> dealing with the names of a MultiIndex and the names of a column.  It's why
> I created this pull request (https://github.com/pandas-dev/pandas/pull/20046
> [ENH: Allow rename_axis to specify index and columns arguments]) and opened
> this issue https://github.com/pandas-dev/pandas/issues/20421 [API: Allow
> MultiIndex.rename() to accept a dict as an argument]
>
> Dr-Irv
>
>
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>

From jorisvandenbossche at gmail.com  Thu Jul 19 02:40:05 2018
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Thu, 19 Jul 2018 01:40:05 -0500
Subject: [Pandas-dev] GroupBy Overhaul Proposal
In-Reply-To: <1531897282.3286.6.camel@pietrobattiston.it>
References: <F0DE6110-6B5B-468E-B8CD-47DF87F38E07@icloud.com>
 <1531782675.15070.18.camel@pietrobattiston.it>
 <09F05CAB-B491-416B-AC63-486C9AD132CD@icloud.com>
 <1531814373.15070.27.camel@pietrobattiston.it>
 <B7A9F796-25B0-43EF-8CA5-4ABFEFE9BD5A@icloud.com>
 <1531897282.3286.6.camel@pietrobattiston.it>
Message-ID: <CALQtMBZ+d=pEOy-tzzRfpc5ddiLibERQ+P-TMiowYXwG2hAxPw@mail.gmail.com>

Will, thanks for starting this! (I was after the sprint also thinking about
need of refactoring the groupby code :-))

Lot's of discussion has happened, and will need some time to digest it, but
I already quickly want to react on the 'apply' discussion:

IMO, apply should basically be syntactic sugar for the following:

keys = []
results = []

for name, group in df.groupby():
    res = func(group)
    result.append(res)
    keys.append(name)

pd.concat(results, keys=keys)

(much simplified of course, as when the result for each group is a Series
and not a DataFrame, the default concat is not what we want)

And I personally think it is useful having something as the above as a
general apply method for UDFs in groupby.

It is certainly true that the current apply implementation has
inconsistencies and magical behaviours, but I think we can deprecate those
instead of deprecating the full method. See
https://github.com/pandas-dev/pandas/issues/13056 for some comments about
this (eg on deprecating the magical 'transform' behaviour).

Apart from that, it still is a fact that a user who doesn't know all the
details will quickly turn to apply (rather than to agg), just because of
its name, and then having eg bad performance.
I am not directly sure how to solve this. We could maybe warn in certain
obvious cases (like apply(np.sum))? Although warnings can also become
annoying.

Joris

2018-07-18 2:01 GMT-05:00 Pietro Battiston <me at pietrobattiston.it>:

> Il giorno mar, 17/07/2018 alle 16.10 -0700, William Ayd ha scritto:
> > > In fact, my preference for keeping apply is pretty weak as long as
> > > there are alternatives that cover each of its use cases. But again,
> > > I'm
> > > not sure this is true.
> >
> > Just to clarify my position:
> >
> >       1. .apply() + UDF reducing to a scalar should be replaceable
> > with .agg() + same UDF (even though there are differences today?)
> >       2. .apply() + UDF returning Series / DataFrame / collection
> > doesn?t have anything else to cover it
>
> .transform() at least covers the case in which the shape of the chunk
> is unchanged.
>
> > But with #2 above I think its dangerous to assume that .apply can
> > always do the ?right thing? with those types of inputs. We don?t make
> > any assertions about the indexing / labeling of returned Series and
> > DataFrames.
>
> There is a simple way to stop throwing magic at users, and it is to
> clearly document which cases .apply() covers (and which should be
> covered by .agg() or transform()), reflecting the actual guesswork
> taking place in the code.
> By the way, my understanding (without having looked at the code) is
> that
> UDF returns Series -> concat in a new Series
> UDF returns DataFrame -> concat in a new DataFrame
> and the guesswork mostly concerns understanding whether the new index
> is the same as the old. Am I missing anything relevant?
>
>
> Now, I would be all for suppressing a complicated function by replacing
> it with simpler ways to do the same thing. But for instance I would
> like the following to still work with groupby().something():
>
> def remove_group_outliers(group):
>     outliers = # code to identify them
>     return group[~group.index.isin(outliers)]
>
> ... and I currently don't see any way but .apply().
>
> > As far as collections are concerned I?m not sure if there will be a
> > clear answer on how to handle those assuming we start getting EAs
> > that add first-class support for those.
>
> Do you have any pointer/example? I'm missing the relation between
> collections and .apply().
>
>
> > > Unless I'm wrong, #18366 is orthgonal to what we are discussing:
> > > unnamed lambdas would remain unnamed lambdas.
> > > (And the obvious solution to my eyes is used named methods instead)
> >
> > I don?t think this is orthogonal. Your concern is valid on lambdas
> > and I don?t know what the solution there is (perhaps some kind of
> > keyword argument) but without getting tripped up on that I don?t
> > think its immediately apparent that the returned object for a
> > DataFrame with columns ?a?, ?b?, ?c? will have a single column when
> > called as follows:
> >
> >  - df.groupby(?a?).agg(sum)
> >  - df.groupby(?a?).agg({?b?: sum, ?c?: min})
> >
> > Yet the following will yield a MultiIndex column:
> >
> >  - df.groupby(?a?).agg([sum])
> >  - df.groupby(?a?).agg({?b?: [sum], ?c?: min})
>
> The rule is not very complicated either (if correctly documented), but
> anyway, the inconsistency would disappear by just having the first two
> examples also return a MultiIndex.
>
> ... and maybe provide the users a very simple way to flatten
> MultiIndexes (see below).
>
>
> > If you reduce the returned columns to ??sum? of ?b?? and ??min? of
> > ?c?? you can ensure that the returned columns have the same number of
> > levels regardless of call signature,
> > AND have the added bonus of not obfuscating what type of aggregation
> > was performed with the former two examples.
>
> Both can be solved through a MI, or through an Index(dtype=object)
> containing tuples.
>
> > Of course the end user may ultimately decide that they don?t like
> > those labels at all and completely override them, but that effort
> > becomes much easier if they can make guarantees around the number of
> > levels of the returned object
>
> I agree on this
>
> >  (especially if it?s just one!).
>
> ... not on that.
>
> MI (or tuples) -> arbitrary strings
>
> is much simpler/cleaner to do than
>
> arbitrary strings -> MI (or tuples)
>
> >
> > > - if, after creating all my columns, I want to e.g. select all
> > > columns
> > > that contain sums, I need to do some sort of "df[[col if
> > > col.startswith("Sum of")]]". Compare to "df.loc[:, ('Sum',)]?
> >
> > Unless I am mistaken you would have to do something like
> > "df.groupby('a').agg([sum]).loc[:, slice(None, 'sum?)]? to get that
> > to work.
>
> Yeah, I had swapped the levels, it is
>
> df.groupby('a').agg([sum]).loc[:, (slice(None), 'sum?)]
>
>
> > I don?t think that syntax really is that clean
>
> In my code I always start by defining
>
> WE = slice(None) # WhatEver
>
> and we could advertise this as a way to make the syntax shorter, but
> regardless of that, it definitely is cleaner than any string
> manipulation.
>
>
> > and it starts taking us down the path of advanced indexing for what
> > may start off to the end user as a very simple aggregation exercise.
>
> On this I agree with you. I'm all for providing
>
> - a MultiIndex.flatten() method which allows me to do
> res.columns = res.columns.flatten("{} of {}".format)
>
> - a simple way to do the above in-line (which is already being
> discussed, regardless of groupby)
>
> > [...]
> > > - it would be the only case in pandas in which we decide how to
> > > call a
> > > column on behalf of the user
> >
> > Well we have to do something to reduce ambiguity?I think a consistent
> > naming convention and dimension for the columns across all
> > invocations is strongly preferable to inserting a column level some
> > of the time.
>
> Again, I agree on this.
>
> >
> > - if one wants to allow the user to name the columns according to her
> > > taste, it's pretty simple to introduce an argument which takes a
> > > string
> > > to be .format()ted with the name of the column (or even of the
> > > method),
> > > e.g. name="Sum of {}"
> >
> > Agreed. In my head I feel like this defaults to something like
> > f?{fname} of {colname}? but gives the user potentially the option to
> > override. By default keep the same number of levels as what is being
> > passed in, though maybe None as an argument reverts to the old style
> > behavior of simply inserting a new column index level.
>
> Agree on everything but the default, again, because it is arbitrary
>
>
> > > By the way, despite some related issues, I still think tuples can
> > > be
> > > first class citizens of flat indexes. So if one doesn't like
> > > MultiIndexes, or they do not fit one's needs, ("sum", "A") can well
> > > be
> > > a label in a regular index.
> >
> > You know better than I do here, but again I don?t think it makes for
> > a good user experience to convert columns with one level into
> > multiple levels after a GroupBy operation regardless of how you could
> > subsequently access those values.
>
> Notice that I'm not talking about a MultiIndex, but about a flat index.
> But it is an inferior solution, given the API we already expose, to the
> MultiIndex.
>
>
> Pietro
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180719/20f87262/attachment-0001.html>

From me at pietrobattiston.it  Thu Jul 19 11:17:41 2018
From: me at pietrobattiston.it (Pietro Battiston)
Date: Thu, 19 Jul 2018 17:17:41 +0200
Subject: [Pandas-dev] Colon available everywhere
In-Reply-To: <1531897282.3286.6.camel@pietrobattiston.it>
References: <F0DE6110-6B5B-468E-B8CD-47DF87F38E07@icloud.com>
 <1531782675.15070.18.camel@pietrobattiston.it>
 <09F05CAB-B491-416B-AC63-486C9AD132CD@icloud.com>
 <1531814373.15070.27.camel@pietrobattiston.it>
 <B7A9F796-25B0-43EF-8CA5-4ABFEFE9BD5A@icloud.com>
 <1531897282.3286.6.camel@pietrobattiston.it>
Message-ID: <1532013461.3286.45.camel@pietrobattiston.it>

Il giorno mer, 18/07/2018 alle 09.01 +0200, Pietro Battiston ha
scritto:
> Il giorno mar, 17/07/2018 alle 16.10 -0700, William Ayd ha scritto:
> > > - if, after creating all my columns, I want to e.g. select all
> > > columns
> > > that contain sums, I need to do some sort of "df[[col if
> > > col.startswith("Sum of")]]". Compare to "df.loc[:, ('Sum',)]?
> > 
> > Unless I am mistaken you would have to do something like
> > "df.groupby('a').agg([sum]).loc[:, slice(None, 'sum?)]? to get that
> > to work.
> 
> Yeah, I had swapped the levels, it is
> 
> df.groupby('a').agg([sum]).loc[:, (slice(None), 'sum?)]
> 
> 
> > I don?t think that syntax really is that clean
> 
> In my code I always start by defining
> 
> WE = slice(None) # WhatEver
> 
> and we could advertise this as a way to make the syntax shorter, but
> regardless of that, it definitely is cleaner than any string
> manipulation.


Related to this, I'm curious about some opinion from pandas devs on an
idea which I think would simplify our users' life (and by that, I don't
only mean current users of current pandas API) at (almost) no cost.

The colon in Python is meant for:

1)?logical blocks:
  if True:

2)?separating args and body of a lambda:
  lambda x : x**2

3) assignment expressions (since 3.8):
  if (a := True):

4) separating key and value in dict:
  {1 : 'a'}

5) define slices:
  a_series.loc['2018-06-01':'2018-07-03']

The last example is entirely indistinguishable from
a_series.loc[slice('2018-06-01','2018-07-03')]
... but unfortunately, only works inside __getitem__ calls.

My idea is: there is no obvious reason why it should be so, that is,
why

'2018-06-01':'2018-07-03'

couldn't just be parsed as slice('2018-06-01','2018-07-03').

The alternative uses 1)-4) of the colon imply that some precaution must
be taken, but:

1) should not create ambiguity, as the ":" is always matched with a
control flow statement

2) should not create ambiguity, as the ":" is always matched with the
"lambda" statement

3) should not create ambiguity, as the ":" is always present close to
"=", while the "slice interpretation" of ":" would never appear (unless
nested) in the left part of an assignment

4) is the only potential problematic case, as
  {2 : 3}
could be interpreted as
  {slice(2, 3)}
but is currently interpreted as 
 ?dict([(1,3)])

However, the solution could be to just prioritize the current
interpretation, and use 
  {(2 : 3)}
to force the second.


If this proposal was implemented,

  df.loc[:, (slice(None), 'sum?)]

would finally just become

  df.loc[:, (:, 'sum?)]

at the cost of a minimal ambiguity (in the case shown above), which is
easy to solve (and no more grave, I guess, than the fact that {} is an
empty dict and not an empty set).

For Python beginners, it would probably even simplify the understanding
of slices (today, it is not trivial, I think, to understand that obj[:]
is exactly equivalent to obj[slice(None)] - but that ":" does not per
se mean anything).
Moreover, it would mimick "...", which is instead available also
outside of __getitem__ calls.

Would it be crazy to propose a PEP with this?

A milder form would be to allow ":" to be used only inside __getitem__
calls, but also nested: I think however this would be more confusing
and probably difficult to implement.

Thoughts?

Pietro

From me at pietrobattiston.it  Thu Jul 19 11:41:40 2018
From: me at pietrobattiston.it (Pietro Battiston)
Date: Thu, 19 Jul 2018 17:41:40 +0200
Subject: [Pandas-dev] Pandas Sprint Recap
In-Reply-To: <CAEQ_Tvf=_VBBK3dQucXH==Ho23sxW2M4OzD0vbGh8s06RtNGxg@mail.gmail.com>
References: <CAE1aY-nPUpVb5MUziiEM2RMKhnYuD4qaK4tcQs16mNbyFg7HQg@mail.gmail.com>
 <1531501924.11738.110.camel@pietrobattiston.it>
 <CAE1aY-=FX6ncmL3NMaxTeBkKC7qZ+F5YhOczbnmKqKBV1GTF_Q@mail.gmail.com>
 <CAJPUwMAnySm2f_zsUwkhcZ3q68=6kAAFjqVvyw=PWvyPWe0Fng@mail.gmail.com>
 <1531780555.15070.13.camel@pietrobattiston.it>
 <CAJPUwMB4sUePUFiqwiShTUYVvtM8SMdNTbQqpjuxQp=0_f398Q@mail.gmail.com>
 <1531787017.15070.24.camel@pietrobattiston.it>
 <CAEQ_Tvf=_VBBK3dQucXH==Ho23sxW2M4OzD0vbGh8s06RtNGxg@mail.gmail.com>
Message-ID: <1532014900.3286.50.camel@pietrobattiston.it>

By the way...

Il giorno lun, 16/07/2018 alle 18.14 -0700, Stephan Hoyer ha scritto:
> [...]
> 2. The indexed pandas.Series and pandas.DataFrame isn't the right
> abstraction for many tasks. A simpler, index free DataFrame would be
> a better data model for many tasks.

Is there an issue open already for this?

Otherwise, I think we could create it, at least to have a reference
target for decoupling index from storage.

Pietro

From shoyer at gmail.com  Thu Jul 19 12:33:41 2018
From: shoyer at gmail.com (Stephan Hoyer)
Date: Thu, 19 Jul 2018 09:33:41 -0700
Subject: [Pandas-dev] Colon available everywhere
In-Reply-To: <1532013461.3286.45.camel@pietrobattiston.it>
References: <F0DE6110-6B5B-468E-B8CD-47DF87F38E07@icloud.com>
 <1531782675.15070.18.camel@pietrobattiston.it>
 <09F05CAB-B491-416B-AC63-486C9AD132CD@icloud.com>
 <1531814373.15070.27.camel@pietrobattiston.it>
 <B7A9F796-25B0-43EF-8CA5-4ABFEFE9BD5A@icloud.com>
 <1531897282.3286.6.camel@pietrobattiston.it>
 <1532013461.3286.45.camel@pietrobattiston.it>
Message-ID: <CAEQ_TvetpFhKMPKqRVE_yMw_=P5McQNAqFbh724uMaXx6rpxBg@mail.gmail.com>

I'm pretty sure this has been proposed before on Python-ideas. Definitely
search through the archives first.

Another option I liked that involved no changes to Python syntax would be
to make indexing the built-in slice class return a slice object, e.g.,
slice[:5] -> slice(None, 5, None). But if I recall correctly that had been
shot down, too.
On Thu, Jul 19, 2018 at 8:17 AM Pietro Battiston <me at pietrobattiston.it>
wrote:

> Il giorno mer, 18/07/2018 alle 09.01 +0200, Pietro Battiston ha
> scritto:
> > Il giorno mar, 17/07/2018 alle 16.10 -0700, William Ayd ha scritto:
> > > > - if, after creating all my columns, I want to e.g. select all
> > > > columns
> > > > that contain sums, I need to do some sort of "df[[col if
> > > > col.startswith("Sum of")]]". Compare to "df.loc[:, ('Sum',)]?
> > >
> > > Unless I am mistaken you would have to do something like
> > > "df.groupby('a').agg([sum]).loc[:, slice(None, 'sum?)]? to get that
> > > to work.
> >
> > Yeah, I had swapped the levels, it is
> >
> > df.groupby('a').agg([sum]).loc[:, (slice(None), 'sum?)]
> >
> >
> > > I don?t think that syntax really is that clean
> >
> > In my code I always start by defining
> >
> > WE = slice(None) # WhatEver
> >
> > and we could advertise this as a way to make the syntax shorter, but
> > regardless of that, it definitely is cleaner than any string
> > manipulation.
>
>
> Related to this, I'm curious about some opinion from pandas devs on an
> idea which I think would simplify our users' life (and by that, I don't
> only mean current users of current pandas API) at (almost) no cost.
>
> The colon in Python is meant for:
>
> 1) logical blocks:
>   if True:
>
> 2) separating args and body of a lambda:
>   lambda x : x**2
>
> 3) assignment expressions (since 3.8):
>   if (a := True):
>
> 4) separating key and value in dict:
>   {1 : 'a'}
>
> 5) define slices:
>   a_series.loc['2018-06-01':'2018-07-03']
>
> The last example is entirely indistinguishable from
> a_series.loc[slice('2018-06-01','2018-07-03')]
> ... but unfortunately, only works inside __getitem__ calls.
>
> My idea is: there is no obvious reason why it should be so, that is,
> why
>
> '2018-06-01':'2018-07-03'
>
> couldn't just be parsed as slice('2018-06-01','2018-07-03').
>
> The alternative uses 1)-4) of the colon imply that some precaution must
> be taken, but:
>
> 1) should not create ambiguity, as the ":" is always matched with a
> control flow statement
>
> 2) should not create ambiguity, as the ":" is always matched with the
> "lambda" statement
>
> 3) should not create ambiguity, as the ":" is always present close to
> "=", while the "slice interpretation" of ":" would never appear (unless
> nested) in the left part of an assignment
>
> 4) is the only potential problematic case, as
>   {2 : 3}
> could be interpreted as
>   {slice(2, 3)}
> but is currently interpreted as
>   dict([(1,3)])
>
> However, the solution could be to just prioritize the current
> interpretation, and use
>   {(2 : 3)}
> to force the second.
>
>
> If this proposal was implemented,
>
>   df.loc[:, (slice(None), 'sum?)]
>
> would finally just become
>
>   df.loc[:, (:, 'sum?)]
>
> at the cost of a minimal ambiguity (in the case shown above), which is
> easy to solve (and no more grave, I guess, than the fact that {} is an
> empty dict and not an empty set).
>
> For Python beginners, it would probably even simplify the understanding
> of slices (today, it is not trivial, I think, to understand that obj[:]
> is exactly equivalent to obj[slice(None)] - but that ":" does not per
> se mean anything).
> Moreover, it would mimick "...", which is instead available also
> outside of __getitem__ calls.
>
> Would it be crazy to propose a PEP with this?
>
> A milder form would be to allow ":" to be used only inside __getitem__
> calls, but also nested: I think however this would be more confusing
> and probably difficult to implement.
>
> Thoughts?
>
> Pietro
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180719/b24ba5c8/attachment.html>

From cbartak at gmail.com  Thu Jul 19 14:31:46 2018
From: cbartak at gmail.com (Chris Bartak)
Date: Thu, 19 Jul 2018 13:31:46 -0500
Subject: [Pandas-dev] Colon available everywhere
In-Reply-To: <CAEQ_TvetpFhKMPKqRVE_yMw_=P5McQNAqFbh724uMaXx6rpxBg@mail.gmail.com>
References: <F0DE6110-6B5B-468E-B8CD-47DF87F38E07@icloud.com>
 <1531782675.15070.18.camel@pietrobattiston.it>
 <09F05CAB-B491-416B-AC63-486C9AD132CD@icloud.com>
 <1531814373.15070.27.camel@pietrobattiston.it>
 <B7A9F796-25B0-43EF-8CA5-4ABFEFE9BD5A@icloud.com>
 <1531897282.3286.6.camel@pietrobattiston.it>
 <1532013461.3286.45.camel@pietrobattiston.it>
 <CAEQ_TvetpFhKMPKqRVE_yMw_=P5McQNAqFbh724uMaXx6rpxBg@mail.gmail.com>
Message-ID: <CAJtmDoDRo8yLO8VB1yyEiWh6KQOaZS6WO+U48EaT41pF_13+Uw@mail.gmail.com>

I think this has also been discussed in forms too, but another syntax
possibility would be expanding what is accepted inside __getitem__.
Ignoring backwards compat for a second (can of worms how `*args`/existing
tuple key behavior would interact), could envision something roughly like
this, which could also solve the named indexer problem.

class A:
    def __getitem__(self, *args, **kwargs): print(args, kwargs)
a = A()

a[1, 2]
# (1, 2), {}

a[1, 2, b=3]
# (1, 2), {'b': 3}

a[1, 2, (:, 2), c=3]
# (1, 2, (slice(None), 2)), {'c': 3}


On Thu, Jul 19, 2018 at 11:34 AM Stephan Hoyer <shoyer at gmail.com> wrote:

> I'm pretty sure this has been proposed before on Python-ideas. Definitely
> search through the archives first.
>
> Another option I liked that involved no changes to Python syntax would be
> to make indexing the built-in slice class return a slice object, e.g.,
> slice[:5] -> slice(None, 5, None). But if I recall correctly that had been
> shot down, too.
> On Thu, Jul 19, 2018 at 8:17 AM Pietro Battiston <me at pietrobattiston.it>
> wrote:
>
>> Il giorno mer, 18/07/2018 alle 09.01 +0200, Pietro Battiston ha
>> scritto:
>> > Il giorno mar, 17/07/2018 alle 16.10 -0700, William Ayd ha scritto:
>> > > > - if, after creating all my columns, I want to e.g. select all
>> > > > columns
>> > > > that contain sums, I need to do some sort of "df[[col if
>> > > > col.startswith("Sum of")]]". Compare to "df.loc[:, ('Sum',)]?
>> > >
>> > > Unless I am mistaken you would have to do something like
>> > > "df.groupby('a').agg([sum]).loc[:, slice(None, 'sum?)]? to get that
>> > > to work.
>> >
>> > Yeah, I had swapped the levels, it is
>> >
>> > df.groupby('a').agg([sum]).loc[:, (slice(None), 'sum?)]
>> >
>> >
>> > > I don?t think that syntax really is that clean
>> >
>> > In my code I always start by defining
>> >
>> > WE = slice(None) # WhatEver
>> >
>> > and we could advertise this as a way to make the syntax shorter, but
>> > regardless of that, it definitely is cleaner than any string
>> > manipulation.
>>
>>
>> Related to this, I'm curious about some opinion from pandas devs on an
>> idea which I think would simplify our users' life (and by that, I don't
>> only mean current users of current pandas API) at (almost) no cost.
>>
>> The colon in Python is meant for:
>>
>> 1) logical blocks:
>>   if True:
>>
>> 2) separating args and body of a lambda:
>>   lambda x : x**2
>>
>> 3) assignment expressions (since 3.8):
>>   if (a := True):
>>
>> 4) separating key and value in dict:
>>   {1 : 'a'}
>>
>> 5) define slices:
>>   a_series.loc['2018-06-01':'2018-07-03']
>>
>> The last example is entirely indistinguishable from
>> a_series.loc[slice('2018-06-01','2018-07-03')]
>> ... but unfortunately, only works inside __getitem__ calls.
>>
>> My idea is: there is no obvious reason why it should be so, that is,
>> why
>>
>> '2018-06-01':'2018-07-03'
>>
>> couldn't just be parsed as slice('2018-06-01','2018-07-03').
>>
>> The alternative uses 1)-4) of the colon imply that some precaution must
>> be taken, but:
>>
>> 1) should not create ambiguity, as the ":" is always matched with a
>> control flow statement
>>
>> 2) should not create ambiguity, as the ":" is always matched with the
>> "lambda" statement
>>
>> 3) should not create ambiguity, as the ":" is always present close to
>> "=", while the "slice interpretation" of ":" would never appear (unless
>> nested) in the left part of an assignment
>>
>> 4) is the only potential problematic case, as
>>   {2 : 3}
>> could be interpreted as
>>   {slice(2, 3)}
>> but is currently interpreted as
>>   dict([(1,3)])
>>
>> However, the solution could be to just prioritize the current
>> interpretation, and use
>>   {(2 : 3)}
>> to force the second.
>>
>>
>> If this proposal was implemented,
>>
>>   df.loc[:, (slice(None), 'sum?)]
>>
>> would finally just become
>>
>>   df.loc[:, (:, 'sum?)]
>>
>> at the cost of a minimal ambiguity (in the case shown above), which is
>> easy to solve (and no more grave, I guess, than the fact that {} is an
>> empty dict and not an empty set).
>>
>> For Python beginners, it would probably even simplify the understanding
>> of slices (today, it is not trivial, I think, to understand that obj[:]
>> is exactly equivalent to obj[slice(None)] - but that ":" does not per
>> se mean anything).
>> Moreover, it would mimick "...", which is instead available also
>> outside of __getitem__ calls.
>>
>> Would it be crazy to propose a PEP with this?
>>
>> A milder form would be to allow ":" to be used only inside __getitem__
>> calls, but also nested: I think however this would be more confusing
>> and probably difficult to implement.
>>
>> Thoughts?
>>
>> Pietro
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180719/ed173095/attachment-0001.html>

From shoyer at gmail.com  Thu Jul 19 14:46:03 2018
From: shoyer at gmail.com (Stephan Hoyer)
Date: Thu, 19 Jul 2018 11:46:03 -0700
Subject: [Pandas-dev] Colon available everywhere
In-Reply-To: <CAJtmDoDRo8yLO8VB1yyEiWh6KQOaZS6WO+U48EaT41pF_13+Uw@mail.gmail.com>
References: <F0DE6110-6B5B-468E-B8CD-47DF87F38E07@icloud.com>
 <1531782675.15070.18.camel@pietrobattiston.it>
 <09F05CAB-B491-416B-AC63-486C9AD132CD@icloud.com>
 <1531814373.15070.27.camel@pietrobattiston.it>
 <B7A9F796-25B0-43EF-8CA5-4ABFEFE9BD5A@icloud.com>
 <1531897282.3286.6.camel@pietrobattiston.it>
 <1532013461.3286.45.camel@pietrobattiston.it>
 <CAEQ_TvetpFhKMPKqRVE_yMw_=P5McQNAqFbh724uMaXx6rpxBg@mail.gmail.com>
 <CAJtmDoDRo8yLO8VB1yyEiWh6KQOaZS6WO+U48EaT41pF_13+Uw@mail.gmail.com>
Message-ID: <CAEQ_TvfyV21PJ8px2dZdGV5iTC5sToOaoSayu5xLBxdCq8z1Vg@mail.gmail.com>

Yes, I'd love to see *args and **kwargs for __getitem__, but that's a much
bigger change. See also https://www.python.org/dev/peps/pep-0472/


On Thu, Jul 19, 2018 at 11:31 AM Chris Bartak <cbartak at gmail.com> wrote:

> I think this has also been discussed in forms too, but another syntax
> possibility would be expanding what is accepted inside __getitem__.
> Ignoring backwards compat for a second (can of worms how `*args`/existing
> tuple key behavior would interact), could envision something roughly like
> this, which could also solve the named indexer problem.
>
> class A:
>     def __getitem__(self, *args, **kwargs): print(args, kwargs)
> a = A()
>
> a[1, 2]
> # (1, 2), {}
>
> a[1, 2, b=3]
> # (1, 2), {'b': 3}
>
> a[1, 2, (:, 2), c=3]
> # (1, 2, (slice(None), 2)), {'c': 3}
>
>
>
> On Thu, Jul 19, 2018 at 11:34 AM Stephan Hoyer <shoyer at gmail.com> wrote:
>
>> I'm pretty sure this has been proposed before on Python-ideas. Definitely
>> search through the archives first.
>>
>> Another option I liked that involved no changes to Python syntax would be
>> to make indexing the built-in slice class return a slice object, e.g.,
>> slice[:5] -> slice(None, 5, None). But if I recall correctly that had been
>> shot down, too.
>> On Thu, Jul 19, 2018 at 8:17 AM Pietro Battiston <me at pietrobattiston.it>
>> wrote:
>>
>>> Il giorno mer, 18/07/2018 alle 09.01 +0200, Pietro Battiston ha
>>> scritto:
>>> > Il giorno mar, 17/07/2018 alle 16.10 -0700, William Ayd ha scritto:
>>> > > > - if, after creating all my columns, I want to e.g. select all
>>> > > > columns
>>> > > > that contain sums, I need to do some sort of "df[[col if
>>> > > > col.startswith("Sum of")]]". Compare to "df.loc[:, ('Sum',)]?
>>> > >
>>> > > Unless I am mistaken you would have to do something like
>>> > > "df.groupby('a').agg([sum]).loc[:, slice(None, 'sum?)]? to get that
>>> > > to work.
>>> >
>>> > Yeah, I had swapped the levels, it is
>>> >
>>> > df.groupby('a').agg([sum]).loc[:, (slice(None), 'sum?)]
>>> >
>>> >
>>> > > I don?t think that syntax really is that clean
>>> >
>>> > In my code I always start by defining
>>> >
>>> > WE = slice(None) # WhatEver
>>> >
>>> > and we could advertise this as a way to make the syntax shorter, but
>>> > regardless of that, it definitely is cleaner than any string
>>> > manipulation.
>>>
>>>
>>> Related to this, I'm curious about some opinion from pandas devs on an
>>> idea which I think would simplify our users' life (and by that, I don't
>>> only mean current users of current pandas API) at (almost) no cost.
>>>
>>> The colon in Python is meant for:
>>>
>>> 1) logical blocks:
>>>   if True:
>>>
>>> 2) separating args and body of a lambda:
>>>   lambda x : x**2
>>>
>>> 3) assignment expressions (since 3.8):
>>>   if (a := True):
>>>
>>> 4) separating key and value in dict:
>>>   {1 : 'a'}
>>>
>>> 5) define slices:
>>>   a_series.loc['2018-06-01':'2018-07-03']
>>>
>>> The last example is entirely indistinguishable from
>>> a_series.loc[slice('2018-06-01','2018-07-03')]
>>> ... but unfortunately, only works inside __getitem__ calls.
>>>
>>> My idea is: there is no obvious reason why it should be so, that is,
>>> why
>>>
>>> '2018-06-01':'2018-07-03'
>>>
>>> couldn't just be parsed as slice('2018-06-01','2018-07-03').
>>>
>>> The alternative uses 1)-4) of the colon imply that some precaution must
>>> be taken, but:
>>>
>>> 1) should not create ambiguity, as the ":" is always matched with a
>>> control flow statement
>>>
>>> 2) should not create ambiguity, as the ":" is always matched with the
>>> "lambda" statement
>>>
>>> 3) should not create ambiguity, as the ":" is always present close to
>>> "=", while the "slice interpretation" of ":" would never appear (unless
>>> nested) in the left part of an assignment
>>>
>>> 4) is the only potential problematic case, as
>>>   {2 : 3}
>>> could be interpreted as
>>>   {slice(2, 3)}
>>> but is currently interpreted as
>>>   dict([(1,3)])
>>>
>>> However, the solution could be to just prioritize the current
>>> interpretation, and use
>>>   {(2 : 3)}
>>> to force the second.
>>>
>>>
>>> If this proposal was implemented,
>>>
>>>   df.loc[:, (slice(None), 'sum?)]
>>>
>>> would finally just become
>>>
>>>   df.loc[:, (:, 'sum?)]
>>>
>>> at the cost of a minimal ambiguity (in the case shown above), which is
>>> easy to solve (and no more grave, I guess, than the fact that {} is an
>>> empty dict and not an empty set).
>>>
>>> For Python beginners, it would probably even simplify the understanding
>>> of slices (today, it is not trivial, I think, to understand that obj[:]
>>> is exactly equivalent to obj[slice(None)] - but that ":" does not per
>>> se mean anything).
>>> Moreover, it would mimick "...", which is instead available also
>>> outside of __getitem__ calls.
>>>
>>> Would it be crazy to propose a PEP with this?
>>>
>>> A milder form would be to allow ":" to be used only inside __getitem__
>>> calls, but also nested: I think however this would be more confusing
>>> and probably difficult to implement.
>>>
>>> Thoughts?
>>>
>>> Pietro
>>> _______________________________________________
>>> Pandas-dev mailing list
>>> Pandas-dev at python.org
>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180719/24376078/attachment.html>

From me at pietrobattiston.it  Thu Jul 19 16:27:08 2018
From: me at pietrobattiston.it (Pietro Battiston)
Date: Thu, 19 Jul 2018 22:27:08 +0200
Subject: [Pandas-dev] Colon available everywhere
In-Reply-To: <CAEQ_TvetpFhKMPKqRVE_yMw_=P5McQNAqFbh724uMaXx6rpxBg@mail.gmail.com>
References: <F0DE6110-6B5B-468E-B8CD-47DF87F38E07@icloud.com>
 <1531782675.15070.18.camel@pietrobattiston.it>
 <09F05CAB-B491-416B-AC63-486C9AD132CD@icloud.com>
 <1531814373.15070.27.camel@pietrobattiston.it>
 <B7A9F796-25B0-43EF-8CA5-4ABFEFE9BD5A@icloud.com>
 <1531897282.3286.6.camel@pietrobattiston.it>
 <1532013461.3286.45.camel@pietrobattiston.it>
 <CAEQ_TvetpFhKMPKqRVE_yMw_=P5McQNAqFbh724uMaXx6rpxBg@mail.gmail.com>
Message-ID: <1532032028.3286.52.camel@pietrobattiston.it>

Il giorno gio, 19/07/2018 alle 09.33 -0700, Stephan Hoyer ha scritto:
> I'm pretty sure this has been proposed before on Python-ideas.
> Definitely search through the archives first.

The closest I found is
https://mail.python.org/pipermail/python-ideas/2015-June/034086.html
https://bugs.python.org/issue24379

proposing the less invasive?(not requiring changes in the language),
but also less useful 

slice.literal

as in

reverse = slice.literal[::-1]

By the way, if only slice was subclassable, we could do

class PowerSlice(slice):????????????
    def __getitem__(self, key):
        return key

W = PowerSlice()

so that both

series.df[W, 'col']

and

series.df[W[:], 'col']

would work.

Unfortunately that is not the case (and anyway, this solution would
still be suboptimal).

Pietro


From shoyer at gmail.com  Thu Jul 19 16:35:20 2018
From: shoyer at gmail.com (Stephan Hoyer)
Date: Thu, 19 Jul 2018 13:35:20 -0700
Subject: [Pandas-dev] Colon available everywhere
In-Reply-To: <1532032028.3286.52.camel@pietrobattiston.it>
References: <F0DE6110-6B5B-468E-B8CD-47DF87F38E07@icloud.com>
 <1531782675.15070.18.camel@pietrobattiston.it>
 <09F05CAB-B491-416B-AC63-486C9AD132CD@icloud.com>
 <1531814373.15070.27.camel@pietrobattiston.it>
 <B7A9F796-25B0-43EF-8CA5-4ABFEFE9BD5A@icloud.com>
 <1531897282.3286.6.camel@pietrobattiston.it>
 <1532013461.3286.45.camel@pietrobattiston.it>
 <CAEQ_TvetpFhKMPKqRVE_yMw_=P5McQNAqFbh724uMaXx6rpxBg@mail.gmail.com>
 <1532032028.3286.52.camel@pietrobattiston.it>
Message-ID: <CAEQ_TvcQGcWjTCJ752Ebw8xP05mDdygmDP0QczUi_Lrvz9n7rg@mail.gmail.com>

I'd be pretty happy getting even operator.subscript into Python 3.8.
subscript[:, 0, ::-1] is *way* more readable than (slice(None), 0,
slice(None, None, -1)).

My sense is that the commentators on the Python bug don't work with
multi-dimensional arrays so they don't appreciate how ubiquitous the need
for this is. It would be really nice to have standard utility for this,
rather than needing to rely on the separate utilities in pandas and NumPy.

On Thu, Jul 19, 2018 at 1:27 PM Pietro Battiston <me at pietrobattiston.it>
wrote:

> Il giorno gio, 19/07/2018 alle 09.33 -0700, Stephan Hoyer ha scritto:
> > I'm pretty sure this has been proposed before on Python-ideas.
> > Definitely search through the archives first.
>
> The closest I found is
> https://mail.python.org/pipermail/python-ideas/2015-June/034086.html
> https://bugs.python.org/issue24379
>
> proposing the less invasive (not requiring changes in the language),
> but also less useful
>
> slice.literal
>
> as in
>
> reverse = slice.literal[::-1]
>
> By the way, if only slice was subclassable, we could do
>
> class PowerSlice(slice):
>     def __getitem__(self, key):
>         return key
>
> W = PowerSlice()
>
> so that both
>
> series.df[W, 'col']
>
> and
>
> series.df[W[:], 'col']
>
> would work.
>
> Unfortunately that is not the case (and anyway, this solution would
> still be suboptimal).
>
> Pietro
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180719/fd8c84ec/attachment-0001.html>

From me at pietrobattiston.it  Thu Jul 19 16:43:59 2018
From: me at pietrobattiston.it (Pietro Battiston)
Date: Thu, 19 Jul 2018 22:43:59 +0200
Subject: [Pandas-dev] Colon available everywhere
In-Reply-To: <CAEQ_TvcQGcWjTCJ752Ebw8xP05mDdygmDP0QczUi_Lrvz9n7rg@mail.gmail.com>
References: <F0DE6110-6B5B-468E-B8CD-47DF87F38E07@icloud.com>
 <1531782675.15070.18.camel@pietrobattiston.it>
 <09F05CAB-B491-416B-AC63-486C9AD132CD@icloud.com>
 <1531814373.15070.27.camel@pietrobattiston.it>
 <B7A9F796-25B0-43EF-8CA5-4ABFEFE9BD5A@icloud.com>
 <1531897282.3286.6.camel@pietrobattiston.it>
 <1532013461.3286.45.camel@pietrobattiston.it>
 <CAEQ_TvetpFhKMPKqRVE_yMw_=P5McQNAqFbh724uMaXx6rpxBg@mail.gmail.com>
 <1532032028.3286.52.camel@pietrobattiston.it>
 <CAEQ_TvcQGcWjTCJ752Ebw8xP05mDdygmDP0QczUi_Lrvz9n7rg@mail.gmail.com>
Message-ID: <1532033039.3286.55.camel@pietrobattiston.it>

Il giorno gio, 19/07/2018 alle 13.35 -0700, Stephan Hoyer ha scritto:
> I'd be pretty happy getting even operator.subscript into Python 3.8.
> subscript[:, 0, ::-1] is way?more readable than (slice(None), 0,
> slice(None, None, -1)).

I totally agree... but isn't

(:, 0, ::-1)

even better?

Pietro

From shoyer at gmail.com  Thu Jul 19 17:02:23 2018
From: shoyer at gmail.com (Stephan Hoyer)
Date: Thu, 19 Jul 2018 14:02:23 -0700
Subject: [Pandas-dev] Colon available everywhere
In-Reply-To: <1532033039.3286.55.camel@pietrobattiston.it>
References: <F0DE6110-6B5B-468E-B8CD-47DF87F38E07@icloud.com>
 <1531782675.15070.18.camel@pietrobattiston.it>
 <09F05CAB-B491-416B-AC63-486C9AD132CD@icloud.com>
 <1531814373.15070.27.camel@pietrobattiston.it>
 <B7A9F796-25B0-43EF-8CA5-4ABFEFE9BD5A@icloud.com>
 <1531897282.3286.6.camel@pietrobattiston.it>
 <1532013461.3286.45.camel@pietrobattiston.it>
 <CAEQ_TvetpFhKMPKqRVE_yMw_=P5McQNAqFbh724uMaXx6rpxBg@mail.gmail.com>
 <1532032028.3286.52.camel@pietrobattiston.it>
 <CAEQ_TvcQGcWjTCJ752Ebw8xP05mDdygmDP0QczUi_Lrvz9n7rg@mail.gmail.com>
 <1532033039.3286.55.camel@pietrobattiston.it>
Message-ID: <CAEQ_TvdYxtjRHJkqs0gP5GMR2uTF15uY1OWFzExoRERYWAVK6w@mail.gmail.com>

Sure -- but good luck pursauding mainstream Python devs on that one!
On Thu, Jul 19, 2018 at 1:44 PM Pietro Battiston <me at pietrobattiston.it>
wrote:

> Il giorno gio, 19/07/2018 alle 13.35 -0700, Stephan Hoyer ha scritto:
> > I'd be pretty happy getting even operator.subscript into Python 3.8.
> > subscript[:, 0, ::-1] is way more readable than (slice(None), 0,
> > slice(None, None, -1)).
>
> I totally agree... but isn't
>
> (:, 0, ::-1)
>
> even better?
>
> Pietro
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20180719/0a505abc/attachment.html>

From me at pietrobattiston.it  Thu Jul 19 17:33:58 2018
From: me at pietrobattiston.it (Pietro Battiston)
Date: Thu, 19 Jul 2018 23:33:58 +0200
Subject: [Pandas-dev] Colon available everywhere
In-Reply-To: <CAEQ_TvdYxtjRHJkqs0gP5GMR2uTF15uY1OWFzExoRERYWAVK6w@mail.gmail.com>
References: <F0DE6110-6B5B-468E-B8CD-47DF87F38E07@icloud.com>
 <1531782675.15070.18.camel@pietrobattiston.it>
 <09F05CAB-B491-416B-AC63-486C9AD132CD@icloud.com>
 <1531814373.15070.27.camel@pietrobattiston.it>
 <B7A9F796-25B0-43EF-8CA5-4ABFEFE9BD5A@icloud.com>
 <1531897282.3286.6.camel@pietrobattiston.it>
 <1532013461.3286.45.camel@pietrobattiston.it>
 <CAEQ_TvetpFhKMPKqRVE_yMw_=P5McQNAqFbh724uMaXx6rpxBg@mail.gmail.com>
 <1532032028.3286.52.camel@pietrobattiston.it>
 <CAEQ_TvcQGcWjTCJ752Ebw8xP05mDdygmDP0QczUi_Lrvz9n7rg@mail.gmail.com>
 <1532033039.3286.55.camel@pietrobattiston.it>
 <CAEQ_TvdYxtjRHJkqs0gP5GMR2uTF15uY1OWFzExoRERYWAVK6w@mail.gmail.com>
Message-ID: <1532036038.3286.59.camel@pietrobattiston.it>

Il giorno gio, 19/07/2018 alle 14.02 -0700, Stephan Hoyer ha scritto:
> Sure -- but good luck pursauding mainstream Python devs on that one!

I would never do this alone :-)

Unless I'm wrong, numpy devs were able to obtain the ellipsis, and the
operator "@".

The proposal on ":" can maybe be supported with more general arguments
(Python already has ":", and making it more widely available seems
natural) but clearly only makes sense if it is the request of a
community.

Pietro