From tom.augspurger88 at gmail.com Fri Jul 6 08:40:22 2018 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Fri, 6 Jul 2018 07:40:22 -0500 Subject: [Pandas-dev] ANN: Pandas 0.23.2 Released Message-ID: Hi all, I'm happy to announce pandas that pandas 0.23.2 has been released. This is a minor bug-fix release in the 0.23.x series and includes some regression fixes, bug fixes, and performance improvements. We recommend that all users upgrade to this version. See the full whatsnew for a list of all the changes. The release can be installed with conda from the default channel and conda-forge:: conda install pandas Or via PyPI: python -m pip install --upgrade pandas A total of 17 people contributed to this release. People with a ?+? by their names contributed a patch for the first time. - David Krych - Jacopo Rota + - Jeff Reback - Jeremy Schendel - Joris Van den Bossche - Kalyan Gokhale - Matthew Roeschke - Michael Odintsov + - Ming Li - Pietro Battiston - Tom Augspurger - Uddeshya Singh - Vu Le + - alimcmaster1 + - david-liu-brattle-1 + - gfyoung - jbrockmendel -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.w.augspurger at gmail.com Sun Jul 8 17:26:58 2018 From: tom.w.augspurger at gmail.com (Tom Augspurger) Date: Sun, 8 Jul 2018 16:26:58 -0500 Subject: [Pandas-dev] Pandas Sprint Recap Message-ID: Hi all, This week, a group of the pandas maintainers sat down in Austin to talk through the status of the project, and its future direction. I've posted a document on our wiki with a summary of the topics discussed. https://github.com/pandas-dev/pandas/wiki/Pandas-Sprint-(July,-2018) If people have questions or comments, feel free to post here and we'll clarify that document. Tom -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Mon Jul 9 14:25:47 2018 From: wesmckinn at gmail.com (Wes McKinney) Date: Mon, 9 Jul 2018 14:25:47 -0400 Subject: [Pandas-dev] Pandas Sprint Recap In-Reply-To: References: Message-ID: Thanks Tom! To all readers: please have a look. We are really interested in the community's feedback on the upcoming roadmap and the future work the team is contemplating On Sun, Jul 8, 2018 at 5:26 PM, Tom Augspurger wrote: > Hi all, > > This week, a group of the pandas maintainers sat down in Austin to talk > through the status of the project, and its future direction. > > I've posted a document on our wiki with a summary of the topics discussed. > https://github.com/pandas-dev/pandas/wiki/Pandas-Sprint-(July,-2018) > > If people have questions or comments, feel free to post here and we'll > clarify that document. > > Tom > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > From shoyer at gmail.com Mon Jul 9 14:41:50 2018 From: shoyer at gmail.com (Stephan Hoyer) Date: Mon, 9 Jul 2018 11:41:50 -0700 Subject: [Pandas-dev] Pandas Sprint Recap In-Reply-To: References: Message-ID: Hi Wes and Tom, I'm sorry I missed the sprint, but this looks like a great plan! My main concern would be figuring out how to smoothly transition from exposing the NumPy API directly to users to a separate abstraction layer. The "mixed" semantics of dtype and values in pandas (sometimes matching NumPy, sometimes not) is highly confusing, especially with incremental changes for data types in new releases of pandas. There is still a lot of code that relies on the internal representation of pandas.Series as a wrapper over a NumPy array. I wonder if it would be possible to do a hard break in 1.0 that switches everything over to a pandas dtype system, even if it is currently just a wrapper over NumPy. Or maybe that would be too large of a change to do right now, and it would be enough to merely stabilize the set of numpy vs pandas dtypes for the 1.X series. Cheers, Stephan On Mon, Jul 9, 2018 at 11:26 AM Wes McKinney wrote: > Thanks Tom! To all readers: please have a look. We are really > interested in the community's feedback on the upcoming roadmap and the > future work the team is contemplating > > On Sun, Jul 8, 2018 at 5:26 PM, Tom Augspurger > wrote: > > Hi all, > > > > This week, a group of the pandas maintainers sat down in Austin to talk > > through the status of the project, and its future direction. > > > > I've posted a document on our wiki with a summary of the topics > discussed. > > https://github.com/pandas-dev/pandas/wiki/Pandas-Sprint-(July,-2018) > > > > If people have questions or comments, feel free to post here and we'll > > clarify that document. > > > > Tom > > > > _______________________________________________ > > Pandas-dev mailing list > > Pandas-dev at python.org > > https://mail.python.org/mailman/listinfo/pandas-dev > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Mon Jul 9 15:20:51 2018 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Mon, 9 Jul 2018 14:20:51 -0500 Subject: [Pandas-dev] Pandas Sprint Recap In-Reply-To: References: Message-ID: 2018-07-09 13:41 GMT-05:00 Stephan Hoyer : > Hi Wes and Tom, > > I'm sorry I missed the sprint, but this looks like a great plan! > > My main concern would be figuring out how to smoothly transition from > exposing the NumPy API directly to users to a separate abstraction layer. > The "mixed" semantics of dtype and values in pandas (sometimes matching > NumPy, sometimes not) is highly confusing, especially with incremental > changes for data types in new releases of pandas. There is still a lot of > code that relies on the internal representation of pandas.Series as a > wrapper over a NumPy array. > > I wonder if it would be possible to do a hard break in 1.0 that switches > everything over to a pandas dtype system, even if it is currently just a > wrapper over NumPy. > That's a good point, and I agree that the mixture of numpy dtypes and pandas dtypes (and which are not fully compatible) is confusing. And in the long term I also think we should probably have our own pandas dtype system (that hopefully interoperates better with other dtypes systems by that time), so that we have a consistent user experience within pandas. But, I personally don't think we should do that for 1.0. Of course we can choose to keep on going with the 0.x releases for another few years, but personally I would rather go for a 1.0 that basically is the current state of pandas (with some clean-ups like removing panel and deprecated stuff, but no fundamental changes). Then we can discuss doing such a change for 2.0 regarding the dtypes (redesigning the internal blockmanager to something simpler will also break some stuff). Designing our own type system and functionality around it will take some time. > Or maybe that would be too large of a change to do right now, and it would > be enough to merely stabilize the set of numpy vs pandas dtypes for the 1.X > series. > I don't think we plan to have additional pandas dtypes for now (for 1.0), except for: - interval, period, datetimetz are already pandas dtypes, but will get more publicly exposed once we allow those to be stored in Series (now only in Index, and in Series you see them as object dtyped arrays) - the integer with NA support, but which will only be an experimental opt-in feature I personally think we should keep it on that for the 1.x series. Cheers, Joris > > Cheers, > Stephan > > On Mon, Jul 9, 2018 at 11:26 AM Wes McKinney wrote: > >> Thanks Tom! To all readers: please have a look. We are really >> interested in the community's feedback on the upcoming roadmap and the >> future work the team is contemplating >> >> On Sun, Jul 8, 2018 at 5:26 PM, Tom Augspurger >> wrote: >> > Hi all, >> > >> > This week, a group of the pandas maintainers sat down in Austin to talk >> > through the status of the project, and its future direction. >> > >> > I've posted a document on our wiki with a summary of the topics >> discussed. >> > https://github.com/pandas-dev/pandas/wiki/Pandas-Sprint-(July,-2018) >> > >> > If people have questions or comments, feel free to post here and we'll >> > clarify that document. >> > >> > Tom >> > >> > _______________________________________________ >> > Pandas-dev mailing list >> > Pandas-dev at python.org >> > https://mail.python.org/mailman/listinfo/pandas-dev >> > >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Mon Jul 9 19:47:05 2018 From: shoyer at gmail.com (Stephan Hoyer) Date: Mon, 9 Jul 2018 16:47:05 -0700 Subject: [Pandas-dev] Pandas Sprint Recap In-Reply-To: References: Message-ID: On Mon, Jul 9, 2018 at 12:21 PM Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > I don't think we plan to have additional pandas dtypes for now (for 1.0), > except for: > > - interval, period, datetimetz are already pandas dtypes, but will get > more publicly exposed once we allow those to be stored in Series (now only > in Index, and in Series you see them as object dtyped arrays) > - the integer with NA support, but which will only be an experimental > opt-in feature > > I personally think we should keep it on that for the 1.x series. > I agree, this sounds good to me. -------------- next part -------------- An HTML attachment was scrubbed... URL: From william.ayd at icloud.com Mon Jul 9 20:18:26 2018 From: william.ayd at icloud.com (William Ayd) Date: Mon, 09 Jul 2018 17:18:26 -0700 Subject: [Pandas-dev] GroupBy Overhaul Proposal Message-ID: Hi All, I?ve been thinking through what a redesigned GroupBy module could look like in 1.0. The main problems I am trying to address are: - The current module is relatively convoluted, making contribution and debugging challenging - Behavior is sometimes non-obvious and buggy (see here , here and here as some examples) AND - We violate the mantra of there being ?only one obvious way to do things? Along those lines, here were four things I thought could be of immense value: ? Removal of apply method ? Removal of DataFrameGroupBy and SeriesGroupBy classes ? Explicit default column naming ? Removal of axis argument These are easier said than done and admittedly controversial. I've pieced together my reasoning and what I think counter arguments could be in the attached documented. I?d be curious to hear everyone?s feedback. - Will -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: GroupByOverhaul.pdf Type: application/pdf Size: 126160 bytes Desc: not available URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Tue Jul 10 18:24:55 2018 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Tue, 10 Jul 2018 17:24:55 -0500 Subject: [Pandas-dev] Welcome Brock Mendel and Marc Garcia to the team Message-ID: We are happy to announce that Brock Mendel (@jbrockmendel) and Marc Garcia (@datapythonista) have been added to the pandas core team. Amongst many other things, Brock has been working a lot on refactoring the time series related code, and Marc has done amazing work on the documentation. Thanks both, and congratulations! Joris -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Tue Jul 10 18:26:45 2018 From: wesmckinn at gmail.com (Wes McKinney) Date: Tue, 10 Jul 2018 18:26:45 -0400 Subject: [Pandas-dev] Welcome Brock Mendel and Marc Garcia to the team In-Reply-To: References: Message-ID: Thanks to both of you! On Tue, Jul 10, 2018 at 6:24 PM, Joris Van den Bossche wrote: > We are happy to announce that Brock Mendel (@jbrockmendel) and Marc Garcia > (@datapythonista) have been added to the pandas core team. > > Amongst many other things, Brock has been working a lot on refactoring the > time series related code, and Marc has done amazing work on the > documentation. > > Thanks both, and congratulations! > > Joris > > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > From tom.augspurger88 at gmail.com Thu Jul 12 15:06:52 2018 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Thu, 12 Jul 2018 14:06:52 -0500 Subject: [Pandas-dev] Pandas Sprint Recap In-Reply-To: References: Message-ID: Updated the wiki page with an attempt to summarize Stephan's and Joris' points: https://github.com/pandas-dev/pandas/wiki/Pandas-Sprint-(July,-2018)/_compare/ac9aaf55348b8c62eb2ddb020d4be1dec6e7896b On Mon, Jul 9, 2018 at 6:47 PM, Stephan Hoyer wrote: > On Mon, Jul 9, 2018 at 12:21 PM Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > >> I don't think we plan to have additional pandas dtypes for now (for 1.0), >> except for: >> >> - interval, period, datetimetz are already pandas dtypes, but will get >> more publicly exposed once we allow those to be stored in Series (now only >> in Index, and in Series you see them as object dtyped arrays) >> - the integer with NA support, but which will only be an experimental >> opt-in feature >> >> I personally think we should keep it on that for the 1.x series. >> > > I agree, this sounds good to me. > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From me at pietrobattiston.it Fri Jul 13 13:12:04 2018 From: me at pietrobattiston.it (Pietro Battiston) Date: Fri, 13 Jul 2018 19:12:04 +0200 Subject: [Pandas-dev] Pandas Sprint Recap In-Reply-To: References: Message-ID: <1531501924.11738.110.camel@pietrobattiston.it> Hi Tom, first, thanks to all those who participated in the sprint, and for the recap. Il giorno dom, 08/07/2018 alle 16.26 -0500, Tom Augspurger ha scritto: > [...] > I've posted a document on our wiki with a summary of the topics > discussed. https://github.com/pandas-dev/pandas/wiki/Pandas-Sprint-(J > uly,-2018) > > If people have questions or comments, feel free to post here and > we'll clarify that document. Something that scares me - but maybe because I'm missing something obvious - is what exactly qualifies as "deprecation". Is it something which was once presented as a distinct feature and is then disabled, or any general change to what any API call performs (that is, anything requiring a deprecation cycle - that is)? There are many bugs - in particular, in indexing code - which might potentially break existing code when fixed. Some of them will have non- trivial deprecation paths/detection strategies. The first ones that come to my mind are #18631, #12827, #9519. The last one, in particular, implies changing the result of potentially tons of calls to .loc on a non-unique index. My view is that those (and many more, including several that will be found) will be best fixed through a total rewrite of indexing code (i.e., all code in indexing.py, and some code in internals.py), which I assumed would happen before 1.0, and which I certainly won't be able to do before 0.24.0 (September 2018). I'm clearly not claiming that nobody else can do it (nor that the bugs can necessarily only be fixed through a complete rewrite)... but since I did not get any feedback on https://github.com/pandas-dev/pandas/wiki/(Tentative)-rules-for-restruc turing-indexing-code ... I assume that nobody is focusing/planning to focus on this in the near future (or was it somehow discussed in the sprint?). I perfectly understand the desire to stop postponing 1.0 to a vague future, if it's just a matter of recognizing that pandas is worth using. But if it's a statement/commitment about code robustness/quality, and relatedly API stability... then I think we it is risky to leave the indexing API, and more in general the core codebase (as opposed to important but more lateral features such as new dtypes) out of the picture (e.g. out of #21894). Cheers, Pietro From tom.augspurger88 at gmail.com Fri Jul 13 13:45:36 2018 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Fri, 13 Jul 2018 12:45:36 -0500 Subject: [Pandas-dev] Pandas Sprint Recap In-Reply-To: <1531501924.11738.110.camel@pietrobattiston.it> References: <1531501924.11738.110.camel@pietrobattiston.it> Message-ID: Thanks Pietro, We didn't discuss indexing much, beyond agreeing that there's work to be done, and that fixing it was too large a task for 1.0. As for whether an individual issue is a bug or feature, we'll have to continue using our judgement. I think we'll inevitably break users' code in a 1.x release as we fix bugs. We'll need to discuss workflows for these large changes (e.g. ripping out the block manager) that will be API breaking, but may take some time to land. Keeping a separate branch in sync is a pain, but may be the least painful alternative. One thing I want to reiterate: it's not going to take another 11 years to reach pandas 2.0 :) Just because we don't solve indexing for 1.0 doesn't mean we won't ever be able to fix it. Tom On Fri, Jul 13, 2018 at 12:12 PM, Pietro Battiston wrote: > Hi Tom, > > first, thanks to all those who participated in the sprint, and for the > recap. > > Il giorno dom, 08/07/2018 alle 16.26 -0500, Tom Augspurger ha scritto: > > [...] > > I've posted a document on our wiki with a summary of the topics > > discussed. https://github.com/pandas-dev/pandas/wiki/Pandas-Sprint-(J > > uly,-2018) > > > > If people have questions or comments, feel free to post here and > > we'll clarify that document. > > Something that scares me - but maybe because I'm missing something > obvious - is what exactly qualifies as "deprecation". Is it something > which was once presented as a distinct feature and is then disabled, or > any general change to what any API call performs (that is, anything > requiring a deprecation cycle - that is)? > > There are many bugs - in particular, in indexing code - which might > potentially break existing code when fixed. Some of them will have non- > trivial deprecation paths/detection strategies. The first ones that > come to my mind are #18631, #12827, #9519. The last one, in particular, > implies changing the result of potentially tons of calls to .loc on a > non-unique index. > > My view is that those (and many more, including several that will be > found) will be best fixed through a total rewrite of indexing code > (i.e., all code in indexing.py, and some code in internals.py), which I > assumed would happen before 1.0, and which I certainly won't be able to > do before 0.24.0 (September 2018). > I'm clearly not claiming that nobody else can do it (nor that the bugs > can necessarily only be fixed through a complete rewrite)... but since > I did not get any feedback on > https://github.com/pandas-dev/pandas/wiki/(Tentative)-rules-for-restruc > turing-indexing-code > ... I assume that nobody is focusing/planning to focus on this in the > near future (or was it somehow discussed in the sprint?). > > I perfectly understand the desire to stop postponing 1.0 to a vague > future, if it's just a matter of recognizing that pandas is worth > using. > But if it's a statement/commitment about code robustness/quality, and > relatedly API stability... then I think we it is risky to leave the > indexing API, and more in general the core codebase (as opposed to > important but more lateral features such as new dtypes) out of the > picture (e.g. out of #21894). > > Cheers, > > Pietro > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Mon Jul 16 13:50:29 2018 From: wesmckinn at gmail.com (Wes McKinney) Date: Mon, 16 Jul 2018 13:50:29 -0400 Subject: [Pandas-dev] Pandas Sprint Recap In-Reply-To: References: <1531501924.11738.110.camel@pietrobattiston.it> Message-ID: > One thing I want to reiterate: it's not going to take another 11 years to > reach pandas 2.0 :) Just because we don't > solve indexing for 1.0 doesn't mean we won't ever be able to fix it. One point on this that we discussed some in the sprint and during SciPy: to undertake a major overhaul of pandas, at some point it may require a shift to a "new codebase". This could cohabit the same pandas-dev/pandas git repository which can serve as a monorepo for several Python package artifacts. This would make refactoring to separate out reusable components and code reuse much easier. The test suite could also be refactored to be able to run against "future-pandas" and "pandas" (or whatever we want to call them). I'm skeptical whether the kinds of significant / breaking changes we've discussed the last 3 years can happen in an iterative / organic fashion within the current pandas codebase. I'd like to avoid getting stuck in place for a decade; if we haven't made much progress toward some of these major changes by say beginning of 2020 or 2021 we might want to take a step back and evaluate our situation. I spent some time at the sprint looking through pandas.core.internals, pandas.core.generic, and some of the other low level pieces, and my feeling is that it would be easier to start over. All of this is made much more difficult by pandas's spartan funding situation (Joris and Tom supported at ~50% time, rest of maintainers are volunteers AFAIK). In the meantime, personally my efforts will continue to be focused on building portable, front end agnostic, reusable computational libraries for in-memory computing on large tabular datasets (ie https://www.slideshare.net/wesm/apache-arrow-crosslanguage-development-platform-for-inmemory-data-105427919). I believe that bootstrapping a much larger community to work on these problems will reduce our collective maintenance burden (though it is likely to take a number of years for this to pay off). - Wes On Fri, Jul 13, 2018 at 1:45 PM, Tom Augspurger wrote: > Thanks Pietro, > > We didn't discuss indexing much, beyond agreeing that there's work to be > done, and that fixing it was too large > a task for 1.0. > > As for whether an individual issue is a bug or feature, we'll have to > continue using our judgement. I think we'll > inevitably break users' code in a 1.x release as we fix bugs. > > We'll need to discuss workflows for these large changes (e.g. ripping out > the block manager) that will be API > breaking, but may take some time to land. Keeping a separate branch in sync > is a pain, but may be the least > painful alternative. > > One thing I want to reiterate: it's not going to take another 11 years to > reach pandas 2.0 :) Just because we don't > solve indexing for 1.0 doesn't mean we won't ever be able to fix it. > > Tom > > On Fri, Jul 13, 2018 at 12:12 PM, Pietro Battiston > wrote: >> >> Hi Tom, >> >> first, thanks to all those who participated in the sprint, and for the >> recap. >> >> Il giorno dom, 08/07/2018 alle 16.26 -0500, Tom Augspurger ha scritto: >> > [...] >> > I've posted a document on our wiki with a summary of the topics >> > discussed. https://github.com/pandas-dev/pandas/wiki/Pandas-Sprint-(J >> > uly,-2018) >> > >> > If people have questions or comments, feel free to post here and >> > we'll clarify that document. >> >> Something that scares me - but maybe because I'm missing something >> obvious - is what exactly qualifies as "deprecation". Is it something >> which was once presented as a distinct feature and is then disabled, or >> any general change to what any API call performs (that is, anything >> requiring a deprecation cycle - that is)? >> >> There are many bugs - in particular, in indexing code - which might >> potentially break existing code when fixed. Some of them will have non- >> trivial deprecation paths/detection strategies. The first ones that >> come to my mind are #18631, #12827, #9519. The last one, in particular, >> implies changing the result of potentially tons of calls to .loc on a >> non-unique index. >> >> My view is that those (and many more, including several that will be >> found) will be best fixed through a total rewrite of indexing code >> (i.e., all code in indexing.py, and some code in internals.py), which I >> assumed would happen before 1.0, and which I certainly won't be able to >> do before 0.24.0 (September 2018). >> I'm clearly not claiming that nobody else can do it (nor that the bugs >> can necessarily only be fixed through a complete rewrite)... but since >> I did not get any feedback on >> https://github.com/pandas-dev/pandas/wiki/(Tentative)-rules-for-restruc >> turing-indexing-code >> ... I assume that nobody is focusing/planning to focus on this in the >> near future (or was it somehow discussed in the sprint?). >> >> I perfectly understand the desire to stop postponing 1.0 to a vague >> future, if it's just a matter of recognizing that pandas is worth >> using. >> But if it's a statement/commitment about code robustness/quality, and >> relatedly API stability... then I think we it is risky to leave the >> indexing API, and more in general the core codebase (as opposed to >> important but more lateral features such as new dtypes) out of the >> picture (e.g. out of #21894). >> >> Cheers, >> >> Pietro >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev > > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > From me at pietrobattiston.it Mon Jul 16 18:35:55 2018 From: me at pietrobattiston.it (Pietro Battiston) Date: Tue, 17 Jul 2018 00:35:55 +0200 Subject: [Pandas-dev] Pandas Sprint Recap In-Reply-To: References: <1531501924.11738.110.camel@pietrobattiston.it> Message-ID: <1531780555.15070.13.camel@pietrobattiston.it> Il giorno lun, 16/07/2018 alle 13.50 -0400, Wes McKinney ha scritto: > > One thing I want to reiterate: it's not going to take another 11 > > years to > > reach pandas 2.0 :) Just because we don't > > solve indexing for 1.0 doesn't mean we won't ever be able to fix > > it. > > One point on this that we discussed some in the sprint and during > SciPy: to undertake a major overhaul of pandas, at some point it may > require a shift to a "new codebase". This could cohabit the same > pandas-dev/pandas git repository which can serve as a monorepo for > several Python package artifacts. This would make refactoring to > separate out reusable components and code reuse much easier. The test > suite could also be refactored to be able to run against > "future-pandas" and "pandas" (or whatever we want to call them). > > I'm skeptical whether the kinds of significant / breaking changes > we've discussed the last 3 years can happen in an iterative / organic > fashion within the current pandas codebase. I'd like to avoid getting > stuck in place for a decade; if we haven't made much progress toward > some of these major changes by say beginning of 2020 or 2021 we might > want to take a step back and evaluate our situation. Let me reverse the question: how much progress has/will have pandas 2.0 codebase made in the meanwhile? :-) Joking apart, it's not that if current pandas progresses slowly, then pandas 2.0 has any guarantee to progress more quickly. There is a thing I never understood of pandas 2.0, and I undertand even less now that 1.0 gets closer: if you/we have clear plans to rewrite the codebase, then why aren't we doing it now? Why are we wasting time on the current one?! Why are we releasing a pandas 1.0 with its "illusion of maturity"?! Rewriting a large project takes effort but can be worth it; _planning_ a (not so close) future rewrite seems to me just a sort of perversion. I do respect the desire to improve the API under several aspects - and I see this as the main reason for having something called pandas 2.0. But I think this discussion of the API could and should be decoupled from the idea to rewrite/reorganize internals. Indexing code and many internals badly need a rewrite, regardless of whether we change the API, regardles of whether we call it "2.0", and regardless of whether we change the entire codebase all at once, or refactor bit by bit. They firstly need it because there are 313 open bugs labeled "Indexing", and some of them are very difficult to solve because the code is unnecessarily complicated. But I think this rewrite is basically what is happening daily. More generally, if our plan is to close, sooner or later, 2400+ bugs by basically saying "pandas 1.0 is obsolete, long live pandas 2.0"... then we are not doing a great service to our users in releasing pandas 1.0 as such. > I spent some time at the sprint looking through > pandas.core.internals, > pandas.core.generic, and some of the other low level pieces, and my > feeling is that it would be easier to start over. I have a different feeling on what is easier, but I might very well be wrong, or it might be a matter of personal taste (e.g. it is true that when I reformat some code in current pandas I more often than not end up finding bugs in some other place, but at least I can immediately test the code I am writing because that "other place" exists). Something we all agree on is that, be in a new or in the same codebase, rewriting internals takes time and effort. It requires that some of the already few devs divert effort from improving the current codebase to focusing on the new one. If we plan to do it, shouldn't we be doing it now? (And if we don't do it now, isn't it because we don't really feel the urge to do it?) In any case, what your mail suggests me is that we definitely need to spend one (more more) dev talk looking through pandas.core.internals and pandas.core.generic all together! Cheers, Pietro From me at pietrobattiston.it Mon Jul 16 19:11:15 2018 From: me at pietrobattiston.it (Pietro Battiston) Date: Tue, 17 Jul 2018 01:11:15 +0200 Subject: [Pandas-dev] GroupBy Overhaul Proposal In-Reply-To: References: Message-ID: <1531782675.15070.18.camel@pietrobattiston.it> Hi Will, there might be parts of your document I don't entirely understand, but I definitely appreciate the desire to clean up the groupby module, and have some comments on what I (think I) understood. Il giorno lun, 09/07/2018 alle 17.18 -0700, William Ayd via Pandas-dev ha scritto: > Hi All, > > I?ve been thinking through what a redesigned GroupBy module could > look like in 1.0. The main problems I am trying to address are: > > ? - The current module is relatively convoluted, making contribution > and debugging challenging > ? - Behavior is sometimes non-obvious and buggy > (see?here,?here?and?here?as some examples) AND > ? - We violate the mantra of there being ?only one obvious way to do > things? > > Along those lines, here were four things I thought could be of > immense value: > ? Removal of?apply?method I would not worry too much about the fact that apply's performance on user-provided functions is bad (as long as it's documented), or that sum() returns different results from .apply(sum) (again, as long as it's documented in sum()'s docstring). What I think we should definitely avoid is _any_ case of apply(a_func) giving different results from apply(lambda x : a_func(x)) ... that is, inference can be made on the result on a function, but not on the function itself. However, this is an issue not specifically related to .apply() - see #17035. So my (not very informed) opinion is that we could just simplify .apply() a lot, reducing it to few simple rules/cases on the kind of output returned by the function, to be clearly documented, without suppressing it. By the way, assuming we keep apply(), I really think it shouldn't be too hard to avoid evaluating the first chunk twice. Apart from this, I was assuming that apply() covered some cases that no other aggregation method covers (e.g. when func returns df but of different shape than original chunk)... but I might be wrong. > ? Removal of?DataFrameGroupBy?and?SeriesGroupBy?classes I miss the technical details, but I don't think we should force the output of a DataFrame.groupby()[col].anything() to be a DataFrame; and most importantly, force the "func" in DataFrame.groupby()[col].apply(func) to accept DataFrame chunks. However there might be a lot of scope for code simplification by having the above case _implemented_ as a DataFrameGroupBy (or just code in .groupby()), of which we then extract the column. > ? Explicit default column naming I understand the concern about the sum of A being called A, which is bad, but I would never want "Sum of A" to appear in my DataFrame. I think this is the typical task to be solved through a MultiIndex, consistently with .agg(). > ? Removal of?axis?argument I never used it indeed... but if it's really just a matter of transpose -> operate -> transpose, couldn't we just do this under the hood (and maybe warn the user in the docs about performance/dtypes mess)? Pietro From wesmckinn at gmail.com Mon Jul 16 19:17:05 2018 From: wesmckinn at gmail.com (Wes McKinney) Date: Mon, 16 Jul 2018 19:17:05 -0400 Subject: [Pandas-dev] Pandas Sprint Recap In-Reply-To: <1531780555.15070.13.camel@pietrobattiston.it> References: <1531501924.11738.110.camel@pietrobattiston.it> <1531780555.15070.13.camel@pietrobattiston.it> Message-ID: hi Pietro, On Mon, Jul 16, 2018 at 6:35 PM, Pietro Battiston wrote: > Il giorno lun, 16/07/2018 alle 13.50 -0400, Wes McKinney ha scritto: >> > One thing I want to reiterate: it's not going to take another 11 >> > years to >> > reach pandas 2.0 :) Just because we don't >> > solve indexing for 1.0 doesn't mean we won't ever be able to fix >> > it. >> >> One point on this that we discussed some in the sprint and during >> SciPy: to undertake a major overhaul of pandas, at some point it may >> require a shift to a "new codebase". This could cohabit the same >> pandas-dev/pandas git repository which can serve as a monorepo for >> several Python package artifacts. This would make refactoring to >> separate out reusable components and code reuse much easier. The test >> suite could also be refactored to be able to run against >> "future-pandas" and "pandas" (or whatever we want to call them). >> >> I'm skeptical whether the kinds of significant / breaking changes >> we've discussed the last 3 years can happen in an iterative / organic >> fashion within the current pandas codebase. I'd like to avoid getting >> stuck in place for a decade; if we haven't made much progress toward >> some of these major changes by say beginning of 2020 or 2021 we might >> want to take a step back and evaluate our situation. > > Let me reverse the question: how much progress has/will have pandas 2.0 > codebase made in the meanwhile? :-) > > Joking apart, it's not that if current pandas progresses slowly, then > pandas 2.0 has any guarantee to progress more quickly. > > There is a thing I never understood of pandas 2.0, and I undertand even > less now that 1.0 gets closer: if you/we have clear plans to rewrite > the codebase, then why aren't we doing it now? Why are we wasting time > on the current one?! Why are we releasing a pandas 1.0 with its > "illusion of maturity"?! > Rewriting a large project takes effort but can be worth it; _planning_ > a (not so close) future rewrite seems to me just a sort of perversion. So when you say "if you/we have clear plans to rewrite the codebase, then why aren't we doing it now", why do you presume that I am not doing that right now? When I initiated conversations around improved internals for pandas and projects like pandas in late 2015, I had just signed on a large group of people to start the Apache Arrow project and that's been about 90% of where I've invested my time since then. My idea with Arrow always has been to build stronger memory management and computational underpinning for a next-generation pandas-type library. One of the "original sins" of pandas is that we own the full stack: data structures, IO, deserialization and serialization, computation/algorithms, visualization, and front end UI. We are (more or less) completely on our own. What I am proposing is to share the burden of developing the low-level stuff with a vastly larger group of developers, at least 10x as large as we currently have in pandas. The wheels are already well in motion for this to happen. I don't see any way without basing the work on top of open standards developed with a community that extends beyond the walls of Python. Maybe I'm going about it wrong, but I've invested 3 years of my life in this at this point, and it's looking like a 7-10 year effort. This is all to say, if the pandas community doesn't agree with my approach to this problem, I'm not going to twist anyone's arm. Either we agree or we go our separate ways; we are all volunteers after all. Unfortunately there are still factions within the Python data world that do not collaborate with each other very actively; I'm not sure what to do about that. We're all getting a lot older; if it turns out that the Python / pandas community doesn't want to leap forward in the ways that we've discussed (where this "leap forward" requires a certain amount of funding and activation energy) after say 20 years since the inception of the project then, as they say, that will be the story of us for the history books. At some point the world could in all likelihood move on from us to something else. > > I do respect the desire to improve the API under several aspects - and > I see this as the main reason for having something called pandas 2.0. Frankly, I would rather have a new project name, but retain affiliation with the pandas community in spirit and governance. > > But I think this discussion of the API could and should be decoupled > from the idea to rewrite/reorganize internals. > > Indexing code and many internals badly need a rewrite, regardless of > whether we change the API, regardles of whether we call it "2.0", and > regardless of whether we change the entire codebase all at once, or > refactor bit by bit. They firstly need it because there are 313 open > bugs labeled "Indexing", and some of them are very difficult to solve > because the code is unnecessarily complicated. > But I think this rewrite is basically what is happening daily. > > More generally, if our plan is to close, sooner or later, 2400+ bugs by > basically saying "pandas 1.0 is obsolete, long live pandas 2.0"... then > we are not doing a great service to our users in releasing pandas 1.0 > as such. I don't think pandas as it exists now will ever be obsolete, at least not on a 10 year horizon. At some point I think we should close off the core to anything new, and restrict changes to either bug fixes or deprecations / removals. New functionality should come in the form of add-on libraries that build off the core API. In a way, what I would like to see is something more like what the R community has -- data frames are "built into the language" and so many libraries can work on the same data and be assured of interop. We are already sort of doing this with pandas + statsmodels, sklearn, etc. but I would argue it needs to be taken even further. For the record, I am never going to argue that pandas should not be maintained or that the user base should be abandoned. However, I question whether the current core maintainers have a duty to be tethered to the issue backlog for the rest of their lives. Perhaps maintenance could be taken up by a for-profit company at some point? I do know that it is difficult to impossible to innovate and build new software while simultaneously keeping up with a bugfix/maintenance grind. > > >> I spent some time at the sprint looking through >> pandas.core.internals, >> pandas.core.generic, and some of the other low level pieces, and my >> feeling is that it would be easier to start over. > > I have a different feeling on what is easier, but I might very well be > wrong, or it might be a matter of personal taste (e.g. it is true that > when I reformat some code in current pandas I more often than not end > up finding bugs in some other place, but at least I can immediately > test the code I am writing because that "other place" exists). > > Something we all agree on is that, be in a new or in the same codebase, > rewriting internals takes time and effort. It requires that some of the > already few devs divert effort from improving the current codebase to > focusing on the new one. If we plan to do it, shouldn't we be doing it > now? > > (And if we don't do it now, isn't it because we don't really feel the > urge to do it?) See above... - Wes > > In any case, what your mail suggests me is that we definitely need to > spend one (more more) dev talk looking through pandas.core.internals > and pandas.core.generic all together! > > Cheers, > > Pietro From me at pietrobattiston.it Mon Jul 16 20:23:37 2018 From: me at pietrobattiston.it (Pietro Battiston) Date: Tue, 17 Jul 2018 02:23:37 +0200 Subject: [Pandas-dev] Pandas Sprint Recap In-Reply-To: References: <1531501924.11738.110.camel@pietrobattiston.it> <1531780555.15070.13.camel@pietrobattiston.it> Message-ID: <1531787017.15070.24.camel@pietrobattiston.it> Hi Wes, thanks for the extensive reply. But sorry, it's probably that I missed the sprint, but I really can't follow you. Do you have any pointers to better understand the future pandas (alternative) you have in mind? I know about Arrow, but I see it as a future potentiality for pandas, not as an alternative, or even the germ of it (and clearly not in the sense of "it's not powerful enough", but of "it has different scope"). Even less do I understand why pandas (or a "pandas-like library") should change name, if we are mostly talking about internals/implementation issues (rather than about API/features). Compared to this, the decision to rewrite the codebase or not is admittedly minor... I see a vision in your email, and certainly many political/community aspects I must be missing... but I still mostly miss the technical details supporting this vision, and apparently https://pandas-dev.githu b.io/pandas2 won't help me. Again, talking all together about what makes you think that the current codebase needs a complete rewrite would be great. Hope we can do this in one of the next devs calls. In any case, Il giorno lun, 16/07/2018 alle 19.17 -0400, Wes McKinney ha scritto: > [...] > For the record, I am never going to argue that pandas should not be > maintained or that the user base should be abandoned. However, I > question whether the current core maintainers have a duty to be > tethered to the issue backlog for the rest of their lives. Perhaps > maintenance could be taken up by a for-profit company at some point? The idea that I should sooner or later pay to use (a working version of) the code I'm helping to write is even more depressing, to me, than the idea that such effort will go partly wasted in a rewrite. I'm personally "tethered" to a software which changed the way I work every day, and to which I occasionally?try to contribute back. The "backlog" is not just a pile of dirt: it signals that (net of some possible better triaging) there are things to fix in the software. I see any change, and even a rewrite, as good basically if and only if it allows us to reduce this "backlog". Your answer to the question "are we wasting time on pandas?" is basically "I'm not, you are". I wonder whether it was discussed in this terms at the sprint! Cheers, Pietro From wesmckinn at gmail.com Mon Jul 16 21:04:21 2018 From: wesmckinn at gmail.com (Wes McKinney) Date: Mon, 16 Jul 2018 21:04:21 -0400 Subject: [Pandas-dev] Pandas Sprint Recap In-Reply-To: <1531787017.15070.24.camel@pietrobattiston.it> References: <1531501924.11738.110.camel@pietrobattiston.it> <1531780555.15070.13.camel@pietrobattiston.it> <1531787017.15070.24.camel@pietrobattiston.it> Message-ID: hi Pietro, On Mon, Jul 16, 2018 at 8:23 PM, Pietro Battiston wrote: > Hi Wes, > > thanks for the extensive reply. But sorry, it's probably that I missed > the sprint, but I really can't follow you. Do you have any pointers to > better understand the future pandas (alternative) you have in mind? I > know about Arrow, but I see it as a future potentiality for pandas, not > as an alternative, or even the germ of it (and clearly not in the sense > of "it's not powerful enough", but of "it has different scope"). Even > less do I understand why pandas (or a "pandas-like library") should > change name, if we are mostly talking about internals/implementation > issues (rather than about API/features). Compared to this, the decision > to rewrite the codebase or not is admittedly minor... > > I see a vision in your email, and certainly many political/community > aspects I must be missing... but I still mostly miss the technical > details supporting this vision, and apparently https://pandas-dev.githu > b.io/pandas2 won't help me. Again, talking all together about what > makes you think that the current codebase needs a complete rewrite > would be great. Hope we can do this in one of the next devs calls. > Well, here is a document from more than 2 years ago now: https://pandas-dev.github.io/pandas2/goals.html The way I would summarize the big picture goals are: * Simpler, more predictable and precise memory management * Ability to work with memory-mapped, on-disk data (this part is essential) * Substantially less memory use for non-numeric data * More civilized copy-on-write semantics * Improved interoperability with the rest of the world (being able to reuse libraries, algorithms for analytics more gracefully) I have been working very hard to present a sound, working, non-hand-wavy solution to these low-level problems. I am a mathematician by training, and so I am allergic to hand-wavy solutions or "designs" lacking in rigor in the fine details. I wrote this blog post addressing some of these topics and more: http://wesmckinney.com/blog/apache-arrow-pandas-internals/ I have spent a great deal of energy in blog posts, slide decks, etc. laying out the technical details about how can work and why it is a sound approach. I am not sure what more I can do other than to hope that those of like-mind and inclination to work on systems engineering follow along. Given the above requirements, I don't see a way forward that does not involve at minimum scrapping pandas.core.internals. > In any case, > > Il giorno lun, 16/07/2018 alle 19.17 -0400, Wes McKinney ha scritto: >> [...] >> For the record, I am never going to argue that pandas should not be >> maintained or that the user base should be abandoned. However, I >> question whether the current core maintainers have a duty to be >> tethered to the issue backlog for the rest of their lives. Perhaps >> maintenance could be taken up by a for-profit company at some point? > > The idea that I should sooner or later pay to use (a working version > of) the code I'm helping to write is even more depressing, to me, than > the idea that such effort will go partly wasted in a rewrite. > > I'm personally "tethered" to a software which changed the way I work > every day, and to which I occasionally try to contribute back. The > "backlog" is not just a pile of dirt: it signals that (net of some > possible better triaging) there are things to fix in the software. > I see any change, and even a rewrite, as good basically if and only if > it allows us to reduce this "backlog". > > Your answer to the question "are we wasting time on pandas?" is > basically "I'm not, you are". I wonder whether it was discussed in this > terms at the sprint! Whoa, I never said this and I do not believe anyone is wasting their time. Maintaining/supporting pandas in its current state is a valid way to spend your time, but my concern is that the feelings of obligation toward keeping the status quo afloat may stop the community from making progress on fundamental issues in performance and scalability. Realistically we need to find a way to do both sustainably (though it's arguable whether development now is sustainable). -W > > Cheers, > > Pietro From shoyer at gmail.com Mon Jul 16 21:14:46 2018 From: shoyer at gmail.com (Stephan Hoyer) Date: Mon, 16 Jul 2018 18:14:46 -0700 Subject: [Pandas-dev] Pandas Sprint Recap In-Reply-To: <1531787017.15070.24.camel@pietrobattiston.it> References: <1531501924.11738.110.camel@pietrobattiston.it> <1531780555.15070.13.camel@pietrobattiston.it> <1531787017.15070.24.camel@pietrobattiston.it> Message-ID: On Mon, Jul 16, 2018 at 7:23 PM Pietro Battiston wrote: > Even less do I understand why pandas (or a "pandas-like library") should > change name, if we are mostly talking about internals/implementation > issues (rather than about API/features). Compared to this, the decision > to rewrite the codebase or not is admittedly minor... > Indeed, I suspect that a big part of the reason for suggesting a new project name rather than pandas2 is that pandas also needs a major overhaul of API/features, in addition to new internals. There are several main issues: 1. Several core features in pandas (notably dtypes and indexing) are difficult to use correctly in their current state, and need a major overhaul/simplification of their functionality. 2. The indexed pandas.Series and pandas.DataFrame isn't the right abstraction for many tasks. A simpler, index free DataFrame would be a better data model for many tasks. For tasks that really need axis labels, a tool like xarray might be more appropriate. 3. Despite its flaws, pandas is extremely useful, so it has grown a large number of features/contributions. It would be difficult to reimplement all of these features immediately on top of a new implementation. Given infinite manpower, all these things could be changed incrementally and in a backwards compatible manner on top of current pandas. But the result would look very different from the pandas we know today. Of course, we are vastly under-resourced, so it will take quite a long time to get to a better place. I don't think it would serve either users or developers well to make such major changes in an incremental way over the course of multiple years. For these reasons, I agree with Wes that it wouldn't make sense to call the hypothetical Python library he is working towards pandas, at least in the sense that you use it by writing "import pandas". At best, we should write "import pandas2". Or perhaps, as Wes suggests, it would more appropriately be given a new name to indicate its new design/scope. Best, Stephan -------------- next part -------------- An HTML attachment was scrubbed... URL: From william.ayd at icloud.com Mon Jul 16 22:45:24 2018 From: william.ayd at icloud.com (William Ayd) Date: Mon, 16 Jul 2018 19:45:24 -0700 Subject: [Pandas-dev] GroupBy Overhaul Proposal In-Reply-To: <1531782675.15070.18.camel@pietrobattiston.it> References: <1531782675.15070.18.camel@pietrobattiston.it> Message-ID: <09F05CAB-B491-416B-AC63-486C9AD132CD@icloud.com> Thanks Pietro for your feedback - very much appreciated! > I would not worry too much about the fact that apply's performance on > user-provided functions is bad (as long as it's documented), or that > sum() returns different results from .apply(sum) (again, as long as > it's documented in sum()'s docstring). Perfectly fine to ignore performance for now, but I disagree on the second point you make. To a core developer it makes perfect sense that .sum() and .apply(sum) may return different results, but I don?t think that is as apparent to newcomers or just casual users of pandas. In fact I?d worry about a newcomer thinking ?why bother with other methods when I can just send everything on through apply?? Even if I?m overly concerned about that, I don?t think there?s a simple explanation to when these should differ, which is again why I think it?s a mistake to offer two very similar but actually slightly different ways of going about that calculation. > So my (not very informed) opinion is that we could just simplify > .apply() a lot, reducing it to few simple rules/cases on the kind of > output returned by the function, to be clearly documented, without > suppressing it. Totally agree here. I think there?s a lot of overlap between .apply and other methods which obfuscates the need for the former. One example was cited above, but another thing to consider is its overlap with .agg. IMO the ?cleanest? use of apply is sending it a function which reduces to a scalar, but in that case you could arguably just use .agg. The other uses cases would cover Series, DataFrame, collections, etc? and it kind of ?just works? with those, but I think those types of objects are impossible to make guarantees about how to properly piece back together. DataFrames of differing dimensions can easily create sparse objects (which may or may not be the intention) and for things like collections I?d question how we will walk the tightrope of expectations if / when we get some Extension Arrays in place that support that as first class objects in pandas. > I miss the technical details, but I don't think we should force the > output of a DataFrame.groupby()[col].anything() to be a DataFrame... > > ... However there might be a lot of scope for > code simplification by having the above case _implemented_ as a > DataFrameGroupBy (or just code in .groupby()), of which we then extract > the column. Yea that?s a valid point. The suggestion here may be extreme, but with your last statement there I think we are aligned on high level the intention and how it could simplify the code. > I understand the concern about the sum of A being called A, which is > bad, but I would never want "Sum of A" to appear in my DataFrame. I > think this is the typical task to be solved through a MultiIndex, > consistently with .agg(). The problem with that is we have a variety of issues that are trying to work around the MultiIndex columns being returned. #18366 is probably the main issue (with 15 upvotes!), but you?ll also see this loosely manifested in #20241, #19978 and potentially quite a few more. What is the apprehension with something like ?Sum of A?? I?m not tied to that naming per se, but it at least mimics Excel and therefore isn?t that farfetched of a solution. From an end user perspective I can see the big gripe that we kind of force a MultiIndex column on them when they often don?t have that to begin with, and it just adds more complexity and method chaining to their pipeline. Something like ?Sum of A? (or whatever else really) could maintain the original dimensions of the columns being used while also being a solution that might work across all of the various aggregation / transformation methods and acceptable arguments. > I never used it indeed... but if it's really just a matter of transpose > -> operate -> transpose, couldn't we just do this under the hood (and > maybe warn the user in the docs about performance/dtypes mess)? That could be an option as well. Curious to hear what others think. - Will > On Jul 16, 2018, at 4:11 PM, Pietro Battiston wrote: > > Hi Will, > > there might be parts of your document I don't entirely understand, but > I definitely appreciate the desire to clean up the groupby module, and > have some comments on what I (think I) understood. > > > Il giorno lun, 09/07/2018 alle 17.18 -0700, William Ayd via Pandas-dev > ha scritto: >> Hi All, >> >> I?ve been thinking through what a redesigned GroupBy module could >> look like in 1.0. The main problems I am trying to address are: >> >> - The current module is relatively convoluted, making contribution >> and debugging challenging >> - Behavior is sometimes non-obvious and buggy >> (see here, here and here as some examples) AND >> - We violate the mantra of there being ?only one obvious way to do >> things? >> >> Along those lines, here were four things I thought could be of >> immense value: >> ? Removal of apply method > > I would not worry too much about the fact that apply's performance on > user-provided functions is bad (as long as it's documented), or that > sum() returns different results from .apply(sum) (again, as long as > it's documented in sum()'s docstring). > > What I think we should definitely avoid is _any_ case of > > apply(a_func) > > giving different results from > > apply(lambda x : a_func(x)) > > ... that is, inference can be made on the result on a function, but not > on the function itself. > > However, this is an issue not specifically related to .apply() - see > #17035. > > So my (not very informed) opinion is that we could just simplify > .apply() a lot, reducing it to few simple rules/cases on the kind of > output returned by the function, to be clearly documented, without > suppressing it. > > By the way, assuming we keep apply(), I really think it shouldn't be > too hard to avoid evaluating the first chunk twice. > > Apart from this, I was assuming that apply() covered some cases that no > other aggregation method covers (e.g. when func returns df but of > different shape than original chunk)... but I might be wrong. > > >> ? Removal of DataFrameGroupBy and SeriesGroupBy classes > > I miss the technical details, but I don't think we should force the > output of a DataFrame.groupby()[col].anything() to be a DataFrame; and > most importantly, force the "func" in > > DataFrame.groupby()[col].apply(func) > > to accept DataFrame chunks. However there might be a lot of scope for > code simplification by having the above case _implemented_ as a > DataFrameGroupBy (or just code in .groupby()), of which we then extract > the column. > >> ? Explicit default column naming > I understand the concern about the sum of A being called A, which is > bad, but I would never want "Sum of A" to appear in my DataFrame. I > think this is the typical task to be solved through a MultiIndex, > consistently with .agg(). > >> ? Removal of axis argument > > I never used it indeed... but if it's really just a matter of transpose > -> operate -> transpose, couldn't we just do this under the hood (and > maybe warn the user in the docs about performance/dtypes mess)? > > Pietro -------------- next part -------------- An HTML attachment was scrubbed... URL: From me at pietrobattiston.it Tue Jul 17 03:59:33 2018 From: me at pietrobattiston.it (Pietro Battiston) Date: Tue, 17 Jul 2018 09:59:33 +0200 Subject: [Pandas-dev] GroupBy Overhaul Proposal In-Reply-To: <09F05CAB-B491-416B-AC63-486C9AD132CD@icloud.com> References: <1531782675.15070.18.camel@pietrobattiston.it> <09F05CAB-B491-416B-AC63-486C9AD132CD@icloud.com> Message-ID: <1531814373.15070.27.camel@pietrobattiston.it> Quick reply on a couple of points. Il giorno lun, 16/07/2018 alle 19.45 -0700, William Ayd ha scritto: > [...] > Even > if I?m overly concerned about that, I don?t think there?s a simple > explanation to when?these should?differ, which is again why I think > it?s a mistake to offer two very similar but actually slightly > different ways of going about that calculation. In fact, my preference for keeping apply is pretty weak as long as there are alternatives that cover each of its use cases. But again, I'm not sure this is true. > > I understand the concern about the sum of A being called A, which > > is > > bad, but I would never want "Sum of A" to appear in my DataFrame. I > > think this is the typical task to be solved through a MultiIndex, > > consistently with .agg(). > > The problem with that is we have a variety of issues that are trying > to work around the MultiIndex columns being returned. #18366 is > probably the main issue (with 15 upvotes!), Unless I'm wrong, #18366 is orthgonal to what we are discussing: unnamed lambdas would remain?unnamed lambdas. (And the obvious solution to my eyes is used named methods instead) > but you?ll also see this loosely manifested in #20241, #19978 and > potentially quite a few more.? I think we do need a better ability to do in-line renaming of MultiIndexed DataFrames, regardless of whether they come from groupby(). > What is the apprehension with something like ?Sum of A?? I?m not tied > to that naming per se, but it at least mimics Excel and therefore > isn?t that farfetched of a solution. Problems I have with "Sum of A": - if, after creating all my columns, I want to e.g. select all columns that contain sums, I need to do some sort of "df[[col if col.startswith("Sum of")]]". Compare to "df.loc[:, ('Sum',)]" - it would be the only case in pandas in which we decide how to call a column on behalf of the user - ... and this unexpected behavior is introduced to solve a relatively specific case of aggregation (1 column -> 1 scalar) - if one wants to allow the user to name the columns according to her taste, it's pretty simple to introduce an argument which takes a string to be .format()ted with the name of the column (or even of the method), e.g. name="Sum of {}" - ... although it is actually pretty simple to just do df.columns = "Sum of " + df.columns - agg already returns MultiIndexes (when passed multiple functions) - we would be following Excel as example :-D By the way, despite some related issues, I still think tuples can be first class citizens of flat indexes. So if one doesn't like MultiIndexes, or they do not fit one's needs, ("sum", "A") can well be a label in a regular index. Pietro From me at pietrobattiston.it Tue Jul 17 04:29:13 2018 From: me at pietrobattiston.it (Pietro Battiston) Date: Tue, 17 Jul 2018 10:29:13 +0200 Subject: [Pandas-dev] Pandas Sprint Recap In-Reply-To: References: <1531501924.11738.110.camel@pietrobattiston.it> <1531780555.15070.13.camel@pietrobattiston.it> <1531787017.15070.24.camel@pietrobattiston.it> Message-ID: <1531816153.15070.29.camel@pietrobattiston.it> Il giorno lun, 16/07/2018 alle 21.04 -0400, Wes McKinney ha scritto: > hi Pietro, > > On Mon, Jul 16, 2018 at 8:23 PM, Pietro Battiston .it> wrote: > > Hi Wes, > > > > thanks for the extensive reply. But sorry, it's probably that I > > missed > > the sprint, but I really can't follow you. Do you have any pointers > > to > > better understand the future pandas (alternative) you have in mind? > > I > > know about Arrow, but I see it as a future potentiality for pandas, > > not > > as an alternative, or even the germ of it (and clearly not in the > > sense > > of "it's not powerful enough", but of "it has different scope"). > > Even > > less do I understand why pandas (or a "pandas-like library") should > > change name, if we are mostly talking about > > internals/implementation > > issues (rather than about API/features). Compared to this, the > > decision > > to rewrite the codebase or not is admittedly minor... > > > > I see a vision in your email, and certainly many > > political/community > > aspects I must be missing... but I still mostly miss the technical > > details supporting this vision, and apparently https://pandas-dev.g > > ithu > > b.io/pandas2 won't help me. Again, talking all together about what > > makes you think that the current codebase needs a complete rewrite > > would be great. Hope we can do this in one of the next devs calls. > > > > Well, here is a document from more than 2 years ago now: > > https://pandas-dev.github.io/pandas2/goals.html Yes, that's what I was referring to... > > The way I would summarize the big picture goals are: > > * Simpler, more predictable and precise memory management > * Ability to work with memory-mapped, on-disk data (this part is > essential) > * Substantially less memory use for non-numeric data > * More civilized copy-on-write semantics > * Improved interoperability with the rest of the world (being able to > reuse libraries, algorithms for analytics more gracefully) This is all very important stuff, but my rough guess is that it concerns no more than 25% of the pandas codebase, and no more than 15% of open bugs. > > I have been working very hard to present a sound, working, > non-hand-wavy solution to these low-level problems. I am a > mathematician by training, and so I am allergic to hand-wavy > solutions > or "designs" lacking in rigor in the fine details. > > I wrote this blog post addressing some of these topics and more: > http://wesmckinney.com/blog/apache-arrow-pandas-internals/ > I happened to have read this too, but my comment above more or less applies unchanged, with the exception of point 10, "Eager evaluation model, no query planning", which is (probably) incompatible with the current pandas codebase/API. That's in principle Dask's job. Now, I don't know Dask well enough to judge whether it's up to the task, but the criticisms in your "Addendum" are again mainly about pandas, not about how Dask extends pandas. And they are again about the memory management/IO, which again is where I thought we thought Arrow would improve pandas. > I have spent a great deal of energy in blog posts, slide decks, etc. > laying out the technical details about how can work and why it is a > sound approach. I am not sure what more I can do other than to hope > that those of like-mind and inclination to work on systems > engineering > follow along. This conversation (which is the first one in which I see Arrow as an alternative, not a complement, to pandas) didn't start with blog posts, slide decks etc. (which I had partly read), it started with you saying "I spent some time at the sprint looking through pandas.core.internals, pandas.core.generic". I would really love to know/talk about that. > Given the above requirements, I don't see a way forward that does not > involve at minimum scrapping pandas.core.internals. I'm not making claims on the amount of lines of code (or methods/features) we should save or drop. I'm just claiming that if we want to change "import pandas" to something else different enough to justify a change of name, - it would be mostly because of changes in the API, not of implementation - not having human forces enough to solve current pandas issues is not a valid reason (actually the opposite) - it would be nice to have a clearer roadmap (again: the pands2 docs are not clear at all about this) and discussion about the future of pandas (as in "current name and user requirements/expectations", not "current codebase") > [...] > > Your answer to the question "are we wasting time on pandas?" is > > basically "I'm not, you are". I wonder whether it was discussed in > > this > > terms at the sprint! > > Whoa, I never said this and I do not believe anyone is wasting their > time. I am a mathematician by training, give me the premises and I will naturally look for logical consequences ;-) Pietro From me at pietrobattiston.it Tue Jul 17 05:00:58 2018 From: me at pietrobattiston.it (Pietro Battiston) Date: Tue, 17 Jul 2018 11:00:58 +0200 Subject: [Pandas-dev] Pandas Sprint Recap In-Reply-To: References: <1531501924.11738.110.camel@pietrobattiston.it> <1531780555.15070.13.camel@pietrobattiston.it> <1531787017.15070.24.camel@pietrobattiston.it> Message-ID: <1531818058.15070.31.camel@pietrobattiston.it> Hi Stephan, I appreciate that your email focuses on specific API changes (again, the only reason, in my view, to justify a pandas 2, or even a change of name) Il giorno lun, 16/07/2018 alle 18.14 -0700, Stephan Hoyer ha scritto: > On Mon, Jul 16, 2018 at 7:23 PM Pietro Battiston it> wrote: > [...] > 2. The indexed pandas.Series and pandas.DataFrame isn't the right > abstraction for many tasks. A simpler, index free DataFrame would be > a better data model for many tasks. For tasks that really need axis > labels, a tool like xarray might be more appropriate. ... but this makes me smile. First, because labels/indexes are in my experience the main reason why people come to pandas (another important reason is having multiple dtypes in a single data structure, but numpy structured arrays also do this). Second because supporting a DataFrame with no index would be pretty easy in the current codebase/API (e.g. "index=False"). I know it would break some code, but it would be wrong code anyway (that is, code that doesn't decouple indexes from data storage). Third, because now that the default index is RangeIndex(n) (which a user is free not to rely on in any way), and as long as broken code is fixed (see above), a DataFrame with no index wouldn't really be "simpler". It would mostly amount to deciding whether to show the index or not when printing to screen/doing IO. Fourth, because you cite xarray as an alternative... but unless I'm wrong, labels are now optional in xarray (precisely the path I suggest we could take). More in general, in my view, asking users to choose between multiple dtypes and indexes would bring the state of data manipulation in Python backwards by several years (and probably behind the state of data manipulation in R). > 3. Despite its flaws, pandas is extremely useful, so it has grown a > large number of features/contributions. It would be difficult to > reimplement all of these features immediately on top of a new > implementation. > > Given infinite manpower, all these things could be changed > incrementally and in a backwards compatible manner on top of current > pandas. But the result would look very different from the pandas we > know today. Of course, we are vastly under-resourced, so it will take > quite a long time to get to a better place. I don't think it would > serve either users or developers well to make such major changes in > an incremental way over the course of multiple years. > ? > For these reasons, I agree with Wes that it wouldn't make sense to > call the hypothetical Python library he is working towards pandas, at > least in the sense that you use it by writing "import pandas". At > best, we should write "import pandas2". Or perhaps, as Wes suggests, > it would more appropriately be given a new name to indicate its new > design/scope. I understand the need to change the import name not to break people's code. But again, I fail to see a new "scope". I don't see an analysis of which (share) of the current pandas problems (=issues) would be solved. Add to this the urge of rewriting the code base (while at the same time I'm having troubles getting any comments on _how_ to restructure - or structure, in a potential rewrite - an important part of it?), and I can't help but fear that we are trying very hard to reinvent the wheel. Pietro ? https://github.com/pandas-dev/pandas/wiki/(Tentative)-rules-for-restr ucturing-indexing-code From shoyer at gmail.com Tue Jul 17 18:28:56 2018 From: shoyer at gmail.com (Stephan Hoyer) Date: Tue, 17 Jul 2018 15:28:56 -0700 Subject: [Pandas-dev] Pandas Sprint Recap In-Reply-To: <1531818058.15070.31.camel@pietrobattiston.it> References: <1531501924.11738.110.camel@pietrobattiston.it> <1531780555.15070.13.camel@pietrobattiston.it> <1531787017.15070.24.camel@pietrobattiston.it> <1531818058.15070.31.camel@pietrobattiston.it> Message-ID: On Tue, Jul 17, 2018 at 2:01 AM Pietro Battiston wrote: > First, because labels/indexes are in my experience the main reason why > people come to pandas (another important reason is having multiple > dtypes in a single data structure, but numpy structured arrays also do > this). > Certainly the functionality of indexes is valuable (especially from some use-cases), but I don't think the particular way we expose them is optimal. In my experience, the need to call reset_index() or assign directly to .index or .columns is a frequent source of annoyance. > Second because supporting a DataFrame with no index would be pretty > easy in the current codebase/API (e.g. "index=False"). > I know it would break some code, but it would be wrong code anyway > (that is, code that doesn't decouple indexes from data storage). > > Third, because now that the default index is RangeIndex(n) (which a > user is free not to rely on in any way), and as long as broken code is > fixed (see above), a DataFrame with no index wouldn't really be > "simpler". It would mostly amount to deciding whether to show the index > or not when printing to screen/doing IO. > Sure, you *could* fix all this on top of the current pandas data model. But it would be quite a challenging effort, and the full pandas data model would remain quite complex. The current pandas data model looks something like this: DataFrame: - values: BlockManager wrapping 1d and/or 2d NumPy arrays - index: Index - columns: Index The data model I'd like to work with in the future for most use-cases involving tabular data is something closer to: DataFrame: - data: OrderedDict[str, Array] - indexes: OrderedDict[str, Index] Conveniently, this looks very similar to the data model of Arrow or R. Optional indexes would provide fast reverse lookup for some subset of dataframe columns. This pretty obviously could not support everything pandas can do today. For example, you couldn't have a hierarchical index for column names. But in my experience, you're better off working with "tidy data" anyways, as popularized in R's tidyverse. > But again, I fail to see a new "scope". I don't see an analysis of > which (share) of the current pandas problems (=issues) would be solved. > One way in which we have reduced pandas' scope recently is the proposed deprecation of Panel. This is an example of focusing pandas on tabular data rather than N-dimensional arrays. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mrocklin at gmail.com Tue Jul 17 18:46:45 2018 From: mrocklin at gmail.com (Matthew Rocklin) Date: Tue, 17 Jul 2018 18:46:45 -0400 Subject: [Pandas-dev] Pandas Sprint Recap In-Reply-To: References: <1531501924.11738.110.camel@pietrobattiston.it> <1531780555.15070.13.camel@pietrobattiston.it> <1531787017.15070.24.camel@pietrobattiston.it> <1531818058.15070.31.camel@pietrobattiston.it> Message-ID: Has Pandas ever done a user survey? I would be curious to know the answer to the question "do you make heavy use of the Pandas index" among users, and how that correlates with different domain/industry. On Tue, Jul 17, 2018 at 6:29 PM Stephan Hoyer wrote: > On Tue, Jul 17, 2018 at 2:01 AM Pietro Battiston > wrote: > >> First, because labels/indexes are in my experience the main reason why >> people come to pandas (another important reason is having multiple >> dtypes in a single data structure, but numpy structured arrays also do >> this). >> > > Certainly the functionality of indexes is valuable (especially from some > use-cases), but I don't think the particular way we expose them is optimal. > In my experience, the need to call reset_index() or assign directly to > .index or .columns is a frequent source of annoyance. > > >> Second because supporting a DataFrame with no index would be pretty >> easy in the current codebase/API (e.g. "index=False"). >> I know it would break some code, but it would be wrong code anyway >> (that is, code that doesn't decouple indexes from data storage). >> >> Third, because now that the default index is RangeIndex(n) (which a >> user is free not to rely on in any way), and as long as broken code is >> fixed (see above), a DataFrame with no index wouldn't really be >> "simpler". It would mostly amount to deciding whether to show the index >> or not when printing to screen/doing IO. >> > > Sure, you *could* fix all this on top of the current pandas data model. > But it would be quite a challenging effort, and the full pandas data model > would remain quite complex. > > The current pandas data model looks something like this: > > DataFrame: > - values: BlockManager wrapping 1d and/or 2d NumPy arrays > - index: Index > - columns: Index > > The data model I'd like to work with in the future for most use-cases > involving tabular data is something closer to: > > DataFrame: > - data: OrderedDict[str, Array] > - indexes: OrderedDict[str, Index] > > Conveniently, this looks very similar to the data model of Arrow or R. > Optional indexes would provide fast reverse lookup for some subset of > dataframe columns. > > This pretty obviously could not support everything pandas can do today. > For example, you couldn't have a hierarchical index for column names. But > in my experience, you're better off working with "tidy data" anyways, as > popularized in R's tidyverse. > > >> But again, I fail to see a new "scope". I don't see an analysis of >> which (share) of the current pandas problems (=issues) would be solved. >> > > One way in which we have reduced pandas' scope recently is the proposed > deprecation of Panel. > > This is an example of focusing pandas on tabular data rather than > N-dimensional arrays. > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From william.ayd at icloud.com Tue Jul 17 19:10:47 2018 From: william.ayd at icloud.com (William Ayd) Date: Tue, 17 Jul 2018 16:10:47 -0700 Subject: [Pandas-dev] GroupBy Overhaul Proposal In-Reply-To: <1531814373.15070.27.camel@pietrobattiston.it> References: <1531782675.15070.18.camel@pietrobattiston.it> <09F05CAB-B491-416B-AC63-486C9AD132CD@icloud.com> <1531814373.15070.27.camel@pietrobattiston.it> Message-ID: > In fact, my preference for keeping apply is pretty weak as long as > there are alternatives that cover each of its use cases. But again, I'm > not sure this is true. Just to clarify my position: 1. .apply() + UDF reducing to a scalar should be replaceable with .agg() + same UDF (even though there are differences today?) 2. .apply() + UDF returning Series / DataFrame / collection doesn?t have anything else to cover it But with #2 above I think its dangerous to assume that .apply can always do the ?right thing? with those types of inputs. We don?t make any assertions about the indexing / labeling of returned Series and DataFrames. As far as collections are concerned I?m not sure if there will be a clear answer on how to handle those assuming we start getting EAs that add first-class support for those. > Unless I'm wrong, #18366 is orthgonal to what we are discussing: > unnamed lambdas would remain unnamed lambdas. > (And the obvious solution to my eyes is used named methods instead) I don?t think this is orthogonal. Your concern is valid on lambdas and I don?t know what the solution there is (perhaps some kind of keyword argument) but without getting tripped up on that I don?t think its immediately apparent that the returned object for a DataFrame with columns ?a?, ?b?, ?c? will have a single column when called as follows: - df.groupby(?a?).agg(sum) - df.groupby(?a?).agg({?b?: sum, ?c?: min}) Yet the following will yield a MultiIndex column: - df.groupby(?a?).agg([sum]) - df.groupby(?a?).agg({?b?: [sum], ?c?: min}) If you reduce the returned columns to ??sum? of ?b?? and ??min? of ?c?? you can ensure that the returned columns have the same number of levels regardless of call signature, AND have the added bonus of not obfuscating what type of aggregation was performed with the former two examples. Of course the end user may ultimately decide that they don?t like those labels at all and completely override them, but that effort becomes much easier if they can make guarantees around the number of levels of the returned object (especially if it?s just one!). > - if, after creating all my columns, I want to e.g. select all columns > that contain sums, I need to do some sort of "df[[col if > col.startswith("Sum of")]]". Compare to "df.loc[:, ('Sum',)]? Unless I am mistaken you would have to do something like "df.groupby('a').agg([sum]).loc[:, slice(None, 'sum?)]? to get that to work. I don?t think that syntax really is that clean and it starts taking us down the path of advanced indexing for what may start off to the end user as a very simple aggregation exercise. > - it would be the only case in pandas in which we decide how to call a > column on behalf of the user Well we have to do something to reduce ambiguity?I think a consistent naming convention and dimension for the columns across all invocations is strongly preferable to inserting a column level some of the time. > - if one wants to allow the user to name the columns according to her > taste, it's pretty simple to introduce an argument which takes a string > to be .format()ted with the name of the column (or even of the method), > e.g. name="Sum of {}" Agreed. In my head I feel like this defaults to something like f?{fname} of {colname}? but gives the user potentially the option to override. By default keep the same number of levels as what is being passed in, though maybe None as an argument reverts to the old style behavior of simply inserting a new column index level. > By the way, despite some related issues, I still think tuples can be > first class citizens of flat indexes. So if one doesn't like > MultiIndexes, or they do not fit one's needs, ("sum", "A") can well be > a label in a regular index. You know better than I do here, but again I don?t think it makes for a good user experience to convert columns with one level into multiple levels after a GroupBy operation regardless of how you could subsequently access those values. William Ayd william.ayd at icloud.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From me at pietrobattiston.it Wed Jul 18 03:01:22 2018 From: me at pietrobattiston.it (Pietro Battiston) Date: Wed, 18 Jul 2018 09:01:22 +0200 Subject: [Pandas-dev] GroupBy Overhaul Proposal In-Reply-To: References: <1531782675.15070.18.camel@pietrobattiston.it> <09F05CAB-B491-416B-AC63-486C9AD132CD@icloud.com> <1531814373.15070.27.camel@pietrobattiston.it> Message-ID: <1531897282.3286.6.camel@pietrobattiston.it> Il giorno mar, 17/07/2018 alle 16.10 -0700, William Ayd ha scritto: > > In fact, my preference for keeping apply is pretty weak as long as > > there are alternatives that cover each of its use cases. But again, > > I'm > > not sure this is true. > > Just to clarify my position: > > 1. .apply() + UDF reducing to a scalar should be replaceable > with .agg() + same UDF (even though there are differences today?) > 2. .apply() + UDF returning Series / DataFrame / collection > doesn?t have anything else to cover it .transform() at least covers the case in which the shape of the chunk is unchanged. > But with #2 above I think its dangerous to assume that .apply can > always do the ?right thing? with those types of inputs. We don?t make > any assertions about the indexing / labeling of returned Series and > DataFrames. There is a simple way to stop throwing magic at users, and it is to clearly document which cases .apply() covers (and which should be covered by .agg() or transform()), reflecting the actual guesswork taking place in the code. By the way, my understanding (without having looked at the code) is that UDF returns Series -> concat in a new Series UDF returns DataFrame -> concat in a new DataFrame and the guesswork mostly concerns understanding whether the new index is the same as the old. Am I missing anything relevant? Now, I would be all for suppressing a complicated function by replacing it with simpler ways to do the same thing. But for instance I would like the following to still work with groupby().something(): def remove_group_outliers(group): outliers = # code to identify them return group[~group.index.isin(outliers)] ... and I currently don't see any way but .apply(). > As far as collections are concerned I?m not sure if there will be a > clear answer on how to handle those assuming we start getting EAs > that add first-class support for those.? Do you have any pointer/example? I'm missing the relation between collections and .apply(). > > Unless I'm wrong, #18366 is orthgonal to what we are discussing: > > unnamed lambdas would remain?unnamed lambdas. > > (And the obvious solution to my eyes is used named methods instead) > > I don?t think this is orthogonal. Your concern is valid on lambdas > and I don?t know what the solution there is (perhaps some kind of > keyword argument) but without getting tripped up on that I don?t > think its immediately apparent that the returned object for a > DataFrame with columns ?a?, ?b?, ?c? will have a single column when > called as follows: > > ?- df.groupby(?a?).agg(sum) > ?- df.groupby(?a?).agg({?b?: sum, ?c?: min}) > > Yet the following will yield a MultiIndex column: > > ?- df.groupby(?a?).agg([sum]) > ?- df.groupby(?a?).agg({?b?: [sum], ?c?: min}) The rule is not very complicated either (if correctly documented), but anyway, the inconsistency would disappear by just having the first two examples also return a MultiIndex. ... and maybe provide the users a very simple way to flatten MultiIndexes (see below). > If you reduce the returned columns to ??sum? of ?b?? and ??min? of > ?c?? you can ensure that the returned columns have the same number of > levels regardless of call signature, > AND have the added bonus of not obfuscating what type of aggregation > was performed with the former two examples. Both can be solved through a MI, or through an Index(dtype=object) containing tuples. > Of course the end user may ultimately decide that they don?t like > those labels at all and completely override them, but that effort > becomes much easier if they can make guarantees around the number of > levels of the returned object I agree on this > (especially if it?s just one!). ... not on that. MI (or tuples) -> arbitrary strings is much simpler/cleaner to do than arbitrary strings -> MI (or tuples) > > > - if, after creating all my columns, I want to e.g. select all > > columns > > that contain sums, I need to do some sort of "df[[col if > > col.startswith("Sum of")]]". Compare to "df.loc[:, ('Sum',)]? > > Unless I am mistaken you would have to do something like > "df.groupby('a').agg([sum]).loc[:, slice(None, 'sum?)]? to get that > to work. Yeah, I had swapped the levels, it is df.groupby('a').agg([sum]).loc[:, (slice(None), 'sum?)] > I don?t think that syntax really is that clean In my code I always start by defining WE = slice(None) # WhatEver and we could advertise this as a way to make the syntax shorter, but regardless of that, it definitely is cleaner than any string manipulation. > and it starts taking us down the path of advanced indexing for what > may start off to the end user as a very simple aggregation exercise. On this I agree with you. I'm all for providing - a MultiIndex.flatten() method which allows me to do res.columns = res.columns.flatten("{} of {}".format) - a simple way to do the above in-line (which is already being discussed, regardless of groupby) > [...] > > - it would be the only case in pandas in which we decide how to > > call a > > column on behalf of the user > > Well we have to do something to reduce ambiguity?I think a consistent > naming convention and dimension for the columns across all > invocations is strongly preferable to inserting a column level some > of the time. Again, I agree on this. > > - if one wants to allow the user to name the columns according to her > > taste, it's pretty simple to introduce an argument which takes a > > string > > to be .format()ted with the name of the column (or even of the > > method), > > e.g. name="Sum of {}" > > Agreed. In my head I feel like this defaults to something like > f?{fname} of {colname}? but gives the user potentially the option to > override. By default keep the same number of levels as what is being > passed in, though maybe None as an argument reverts to the old style > behavior of simply inserting a new column index level. Agree on everything but the default, again, because it is arbitrary > > By the way, despite some related issues, I still think tuples can > > be > > first class citizens of flat indexes. So if one doesn't like > > MultiIndexes, or they do not fit one's needs, ("sum", "A") can well > > be > > a label in a regular index. > > You know better than I do here, but again I don?t think it makes for > a good user experience to convert columns with one level into > multiple levels after a GroupBy operation regardless of how you could > subsequently access those values. Notice that I'm not talking about a MultiIndex, but about a flat index. But it is an inferior solution, given the API we already expose, to the MultiIndex. Pietro From me at pietrobattiston.it Wed Jul 18 03:23:35 2018 From: me at pietrobattiston.it (Pietro Battiston) Date: Wed, 18 Jul 2018 09:23:35 +0200 Subject: [Pandas-dev] Pandas Sprint Recap In-Reply-To: References: <1531501924.11738.110.camel@pietrobattiston.it> <1531780555.15070.13.camel@pietrobattiston.it> <1531787017.15070.24.camel@pietrobattiston.it> <1531818058.15070.31.camel@pietrobattiston.it> Message-ID: <1531898615.3286.8.camel@pietrobattiston.it> Il giorno mar, 17/07/2018 alle 15.28 -0700, Stephan Hoyer ha scritto: > On Tue, Jul 17, 2018 at 2:01 AM Pietro Battiston it> wrote: > > First, because labels/indexes are in my experience the main reason > > why > > people come to pandas (another important reason is having multiple > > dtypes in a single data structure, but numpy structured arrays also > > do > > this). > > Certainly the functionality of indexes is valuable (especially from > some use-cases), but I don't think the particular way we expose them > is optimal. In my experience, the need to call reset_index() or > assign directly to .index or .columns is a frequent source of > annoyance. I agree if you're refering to the impossibility to do this in-line... but that's not extremely difficult to solve. ? > > Second because supporting a DataFrame with no index would be pretty > > easy in the current codebase/API (e.g. "index=False"). > > I know it would break some code, but it would be wrong code anyway > > (that is, code that doesn't decouple indexes from data storage). > > > > Third, because now that the default index is RangeIndex(n) (which a > > user is free not to rely on in any way), and as long as broken code > > is > > fixed (see above), a DataFrame with no index wouldn't really be > > "simpler". It would mostly amount to deciding whether to show the > > index > > or not when printing to screen/doing IO. > > Sure, you *could* fix all this on top of the current pandas data > model. But it would be quite a challenging effort, and the full > pandas data model would remain quite complex. > > The current pandas data model looks something like this: > > DataFrame: > - values: BlockManager wrapping 1d and/or 2d NumPy arrays > - index: Index > - columns: Index > ? > The data model I'd like to work with in the future for most use-cases > involving tabular data is something closer to: > > DataFrame: > - data: OrderedDict[str, Array] > - indexes: OrderedDict[str, Index] > > Conveniently, this looks very similar to the data model of Arrow or > R. Optional indexes would provide fast reverse lookup for some subset > of dataframe columns. How is this different (API-wise) from a list of (same lenght, I assume) Series? > This pretty obviously could not support everything pandas can do > today. For example, you couldn't have a hierarchical index for column > names. But in my experience, you're better off working with "tidy > data" anyways, as popularized in R's tidyverse. We have deprecated Panel (and I think it was the right choice) because you can always tell (and I've often told) users "work with MultiIndex and stack/unstack, that's efficient and much better and easier to understand". ... do you instead see all the stack/unstack machinery as just useless?! Do you have a feeling of what the user base (present and potential) thinks about this? (Do you think it matters in some way?) ? > > But again, I fail to see a new "scope". I don't see an analysis of > > which (share) of the current pandas problems (=issues) would be > > solved. > > One way in which we have reduced pandas' scope recently is the > proposed deprecation of Panel. > > This is an example of focusing pandas on tabular data rather than N- > dimensional arrays. See above, we have lost N-dimensional arrays (with N=3 or 4) but luckily not the concept of N-features data. I can't even think of retrieving data with pandaSDMX without MultiIndex columns, let alone manipulate it in any way. We are constantly comparing pandas to R, but while so far I have always implicitly thought we were proud of the?agile manipulation abilities that pandas has and R frames don't, only now I seem to understand we are envious, for some weird reason, of their lack of features. I understand we could be envious that they have a cleaner codebase... but since it's GPL covered, we actually have no idea :-D And more seriously, we have talked very little (also in the sprint,?from what I understand) of the possibility to improve/clean the internals code. I think we agree on the fact that having data stored in a BlockManager based on as many arrays as there are dtypes does not per se really qualify as rocket science. But then, even if we decided that we are not good programmers enough to implement this cleanly, even the solution of storing stuff as single arrays but leaving the API (e.g. .columns as Index) as it is is superior to the DataFrame being a mere collection of (same length) Series. Pietro From shoyer at gmail.com Wed Jul 18 12:49:37 2018 From: shoyer at gmail.com (Stephan Hoyer) Date: Wed, 18 Jul 2018 09:49:37 -0700 Subject: [Pandas-dev] Pandas Sprint Recap In-Reply-To: References: <1531501924.11738.110.camel@pietrobattiston.it> <1531780555.15070.13.camel@pietrobattiston.it> <1531787017.15070.24.camel@pietrobattiston.it> <1531818058.15070.31.camel@pietrobattiston.it> Message-ID: On Tue, Jul 17, 2018 at 3:47 PM Matthew Rocklin wrote: > Has Pandas ever done a user survey? > > I would be curious to know the answer to the question "do you make heavy > use of the Pandas index" among users, and how that correlates with > different domain/industry. > This is a great question. I don't think we've ever done this sort of reserach. My suspicion is that most of the time users ignore the index, and find the way it is used heavily in pandas more annoying than helpful. But certainly there are some use-cases for which automatic alignment with an index is fantastic. -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Wed Jul 18 13:16:17 2018 From: shoyer at gmail.com (Stephan Hoyer) Date: Wed, 18 Jul 2018 10:16:17 -0700 Subject: [Pandas-dev] Pandas Sprint Recap In-Reply-To: <1531898615.3286.8.camel@pietrobattiston.it> References: <1531501924.11738.110.camel@pietrobattiston.it> <1531780555.15070.13.camel@pietrobattiston.it> <1531787017.15070.24.camel@pietrobattiston.it> <1531818058.15070.31.camel@pietrobattiston.it> <1531898615.3286.8.camel@pietrobattiston.it> Message-ID: On Wed, Jul 18, 2018 at 12:23 AM Pietro Battiston wrote: > > The data model I'd like to work with in the future for most use-cases > > involving tabular data is something closer to: > > > > DataFrame: > > - data: OrderedDict[str, Array] > > - indexes: OrderedDict[str, Index] > > > > Conveniently, this looks very similar to the data model of Arrow or > > R. Optional indexes would provide fast reverse lookup for some subset > > of dataframe columns. > > How is this different (API-wise) from a list of (same lenght, I assume) > Series? > It does sound very similar to me -- the DataFrame just provides a nice way to do collective operations. See above, we have lost N-dimensional arrays (with N=3 or 4) but > luckily not the concept of N-features data. > I can't even think of retrieving data with pandaSDMX without MultiIndex > columns, let alone manipulate it in any way. > ... But then, even if we decided that we are not good programmers enough to > implement this cleanly, even the solution of storing stuff as single > arrays but leaving the API (e.g. .columns as Index) as it is is > superior to the DataFrame being a mere collection of (same length) > Series. > To be entirely clear, I'm only speaking for myself -- not Wes or the entire pandas development team. I wasn't even at the sprint! I certainly find stacking/unstacking useful, but it is isn't the only way to manipulate multi-dimensional tabular data. I do think R's tidyverse shows an alternative viable path. Without having used it extensively, it appears to be more consistent and easier to use than pandas. For multi-dimensional data analysis, these days I generally prefer to use xarray (disclaimer: my project) instead of a pandas.MultiIndex. I find it more satisfying to have indexed N-D arrays (in an xarray.Dataset) rather than indexed 2D dataframes. The way that pandas.DataFrame uses an Index for both row and column labels makes it in some ways similar to the fixed 2D numpy.matrix, which personally I find less useful. It also makes all pandas operations more complex to implement than those on index-free "simple" dataframes. These are certainly not mutually exclusive options -- there is room for packages that provide all of these data models (simple dataframes like R, indexed dataframes like pandas, N-D labeled arrays like xarray, and even N-D arrays without labels like NumPy). I do hope that one day all of them can share the same foundation -- that would have major benefits for the ecosystem. -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Wed Jul 18 13:30:01 2018 From: wesmckinn at gmail.com (Wes McKinney) Date: Wed, 18 Jul 2018 13:30:01 -0400 Subject: [Pandas-dev] Pandas Sprint Recap In-Reply-To: References: <1531501924.11738.110.camel@pietrobattiston.it> <1531780555.15070.13.camel@pietrobattiston.it> <1531787017.15070.24.camel@pietrobattiston.it> <1531818058.15070.31.camel@pietrobattiston.it> <1531898615.3286.8.camel@pietrobattiston.it> Message-ID: > These are certainly not mutually exclusive options -- there is room for packages that provide all of these data models (simple dataframes like R, indexed dataframes like pandas, N-D labeled arrays like xarray, and even N-D arrays without labels like NumPy). I do hope that one day all of them can share the same foundation -- that would have major benefits for the ecosystem. On this subject -- one of the objectives of the next years is to enable dplyr / tidyverse expressions to run atop Arrow-based data frames (in addition to R's native data frames). While this will require some type coercion in some cases (for strings, non-numeric data) the net benefits in terms of SIMD / parallelization / out-of-core computing should be well worth it. The tidyverse developers have created much cleaner boundaries between the expression / API semantics and the implementation details than we have, and this has all happened in the last 5 years. On Wed, Jul 18, 2018 at 1:16 PM, Stephan Hoyer wrote: > On Wed, Jul 18, 2018 at 12:23 AM Pietro Battiston > wrote: >> >> > The data model I'd like to work with in the future for most use-cases >> > involving tabular data is something closer to: >> > >> > DataFrame: >> > - data: OrderedDict[str, Array] >> > - indexes: OrderedDict[str, Index] >> > >> > Conveniently, this looks very similar to the data model of Arrow or >> > R. Optional indexes would provide fast reverse lookup for some subset >> > of dataframe columns. >> >> How is this different (API-wise) from a list of (same lenght, I assume) >> Series? > > > It does sound very similar to me -- the DataFrame just provides a nice way > to do collective operations. > >> See above, we have lost N-dimensional arrays (with N=3 or 4) but >> luckily not the concept of N-features data. >> I can't even think of retrieving data with pandaSDMX without MultiIndex >> columns, let alone manipulate it in any way. >> ... >> >> But then, even if we decided that we are not good programmers enough to >> implement this cleanly, even the solution of storing stuff as single >> arrays but leaving the API (e.g. .columns as Index) as it is is >> superior to the DataFrame being a mere collection of (same length) >> Series. > > > To be entirely clear, I'm only speaking for myself -- not Wes or the entire > pandas development team. I wasn't even at the sprint! > > I certainly find stacking/unstacking useful, but it is isn't the only way to > manipulate multi-dimensional tabular data. I do think R's tidyverse shows an > alternative viable path. Without having used it extensively, it appears to > be more consistent and easier to use than pandas. > > For multi-dimensional data analysis, these days I generally prefer to use > xarray (disclaimer: my project) instead of a pandas.MultiIndex. I find it > more satisfying to have indexed N-D arrays (in an xarray.Dataset) rather > than indexed 2D dataframes. The way that pandas.DataFrame uses an Index for > both row and column labels makes it in some ways similar to the fixed 2D > numpy.matrix, which personally I find less useful. It also makes all pandas > operations more complex to implement than those on index-free "simple" > dataframes. > > These are certainly not mutually exclusive options -- there is room for > packages that provide all of these data models (simple dataframes like R, > indexed dataframes like pandas, N-D labeled arrays like xarray, and even N-D > arrays without labels like NumPy). I do hope that one day all of them can > share the same foundation -- that would have major benefits for the > ecosystem. From me at pietrobattiston.it Wed Jul 18 13:48:46 2018 From: me at pietrobattiston.it (Pietro Battiston) Date: Wed, 18 Jul 2018 19:48:46 +0200 Subject: [Pandas-dev] Pandas Sprint Recap In-Reply-To: References: <1531501924.11738.110.camel@pietrobattiston.it> <1531780555.15070.13.camel@pietrobattiston.it> <1531787017.15070.24.camel@pietrobattiston.it> <1531818058.15070.31.camel@pietrobattiston.it> <1531898615.3286.8.camel@pietrobattiston.it> Message-ID: <1531936126.3286.17.camel@pietrobattiston.it> Il giorno mer, 18/07/2018 alle 13.30 -0400, Wes McKinney ha scritto: > > These are certainly not mutually exclusive options -- there is room > > for packages that provide all of these data models (simple > > dataframes like R, indexed dataframes like pandas, N-D labeled > > arrays like xarray, and even N-D arrays without labels like NumPy). > > I do hope that one day all of them can share the same foundation -- > > that would have major benefits for the ecosystem. > > On this subject -- one of the objectives of the next years is to > enable dplyr / tidyverse expressions to run atop Arrow-based data > frames (in addition to R's native data frames). You mean Arrow-based R data frames, right? Or are you thinking about a sort of cross-language dplyr? (My understanding is that in Python we can't go much closer to dplyr than with .pipe(), but consistent terminology helps users, and certainly dplyr developers have done great work on this) Pietro From wesmckinn at gmail.com Wed Jul 18 13:56:31 2018 From: wesmckinn at gmail.com (Wes McKinney) Date: Wed, 18 Jul 2018 13:56:31 -0400 Subject: [Pandas-dev] Pandas Sprint Recap In-Reply-To: References: <1531501924.11738.110.camel@pietrobattiston.it> <1531780555.15070.13.camel@pietrobattiston.it> <1531787017.15070.24.camel@pietrobattiston.it> <1531818058.15070.31.camel@pietrobattiston.it> <1531898615.3286.8.camel@pietrobattiston.it> <1531936126.3286.17.camel@pietrobattiston.it> Message-ID: > You mean Arrow-based R data frames, right? Or are you thinking about a sort of cross-language dplyr? Precisely a cross-language computational system (I've been talking about this publicly for well over 3 years now, e.g. here in April 2015 https://www.slideshare.net/wesm/dataframes-the-good-bad-and-ugly). Same implementation (in C/C++/LLVM), different front end. dplyr already has interfaces to SQL, for example. On Wed, Jul 18, 2018 at 1:52 PM, Brock Mendel wrote: > ... and in conclusion, everyone at the sprint learned a valuable less about > the power of friendship. > > The discussion so far seems to point towards making a push towards loose > coupling. Regardless of release names or numbers, and regardless of exactly > what happens with core.internals, there will be a need for something like > the current `pandas.io`, `pandas.plotting`, `khash`, `tslibs`, etc. The > more of this we can make independent of core.internals, the more room we'll > have to customize/experiment with options discussed above. > > On Wed, Jul 18, 2018 at 10:48 AM, Pietro Battiston > wrote: >> >> Il giorno mer, 18/07/2018 alle 13.30 -0400, Wes McKinney ha scritto: >> > > These are certainly not mutually exclusive options -- there is room >> > > for packages that provide all of these data models (simple >> > > dataframes like R, indexed dataframes like pandas, N-D labeled >> > > arrays like xarray, and even N-D arrays without labels like NumPy). >> > > I do hope that one day all of them can share the same foundation -- >> > > that would have major benefits for the ecosystem. >> > >> > On this subject -- one of the objectives of the next years is to >> > enable dplyr / tidyverse expressions to run atop Arrow-based data >> > frames (in addition to R's native data frames). >> >> You mean Arrow-based R data frames, right? Or are you thinking about a >> sort of cross-language dplyr? >> >> (My understanding is that in Python we can't go much closer to dplyr >> than with .pipe(), but consistent terminology helps users, and >> certainly dplyr developers have done great work on this) >> >> Pietro >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev > > From me at pietrobattiston.it Wed Jul 18 14:02:44 2018 From: me at pietrobattiston.it (Pietro Battiston) Date: Wed, 18 Jul 2018 20:02:44 +0200 Subject: [Pandas-dev] Pandas Sprint Recap In-Reply-To: References: <1531501924.11738.110.camel@pietrobattiston.it> <1531780555.15070.13.camel@pietrobattiston.it> <1531787017.15070.24.camel@pietrobattiston.it> <1531818058.15070.31.camel@pietrobattiston.it> <1531898615.3286.8.camel@pietrobattiston.it> Message-ID: <1531936964.3286.19.camel@pietrobattiston.it> Il giorno mer, 18/07/2018 alle 10.16 -0700, Stephan Hoyer ha scritto: > [...] > I certainly find stacking/unstacking useful, but it is isn't the only > way to manipulate multi-dimensional tabular data. I do think R's > tidyverse shows an alternative viable path. Without having used it > extensively, it appears to be more consistent and easier to use than > pandas. Probably, but (from the limited knowledge I gathered in the last months) it just doesn't seem as powerful, by far. > For multi-dimensional data analysis, these days I generally prefer to > use xarray (disclaimer: my project) instead of a pandas.MultiIndex. I > find it more satisfying to have indexed N-D arrays (in an > xarray.Dataset) rather than indexed 2D dataframes. I certainly find xarray the closest thing to a pandas replacement; the main tradeoff is the single dtype and the waste of memory if dimensions are not aligned, right? > The way that pandas.DataFrame uses an Index for both row and column > labels makes it in some ways similar to the fixed 2D numpy.matrix, > which personally I find less useful. Tastes are tastes, but it is a fact that a 2D DataFrame + MultiIndex offers possibilities that only nD numpy arrays + lot of manual effort would rival. (That's actually how I used to work before knowing pandas: I would build my own indexes and use them to access numpy arrays) Please correct me if I'm wrong, but I think that even a Series with MultiIndex is quite close, in terms of manipulation abilities (that is, discarding e.g. efficiency, and cleanness of the API) to a xarray?xarray.Dataset. Pietro From me at pietrobattiston.it Wed Jul 18 14:08:20 2018 From: me at pietrobattiston.it (Pietro Battiston) Date: Wed, 18 Jul 2018 20:08:20 +0200 Subject: [Pandas-dev] Pandas Sprint Recap In-Reply-To: References: <1531501924.11738.110.camel@pietrobattiston.it> <1531780555.15070.13.camel@pietrobattiston.it> <1531787017.15070.24.camel@pietrobattiston.it> <1531818058.15070.31.camel@pietrobattiston.it> <1531898615.3286.8.camel@pietrobattiston.it> <1531936126.3286.17.camel@pietrobattiston.it> Message-ID: <1531937300.3286.21.camel@pietrobattiston.it> Il giorno mer, 18/07/2018 alle 13.56 -0400, Wes McKinney ha scritto: > > You mean Arrow-based R data frames, right? Or are you thinking > > about a > > sort of cross-language dplyr? > > Precisely a cross-language computational system (I've been talking > about this publicly for well over 3 years now, e.g. here in April > 2015 > https://www.slideshare.net/wesm/dataframes-the-good-bad-and-ugly). > Same implementation (in C/C++/LLVM), different front end. dplyr > already has interfaces to SQL, for example. _Through R_, right? Pietro From wesmckinn at gmail.com Wed Jul 18 14:45:28 2018 From: wesmckinn at gmail.com (Wes McKinney) Date: Wed, 18 Jul 2018 14:45:28 -0400 Subject: [Pandas-dev] Pandas Sprint Recap In-Reply-To: <1531937300.3286.21.camel@pietrobattiston.it> References: <1531501924.11738.110.camel@pietrobattiston.it> <1531780555.15070.13.camel@pietrobattiston.it> <1531787017.15070.24.camel@pietrobattiston.it> <1531818058.15070.31.camel@pietrobattiston.it> <1531898615.3286.8.camel@pietrobattiston.it> <1531936126.3286.17.camel@pietrobattiston.it> <1531937300.3286.21.camel@pietrobattiston.it> Message-ID: On Wed, Jul 18, 2018, 2:08 PM Pietro Battiston wrote: > Il giorno mer, 18/07/2018 alle 13.56 -0400, Wes McKinney ha scritto: > > > You mean Arrow-based R data frames, right? Or are you thinking > > > about a > > > > sort of cross-language dplyr? > > > > Precisely a cross-language computational system (I've been talking > > about this publicly for well over 3 years now, e.g. here in April > > 2015 > > https://www.slideshare.net/wesm/dataframes-the-good-bad-and-ugly). > > Same implementation (in C/C++/LLVM), different front end. dplyr > > already has interfaces to SQL, for example. > > _Through R_, right? > Replying all this time Well, dplyr is an R package. My point was that it was not designed around R-specific semantics per se. This is explained in the slide deck I linked. > Pietro > -------------- next part -------------- An HTML attachment was scrubbed... URL: From me at pietrobattiston.it Wed Jul 18 15:05:15 2018 From: me at pietrobattiston.it (Pietro Battiston) Date: Wed, 18 Jul 2018 21:05:15 +0200 Subject: [Pandas-dev] Pandas Sprint Recap In-Reply-To: References: <1531501924.11738.110.camel@pietrobattiston.it> <1531780555.15070.13.camel@pietrobattiston.it> <1531787017.15070.24.camel@pietrobattiston.it> <1531818058.15070.31.camel@pietrobattiston.it> <1531898615.3286.8.camel@pietrobattiston.it> <1531936126.3286.17.camel@pietrobattiston.it> <1531937300.3286.21.camel@pietrobattiston.it> Message-ID: <1531940715.3286.24.camel@pietrobattiston.it> Il giorno mer, 18/07/2018 alle 14.45 -0400, Wes McKinney ha scritto: > > > On Wed, Jul 18, 2018, 2:08 PM Pietro Battiston > wrote: > > Il giorno mer, 18/07/2018 alle 13.56 -0400, Wes McKinney ha > > scritto: > > > > You mean Arrow-based R data frames, right? Or are you thinking > > > > about a > > >? > > > sort of cross-language dplyr? > > >? > > > Precisely a cross-language computational system (I've been > > talking > > > about this publicly for well over 3 years now, e.g. here in April > > > 2015 > > > https://www.slideshare.net/wesm/dataframes-the-good-bad-and-ugly) > > . > > > Same implementation (in C/C++/LLVM), different front end. dplyr > > > already has interfaces to SQL, for example. > > > > _Through R_, right? > > Replying all this time > > Well, dplyr is an R package. My point was that it was not designed > around R-specific semantics per se. This is explained in the slide > deck I linked.? I think clearly distinguishing API problems/solutions from implementation problems/solutions can only help this discussion (and I don't just mean this thread). Your slides describe a nice plan for what concerns solutions. But my limited understanding is that the dplyr _syntax_ is more innovative than the dplyr _semantics_, from which pandas doesn't have that much to learn. Then, sure, a shared codebase is cool. Pietro From jbrockmendel at gmail.com Wed Jul 18 13:52:18 2018 From: jbrockmendel at gmail.com (Brock Mendel) Date: Wed, 18 Jul 2018 10:52:18 -0700 Subject: [Pandas-dev] Pandas Sprint Recap In-Reply-To: <1531936126.3286.17.camel@pietrobattiston.it> References: <1531501924.11738.110.camel@pietrobattiston.it> <1531780555.15070.13.camel@pietrobattiston.it> <1531787017.15070.24.camel@pietrobattiston.it> <1531818058.15070.31.camel@pietrobattiston.it> <1531898615.3286.8.camel@pietrobattiston.it> <1531936126.3286.17.camel@pietrobattiston.it> Message-ID: ... and in conclusion, everyone at the sprint learned a valuable less about the power of friendship. The discussion so far seems to point towards making a push towards loose coupling. Regardless of release names or numbers, and regardless of exactly what happens with core.internals, there will be a need for something like the current `pandas.io`, `pandas.plotting`, `khash`, `tslibs`, etc. The more of this we can make independent of core.internals, the more room we'll have to customize/experiment with options discussed above. On Wed, Jul 18, 2018 at 10:48 AM, Pietro Battiston wrote: > Il giorno mer, 18/07/2018 alle 13.30 -0400, Wes McKinney ha scritto: > > > These are certainly not mutually exclusive options -- there is room > > > for packages that provide all of these data models (simple > > > dataframes like R, indexed dataframes like pandas, N-D labeled > > > arrays like xarray, and even N-D arrays without labels like NumPy). > > > I do hope that one day all of them can share the same foundation -- > > > that would have major benefits for the ecosystem. > > > > On this subject -- one of the objectives of the next years is to > > enable dplyr / tidyverse expressions to run atop Arrow-based data > > frames (in addition to R's native data frames). > > You mean Arrow-based R data frames, right? Or are you thinking about a > sort of cross-language dplyr? > > (My understanding is that in Python we can't go much closer to dplyr > than with .pipe(), but consistent terminology helps users, and > certainly dplyr developers have done great work on this) > > Pietro > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From irv at princeton.com Wed Jul 18 15:05:42 2018 From: irv at princeton.com (Irv Lustig) Date: Wed, 18 Jul 2018 15:05:42 -0400 Subject: [Pandas-dev] Pandas Sprint Recap Message-ID: > Stephan Hoyer wrote: > > On Tue, Jul 17, 2018 at 3:47 PM Matthew Rocklin > wrote: > > > Has Pandas ever done a user survey? > > > > I would be curious to know the answer to the question "do you make heavy > > use of the Pandas index" among users, and how that correlates with > > different domain/industry. > > > > This is a great question. I don't think we've ever done this sort of > reserach. > > My suspicion is that most of the time users ignore the index, and find the > way it is used heavily in pandas more annoying than helpful. But certainly > there are some use-cases for which automatic alignment with an index is > fantastic. > For our team, we heavily use the MultiIndex capability for rows (but not columns). Our main use of pandas is to read in data from disparate data sources, and do data wrangling to reshape the data. We do lots of joins/merges of different DataFrames, and placing the keys in a MultiIndex makes it easier to track the join operations. >From our perspective, the MultiIndex on rows is akin to the primary keys of a data table. As we explore data, being able to slice the data along various dimensions is quite valuable. It is also quite natural that a groupby() operation returns a Series or DataFrame with a MultiIndex. What I find a bit frustrating is the lack of symmetry in the API between dealing with the names of a MultiIndex and the names of a column. It's why I created this pull request (https://github.com/pandas-dev/pandas/pull/20046 [ENH: Allow rename_axis to specify index and columns arguments]) and opened this issue https://github.com/pandas-dev/pandas/issues/20421 [API: Allow MultiIndex.rename() to accept a dict as an argument] Dr-Irv -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Wed Jul 18 22:26:23 2018 From: wesmckinn at gmail.com (Wes McKinney) Date: Wed, 18 Jul 2018 22:26:23 -0400 Subject: [Pandas-dev] Pandas Sprint Recap In-Reply-To: References: Message-ID: Just as an aside on the pandas internals discussion, want to draw attention to two projects in development right now with the objective of providing faster/more scalable pandas-type operations: https://github.com/h2oai/datatable https://github.com/maartenbreddels/vaex Both utilize memory-mapped data representations of their own devising (vs. using an open standard of some kind). - Wes On Wed, Jul 18, 2018 at 3:05 PM, Irv Lustig wrote: > > >> > > Stephan Hoyer wrote: > >> >> >> On Tue, Jul 17, 2018 at 3:47 PM Matthew Rocklin >> wrote: >> >> > Has Pandas ever done a user survey? >> > >> > I would be curious to know the answer to the question "do you make heavy >> > use of the Pandas index" among users, and how that correlates with >> > different domain/industry. >> > >> >> This is a great question. I don't think we've ever done this sort of >> reserach. >> >> My suspicion is that most of the time users ignore the index, and find the >> way it is used heavily in pandas more annoying than helpful. But certainly >> there are some use-cases for which automatic alignment with an index is >> fantastic. > > > For our team, we heavily use the MultiIndex capability for rows (but not > columns). Our main use of pandas is to read in data from disparate data > sources, and do data wrangling to reshape the data. We do lots of > joins/merges of different DataFrames, and placing the keys in a MultiIndex > makes it easier to track the join operations. > > From our perspective, the MultiIndex on rows is akin to the primary keys of > a data table. As we explore data, being able to slice the data along various > dimensions is quite valuable. It is also quite natural that a groupby() > operation returns a Series or DataFrame with a MultiIndex. > > What I find a bit frustrating is the lack of symmetry in the API between > dealing with the names of a MultiIndex and the names of a column. It's why > I created this pull request (https://github.com/pandas-dev/pandas/pull/20046 > [ENH: Allow rename_axis to specify index and columns arguments]) and opened > this issue https://github.com/pandas-dev/pandas/issues/20421 [API: Allow > MultiIndex.rename() to accept a dict as an argument] > > Dr-Irv > > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > From jorisvandenbossche at gmail.com Thu Jul 19 02:40:05 2018 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Thu, 19 Jul 2018 01:40:05 -0500 Subject: [Pandas-dev] GroupBy Overhaul Proposal In-Reply-To: <1531897282.3286.6.camel@pietrobattiston.it> References: <1531782675.15070.18.camel@pietrobattiston.it> <09F05CAB-B491-416B-AC63-486C9AD132CD@icloud.com> <1531814373.15070.27.camel@pietrobattiston.it> <1531897282.3286.6.camel@pietrobattiston.it> Message-ID: Will, thanks for starting this! (I was after the sprint also thinking about need of refactoring the groupby code :-)) Lot's of discussion has happened, and will need some time to digest it, but I already quickly want to react on the 'apply' discussion: IMO, apply should basically be syntactic sugar for the following: keys = [] results = [] for name, group in df.groupby(): res = func(group) result.append(res) keys.append(name) pd.concat(results, keys=keys) (much simplified of course, as when the result for each group is a Series and not a DataFrame, the default concat is not what we want) And I personally think it is useful having something as the above as a general apply method for UDFs in groupby. It is certainly true that the current apply implementation has inconsistencies and magical behaviours, but I think we can deprecate those instead of deprecating the full method. See https://github.com/pandas-dev/pandas/issues/13056 for some comments about this (eg on deprecating the magical 'transform' behaviour). Apart from that, it still is a fact that a user who doesn't know all the details will quickly turn to apply (rather than to agg), just because of its name, and then having eg bad performance. I am not directly sure how to solve this. We could maybe warn in certain obvious cases (like apply(np.sum))? Although warnings can also become annoying. Joris 2018-07-18 2:01 GMT-05:00 Pietro Battiston : > Il giorno mar, 17/07/2018 alle 16.10 -0700, William Ayd ha scritto: > > > In fact, my preference for keeping apply is pretty weak as long as > > > there are alternatives that cover each of its use cases. But again, > > > I'm > > > not sure this is true. > > > > Just to clarify my position: > > > > 1. .apply() + UDF reducing to a scalar should be replaceable > > with .agg() + same UDF (even though there are differences today?) > > 2. .apply() + UDF returning Series / DataFrame / collection > > doesn?t have anything else to cover it > > .transform() at least covers the case in which the shape of the chunk > is unchanged. > > > But with #2 above I think its dangerous to assume that .apply can > > always do the ?right thing? with those types of inputs. We don?t make > > any assertions about the indexing / labeling of returned Series and > > DataFrames. > > There is a simple way to stop throwing magic at users, and it is to > clearly document which cases .apply() covers (and which should be > covered by .agg() or transform()), reflecting the actual guesswork > taking place in the code. > By the way, my understanding (without having looked at the code) is > that > UDF returns Series -> concat in a new Series > UDF returns DataFrame -> concat in a new DataFrame > and the guesswork mostly concerns understanding whether the new index > is the same as the old. Am I missing anything relevant? > > > Now, I would be all for suppressing a complicated function by replacing > it with simpler ways to do the same thing. But for instance I would > like the following to still work with groupby().something(): > > def remove_group_outliers(group): > outliers = # code to identify them > return group[~group.index.isin(outliers)] > > ... and I currently don't see any way but .apply(). > > > As far as collections are concerned I?m not sure if there will be a > > clear answer on how to handle those assuming we start getting EAs > > that add first-class support for those. > > Do you have any pointer/example? I'm missing the relation between > collections and .apply(). > > > > > Unless I'm wrong, #18366 is orthgonal to what we are discussing: > > > unnamed lambdas would remain unnamed lambdas. > > > (And the obvious solution to my eyes is used named methods instead) > > > > I don?t think this is orthogonal. Your concern is valid on lambdas > > and I don?t know what the solution there is (perhaps some kind of > > keyword argument) but without getting tripped up on that I don?t > > think its immediately apparent that the returned object for a > > DataFrame with columns ?a?, ?b?, ?c? will have a single column when > > called as follows: > > > > - df.groupby(?a?).agg(sum) > > - df.groupby(?a?).agg({?b?: sum, ?c?: min}) > > > > Yet the following will yield a MultiIndex column: > > > > - df.groupby(?a?).agg([sum]) > > - df.groupby(?a?).agg({?b?: [sum], ?c?: min}) > > The rule is not very complicated either (if correctly documented), but > anyway, the inconsistency would disappear by just having the first two > examples also return a MultiIndex. > > ... and maybe provide the users a very simple way to flatten > MultiIndexes (see below). > > > > If you reduce the returned columns to ??sum? of ?b?? and ??min? of > > ?c?? you can ensure that the returned columns have the same number of > > levels regardless of call signature, > > AND have the added bonus of not obfuscating what type of aggregation > > was performed with the former two examples. > > Both can be solved through a MI, or through an Index(dtype=object) > containing tuples. > > > Of course the end user may ultimately decide that they don?t like > > those labels at all and completely override them, but that effort > > becomes much easier if they can make guarantees around the number of > > levels of the returned object > > I agree on this > > > (especially if it?s just one!). > > ... not on that. > > MI (or tuples) -> arbitrary strings > > is much simpler/cleaner to do than > > arbitrary strings -> MI (or tuples) > > > > > > - if, after creating all my columns, I want to e.g. select all > > > columns > > > that contain sums, I need to do some sort of "df[[col if > > > col.startswith("Sum of")]]". Compare to "df.loc[:, ('Sum',)]? > > > > Unless I am mistaken you would have to do something like > > "df.groupby('a').agg([sum]).loc[:, slice(None, 'sum?)]? to get that > > to work. > > Yeah, I had swapped the levels, it is > > df.groupby('a').agg([sum]).loc[:, (slice(None), 'sum?)] > > > > I don?t think that syntax really is that clean > > In my code I always start by defining > > WE = slice(None) # WhatEver > > and we could advertise this as a way to make the syntax shorter, but > regardless of that, it definitely is cleaner than any string > manipulation. > > > > and it starts taking us down the path of advanced indexing for what > > may start off to the end user as a very simple aggregation exercise. > > On this I agree with you. I'm all for providing > > - a MultiIndex.flatten() method which allows me to do > res.columns = res.columns.flatten("{} of {}".format) > > - a simple way to do the above in-line (which is already being > discussed, regardless of groupby) > > > [...] > > > - it would be the only case in pandas in which we decide how to > > > call a > > > column on behalf of the user > > > > Well we have to do something to reduce ambiguity?I think a consistent > > naming convention and dimension for the columns across all > > invocations is strongly preferable to inserting a column level some > > of the time. > > Again, I agree on this. > > > > > - if one wants to allow the user to name the columns according to her > > > taste, it's pretty simple to introduce an argument which takes a > > > string > > > to be .format()ted with the name of the column (or even of the > > > method), > > > e.g. name="Sum of {}" > > > > Agreed. In my head I feel like this defaults to something like > > f?{fname} of {colname}? but gives the user potentially the option to > > override. By default keep the same number of levels as what is being > > passed in, though maybe None as an argument reverts to the old style > > behavior of simply inserting a new column index level. > > Agree on everything but the default, again, because it is arbitrary > > > > > By the way, despite some related issues, I still think tuples can > > > be > > > first class citizens of flat indexes. So if one doesn't like > > > MultiIndexes, or they do not fit one's needs, ("sum", "A") can well > > > be > > > a label in a regular index. > > > > You know better than I do here, but again I don?t think it makes for > > a good user experience to convert columns with one level into > > multiple levels after a GroupBy operation regardless of how you could > > subsequently access those values. > > Notice that I'm not talking about a MultiIndex, but about a flat index. > But it is an inferior solution, given the API we already expose, to the > MultiIndex. > > > Pietro > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From me at pietrobattiston.it Thu Jul 19 11:17:41 2018 From: me at pietrobattiston.it (Pietro Battiston) Date: Thu, 19 Jul 2018 17:17:41 +0200 Subject: [Pandas-dev] Colon available everywhere In-Reply-To: <1531897282.3286.6.camel@pietrobattiston.it> References: <1531782675.15070.18.camel@pietrobattiston.it> <09F05CAB-B491-416B-AC63-486C9AD132CD@icloud.com> <1531814373.15070.27.camel@pietrobattiston.it> <1531897282.3286.6.camel@pietrobattiston.it> Message-ID: <1532013461.3286.45.camel@pietrobattiston.it> Il giorno mer, 18/07/2018 alle 09.01 +0200, Pietro Battiston ha scritto: > Il giorno mar, 17/07/2018 alle 16.10 -0700, William Ayd ha scritto: > > > - if, after creating all my columns, I want to e.g. select all > > > columns > > > that contain sums, I need to do some sort of "df[[col if > > > col.startswith("Sum of")]]". Compare to "df.loc[:, ('Sum',)]? > > > > Unless I am mistaken you would have to do something like > > "df.groupby('a').agg([sum]).loc[:, slice(None, 'sum?)]? to get that > > to work. > > Yeah, I had swapped the levels, it is > > df.groupby('a').agg([sum]).loc[:, (slice(None), 'sum?)] > > > > I don?t think that syntax really is that clean > > In my code I always start by defining > > WE = slice(None) # WhatEver > > and we could advertise this as a way to make the syntax shorter, but > regardless of that, it definitely is cleaner than any string > manipulation. Related to this, I'm curious about some opinion from pandas devs on an idea which I think would simplify our users' life (and by that, I don't only mean current users of current pandas API) at (almost) no cost. The colon in Python is meant for: 1)?logical blocks: if True: 2)?separating args and body of a lambda: lambda x : x**2 3) assignment expressions (since 3.8): if (a := True): 4) separating key and value in dict: {1 : 'a'} 5) define slices: a_series.loc['2018-06-01':'2018-07-03'] The last example is entirely indistinguishable from a_series.loc[slice('2018-06-01','2018-07-03')] ... but unfortunately, only works inside __getitem__ calls. My idea is: there is no obvious reason why it should be so, that is, why '2018-06-01':'2018-07-03' couldn't just be parsed as slice('2018-06-01','2018-07-03'). The alternative uses 1)-4) of the colon imply that some precaution must be taken, but: 1) should not create ambiguity, as the ":" is always matched with a control flow statement 2) should not create ambiguity, as the ":" is always matched with the "lambda" statement 3) should not create ambiguity, as the ":" is always present close to "=", while the "slice interpretation" of ":" would never appear (unless nested) in the left part of an assignment 4) is the only potential problematic case, as {2 : 3} could be interpreted as {slice(2, 3)} but is currently interpreted as ?dict([(1,3)]) However, the solution could be to just prioritize the current interpretation, and use {(2 : 3)} to force the second. If this proposal was implemented, df.loc[:, (slice(None), 'sum?)] would finally just become df.loc[:, (:, 'sum?)] at the cost of a minimal ambiguity (in the case shown above), which is easy to solve (and no more grave, I guess, than the fact that {} is an empty dict and not an empty set). For Python beginners, it would probably even simplify the understanding of slices (today, it is not trivial, I think, to understand that obj[:] is exactly equivalent to obj[slice(None)] - but that ":" does not per se mean anything). Moreover, it would mimick "...", which is instead available also outside of __getitem__ calls. Would it be crazy to propose a PEP with this? A milder form would be to allow ":" to be used only inside __getitem__ calls, but also nested: I think however this would be more confusing and probably difficult to implement. Thoughts? Pietro From me at pietrobattiston.it Thu Jul 19 11:41:40 2018 From: me at pietrobattiston.it (Pietro Battiston) Date: Thu, 19 Jul 2018 17:41:40 +0200 Subject: [Pandas-dev] Pandas Sprint Recap In-Reply-To: References: <1531501924.11738.110.camel@pietrobattiston.it> <1531780555.15070.13.camel@pietrobattiston.it> <1531787017.15070.24.camel@pietrobattiston.it> Message-ID: <1532014900.3286.50.camel@pietrobattiston.it> By the way... Il giorno lun, 16/07/2018 alle 18.14 -0700, Stephan Hoyer ha scritto: > [...] > 2. The indexed pandas.Series and pandas.DataFrame isn't the right > abstraction for many tasks. A simpler, index free DataFrame would be > a better data model for many tasks. Is there an issue open already for this? Otherwise, I think we could create it, at least to have a reference target for decoupling index from storage. Pietro From shoyer at gmail.com Thu Jul 19 12:33:41 2018 From: shoyer at gmail.com (Stephan Hoyer) Date: Thu, 19 Jul 2018 09:33:41 -0700 Subject: [Pandas-dev] Colon available everywhere In-Reply-To: <1532013461.3286.45.camel@pietrobattiston.it> References: <1531782675.15070.18.camel@pietrobattiston.it> <09F05CAB-B491-416B-AC63-486C9AD132CD@icloud.com> <1531814373.15070.27.camel@pietrobattiston.it> <1531897282.3286.6.camel@pietrobattiston.it> <1532013461.3286.45.camel@pietrobattiston.it> Message-ID: I'm pretty sure this has been proposed before on Python-ideas. Definitely search through the archives first. Another option I liked that involved no changes to Python syntax would be to make indexing the built-in slice class return a slice object, e.g., slice[:5] -> slice(None, 5, None). But if I recall correctly that had been shot down, too. On Thu, Jul 19, 2018 at 8:17 AM Pietro Battiston wrote: > Il giorno mer, 18/07/2018 alle 09.01 +0200, Pietro Battiston ha > scritto: > > Il giorno mar, 17/07/2018 alle 16.10 -0700, William Ayd ha scritto: > > > > - if, after creating all my columns, I want to e.g. select all > > > > columns > > > > that contain sums, I need to do some sort of "df[[col if > > > > col.startswith("Sum of")]]". Compare to "df.loc[:, ('Sum',)]? > > > > > > Unless I am mistaken you would have to do something like > > > "df.groupby('a').agg([sum]).loc[:, slice(None, 'sum?)]? to get that > > > to work. > > > > Yeah, I had swapped the levels, it is > > > > df.groupby('a').agg([sum]).loc[:, (slice(None), 'sum?)] > > > > > > > I don?t think that syntax really is that clean > > > > In my code I always start by defining > > > > WE = slice(None) # WhatEver > > > > and we could advertise this as a way to make the syntax shorter, but > > regardless of that, it definitely is cleaner than any string > > manipulation. > > > Related to this, I'm curious about some opinion from pandas devs on an > idea which I think would simplify our users' life (and by that, I don't > only mean current users of current pandas API) at (almost) no cost. > > The colon in Python is meant for: > > 1) logical blocks: > if True: > > 2) separating args and body of a lambda: > lambda x : x**2 > > 3) assignment expressions (since 3.8): > if (a := True): > > 4) separating key and value in dict: > {1 : 'a'} > > 5) define slices: > a_series.loc['2018-06-01':'2018-07-03'] > > The last example is entirely indistinguishable from > a_series.loc[slice('2018-06-01','2018-07-03')] > ... but unfortunately, only works inside __getitem__ calls. > > My idea is: there is no obvious reason why it should be so, that is, > why > > '2018-06-01':'2018-07-03' > > couldn't just be parsed as slice('2018-06-01','2018-07-03'). > > The alternative uses 1)-4) of the colon imply that some precaution must > be taken, but: > > 1) should not create ambiguity, as the ":" is always matched with a > control flow statement > > 2) should not create ambiguity, as the ":" is always matched with the > "lambda" statement > > 3) should not create ambiguity, as the ":" is always present close to > "=", while the "slice interpretation" of ":" would never appear (unless > nested) in the left part of an assignment > > 4) is the only potential problematic case, as > {2 : 3} > could be interpreted as > {slice(2, 3)} > but is currently interpreted as > dict([(1,3)]) > > However, the solution could be to just prioritize the current > interpretation, and use > {(2 : 3)} > to force the second. > > > If this proposal was implemented, > > df.loc[:, (slice(None), 'sum?)] > > would finally just become > > df.loc[:, (:, 'sum?)] > > at the cost of a minimal ambiguity (in the case shown above), which is > easy to solve (and no more grave, I guess, than the fact that {} is an > empty dict and not an empty set). > > For Python beginners, it would probably even simplify the understanding > of slices (today, it is not trivial, I think, to understand that obj[:] > is exactly equivalent to obj[slice(None)] - but that ":" does not per > se mean anything). > Moreover, it would mimick "...", which is instead available also > outside of __getitem__ calls. > > Would it be crazy to propose a PEP with this? > > A milder form would be to allow ":" to be used only inside __getitem__ > calls, but also nested: I think however this would be more confusing > and probably difficult to implement. > > Thoughts? > > Pietro > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cbartak at gmail.com Thu Jul 19 14:31:46 2018 From: cbartak at gmail.com (Chris Bartak) Date: Thu, 19 Jul 2018 13:31:46 -0500 Subject: [Pandas-dev] Colon available everywhere In-Reply-To: References: <1531782675.15070.18.camel@pietrobattiston.it> <09F05CAB-B491-416B-AC63-486C9AD132CD@icloud.com> <1531814373.15070.27.camel@pietrobattiston.it> <1531897282.3286.6.camel@pietrobattiston.it> <1532013461.3286.45.camel@pietrobattiston.it> Message-ID: I think this has also been discussed in forms too, but another syntax possibility would be expanding what is accepted inside __getitem__. Ignoring backwards compat for a second (can of worms how `*args`/existing tuple key behavior would interact), could envision something roughly like this, which could also solve the named indexer problem. class A: def __getitem__(self, *args, **kwargs): print(args, kwargs) a = A() a[1, 2] # (1, 2), {} a[1, 2, b=3] # (1, 2), {'b': 3} a[1, 2, (:, 2), c=3] # (1, 2, (slice(None), 2)), {'c': 3} On Thu, Jul 19, 2018 at 11:34 AM Stephan Hoyer wrote: > I'm pretty sure this has been proposed before on Python-ideas. Definitely > search through the archives first. > > Another option I liked that involved no changes to Python syntax would be > to make indexing the built-in slice class return a slice object, e.g., > slice[:5] -> slice(None, 5, None). But if I recall correctly that had been > shot down, too. > On Thu, Jul 19, 2018 at 8:17 AM Pietro Battiston > wrote: > >> Il giorno mer, 18/07/2018 alle 09.01 +0200, Pietro Battiston ha >> scritto: >> > Il giorno mar, 17/07/2018 alle 16.10 -0700, William Ayd ha scritto: >> > > > - if, after creating all my columns, I want to e.g. select all >> > > > columns >> > > > that contain sums, I need to do some sort of "df[[col if >> > > > col.startswith("Sum of")]]". Compare to "df.loc[:, ('Sum',)]? >> > > >> > > Unless I am mistaken you would have to do something like >> > > "df.groupby('a').agg([sum]).loc[:, slice(None, 'sum?)]? to get that >> > > to work. >> > >> > Yeah, I had swapped the levels, it is >> > >> > df.groupby('a').agg([sum]).loc[:, (slice(None), 'sum?)] >> > >> > >> > > I don?t think that syntax really is that clean >> > >> > In my code I always start by defining >> > >> > WE = slice(None) # WhatEver >> > >> > and we could advertise this as a way to make the syntax shorter, but >> > regardless of that, it definitely is cleaner than any string >> > manipulation. >> >> >> Related to this, I'm curious about some opinion from pandas devs on an >> idea which I think would simplify our users' life (and by that, I don't >> only mean current users of current pandas API) at (almost) no cost. >> >> The colon in Python is meant for: >> >> 1) logical blocks: >> if True: >> >> 2) separating args and body of a lambda: >> lambda x : x**2 >> >> 3) assignment expressions (since 3.8): >> if (a := True): >> >> 4) separating key and value in dict: >> {1 : 'a'} >> >> 5) define slices: >> a_series.loc['2018-06-01':'2018-07-03'] >> >> The last example is entirely indistinguishable from >> a_series.loc[slice('2018-06-01','2018-07-03')] >> ... but unfortunately, only works inside __getitem__ calls. >> >> My idea is: there is no obvious reason why it should be so, that is, >> why >> >> '2018-06-01':'2018-07-03' >> >> couldn't just be parsed as slice('2018-06-01','2018-07-03'). >> >> The alternative uses 1)-4) of the colon imply that some precaution must >> be taken, but: >> >> 1) should not create ambiguity, as the ":" is always matched with a >> control flow statement >> >> 2) should not create ambiguity, as the ":" is always matched with the >> "lambda" statement >> >> 3) should not create ambiguity, as the ":" is always present close to >> "=", while the "slice interpretation" of ":" would never appear (unless >> nested) in the left part of an assignment >> >> 4) is the only potential problematic case, as >> {2 : 3} >> could be interpreted as >> {slice(2, 3)} >> but is currently interpreted as >> dict([(1,3)]) >> >> However, the solution could be to just prioritize the current >> interpretation, and use >> {(2 : 3)} >> to force the second. >> >> >> If this proposal was implemented, >> >> df.loc[:, (slice(None), 'sum?)] >> >> would finally just become >> >> df.loc[:, (:, 'sum?)] >> >> at the cost of a minimal ambiguity (in the case shown above), which is >> easy to solve (and no more grave, I guess, than the fact that {} is an >> empty dict and not an empty set). >> >> For Python beginners, it would probably even simplify the understanding >> of slices (today, it is not trivial, I think, to understand that obj[:] >> is exactly equivalent to obj[slice(None)] - but that ":" does not per >> se mean anything). >> Moreover, it would mimick "...", which is instead available also >> outside of __getitem__ calls. >> >> Would it be crazy to propose a PEP with this? >> >> A milder form would be to allow ":" to be used only inside __getitem__ >> calls, but also nested: I think however this would be more confusing >> and probably difficult to implement. >> >> Thoughts? >> >> Pietro >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Thu Jul 19 14:46:03 2018 From: shoyer at gmail.com (Stephan Hoyer) Date: Thu, 19 Jul 2018 11:46:03 -0700 Subject: [Pandas-dev] Colon available everywhere In-Reply-To: References: <1531782675.15070.18.camel@pietrobattiston.it> <09F05CAB-B491-416B-AC63-486C9AD132CD@icloud.com> <1531814373.15070.27.camel@pietrobattiston.it> <1531897282.3286.6.camel@pietrobattiston.it> <1532013461.3286.45.camel@pietrobattiston.it> Message-ID: Yes, I'd love to see *args and **kwargs for __getitem__, but that's a much bigger change. See also https://www.python.org/dev/peps/pep-0472/ On Thu, Jul 19, 2018 at 11:31 AM Chris Bartak wrote: > I think this has also been discussed in forms too, but another syntax > possibility would be expanding what is accepted inside __getitem__. > Ignoring backwards compat for a second (can of worms how `*args`/existing > tuple key behavior would interact), could envision something roughly like > this, which could also solve the named indexer problem. > > class A: > def __getitem__(self, *args, **kwargs): print(args, kwargs) > a = A() > > a[1, 2] > # (1, 2), {} > > a[1, 2, b=3] > # (1, 2), {'b': 3} > > a[1, 2, (:, 2), c=3] > # (1, 2, (slice(None), 2)), {'c': 3} > > > > On Thu, Jul 19, 2018 at 11:34 AM Stephan Hoyer wrote: > >> I'm pretty sure this has been proposed before on Python-ideas. Definitely >> search through the archives first. >> >> Another option I liked that involved no changes to Python syntax would be >> to make indexing the built-in slice class return a slice object, e.g., >> slice[:5] -> slice(None, 5, None). But if I recall correctly that had been >> shot down, too. >> On Thu, Jul 19, 2018 at 8:17 AM Pietro Battiston >> wrote: >> >>> Il giorno mer, 18/07/2018 alle 09.01 +0200, Pietro Battiston ha >>> scritto: >>> > Il giorno mar, 17/07/2018 alle 16.10 -0700, William Ayd ha scritto: >>> > > > - if, after creating all my columns, I want to e.g. select all >>> > > > columns >>> > > > that contain sums, I need to do some sort of "df[[col if >>> > > > col.startswith("Sum of")]]". Compare to "df.loc[:, ('Sum',)]? >>> > > >>> > > Unless I am mistaken you would have to do something like >>> > > "df.groupby('a').agg([sum]).loc[:, slice(None, 'sum?)]? to get that >>> > > to work. >>> > >>> > Yeah, I had swapped the levels, it is >>> > >>> > df.groupby('a').agg([sum]).loc[:, (slice(None), 'sum?)] >>> > >>> > >>> > > I don?t think that syntax really is that clean >>> > >>> > In my code I always start by defining >>> > >>> > WE = slice(None) # WhatEver >>> > >>> > and we could advertise this as a way to make the syntax shorter, but >>> > regardless of that, it definitely is cleaner than any string >>> > manipulation. >>> >>> >>> Related to this, I'm curious about some opinion from pandas devs on an >>> idea which I think would simplify our users' life (and by that, I don't >>> only mean current users of current pandas API) at (almost) no cost. >>> >>> The colon in Python is meant for: >>> >>> 1) logical blocks: >>> if True: >>> >>> 2) separating args and body of a lambda: >>> lambda x : x**2 >>> >>> 3) assignment expressions (since 3.8): >>> if (a := True): >>> >>> 4) separating key and value in dict: >>> {1 : 'a'} >>> >>> 5) define slices: >>> a_series.loc['2018-06-01':'2018-07-03'] >>> >>> The last example is entirely indistinguishable from >>> a_series.loc[slice('2018-06-01','2018-07-03')] >>> ... but unfortunately, only works inside __getitem__ calls. >>> >>> My idea is: there is no obvious reason why it should be so, that is, >>> why >>> >>> '2018-06-01':'2018-07-03' >>> >>> couldn't just be parsed as slice('2018-06-01','2018-07-03'). >>> >>> The alternative uses 1)-4) of the colon imply that some precaution must >>> be taken, but: >>> >>> 1) should not create ambiguity, as the ":" is always matched with a >>> control flow statement >>> >>> 2) should not create ambiguity, as the ":" is always matched with the >>> "lambda" statement >>> >>> 3) should not create ambiguity, as the ":" is always present close to >>> "=", while the "slice interpretation" of ":" would never appear (unless >>> nested) in the left part of an assignment >>> >>> 4) is the only potential problematic case, as >>> {2 : 3} >>> could be interpreted as >>> {slice(2, 3)} >>> but is currently interpreted as >>> dict([(1,3)]) >>> >>> However, the solution could be to just prioritize the current >>> interpretation, and use >>> {(2 : 3)} >>> to force the second. >>> >>> >>> If this proposal was implemented, >>> >>> df.loc[:, (slice(None), 'sum?)] >>> >>> would finally just become >>> >>> df.loc[:, (:, 'sum?)] >>> >>> at the cost of a minimal ambiguity (in the case shown above), which is >>> easy to solve (and no more grave, I guess, than the fact that {} is an >>> empty dict and not an empty set). >>> >>> For Python beginners, it would probably even simplify the understanding >>> of slices (today, it is not trivial, I think, to understand that obj[:] >>> is exactly equivalent to obj[slice(None)] - but that ":" does not per >>> se mean anything). >>> Moreover, it would mimick "...", which is instead available also >>> outside of __getitem__ calls. >>> >>> Would it be crazy to propose a PEP with this? >>> >>> A milder form would be to allow ":" to be used only inside __getitem__ >>> calls, but also nested: I think however this would be more confusing >>> and probably difficult to implement. >>> >>> Thoughts? >>> >>> Pietro >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From me at pietrobattiston.it Thu Jul 19 16:27:08 2018 From: me at pietrobattiston.it (Pietro Battiston) Date: Thu, 19 Jul 2018 22:27:08 +0200 Subject: [Pandas-dev] Colon available everywhere In-Reply-To: References: <1531782675.15070.18.camel@pietrobattiston.it> <09F05CAB-B491-416B-AC63-486C9AD132CD@icloud.com> <1531814373.15070.27.camel@pietrobattiston.it> <1531897282.3286.6.camel@pietrobattiston.it> <1532013461.3286.45.camel@pietrobattiston.it> Message-ID: <1532032028.3286.52.camel@pietrobattiston.it> Il giorno gio, 19/07/2018 alle 09.33 -0700, Stephan Hoyer ha scritto: > I'm pretty sure this has been proposed before on Python-ideas. > Definitely search through the archives first. The closest I found is https://mail.python.org/pipermail/python-ideas/2015-June/034086.html https://bugs.python.org/issue24379 proposing the less invasive?(not requiring changes in the language), but also less useful slice.literal as in reverse = slice.literal[::-1] By the way, if only slice was subclassable, we could do class PowerSlice(slice):???????????? def __getitem__(self, key): return key W = PowerSlice() so that both series.df[W, 'col'] and series.df[W[:], 'col'] would work. Unfortunately that is not the case (and anyway, this solution would still be suboptimal). Pietro From shoyer at gmail.com Thu Jul 19 16:35:20 2018 From: shoyer at gmail.com (Stephan Hoyer) Date: Thu, 19 Jul 2018 13:35:20 -0700 Subject: [Pandas-dev] Colon available everywhere In-Reply-To: <1532032028.3286.52.camel@pietrobattiston.it> References: <1531782675.15070.18.camel@pietrobattiston.it> <09F05CAB-B491-416B-AC63-486C9AD132CD@icloud.com> <1531814373.15070.27.camel@pietrobattiston.it> <1531897282.3286.6.camel@pietrobattiston.it> <1532013461.3286.45.camel@pietrobattiston.it> <1532032028.3286.52.camel@pietrobattiston.it> Message-ID: I'd be pretty happy getting even operator.subscript into Python 3.8. subscript[:, 0, ::-1] is *way* more readable than (slice(None), 0, slice(None, None, -1)). My sense is that the commentators on the Python bug don't work with multi-dimensional arrays so they don't appreciate how ubiquitous the need for this is. It would be really nice to have standard utility for this, rather than needing to rely on the separate utilities in pandas and NumPy. On Thu, Jul 19, 2018 at 1:27 PM Pietro Battiston wrote: > Il giorno gio, 19/07/2018 alle 09.33 -0700, Stephan Hoyer ha scritto: > > I'm pretty sure this has been proposed before on Python-ideas. > > Definitely search through the archives first. > > The closest I found is > https://mail.python.org/pipermail/python-ideas/2015-June/034086.html > https://bugs.python.org/issue24379 > > proposing the less invasive (not requiring changes in the language), > but also less useful > > slice.literal > > as in > > reverse = slice.literal[::-1] > > By the way, if only slice was subclassable, we could do > > class PowerSlice(slice): > def __getitem__(self, key): > return key > > W = PowerSlice() > > so that both > > series.df[W, 'col'] > > and > > series.df[W[:], 'col'] > > would work. > > Unfortunately that is not the case (and anyway, this solution would > still be suboptimal). > > Pietro > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From me at pietrobattiston.it Thu Jul 19 16:43:59 2018 From: me at pietrobattiston.it (Pietro Battiston) Date: Thu, 19 Jul 2018 22:43:59 +0200 Subject: [Pandas-dev] Colon available everywhere In-Reply-To: References: <1531782675.15070.18.camel@pietrobattiston.it> <09F05CAB-B491-416B-AC63-486C9AD132CD@icloud.com> <1531814373.15070.27.camel@pietrobattiston.it> <1531897282.3286.6.camel@pietrobattiston.it> <1532013461.3286.45.camel@pietrobattiston.it> <1532032028.3286.52.camel@pietrobattiston.it> Message-ID: <1532033039.3286.55.camel@pietrobattiston.it> Il giorno gio, 19/07/2018 alle 13.35 -0700, Stephan Hoyer ha scritto: > I'd be pretty happy getting even operator.subscript into Python 3.8. > subscript[:, 0, ::-1] is way?more readable than (slice(None), 0, > slice(None, None, -1)). I totally agree... but isn't (:, 0, ::-1) even better? Pietro From shoyer at gmail.com Thu Jul 19 17:02:23 2018 From: shoyer at gmail.com (Stephan Hoyer) Date: Thu, 19 Jul 2018 14:02:23 -0700 Subject: [Pandas-dev] Colon available everywhere In-Reply-To: <1532033039.3286.55.camel@pietrobattiston.it> References: <1531782675.15070.18.camel@pietrobattiston.it> <09F05CAB-B491-416B-AC63-486C9AD132CD@icloud.com> <1531814373.15070.27.camel@pietrobattiston.it> <1531897282.3286.6.camel@pietrobattiston.it> <1532013461.3286.45.camel@pietrobattiston.it> <1532032028.3286.52.camel@pietrobattiston.it> <1532033039.3286.55.camel@pietrobattiston.it> Message-ID: Sure -- but good luck pursauding mainstream Python devs on that one! On Thu, Jul 19, 2018 at 1:44 PM Pietro Battiston wrote: > Il giorno gio, 19/07/2018 alle 13.35 -0700, Stephan Hoyer ha scritto: > > I'd be pretty happy getting even operator.subscript into Python 3.8. > > subscript[:, 0, ::-1] is way more readable than (slice(None), 0, > > slice(None, None, -1)). > > I totally agree... but isn't > > (:, 0, ::-1) > > even better? > > Pietro > -------------- next part -------------- An HTML attachment was scrubbed... URL: From me at pietrobattiston.it Thu Jul 19 17:33:58 2018 From: me at pietrobattiston.it (Pietro Battiston) Date: Thu, 19 Jul 2018 23:33:58 +0200 Subject: [Pandas-dev] Colon available everywhere In-Reply-To: References: <1531782675.15070.18.camel@pietrobattiston.it> <09F05CAB-B491-416B-AC63-486C9AD132CD@icloud.com> <1531814373.15070.27.camel@pietrobattiston.it> <1531897282.3286.6.camel@pietrobattiston.it> <1532013461.3286.45.camel@pietrobattiston.it> <1532032028.3286.52.camel@pietrobattiston.it> <1532033039.3286.55.camel@pietrobattiston.it> Message-ID: <1532036038.3286.59.camel@pietrobattiston.it> Il giorno gio, 19/07/2018 alle 14.02 -0700, Stephan Hoyer ha scritto: > Sure -- but good luck pursauding mainstream Python devs on that one! I would never do this alone :-) Unless I'm wrong, numpy devs were able to obtain the ellipsis, and the operator "@". The proposal on ":" can maybe be supported with more general arguments (Python already has ":", and making it more widely available seems natural) but clearly only makes sense if it is the request of a community. Pietro