From wesmckinn at gmail.com Tue Jul 26 16:51:15 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Tue, 26 Jul 2016 13:51:15 -0700 Subject: [Pandas-dev] Proposal: pandas 1.0 and ideas for pandas 2.0 / future Message-ID: hi folks, As a continuation of ongoing discussions on GitHub and on the mailing list around deprecations and future innovation and internal reworkings of pandas, I had a couple of ideas to share that I am looking for feedback on. As far as pandas 0.19.x today, I would like to propose that we consider releasing the project as pandas 1.0 in the next major release or the one after. The Python community does have a penchant for "eternal betas", but after all the hard work of the core developers and community over the last 5 years, I think we can safely consider making a stable 1.X production release. If we do decide to release pandas 1.0, I also propose that we strongly consider making 1.X an LTS / Long Term Support branch where we can continue to make releases, but bug fixes and documentation improvements only. Or, we can add new features, but on an extremely conservative basis. This might require some changes to development process, so looking for feedback on this. If we commit to this path, I would suggest that we start a pandas-2.0 integration branch where we can begin more seriously planning and executing on - Cleanup and removal of years' worth of accumulated cruft / legacy code - Removal of deprecated features - Series and DataFrame internals revamp. I had hoped that 2016 would offer me more time to work on the internals revamp, but between my day job and the 2nd ed of "Python for Data Analysis" that turned out to be a little too ambitious. I have been almost continuously thinking about how to go about this though, and it might be good to figure out a process where we can start documenting and coming up with a more granular development roadmap for this. Part of this will be carefully documenting any APIs we change or unit tests we break along the way. We would want to give ample time for heavy pandas users to run their 3rd-party code based on pandas 2.0-dev to give feedback on whether our assumptions about the impact of changes affect real production code. As a concrete example: integer and boolean Series would be able to accommodate missing data without implicitly casting to float or object NumPy dtype respectively. Since many users will have inserted workarounds / data massaging code because of such rough edges, this may cause code breakage or simply redundancy in some cases. As another example: we should probably remove the .ix indexing attribute altogether. I'm sure many users are still using .ix, but it would be worthwhile to go through such code and decide whether it's really .loc or .iloc. My hope would be (being a deadline-motivated person) that we could see a pandas 2.0 alpha release sometime mid- or 2nd half 2017, with a target beta / pre-production QA release in early 2018 or thereabouts. Part of this would be creating a 1.0 to 2.0 migration guide for users. My biggest concern with pandas in recent years is how not to be held back by strict backwards compatibility and still be able to innovate and stay relevant into the 2020s. For pandas 2.0 some of the most important issues I've been thinking about are: - Logical type abstraction layer / decoupling. pandas-only data types (Categorical, DatetimeTZ, Period, etc.) will become equal citizens as compared with data types mapping 1-1 on NumPy numeric dtypes - Decoupling physical storage to permit non-NumPy data structures inside Series - Removal of BlockManager and 2D block consolidation in DataFrame, in favor of a native C++ internal table (vector-of-arrays) data structure - Consistent NA semantics across all data types - Significantly improved handling of string/UTF8 data (performance, memory use -- elimination of PyObject boxes). From the above 2 items, we could even make all string arrays internally categorical (with the option to explicitly cast to categorical) -- in the database world this is often called dictionary encoding. - Refactor of most Cython algorithms into C++11/14 templates - Copy-on-write for Series and DataFrame - Removal of Panel, ndim > 3 data structures - Analytical expression VM (for example -- things like df[boolean_arr].groupby(...).agg(...) could be evaluated by a small Numexpr-like VM, not dissimilar to R's dplyr library, with significantly improved memory use and maybe performance too) There's a lot to unpack here, but let me know what everyone thinks about these things. The "pandas 2.0" / internals revamp discussion we can tackle in a separate thread or in perhaps in a GitHub repo or design folder in the pandas codebase. Thanks, Wes From shoyer at gmail.com Tue Jul 26 17:13:11 2016 From: shoyer at gmail.com (Stephan Hoyer) Date: Tue, 26 Jul 2016 14:13:11 -0700 Subject: [Pandas-dev] Proposal: pandas 1.0 and ideas for pandas 2.0 / future In-Reply-To: References: Message-ID: I know I expressed concerns about cross-compatibility with the rest of the SciPy ecosystem before (especially xarray), but this plan sounds very solid to me. Flexible data types in N-dimensional arrays are important for other use cases, but also not really a problem for pandas. A separate 2.0 release will let us make the major breaking changes to the pandas data model necessary for it to work well in the long term. There are a few other API warts that will be able to clean up this way (detailed in github.com/pydata/pandas/issues/10000), indexing on DataFrames being the most obvious one. On Tue, Jul 26, 2016 at 1:51 PM, Wes McKinney wrote: > hi folks, > > As a continuation of ongoing discussions on GitHub and on the mailing > list around deprecations and future innovation and internal reworkings > of pandas, I had a couple of ideas to share that I am looking for > feedback on. > > As far as pandas 0.19.x today, I would like to propose that we > consider releasing the project as pandas 1.0 in the next major release > or the one after. The Python community does have a penchant for > "eternal betas", but after all the hard work of the core developers > and community over the last 5 years, I think we can safely consider > making a stable 1.X production release. > > If we do decide to release pandas 1.0, I also propose that we strongly > consider making 1.X an LTS / Long Term Support branch where we can > continue to make releases, but bug fixes and documentation > improvements only. Or, we can add new features, but on an extremely > conservative basis. This might require some changes to development > process, so looking for feedback on this. > > If we commit to this path, I would suggest that we start a pandas-2.0 > integration branch where we can begin more seriously planning and > executing on > > - Cleanup and removal of years' worth of accumulated cruft / legacy code > - Removal of deprecated features > - Series and DataFrame internals revamp. > > I had hoped that 2016 would offer me more time to work on the > internals revamp, but between my day job and the 2nd ed of "Python for > Data Analysis" that turned out to be a little too ambitious. I have > been almost continuously thinking about how to go about this though, > and it might be good to figure out a process where we can start > documenting and coming up with a more granular development roadmap for > this. Part of this will be carefully documenting any APIs we change or > unit tests we break along the way. > > We would want to give ample time for heavy pandas users to run their > 3rd-party code based on pandas 2.0-dev to give feedback on whether our > assumptions about the impact of changes affect real production code. > As a concrete example: integer and boolean Series would be able to > accommodate missing data without implicitly casting to float or object > NumPy dtype respectively. Since many users will have inserted > workarounds / data massaging code because of such rough edges, this > may cause code breakage or simply redundancy in some cases. As another > example: we should probably remove the .ix indexing attribute > altogether. I'm sure many users are still using .ix, but it would be > worthwhile to go through such code and decide whether it's really .loc > or .iloc. > > My hope would be (being a deadline-motivated person) that we could see > a pandas 2.0 alpha release sometime mid- or 2nd half 2017, with a > target beta / pre-production QA release in early 2018 or thereabouts. > Part of this would be creating a 1.0 to 2.0 migration guide for users. > > My biggest concern with pandas in recent years is how not to be held > back by strict backwards compatibility and still be able to innovate > and stay relevant into the 2020s. > > For pandas 2.0 some of the most important issues I've been thinking about > are: > > - Logical type abstraction layer / decoupling. pandas-only data types > (Categorical, DatetimeTZ, Period, etc.) will become equal citizens as > compared with data types mapping 1-1 on NumPy numeric dtypes > > - Decoupling physical storage to permit non-NumPy data structures inside > Series > > - Removal of BlockManager and 2D block consolidation in DataFrame, in > favor of a native C++ internal table (vector-of-arrays) data structure > > - Consistent NA semantics across all data types > > - Significantly improved handling of string/UTF8 data (performance, > memory use -- elimination of PyObject boxes). From the above 2 items, > we could even make all string arrays internally categorical (with the > option to explicitly cast to categorical) -- in the database world > this is often called dictionary encoding. > > - Refactor of most Cython algorithms into C++11/14 templates > > - Copy-on-write for Series and DataFrame > > - Removal of Panel, ndim > 3 data structures > > - Analytical expression VM (for example -- things like > df[boolean_arr].groupby(...).agg(...) could be evaluated by a small > Numexpr-like VM, not dissimilar to R's dplyr library, with > significantly improved memory use and maybe performance too) > > There's a lot to unpack here, but let me know what everyone thinks > about these things. The "pandas 2.0" / internals revamp discussion we > can tackle in a separate thread or in perhaps in a GitHub repo or > design folder in the pandas codebase. > > Thanks, > Wes > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeffreback at gmail.com Wed Jul 27 06:04:47 2016 From: jeffreback at gmail.com (Jeff Reback) Date: Wed, 27 Jul 2016 06:04:47 -0400 Subject: [Pandas-dev] Proposal: pandas 1.0 and ideas for pandas 2.0 / future In-Reply-To: References: Message-ID: I applaud the vision and ambition for the roadmap of the future of pandas. However, the resources are lacking for much of these changes. Currently pandas is just barely keeping up with the (recently increased) user flow of pull-requests, not to mention the issue reports. These are all great indicators of community use and exercising the edge cases. A roadmap is an excellent start, but the resource question needs to be front and center. The current process *could* evolve into LTS. In 0.19.0, lots of progress towards removing older code (and of course deprecating things) is happening. An aggressive push of this into 0.20.0 will go a long ways towards de-facto establishing 1.0 / LTS. (and maybe that's what we simply call 0.20.0). I would agree we could simply release 1.0 / LTS without adding any 'new' features (like fixed getitem indexing and such). I would like to see 2.0 with a user facing API that is a drop-in replacement (though allowing for some breaking changes that are NOT back-compat, e.g. getitem indexing). I think it would be acceptable to break the back-end API (meaning to numpy) though. For the resource question, as I have mentioned off-list, I will format this roadmap in order for pandas to support a fund-raising effort to garner resources for these changes. Jeff On Tue, Jul 26, 2016 at 5:13 PM, Stephan Hoyer wrote: > I know I expressed concerns about cross-compatibility with the rest of the > SciPy ecosystem before (especially xarray), but this plan sounds very solid > to me. Flexible data types in N-dimensional arrays are important for other > use cases, but also not really a problem for pandas. > > A separate 2.0 release will let us make the major breaking changes to the > pandas data model necessary for it to work well in the long term. There are > a few other API warts that will be able to clean up this way (detailed in > github.com/pydata/pandas/issues/10000), indexing on DataFrames being the > most obvious one. > > On Tue, Jul 26, 2016 at 1:51 PM, Wes McKinney wrote: > >> hi folks, >> >> As a continuation of ongoing discussions on GitHub and on the mailing >> list around deprecations and future innovation and internal reworkings >> of pandas, I had a couple of ideas to share that I am looking for >> feedback on. >> >> As far as pandas 0.19.x today, I would like to propose that we >> consider releasing the project as pandas 1.0 in the next major release >> or the one after. The Python community does have a penchant for >> "eternal betas", but after all the hard work of the core developers >> and community over the last 5 years, I think we can safely consider >> making a stable 1.X production release. >> >> If we do decide to release pandas 1.0, I also propose that we strongly >> consider making 1.X an LTS / Long Term Support branch where we can >> continue to make releases, but bug fixes and documentation >> improvements only. Or, we can add new features, but on an extremely >> conservative basis. This might require some changes to development >> process, so looking for feedback on this. >> >> If we commit to this path, I would suggest that we start a pandas-2.0 >> integration branch where we can begin more seriously planning and >> executing on >> >> - Cleanup and removal of years' worth of accumulated cruft / legacy code >> - Removal of deprecated features >> - Series and DataFrame internals revamp. >> >> I had hoped that 2016 would offer me more time to work on the >> internals revamp, but between my day job and the 2nd ed of "Python for >> Data Analysis" that turned out to be a little too ambitious. I have >> been almost continuously thinking about how to go about this though, >> and it might be good to figure out a process where we can start >> documenting and coming up with a more granular development roadmap for >> this. Part of this will be carefully documenting any APIs we change or >> unit tests we break along the way. >> >> We would want to give ample time for heavy pandas users to run their >> 3rd-party code based on pandas 2.0-dev to give feedback on whether our >> assumptions about the impact of changes affect real production code. >> As a concrete example: integer and boolean Series would be able to >> accommodate missing data without implicitly casting to float or object >> NumPy dtype respectively. Since many users will have inserted >> workarounds / data massaging code because of such rough edges, this >> may cause code breakage or simply redundancy in some cases. As another >> example: we should probably remove the .ix indexing attribute >> altogether. I'm sure many users are still using .ix, but it would be >> worthwhile to go through such code and decide whether it's really .loc >> or .iloc. >> >> My hope would be (being a deadline-motivated person) that we could see >> a pandas 2.0 alpha release sometime mid- or 2nd half 2017, with a >> target beta / pre-production QA release in early 2018 or thereabouts. >> Part of this would be creating a 1.0 to 2.0 migration guide for users. >> >> My biggest concern with pandas in recent years is how not to be held >> back by strict backwards compatibility and still be able to innovate >> and stay relevant into the 2020s. >> >> For pandas 2.0 some of the most important issues I've been thinking about >> are: >> >> - Logical type abstraction layer / decoupling. pandas-only data types >> (Categorical, DatetimeTZ, Period, etc.) will become equal citizens as >> compared with data types mapping 1-1 on NumPy numeric dtypes >> >> - Decoupling physical storage to permit non-NumPy data structures inside >> Series >> >> - Removal of BlockManager and 2D block consolidation in DataFrame, in >> favor of a native C++ internal table (vector-of-arrays) data structure >> >> - Consistent NA semantics across all data types >> >> - Significantly improved handling of string/UTF8 data (performance, >> memory use -- elimination of PyObject boxes). From the above 2 items, >> we could even make all string arrays internally categorical (with the >> option to explicitly cast to categorical) -- in the database world >> this is often called dictionary encoding. >> >> - Refactor of most Cython algorithms into C++11/14 templates >> >> - Copy-on-write for Series and DataFrame >> >> - Removal of Panel, ndim > 3 data structures >> >> - Analytical expression VM (for example -- things like >> df[boolean_arr].groupby(...).agg(...) could be evaluated by a small >> Numexpr-like VM, not dissimilar to R's dplyr library, with >> significantly improved memory use and maybe performance too) >> >> There's a lot to unpack here, but let me know what everyone thinks >> about these things. The "pandas 2.0" / internals revamp discussion we >> can tackle in a separate thread or in perhaps in a GitHub repo or >> design folder in the pandas codebase. >> >> Thanks, >> Wes >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Wed Jul 27 19:51:21 2016 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Thu, 28 Jul 2016 01:51:21 +0200 Subject: [Pandas-dev] Proposal: pandas 1.0 and ideas for pandas 2.0 / future In-Reply-To: References: Message-ID: Wes, thanks for your mail! I like the idea of first releasing a pandas 1.0 before the 'big refactor'. We for sure know that this will take a while to stabilize (even with a lot of resources), and I think the idea was to provide a kind of LTS release. In that regard, it is just clearer to name this pandas 1.x then 0.19.x. Maybe we can start a separate thread to discuss on this 1.0, as there are of course some questions to discuss: - do we first release 0.19 (we didn't specifically discuss this, but I think the rough idea was to have somewhere in august a release candidate), or do we directly aim at 1.0? - are there some certain changes we want to do before 1.0 that are feasible in the short term? - are there some of the current ideas of deprecations that we should exclude/include for this release? (eg I think deprecating PanelND (as just landed in master) is good, but the idea of deprecating Panel should rather wait until 2.0?) - ... How exactly to tackle those bug fix releases / LTS branch, is also something that can be discussed, but I would not worry too much about that (there are enough examples of other projects to do something similar, we just have to search for a process that suits us). What I think a more important issue or problem with this process is the community of contributors. If we would effectively have a period of about two years (before a final 2.0 release) where for the current (1.0) version only certain bug-fixes are considered, but on the other hand it is still difficult to contribute to the new version. We would maybe have to say no to many of the PRs or enhancement ideas. Such a situation could hinder the process of community contributions and participation. And there are currently a lot of contributions. As Jeff also said, the current active contributors are barely keeping up with managing all issues and pull requests. I have worked the last few weeks more on pandas (thanks to Continuum), and indeed I spent most of my time answering issues and reviewing PRs, and hardly have any time to do much coding myself. But of course this is also a choice that I currently make. And I (we) could also make the choice to focus more on pandas 1.0/2.0 related issues, or try to steer some of the active contributors to that. I also have some concerns about the compatibility with the rest of the ecosystem, but at the same time it is clear I think that there should be some kind of refactor, and it is in the further elaboration of the roadmap that such concerns can be addressed. Joris 2016-07-27 12:04 GMT+02:00 Jeff Reback : > I applaud the vision and ambition for the roadmap of the future of pandas. > > However, the resources are lacking for much of these changes. Currently > pandas is just barely keeping up with the (recently increased) user flow > of pull-requests, not to mention the issue reports. These are all great > indicators > of community use and exercising the edge cases. > > A roadmap is an excellent start, but the resource question needs to be > front and center. > > The current process *could* evolve into LTS. In 0.19.0, lots of progress > towards removing > older code (and of course deprecating things) is happening. An aggressive > push of this into > 0.20.0 will go a long ways towards de-facto establishing 1.0 / LTS. (and > maybe that's what we simply > call 0.20.0). > > I would agree we could simply release 1.0 / LTS without adding any 'new' > features (like fixed getitem indexing > and such). > > I would like to see 2.0 with a user facing API that is a drop-in > replacement (though allowing for some breaking changes that are NOT > back-compat, e.g. getitem indexing). I think it would be acceptable to > break the back-end API (meaning to numpy) though. > > For the resource question, as I have mentioned off-list, I will format > this roadmap in order for pandas to support a fund-raising effort to garner > resources for these changes. > > Jeff > > On Tue, Jul 26, 2016 at 5:13 PM, Stephan Hoyer wrote: > >> I know I expressed concerns about cross-compatibility with the rest of >> the SciPy ecosystem before (especially xarray), but this plan sounds very >> solid to me. Flexible data types in N-dimensional arrays are important for >> other use cases, but also not really a problem for pandas. >> >> A separate 2.0 release will let us make the major breaking changes to the >> pandas data model necessary for it to work well in the long term. There are >> a few other API warts that will be able to clean up this way (detailed in >> github.com/pydata/pandas/issues/10000), indexing on DataFrames being the >> most obvious one. >> >> On Tue, Jul 26, 2016 at 1:51 PM, Wes McKinney >> wrote: >> >>> hi folks, >>> >>> As a continuation of ongoing discussions on GitHub and on the mailing >>> list around deprecations and future innovation and internal reworkings >>> of pandas, I had a couple of ideas to share that I am looking for >>> feedback on. >>> >>> As far as pandas 0.19.x today, I would like to propose that we >>> consider releasing the project as pandas 1.0 in the next major release >>> or the one after. The Python community does have a penchant for >>> "eternal betas", but after all the hard work of the core developers >>> and community over the last 5 years, I think we can safely consider >>> making a stable 1.X production release. >>> >>> If we do decide to release pandas 1.0, I also propose that we strongly >>> consider making 1.X an LTS / Long Term Support branch where we can >>> continue to make releases, but bug fixes and documentation >>> improvements only. Or, we can add new features, but on an extremely >>> conservative basis. This might require some changes to development >>> process, so looking for feedback on this. >>> >>> If we commit to this path, I would suggest that we start a pandas-2.0 >>> integration branch where we can begin more seriously planning and >>> executing on >>> >>> - Cleanup and removal of years' worth of accumulated cruft / legacy code >>> - Removal of deprecated features >>> - Series and DataFrame internals revamp. >>> >>> I had hoped that 2016 would offer me more time to work on the >>> internals revamp, but between my day job and the 2nd ed of "Python for >>> Data Analysis" that turned out to be a little too ambitious. I have >>> been almost continuously thinking about how to go about this though, >>> and it might be good to figure out a process where we can start >>> documenting and coming up with a more granular development roadmap for >>> this. Part of this will be carefully documenting any APIs we change or >>> unit tests we break along the way. >>> >>> We would want to give ample time for heavy pandas users to run their >>> 3rd-party code based on pandas 2.0-dev to give feedback on whether our >>> assumptions about the impact of changes affect real production code. >>> As a concrete example: integer and boolean Series would be able to >>> accommodate missing data without implicitly casting to float or object >>> NumPy dtype respectively. Since many users will have inserted >>> workarounds / data massaging code because of such rough edges, this >>> may cause code breakage or simply redundancy in some cases. As another >>> example: we should probably remove the .ix indexing attribute >>> altogether. I'm sure many users are still using .ix, but it would be >>> worthwhile to go through such code and decide whether it's really .loc >>> or .iloc. >>> >>> My hope would be (being a deadline-motivated person) that we could see >>> a pandas 2.0 alpha release sometime mid- or 2nd half 2017, with a >>> target beta / pre-production QA release in early 2018 or thereabouts. >>> Part of this would be creating a 1.0 to 2.0 migration guide for users. >>> >>> My biggest concern with pandas in recent years is how not to be held >>> back by strict backwards compatibility and still be able to innovate >>> and stay relevant into the 2020s. >>> >>> For pandas 2.0 some of the most important issues I've been thinking >>> about are: >>> >>> - Logical type abstraction layer / decoupling. pandas-only data types >>> (Categorical, DatetimeTZ, Period, etc.) will become equal citizens as >>> compared with data types mapping 1-1 on NumPy numeric dtypes >>> >>> - Decoupling physical storage to permit non-NumPy data structures inside >>> Series >>> >>> - Removal of BlockManager and 2D block consolidation in DataFrame, in >>> favor of a native C++ internal table (vector-of-arrays) data structure >>> >>> - Consistent NA semantics across all data types >>> >>> - Significantly improved handling of string/UTF8 data (performance, >>> memory use -- elimination of PyObject boxes). From the above 2 items, >>> we could even make all string arrays internally categorical (with the >>> option to explicitly cast to categorical) -- in the database world >>> this is often called dictionary encoding. >>> >>> - Refactor of most Cython algorithms into C++11/14 templates >>> >>> - Copy-on-write for Series and DataFrame >>> >>> - Removal of Panel, ndim > 3 data structures >>> >>> - Analytical expression VM (for example -- things like >>> df[boolean_arr].groupby(...).agg(...) could be evaluated by a small >>> Numexpr-like VM, not dissimilar to R's dplyr library, with >>> significantly improved memory use and maybe performance too) >>> >>> There's a lot to unpack here, but let me know what everyone thinks >>> about these things. The "pandas 2.0" / internals revamp discussion we >>> can tackle in a separate thread or in perhaps in a GitHub repo or >>> design folder in the pandas codebase. >>> >>> Thanks, >>> Wes >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> >> > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gfyoung17 at gmail.com Wed Jul 27 23:39:04 2016 From: gfyoung17 at gmail.com (G Young) Date: Wed, 27 Jul 2016 23:39:04 -0400 Subject: [Pandas-dev] Proposal: pandas 1.0 and ideas for pandas 2.0 / future In-Reply-To: References: Message-ID: 1) I would be in favour of releasing 0.19.0 in part because we already pushed back and actually forgone 0.18.2. I think these plans are better served for the release after this one to give more time to map this but also to push out the changes that have already been made in preparation for this release. 2) In terms of organisation, I wonder if we might be better served *reorganising* the way in which PR's are reviewed during the time period between one release and the next instead of having these parallel tracks of development in light of the concern brought up by @jorisvanenbossche. Perhaps rather than just reviewing PR's as they come in, specify which types of PR's should be submitted during certain periods of time. For example, a large chunk of the period could be devoted to accepting enhancements / new features after which the remaining time before a release could be devoted to just organisation / refactoring / deprecations / what have you (maybe include bug fixes too). That way we could have a *contiguous* block of time to focus on stabilising and tidying up the release. It would also allow for the refactoring to take place (perhaps incrementally) without the concern of being destabilised by a new feature. For this to work, this would have to be clearly stated in the contributing docs as well as circulated in emails to pandas-dev AND other related groups so that way people know what's going on in terms of the development cycle. On Wed, Jul 27, 2016 at 7:51 PM, Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > Wes, thanks for your mail! > > I like the idea of first releasing a pandas 1.0 before the 'big refactor'. > We for sure know that this will take a while to stabilize (even with a lot > of resources), and I think the idea was to provide a kind of LTS release. > In that regard, it is just clearer to name this pandas 1.x then 0.19.x. > > Maybe we can start a separate thread to discuss on this 1.0, as there are > of course some questions to discuss: > - do we first release 0.19 (we didn't specifically discuss this, but I > think the rough idea was to have somewhere in august a release candidate), > or do we directly aim at 1.0? > - are there some certain changes we want to do before 1.0 that are > feasible in the short term? > - are there some of the current ideas of deprecations that we should > exclude/include for this release? (eg I think deprecating PanelND (as just > landed in master) is good, but the idea of deprecating Panel should rather > wait until 2.0?) > - ... > > How exactly to tackle those bug fix releases / LTS branch, is also > something that can be discussed, but I would not worry too much about that > (there are enough examples of other projects to do something similar, we > just have to search for a process that suits us). > > What I think a more important issue or problem with this process is the > community of contributors. If we would effectively have a period of about > two years (before a final 2.0 release) where for the current (1.0) version > only certain bug-fixes are considered, but on the other hand it is still > difficult to contribute to the new version. We would maybe have to say no > to many of the PRs or enhancement ideas. Such a situation could hinder the > process of community contributions and participation. > And there are currently a lot of contributions. As Jeff also said, the > current active contributors are barely keeping up with managing all issues > and pull requests. I have worked the last few weeks more on pandas (thanks > to Continuum), and indeed I spent most of my time answering issues and > reviewing PRs, and hardly have any time to do much coding myself. But of > course this is also a choice that I currently make. And I (we) could also > make the choice to focus more on pandas 1.0/2.0 related issues, or try to > steer some of the active contributors to that. > > I also have some concerns about the compatibility with the rest of the > ecosystem, but at the same time it is clear I think that there should be > some kind of refactor, and it is in the further elaboration of the roadmap > that such concerns can be addressed. > > Joris > > > > 2016-07-27 12:04 GMT+02:00 Jeff Reback : > >> I applaud the vision and ambition for the roadmap of the future of pandas. >> >> However, the resources are lacking for much of these changes. Currently >> pandas is just barely keeping up with the (recently increased) user flow >> of pull-requests, not to mention the issue reports. These are all great >> indicators >> of community use and exercising the edge cases. >> >> A roadmap is an excellent start, but the resource question needs to be >> front and center. >> >> The current process *could* evolve into LTS. In 0.19.0, lots of progress >> towards removing >> older code (and of course deprecating things) is happening. An aggressive >> push of this into >> 0.20.0 will go a long ways towards de-facto establishing 1.0 / LTS. (and >> maybe that's what we simply >> call 0.20.0). >> >> I would agree we could simply release 1.0 / LTS without adding any 'new' >> features (like fixed getitem indexing >> and such). >> >> I would like to see 2.0 with a user facing API that is a drop-in >> replacement (though allowing for some breaking changes that are NOT >> back-compat, e.g. getitem indexing). I think it would be acceptable to >> break the back-end API (meaning to numpy) though. >> >> For the resource question, as I have mentioned off-list, I will format >> this roadmap in order for pandas to support a fund-raising effort to garner >> resources for these changes. >> >> Jeff >> >> On Tue, Jul 26, 2016 at 5:13 PM, Stephan Hoyer wrote: >> >>> I know I expressed concerns about cross-compatibility with the rest of >>> the SciPy ecosystem before (especially xarray), but this plan sounds very >>> solid to me. Flexible data types in N-dimensional arrays are important for >>> other use cases, but also not really a problem for pandas. >>> >>> A separate 2.0 release will let us make the major breaking changes to >>> the pandas data model necessary for it to work well in the long term. There >>> are a few other API warts that will be able to clean up this way (detailed >>> in github.com/pydata/pandas/issues/10000), indexing on DataFrames being >>> the most obvious one. >>> >>> On Tue, Jul 26, 2016 at 1:51 PM, Wes McKinney >>> wrote: >>> >>>> hi folks, >>>> >>>> As a continuation of ongoing discussions on GitHub and on the mailing >>>> list around deprecations and future innovation and internal reworkings >>>> of pandas, I had a couple of ideas to share that I am looking for >>>> feedback on. >>>> >>>> As far as pandas 0.19.x today, I would like to propose that we >>>> consider releasing the project as pandas 1.0 in the next major release >>>> or the one after. The Python community does have a penchant for >>>> "eternal betas", but after all the hard work of the core developers >>>> and community over the last 5 years, I think we can safely consider >>>> making a stable 1.X production release. >>>> >>>> If we do decide to release pandas 1.0, I also propose that we strongly >>>> consider making 1.X an LTS / Long Term Support branch where we can >>>> continue to make releases, but bug fixes and documentation >>>> improvements only. Or, we can add new features, but on an extremely >>>> conservative basis. This might require some changes to development >>>> process, so looking for feedback on this. >>>> >>>> If we commit to this path, I would suggest that we start a pandas-2.0 >>>> integration branch where we can begin more seriously planning and >>>> executing on >>>> >>>> - Cleanup and removal of years' worth of accumulated cruft / legacy code >>>> - Removal of deprecated features >>>> - Series and DataFrame internals revamp. >>>> >>>> I had hoped that 2016 would offer me more time to work on the >>>> internals revamp, but between my day job and the 2nd ed of "Python for >>>> Data Analysis" that turned out to be a little too ambitious. I have >>>> been almost continuously thinking about how to go about this though, >>>> and it might be good to figure out a process where we can start >>>> documenting and coming up with a more granular development roadmap for >>>> this. Part of this will be carefully documenting any APIs we change or >>>> unit tests we break along the way. >>>> >>>> We would want to give ample time for heavy pandas users to run their >>>> 3rd-party code based on pandas 2.0-dev to give feedback on whether our >>>> assumptions about the impact of changes affect real production code. >>>> As a concrete example: integer and boolean Series would be able to >>>> accommodate missing data without implicitly casting to float or object >>>> NumPy dtype respectively. Since many users will have inserted >>>> workarounds / data massaging code because of such rough edges, this >>>> may cause code breakage or simply redundancy in some cases. As another >>>> example: we should probably remove the .ix indexing attribute >>>> altogether. I'm sure many users are still using .ix, but it would be >>>> worthwhile to go through such code and decide whether it's really .loc >>>> or .iloc. >>>> >>>> My hope would be (being a deadline-motivated person) that we could see >>>> a pandas 2.0 alpha release sometime mid- or 2nd half 2017, with a >>>> target beta / pre-production QA release in early 2018 or thereabouts. >>>> Part of this would be creating a 1.0 to 2.0 migration guide for users. >>>> >>>> My biggest concern with pandas in recent years is how not to be held >>>> back by strict backwards compatibility and still be able to innovate >>>> and stay relevant into the 2020s. >>>> >>>> For pandas 2.0 some of the most important issues I've been thinking >>>> about are: >>>> >>>> - Logical type abstraction layer / decoupling. pandas-only data types >>>> (Categorical, DatetimeTZ, Period, etc.) will become equal citizens as >>>> compared with data types mapping 1-1 on NumPy numeric dtypes >>>> >>>> - Decoupling physical storage to permit non-NumPy data structures >>>> inside Series >>>> >>>> - Removal of BlockManager and 2D block consolidation in DataFrame, in >>>> favor of a native C++ internal table (vector-of-arrays) data structure >>>> >>>> - Consistent NA semantics across all data types >>>> >>>> - Significantly improved handling of string/UTF8 data (performance, >>>> memory use -- elimination of PyObject boxes). From the above 2 items, >>>> we could even make all string arrays internally categorical (with the >>>> option to explicitly cast to categorical) -- in the database world >>>> this is often called dictionary encoding. >>>> >>>> - Refactor of most Cython algorithms into C++11/14 templates >>>> >>>> - Copy-on-write for Series and DataFrame >>>> >>>> - Removal of Panel, ndim > 3 data structures >>>> >>>> - Analytical expression VM (for example -- things like >>>> df[boolean_arr].groupby(...).agg(...) could be evaluated by a small >>>> Numexpr-like VM, not dissimilar to R's dplyr library, with >>>> significantly improved memory use and maybe performance too) >>>> >>>> There's a lot to unpack here, but let me know what everyone thinks >>>> about these things. The "pandas 2.0" / internals revamp discussion we >>>> can tackle in a separate thread or in perhaps in a GitHub repo or >>>> design folder in the pandas codebase. >>>> >>>> Thanks, >>>> Wes >>>> _______________________________________________ >>>> Pandas-dev mailing list >>>> Pandas-dev at python.org >>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>> >>> >>> >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >>> >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> >> > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Thu Jul 28 18:15:57 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Thu, 28 Jul 2016 15:15:57 -0700 Subject: [Pandas-dev] Proposal: pandas 1.0 and ideas for pandas 2.0 / future In-Reply-To: References: Message-ID: OK, let me try to collect some of the feedback and give my thoughts 1) 0.19 and 0.20: I think we should push to release 0.19.0 soon and then plan what we want to add/change/deprecate for 1.0 which might otherwise have been 1.0. I think delaying 0.19.0 since we already pushed back 0.18.2, and there are some significant new patches (asof_merge and variable rolling windows), it would be good to get this into production before we declare a stable 1.0. 2) We will need to raise a significant amount of money for pandas (I estimate in the ballpark of US $300-500K -- better to have too much than too little) to be able to pursue the pandas 2.0 plan wholeheartedly. I would like to dedicate a minimum 5-10 hours per week to it in 2017 but this will not be sufficient to do everything (I am also a human being, and have a day job). It would be better to collaborate with one or two good freelance developers (with proven experience in C++ and Python) who are spending at least 50% of their time on pandas next year. I am going to start spending some time on design documentation so that we can start resolving some of the design questions and tradeoffs (not all of these decisions will be easy). We'll work on this offline and look to start soliciting funding (if anyone with the ability to write checks is reading, feel free to contact me offline). 3) I agree we will need to come up with a development process that facilitates both an invasive modification of pandas internals while also supporting production users of pandas 1.X. Cherry-picking bug fixes into the pandas 2.x branch will grow increasingly complicated; we need to factor this into our process (for example: we might collect all the unit tests for bug fixes -- assuming they rely on definitely stable behavior -- into a "to fix" folder so that we can return and adapt the bug fixes once the 2.x branch is getting more stable). To have developers both maintaining 1.x and trying to drive forward the 2.x branch at the same time does not seem realistic -- we should talk to the IPython/Jupyter devs to understand how they handled this through their long-lived IPython 1.0 branch IIRC (see http://ipython.org/news.html#ipython-1-0). 4) My goal, which I think we're all aligned on, would be for pandas 2.0 to be a drop-in replacement for 90-95% of normal pandas use. Many power users will have embraced some of the idiosyncrasies of pandas's implementation details, but I think some of the changes (e.g. missing data consistency, copy-on-write / improved semantics around memory ownership and views) will be welcomed. We should clearly document (in a dedicated "pandas's internal relationship with NumPy" document) and maintain very tight contracts around what kinds of zero-copy NumPy interoperability are supported -- it is not clear to me for example that arrays of Python string/unicode objects are a NumPy use case that is especially important to preserve, but most numeric data use cases are. This will also be helpful for power users to understand the nuances and how things are going to stay the same or change (for example: boolean and integer arrays with NAs will probably not be zero-copyable to NumPy arrays). We should maybe start side threads about each of these items. Just deciding what we want to deprecate or do in 0.20 aka 1.0 is a large enough task. Thanks all Wes On Wed, Jul 27, 2016 at 8:39 PM, G Young wrote: > 1) I would be in favour of releasing 0.19.0 in part because we already > pushed back and actually forgone 0.18.2. I think these plans are better > served for the release after this one to give more time to map this but also > to push out the changes that have already been made in preparation for this > release. > > 2) In terms of organisation, I wonder if we might be better served > reorganising the way in which PR's are reviewed during the time period > between one release and the next instead of having these parallel tracks of > development in light of the concern brought up by @jorisvanenbossche. > Perhaps rather than just reviewing PR's as they come in, specify which types > of PR's should be submitted during certain periods of time. > > For example, a large chunk of the period could be devoted to accepting > enhancements / new features after which the remaining time before a release > could be devoted to just organisation / refactoring / deprecations / what > have you (maybe include bug fixes too). That way we could have a contiguous > block of time to focus on stabilising and tidying up the release. It would > also allow for the refactoring to take place (perhaps incrementally) without > the concern of being destabilised by a new feature. > > For this to work, this would have to be clearly stated in the contributing > docs as well as circulated in emails to pandas-dev AND other related groups > so that way people know what's going on in terms of the development cycle. > > > > On Wed, Jul 27, 2016 at 7:51 PM, Joris Van den Bossche > wrote: >> >> Wes, thanks for your mail! >> >> I like the idea of first releasing a pandas 1.0 before the 'big refactor'. >> We for sure know that this will take a while to stabilize (even with a lot >> of resources), and I think the idea was to provide a kind of LTS release. In >> that regard, it is just clearer to name this pandas 1.x then 0.19.x. >> >> Maybe we can start a separate thread to discuss on this 1.0, as there are >> of course some questions to discuss: >> - do we first release 0.19 (we didn't specifically discuss this, but I >> think the rough idea was to have somewhere in august a release candidate), >> or do we directly aim at 1.0? >> - are there some certain changes we want to do before 1.0 that are >> feasible in the short term? >> - are there some of the current ideas of deprecations that we should >> exclude/include for this release? (eg I think deprecating PanelND (as just >> landed in master) is good, but the idea of deprecating Panel should rather >> wait until 2.0?) >> - ... >> >> How exactly to tackle those bug fix releases / LTS branch, is also >> something that can be discussed, but I would not worry too much about that >> (there are enough examples of other projects to do something similar, we >> just have to search for a process that suits us). >> >> What I think a more important issue or problem with this process is the >> community of contributors. If we would effectively have a period of about >> two years (before a final 2.0 release) where for the current (1.0) version >> only certain bug-fixes are considered, but on the other hand it is still >> difficult to contribute to the new version. We would maybe have to say no to >> many of the PRs or enhancement ideas. Such a situation could hinder the >> process of community contributions and participation. >> And there are currently a lot of contributions. As Jeff also said, the >> current active contributors are barely keeping up with managing all issues >> and pull requests. I have worked the last few weeks more on pandas (thanks >> to Continuum), and indeed I spent most of my time answering issues and >> reviewing PRs, and hardly have any time to do much coding myself. But of >> course this is also a choice that I currently make. And I (we) could also >> make the choice to focus more on pandas 1.0/2.0 related issues, or try to >> steer some of the active contributors to that. >> >> I also have some concerns about the compatibility with the rest of the >> ecosystem, but at the same time it is clear I think that there should be >> some kind of refactor, and it is in the further elaboration of the roadmap >> that such concerns can be addressed. >> >> Joris >> >> >> >> 2016-07-27 12:04 GMT+02:00 Jeff Reback : >>> >>> I applaud the vision and ambition for the roadmap of the future of >>> pandas. >>> >>> However, the resources are lacking for much of these changes. Currently >>> pandas is just barely keeping up with the (recently increased) user flow >>> of pull-requests, not to mention the issue reports. These are all great >>> indicators >>> of community use and exercising the edge cases. >>> >>> A roadmap is an excellent start, but the resource question needs to be >>> front and center. >>> >>> The current process *could* evolve into LTS. In 0.19.0, lots of progress >>> towards removing >>> older code (and of course deprecating things) is happening. An aggressive >>> push of this into >>> 0.20.0 will go a long ways towards de-facto establishing 1.0 / LTS. (and >>> maybe that's what we simply >>> call 0.20.0). >>> >>> I would agree we could simply release 1.0 / LTS without adding any 'new' >>> features (like fixed getitem indexing >>> and such). >>> >>> I would like to see 2.0 with a user facing API that is a drop-in >>> replacement (though allowing for some breaking changes that are NOT >>> back-compat, e.g. getitem indexing). I think it would be acceptable to break >>> the back-end API (meaning to numpy) though. >>> >>> For the resource question, as I have mentioned off-list, I will format >>> this roadmap in order for pandas to support a fund-raising effort to garner >>> resources for these changes. >>> >>> Jeff >>> >>> On Tue, Jul 26, 2016 at 5:13 PM, Stephan Hoyer wrote: >>>> >>>> I know I expressed concerns about cross-compatibility with the rest of >>>> the SciPy ecosystem before (especially xarray), but this plan sounds very >>>> solid to me. Flexible data types in N-dimensional arrays are important for >>>> other use cases, but also not really a problem for pandas. >>>> >>>> A separate 2.0 release will let us make the major breaking changes to >>>> the pandas data model necessary for it to work well in the long term. There >>>> are a few other API warts that will be able to clean up this way (detailed >>>> in github.com/pydata/pandas/issues/10000), indexing on DataFrames being the >>>> most obvious one. >>>> >>>> On Tue, Jul 26, 2016 at 1:51 PM, Wes McKinney >>>> wrote: >>>>> >>>>> hi folks, >>>>> >>>>> As a continuation of ongoing discussions on GitHub and on the mailing >>>>> list around deprecations and future innovation and internal reworkings >>>>> of pandas, I had a couple of ideas to share that I am looking for >>>>> feedback on. >>>>> >>>>> As far as pandas 0.19.x today, I would like to propose that we >>>>> consider releasing the project as pandas 1.0 in the next major release >>>>> or the one after. The Python community does have a penchant for >>>>> "eternal betas", but after all the hard work of the core developers >>>>> and community over the last 5 years, I think we can safely consider >>>>> making a stable 1.X production release. >>>>> >>>>> If we do decide to release pandas 1.0, I also propose that we strongly >>>>> consider making 1.X an LTS / Long Term Support branch where we can >>>>> continue to make releases, but bug fixes and documentation >>>>> improvements only. Or, we can add new features, but on an extremely >>>>> conservative basis. This might require some changes to development >>>>> process, so looking for feedback on this. >>>>> >>>>> If we commit to this path, I would suggest that we start a pandas-2.0 >>>>> integration branch where we can begin more seriously planning and >>>>> executing on >>>>> >>>>> - Cleanup and removal of years' worth of accumulated cruft / legacy >>>>> code >>>>> - Removal of deprecated features >>>>> - Series and DataFrame internals revamp. >>>>> >>>>> I had hoped that 2016 would offer me more time to work on the >>>>> internals revamp, but between my day job and the 2nd ed of "Python for >>>>> Data Analysis" that turned out to be a little too ambitious. I have >>>>> been almost continuously thinking about how to go about this though, >>>>> and it might be good to figure out a process where we can start >>>>> documenting and coming up with a more granular development roadmap for >>>>> this. Part of this will be carefully documenting any APIs we change or >>>>> unit tests we break along the way. >>>>> >>>>> We would want to give ample time for heavy pandas users to run their >>>>> 3rd-party code based on pandas 2.0-dev to give feedback on whether our >>>>> assumptions about the impact of changes affect real production code. >>>>> As a concrete example: integer and boolean Series would be able to >>>>> accommodate missing data without implicitly casting to float or object >>>>> NumPy dtype respectively. Since many users will have inserted >>>>> workarounds / data massaging code because of such rough edges, this >>>>> may cause code breakage or simply redundancy in some cases. As another >>>>> example: we should probably remove the .ix indexing attribute >>>>> altogether. I'm sure many users are still using .ix, but it would be >>>>> worthwhile to go through such code and decide whether it's really .loc >>>>> or .iloc. >>>>> >>>>> My hope would be (being a deadline-motivated person) that we could see >>>>> a pandas 2.0 alpha release sometime mid- or 2nd half 2017, with a >>>>> target beta / pre-production QA release in early 2018 or thereabouts. >>>>> Part of this would be creating a 1.0 to 2.0 migration guide for users. >>>>> >>>>> My biggest concern with pandas in recent years is how not to be held >>>>> back by strict backwards compatibility and still be able to innovate >>>>> and stay relevant into the 2020s. >>>>> >>>>> For pandas 2.0 some of the most important issues I've been thinking >>>>> about are: >>>>> >>>>> - Logical type abstraction layer / decoupling. pandas-only data types >>>>> (Categorical, DatetimeTZ, Period, etc.) will become equal citizens as >>>>> compared with data types mapping 1-1 on NumPy numeric dtypes >>>>> >>>>> - Decoupling physical storage to permit non-NumPy data structures >>>>> inside Series >>>>> >>>>> - Removal of BlockManager and 2D block consolidation in DataFrame, in >>>>> favor of a native C++ internal table (vector-of-arrays) data structure >>>>> >>>>> - Consistent NA semantics across all data types >>>>> >>>>> - Significantly improved handling of string/UTF8 data (performance, >>>>> memory use -- elimination of PyObject boxes). From the above 2 items, >>>>> we could even make all string arrays internally categorical (with the >>>>> option to explicitly cast to categorical) -- in the database world >>>>> this is often called dictionary encoding. >>>>> >>>>> - Refactor of most Cython algorithms into C++11/14 templates >>>>> >>>>> - Copy-on-write for Series and DataFrame >>>>> >>>>> - Removal of Panel, ndim > 3 data structures >>>>> >>>>> - Analytical expression VM (for example -- things like >>>>> df[boolean_arr].groupby(...).agg(...) could be evaluated by a small >>>>> Numexpr-like VM, not dissimilar to R's dplyr library, with >>>>> significantly improved memory use and maybe performance too) >>>>> >>>>> There's a lot to unpack here, but let me know what everyone thinks >>>>> about these things. The "pandas 2.0" / internals revamp discussion we >>>>> can tackle in a separate thread or in perhaps in a GitHub repo or >>>>> design folder in the pandas codebase. >>>>> >>>>> Thanks, >>>>> Wes >>>>> _______________________________________________ >>>>> Pandas-dev mailing list >>>>> Pandas-dev at python.org >>>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>> >>>> >>>> >>>> _______________________________________________ >>>> Pandas-dev mailing list >>>> Pandas-dev at python.org >>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>> >>> >>> >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > From jorisvandenbossche at gmail.com Sun Jul 31 18:03:06 2016 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Mon, 1 Aug 2016 00:03:06 +0200 Subject: [Pandas-dev] Proposal: pandas 1.0 and ideas for pandas 2.0 / future In-Reply-To: References: Message-ID: Regarding 1), I agree it is a good idea to push a 0.19.0 release soonish, en we can then discuss what we further want to do (or not to do) for the 1.0 release. I am on holidays the coming week and a half, but afterwards I will also focus on getting 0.19.0 out. A release candidate in the last week of August is maybe a good deadline? Joris 2016-07-29 0:15 GMT+02:00 Wes McKinney : > OK, let me try to collect some of the feedback and give my thoughts > > 1) 0.19 and 0.20: I think we should push to release 0.19.0 soon and > then plan what we want to add/change/deprecate for 1.0 which might > otherwise have been 1.0. I think delaying 0.19.0 since we already > pushed back 0.18.2, and there are some significant new patches > (asof_merge and variable rolling windows), it would be good to get > this into production before we declare a stable 1.0. > > 2) We will need to raise a significant amount of money for pandas (I > estimate in the ballpark of US $300-500K -- better to have too much > than too little) to be able to pursue the pandas 2.0 plan > wholeheartedly. I would like to dedicate a minimum 5-10 hours per week > to it in 2017 but this will not be sufficient to do everything (I am > also a human being, and have a day job). It would be better to > collaborate with one or two good freelance developers (with proven > experience in C++ and Python) who are spending at least 50% of their > time on pandas next year. I am going to start spending some time on > design documentation so that we can start resolving some of the design > questions and tradeoffs (not all of these decisions will be easy). > We'll work on this offline and look to start soliciting funding (if > anyone with the ability to write checks is reading, feel free to > contact me offline). > > 3) I agree we will need to come up with a development process that > facilitates both an invasive modification of pandas internals while > also supporting production users of pandas 1.X. Cherry-picking bug > fixes into the pandas 2.x branch will grow increasingly complicated; > we need to factor this into our process (for example: we might collect > all the unit tests for bug fixes -- assuming they rely on definitely > stable behavior -- into a "to fix" folder so that we can return and > adapt the bug fixes once the 2.x branch is getting more stable). To > have developers both maintaining 1.x and trying to drive forward the > 2.x branch at the same time does not seem realistic -- we should talk > to the IPython/Jupyter devs to understand how they handled this > through their long-lived IPython 1.0 branch IIRC (see > http://ipython.org/news.html#ipython-1-0). > > 4) My goal, which I think we're all aligned on, would be for pandas > 2.0 to be a drop-in replacement for 90-95% of normal pandas use. Many > power users will have embraced some of the idiosyncrasies of pandas's > implementation details, but I think some of the changes (e.g. missing > data consistency, copy-on-write / improved semantics around memory > ownership and views) will be welcomed. We should clearly document (in > a dedicated "pandas's internal relationship with NumPy" document) and > maintain very tight contracts around what kinds of zero-copy NumPy > interoperability are supported -- it is not clear to me for example > that arrays of Python string/unicode objects are a NumPy use case that > is especially important to preserve, but most numeric data use cases > are. This will also be helpful for power users to understand the > nuances and how things are going to stay the same or change (for > example: boolean and integer arrays with NAs will probably not be > zero-copyable to NumPy arrays). > > We should maybe start side threads about each of these items. Just > deciding what we want to deprecate or do in 0.20 aka 1.0 is a large > enough task. > > Thanks all > Wes > > On Wed, Jul 27, 2016 at 8:39 PM, G Young wrote: > > 1) I would be in favour of releasing 0.19.0 in part because we already > > pushed back and actually forgone 0.18.2. I think these plans are better > > served for the release after this one to give more time to map this but > also > > to push out the changes that have already been made in preparation for > this > > release. > > > > 2) In terms of organisation, I wonder if we might be better served > > reorganising the way in which PR's are reviewed during the time period > > between one release and the next instead of having these parallel tracks > of > > development in light of the concern brought up by @jorisvanenbossche. > > Perhaps rather than just reviewing PR's as they come in, specify which > types > > of PR's should be submitted during certain periods of time. > > > > For example, a large chunk of the period could be devoted to accepting > > enhancements / new features after which the remaining time before a > release > > could be devoted to just organisation / refactoring / deprecations / what > > have you (maybe include bug fixes too). That way we could have a > contiguous > > block of time to focus on stabilising and tidying up the release. It > would > > also allow for the refactoring to take place (perhaps incrementally) > without > > the concern of being destabilised by a new feature. > > > > For this to work, this would have to be clearly stated in the > contributing > > docs as well as circulated in emails to pandas-dev AND other related > groups > > so that way people know what's going on in terms of the development > cycle. > > > > > > > > On Wed, Jul 27, 2016 at 7:51 PM, Joris Van den Bossche > > wrote: > >> > >> Wes, thanks for your mail! > >> > >> I like the idea of first releasing a pandas 1.0 before the 'big > refactor'. > >> We for sure know that this will take a while to stabilize (even with a > lot > >> of resources), and I think the idea was to provide a kind of LTS > release. In > >> that regard, it is just clearer to name this pandas 1.x then 0.19.x. > >> > >> Maybe we can start a separate thread to discuss on this 1.0, as there > are > >> of course some questions to discuss: > >> - do we first release 0.19 (we didn't specifically discuss this, but I > >> think the rough idea was to have somewhere in august a release > candidate), > >> or do we directly aim at 1.0? > >> - are there some certain changes we want to do before 1.0 that are > >> feasible in the short term? > >> - are there some of the current ideas of deprecations that we should > >> exclude/include for this release? (eg I think deprecating PanelND (as > just > >> landed in master) is good, but the idea of deprecating Panel should > rather > >> wait until 2.0?) > >> - ... > >> > >> How exactly to tackle those bug fix releases / LTS branch, is also > >> something that can be discussed, but I would not worry too much about > that > >> (there are enough examples of other projects to do something similar, we > >> just have to search for a process that suits us). > >> > >> What I think a more important issue or problem with this process is the > >> community of contributors. If we would effectively have a period of > about > >> two years (before a final 2.0 release) where for the current (1.0) > version > >> only certain bug-fixes are considered, but on the other hand it is still > >> difficult to contribute to the new version. We would maybe have to say > no to > >> many of the PRs or enhancement ideas. Such a situation could hinder the > >> process of community contributions and participation. > >> And there are currently a lot of contributions. As Jeff also said, the > >> current active contributors are barely keeping up with managing all > issues > >> and pull requests. I have worked the last few weeks more on pandas > (thanks > >> to Continuum), and indeed I spent most of my time answering issues and > >> reviewing PRs, and hardly have any time to do much coding myself. But of > >> course this is also a choice that I currently make. And I (we) could > also > >> make the choice to focus more on pandas 1.0/2.0 related issues, or try > to > >> steer some of the active contributors to that. > >> > >> I also have some concerns about the compatibility with the rest of the > >> ecosystem, but at the same time it is clear I think that there should be > >> some kind of refactor, and it is in the further elaboration of the > roadmap > >> that such concerns can be addressed. > >> > >> Joris > >> > >> > >> > >> 2016-07-27 12:04 GMT+02:00 Jeff Reback : > >>> > >>> I applaud the vision and ambition for the roadmap of the future of > >>> pandas. > >>> > >>> However, the resources are lacking for much of these changes. Currently > >>> pandas is just barely keeping up with the (recently increased) user > flow > >>> of pull-requests, not to mention the issue reports. These are all great > >>> indicators > >>> of community use and exercising the edge cases. > >>> > >>> A roadmap is an excellent start, but the resource question needs to be > >>> front and center. > >>> > >>> The current process *could* evolve into LTS. In 0.19.0, lots of > progress > >>> towards removing > >>> older code (and of course deprecating things) is happening. An > aggressive > >>> push of this into > >>> 0.20.0 will go a long ways towards de-facto establishing 1.0 / LTS. > (and > >>> maybe that's what we simply > >>> call 0.20.0). > >>> > >>> I would agree we could simply release 1.0 / LTS without adding any > 'new' > >>> features (like fixed getitem indexing > >>> and such). > >>> > >>> I would like to see 2.0 with a user facing API that is a drop-in > >>> replacement (though allowing for some breaking changes that are NOT > >>> back-compat, e.g. getitem indexing). I think it would be acceptable to > break > >>> the back-end API (meaning to numpy) though. > >>> > >>> For the resource question, as I have mentioned off-list, I will format > >>> this roadmap in order for pandas to support a fund-raising effort to > garner > >>> resources for these changes. > >>> > >>> Jeff > >>> > >>> On Tue, Jul 26, 2016 at 5:13 PM, Stephan Hoyer > wrote: > >>>> > >>>> I know I expressed concerns about cross-compatibility with the rest of > >>>> the SciPy ecosystem before (especially xarray), but this plan sounds > very > >>>> solid to me. Flexible data types in N-dimensional arrays are > important for > >>>> other use cases, but also not really a problem for pandas. > >>>> > >>>> A separate 2.0 release will let us make the major breaking changes to > >>>> the pandas data model necessary for it to work well in the long term. > There > >>>> are a few other API warts that will be able to clean up this way > (detailed > >>>> in github.com/pydata/pandas/issues/10000), indexing on DataFrames > being the > >>>> most obvious one. > >>>> > >>>> On Tue, Jul 26, 2016 at 1:51 PM, Wes McKinney > >>>> wrote: > >>>>> > >>>>> hi folks, > >>>>> > >>>>> As a continuation of ongoing discussions on GitHub and on the mailing > >>>>> list around deprecations and future innovation and internal > reworkings > >>>>> of pandas, I had a couple of ideas to share that I am looking for > >>>>> feedback on. > >>>>> > >>>>> As far as pandas 0.19.x today, I would like to propose that we > >>>>> consider releasing the project as pandas 1.0 in the next major > release > >>>>> or the one after. The Python community does have a penchant for > >>>>> "eternal betas", but after all the hard work of the core developers > >>>>> and community over the last 5 years, I think we can safely consider > >>>>> making a stable 1.X production release. > >>>>> > >>>>> If we do decide to release pandas 1.0, I also propose that we > strongly > >>>>> consider making 1.X an LTS / Long Term Support branch where we can > >>>>> continue to make releases, but bug fixes and documentation > >>>>> improvements only. Or, we can add new features, but on an extremely > >>>>> conservative basis. This might require some changes to development > >>>>> process, so looking for feedback on this. > >>>>> > >>>>> If we commit to this path, I would suggest that we start a pandas-2.0 > >>>>> integration branch where we can begin more seriously planning and > >>>>> executing on > >>>>> > >>>>> - Cleanup and removal of years' worth of accumulated cruft / legacy > >>>>> code > >>>>> - Removal of deprecated features > >>>>> - Series and DataFrame internals revamp. > >>>>> > >>>>> I had hoped that 2016 would offer me more time to work on the > >>>>> internals revamp, but between my day job and the 2nd ed of "Python > for > >>>>> Data Analysis" that turned out to be a little too ambitious. I have > >>>>> been almost continuously thinking about how to go about this though, > >>>>> and it might be good to figure out a process where we can start > >>>>> documenting and coming up with a more granular development roadmap > for > >>>>> this. Part of this will be carefully documenting any APIs we change > or > >>>>> unit tests we break along the way. > >>>>> > >>>>> We would want to give ample time for heavy pandas users to run their > >>>>> 3rd-party code based on pandas 2.0-dev to give feedback on whether > our > >>>>> assumptions about the impact of changes affect real production code. > >>>>> As a concrete example: integer and boolean Series would be able to > >>>>> accommodate missing data without implicitly casting to float or > object > >>>>> NumPy dtype respectively. Since many users will have inserted > >>>>> workarounds / data massaging code because of such rough edges, this > >>>>> may cause code breakage or simply redundancy in some cases. As > another > >>>>> example: we should probably remove the .ix indexing attribute > >>>>> altogether. I'm sure many users are still using .ix, but it would be > >>>>> worthwhile to go through such code and decide whether it's really > .loc > >>>>> or .iloc. > >>>>> > >>>>> My hope would be (being a deadline-motivated person) that we could > see > >>>>> a pandas 2.0 alpha release sometime mid- or 2nd half 2017, with a > >>>>> target beta / pre-production QA release in early 2018 or thereabouts. > >>>>> Part of this would be creating a 1.0 to 2.0 migration guide for > users. > >>>>> > >>>>> My biggest concern with pandas in recent years is how not to be held > >>>>> back by strict backwards compatibility and still be able to innovate > >>>>> and stay relevant into the 2020s. > >>>>> > >>>>> For pandas 2.0 some of the most important issues I've been thinking > >>>>> about are: > >>>>> > >>>>> - Logical type abstraction layer / decoupling. pandas-only data types > >>>>> (Categorical, DatetimeTZ, Period, etc.) will become equal citizens as > >>>>> compared with data types mapping 1-1 on NumPy numeric dtypes > >>>>> > >>>>> - Decoupling physical storage to permit non-NumPy data structures > >>>>> inside Series > >>>>> > >>>>> - Removal of BlockManager and 2D block consolidation in DataFrame, in > >>>>> favor of a native C++ internal table (vector-of-arrays) data > structure > >>>>> > >>>>> - Consistent NA semantics across all data types > >>>>> > >>>>> - Significantly improved handling of string/UTF8 data (performance, > >>>>> memory use -- elimination of PyObject boxes). From the above 2 items, > >>>>> we could even make all string arrays internally categorical (with the > >>>>> option to explicitly cast to categorical) -- in the database world > >>>>> this is often called dictionary encoding. > >>>>> > >>>>> - Refactor of most Cython algorithms into C++11/14 templates > >>>>> > >>>>> - Copy-on-write for Series and DataFrame > >>>>> > >>>>> - Removal of Panel, ndim > 3 data structures > >>>>> > >>>>> - Analytical expression VM (for example -- things like > >>>>> df[boolean_arr].groupby(...).agg(...) could be evaluated by a small > >>>>> Numexpr-like VM, not dissimilar to R's dplyr library, with > >>>>> significantly improved memory use and maybe performance too) > >>>>> > >>>>> There's a lot to unpack here, but let me know what everyone thinks > >>>>> about these things. The "pandas 2.0" / internals revamp discussion we > >>>>> can tackle in a separate thread or in perhaps in a GitHub repo or > >>>>> design folder in the pandas codebase. > >>>>> > >>>>> Thanks, > >>>>> Wes > >>>>> _______________________________________________ > >>>>> Pandas-dev mailing list > >>>>> Pandas-dev at python.org > >>>>> https://mail.python.org/mailman/listinfo/pandas-dev > >>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> Pandas-dev mailing list > >>>> Pandas-dev at python.org > >>>> https://mail.python.org/mailman/listinfo/pandas-dev > >>>> > >>> > >>> > >>> _______________________________________________ > >>> Pandas-dev mailing list > >>> Pandas-dev at python.org > >>> https://mail.python.org/mailman/listinfo/pandas-dev > >>> > >> > >> > >> _______________________________________________ > >> Pandas-dev mailing list > >> Pandas-dev at python.org > >> https://mail.python.org/mailman/listinfo/pandas-dev > >> > > > > > > _______________________________________________ > > Pandas-dev mailing list > > Pandas-dev at python.org > > https://mail.python.org/mailman/listinfo/pandas-dev > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: