From wesmckinn at gmail.com Mon Aug 1 17:01:58 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Mon, 1 Aug 2016 14:01:58 -0700 Subject: [Pandas-dev] Proposal: pandas 1.0 and ideas for pandas 2.0 / future In-Reply-To: References: Message-ID: hey Andy -- that makes sense to me. What I'm hoping to do this month is scope out a more granular plan for the specific things (problems and their possible solutions with lists of pros/cons of various approaches) we want to accomplish in a pandas 2.x effort and make sure we all agree (up to 70-80% of the big picture items). If we're going to raise a significant amount of money we owe it to the donors to explain how the money will be directed, and we won't want to be dealing with a lot of uncertainty about the roadmap once we have engaged FTEs beginning to help with moving things forward. - Wes On Mon, Aug 1, 2016 at 1:54 PM, Andy Ray Terrel wrote: > Crazy thought. > > Perhaps ya'll could put together a road map and resources you will need to > get it done (as in money for FTEs). I would like to see NumFOCUS try to push > our sponsors to fund more FTEs for projects like this. If we have a road map > in hand it makes the conversations much more tangible. > > -- Andy > > On Sun, Jul 31, 2016 at 5:03 PM, Joris Van den Bossche > wrote: >> >> Regarding 1), I agree it is a good idea to push a 0.19.0 release soonish, >> en we can then discuss what we further want to do (or not to do) for the 1.0 >> release. I am on holidays the coming week and a half, but afterwards I will >> also focus on getting 0.19.0 out. A release candidate in the last week of >> August is maybe a good deadline? >> >> Joris >> >> 2016-07-29 0:15 GMT+02:00 Wes McKinney : >>> >>> OK, let me try to collect some of the feedback and give my thoughts >>> >>> 1) 0.19 and 0.20: I think we should push to release 0.19.0 soon and >>> then plan what we want to add/change/deprecate for 1.0 which might >>> otherwise have been 1.0. I think delaying 0.19.0 since we already >>> pushed back 0.18.2, and there are some significant new patches >>> (asof_merge and variable rolling windows), it would be good to get >>> this into production before we declare a stable 1.0. >>> >>> 2) We will need to raise a significant amount of money for pandas (I >>> estimate in the ballpark of US $300-500K -- better to have too much >>> than too little) to be able to pursue the pandas 2.0 plan >>> wholeheartedly. I would like to dedicate a minimum 5-10 hours per week >>> to it in 2017 but this will not be sufficient to do everything (I am >>> also a human being, and have a day job). It would be better to >>> collaborate with one or two good freelance developers (with proven >>> experience in C++ and Python) who are spending at least 50% of their >>> time on pandas next year. I am going to start spending some time on >>> design documentation so that we can start resolving some of the design >>> questions and tradeoffs (not all of these decisions will be easy). >>> We'll work on this offline and look to start soliciting funding (if >>> anyone with the ability to write checks is reading, feel free to >>> contact me offline). >>> >>> 3) I agree we will need to come up with a development process that >>> facilitates both an invasive modification of pandas internals while >>> also supporting production users of pandas 1.X. Cherry-picking bug >>> fixes into the pandas 2.x branch will grow increasingly complicated; >>> we need to factor this into our process (for example: we might collect >>> all the unit tests for bug fixes -- assuming they rely on definitely >>> stable behavior -- into a "to fix" folder so that we can return and >>> adapt the bug fixes once the 2.x branch is getting more stable). To >>> have developers both maintaining 1.x and trying to drive forward the >>> 2.x branch at the same time does not seem realistic -- we should talk >>> to the IPython/Jupyter devs to understand how they handled this >>> through their long-lived IPython 1.0 branch IIRC (see >>> http://ipython.org/news.html#ipython-1-0). >>> >>> 4) My goal, which I think we're all aligned on, would be for pandas >>> 2.0 to be a drop-in replacement for 90-95% of normal pandas use. Many >>> power users will have embraced some of the idiosyncrasies of pandas's >>> implementation details, but I think some of the changes (e.g. missing >>> data consistency, copy-on-write / improved semantics around memory >>> ownership and views) will be welcomed. We should clearly document (in >>> a dedicated "pandas's internal relationship with NumPy" document) and >>> maintain very tight contracts around what kinds of zero-copy NumPy >>> interoperability are supported -- it is not clear to me for example >>> that arrays of Python string/unicode objects are a NumPy use case that >>> is especially important to preserve, but most numeric data use cases >>> are. This will also be helpful for power users to understand the >>> nuances and how things are going to stay the same or change (for >>> example: boolean and integer arrays with NAs will probably not be >>> zero-copyable to NumPy arrays). >>> >>> We should maybe start side threads about each of these items. Just >>> deciding what we want to deprecate or do in 0.20 aka 1.0 is a large >>> enough task. >>> >>> Thanks all >>> Wes >>> >>> On Wed, Jul 27, 2016 at 8:39 PM, G Young wrote: >>> > 1) I would be in favour of releasing 0.19.0 in part because we already >>> > pushed back and actually forgone 0.18.2. I think these plans are >>> > better >>> > served for the release after this one to give more time to map this but >>> > also >>> > to push out the changes that have already been made in preparation for >>> > this >>> > release. >>> > >>> > 2) In terms of organisation, I wonder if we might be better served >>> > reorganising the way in which PR's are reviewed during the time period >>> > between one release and the next instead of having these parallel >>> > tracks of >>> > development in light of the concern brought up by @jorisvanenbossche. >>> > Perhaps rather than just reviewing PR's as they come in, specify which >>> > types >>> > of PR's should be submitted during certain periods of time. >>> > >>> > For example, a large chunk of the period could be devoted to accepting >>> > enhancements / new features after which the remaining time before a >>> > release >>> > could be devoted to just organisation / refactoring / deprecations / >>> > what >>> > have you (maybe include bug fixes too). That way we could have a >>> > contiguous >>> > block of time to focus on stabilising and tidying up the release. It >>> > would >>> > also allow for the refactoring to take place (perhaps incrementally) >>> > without >>> > the concern of being destabilised by a new feature. >>> > >>> > For this to work, this would have to be clearly stated in the >>> > contributing >>> > docs as well as circulated in emails to pandas-dev AND other related >>> > groups >>> > so that way people know what's going on in terms of the development >>> > cycle. >>> > >>> > >>> > >>> > On Wed, Jul 27, 2016 at 7:51 PM, Joris Van den Bossche >>> > wrote: >>> >> >>> >> Wes, thanks for your mail! >>> >> >>> >> I like the idea of first releasing a pandas 1.0 before the 'big >>> >> refactor'. >>> >> We for sure know that this will take a while to stabilize (even with a >>> >> lot >>> >> of resources), and I think the idea was to provide a kind of LTS >>> >> release. In >>> >> that regard, it is just clearer to name this pandas 1.x then 0.19.x. >>> >> >>> >> Maybe we can start a separate thread to discuss on this 1.0, as there >>> >> are >>> >> of course some questions to discuss: >>> >> - do we first release 0.19 (we didn't specifically discuss this, but I >>> >> think the rough idea was to have somewhere in august a release >>> >> candidate), >>> >> or do we directly aim at 1.0? >>> >> - are there some certain changes we want to do before 1.0 that are >>> >> feasible in the short term? >>> >> - are there some of the current ideas of deprecations that we should >>> >> exclude/include for this release? (eg I think deprecating PanelND (as >>> >> just >>> >> landed in master) is good, but the idea of deprecating Panel should >>> >> rather >>> >> wait until 2.0?) >>> >> - ... >>> >> >>> >> How exactly to tackle those bug fix releases / LTS branch, is also >>> >> something that can be discussed, but I would not worry too much about >>> >> that >>> >> (there are enough examples of other projects to do something similar, >>> >> we >>> >> just have to search for a process that suits us). >>> >> >>> >> What I think a more important issue or problem with this process is >>> >> the >>> >> community of contributors. If we would effectively have a period of >>> >> about >>> >> two years (before a final 2.0 release) where for the current (1.0) >>> >> version >>> >> only certain bug-fixes are considered, but on the other hand it is >>> >> still >>> >> difficult to contribute to the new version. We would maybe have to say >>> >> no to >>> >> many of the PRs or enhancement ideas. Such a situation could hinder >>> >> the >>> >> process of community contributions and participation. >>> >> And there are currently a lot of contributions. As Jeff also said, the >>> >> current active contributors are barely keeping up with managing all >>> >> issues >>> >> and pull requests. I have worked the last few weeks more on pandas >>> >> (thanks >>> >> to Continuum), and indeed I spent most of my time answering issues and >>> >> reviewing PRs, and hardly have any time to do much coding myself. But >>> >> of >>> >> course this is also a choice that I currently make. And I (we) could >>> >> also >>> >> make the choice to focus more on pandas 1.0/2.0 related issues, or try >>> >> to >>> >> steer some of the active contributors to that. >>> >> >>> >> I also have some concerns about the compatibility with the rest of the >>> >> ecosystem, but at the same time it is clear I think that there should >>> >> be >>> >> some kind of refactor, and it is in the further elaboration of the >>> >> roadmap >>> >> that such concerns can be addressed. >>> >> >>> >> Joris >>> >> >>> >> >>> >> >>> >> 2016-07-27 12:04 GMT+02:00 Jeff Reback : >>> >>> >>> >>> I applaud the vision and ambition for the roadmap of the future of >>> >>> pandas. >>> >>> >>> >>> However, the resources are lacking for much of these changes. >>> >>> Currently >>> >>> pandas is just barely keeping up with the (recently increased) user >>> >>> flow >>> >>> of pull-requests, not to mention the issue reports. These are all >>> >>> great >>> >>> indicators >>> >>> of community use and exercising the edge cases. >>> >>> >>> >>> A roadmap is an excellent start, but the resource question needs to >>> >>> be >>> >>> front and center. >>> >>> >>> >>> The current process *could* evolve into LTS. In 0.19.0, lots of >>> >>> progress >>> >>> towards removing >>> >>> older code (and of course deprecating things) is happening. An >>> >>> aggressive >>> >>> push of this into >>> >>> 0.20.0 will go a long ways towards de-facto establishing 1.0 / LTS. >>> >>> (and >>> >>> maybe that's what we simply >>> >>> call 0.20.0). >>> >>> >>> >>> I would agree we could simply release 1.0 / LTS without adding any >>> >>> 'new' >>> >>> features (like fixed getitem indexing >>> >>> and such). >>> >>> >>> >>> I would like to see 2.0 with a user facing API that is a drop-in >>> >>> replacement (though allowing for some breaking changes that are NOT >>> >>> back-compat, e.g. getitem indexing). I think it would be acceptable >>> >>> to break >>> >>> the back-end API (meaning to numpy) though. >>> >>> >>> >>> For the resource question, as I have mentioned off-list, I will >>> >>> format >>> >>> this roadmap in order for pandas to support a fund-raising effort to >>> >>> garner >>> >>> resources for these changes. >>> >>> >>> >>> Jeff >>> >>> >>> >>> On Tue, Jul 26, 2016 at 5:13 PM, Stephan Hoyer >>> >>> wrote: >>> >>>> >>> >>>> I know I expressed concerns about cross-compatibility with the rest >>> >>>> of >>> >>>> the SciPy ecosystem before (especially xarray), but this plan sounds >>> >>>> very >>> >>>> solid to me. Flexible data types in N-dimensional arrays are >>> >>>> important for >>> >>>> other use cases, but also not really a problem for pandas. >>> >>>> >>> >>>> A separate 2.0 release will let us make the major breaking changes >>> >>>> to >>> >>>> the pandas data model necessary for it to work well in the long >>> >>>> term. There >>> >>>> are a few other API warts that will be able to clean up this way >>> >>>> (detailed >>> >>>> in github.com/pydata/pandas/issues/10000), indexing on DataFrames >>> >>>> being the >>> >>>> most obvious one. >>> >>>> >>> >>>> On Tue, Jul 26, 2016 at 1:51 PM, Wes McKinney >>> >>>> wrote: >>> >>>>> >>> >>>>> hi folks, >>> >>>>> >>> >>>>> As a continuation of ongoing discussions on GitHub and on the >>> >>>>> mailing >>> >>>>> list around deprecations and future innovation and internal >>> >>>>> reworkings >>> >>>>> of pandas, I had a couple of ideas to share that I am looking for >>> >>>>> feedback on. >>> >>>>> >>> >>>>> As far as pandas 0.19.x today, I would like to propose that we >>> >>>>> consider releasing the project as pandas 1.0 in the next major >>> >>>>> release >>> >>>>> or the one after. The Python community does have a penchant for >>> >>>>> "eternal betas", but after all the hard work of the core developers >>> >>>>> and community over the last 5 years, I think we can safely consider >>> >>>>> making a stable 1.X production release. >>> >>>>> >>> >>>>> If we do decide to release pandas 1.0, I also propose that we >>> >>>>> strongly >>> >>>>> consider making 1.X an LTS / Long Term Support branch where we can >>> >>>>> continue to make releases, but bug fixes and documentation >>> >>>>> improvements only. Or, we can add new features, but on an extremely >>> >>>>> conservative basis. This might require some changes to development >>> >>>>> process, so looking for feedback on this. >>> >>>>> >>> >>>>> If we commit to this path, I would suggest that we start a >>> >>>>> pandas-2.0 >>> >>>>> integration branch where we can begin more seriously planning and >>> >>>>> executing on >>> >>>>> >>> >>>>> - Cleanup and removal of years' worth of accumulated cruft / legacy >>> >>>>> code >>> >>>>> - Removal of deprecated features >>> >>>>> - Series and DataFrame internals revamp. >>> >>>>> >>> >>>>> I had hoped that 2016 would offer me more time to work on the >>> >>>>> internals revamp, but between my day job and the 2nd ed of "Python >>> >>>>> for >>> >>>>> Data Analysis" that turned out to be a little too ambitious. I have >>> >>>>> been almost continuously thinking about how to go about this >>> >>>>> though, >>> >>>>> and it might be good to figure out a process where we can start >>> >>>>> documenting and coming up with a more granular development roadmap >>> >>>>> for >>> >>>>> this. Part of this will be carefully documenting any APIs we change >>> >>>>> or >>> >>>>> unit tests we break along the way. >>> >>>>> >>> >>>>> We would want to give ample time for heavy pandas users to run >>> >>>>> their >>> >>>>> 3rd-party code based on pandas 2.0-dev to give feedback on whether >>> >>>>> our >>> >>>>> assumptions about the impact of changes affect real production >>> >>>>> code. >>> >>>>> As a concrete example: integer and boolean Series would be able to >>> >>>>> accommodate missing data without implicitly casting to float or >>> >>>>> object >>> >>>>> NumPy dtype respectively. Since many users will have inserted >>> >>>>> workarounds / data massaging code because of such rough edges, this >>> >>>>> may cause code breakage or simply redundancy in some cases. As >>> >>>>> another >>> >>>>> example: we should probably remove the .ix indexing attribute >>> >>>>> altogether. I'm sure many users are still using .ix, but it would >>> >>>>> be >>> >>>>> worthwhile to go through such code and decide whether it's really >>> >>>>> .loc >>> >>>>> or .iloc. >>> >>>>> >>> >>>>> My hope would be (being a deadline-motivated person) that we could >>> >>>>> see >>> >>>>> a pandas 2.0 alpha release sometime mid- or 2nd half 2017, with a >>> >>>>> target beta / pre-production QA release in early 2018 or >>> >>>>> thereabouts. >>> >>>>> Part of this would be creating a 1.0 to 2.0 migration guide for >>> >>>>> users. >>> >>>>> >>> >>>>> My biggest concern with pandas in recent years is how not to be >>> >>>>> held >>> >>>>> back by strict backwards compatibility and still be able to >>> >>>>> innovate >>> >>>>> and stay relevant into the 2020s. >>> >>>>> >>> >>>>> For pandas 2.0 some of the most important issues I've been thinking >>> >>>>> about are: >>> >>>>> >>> >>>>> - Logical type abstraction layer / decoupling. pandas-only data >>> >>>>> types >>> >>>>> (Categorical, DatetimeTZ, Period, etc.) will become equal citizens >>> >>>>> as >>> >>>>> compared with data types mapping 1-1 on NumPy numeric dtypes >>> >>>>> >>> >>>>> - Decoupling physical storage to permit non-NumPy data structures >>> >>>>> inside Series >>> >>>>> >>> >>>>> - Removal of BlockManager and 2D block consolidation in DataFrame, >>> >>>>> in >>> >>>>> favor of a native C++ internal table (vector-of-arrays) data >>> >>>>> structure >>> >>>>> >>> >>>>> - Consistent NA semantics across all data types >>> >>>>> >>> >>>>> - Significantly improved handling of string/UTF8 data (performance, >>> >>>>> memory use -- elimination of PyObject boxes). From the above 2 >>> >>>>> items, >>> >>>>> we could even make all string arrays internally categorical (with >>> >>>>> the >>> >>>>> option to explicitly cast to categorical) -- in the database world >>> >>>>> this is often called dictionary encoding. >>> >>>>> >>> >>>>> - Refactor of most Cython algorithms into C++11/14 templates >>> >>>>> >>> >>>>> - Copy-on-write for Series and DataFrame >>> >>>>> >>> >>>>> - Removal of Panel, ndim > 3 data structures >>> >>>>> >>> >>>>> - Analytical expression VM (for example -- things like >>> >>>>> df[boolean_arr].groupby(...).agg(...) could be evaluated by a small >>> >>>>> Numexpr-like VM, not dissimilar to R's dplyr library, with >>> >>>>> significantly improved memory use and maybe performance too) >>> >>>>> >>> >>>>> There's a lot to unpack here, but let me know what everyone thinks >>> >>>>> about these things. The "pandas 2.0" / internals revamp discussion >>> >>>>> we >>> >>>>> can tackle in a separate thread or in perhaps in a GitHub repo or >>> >>>>> design folder in the pandas codebase. >>> >>>>> >>> >>>>> Thanks, >>> >>>>> Wes >>> >>>>> _______________________________________________ >>> >>>>> Pandas-dev mailing list >>> >>>>> Pandas-dev at python.org >>> >>>>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >>>> >>> >>>> >>> >>>> >>> >>>> _______________________________________________ >>> >>>> Pandas-dev mailing list >>> >>>> Pandas-dev at python.org >>> >>>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >>>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> >>> Pandas-dev mailing list >>> >>> Pandas-dev at python.org >>> >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >>> >>> >> >>> >> >>> >> _______________________________________________ >>> >> Pandas-dev mailing list >>> >> Pandas-dev at python.org >>> >> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> >>> > >>> > >>> > _______________________________________________ >>> > Pandas-dev mailing list >>> > Pandas-dev at python.org >>> > https://mail.python.org/mailman/listinfo/pandas-dev >>> > >> >> >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > From wesmckinn at gmail.com Mon Aug 1 17:11:11 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Mon, 1 Aug 2016 14:11:11 -0700 Subject: [Pandas-dev] Proposal: pandas 1.0 and ideas for pandas 2.0 / future In-Reply-To: References: Message-ID: Masaaki -- on your point re: accepting new features into the 1.x branch. The main issue is how we can keep a pandas 2.0 branch (which will be unstable for the first 3-6 months of its life) relatively in sync with 1.x until the 2.0 branch stabilizes. The worst case scenario is that you have to do double the amount of work for each pull request (essentially: independent patches to 1.x and 2.x), but if it could be reduced to 1.5x as much work then perhaps that's OK. Even "forward-porting" bug fixes will be a challenge. We shouldn't allow these things to halt progress on advancing the library internals to a more sustainable / future-proof place. Our problem is not unlike the Python language moratorium instituted in 2009: https://www.python.org/dev/peps/pep-3003/. - Wes On Mon, Aug 1, 2016 at 2:01 PM, Wes McKinney wrote: > hey Andy -- that makes sense to me. What I'm hoping to do this month > is scope out a more granular plan for the specific things (problems > and their possible solutions with lists of pros/cons of various > approaches) we want to accomplish in a pandas 2.x effort and make sure > we all agree (up to 70-80% of the big picture items). If we're going > to raise a significant amount of money we owe it to the donors to > explain how the money will be directed, and we won't want to be > dealing with a lot of uncertainty about the roadmap once we have > engaged FTEs beginning to help with moving things forward. > > - Wes > > On Mon, Aug 1, 2016 at 1:54 PM, Andy Ray Terrel wrote: >> Crazy thought. >> >> Perhaps ya'll could put together a road map and resources you will need to >> get it done (as in money for FTEs). I would like to see NumFOCUS try to push >> our sponsors to fund more FTEs for projects like this. If we have a road map >> in hand it makes the conversations much more tangible. >> >> -- Andy >> >> On Sun, Jul 31, 2016 at 5:03 PM, Joris Van den Bossche >> wrote: >>> >>> Regarding 1), I agree it is a good idea to push a 0.19.0 release soonish, >>> en we can then discuss what we further want to do (or not to do) for the 1.0 >>> release. I am on holidays the coming week and a half, but afterwards I will >>> also focus on getting 0.19.0 out. A release candidate in the last week of >>> August is maybe a good deadline? >>> >>> Joris >>> >>> 2016-07-29 0:15 GMT+02:00 Wes McKinney : >>>> >>>> OK, let me try to collect some of the feedback and give my thoughts >>>> >>>> 1) 0.19 and 0.20: I think we should push to release 0.19.0 soon and >>>> then plan what we want to add/change/deprecate for 1.0 which might >>>> otherwise have been 1.0. I think delaying 0.19.0 since we already >>>> pushed back 0.18.2, and there are some significant new patches >>>> (asof_merge and variable rolling windows), it would be good to get >>>> this into production before we declare a stable 1.0. >>>> >>>> 2) We will need to raise a significant amount of money for pandas (I >>>> estimate in the ballpark of US $300-500K -- better to have too much >>>> than too little) to be able to pursue the pandas 2.0 plan >>>> wholeheartedly. I would like to dedicate a minimum 5-10 hours per week >>>> to it in 2017 but this will not be sufficient to do everything (I am >>>> also a human being, and have a day job). It would be better to >>>> collaborate with one or two good freelance developers (with proven >>>> experience in C++ and Python) who are spending at least 50% of their >>>> time on pandas next year. I am going to start spending some time on >>>> design documentation so that we can start resolving some of the design >>>> questions and tradeoffs (not all of these decisions will be easy). >>>> We'll work on this offline and look to start soliciting funding (if >>>> anyone with the ability to write checks is reading, feel free to >>>> contact me offline). >>>> >>>> 3) I agree we will need to come up with a development process that >>>> facilitates both an invasive modification of pandas internals while >>>> also supporting production users of pandas 1.X. Cherry-picking bug >>>> fixes into the pandas 2.x branch will grow increasingly complicated; >>>> we need to factor this into our process (for example: we might collect >>>> all the unit tests for bug fixes -- assuming they rely on definitely >>>> stable behavior -- into a "to fix" folder so that we can return and >>>> adapt the bug fixes once the 2.x branch is getting more stable). To >>>> have developers both maintaining 1.x and trying to drive forward the >>>> 2.x branch at the same time does not seem realistic -- we should talk >>>> to the IPython/Jupyter devs to understand how they handled this >>>> through their long-lived IPython 1.0 branch IIRC (see >>>> http://ipython.org/news.html#ipython-1-0). >>>> >>>> 4) My goal, which I think we're all aligned on, would be for pandas >>>> 2.0 to be a drop-in replacement for 90-95% of normal pandas use. Many >>>> power users will have embraced some of the idiosyncrasies of pandas's >>>> implementation details, but I think some of the changes (e.g. missing >>>> data consistency, copy-on-write / improved semantics around memory >>>> ownership and views) will be welcomed. We should clearly document (in >>>> a dedicated "pandas's internal relationship with NumPy" document) and >>>> maintain very tight contracts around what kinds of zero-copy NumPy >>>> interoperability are supported -- it is not clear to me for example >>>> that arrays of Python string/unicode objects are a NumPy use case that >>>> is especially important to preserve, but most numeric data use cases >>>> are. This will also be helpful for power users to understand the >>>> nuances and how things are going to stay the same or change (for >>>> example: boolean and integer arrays with NAs will probably not be >>>> zero-copyable to NumPy arrays). >>>> >>>> We should maybe start side threads about each of these items. Just >>>> deciding what we want to deprecate or do in 0.20 aka 1.0 is a large >>>> enough task. >>>> >>>> Thanks all >>>> Wes >>>> >>>> On Wed, Jul 27, 2016 at 8:39 PM, G Young wrote: >>>> > 1) I would be in favour of releasing 0.19.0 in part because we already >>>> > pushed back and actually forgone 0.18.2. I think these plans are >>>> > better >>>> > served for the release after this one to give more time to map this but >>>> > also >>>> > to push out the changes that have already been made in preparation for >>>> > this >>>> > release. >>>> > >>>> > 2) In terms of organisation, I wonder if we might be better served >>>> > reorganising the way in which PR's are reviewed during the time period >>>> > between one release and the next instead of having these parallel >>>> > tracks of >>>> > development in light of the concern brought up by @jorisvanenbossche. >>>> > Perhaps rather than just reviewing PR's as they come in, specify which >>>> > types >>>> > of PR's should be submitted during certain periods of time. >>>> > >>>> > For example, a large chunk of the period could be devoted to accepting >>>> > enhancements / new features after which the remaining time before a >>>> > release >>>> > could be devoted to just organisation / refactoring / deprecations / >>>> > what >>>> > have you (maybe include bug fixes too). That way we could have a >>>> > contiguous >>>> > block of time to focus on stabilising and tidying up the release. It >>>> > would >>>> > also allow for the refactoring to take place (perhaps incrementally) >>>> > without >>>> > the concern of being destabilised by a new feature. >>>> > >>>> > For this to work, this would have to be clearly stated in the >>>> > contributing >>>> > docs as well as circulated in emails to pandas-dev AND other related >>>> > groups >>>> > so that way people know what's going on in terms of the development >>>> > cycle. >>>> > >>>> > >>>> > >>>> > On Wed, Jul 27, 2016 at 7:51 PM, Joris Van den Bossche >>>> > wrote: >>>> >> >>>> >> Wes, thanks for your mail! >>>> >> >>>> >> I like the idea of first releasing a pandas 1.0 before the 'big >>>> >> refactor'. >>>> >> We for sure know that this will take a while to stabilize (even with a >>>> >> lot >>>> >> of resources), and I think the idea was to provide a kind of LTS >>>> >> release. In >>>> >> that regard, it is just clearer to name this pandas 1.x then 0.19.x. >>>> >> >>>> >> Maybe we can start a separate thread to discuss on this 1.0, as there >>>> >> are >>>> >> of course some questions to discuss: >>>> >> - do we first release 0.19 (we didn't specifically discuss this, but I >>>> >> think the rough idea was to have somewhere in august a release >>>> >> candidate), >>>> >> or do we directly aim at 1.0? >>>> >> - are there some certain changes we want to do before 1.0 that are >>>> >> feasible in the short term? >>>> >> - are there some of the current ideas of deprecations that we should >>>> >> exclude/include for this release? (eg I think deprecating PanelND (as >>>> >> just >>>> >> landed in master) is good, but the idea of deprecating Panel should >>>> >> rather >>>> >> wait until 2.0?) >>>> >> - ... >>>> >> >>>> >> How exactly to tackle those bug fix releases / LTS branch, is also >>>> >> something that can be discussed, but I would not worry too much about >>>> >> that >>>> >> (there are enough examples of other projects to do something similar, >>>> >> we >>>> >> just have to search for a process that suits us). >>>> >> >>>> >> What I think a more important issue or problem with this process is >>>> >> the >>>> >> community of contributors. If we would effectively have a period of >>>> >> about >>>> >> two years (before a final 2.0 release) where for the current (1.0) >>>> >> version >>>> >> only certain bug-fixes are considered, but on the other hand it is >>>> >> still >>>> >> difficult to contribute to the new version. We would maybe have to say >>>> >> no to >>>> >> many of the PRs or enhancement ideas. Such a situation could hinder >>>> >> the >>>> >> process of community contributions and participation. >>>> >> And there are currently a lot of contributions. As Jeff also said, the >>>> >> current active contributors are barely keeping up with managing all >>>> >> issues >>>> >> and pull requests. I have worked the last few weeks more on pandas >>>> >> (thanks >>>> >> to Continuum), and indeed I spent most of my time answering issues and >>>> >> reviewing PRs, and hardly have any time to do much coding myself. But >>>> >> of >>>> >> course this is also a choice that I currently make. And I (we) could >>>> >> also >>>> >> make the choice to focus more on pandas 1.0/2.0 related issues, or try >>>> >> to >>>> >> steer some of the active contributors to that. >>>> >> >>>> >> I also have some concerns about the compatibility with the rest of the >>>> >> ecosystem, but at the same time it is clear I think that there should >>>> >> be >>>> >> some kind of refactor, and it is in the further elaboration of the >>>> >> roadmap >>>> >> that such concerns can be addressed. >>>> >> >>>> >> Joris >>>> >> >>>> >> >>>> >> >>>> >> 2016-07-27 12:04 GMT+02:00 Jeff Reback : >>>> >>> >>>> >>> I applaud the vision and ambition for the roadmap of the future of >>>> >>> pandas. >>>> >>> >>>> >>> However, the resources are lacking for much of these changes. >>>> >>> Currently >>>> >>> pandas is just barely keeping up with the (recently increased) user >>>> >>> flow >>>> >>> of pull-requests, not to mention the issue reports. These are all >>>> >>> great >>>> >>> indicators >>>> >>> of community use and exercising the edge cases. >>>> >>> >>>> >>> A roadmap is an excellent start, but the resource question needs to >>>> >>> be >>>> >>> front and center. >>>> >>> >>>> >>> The current process *could* evolve into LTS. In 0.19.0, lots of >>>> >>> progress >>>> >>> towards removing >>>> >>> older code (and of course deprecating things) is happening. An >>>> >>> aggressive >>>> >>> push of this into >>>> >>> 0.20.0 will go a long ways towards de-facto establishing 1.0 / LTS. >>>> >>> (and >>>> >>> maybe that's what we simply >>>> >>> call 0.20.0). >>>> >>> >>>> >>> I would agree we could simply release 1.0 / LTS without adding any >>>> >>> 'new' >>>> >>> features (like fixed getitem indexing >>>> >>> and such). >>>> >>> >>>> >>> I would like to see 2.0 with a user facing API that is a drop-in >>>> >>> replacement (though allowing for some breaking changes that are NOT >>>> >>> back-compat, e.g. getitem indexing). I think it would be acceptable >>>> >>> to break >>>> >>> the back-end API (meaning to numpy) though. >>>> >>> >>>> >>> For the resource question, as I have mentioned off-list, I will >>>> >>> format >>>> >>> this roadmap in order for pandas to support a fund-raising effort to >>>> >>> garner >>>> >>> resources for these changes. >>>> >>> >>>> >>> Jeff >>>> >>> >>>> >>> On Tue, Jul 26, 2016 at 5:13 PM, Stephan Hoyer >>>> >>> wrote: >>>> >>>> >>>> >>>> I know I expressed concerns about cross-compatibility with the rest >>>> >>>> of >>>> >>>> the SciPy ecosystem before (especially xarray), but this plan sounds >>>> >>>> very >>>> >>>> solid to me. Flexible data types in N-dimensional arrays are >>>> >>>> important for >>>> >>>> other use cases, but also not really a problem for pandas. >>>> >>>> >>>> >>>> A separate 2.0 release will let us make the major breaking changes >>>> >>>> to >>>> >>>> the pandas data model necessary for it to work well in the long >>>> >>>> term. There >>>> >>>> are a few other API warts that will be able to clean up this way >>>> >>>> (detailed >>>> >>>> in github.com/pydata/pandas/issues/10000), indexing on DataFrames >>>> >>>> being the >>>> >>>> most obvious one. >>>> >>>> >>>> >>>> On Tue, Jul 26, 2016 at 1:51 PM, Wes McKinney >>>> >>>> wrote: >>>> >>>>> >>>> >>>>> hi folks, >>>> >>>>> >>>> >>>>> As a continuation of ongoing discussions on GitHub and on the >>>> >>>>> mailing >>>> >>>>> list around deprecations and future innovation and internal >>>> >>>>> reworkings >>>> >>>>> of pandas, I had a couple of ideas to share that I am looking for >>>> >>>>> feedback on. >>>> >>>>> >>>> >>>>> As far as pandas 0.19.x today, I would like to propose that we >>>> >>>>> consider releasing the project as pandas 1.0 in the next major >>>> >>>>> release >>>> >>>>> or the one after. The Python community does have a penchant for >>>> >>>>> "eternal betas", but after all the hard work of the core developers >>>> >>>>> and community over the last 5 years, I think we can safely consider >>>> >>>>> making a stable 1.X production release. >>>> >>>>> >>>> >>>>> If we do decide to release pandas 1.0, I also propose that we >>>> >>>>> strongly >>>> >>>>> consider making 1.X an LTS / Long Term Support branch where we can >>>> >>>>> continue to make releases, but bug fixes and documentation >>>> >>>>> improvements only. Or, we can add new features, but on an extremely >>>> >>>>> conservative basis. This might require some changes to development >>>> >>>>> process, so looking for feedback on this. >>>> >>>>> >>>> >>>>> If we commit to this path, I would suggest that we start a >>>> >>>>> pandas-2.0 >>>> >>>>> integration branch where we can begin more seriously planning and >>>> >>>>> executing on >>>> >>>>> >>>> >>>>> - Cleanup and removal of years' worth of accumulated cruft / legacy >>>> >>>>> code >>>> >>>>> - Removal of deprecated features >>>> >>>>> - Series and DataFrame internals revamp. >>>> >>>>> >>>> >>>>> I had hoped that 2016 would offer me more time to work on the >>>> >>>>> internals revamp, but between my day job and the 2nd ed of "Python >>>> >>>>> for >>>> >>>>> Data Analysis" that turned out to be a little too ambitious. I have >>>> >>>>> been almost continuously thinking about how to go about this >>>> >>>>> though, >>>> >>>>> and it might be good to figure out a process where we can start >>>> >>>>> documenting and coming up with a more granular development roadmap >>>> >>>>> for >>>> >>>>> this. Part of this will be carefully documenting any APIs we change >>>> >>>>> or >>>> >>>>> unit tests we break along the way. >>>> >>>>> >>>> >>>>> We would want to give ample time for heavy pandas users to run >>>> >>>>> their >>>> >>>>> 3rd-party code based on pandas 2.0-dev to give feedback on whether >>>> >>>>> our >>>> >>>>> assumptions about the impact of changes affect real production >>>> >>>>> code. >>>> >>>>> As a concrete example: integer and boolean Series would be able to >>>> >>>>> accommodate missing data without implicitly casting to float or >>>> >>>>> object >>>> >>>>> NumPy dtype respectively. Since many users will have inserted >>>> >>>>> workarounds / data massaging code because of such rough edges, this >>>> >>>>> may cause code breakage or simply redundancy in some cases. As >>>> >>>>> another >>>> >>>>> example: we should probably remove the .ix indexing attribute >>>> >>>>> altogether. I'm sure many users are still using .ix, but it would >>>> >>>>> be >>>> >>>>> worthwhile to go through such code and decide whether it's really >>>> >>>>> .loc >>>> >>>>> or .iloc. >>>> >>>>> >>>> >>>>> My hope would be (being a deadline-motivated person) that we could >>>> >>>>> see >>>> >>>>> a pandas 2.0 alpha release sometime mid- or 2nd half 2017, with a >>>> >>>>> target beta / pre-production QA release in early 2018 or >>>> >>>>> thereabouts. >>>> >>>>> Part of this would be creating a 1.0 to 2.0 migration guide for >>>> >>>>> users. >>>> >>>>> >>>> >>>>> My biggest concern with pandas in recent years is how not to be >>>> >>>>> held >>>> >>>>> back by strict backwards compatibility and still be able to >>>> >>>>> innovate >>>> >>>>> and stay relevant into the 2020s. >>>> >>>>> >>>> >>>>> For pandas 2.0 some of the most important issues I've been thinking >>>> >>>>> about are: >>>> >>>>> >>>> >>>>> - Logical type abstraction layer / decoupling. pandas-only data >>>> >>>>> types >>>> >>>>> (Categorical, DatetimeTZ, Period, etc.) will become equal citizens >>>> >>>>> as >>>> >>>>> compared with data types mapping 1-1 on NumPy numeric dtypes >>>> >>>>> >>>> >>>>> - Decoupling physical storage to permit non-NumPy data structures >>>> >>>>> inside Series >>>> >>>>> >>>> >>>>> - Removal of BlockManager and 2D block consolidation in DataFrame, >>>> >>>>> in >>>> >>>>> favor of a native C++ internal table (vector-of-arrays) data >>>> >>>>> structure >>>> >>>>> >>>> >>>>> - Consistent NA semantics across all data types >>>> >>>>> >>>> >>>>> - Significantly improved handling of string/UTF8 data (performance, >>>> >>>>> memory use -- elimination of PyObject boxes). From the above 2 >>>> >>>>> items, >>>> >>>>> we could even make all string arrays internally categorical (with >>>> >>>>> the >>>> >>>>> option to explicitly cast to categorical) -- in the database world >>>> >>>>> this is often called dictionary encoding. >>>> >>>>> >>>> >>>>> - Refactor of most Cython algorithms into C++11/14 templates >>>> >>>>> >>>> >>>>> - Copy-on-write for Series and DataFrame >>>> >>>>> >>>> >>>>> - Removal of Panel, ndim > 3 data structures >>>> >>>>> >>>> >>>>> - Analytical expression VM (for example -- things like >>>> >>>>> df[boolean_arr].groupby(...).agg(...) could be evaluated by a small >>>> >>>>> Numexpr-like VM, not dissimilar to R's dplyr library, with >>>> >>>>> significantly improved memory use and maybe performance too) >>>> >>>>> >>>> >>>>> There's a lot to unpack here, but let me know what everyone thinks >>>> >>>>> about these things. The "pandas 2.0" / internals revamp discussion >>>> >>>>> we >>>> >>>>> can tackle in a separate thread or in perhaps in a GitHub repo or >>>> >>>>> design folder in the pandas codebase. >>>> >>>>> >>>> >>>>> Thanks, >>>> >>>>> Wes >>>> >>>>> _______________________________________________ >>>> >>>>> Pandas-dev mailing list >>>> >>>>> Pandas-dev at python.org >>>> >>>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> >>>> Pandas-dev mailing list >>>> >>>> Pandas-dev at python.org >>>> >>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>> >>>> >>>> >>> >>>> >>> >>>> >>> _______________________________________________ >>>> >>> Pandas-dev mailing list >>>> >>> Pandas-dev at python.org >>>> >>> https://mail.python.org/mailman/listinfo/pandas-dev >>>> >>> >>>> >> >>>> >> >>>> >> _______________________________________________ >>>> >> Pandas-dev mailing list >>>> >> Pandas-dev at python.org >>>> >> https://mail.python.org/mailman/listinfo/pandas-dev >>>> >> >>>> > >>>> > >>>> > _______________________________________________ >>>> > Pandas-dev mailing list >>>> > Pandas-dev at python.org >>>> > https://mail.python.org/mailman/listinfo/pandas-dev >>>> > >>> >>> >>> >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> From andy.terrel at gmail.com Mon Aug 1 16:54:37 2016 From: andy.terrel at gmail.com (Andy Ray Terrel) Date: Mon, 1 Aug 2016 15:54:37 -0500 Subject: [Pandas-dev] Proposal: pandas 1.0 and ideas for pandas 2.0 / future In-Reply-To: References: Message-ID: Crazy thought. Perhaps ya'll could put together a road map and resources you will need to get it done (as in money for FTEs). I would like to see NumFOCUS try to push our sponsors to fund more FTEs for projects like this. If we have a road map in hand it makes the conversations much more tangible. -- Andy On Sun, Jul 31, 2016 at 5:03 PM, Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > Regarding 1), I agree it is a good idea to push a 0.19.0 release soonish, > en we can then discuss what we further want to do (or not to do) for the > 1.0 release. I am on holidays the coming week and a half, but afterwards I > will also focus on getting 0.19.0 out. A release candidate in the last week > of August is maybe a good deadline? > > Joris > > 2016-07-29 0:15 GMT+02:00 Wes McKinney : > >> OK, let me try to collect some of the feedback and give my thoughts >> >> 1) 0.19 and 0.20: I think we should push to release 0.19.0 soon and >> then plan what we want to add/change/deprecate for 1.0 which might >> otherwise have been 1.0. I think delaying 0.19.0 since we already >> pushed back 0.18.2, and there are some significant new patches >> (asof_merge and variable rolling windows), it would be good to get >> this into production before we declare a stable 1.0. >> >> 2) We will need to raise a significant amount of money for pandas (I >> estimate in the ballpark of US $300-500K -- better to have too much >> than too little) to be able to pursue the pandas 2.0 plan >> wholeheartedly. I would like to dedicate a minimum 5-10 hours per week >> to it in 2017 but this will not be sufficient to do everything (I am >> also a human being, and have a day job). It would be better to >> collaborate with one or two good freelance developers (with proven >> experience in C++ and Python) who are spending at least 50% of their >> time on pandas next year. I am going to start spending some time on >> design documentation so that we can start resolving some of the design >> questions and tradeoffs (not all of these decisions will be easy). >> We'll work on this offline and look to start soliciting funding (if >> anyone with the ability to write checks is reading, feel free to >> contact me offline). >> >> 3) I agree we will need to come up with a development process that >> facilitates both an invasive modification of pandas internals while >> also supporting production users of pandas 1.X. Cherry-picking bug >> fixes into the pandas 2.x branch will grow increasingly complicated; >> we need to factor this into our process (for example: we might collect >> all the unit tests for bug fixes -- assuming they rely on definitely >> stable behavior -- into a "to fix" folder so that we can return and >> adapt the bug fixes once the 2.x branch is getting more stable). To >> have developers both maintaining 1.x and trying to drive forward the >> 2.x branch at the same time does not seem realistic -- we should talk >> to the IPython/Jupyter devs to understand how they handled this >> through their long-lived IPython 1.0 branch IIRC (see >> http://ipython.org/news.html#ipython-1-0). >> >> 4) My goal, which I think we're all aligned on, would be for pandas >> 2.0 to be a drop-in replacement for 90-95% of normal pandas use. Many >> power users will have embraced some of the idiosyncrasies of pandas's >> implementation details, but I think some of the changes (e.g. missing >> data consistency, copy-on-write / improved semantics around memory >> ownership and views) will be welcomed. We should clearly document (in >> a dedicated "pandas's internal relationship with NumPy" document) and >> maintain very tight contracts around what kinds of zero-copy NumPy >> interoperability are supported -- it is not clear to me for example >> that arrays of Python string/unicode objects are a NumPy use case that >> is especially important to preserve, but most numeric data use cases >> are. This will also be helpful for power users to understand the >> nuances and how things are going to stay the same or change (for >> example: boolean and integer arrays with NAs will probably not be >> zero-copyable to NumPy arrays). >> >> We should maybe start side threads about each of these items. Just >> deciding what we want to deprecate or do in 0.20 aka 1.0 is a large >> enough task. >> >> Thanks all >> Wes >> >> On Wed, Jul 27, 2016 at 8:39 PM, G Young wrote: >> > 1) I would be in favour of releasing 0.19.0 in part because we already >> > pushed back and actually forgone 0.18.2. I think these plans are better >> > served for the release after this one to give more time to map this but >> also >> > to push out the changes that have already been made in preparation for >> this >> > release. >> > >> > 2) In terms of organisation, I wonder if we might be better served >> > reorganising the way in which PR's are reviewed during the time period >> > between one release and the next instead of having these parallel >> tracks of >> > development in light of the concern brought up by @jorisvanenbossche. >> > Perhaps rather than just reviewing PR's as they come in, specify which >> types >> > of PR's should be submitted during certain periods of time. >> > >> > For example, a large chunk of the period could be devoted to accepting >> > enhancements / new features after which the remaining time before a >> release >> > could be devoted to just organisation / refactoring / deprecations / >> what >> > have you (maybe include bug fixes too). That way we could have a >> contiguous >> > block of time to focus on stabilising and tidying up the release. It >> would >> > also allow for the refactoring to take place (perhaps incrementally) >> without >> > the concern of being destabilised by a new feature. >> > >> > For this to work, this would have to be clearly stated in the >> contributing >> > docs as well as circulated in emails to pandas-dev AND other related >> groups >> > so that way people know what's going on in terms of the development >> cycle. >> > >> > >> > >> > On Wed, Jul 27, 2016 at 7:51 PM, Joris Van den Bossche >> > wrote: >> >> >> >> Wes, thanks for your mail! >> >> >> >> I like the idea of first releasing a pandas 1.0 before the 'big >> refactor'. >> >> We for sure know that this will take a while to stabilize (even with a >> lot >> >> of resources), and I think the idea was to provide a kind of LTS >> release. In >> >> that regard, it is just clearer to name this pandas 1.x then 0.19.x. >> >> >> >> Maybe we can start a separate thread to discuss on this 1.0, as there >> are >> >> of course some questions to discuss: >> >> - do we first release 0.19 (we didn't specifically discuss this, but I >> >> think the rough idea was to have somewhere in august a release >> candidate), >> >> or do we directly aim at 1.0? >> >> - are there some certain changes we want to do before 1.0 that are >> >> feasible in the short term? >> >> - are there some of the current ideas of deprecations that we should >> >> exclude/include for this release? (eg I think deprecating PanelND (as >> just >> >> landed in master) is good, but the idea of deprecating Panel should >> rather >> >> wait until 2.0?) >> >> - ... >> >> >> >> How exactly to tackle those bug fix releases / LTS branch, is also >> >> something that can be discussed, but I would not worry too much about >> that >> >> (there are enough examples of other projects to do something similar, >> we >> >> just have to search for a process that suits us). >> >> >> >> What I think a more important issue or problem with this process is the >> >> community of contributors. If we would effectively have a period of >> about >> >> two years (before a final 2.0 release) where for the current (1.0) >> version >> >> only certain bug-fixes are considered, but on the other hand it is >> still >> >> difficult to contribute to the new version. We would maybe have to say >> no to >> >> many of the PRs or enhancement ideas. Such a situation could hinder the >> >> process of community contributions and participation. >> >> And there are currently a lot of contributions. As Jeff also said, the >> >> current active contributors are barely keeping up with managing all >> issues >> >> and pull requests. I have worked the last few weeks more on pandas >> (thanks >> >> to Continuum), and indeed I spent most of my time answering issues and >> >> reviewing PRs, and hardly have any time to do much coding myself. But >> of >> >> course this is also a choice that I currently make. And I (we) could >> also >> >> make the choice to focus more on pandas 1.0/2.0 related issues, or try >> to >> >> steer some of the active contributors to that. >> >> >> >> I also have some concerns about the compatibility with the rest of the >> >> ecosystem, but at the same time it is clear I think that there should >> be >> >> some kind of refactor, and it is in the further elaboration of the >> roadmap >> >> that such concerns can be addressed. >> >> >> >> Joris >> >> >> >> >> >> >> >> 2016-07-27 12:04 GMT+02:00 Jeff Reback : >> >>> >> >>> I applaud the vision and ambition for the roadmap of the future of >> >>> pandas. >> >>> >> >>> However, the resources are lacking for much of these changes. >> Currently >> >>> pandas is just barely keeping up with the (recently increased) user >> flow >> >>> of pull-requests, not to mention the issue reports. These are all >> great >> >>> indicators >> >>> of community use and exercising the edge cases. >> >>> >> >>> A roadmap is an excellent start, but the resource question needs to be >> >>> front and center. >> >>> >> >>> The current process *could* evolve into LTS. In 0.19.0, lots of >> progress >> >>> towards removing >> >>> older code (and of course deprecating things) is happening. An >> aggressive >> >>> push of this into >> >>> 0.20.0 will go a long ways towards de-facto establishing 1.0 / LTS. >> (and >> >>> maybe that's what we simply >> >>> call 0.20.0). >> >>> >> >>> I would agree we could simply release 1.0 / LTS without adding any >> 'new' >> >>> features (like fixed getitem indexing >> >>> and such). >> >>> >> >>> I would like to see 2.0 with a user facing API that is a drop-in >> >>> replacement (though allowing for some breaking changes that are NOT >> >>> back-compat, e.g. getitem indexing). I think it would be acceptable >> to break >> >>> the back-end API (meaning to numpy) though. >> >>> >> >>> For the resource question, as I have mentioned off-list, I will format >> >>> this roadmap in order for pandas to support a fund-raising effort to >> garner >> >>> resources for these changes. >> >>> >> >>> Jeff >> >>> >> >>> On Tue, Jul 26, 2016 at 5:13 PM, Stephan Hoyer >> wrote: >> >>>> >> >>>> I know I expressed concerns about cross-compatibility with the rest >> of >> >>>> the SciPy ecosystem before (especially xarray), but this plan sounds >> very >> >>>> solid to me. Flexible data types in N-dimensional arrays are >> important for >> >>>> other use cases, but also not really a problem for pandas. >> >>>> >> >>>> A separate 2.0 release will let us make the major breaking changes to >> >>>> the pandas data model necessary for it to work well in the long >> term. There >> >>>> are a few other API warts that will be able to clean up this way >> (detailed >> >>>> in github.com/pydata/pandas/issues/10000), indexing on DataFrames >> being the >> >>>> most obvious one. >> >>>> >> >>>> On Tue, Jul 26, 2016 at 1:51 PM, Wes McKinney >> >>>> wrote: >> >>>>> >> >>>>> hi folks, >> >>>>> >> >>>>> As a continuation of ongoing discussions on GitHub and on the >> mailing >> >>>>> list around deprecations and future innovation and internal >> reworkings >> >>>>> of pandas, I had a couple of ideas to share that I am looking for >> >>>>> feedback on. >> >>>>> >> >>>>> As far as pandas 0.19.x today, I would like to propose that we >> >>>>> consider releasing the project as pandas 1.0 in the next major >> release >> >>>>> or the one after. The Python community does have a penchant for >> >>>>> "eternal betas", but after all the hard work of the core developers >> >>>>> and community over the last 5 years, I think we can safely consider >> >>>>> making a stable 1.X production release. >> >>>>> >> >>>>> If we do decide to release pandas 1.0, I also propose that we >> strongly >> >>>>> consider making 1.X an LTS / Long Term Support branch where we can >> >>>>> continue to make releases, but bug fixes and documentation >> >>>>> improvements only. Or, we can add new features, but on an extremely >> >>>>> conservative basis. This might require some changes to development >> >>>>> process, so looking for feedback on this. >> >>>>> >> >>>>> If we commit to this path, I would suggest that we start a >> pandas-2.0 >> >>>>> integration branch where we can begin more seriously planning and >> >>>>> executing on >> >>>>> >> >>>>> - Cleanup and removal of years' worth of accumulated cruft / legacy >> >>>>> code >> >>>>> - Removal of deprecated features >> >>>>> - Series and DataFrame internals revamp. >> >>>>> >> >>>>> I had hoped that 2016 would offer me more time to work on the >> >>>>> internals revamp, but between my day job and the 2nd ed of "Python >> for >> >>>>> Data Analysis" that turned out to be a little too ambitious. I have >> >>>>> been almost continuously thinking about how to go about this though, >> >>>>> and it might be good to figure out a process where we can start >> >>>>> documenting and coming up with a more granular development roadmap >> for >> >>>>> this. Part of this will be carefully documenting any APIs we change >> or >> >>>>> unit tests we break along the way. >> >>>>> >> >>>>> We would want to give ample time for heavy pandas users to run their >> >>>>> 3rd-party code based on pandas 2.0-dev to give feedback on whether >> our >> >>>>> assumptions about the impact of changes affect real production code. >> >>>>> As a concrete example: integer and boolean Series would be able to >> >>>>> accommodate missing data without implicitly casting to float or >> object >> >>>>> NumPy dtype respectively. Since many users will have inserted >> >>>>> workarounds / data massaging code because of such rough edges, this >> >>>>> may cause code breakage or simply redundancy in some cases. As >> another >> >>>>> example: we should probably remove the .ix indexing attribute >> >>>>> altogether. I'm sure many users are still using .ix, but it would be >> >>>>> worthwhile to go through such code and decide whether it's really >> .loc >> >>>>> or .iloc. >> >>>>> >> >>>>> My hope would be (being a deadline-motivated person) that we could >> see >> >>>>> a pandas 2.0 alpha release sometime mid- or 2nd half 2017, with a >> >>>>> target beta / pre-production QA release in early 2018 or >> thereabouts. >> >>>>> Part of this would be creating a 1.0 to 2.0 migration guide for >> users. >> >>>>> >> >>>>> My biggest concern with pandas in recent years is how not to be held >> >>>>> back by strict backwards compatibility and still be able to innovate >> >>>>> and stay relevant into the 2020s. >> >>>>> >> >>>>> For pandas 2.0 some of the most important issues I've been thinking >> >>>>> about are: >> >>>>> >> >>>>> - Logical type abstraction layer / decoupling. pandas-only data >> types >> >>>>> (Categorical, DatetimeTZ, Period, etc.) will become equal citizens >> as >> >>>>> compared with data types mapping 1-1 on NumPy numeric dtypes >> >>>>> >> >>>>> - Decoupling physical storage to permit non-NumPy data structures >> >>>>> inside Series >> >>>>> >> >>>>> - Removal of BlockManager and 2D block consolidation in DataFrame, >> in >> >>>>> favor of a native C++ internal table (vector-of-arrays) data >> structure >> >>>>> >> >>>>> - Consistent NA semantics across all data types >> >>>>> >> >>>>> - Significantly improved handling of string/UTF8 data (performance, >> >>>>> memory use -- elimination of PyObject boxes). From the above 2 >> items, >> >>>>> we could even make all string arrays internally categorical (with >> the >> >>>>> option to explicitly cast to categorical) -- in the database world >> >>>>> this is often called dictionary encoding. >> >>>>> >> >>>>> - Refactor of most Cython algorithms into C++11/14 templates >> >>>>> >> >>>>> - Copy-on-write for Series and DataFrame >> >>>>> >> >>>>> - Removal of Panel, ndim > 3 data structures >> >>>>> >> >>>>> - Analytical expression VM (for example -- things like >> >>>>> df[boolean_arr].groupby(...).agg(...) could be evaluated by a small >> >>>>> Numexpr-like VM, not dissimilar to R's dplyr library, with >> >>>>> significantly improved memory use and maybe performance too) >> >>>>> >> >>>>> There's a lot to unpack here, but let me know what everyone thinks >> >>>>> about these things. The "pandas 2.0" / internals revamp discussion >> we >> >>>>> can tackle in a separate thread or in perhaps in a GitHub repo or >> >>>>> design folder in the pandas codebase. >> >>>>> >> >>>>> Thanks, >> >>>>> Wes >> >>>>> _______________________________________________ >> >>>>> Pandas-dev mailing list >> >>>>> Pandas-dev at python.org >> >>>>> https://mail.python.org/mailman/listinfo/pandas-dev >> >>>> >> >>>> >> >>>> >> >>>> _______________________________________________ >> >>>> Pandas-dev mailing list >> >>>> Pandas-dev at python.org >> >>>> https://mail.python.org/mailman/listinfo/pandas-dev >> >>>> >> >>> >> >>> >> >>> _______________________________________________ >> >>> Pandas-dev mailing list >> >>> Pandas-dev at python.org >> >>> https://mail.python.org/mailman/listinfo/pandas-dev >> >>> >> >> >> >> >> >> _______________________________________________ >> >> Pandas-dev mailing list >> >> Pandas-dev at python.org >> >> https://mail.python.org/mailman/listinfo/pandas-dev >> >> >> > >> > >> > _______________________________________________ >> > Pandas-dev mailing list >> > Pandas-dev at python.org >> > https://mail.python.org/mailman/listinfo/pandas-dev >> > >> > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Thu Aug 11 12:06:55 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Thu, 11 Aug 2016 09:06:55 -0700 Subject: [Pandas-dev] Proposal: pandas 1.0 and ideas for pandas 2.0 / future In-Reply-To: References: Message-ID: ICYMI, we have a discussion going about some of the ideas referenced here (and in discussions earlier this year) for making changes to pandas's internals: https://github.com/pydata/pandas/pull/13944 There is also the discussion around what we may call "pandas 1.0", possibly (if we reach consensus about it) a stable maintenance release similar to the way that IPython / Jupyter approached its internal rearchitecture: https://github.com/pydata/pandas/issues/10000 Interested developers and users of pandas are highly encouraged to get involved in these discussions and contribute their perspectives, even if you don't plan to help do the actual coding work. cheers Wes On Mon, Aug 1, 2016 at 2:11 PM, Wes McKinney wrote: > Masaaki -- on your point re: accepting new features into the 1.x > branch. The main issue is how we can keep a pandas 2.0 branch (which > will be unstable for the first 3-6 months of its life) relatively in > sync with 1.x until the 2.0 branch stabilizes. > > The worst case scenario is that you have to do double the amount of > work for each pull request (essentially: independent patches to 1.x > and 2.x), but if it could be reduced to 1.5x as much work then perhaps > that's OK. Even "forward-porting" bug fixes will be a challenge. We > shouldn't allow these things to halt progress on advancing the library > internals to a more sustainable / future-proof place. > > Our problem is not unlike the Python language moratorium instituted in > 2009: https://www.python.org/dev/peps/pep-3003/. > > - Wes > > On Mon, Aug 1, 2016 at 2:01 PM, Wes McKinney wrote: >> hey Andy -- that makes sense to me. What I'm hoping to do this month >> is scope out a more granular plan for the specific things (problems >> and their possible solutions with lists of pros/cons of various >> approaches) we want to accomplish in a pandas 2.x effort and make sure >> we all agree (up to 70-80% of the big picture items). If we're going >> to raise a significant amount of money we owe it to the donors to >> explain how the money will be directed, and we won't want to be >> dealing with a lot of uncertainty about the roadmap once we have >> engaged FTEs beginning to help with moving things forward. >> >> - Wes >> >> On Mon, Aug 1, 2016 at 1:54 PM, Andy Ray Terrel wrote: >>> Crazy thought. >>> >>> Perhaps ya'll could put together a road map and resources you will need to >>> get it done (as in money for FTEs). I would like to see NumFOCUS try to push >>> our sponsors to fund more FTEs for projects like this. If we have a road map >>> in hand it makes the conversations much more tangible. >>> >>> -- Andy >>> >>> On Sun, Jul 31, 2016 at 5:03 PM, Joris Van den Bossche >>> wrote: >>>> >>>> Regarding 1), I agree it is a good idea to push a 0.19.0 release soonish, >>>> en we can then discuss what we further want to do (or not to do) for the 1.0 >>>> release. I am on holidays the coming week and a half, but afterwards I will >>>> also focus on getting 0.19.0 out. A release candidate in the last week of >>>> August is maybe a good deadline? >>>> >>>> Joris >>>> >>>> 2016-07-29 0:15 GMT+02:00 Wes McKinney : >>>>> >>>>> OK, let me try to collect some of the feedback and give my thoughts >>>>> >>>>> 1) 0.19 and 0.20: I think we should push to release 0.19.0 soon and >>>>> then plan what we want to add/change/deprecate for 1.0 which might >>>>> otherwise have been 1.0. I think delaying 0.19.0 since we already >>>>> pushed back 0.18.2, and there are some significant new patches >>>>> (asof_merge and variable rolling windows), it would be good to get >>>>> this into production before we declare a stable 1.0. >>>>> >>>>> 2) We will need to raise a significant amount of money for pandas (I >>>>> estimate in the ballpark of US $300-500K -- better to have too much >>>>> than too little) to be able to pursue the pandas 2.0 plan >>>>> wholeheartedly. I would like to dedicate a minimum 5-10 hours per week >>>>> to it in 2017 but this will not be sufficient to do everything (I am >>>>> also a human being, and have a day job). It would be better to >>>>> collaborate with one or two good freelance developers (with proven >>>>> experience in C++ and Python) who are spending at least 50% of their >>>>> time on pandas next year. I am going to start spending some time on >>>>> design documentation so that we can start resolving some of the design >>>>> questions and tradeoffs (not all of these decisions will be easy). >>>>> We'll work on this offline and look to start soliciting funding (if >>>>> anyone with the ability to write checks is reading, feel free to >>>>> contact me offline). >>>>> >>>>> 3) I agree we will need to come up with a development process that >>>>> facilitates both an invasive modification of pandas internals while >>>>> also supporting production users of pandas 1.X. Cherry-picking bug >>>>> fixes into the pandas 2.x branch will grow increasingly complicated; >>>>> we need to factor this into our process (for example: we might collect >>>>> all the unit tests for bug fixes -- assuming they rely on definitely >>>>> stable behavior -- into a "to fix" folder so that we can return and >>>>> adapt the bug fixes once the 2.x branch is getting more stable). To >>>>> have developers both maintaining 1.x and trying to drive forward the >>>>> 2.x branch at the same time does not seem realistic -- we should talk >>>>> to the IPython/Jupyter devs to understand how they handled this >>>>> through their long-lived IPython 1.0 branch IIRC (see >>>>> http://ipython.org/news.html#ipython-1-0). >>>>> >>>>> 4) My goal, which I think we're all aligned on, would be for pandas >>>>> 2.0 to be a drop-in replacement for 90-95% of normal pandas use. Many >>>>> power users will have embraced some of the idiosyncrasies of pandas's >>>>> implementation details, but I think some of the changes (e.g. missing >>>>> data consistency, copy-on-write / improved semantics around memory >>>>> ownership and views) will be welcomed. We should clearly document (in >>>>> a dedicated "pandas's internal relationship with NumPy" document) and >>>>> maintain very tight contracts around what kinds of zero-copy NumPy >>>>> interoperability are supported -- it is not clear to me for example >>>>> that arrays of Python string/unicode objects are a NumPy use case that >>>>> is especially important to preserve, but most numeric data use cases >>>>> are. This will also be helpful for power users to understand the >>>>> nuances and how things are going to stay the same or change (for >>>>> example: boolean and integer arrays with NAs will probably not be >>>>> zero-copyable to NumPy arrays). >>>>> >>>>> We should maybe start side threads about each of these items. Just >>>>> deciding what we want to deprecate or do in 0.20 aka 1.0 is a large >>>>> enough task. >>>>> >>>>> Thanks all >>>>> Wes >>>>> >>>>> On Wed, Jul 27, 2016 at 8:39 PM, G Young wrote: >>>>> > 1) I would be in favour of releasing 0.19.0 in part because we already >>>>> > pushed back and actually forgone 0.18.2. I think these plans are >>>>> > better >>>>> > served for the release after this one to give more time to map this but >>>>> > also >>>>> > to push out the changes that have already been made in preparation for >>>>> > this >>>>> > release. >>>>> > >>>>> > 2) In terms of organisation, I wonder if we might be better served >>>>> > reorganising the way in which PR's are reviewed during the time period >>>>> > between one release and the next instead of having these parallel >>>>> > tracks of >>>>> > development in light of the concern brought up by @jorisvanenbossche. >>>>> > Perhaps rather than just reviewing PR's as they come in, specify which >>>>> > types >>>>> > of PR's should be submitted during certain periods of time. >>>>> > >>>>> > For example, a large chunk of the period could be devoted to accepting >>>>> > enhancements / new features after which the remaining time before a >>>>> > release >>>>> > could be devoted to just organisation / refactoring / deprecations / >>>>> > what >>>>> > have you (maybe include bug fixes too). That way we could have a >>>>> > contiguous >>>>> > block of time to focus on stabilising and tidying up the release. It >>>>> > would >>>>> > also allow for the refactoring to take place (perhaps incrementally) >>>>> > without >>>>> > the concern of being destabilised by a new feature. >>>>> > >>>>> > For this to work, this would have to be clearly stated in the >>>>> > contributing >>>>> > docs as well as circulated in emails to pandas-dev AND other related >>>>> > groups >>>>> > so that way people know what's going on in terms of the development >>>>> > cycle. >>>>> > >>>>> > >>>>> > >>>>> > On Wed, Jul 27, 2016 at 7:51 PM, Joris Van den Bossche >>>>> > wrote: >>>>> >> >>>>> >> Wes, thanks for your mail! >>>>> >> >>>>> >> I like the idea of first releasing a pandas 1.0 before the 'big >>>>> >> refactor'. >>>>> >> We for sure know that this will take a while to stabilize (even with a >>>>> >> lot >>>>> >> of resources), and I think the idea was to provide a kind of LTS >>>>> >> release. In >>>>> >> that regard, it is just clearer to name this pandas 1.x then 0.19.x. >>>>> >> >>>>> >> Maybe we can start a separate thread to discuss on this 1.0, as there >>>>> >> are >>>>> >> of course some questions to discuss: >>>>> >> - do we first release 0.19 (we didn't specifically discuss this, but I >>>>> >> think the rough idea was to have somewhere in august a release >>>>> >> candidate), >>>>> >> or do we directly aim at 1.0? >>>>> >> - are there some certain changes we want to do before 1.0 that are >>>>> >> feasible in the short term? >>>>> >> - are there some of the current ideas of deprecations that we should >>>>> >> exclude/include for this release? (eg I think deprecating PanelND (as >>>>> >> just >>>>> >> landed in master) is good, but the idea of deprecating Panel should >>>>> >> rather >>>>> >> wait until 2.0?) >>>>> >> - ... >>>>> >> >>>>> >> How exactly to tackle those bug fix releases / LTS branch, is also >>>>> >> something that can be discussed, but I would not worry too much about >>>>> >> that >>>>> >> (there are enough examples of other projects to do something similar, >>>>> >> we >>>>> >> just have to search for a process that suits us). >>>>> >> >>>>> >> What I think a more important issue or problem with this process is >>>>> >> the >>>>> >> community of contributors. If we would effectively have a period of >>>>> >> about >>>>> >> two years (before a final 2.0 release) where for the current (1.0) >>>>> >> version >>>>> >> only certain bug-fixes are considered, but on the other hand it is >>>>> >> still >>>>> >> difficult to contribute to the new version. We would maybe have to say >>>>> >> no to >>>>> >> many of the PRs or enhancement ideas. Such a situation could hinder >>>>> >> the >>>>> >> process of community contributions and participation. >>>>> >> And there are currently a lot of contributions. As Jeff also said, the >>>>> >> current active contributors are barely keeping up with managing all >>>>> >> issues >>>>> >> and pull requests. I have worked the last few weeks more on pandas >>>>> >> (thanks >>>>> >> to Continuum), and indeed I spent most of my time answering issues and >>>>> >> reviewing PRs, and hardly have any time to do much coding myself. But >>>>> >> of >>>>> >> course this is also a choice that I currently make. And I (we) could >>>>> >> also >>>>> >> make the choice to focus more on pandas 1.0/2.0 related issues, or try >>>>> >> to >>>>> >> steer some of the active contributors to that. >>>>> >> >>>>> >> I also have some concerns about the compatibility with the rest of the >>>>> >> ecosystem, but at the same time it is clear I think that there should >>>>> >> be >>>>> >> some kind of refactor, and it is in the further elaboration of the >>>>> >> roadmap >>>>> >> that such concerns can be addressed. >>>>> >> >>>>> >> Joris >>>>> >> >>>>> >> >>>>> >> >>>>> >> 2016-07-27 12:04 GMT+02:00 Jeff Reback : >>>>> >>> >>>>> >>> I applaud the vision and ambition for the roadmap of the future of >>>>> >>> pandas. >>>>> >>> >>>>> >>> However, the resources are lacking for much of these changes. >>>>> >>> Currently >>>>> >>> pandas is just barely keeping up with the (recently increased) user >>>>> >>> flow >>>>> >>> of pull-requests, not to mention the issue reports. These are all >>>>> >>> great >>>>> >>> indicators >>>>> >>> of community use and exercising the edge cases. >>>>> >>> >>>>> >>> A roadmap is an excellent start, but the resource question needs to >>>>> >>> be >>>>> >>> front and center. >>>>> >>> >>>>> >>> The current process *could* evolve into LTS. In 0.19.0, lots of >>>>> >>> progress >>>>> >>> towards removing >>>>> >>> older code (and of course deprecating things) is happening. An >>>>> >>> aggressive >>>>> >>> push of this into >>>>> >>> 0.20.0 will go a long ways towards de-facto establishing 1.0 / LTS. >>>>> >>> (and >>>>> >>> maybe that's what we simply >>>>> >>> call 0.20.0). >>>>> >>> >>>>> >>> I would agree we could simply release 1.0 / LTS without adding any >>>>> >>> 'new' >>>>> >>> features (like fixed getitem indexing >>>>> >>> and such). >>>>> >>> >>>>> >>> I would like to see 2.0 with a user facing API that is a drop-in >>>>> >>> replacement (though allowing for some breaking changes that are NOT >>>>> >>> back-compat, e.g. getitem indexing). I think it would be acceptable >>>>> >>> to break >>>>> >>> the back-end API (meaning to numpy) though. >>>>> >>> >>>>> >>> For the resource question, as I have mentioned off-list, I will >>>>> >>> format >>>>> >>> this roadmap in order for pandas to support a fund-raising effort to >>>>> >>> garner >>>>> >>> resources for these changes. >>>>> >>> >>>>> >>> Jeff >>>>> >>> >>>>> >>> On Tue, Jul 26, 2016 at 5:13 PM, Stephan Hoyer >>>>> >>> wrote: >>>>> >>>> >>>>> >>>> I know I expressed concerns about cross-compatibility with the rest >>>>> >>>> of >>>>> >>>> the SciPy ecosystem before (especially xarray), but this plan sounds >>>>> >>>> very >>>>> >>>> solid to me. Flexible data types in N-dimensional arrays are >>>>> >>>> important for >>>>> >>>> other use cases, but also not really a problem for pandas. >>>>> >>>> >>>>> >>>> A separate 2.0 release will let us make the major breaking changes >>>>> >>>> to >>>>> >>>> the pandas data model necessary for it to work well in the long >>>>> >>>> term. There >>>>> >>>> are a few other API warts that will be able to clean up this way >>>>> >>>> (detailed >>>>> >>>> in github.com/pydata/pandas/issues/10000), indexing on DataFrames >>>>> >>>> being the >>>>> >>>> most obvious one. >>>>> >>>> >>>>> >>>> On Tue, Jul 26, 2016 at 1:51 PM, Wes McKinney >>>>> >>>> wrote: >>>>> >>>>> >>>>> >>>>> hi folks, >>>>> >>>>> >>>>> >>>>> As a continuation of ongoing discussions on GitHub and on the >>>>> >>>>> mailing >>>>> >>>>> list around deprecations and future innovation and internal >>>>> >>>>> reworkings >>>>> >>>>> of pandas, I had a couple of ideas to share that I am looking for >>>>> >>>>> feedback on. >>>>> >>>>> >>>>> >>>>> As far as pandas 0.19.x today, I would like to propose that we >>>>> >>>>> consider releasing the project as pandas 1.0 in the next major >>>>> >>>>> release >>>>> >>>>> or the one after. The Python community does have a penchant for >>>>> >>>>> "eternal betas", but after all the hard work of the core developers >>>>> >>>>> and community over the last 5 years, I think we can safely consider >>>>> >>>>> making a stable 1.X production release. >>>>> >>>>> >>>>> >>>>> If we do decide to release pandas 1.0, I also propose that we >>>>> >>>>> strongly >>>>> >>>>> consider making 1.X an LTS / Long Term Support branch where we can >>>>> >>>>> continue to make releases, but bug fixes and documentation >>>>> >>>>> improvements only. Or, we can add new features, but on an extremely >>>>> >>>>> conservative basis. This might require some changes to development >>>>> >>>>> process, so looking for feedback on this. >>>>> >>>>> >>>>> >>>>> If we commit to this path, I would suggest that we start a >>>>> >>>>> pandas-2.0 >>>>> >>>>> integration branch where we can begin more seriously planning and >>>>> >>>>> executing on >>>>> >>>>> >>>>> >>>>> - Cleanup and removal of years' worth of accumulated cruft / legacy >>>>> >>>>> code >>>>> >>>>> - Removal of deprecated features >>>>> >>>>> - Series and DataFrame internals revamp. >>>>> >>>>> >>>>> >>>>> I had hoped that 2016 would offer me more time to work on the >>>>> >>>>> internals revamp, but between my day job and the 2nd ed of "Python >>>>> >>>>> for >>>>> >>>>> Data Analysis" that turned out to be a little too ambitious. I have >>>>> >>>>> been almost continuously thinking about how to go about this >>>>> >>>>> though, >>>>> >>>>> and it might be good to figure out a process where we can start >>>>> >>>>> documenting and coming up with a more granular development roadmap >>>>> >>>>> for >>>>> >>>>> this. Part of this will be carefully documenting any APIs we change >>>>> >>>>> or >>>>> >>>>> unit tests we break along the way. >>>>> >>>>> >>>>> >>>>> We would want to give ample time for heavy pandas users to run >>>>> >>>>> their >>>>> >>>>> 3rd-party code based on pandas 2.0-dev to give feedback on whether >>>>> >>>>> our >>>>> >>>>> assumptions about the impact of changes affect real production >>>>> >>>>> code. >>>>> >>>>> As a concrete example: integer and boolean Series would be able to >>>>> >>>>> accommodate missing data without implicitly casting to float or >>>>> >>>>> object >>>>> >>>>> NumPy dtype respectively. Since many users will have inserted >>>>> >>>>> workarounds / data massaging code because of such rough edges, this >>>>> >>>>> may cause code breakage or simply redundancy in some cases. As >>>>> >>>>> another >>>>> >>>>> example: we should probably remove the .ix indexing attribute >>>>> >>>>> altogether. I'm sure many users are still using .ix, but it would >>>>> >>>>> be >>>>> >>>>> worthwhile to go through such code and decide whether it's really >>>>> >>>>> .loc >>>>> >>>>> or .iloc. >>>>> >>>>> >>>>> >>>>> My hope would be (being a deadline-motivated person) that we could >>>>> >>>>> see >>>>> >>>>> a pandas 2.0 alpha release sometime mid- or 2nd half 2017, with a >>>>> >>>>> target beta / pre-production QA release in early 2018 or >>>>> >>>>> thereabouts. >>>>> >>>>> Part of this would be creating a 1.0 to 2.0 migration guide for >>>>> >>>>> users. >>>>> >>>>> >>>>> >>>>> My biggest concern with pandas in recent years is how not to be >>>>> >>>>> held >>>>> >>>>> back by strict backwards compatibility and still be able to >>>>> >>>>> innovate >>>>> >>>>> and stay relevant into the 2020s. >>>>> >>>>> >>>>> >>>>> For pandas 2.0 some of the most important issues I've been thinking >>>>> >>>>> about are: >>>>> >>>>> >>>>> >>>>> - Logical type abstraction layer / decoupling. pandas-only data >>>>> >>>>> types >>>>> >>>>> (Categorical, DatetimeTZ, Period, etc.) will become equal citizens >>>>> >>>>> as >>>>> >>>>> compared with data types mapping 1-1 on NumPy numeric dtypes >>>>> >>>>> >>>>> >>>>> - Decoupling physical storage to permit non-NumPy data structures >>>>> >>>>> inside Series >>>>> >>>>> >>>>> >>>>> - Removal of BlockManager and 2D block consolidation in DataFrame, >>>>> >>>>> in >>>>> >>>>> favor of a native C++ internal table (vector-of-arrays) data >>>>> >>>>> structure >>>>> >>>>> >>>>> >>>>> - Consistent NA semantics across all data types >>>>> >>>>> >>>>> >>>>> - Significantly improved handling of string/UTF8 data (performance, >>>>> >>>>> memory use -- elimination of PyObject boxes). From the above 2 >>>>> >>>>> items, >>>>> >>>>> we could even make all string arrays internally categorical (with >>>>> >>>>> the >>>>> >>>>> option to explicitly cast to categorical) -- in the database world >>>>> >>>>> this is often called dictionary encoding. >>>>> >>>>> >>>>> >>>>> - Refactor of most Cython algorithms into C++11/14 templates >>>>> >>>>> >>>>> >>>>> - Copy-on-write for Series and DataFrame >>>>> >>>>> >>>>> >>>>> - Removal of Panel, ndim > 3 data structures >>>>> >>>>> >>>>> >>>>> - Analytical expression VM (for example -- things like >>>>> >>>>> df[boolean_arr].groupby(...).agg(...) could be evaluated by a small >>>>> >>>>> Numexpr-like VM, not dissimilar to R's dplyr library, with >>>>> >>>>> significantly improved memory use and maybe performance too) >>>>> >>>>> >>>>> >>>>> There's a lot to unpack here, but let me know what everyone thinks >>>>> >>>>> about these things. The "pandas 2.0" / internals revamp discussion >>>>> >>>>> we >>>>> >>>>> can tackle in a separate thread or in perhaps in a GitHub repo or >>>>> >>>>> design folder in the pandas codebase. >>>>> >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> Wes >>>>> >>>>> _______________________________________________ >>>>> >>>>> Pandas-dev mailing list >>>>> >>>>> Pandas-dev at python.org >>>>> >>>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> _______________________________________________ >>>>> >>>> Pandas-dev mailing list >>>>> >>>> Pandas-dev at python.org >>>>> >>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>>> >>>> >>>>> >>> >>>>> >>> >>>>> >>> _______________________________________________ >>>>> >>> Pandas-dev mailing list >>>>> >>> Pandas-dev at python.org >>>>> >>> https://mail.python.org/mailman/listinfo/pandas-dev >>>>> >>> >>>>> >> >>>>> >> >>>>> >> _______________________________________________ >>>>> >> Pandas-dev mailing list >>>>> >> Pandas-dev at python.org >>>>> >> https://mail.python.org/mailman/listinfo/pandas-dev >>>>> >> >>>>> > >>>>> > >>>>> > _______________________________________________ >>>>> > Pandas-dev mailing list >>>>> > Pandas-dev at python.org >>>>> > https://mail.python.org/mailman/listinfo/pandas-dev >>>>> > >>>> >>>> >>>> >>>> _______________________________________________ >>>> Pandas-dev mailing list >>>> Pandas-dev at python.org >>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>> >>> From wesmckinn at gmail.com Tue Aug 23 14:40:14 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Tue, 23 Aug 2016 11:40:14 -0700 Subject: [Pandas-dev] Our own GitHub organization? Message-ID: We've occasionally discussed moving pandas and associated repos to a dedicated GitHub organization. Some arguments for moving to our own org: - More clear what repositories are part of the "pandas" umbrella (we can potentially formalize this in the pandas-governance repo) - Dedicated capacity from CI services - Easier for us to more clearly develop our own open source project branding independent from PyData (which has increasingly primarily become a conference / meetup brand) While I haven't had any success contacting the owner of github.com/pandas, if we can pick a suitable org name we might consider it. GitHub's route forwarding (including git remotes) makes org changes pretty painless these days Thoughts? - Wes From shoyer at gmail.com Tue Aug 23 14:49:58 2016 From: shoyer at gmail.com (Stephan Hoyer) Date: Tue, 23 Aug 2016 11:49:58 -0700 Subject: [Pandas-dev] Our own GitHub organization? In-Reply-To: References: Message-ID: Did you have any luck going through GitHub's process for reclaiming an unused name? You don't necessarily need to contact the account owner for this. https://help.github.com/articles/name-squatting-policy/ I'm +1 for switching to a dedicated pandas org. GitHub's redirects do make this quite smooth. The main reason I switched xarray to pydata (from the separate xray org) is because I didn't think I would be successful claiming xarray, which appears to be in active use. On Tue, Aug 23, 2016 at 11:40 AM, Wes McKinney wrote: > We've occasionally discussed moving pandas and associated repos to a > dedicated GitHub organization. > > Some arguments for moving to our own org: > > - More clear what repositories are part of the "pandas" umbrella (we > can potentially formalize this in the pandas-governance repo) > > - Dedicated capacity from CI services > > - Easier for us to more clearly develop our own open source project > branding independent from PyData (which has increasingly primarily > become a conference / meetup brand) > > While I haven't had any success contacting the owner of > github.com/pandas, if we can pick a suitable org name we might > consider it. GitHub's route forwarding (including git remotes) makes > org changes pretty painless these days > > Thoughts? > > - Wes > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Tue Aug 23 15:06:58 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Tue, 23 Aug 2016 12:06:58 -0700 Subject: [Pandas-dev] Our own GitHub organization? In-Reply-To: References: Message-ID: According to GitHub, the pandas account is showing activity that is not publicly visible. I've contacted the user twice in an effort to start a dialog but GitHub is very strict about protecting users' privacy. We could do something like @pandas-org for the time being, and hope that at some point we are able to contact the @pandas user (or they become inactive). - Wes On Tue, Aug 23, 2016 at 11:49 AM, Stephan Hoyer wrote: > Did you have any luck going through GitHub's process for reclaiming an > unused name? You don't necessarily need to contact the account owner for > this. > https://help.github.com/articles/name-squatting-policy/ > > I'm +1 for switching to a dedicated pandas org. GitHub's redirects do make > this quite smooth. > > The main reason I switched xarray to pydata (from the separate xray org) is > because I didn't think I would be successful claiming xarray, which appears > to be in active use. > > On Tue, Aug 23, 2016 at 11:40 AM, Wes McKinney wrote: >> >> We've occasionally discussed moving pandas and associated repos to a >> dedicated GitHub organization. >> >> Some arguments for moving to our own org: >> >> - More clear what repositories are part of the "pandas" umbrella (we >> can potentially formalize this in the pandas-governance repo) >> >> - Dedicated capacity from CI services >> >> - Easier for us to more clearly develop our own open source project >> branding independent from PyData (which has increasingly primarily >> become a conference / meetup brand) >> >> While I haven't had any success contacting the owner of >> github.com/pandas, if we can pick a suitable org name we might >> consider it. GitHub's route forwarding (including git remotes) makes >> org changes pretty painless these days >> >> Thoughts? >> >> - Wes >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev > > From andy.terrel at gmail.com Tue Aug 23 15:24:48 2016 From: andy.terrel at gmail.com (Andy Ray Terrel) Date: Tue, 23 Aug 2016 14:24:48 -0500 Subject: [Pandas-dev] Our own GitHub organization? In-Reply-To: References: Message-ID: +1 I've never liked the way the repos are all mixed up with all the other pydata repos. I mean it's okay and isn't a huge problem but it's just clutter IMHO. On Tue, Aug 23, 2016 at 2:06 PM, Wes McKinney wrote: > According to GitHub, the pandas account is showing activity that is > not publicly visible. I've contacted the user twice in an effort to > start a dialog but GitHub is very strict about protecting users' > privacy. > > We could do something like @pandas-org for the time being, and hope > that at some point we are able to contact the @pandas user (or they > become inactive). > > - Wes > > On Tue, Aug 23, 2016 at 11:49 AM, Stephan Hoyer wrote: > > Did you have any luck going through GitHub's process for reclaiming an > > unused name? You don't necessarily need to contact the account owner for > > this. > > https://help.github.com/articles/name-squatting-policy/ > > > > I'm +1 for switching to a dedicated pandas org. GitHub's redirects do > make > > this quite smooth. > > > > The main reason I switched xarray to pydata (from the separate xray org) > is > > because I didn't think I would be successful claiming xarray, which > appears > > to be in active use. > > > > On Tue, Aug 23, 2016 at 11:40 AM, Wes McKinney > wrote: > >> > >> We've occasionally discussed moving pandas and associated repos to a > >> dedicated GitHub organization. > >> > >> Some arguments for moving to our own org: > >> > >> - More clear what repositories are part of the "pandas" umbrella (we > >> can potentially formalize this in the pandas-governance repo) > >> > >> - Dedicated capacity from CI services > >> > >> - Easier for us to more clearly develop our own open source project > >> branding independent from PyData (which has increasingly primarily > >> become a conference / meetup brand) > >> > >> While I haven't had any success contacting the owner of > >> github.com/pandas, if we can pick a suitable org name we might > >> consider it. GitHub's route forwarding (including git remotes) makes > >> org changes pretty painless these days > >> > >> Thoughts? > >> > >> - Wes > >> _______________________________________________ > >> Pandas-dev mailing list > >> Pandas-dev at python.org > >> https://mail.python.org/mailman/listinfo/pandas-dev > > > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Tue Aug 23 15:25:36 2016 From: shoyer at gmail.com (Stephan Hoyer) Date: Tue, 23 Aug 2016 12:25:36 -0700 Subject: [Pandas-dev] Our own GitHub organization? In-Reply-To: References: Message-ID: Too bad about the "pandas" GitHub name. Still, if you want to go for this, I say you should go ahead. My sense (have not checked actual data here) is that all other projects than pandas add a very minimal amount of CI burden, though. On Tue, Aug 23, 2016 at 12:06 PM, Wes McKinney wrote: > According to GitHub, the pandas account is showing activity that is > not publicly visible. I've contacted the user twice in an effort to > start a dialog but GitHub is very strict about protecting users' > privacy. > > We could do something like @pandas-org for the time being, and hope > that at some point we are able to contact the @pandas user (or they > become inactive). > > - Wes > > On Tue, Aug 23, 2016 at 11:49 AM, Stephan Hoyer wrote: > > Did you have any luck going through GitHub's process for reclaiming an > > unused name? You don't necessarily need to contact the account owner for > > this. > > https://help.github.com/articles/name-squatting-policy/ > > > > I'm +1 for switching to a dedicated pandas org. GitHub's redirects do > make > > this quite smooth. > > > > The main reason I switched xarray to pydata (from the separate xray org) > is > > because I didn't think I would be successful claiming xarray, which > appears > > to be in active use. > > > > On Tue, Aug 23, 2016 at 11:40 AM, Wes McKinney > wrote: > >> > >> We've occasionally discussed moving pandas and associated repos to a > >> dedicated GitHub organization. > >> > >> Some arguments for moving to our own org: > >> > >> - More clear what repositories are part of the "pandas" umbrella (we > >> can potentially formalize this in the pandas-governance repo) > >> > >> - Dedicated capacity from CI services > >> > >> - Easier for us to more clearly develop our own open source project > >> branding independent from PyData (which has increasingly primarily > >> become a conference / meetup brand) > >> > >> While I haven't had any success contacting the owner of > >> github.com/pandas, if we can pick a suitable org name we might > >> consider it. GitHub's route forwarding (including git remotes) makes > >> org changes pretty painless these days > >> > >> Thoughts? > >> > >> - Wes > >> _______________________________________________ > >> Pandas-dev mailing list > >> Pandas-dev at python.org > >> https://mail.python.org/mailman/listinfo/pandas-dev > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Tue Aug 23 17:42:34 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Tue, 23 Aug 2016 14:42:34 -0700 Subject: [Pandas-dev] Our own GitHub organization? In-Reply-To: References: Message-ID: I've parked github.com/pandas-dev for the time being. Interested to see what others think about the migration On Tue, Aug 23, 2016 at 12:25 PM, Stephan Hoyer wrote: > Too bad about the "pandas" GitHub name. Still, if you want to go for this, I > say you should go ahead. > > My sense (have not checked actual data here) is that all other projects than > pandas add a very minimal amount of CI burden, though. > > On Tue, Aug 23, 2016 at 12:06 PM, Wes McKinney wrote: >> >> According to GitHub, the pandas account is showing activity that is >> not publicly visible. I've contacted the user twice in an effort to >> start a dialog but GitHub is very strict about protecting users' >> privacy. >> >> We could do something like @pandas-org for the time being, and hope >> that at some point we are able to contact the @pandas user (or they >> become inactive). >> >> - Wes >> >> On Tue, Aug 23, 2016 at 11:49 AM, Stephan Hoyer wrote: >> > Did you have any luck going through GitHub's process for reclaiming an >> > unused name? You don't necessarily need to contact the account owner for >> > this. >> > https://help.github.com/articles/name-squatting-policy/ >> > >> > I'm +1 for switching to a dedicated pandas org. GitHub's redirects do >> > make >> > this quite smooth. >> > >> > The main reason I switched xarray to pydata (from the separate xray org) >> > is >> > because I didn't think I would be successful claiming xarray, which >> > appears >> > to be in active use. >> > >> > On Tue, Aug 23, 2016 at 11:40 AM, Wes McKinney >> > wrote: >> >> >> >> We've occasionally discussed moving pandas and associated repos to a >> >> dedicated GitHub organization. >> >> >> >> Some arguments for moving to our own org: >> >> >> >> - More clear what repositories are part of the "pandas" umbrella (we >> >> can potentially formalize this in the pandas-governance repo) >> >> >> >> - Dedicated capacity from CI services >> >> >> >> - Easier for us to more clearly develop our own open source project >> >> branding independent from PyData (which has increasingly primarily >> >> become a conference / meetup brand) >> >> >> >> While I haven't had any success contacting the owner of >> >> github.com/pandas, if we can pick a suitable org name we might >> >> consider it. GitHub's route forwarding (including git remotes) makes >> >> org changes pretty painless these days >> >> >> >> Thoughts? >> >> >> >> - Wes >> >> _______________________________________________ >> >> Pandas-dev mailing list >> >> Pandas-dev at python.org >> >> https://mail.python.org/mailman/listinfo/pandas-dev >> > >> > > > From wesmckinn at gmail.com Wed Aug 24 11:29:31 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Wed, 24 Aug 2016 08:29:31 -0700 Subject: [Pandas-dev] Proposal: pandas 1.0 and ideas for pandas 2.0 / future In-Reply-To: References: Message-ID: Just created the repo https://github.com/pydata/pandas-design to house the design documents and discussion (possibly temporarily -- we may want to move the docs back to the main pandas repo after the process is near completion). I think this will help more people engage with the process (as they can watch this repo and only get notifications for the design discussion, rather than subscribing to the entire pandas issue/PR firehose). If you'd like to participate, definitely Watch the repo! thanks Wes On Thu, Aug 11, 2016 at 9:06 AM, Wes McKinney wrote: > ICYMI, we have a discussion going about some of the ideas referenced > here (and in discussions earlier this year) for making changes to > pandas's internals: > > https://github.com/pydata/pandas/pull/13944 > > There is also the discussion around what we may call "pandas 1.0", > possibly (if we reach consensus about it) a stable maintenance release > similar to the way that IPython / Jupyter approached its internal > rearchitecture: > > https://github.com/pydata/pandas/issues/10000 > > Interested developers and users of pandas are highly encouraged to get > involved in these discussions and contribute their perspectives, even > if you don't plan to help do the actual coding work. > > cheers > Wes