[Pandas-dev] Proposal: pandas 1.0 and ideas for pandas 2.0 / future

Joris Van den Bossche jorisvandenbossche at gmail.com
Sun Jul 31 18:03:06 EDT 2016


Regarding 1), I agree it is a good idea to push a 0.19.0 release soonish,
en we can then discuss what we further want to do (or not to do) for the
1.0 release. I am on holidays the coming week and a half, but afterwards I
will also focus on getting 0.19.0 out. A release candidate in the last week
of August is maybe a good deadline?

Joris

2016-07-29 0:15 GMT+02:00 Wes McKinney <wesmckinn at gmail.com>:

> OK, let me try to collect some of the feedback and give my thoughts
>
> 1) 0.19 and 0.20:  I think we should push to release 0.19.0 soon and
> then plan what we want to add/change/deprecate for 1.0 which might
> otherwise have been 1.0. I think delaying 0.19.0 since we already
> pushed back 0.18.2, and there are some significant new patches
> (asof_merge and variable rolling windows), it would be good to get
> this into production before we declare a stable 1.0.
>
> 2) We will need to raise a significant amount of money for pandas (I
> estimate in the ballpark of US $300-500K -- better to have too much
> than too little) to be able to pursue the pandas 2.0 plan
> wholeheartedly. I would like to dedicate a minimum 5-10 hours per week
> to it in 2017 but this will not be sufficient to do everything (I am
> also a human being, and have a day job). It would be better to
> collaborate with one or two good freelance developers (with proven
> experience in C++ and Python) who are spending at least 50% of their
> time on pandas next year. I am going to start spending some time on
> design documentation so that we can start resolving some of the design
> questions and tradeoffs (not all of these decisions will be easy).
> We'll work on this offline and look to start soliciting funding (if
> anyone with the ability to write checks is reading, feel free to
> contact me offline).
>
> 3) I agree we will need to come up with a development process that
> facilitates both an invasive modification of pandas internals while
> also supporting production users of pandas 1.X. Cherry-picking bug
> fixes into the pandas 2.x branch will grow increasingly complicated;
> we need to factor this into our process (for example: we might collect
> all the unit tests for bug fixes -- assuming they rely on definitely
> stable behavior -- into a "to fix" folder so that we can return and
> adapt the bug fixes once the 2.x branch is getting more stable). To
> have developers both maintaining 1.x and trying to drive forward the
> 2.x branch at the same time does not seem realistic -- we should talk
> to the IPython/Jupyter devs to understand how they handled this
> through their long-lived IPython 1.0 branch IIRC (see
> http://ipython.org/news.html#ipython-1-0).
>
> 4) My goal, which I think we're all aligned on, would be for pandas
> 2.0 to be a drop-in replacement for 90-95% of normal pandas use. Many
> power users will have embraced some of the idiosyncrasies of pandas's
> implementation details, but I think some of the changes (e.g. missing
> data consistency, copy-on-write / improved semantics around memory
> ownership and views) will be welcomed. We should clearly document (in
> a dedicated "pandas's internal relationship with NumPy" document) and
> maintain very tight contracts around what kinds of zero-copy NumPy
> interoperability are supported -- it is not clear to me for example
> that arrays of Python string/unicode objects are a NumPy use case that
> is especially important to preserve, but most numeric data use cases
> are. This will also be helpful for power users to understand the
> nuances and how things are going to stay the same or change (for
> example: boolean and integer arrays with NAs will probably not be
> zero-copyable to NumPy arrays).
>
> We should maybe start side threads about each of these items. Just
> deciding what we want to deprecate or do in 0.20 aka 1.0 is a large
> enough task.
>
> Thanks all
> Wes
>
> On Wed, Jul 27, 2016 at 8:39 PM, G Young <gfyoung17 at gmail.com> wrote:
> > 1) I would be in favour of releasing 0.19.0 in part because we already
> > pushed back and actually forgone 0.18.2.  I think these plans are better
> > served for the release after this one to give more time to map this but
> also
> > to push out the changes that have already been made in preparation for
> this
> > release.
> >
> > 2) In terms of organisation, I wonder if we might be better served
> > reorganising the way in which PR's are reviewed during the time period
> > between one release and the next instead of having these parallel tracks
> of
> > development in light of the concern brought up by @jorisvanenbossche.
> > Perhaps rather than just reviewing PR's as they come in, specify which
> types
> > of PR's should be submitted during certain periods of time.
> >
> > For example, a large chunk of the period could be devoted to accepting
> > enhancements / new features after which the remaining time before a
> release
> > could be devoted to just organisation / refactoring / deprecations / what
> > have you (maybe include bug fixes too).  That way we could have a
> contiguous
> > block of time to focus on stabilising and tidying up the release.  It
> would
> > also allow for the refactoring to take place (perhaps incrementally)
> without
> > the concern of being destabilised by a new feature.
> >
> > For this to work, this would have to be clearly stated in the
> contributing
> > docs as well as circulated in emails to pandas-dev AND other related
> groups
> > so that way people know what's going on in terms of the development
> cycle.
> >
> >
> >
> > On Wed, Jul 27, 2016 at 7:51 PM, Joris Van den Bossche
> > <jorisvandenbossche at gmail.com> wrote:
> >>
> >> Wes, thanks for your mail!
> >>
> >> I like the idea of first releasing a pandas 1.0 before the 'big
> refactor'.
> >> We for sure know that this will take a while to stabilize (even with a
> lot
> >> of resources), and I think the idea was to provide a kind of LTS
> release. In
> >> that regard, it is just clearer to name this pandas 1.x then 0.19.x.
> >>
> >> Maybe we can start a separate thread to discuss on this 1.0, as there
> are
> >> of course some questions to discuss:
> >> - do we first release 0.19 (we didn't specifically discuss this, but I
> >> think the rough idea was to have somewhere in august a release
> candidate),
> >> or do we directly aim at 1.0?
> >> - are there some certain changes we want to do before 1.0 that are
> >> feasible in the short term?
> >> - are there some of the current ideas of deprecations that we should
> >> exclude/include for this release? (eg I think deprecating PanelND (as
> just
> >> landed in master) is good, but the idea of deprecating Panel should
> rather
> >> wait until 2.0?)
> >> - ...
> >>
> >> How exactly to tackle those bug fix releases / LTS branch, is also
> >> something that can be discussed, but I would not worry too much about
> that
> >> (there are enough examples of other projects to do something similar, we
> >> just have to search for a process that suits us).
> >>
> >> What I think a more important issue or problem with this process is the
> >> community of contributors. If we would effectively have a period of
> about
> >> two years (before a final 2.0 release) where for the current (1.0)
> version
> >> only certain bug-fixes are considered, but on the other hand it is still
> >> difficult to contribute to the new version. We would maybe have to say
> no to
> >> many of the PRs or enhancement ideas. Such a situation could hinder the
> >> process of community contributions and participation.
> >> And there are currently a lot of contributions. As Jeff also said, the
> >> current active contributors are barely keeping up with managing all
> issues
> >> and pull requests. I have worked the last few weeks more on pandas
> (thanks
> >> to Continuum), and indeed I spent most of my time answering issues and
> >> reviewing PRs, and hardly have any time to do much coding myself. But of
> >> course this is also a choice that I currently make. And I (we) could
> also
> >> make the choice to focus more on pandas 1.0/2.0 related issues, or try
> to
> >> steer some of the active contributors to that.
> >>
> >> I also have some concerns about the compatibility with the rest of the
> >> ecosystem, but at the same time it is clear I think that there should be
> >> some kind of refactor, and it is in the further elaboration of the
> roadmap
> >> that such concerns can be addressed.
> >>
> >> Joris
> >>
> >>
> >>
> >> 2016-07-27 12:04 GMT+02:00 Jeff Reback <jeffreback at gmail.com>:
> >>>
> >>> I applaud the vision and ambition for the roadmap of the future of
> >>> pandas.
> >>>
> >>> However, the resources are lacking for much of these changes. Currently
> >>> pandas is just barely keeping up with the (recently increased) user
> flow
> >>> of pull-requests, not to mention the issue reports. These are all great
> >>> indicators
> >>> of community use and exercising the edge cases.
> >>>
> >>> A roadmap is an excellent start, but the resource question needs to be
> >>> front and center.
> >>>
> >>> The current process *could* evolve into LTS. In 0.19.0, lots of
> progress
> >>> towards removing
> >>> older code (and of course deprecating things) is happening. An
> aggressive
> >>> push of this into
> >>> 0.20.0 will go a long ways towards de-facto establishing 1.0 / LTS.
> (and
> >>> maybe that's what we simply
> >>> call 0.20.0).
> >>>
> >>> I would agree we could simply release 1.0 / LTS without adding any
> 'new'
> >>> features (like fixed getitem indexing
> >>> and such).
> >>>
> >>> I would like to see 2.0 with a user facing API that is a drop-in
> >>> replacement (though allowing for some breaking changes that are NOT
> >>> back-compat, e.g. getitem indexing). I think it would be acceptable to
> break
> >>> the back-end API (meaning to numpy) though.
> >>>
> >>> For the resource question, as I have mentioned off-list, I will format
> >>> this roadmap in order for pandas to support a fund-raising effort to
> garner
> >>> resources for these changes.
> >>>
> >>> Jeff
> >>>
> >>> On Tue, Jul 26, 2016 at 5:13 PM, Stephan Hoyer <shoyer at gmail.com>
> wrote:
> >>>>
> >>>> I know I expressed concerns about cross-compatibility with the rest of
> >>>> the SciPy ecosystem before (especially xarray), but this plan sounds
> very
> >>>> solid to me. Flexible data types in N-dimensional arrays are
> important for
> >>>> other use cases, but also not really a problem for pandas.
> >>>>
> >>>> A separate 2.0 release will let us make the major breaking changes to
> >>>> the pandas data model necessary for it to work well in the long term.
> There
> >>>> are a few other API warts that will be able to clean up this way
> (detailed
> >>>> in github.com/pydata/pandas/issues/10000), indexing on DataFrames
> being the
> >>>> most obvious one.
> >>>>
> >>>> On Tue, Jul 26, 2016 at 1:51 PM, Wes McKinney <wesmckinn at gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> hi folks,
> >>>>>
> >>>>> As a continuation of ongoing discussions on GitHub and on the mailing
> >>>>> list around deprecations and future innovation and internal
> reworkings
> >>>>> of pandas, I had a couple of ideas to share that I am looking for
> >>>>> feedback on.
> >>>>>
> >>>>> As far as pandas 0.19.x today, I would like to propose that we
> >>>>> consider releasing the project as pandas 1.0 in the next major
> release
> >>>>> or the one after. The Python community does have a penchant for
> >>>>> "eternal betas", but after all the hard work of the core developers
> >>>>> and community over the last 5 years, I think we can safely consider
> >>>>> making a stable 1.X production release.
> >>>>>
> >>>>> If we do decide to release pandas 1.0, I also propose that we
> strongly
> >>>>> consider making 1.X an LTS / Long Term Support branch where we can
> >>>>> continue to make releases, but bug fixes and documentation
> >>>>> improvements only. Or, we can add new features, but on an extremely
> >>>>> conservative basis. This might require some changes to development
> >>>>> process, so looking for feedback on this.
> >>>>>
> >>>>> If we commit to this path, I would suggest that we start a pandas-2.0
> >>>>> integration branch where we can begin more seriously planning and
> >>>>> executing on
> >>>>>
> >>>>> - Cleanup and removal of years' worth of accumulated cruft / legacy
> >>>>> code
> >>>>> - Removal of deprecated features
> >>>>> - Series and DataFrame internals revamp.
> >>>>>
> >>>>> I had hoped that 2016 would offer me more time to work on the
> >>>>> internals revamp, but between my day job and the 2nd ed of "Python
> for
> >>>>> Data Analysis" that turned out to be a little too ambitious. I have
> >>>>> been almost continuously thinking about how to go about this though,
> >>>>> and it might be good to figure out a process where we can start
> >>>>> documenting and coming up with a more granular development roadmap
> for
> >>>>> this. Part of this will be carefully documenting any APIs we change
> or
> >>>>> unit tests we break along the way.
> >>>>>
> >>>>> We would want to give ample time for heavy pandas users to run their
> >>>>> 3rd-party code based on pandas 2.0-dev to give feedback on whether
> our
> >>>>> assumptions about the impact of changes affect real production code.
> >>>>> As a concrete example: integer and boolean Series would be able to
> >>>>> accommodate missing data without implicitly casting to float or
> object
> >>>>> NumPy dtype respectively. Since many users will have inserted
> >>>>> workarounds / data massaging code because of such rough edges, this
> >>>>> may cause code breakage or simply redundancy in some cases. As
> another
> >>>>> example: we should probably remove the .ix indexing attribute
> >>>>> altogether. I'm sure many users are still using .ix, but it would be
> >>>>> worthwhile to go through such code and decide whether it's really
> .loc
> >>>>> or .iloc.
> >>>>>
> >>>>> My hope would be (being a deadline-motivated person) that we could
> see
> >>>>> a pandas 2.0 alpha release sometime mid- or 2nd half 2017, with a
> >>>>> target beta / pre-production QA release in early 2018 or thereabouts.
> >>>>> Part of this would be creating a 1.0 to 2.0 migration guide for
> users.
> >>>>>
> >>>>> My biggest concern with pandas in recent years is how not to be held
> >>>>> back by strict backwards compatibility and still be able to innovate
> >>>>> and stay relevant into the 2020s.
> >>>>>
> >>>>> For pandas 2.0 some of the most important issues I've been thinking
> >>>>> about are:
> >>>>>
> >>>>> - Logical type abstraction layer / decoupling. pandas-only data types
> >>>>> (Categorical, DatetimeTZ, Period, etc.) will become equal citizens as
> >>>>> compared with data types mapping 1-1 on NumPy numeric dtypes
> >>>>>
> >>>>> - Decoupling physical storage to permit non-NumPy data structures
> >>>>> inside Series
> >>>>>
> >>>>> - Removal of BlockManager and 2D block consolidation in DataFrame, in
> >>>>> favor of a native C++ internal table (vector-of-arrays) data
> structure
> >>>>>
> >>>>> - Consistent NA semantics across all data types
> >>>>>
> >>>>> - Significantly improved handling of string/UTF8 data (performance,
> >>>>> memory use -- elimination of PyObject boxes). From the above 2 items,
> >>>>> we could even make all string arrays internally categorical (with the
> >>>>> option to explicitly cast to categorical) -- in the database world
> >>>>> this is often called dictionary encoding.
> >>>>>
> >>>>> - Refactor of most Cython algorithms into C++11/14 templates
> >>>>>
> >>>>> - Copy-on-write for Series and DataFrame
> >>>>>
> >>>>> - Removal of Panel, ndim > 3 data structures
> >>>>>
> >>>>> - Analytical expression VM (for example -- things like
> >>>>> df[boolean_arr].groupby(...).agg(...) could be evaluated by a small
> >>>>> Numexpr-like VM, not dissimilar to R's dplyr library, with
> >>>>> significantly improved memory use and maybe performance too)
> >>>>>
> >>>>> There's a lot to unpack here, but let me know what everyone thinks
> >>>>> about these things. The "pandas 2.0" / internals revamp discussion we
> >>>>> can tackle in a separate thread or in perhaps in a GitHub repo or
> >>>>> design folder in the pandas codebase.
> >>>>>
> >>>>> Thanks,
> >>>>> Wes
> >>>>> _______________________________________________
> >>>>> Pandas-dev mailing list
> >>>>> Pandas-dev at python.org
> >>>>> https://mail.python.org/mailman/listinfo/pandas-dev
> >>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> Pandas-dev mailing list
> >>>> Pandas-dev at python.org
> >>>> https://mail.python.org/mailman/listinfo/pandas-dev
> >>>>
> >>>
> >>>
> >>> _______________________________________________
> >>> Pandas-dev mailing list
> >>> Pandas-dev at python.org
> >>> https://mail.python.org/mailman/listinfo/pandas-dev
> >>>
> >>
> >>
> >> _______________________________________________
> >> Pandas-dev mailing list
> >> Pandas-dev at python.org
> >> https://mail.python.org/mailman/listinfo/pandas-dev
> >>
> >
> >
> > _______________________________________________
> > Pandas-dev mailing list
> > Pandas-dev at python.org
> > https://mail.python.org/mailman/listinfo/pandas-dev
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20160801/59db34de/attachment-0001.html>


More information about the Pandas-dev mailing list