From wesmckinn at gmail.com  Tue Jul 26 16:51:15 2016
From: wesmckinn at gmail.com (Wes McKinney)
Date: Tue, 26 Jul 2016 13:51:15 -0700
Subject: [Pandas-dev] Proposal: pandas 1.0 and ideas for pandas 2.0 / future
Message-ID: <CAJPUwMCrE3Yp_5neugfBfDzz2QCjusZjnmsDjYiaO08938tErw@mail.gmail.com>

hi folks,

As a continuation of ongoing discussions on GitHub and on the mailing
list around deprecations and future innovation and internal reworkings
of pandas, I had a couple of ideas to share that I am looking for
feedback on.

As far as pandas 0.19.x today, I would like to propose that we
consider releasing the project as pandas 1.0 in the next major release
or the one after. The Python community does have a penchant for
"eternal betas", but after all the hard work of the core developers
and community over the last 5 years, I think we can safely consider
making a stable 1.X production release.

If we do decide to release pandas 1.0, I also propose that we strongly
consider making 1.X an LTS / Long Term Support branch where we can
continue to make releases, but bug fixes and documentation
improvements only. Or, we can add new features, but on an extremely
conservative basis. This might require some changes to development
process, so looking for feedback on this.

If we commit to this path, I would suggest that we start a pandas-2.0
integration branch where we can begin more seriously planning and
executing on

- Cleanup and removal of years' worth of accumulated cruft / legacy code
- Removal of deprecated features
- Series and DataFrame internals revamp.

I had hoped that 2016 would offer me more time to work on the
internals revamp, but between my day job and the 2nd ed of "Python for
Data Analysis" that turned out to be a little too ambitious. I have
been almost continuously thinking about how to go about this though,
and it might be good to figure out a process where we can start
documenting and coming up with a more granular development roadmap for
this. Part of this will be carefully documenting any APIs we change or
unit tests we break along the way.

We would want to give ample time for heavy pandas users to run their
3rd-party code based on pandas 2.0-dev to give feedback on whether our
assumptions about the impact of changes affect real production code.
As a concrete example: integer and boolean Series would be able to
accommodate missing data without implicitly casting to float or object
NumPy dtype respectively. Since many users will have inserted
workarounds / data massaging code because of such rough edges, this
may cause code breakage or simply redundancy in some cases. As another
example: we should probably remove the .ix indexing attribute
altogether. I'm sure many users are still using .ix, but it would be
worthwhile to go through such code and decide whether it's really .loc
or .iloc.

My hope would be (being a deadline-motivated person) that we could see
a pandas 2.0 alpha release sometime mid- or 2nd half 2017, with a
target beta / pre-production QA release in early 2018 or thereabouts.
Part of this would be creating a 1.0 to 2.0 migration guide for users.

My biggest concern with pandas in recent years is how not to be held
back by strict backwards compatibility and still be able to innovate
and stay relevant into the 2020s.

For pandas 2.0 some of the most important issues I've been thinking about are:

- Logical type abstraction layer / decoupling. pandas-only data types
(Categorical, DatetimeTZ, Period, etc.) will become equal citizens as
compared with data types mapping 1-1 on NumPy numeric dtypes

- Decoupling physical storage to permit non-NumPy data structures inside Series

- Removal of BlockManager and 2D block consolidation in DataFrame, in
favor of a native C++ internal table (vector-of-arrays) data structure

- Consistent NA semantics across all data types

- Significantly improved handling of string/UTF8 data (performance,
memory use -- elimination of PyObject boxes). From the above 2 items,
we could even make all string arrays internally categorical (with the
option to explicitly cast to categorical) -- in the database world
this is often called dictionary encoding.

- Refactor of most Cython algorithms into C++11/14 templates

- Copy-on-write for Series and DataFrame

- Removal of Panel, ndim > 3 data structures

- Analytical expression VM (for example -- things like
df[boolean_arr].groupby(...).agg(...) could be evaluated by a small
Numexpr-like VM, not dissimilar to R's dplyr library, with
significantly improved memory use and maybe performance too)

There's a lot to unpack here, but let me know what everyone thinks
about these things. The "pandas 2.0" / internals revamp discussion we
can tackle in a separate thread or in perhaps in a GitHub repo or
design folder in the pandas codebase.

Thanks,
Wes

From shoyer at gmail.com  Tue Jul 26 17:13:11 2016
From: shoyer at gmail.com (Stephan Hoyer)
Date: Tue, 26 Jul 2016 14:13:11 -0700
Subject: [Pandas-dev] Proposal: pandas 1.0 and ideas for pandas 2.0 /
 future
In-Reply-To: <CAJPUwMCrE3Yp_5neugfBfDzz2QCjusZjnmsDjYiaO08938tErw@mail.gmail.com>
References: <CAJPUwMCrE3Yp_5neugfBfDzz2QCjusZjnmsDjYiaO08938tErw@mail.gmail.com>
Message-ID: <CAEQ_TveHwf3s5gnmd17RdSk3GWwc4CPK77rOp-UkeKW_kQOCFQ@mail.gmail.com>

I know I expressed concerns about cross-compatibility with the rest of the
SciPy ecosystem before (especially xarray), but this plan sounds very solid
to me. Flexible data types in N-dimensional arrays are important for other
use cases, but also not really a problem for pandas.

A separate 2.0 release will let us make the major breaking changes to the
pandas data model necessary for it to work well in the long term. There are
a few other API warts that will be able to clean up this way (detailed in
github.com/pydata/pandas/issues/10000), indexing on DataFrames being the
most obvious one.

On Tue, Jul 26, 2016 at 1:51 PM, Wes McKinney <wesmckinn at gmail.com> wrote:

> hi folks,
>
> As a continuation of ongoing discussions on GitHub and on the mailing
> list around deprecations and future innovation and internal reworkings
> of pandas, I had a couple of ideas to share that I am looking for
> feedback on.
>
> As far as pandas 0.19.x today, I would like to propose that we
> consider releasing the project as pandas 1.0 in the next major release
> or the one after. The Python community does have a penchant for
> "eternal betas", but after all the hard work of the core developers
> and community over the last 5 years, I think we can safely consider
> making a stable 1.X production release.
>
> If we do decide to release pandas 1.0, I also propose that we strongly
> consider making 1.X an LTS / Long Term Support branch where we can
> continue to make releases, but bug fixes and documentation
> improvements only. Or, we can add new features, but on an extremely
> conservative basis. This might require some changes to development
> process, so looking for feedback on this.
>
> If we commit to this path, I would suggest that we start a pandas-2.0
> integration branch where we can begin more seriously planning and
> executing on
>
> - Cleanup and removal of years' worth of accumulated cruft / legacy code
> - Removal of deprecated features
> - Series and DataFrame internals revamp.
>
> I had hoped that 2016 would offer me more time to work on the
> internals revamp, but between my day job and the 2nd ed of "Python for
> Data Analysis" that turned out to be a little too ambitious. I have
> been almost continuously thinking about how to go about this though,
> and it might be good to figure out a process where we can start
> documenting and coming up with a more granular development roadmap for
> this. Part of this will be carefully documenting any APIs we change or
> unit tests we break along the way.
>
> We would want to give ample time for heavy pandas users to run their
> 3rd-party code based on pandas 2.0-dev to give feedback on whether our
> assumptions about the impact of changes affect real production code.
> As a concrete example: integer and boolean Series would be able to
> accommodate missing data without implicitly casting to float or object
> NumPy dtype respectively. Since many users will have inserted
> workarounds / data massaging code because of such rough edges, this
> may cause code breakage or simply redundancy in some cases. As another
> example: we should probably remove the .ix indexing attribute
> altogether. I'm sure many users are still using .ix, but it would be
> worthwhile to go through such code and decide whether it's really .loc
> or .iloc.
>
> My hope would be (being a deadline-motivated person) that we could see
> a pandas 2.0 alpha release sometime mid- or 2nd half 2017, with a
> target beta / pre-production QA release in early 2018 or thereabouts.
> Part of this would be creating a 1.0 to 2.0 migration guide for users.
>
> My biggest concern with pandas in recent years is how not to be held
> back by strict backwards compatibility and still be able to innovate
> and stay relevant into the 2020s.
>
> For pandas 2.0 some of the most important issues I've been thinking about
> are:
>
> - Logical type abstraction layer / decoupling. pandas-only data types
> (Categorical, DatetimeTZ, Period, etc.) will become equal citizens as
> compared with data types mapping 1-1 on NumPy numeric dtypes
>
> - Decoupling physical storage to permit non-NumPy data structures inside
> Series
>
> - Removal of BlockManager and 2D block consolidation in DataFrame, in
> favor of a native C++ internal table (vector-of-arrays) data structure
>
> - Consistent NA semantics across all data types
>
> - Significantly improved handling of string/UTF8 data (performance,
> memory use -- elimination of PyObject boxes). From the above 2 items,
> we could even make all string arrays internally categorical (with the
> option to explicitly cast to categorical) -- in the database world
> this is often called dictionary encoding.
>
> - Refactor of most Cython algorithms into C++11/14 templates
>
> - Copy-on-write for Series and DataFrame
>
> - Removal of Panel, ndim > 3 data structures
>
> - Analytical expression VM (for example -- things like
> df[boolean_arr].groupby(...).agg(...) could be evaluated by a small
> Numexpr-like VM, not dissimilar to R's dplyr library, with
> significantly improved memory use and maybe performance too)
>
> There's a lot to unpack here, but let me know what everyone thinks
> about these things. The "pandas 2.0" / internals revamp discussion we
> can tackle in a separate thread or in perhaps in a GitHub repo or
> design folder in the pandas codebase.
>
> Thanks,
> Wes
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20160726/1d3246fe/attachment.html>

From jeffreback at gmail.com  Wed Jul 27 06:04:47 2016
From: jeffreback at gmail.com (Jeff Reback)
Date: Wed, 27 Jul 2016 06:04:47 -0400
Subject: [Pandas-dev] Proposal: pandas 1.0 and ideas for pandas 2.0 /
 future
In-Reply-To: <CAEQ_TveHwf3s5gnmd17RdSk3GWwc4CPK77rOp-UkeKW_kQOCFQ@mail.gmail.com>
References: <CAJPUwMCrE3Yp_5neugfBfDzz2QCjusZjnmsDjYiaO08938tErw@mail.gmail.com>
 <CAEQ_TveHwf3s5gnmd17RdSk3GWwc4CPK77rOp-UkeKW_kQOCFQ@mail.gmail.com>
Message-ID: <CAHMnJKg2ish9DLxw87ctBOq_W+X8eGF_qwbXuC_pR0DUU5GZnQ@mail.gmail.com>

I applaud the vision and ambition for the roadmap of the future of pandas.

However, the resources are lacking for much of these changes. Currently
pandas is just barely keeping up with the (recently increased) user flow
of pull-requests, not to mention the issue reports. These are all great
indicators
of community use and exercising the edge cases.

A roadmap is an excellent start, but the resource question needs to be
front and center.

The current process *could* evolve into LTS. In 0.19.0, lots of progress
towards removing
older code (and of course deprecating things) is happening. An aggressive
push of this into
0.20.0 will go a long ways towards de-facto establishing 1.0 / LTS.  (and
maybe that's what we simply
call 0.20.0).

I would agree we could simply release 1.0 / LTS without adding any 'new'
features (like fixed getitem indexing
and such).

I would like to see 2.0 with a user facing API that is a drop-in
replacement (though allowing for some breaking changes that are NOT
back-compat, e.g. getitem indexing). I think it would be acceptable to
break the back-end API (meaning to numpy) though.

For the resource question, as I have mentioned off-list, I will format this
roadmap in order for pandas to support a fund-raising effort to garner
resources for these changes.

Jeff

On Tue, Jul 26, 2016 at 5:13 PM, Stephan Hoyer <shoyer at gmail.com> wrote:

> I know I expressed concerns about cross-compatibility with the rest of the
> SciPy ecosystem before (especially xarray), but this plan sounds very solid
> to me. Flexible data types in N-dimensional arrays are important for other
> use cases, but also not really a problem for pandas.
>
> A separate 2.0 release will let us make the major breaking changes to the
> pandas data model necessary for it to work well in the long term. There are
> a few other API warts that will be able to clean up this way (detailed in
> github.com/pydata/pandas/issues/10000), indexing on DataFrames being the
> most obvious one.
>
> On Tue, Jul 26, 2016 at 1:51 PM, Wes McKinney <wesmckinn at gmail.com> wrote:
>
>> hi folks,
>>
>> As a continuation of ongoing discussions on GitHub and on the mailing
>> list around deprecations and future innovation and internal reworkings
>> of pandas, I had a couple of ideas to share that I am looking for
>> feedback on.
>>
>> As far as pandas 0.19.x today, I would like to propose that we
>> consider releasing the project as pandas 1.0 in the next major release
>> or the one after. The Python community does have a penchant for
>> "eternal betas", but after all the hard work of the core developers
>> and community over the last 5 years, I think we can safely consider
>> making a stable 1.X production release.
>>
>> If we do decide to release pandas 1.0, I also propose that we strongly
>> consider making 1.X an LTS / Long Term Support branch where we can
>> continue to make releases, but bug fixes and documentation
>> improvements only. Or, we can add new features, but on an extremely
>> conservative basis. This might require some changes to development
>> process, so looking for feedback on this.
>>
>> If we commit to this path, I would suggest that we start a pandas-2.0
>> integration branch where we can begin more seriously planning and
>> executing on
>>
>> - Cleanup and removal of years' worth of accumulated cruft / legacy code
>> - Removal of deprecated features
>> - Series and DataFrame internals revamp.
>>
>> I had hoped that 2016 would offer me more time to work on the
>> internals revamp, but between my day job and the 2nd ed of "Python for
>> Data Analysis" that turned out to be a little too ambitious. I have
>> been almost continuously thinking about how to go about this though,
>> and it might be good to figure out a process where we can start
>> documenting and coming up with a more granular development roadmap for
>> this. Part of this will be carefully documenting any APIs we change or
>> unit tests we break along the way.
>>
>> We would want to give ample time for heavy pandas users to run their
>> 3rd-party code based on pandas 2.0-dev to give feedback on whether our
>> assumptions about the impact of changes affect real production code.
>> As a concrete example: integer and boolean Series would be able to
>> accommodate missing data without implicitly casting to float or object
>> NumPy dtype respectively. Since many users will have inserted
>> workarounds / data massaging code because of such rough edges, this
>> may cause code breakage or simply redundancy in some cases. As another
>> example: we should probably remove the .ix indexing attribute
>> altogether. I'm sure many users are still using .ix, but it would be
>> worthwhile to go through such code and decide whether it's really .loc
>> or .iloc.
>>
>> My hope would be (being a deadline-motivated person) that we could see
>> a pandas 2.0 alpha release sometime mid- or 2nd half 2017, with a
>> target beta / pre-production QA release in early 2018 or thereabouts.
>> Part of this would be creating a 1.0 to 2.0 migration guide for users.
>>
>> My biggest concern with pandas in recent years is how not to be held
>> back by strict backwards compatibility and still be able to innovate
>> and stay relevant into the 2020s.
>>
>> For pandas 2.0 some of the most important issues I've been thinking about
>> are:
>>
>> - Logical type abstraction layer / decoupling. pandas-only data types
>> (Categorical, DatetimeTZ, Period, etc.) will become equal citizens as
>> compared with data types mapping 1-1 on NumPy numeric dtypes
>>
>> - Decoupling physical storage to permit non-NumPy data structures inside
>> Series
>>
>> - Removal of BlockManager and 2D block consolidation in DataFrame, in
>> favor of a native C++ internal table (vector-of-arrays) data structure
>>
>> - Consistent NA semantics across all data types
>>
>> - Significantly improved handling of string/UTF8 data (performance,
>> memory use -- elimination of PyObject boxes). From the above 2 items,
>> we could even make all string arrays internally categorical (with the
>> option to explicitly cast to categorical) -- in the database world
>> this is often called dictionary encoding.
>>
>> - Refactor of most Cython algorithms into C++11/14 templates
>>
>> - Copy-on-write for Series and DataFrame
>>
>> - Removal of Panel, ndim > 3 data structures
>>
>> - Analytical expression VM (for example -- things like
>> df[boolean_arr].groupby(...).agg(...) could be evaluated by a small
>> Numexpr-like VM, not dissimilar to R's dplyr library, with
>> significantly improved memory use and maybe performance too)
>>
>> There's a lot to unpack here, but let me know what everyone thinks
>> about these things. The "pandas 2.0" / internals revamp discussion we
>> can tackle in a separate thread or in perhaps in a GitHub repo or
>> design folder in the pandas codebase.
>>
>> Thanks,
>> Wes
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
>
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20160727/e40bc278/attachment-0001.html>

From jorisvandenbossche at gmail.com  Wed Jul 27 19:51:21 2016
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Thu, 28 Jul 2016 01:51:21 +0200
Subject: [Pandas-dev] Proposal: pandas 1.0 and ideas for pandas 2.0 /
 future
In-Reply-To: <CAHMnJKg2ish9DLxw87ctBOq_W+X8eGF_qwbXuC_pR0DUU5GZnQ@mail.gmail.com>
References: <CAJPUwMCrE3Yp_5neugfBfDzz2QCjusZjnmsDjYiaO08938tErw@mail.gmail.com>
 <CAEQ_TveHwf3s5gnmd17RdSk3GWwc4CPK77rOp-UkeKW_kQOCFQ@mail.gmail.com>
 <CAHMnJKg2ish9DLxw87ctBOq_W+X8eGF_qwbXuC_pR0DUU5GZnQ@mail.gmail.com>
Message-ID: <CALQtMBZgt4CuKmbHRFua0Fh9nHvGTyDzXKMUdXEY_DWnTcZJ2w@mail.gmail.com>

Wes, thanks for your mail!

I like the idea of first releasing a pandas 1.0 before the 'big refactor'.
We for sure know that this will take a while to stabilize (even with a lot
of resources), and I think the idea was to provide a kind of LTS release.
In that regard, it is just clearer to name this pandas 1.x then 0.19.x.

Maybe we can start a separate thread to discuss on this 1.0, as there are
of course some questions to discuss:
- do we first release 0.19 (we didn't specifically discuss this, but I
think the rough idea was to have somewhere in august a release candidate),
or do we directly aim at 1.0?
- are there some certain changes we want to do before 1.0 that are feasible
in the short term?
- are there some of the current ideas of deprecations that we should
exclude/include for this release? (eg I think deprecating PanelND (as just
landed in master) is good, but the idea of deprecating Panel should rather
wait until 2.0?)
- ...

How exactly to tackle those bug fix releases / LTS branch, is also
something that can be discussed, but I would not worry too much about that
(there are enough examples of other projects to do something similar, we
just have to search for a process that suits us).

What I think a more important issue or problem with this process is the
community of contributors. If we would effectively have a period of about
two years (before a final 2.0 release) where for the current (1.0) version
only certain bug-fixes are considered, but on the other hand it is still
difficult to contribute to the new version. We would maybe have to say no
to many of the PRs or enhancement ideas. Such a situation could hinder the
process of community contributions and participation.
And there are currently a lot of contributions. As Jeff also said, the
current active contributors are barely keeping up with managing all issues
and pull requests. I have worked the last few weeks more on pandas (thanks
to Continuum), and indeed I spent most of my time answering issues and
reviewing PRs, and hardly have any time to do much coding myself. But of
course this is also a choice that I currently make. And I (we) could also
make the choice to focus more on pandas 1.0/2.0 related issues, or try to
steer some of the active contributors to that.

I also have some concerns about the compatibility with the rest of the
ecosystem, but at the same time it is clear I think that there should be
some kind of refactor, and it is in the further elaboration of the roadmap
that such concerns can be addressed.

Joris


2016-07-27 12:04 GMT+02:00 Jeff Reback <jeffreback at gmail.com>:

> I applaud the vision and ambition for the roadmap of the future of pandas.
>
> However, the resources are lacking for much of these changes. Currently
> pandas is just barely keeping up with the (recently increased) user flow
> of pull-requests, not to mention the issue reports. These are all great
> indicators
> of community use and exercising the edge cases.
>
> A roadmap is an excellent start, but the resource question needs to be
> front and center.
>
> The current process *could* evolve into LTS. In 0.19.0, lots of progress
> towards removing
> older code (and of course deprecating things) is happening. An aggressive
> push of this into
> 0.20.0 will go a long ways towards de-facto establishing 1.0 / LTS.  (and
> maybe that's what we simply
> call 0.20.0).
>
> I would agree we could simply release 1.0 / LTS without adding any 'new'
> features (like fixed getitem indexing
> and such).
>
> I would like to see 2.0 with a user facing API that is a drop-in
> replacement (though allowing for some breaking changes that are NOT
> back-compat, e.g. getitem indexing). I think it would be acceptable to
> break the back-end API (meaning to numpy) though.
>
> For the resource question, as I have mentioned off-list, I will format
> this roadmap in order for pandas to support a fund-raising effort to garner
> resources for these changes.
>
> Jeff
>
> On Tue, Jul 26, 2016 at 5:13 PM, Stephan Hoyer <shoyer at gmail.com> wrote:
>
>> I know I expressed concerns about cross-compatibility with the rest of
>> the SciPy ecosystem before (especially xarray), but this plan sounds very
>> solid to me. Flexible data types in N-dimensional arrays are important for
>> other use cases, but also not really a problem for pandas.
>>
>> A separate 2.0 release will let us make the major breaking changes to the
>> pandas data model necessary for it to work well in the long term. There are
>> a few other API warts that will be able to clean up this way (detailed in
>> github.com/pydata/pandas/issues/10000), indexing on DataFrames being the
>> most obvious one.
>>
>> On Tue, Jul 26, 2016 at 1:51 PM, Wes McKinney <wesmckinn at gmail.com>
>> wrote:
>>
>>> hi folks,
>>>
>>> As a continuation of ongoing discussions on GitHub and on the mailing
>>> list around deprecations and future innovation and internal reworkings
>>> of pandas, I had a couple of ideas to share that I am looking for
>>> feedback on.
>>>
>>> As far as pandas 0.19.x today, I would like to propose that we
>>> consider releasing the project as pandas 1.0 in the next major release
>>> or the one after. The Python community does have a penchant for
>>> "eternal betas", but after all the hard work of the core developers
>>> and community over the last 5 years, I think we can safely consider
>>> making a stable 1.X production release.
>>>
>>> If we do decide to release pandas 1.0, I also propose that we strongly
>>> consider making 1.X an LTS / Long Term Support branch where we can
>>> continue to make releases, but bug fixes and documentation
>>> improvements only. Or, we can add new features, but on an extremely
>>> conservative basis. This might require some changes to development
>>> process, so looking for feedback on this.
>>>
>>> If we commit to this path, I would suggest that we start a pandas-2.0
>>> integration branch where we can begin more seriously planning and
>>> executing on
>>>
>>> - Cleanup and removal of years' worth of accumulated cruft / legacy code
>>> - Removal of deprecated features
>>> - Series and DataFrame internals revamp.
>>>
>>> I had hoped that 2016 would offer me more time to work on the
>>> internals revamp, but between my day job and the 2nd ed of "Python for
>>> Data Analysis" that turned out to be a little too ambitious. I have
>>> been almost continuously thinking about how to go about this though,
>>> and it might be good to figure out a process where we can start
>>> documenting and coming up with a more granular development roadmap for
>>> this. Part of this will be carefully documenting any APIs we change or
>>> unit tests we break along the way.
>>>
>>> We would want to give ample time for heavy pandas users to run their
>>> 3rd-party code based on pandas 2.0-dev to give feedback on whether our
>>> assumptions about the impact of changes affect real production code.
>>> As a concrete example: integer and boolean Series would be able to
>>> accommodate missing data without implicitly casting to float or object
>>> NumPy dtype respectively. Since many users will have inserted
>>> workarounds / data massaging code because of such rough edges, this
>>> may cause code breakage or simply redundancy in some cases. As another
>>> example: we should probably remove the .ix indexing attribute
>>> altogether. I'm sure many users are still using .ix, but it would be
>>> worthwhile to go through such code and decide whether it's really .loc
>>> or .iloc.
>>>
>>> My hope would be (being a deadline-motivated person) that we could see
>>> a pandas 2.0 alpha release sometime mid- or 2nd half 2017, with a
>>> target beta / pre-production QA release in early 2018 or thereabouts.
>>> Part of this would be creating a 1.0 to 2.0 migration guide for users.
>>>
>>> My biggest concern with pandas in recent years is how not to be held
>>> back by strict backwards compatibility and still be able to innovate
>>> and stay relevant into the 2020s.
>>>
>>> For pandas 2.0 some of the most important issues I've been thinking
>>> about are:
>>>
>>> - Logical type abstraction layer / decoupling. pandas-only data types
>>> (Categorical, DatetimeTZ, Period, etc.) will become equal citizens as
>>> compared with data types mapping 1-1 on NumPy numeric dtypes
>>>
>>> - Decoupling physical storage to permit non-NumPy data structures inside
>>> Series
>>>
>>> - Removal of BlockManager and 2D block consolidation in DataFrame, in
>>> favor of a native C++ internal table (vector-of-arrays) data structure
>>>
>>> - Consistent NA semantics across all data types
>>>
>>> - Significantly improved handling of string/UTF8 data (performance,
>>> memory use -- elimination of PyObject boxes). From the above 2 items,
>>> we could even make all string arrays internally categorical (with the
>>> option to explicitly cast to categorical) -- in the database world
>>> this is often called dictionary encoding.
>>>
>>> - Refactor of most Cython algorithms into C++11/14 templates
>>>
>>> - Copy-on-write for Series and DataFrame
>>>
>>> - Removal of Panel, ndim > 3 data structures
>>>
>>> - Analytical expression VM (for example -- things like
>>> df[boolean_arr].groupby(...).agg(...) could be evaluated by a small
>>> Numexpr-like VM, not dissimilar to R's dplyr library, with
>>> significantly improved memory use and maybe performance too)
>>>
>>> There's a lot to unpack here, but let me know what everyone thinks
>>> about these things. The "pandas 2.0" / internals revamp discussion we
>>> can tackle in a separate thread or in perhaps in a GitHub repo or
>>> design folder in the pandas codebase.
>>>
>>> Thanks,
>>> Wes
>>> _______________________________________________
>>> Pandas-dev mailing list
>>> Pandas-dev at python.org
>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>
>>
>>
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
>>
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20160728/99dcebc5/attachment.html>

From gfyoung17 at gmail.com  Wed Jul 27 23:39:04 2016
From: gfyoung17 at gmail.com (G Young)
Date: Wed, 27 Jul 2016 23:39:04 -0400
Subject: [Pandas-dev] Proposal: pandas 1.0 and ideas for pandas 2.0 /
 future
In-Reply-To: <CALQtMBZgt4CuKmbHRFua0Fh9nHvGTyDzXKMUdXEY_DWnTcZJ2w@mail.gmail.com>
References: <CAJPUwMCrE3Yp_5neugfBfDzz2QCjusZjnmsDjYiaO08938tErw@mail.gmail.com>
 <CAEQ_TveHwf3s5gnmd17RdSk3GWwc4CPK77rOp-UkeKW_kQOCFQ@mail.gmail.com>
 <CAHMnJKg2ish9DLxw87ctBOq_W+X8eGF_qwbXuC_pR0DUU5GZnQ@mail.gmail.com>
 <CALQtMBZgt4CuKmbHRFua0Fh9nHvGTyDzXKMUdXEY_DWnTcZJ2w@mail.gmail.com>
Message-ID: <CAJ1_J5gw4My4_DT2CiyT0xHp3+8GwAKce4=r6hOxVMFDpHveQw@mail.gmail.com>

1) I would be in favour of releasing 0.19.0 in part because we already
pushed back and actually forgone 0.18.2.  I think these plans are better
served for the release after this one to give more time to map this but
also to push out the changes that have already been made in preparation for
this release.

2) In terms of organisation, I wonder if we might be better served
*reorganising* the way in which PR's are reviewed during the time period
between one release and the next instead of having these parallel tracks of
development in light of the concern brought up by @jorisvanenbossche.
Perhaps rather than just reviewing PR's as they come in, specify which
types of PR's should be submitted during certain periods of time.

For example, a large chunk of the period could be devoted to accepting
enhancements / new features after which the remaining time before a release
could be devoted to just organisation / refactoring / deprecations / what
have you (maybe include bug fixes too).  That way we could have a
*contiguous* block of time to focus on stabilising and tidying up the
release.  It would also allow for the refactoring to take place (perhaps
incrementally) without the concern of being destabilised by a new feature.

For this to work, this would have to be clearly stated in the contributing
docs as well as circulated in emails to pandas-dev AND other related groups
so that way people know what's going on in terms of the development cycle.


On Wed, Jul 27, 2016 at 7:51 PM, Joris Van den Bossche <
jorisvandenbossche at gmail.com> wrote:

> Wes, thanks for your mail!
>
> I like the idea of first releasing a pandas 1.0 before the 'big refactor'.
> We for sure know that this will take a while to stabilize (even with a lot
> of resources), and I think the idea was to provide a kind of LTS release.
> In that regard, it is just clearer to name this pandas 1.x then 0.19.x.
>
> Maybe we can start a separate thread to discuss on this 1.0, as there are
> of course some questions to discuss:
> - do we first release 0.19 (we didn't specifically discuss this, but I
> think the rough idea was to have somewhere in august a release candidate),
> or do we directly aim at 1.0?
> - are there some certain changes we want to do before 1.0 that are
> feasible in the short term?
> - are there some of the current ideas of deprecations that we should
> exclude/include for this release? (eg I think deprecating PanelND (as just
> landed in master) is good, but the idea of deprecating Panel should rather
> wait until 2.0?)
> - ...
>
> How exactly to tackle those bug fix releases / LTS branch, is also
> something that can be discussed, but I would not worry too much about that
> (there are enough examples of other projects to do something similar, we
> just have to search for a process that suits us).
>
> What I think a more important issue or problem with this process is the
> community of contributors. If we would effectively have a period of about
> two years (before a final 2.0 release) where for the current (1.0) version
> only certain bug-fixes are considered, but on the other hand it is still
> difficult to contribute to the new version. We would maybe have to say no
> to many of the PRs or enhancement ideas. Such a situation could hinder the
> process of community contributions and participation.
> And there are currently a lot of contributions. As Jeff also said, the
> current active contributors are barely keeping up with managing all issues
> and pull requests. I have worked the last few weeks more on pandas (thanks
> to Continuum), and indeed I spent most of my time answering issues and
> reviewing PRs, and hardly have any time to do much coding myself. But of
> course this is also a choice that I currently make. And I (we) could also
> make the choice to focus more on pandas 1.0/2.0 related issues, or try to
> steer some of the active contributors to that.
>
> I also have some concerns about the compatibility with the rest of the
> ecosystem, but at the same time it is clear I think that there should be
> some kind of refactor, and it is in the further elaboration of the roadmap
> that such concerns can be addressed.
>
> Joris
>
>
>
> 2016-07-27 12:04 GMT+02:00 Jeff Reback <jeffreback at gmail.com>:
>
>> I applaud the vision and ambition for the roadmap of the future of pandas.
>>
>> However, the resources are lacking for much of these changes. Currently
>> pandas is just barely keeping up with the (recently increased) user flow
>> of pull-requests, not to mention the issue reports. These are all great
>> indicators
>> of community use and exercising the edge cases.
>>
>> A roadmap is an excellent start, but the resource question needs to be
>> front and center.
>>
>> The current process *could* evolve into LTS. In 0.19.0, lots of progress
>> towards removing
>> older code (and of course deprecating things) is happening. An aggressive
>> push of this into
>> 0.20.0 will go a long ways towards de-facto establishing 1.0 / LTS.  (and
>> maybe that's what we simply
>> call 0.20.0).
>>
>> I would agree we could simply release 1.0 / LTS without adding any 'new'
>> features (like fixed getitem indexing
>> and such).
>>
>> I would like to see 2.0 with a user facing API that is a drop-in
>> replacement (though allowing for some breaking changes that are NOT
>> back-compat, e.g. getitem indexing). I think it would be acceptable to
>> break the back-end API (meaning to numpy) though.
>>
>> For the resource question, as I have mentioned off-list, I will format
>> this roadmap in order for pandas to support a fund-raising effort to garner
>> resources for these changes.
>>
>> Jeff
>>
>> On Tue, Jul 26, 2016 at 5:13 PM, Stephan Hoyer <shoyer at gmail.com> wrote:
>>
>>> I know I expressed concerns about cross-compatibility with the rest of
>>> the SciPy ecosystem before (especially xarray), but this plan sounds very
>>> solid to me. Flexible data types in N-dimensional arrays are important for
>>> other use cases, but also not really a problem for pandas.
>>>
>>> A separate 2.0 release will let us make the major breaking changes to
>>> the pandas data model necessary for it to work well in the long term. There
>>> are a few other API warts that will be able to clean up this way (detailed
>>> in github.com/pydata/pandas/issues/10000), indexing on DataFrames being
>>> the most obvious one.
>>>
>>> On Tue, Jul 26, 2016 at 1:51 PM, Wes McKinney <wesmckinn at gmail.com>
>>> wrote:
>>>
>>>> hi folks,
>>>>
>>>> As a continuation of ongoing discussions on GitHub and on the mailing
>>>> list around deprecations and future innovation and internal reworkings
>>>> of pandas, I had a couple of ideas to share that I am looking for
>>>> feedback on.
>>>>
>>>> As far as pandas 0.19.x today, I would like to propose that we
>>>> consider releasing the project as pandas 1.0 in the next major release
>>>> or the one after. The Python community does have a penchant for
>>>> "eternal betas", but after all the hard work of the core developers
>>>> and community over the last 5 years, I think we can safely consider
>>>> making a stable 1.X production release.
>>>>
>>>> If we do decide to release pandas 1.0, I also propose that we strongly
>>>> consider making 1.X an LTS / Long Term Support branch where we can
>>>> continue to make releases, but bug fixes and documentation
>>>> improvements only. Or, we can add new features, but on an extremely
>>>> conservative basis. This might require some changes to development
>>>> process, so looking for feedback on this.
>>>>
>>>> If we commit to this path, I would suggest that we start a pandas-2.0
>>>> integration branch where we can begin more seriously planning and
>>>> executing on
>>>>
>>>> - Cleanup and removal of years' worth of accumulated cruft / legacy code
>>>> - Removal of deprecated features
>>>> - Series and DataFrame internals revamp.
>>>>
>>>> I had hoped that 2016 would offer me more time to work on the
>>>> internals revamp, but between my day job and the 2nd ed of "Python for
>>>> Data Analysis" that turned out to be a little too ambitious. I have
>>>> been almost continuously thinking about how to go about this though,
>>>> and it might be good to figure out a process where we can start
>>>> documenting and coming up with a more granular development roadmap for
>>>> this. Part of this will be carefully documenting any APIs we change or
>>>> unit tests we break along the way.
>>>>
>>>> We would want to give ample time for heavy pandas users to run their
>>>> 3rd-party code based on pandas 2.0-dev to give feedback on whether our
>>>> assumptions about the impact of changes affect real production code.
>>>> As a concrete example: integer and boolean Series would be able to
>>>> accommodate missing data without implicitly casting to float or object
>>>> NumPy dtype respectively. Since many users will have inserted
>>>> workarounds / data massaging code because of such rough edges, this
>>>> may cause code breakage or simply redundancy in some cases. As another
>>>> example: we should probably remove the .ix indexing attribute
>>>> altogether. I'm sure many users are still using .ix, but it would be
>>>> worthwhile to go through such code and decide whether it's really .loc
>>>> or .iloc.
>>>>
>>>> My hope would be (being a deadline-motivated person) that we could see
>>>> a pandas 2.0 alpha release sometime mid- or 2nd half 2017, with a
>>>> target beta / pre-production QA release in early 2018 or thereabouts.
>>>> Part of this would be creating a 1.0 to 2.0 migration guide for users.
>>>>
>>>> My biggest concern with pandas in recent years is how not to be held
>>>> back by strict backwards compatibility and still be able to innovate
>>>> and stay relevant into the 2020s.
>>>>
>>>> For pandas 2.0 some of the most important issues I've been thinking
>>>> about are:
>>>>
>>>> - Logical type abstraction layer / decoupling. pandas-only data types
>>>> (Categorical, DatetimeTZ, Period, etc.) will become equal citizens as
>>>> compared with data types mapping 1-1 on NumPy numeric dtypes
>>>>
>>>> - Decoupling physical storage to permit non-NumPy data structures
>>>> inside Series
>>>>
>>>> - Removal of BlockManager and 2D block consolidation in DataFrame, in
>>>> favor of a native C++ internal table (vector-of-arrays) data structure
>>>>
>>>> - Consistent NA semantics across all data types
>>>>
>>>> - Significantly improved handling of string/UTF8 data (performance,
>>>> memory use -- elimination of PyObject boxes). From the above 2 items,
>>>> we could even make all string arrays internally categorical (with the
>>>> option to explicitly cast to categorical) -- in the database world
>>>> this is often called dictionary encoding.
>>>>
>>>> - Refactor of most Cython algorithms into C++11/14 templates
>>>>
>>>> - Copy-on-write for Series and DataFrame
>>>>
>>>> - Removal of Panel, ndim > 3 data structures
>>>>
>>>> - Analytical expression VM (for example -- things like
>>>> df[boolean_arr].groupby(...).agg(...) could be evaluated by a small
>>>> Numexpr-like VM, not dissimilar to R's dplyr library, with
>>>> significantly improved memory use and maybe performance too)
>>>>
>>>> There's a lot to unpack here, but let me know what everyone thinks
>>>> about these things. The "pandas 2.0" / internals revamp discussion we
>>>> can tackle in a separate thread or in perhaps in a GitHub repo or
>>>> design folder in the pandas codebase.
>>>>
>>>> Thanks,
>>>> Wes
>>>> _______________________________________________
>>>> Pandas-dev mailing list
>>>> Pandas-dev at python.org
>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>
>>>
>>>
>>> _______________________________________________
>>> Pandas-dev mailing list
>>> Pandas-dev at python.org
>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>
>>>
>>
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
>>
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20160727/8424de9e/attachment-0001.html>

From wesmckinn at gmail.com  Thu Jul 28 18:15:57 2016
From: wesmckinn at gmail.com (Wes McKinney)
Date: Thu, 28 Jul 2016 15:15:57 -0700
Subject: [Pandas-dev] Proposal: pandas 1.0 and ideas for pandas 2.0 /
 future
In-Reply-To: <CAJ1_J5gw4My4_DT2CiyT0xHp3+8GwAKce4=r6hOxVMFDpHveQw@mail.gmail.com>
References: <CAJPUwMCrE3Yp_5neugfBfDzz2QCjusZjnmsDjYiaO08938tErw@mail.gmail.com>
 <CAEQ_TveHwf3s5gnmd17RdSk3GWwc4CPK77rOp-UkeKW_kQOCFQ@mail.gmail.com>
 <CAHMnJKg2ish9DLxw87ctBOq_W+X8eGF_qwbXuC_pR0DUU5GZnQ@mail.gmail.com>
 <CALQtMBZgt4CuKmbHRFua0Fh9nHvGTyDzXKMUdXEY_DWnTcZJ2w@mail.gmail.com>
 <CAJ1_J5gw4My4_DT2CiyT0xHp3+8GwAKce4=r6hOxVMFDpHveQw@mail.gmail.com>
Message-ID: <CAJPUwMC1hLsxUvMewKw68cjXdXjLMg3Hn1Qt4O7NGW8CffjJwQ@mail.gmail.com>

OK, let me try to collect some of the feedback and give my thoughts

1) 0.19 and 0.20:  I think we should push to release 0.19.0 soon and
then plan what we want to add/change/deprecate for 1.0 which might
otherwise have been 1.0. I think delaying 0.19.0 since we already
pushed back 0.18.2, and there are some significant new patches
(asof_merge and variable rolling windows), it would be good to get
this into production before we declare a stable 1.0.

2) We will need to raise a significant amount of money for pandas (I
estimate in the ballpark of US $300-500K -- better to have too much
than too little) to be able to pursue the pandas 2.0 plan
wholeheartedly. I would like to dedicate a minimum 5-10 hours per week
to it in 2017 but this will not be sufficient to do everything (I am
also a human being, and have a day job). It would be better to
collaborate with one or two good freelance developers (with proven
experience in C++ and Python) who are spending at least 50% of their
time on pandas next year. I am going to start spending some time on
design documentation so that we can start resolving some of the design
questions and tradeoffs (not all of these decisions will be easy).
We'll work on this offline and look to start soliciting funding (if
anyone with the ability to write checks is reading, feel free to
contact me offline).

3) I agree we will need to come up with a development process that
facilitates both an invasive modification of pandas internals while
also supporting production users of pandas 1.X. Cherry-picking bug
fixes into the pandas 2.x branch will grow increasingly complicated;
we need to factor this into our process (for example: we might collect
all the unit tests for bug fixes -- assuming they rely on definitely
stable behavior -- into a "to fix" folder so that we can return and
adapt the bug fixes once the 2.x branch is getting more stable). To
have developers both maintaining 1.x and trying to drive forward the
2.x branch at the same time does not seem realistic -- we should talk
to the IPython/Jupyter devs to understand how they handled this
through their long-lived IPython 1.0 branch IIRC (see
http://ipython.org/news.html#ipython-1-0).

4) My goal, which I think we're all aligned on, would be for pandas
2.0 to be a drop-in replacement for 90-95% of normal pandas use. Many
power users will have embraced some of the idiosyncrasies of pandas's
implementation details, but I think some of the changes (e.g. missing
data consistency, copy-on-write / improved semantics around memory
ownership and views) will be welcomed. We should clearly document (in
a dedicated "pandas's internal relationship with NumPy" document) and
maintain very tight contracts around what kinds of zero-copy NumPy
interoperability are supported -- it is not clear to me for example
that arrays of Python string/unicode objects are a NumPy use case that
is especially important to preserve, but most numeric data use cases
are. This will also be helpful for power users to understand the
nuances and how things are going to stay the same or change (for
example: boolean and integer arrays with NAs will probably not be
zero-copyable to NumPy arrays).

We should maybe start side threads about each of these items. Just
deciding what we want to deprecate or do in 0.20 aka 1.0 is a large
enough task.

Thanks all
Wes

On Wed, Jul 27, 2016 at 8:39 PM, G Young <gfyoung17 at gmail.com> wrote:
> 1) I would be in favour of releasing 0.19.0 in part because we already
> pushed back and actually forgone 0.18.2.  I think these plans are better
> served for the release after this one to give more time to map this but also
> to push out the changes that have already been made in preparation for this
> release.
>
> 2) In terms of organisation, I wonder if we might be better served
> reorganising the way in which PR's are reviewed during the time period
> between one release and the next instead of having these parallel tracks of
> development in light of the concern brought up by @jorisvanenbossche.
> Perhaps rather than just reviewing PR's as they come in, specify which types
> of PR's should be submitted during certain periods of time.
>
> For example, a large chunk of the period could be devoted to accepting
> enhancements / new features after which the remaining time before a release
> could be devoted to just organisation / refactoring / deprecations / what
> have you (maybe include bug fixes too).  That way we could have a contiguous
> block of time to focus on stabilising and tidying up the release.  It would
> also allow for the refactoring to take place (perhaps incrementally) without
> the concern of being destabilised by a new feature.
>
> For this to work, this would have to be clearly stated in the contributing
> docs as well as circulated in emails to pandas-dev AND other related groups
> so that way people know what's going on in terms of the development cycle.
>
>
>
> On Wed, Jul 27, 2016 at 7:51 PM, Joris Van den Bossche
> <jorisvandenbossche at gmail.com> wrote:
>>
>> Wes, thanks for your mail!
>>
>> I like the idea of first releasing a pandas 1.0 before the 'big refactor'.
>> We for sure know that this will take a while to stabilize (even with a lot
>> of resources), and I think the idea was to provide a kind of LTS release. In
>> that regard, it is just clearer to name this pandas 1.x then 0.19.x.
>>
>> Maybe we can start a separate thread to discuss on this 1.0, as there are
>> of course some questions to discuss:
>> - do we first release 0.19 (we didn't specifically discuss this, but I
>> think the rough idea was to have somewhere in august a release candidate),
>> or do we directly aim at 1.0?
>> - are there some certain changes we want to do before 1.0 that are
>> feasible in the short term?
>> - are there some of the current ideas of deprecations that we should
>> exclude/include for this release? (eg I think deprecating PanelND (as just
>> landed in master) is good, but the idea of deprecating Panel should rather
>> wait until 2.0?)
>> - ...
>>
>> How exactly to tackle those bug fix releases / LTS branch, is also
>> something that can be discussed, but I would not worry too much about that
>> (there are enough examples of other projects to do something similar, we
>> just have to search for a process that suits us).
>>
>> What I think a more important issue or problem with this process is the
>> community of contributors. If we would effectively have a period of about
>> two years (before a final 2.0 release) where for the current (1.0) version
>> only certain bug-fixes are considered, but on the other hand it is still
>> difficult to contribute to the new version. We would maybe have to say no to
>> many of the PRs or enhancement ideas. Such a situation could hinder the
>> process of community contributions and participation.
>> And there are currently a lot of contributions. As Jeff also said, the
>> current active contributors are barely keeping up with managing all issues
>> and pull requests. I have worked the last few weeks more on pandas (thanks
>> to Continuum), and indeed I spent most of my time answering issues and
>> reviewing PRs, and hardly have any time to do much coding myself. But of
>> course this is also a choice that I currently make. And I (we) could also
>> make the choice to focus more on pandas 1.0/2.0 related issues, or try to
>> steer some of the active contributors to that.
>>
>> I also have some concerns about the compatibility with the rest of the
>> ecosystem, but at the same time it is clear I think that there should be
>> some kind of refactor, and it is in the further elaboration of the roadmap
>> that such concerns can be addressed.
>>
>> Joris
>>
>>
>>
>> 2016-07-27 12:04 GMT+02:00 Jeff Reback <jeffreback at gmail.com>:
>>>
>>> I applaud the vision and ambition for the roadmap of the future of
>>> pandas.
>>>
>>> However, the resources are lacking for much of these changes. Currently
>>> pandas is just barely keeping up with the (recently increased) user flow
>>> of pull-requests, not to mention the issue reports. These are all great
>>> indicators
>>> of community use and exercising the edge cases.
>>>
>>> A roadmap is an excellent start, but the resource question needs to be
>>> front and center.
>>>
>>> The current process *could* evolve into LTS. In 0.19.0, lots of progress
>>> towards removing
>>> older code (and of course deprecating things) is happening. An aggressive
>>> push of this into
>>> 0.20.0 will go a long ways towards de-facto establishing 1.0 / LTS.  (and
>>> maybe that's what we simply
>>> call 0.20.0).
>>>
>>> I would agree we could simply release 1.0 / LTS without adding any 'new'
>>> features (like fixed getitem indexing
>>> and such).
>>>
>>> I would like to see 2.0 with a user facing API that is a drop-in
>>> replacement (though allowing for some breaking changes that are NOT
>>> back-compat, e.g. getitem indexing). I think it would be acceptable to break
>>> the back-end API (meaning to numpy) though.
>>>
>>> For the resource question, as I have mentioned off-list, I will format
>>> this roadmap in order for pandas to support a fund-raising effort to garner
>>> resources for these changes.
>>>
>>> Jeff
>>>
>>> On Tue, Jul 26, 2016 at 5:13 PM, Stephan Hoyer <shoyer at gmail.com> wrote:
>>>>
>>>> I know I expressed concerns about cross-compatibility with the rest of
>>>> the SciPy ecosystem before (especially xarray), but this plan sounds very
>>>> solid to me. Flexible data types in N-dimensional arrays are important for
>>>> other use cases, but also not really a problem for pandas.
>>>>
>>>> A separate 2.0 release will let us make the major breaking changes to
>>>> the pandas data model necessary for it to work well in the long term. There
>>>> are a few other API warts that will be able to clean up this way (detailed
>>>> in github.com/pydata/pandas/issues/10000), indexing on DataFrames being the
>>>> most obvious one.
>>>>
>>>> On Tue, Jul 26, 2016 at 1:51 PM, Wes McKinney <wesmckinn at gmail.com>
>>>> wrote:
>>>>>
>>>>> hi folks,
>>>>>
>>>>> As a continuation of ongoing discussions on GitHub and on the mailing
>>>>> list around deprecations and future innovation and internal reworkings
>>>>> of pandas, I had a couple of ideas to share that I am looking for
>>>>> feedback on.
>>>>>
>>>>> As far as pandas 0.19.x today, I would like to propose that we
>>>>> consider releasing the project as pandas 1.0 in the next major release
>>>>> or the one after. The Python community does have a penchant for
>>>>> "eternal betas", but after all the hard work of the core developers
>>>>> and community over the last 5 years, I think we can safely consider
>>>>> making a stable 1.X production release.
>>>>>
>>>>> If we do decide to release pandas 1.0, I also propose that we strongly
>>>>> consider making 1.X an LTS / Long Term Support branch where we can
>>>>> continue to make releases, but bug fixes and documentation
>>>>> improvements only. Or, we can add new features, but on an extremely
>>>>> conservative basis. This might require some changes to development
>>>>> process, so looking for feedback on this.
>>>>>
>>>>> If we commit to this path, I would suggest that we start a pandas-2.0
>>>>> integration branch where we can begin more seriously planning and
>>>>> executing on
>>>>>
>>>>> - Cleanup and removal of years' worth of accumulated cruft / legacy
>>>>> code
>>>>> - Removal of deprecated features
>>>>> - Series and DataFrame internals revamp.
>>>>>
>>>>> I had hoped that 2016 would offer me more time to work on the
>>>>> internals revamp, but between my day job and the 2nd ed of "Python for
>>>>> Data Analysis" that turned out to be a little too ambitious. I have
>>>>> been almost continuously thinking about how to go about this though,
>>>>> and it might be good to figure out a process where we can start
>>>>> documenting and coming up with a more granular development roadmap for
>>>>> this. Part of this will be carefully documenting any APIs we change or
>>>>> unit tests we break along the way.
>>>>>
>>>>> We would want to give ample time for heavy pandas users to run their
>>>>> 3rd-party code based on pandas 2.0-dev to give feedback on whether our
>>>>> assumptions about the impact of changes affect real production code.
>>>>> As a concrete example: integer and boolean Series would be able to
>>>>> accommodate missing data without implicitly casting to float or object
>>>>> NumPy dtype respectively. Since many users will have inserted
>>>>> workarounds / data massaging code because of such rough edges, this
>>>>> may cause code breakage or simply redundancy in some cases. As another
>>>>> example: we should probably remove the .ix indexing attribute
>>>>> altogether. I'm sure many users are still using .ix, but it would be
>>>>> worthwhile to go through such code and decide whether it's really .loc
>>>>> or .iloc.
>>>>>
>>>>> My hope would be (being a deadline-motivated person) that we could see
>>>>> a pandas 2.0 alpha release sometime mid- or 2nd half 2017, with a
>>>>> target beta / pre-production QA release in early 2018 or thereabouts.
>>>>> Part of this would be creating a 1.0 to 2.0 migration guide for users.
>>>>>
>>>>> My biggest concern with pandas in recent years is how not to be held
>>>>> back by strict backwards compatibility and still be able to innovate
>>>>> and stay relevant into the 2020s.
>>>>>
>>>>> For pandas 2.0 some of the most important issues I've been thinking
>>>>> about are:
>>>>>
>>>>> - Logical type abstraction layer / decoupling. pandas-only data types
>>>>> (Categorical, DatetimeTZ, Period, etc.) will become equal citizens as
>>>>> compared with data types mapping 1-1 on NumPy numeric dtypes
>>>>>
>>>>> - Decoupling physical storage to permit non-NumPy data structures
>>>>> inside Series
>>>>>
>>>>> - Removal of BlockManager and 2D block consolidation in DataFrame, in
>>>>> favor of a native C++ internal table (vector-of-arrays) data structure
>>>>>
>>>>> - Consistent NA semantics across all data types
>>>>>
>>>>> - Significantly improved handling of string/UTF8 data (performance,
>>>>> memory use -- elimination of PyObject boxes). From the above 2 items,
>>>>> we could even make all string arrays internally categorical (with the
>>>>> option to explicitly cast to categorical) -- in the database world
>>>>> this is often called dictionary encoding.
>>>>>
>>>>> - Refactor of most Cython algorithms into C++11/14 templates
>>>>>
>>>>> - Copy-on-write for Series and DataFrame
>>>>>
>>>>> - Removal of Panel, ndim > 3 data structures
>>>>>
>>>>> - Analytical expression VM (for example -- things like
>>>>> df[boolean_arr].groupby(...).agg(...) could be evaluated by a small
>>>>> Numexpr-like VM, not dissimilar to R's dplyr library, with
>>>>> significantly improved memory use and maybe performance too)
>>>>>
>>>>> There's a lot to unpack here, but let me know what everyone thinks
>>>>> about these things. The "pandas 2.0" / internals revamp discussion we
>>>>> can tackle in a separate thread or in perhaps in a GitHub repo or
>>>>> design folder in the pandas codebase.
>>>>>
>>>>> Thanks,
>>>>> Wes
>>>>> _______________________________________________
>>>>> Pandas-dev mailing list
>>>>> Pandas-dev at python.org
>>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Pandas-dev mailing list
>>>> Pandas-dev at python.org
>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>
>>>
>>>
>>> _______________________________________________
>>> Pandas-dev mailing list
>>> Pandas-dev at python.org
>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>
>>
>>
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
>
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>

From jorisvandenbossche at gmail.com  Sun Jul 31 18:03:06 2016
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Mon, 1 Aug 2016 00:03:06 +0200
Subject: [Pandas-dev] Proposal: pandas 1.0 and ideas for pandas 2.0 /
 future
In-Reply-To: <CAJPUwMC1hLsxUvMewKw68cjXdXjLMg3Hn1Qt4O7NGW8CffjJwQ@mail.gmail.com>
References: <CAJPUwMCrE3Yp_5neugfBfDzz2QCjusZjnmsDjYiaO08938tErw@mail.gmail.com>
 <CAEQ_TveHwf3s5gnmd17RdSk3GWwc4CPK77rOp-UkeKW_kQOCFQ@mail.gmail.com>
 <CAHMnJKg2ish9DLxw87ctBOq_W+X8eGF_qwbXuC_pR0DUU5GZnQ@mail.gmail.com>
 <CALQtMBZgt4CuKmbHRFua0Fh9nHvGTyDzXKMUdXEY_DWnTcZJ2w@mail.gmail.com>
 <CAJ1_J5gw4My4_DT2CiyT0xHp3+8GwAKce4=r6hOxVMFDpHveQw@mail.gmail.com>
 <CAJPUwMC1hLsxUvMewKw68cjXdXjLMg3Hn1Qt4O7NGW8CffjJwQ@mail.gmail.com>
Message-ID: <CALQtMBZJCtquV2tnY+1khE3sNK5Lobv=HkVpcPK2MZuhUo1kQg@mail.gmail.com>

Regarding 1), I agree it is a good idea to push a 0.19.0 release soonish,
en we can then discuss what we further want to do (or not to do) for the
1.0 release. I am on holidays the coming week and a half, but afterwards I
will also focus on getting 0.19.0 out. A release candidate in the last week
of August is maybe a good deadline?

Joris

2016-07-29 0:15 GMT+02:00 Wes McKinney <wesmckinn at gmail.com>:

> OK, let me try to collect some of the feedback and give my thoughts
>
> 1) 0.19 and 0.20:  I think we should push to release 0.19.0 soon and
> then plan what we want to add/change/deprecate for 1.0 which might
> otherwise have been 1.0. I think delaying 0.19.0 since we already
> pushed back 0.18.2, and there are some significant new patches
> (asof_merge and variable rolling windows), it would be good to get
> this into production before we declare a stable 1.0.
>
> 2) We will need to raise a significant amount of money for pandas (I
> estimate in the ballpark of US $300-500K -- better to have too much
> than too little) to be able to pursue the pandas 2.0 plan
> wholeheartedly. I would like to dedicate a minimum 5-10 hours per week
> to it in 2017 but this will not be sufficient to do everything (I am
> also a human being, and have a day job). It would be better to
> collaborate with one or two good freelance developers (with proven
> experience in C++ and Python) who are spending at least 50% of their
> time on pandas next year. I am going to start spending some time on
> design documentation so that we can start resolving some of the design
> questions and tradeoffs (not all of these decisions will be easy).
> We'll work on this offline and look to start soliciting funding (if
> anyone with the ability to write checks is reading, feel free to
> contact me offline).
>
> 3) I agree we will need to come up with a development process that
> facilitates both an invasive modification of pandas internals while
> also supporting production users of pandas 1.X. Cherry-picking bug
> fixes into the pandas 2.x branch will grow increasingly complicated;
> we need to factor this into our process (for example: we might collect
> all the unit tests for bug fixes -- assuming they rely on definitely
> stable behavior -- into a "to fix" folder so that we can return and
> adapt the bug fixes once the 2.x branch is getting more stable). To
> have developers both maintaining 1.x and trying to drive forward the
> 2.x branch at the same time does not seem realistic -- we should talk
> to the IPython/Jupyter devs to understand how they handled this
> through their long-lived IPython 1.0 branch IIRC (see
> http://ipython.org/news.html#ipython-1-0).
>
> 4) My goal, which I think we're all aligned on, would be for pandas
> 2.0 to be a drop-in replacement for 90-95% of normal pandas use. Many
> power users will have embraced some of the idiosyncrasies of pandas's
> implementation details, but I think some of the changes (e.g. missing
> data consistency, copy-on-write / improved semantics around memory
> ownership and views) will be welcomed. We should clearly document (in
> a dedicated "pandas's internal relationship with NumPy" document) and
> maintain very tight contracts around what kinds of zero-copy NumPy
> interoperability are supported -- it is not clear to me for example
> that arrays of Python string/unicode objects are a NumPy use case that
> is especially important to preserve, but most numeric data use cases
> are. This will also be helpful for power users to understand the
> nuances and how things are going to stay the same or change (for
> example: boolean and integer arrays with NAs will probably not be
> zero-copyable to NumPy arrays).
>
> We should maybe start side threads about each of these items. Just
> deciding what we want to deprecate or do in 0.20 aka 1.0 is a large
> enough task.
>
> Thanks all
> Wes
>
> On Wed, Jul 27, 2016 at 8:39 PM, G Young <gfyoung17 at gmail.com> wrote:
> > 1) I would be in favour of releasing 0.19.0 in part because we already
> > pushed back and actually forgone 0.18.2.  I think these plans are better
> > served for the release after this one to give more time to map this but
> also
> > to push out the changes that have already been made in preparation for
> this
> > release.
> >
> > 2) In terms of organisation, I wonder if we might be better served
> > reorganising the way in which PR's are reviewed during the time period
> > between one release and the next instead of having these parallel tracks
> of
> > development in light of the concern brought up by @jorisvanenbossche.
> > Perhaps rather than just reviewing PR's as they come in, specify which
> types
> > of PR's should be submitted during certain periods of time.
> >
> > For example, a large chunk of the period could be devoted to accepting
> > enhancements / new features after which the remaining time before a
> release
> > could be devoted to just organisation / refactoring / deprecations / what
> > have you (maybe include bug fixes too).  That way we could have a
> contiguous
> > block of time to focus on stabilising and tidying up the release.  It
> would
> > also allow for the refactoring to take place (perhaps incrementally)
> without
> > the concern of being destabilised by a new feature.
> >
> > For this to work, this would have to be clearly stated in the
> contributing
> > docs as well as circulated in emails to pandas-dev AND other related
> groups
> > so that way people know what's going on in terms of the development
> cycle.
> >
> >
> >
> > On Wed, Jul 27, 2016 at 7:51 PM, Joris Van den Bossche
> > <jorisvandenbossche at gmail.com> wrote:
> >>
> >> Wes, thanks for your mail!
> >>
> >> I like the idea of first releasing a pandas 1.0 before the 'big
> refactor'.
> >> We for sure know that this will take a while to stabilize (even with a
> lot
> >> of resources), and I think the idea was to provide a kind of LTS
> release. In
> >> that regard, it is just clearer to name this pandas 1.x then 0.19.x.
> >>
> >> Maybe we can start a separate thread to discuss on this 1.0, as there
> are
> >> of course some questions to discuss:
> >> - do we first release 0.19 (we didn't specifically discuss this, but I
> >> think the rough idea was to have somewhere in august a release
> candidate),
> >> or do we directly aim at 1.0?
> >> - are there some certain changes we want to do before 1.0 that are
> >> feasible in the short term?
> >> - are there some of the current ideas of deprecations that we should
> >> exclude/include for this release? (eg I think deprecating PanelND (as
> just
> >> landed in master) is good, but the idea of deprecating Panel should
> rather
> >> wait until 2.0?)
> >> - ...
> >>
> >> How exactly to tackle those bug fix releases / LTS branch, is also
> >> something that can be discussed, but I would not worry too much about
> that
> >> (there are enough examples of other projects to do something similar, we
> >> just have to search for a process that suits us).
> >>
> >> What I think a more important issue or problem with this process is the
> >> community of contributors. If we would effectively have a period of
> about
> >> two years (before a final 2.0 release) where for the current (1.0)
> version
> >> only certain bug-fixes are considered, but on the other hand it is still
> >> difficult to contribute to the new version. We would maybe have to say
> no to
> >> many of the PRs or enhancement ideas. Such a situation could hinder the
> >> process of community contributions and participation.
> >> And there are currently a lot of contributions. As Jeff also said, the
> >> current active contributors are barely keeping up with managing all
> issues
> >> and pull requests. I have worked the last few weeks more on pandas
> (thanks
> >> to Continuum), and indeed I spent most of my time answering issues and
> >> reviewing PRs, and hardly have any time to do much coding myself. But of
> >> course this is also a choice that I currently make. And I (we) could
> also
> >> make the choice to focus more on pandas 1.0/2.0 related issues, or try
> to
> >> steer some of the active contributors to that.
> >>
> >> I also have some concerns about the compatibility with the rest of the
> >> ecosystem, but at the same time it is clear I think that there should be
> >> some kind of refactor, and it is in the further elaboration of the
> roadmap
> >> that such concerns can be addressed.
> >>
> >> Joris
> >>
> >>
> >>
> >> 2016-07-27 12:04 GMT+02:00 Jeff Reback <jeffreback at gmail.com>:
> >>>
> >>> I applaud the vision and ambition for the roadmap of the future of
> >>> pandas.
> >>>
> >>> However, the resources are lacking for much of these changes. Currently
> >>> pandas is just barely keeping up with the (recently increased) user
> flow
> >>> of pull-requests, not to mention the issue reports. These are all great
> >>> indicators
> >>> of community use and exercising the edge cases.
> >>>
> >>> A roadmap is an excellent start, but the resource question needs to be
> >>> front and center.
> >>>
> >>> The current process *could* evolve into LTS. In 0.19.0, lots of
> progress
> >>> towards removing
> >>> older code (and of course deprecating things) is happening. An
> aggressive
> >>> push of this into
> >>> 0.20.0 will go a long ways towards de-facto establishing 1.0 / LTS.
> (and
> >>> maybe that's what we simply
> >>> call 0.20.0).
> >>>
> >>> I would agree we could simply release 1.0 / LTS without adding any
> 'new'
> >>> features (like fixed getitem indexing
> >>> and such).
> >>>
> >>> I would like to see 2.0 with a user facing API that is a drop-in
> >>> replacement (though allowing for some breaking changes that are NOT
> >>> back-compat, e.g. getitem indexing). I think it would be acceptable to
> break
> >>> the back-end API (meaning to numpy) though.
> >>>
> >>> For the resource question, as I have mentioned off-list, I will format
> >>> this roadmap in order for pandas to support a fund-raising effort to
> garner
> >>> resources for these changes.
> >>>
> >>> Jeff
> >>>
> >>> On Tue, Jul 26, 2016 at 5:13 PM, Stephan Hoyer <shoyer at gmail.com>
> wrote:
> >>>>
> >>>> I know I expressed concerns about cross-compatibility with the rest of
> >>>> the SciPy ecosystem before (especially xarray), but this plan sounds
> very
> >>>> solid to me. Flexible data types in N-dimensional arrays are
> important for
> >>>> other use cases, but also not really a problem for pandas.
> >>>>
> >>>> A separate 2.0 release will let us make the major breaking changes to
> >>>> the pandas data model necessary for it to work well in the long term.
> There
> >>>> are a few other API warts that will be able to clean up this way
> (detailed
> >>>> in github.com/pydata/pandas/issues/10000), indexing on DataFrames
> being the
> >>>> most obvious one.
> >>>>
> >>>> On Tue, Jul 26, 2016 at 1:51 PM, Wes McKinney <wesmckinn at gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> hi folks,
> >>>>>
> >>>>> As a continuation of ongoing discussions on GitHub and on the mailing
> >>>>> list around deprecations and future innovation and internal
> reworkings
> >>>>> of pandas, I had a couple of ideas to share that I am looking for
> >>>>> feedback on.
> >>>>>
> >>>>> As far as pandas 0.19.x today, I would like to propose that we
> >>>>> consider releasing the project as pandas 1.0 in the next major
> release
> >>>>> or the one after. The Python community does have a penchant for
> >>>>> "eternal betas", but after all the hard work of the core developers
> >>>>> and community over the last 5 years, I think we can safely consider
> >>>>> making a stable 1.X production release.
> >>>>>
> >>>>> If we do decide to release pandas 1.0, I also propose that we
> strongly
> >>>>> consider making 1.X an LTS / Long Term Support branch where we can
> >>>>> continue to make releases, but bug fixes and documentation
> >>>>> improvements only. Or, we can add new features, but on an extremely
> >>>>> conservative basis. This might require some changes to development
> >>>>> process, so looking for feedback on this.
> >>>>>
> >>>>> If we commit to this path, I would suggest that we start a pandas-2.0
> >>>>> integration branch where we can begin more seriously planning and
> >>>>> executing on
> >>>>>
> >>>>> - Cleanup and removal of years' worth of accumulated cruft / legacy
> >>>>> code
> >>>>> - Removal of deprecated features
> >>>>> - Series and DataFrame internals revamp.
> >>>>>
> >>>>> I had hoped that 2016 would offer me more time to work on the
> >>>>> internals revamp, but between my day job and the 2nd ed of "Python
> for
> >>>>> Data Analysis" that turned out to be a little too ambitious. I have
> >>>>> been almost continuously thinking about how to go about this though,
> >>>>> and it might be good to figure out a process where we can start
> >>>>> documenting and coming up with a more granular development roadmap
> for
> >>>>> this. Part of this will be carefully documenting any APIs we change
> or
> >>>>> unit tests we break along the way.
> >>>>>
> >>>>> We would want to give ample time for heavy pandas users to run their
> >>>>> 3rd-party code based on pandas 2.0-dev to give feedback on whether
> our
> >>>>> assumptions about the impact of changes affect real production code.
> >>>>> As a concrete example: integer and boolean Series would be able to
> >>>>> accommodate missing data without implicitly casting to float or
> object
> >>>>> NumPy dtype respectively. Since many users will have inserted
> >>>>> workarounds / data massaging code because of such rough edges, this
> >>>>> may cause code breakage or simply redundancy in some cases. As
> another
> >>>>> example: we should probably remove the .ix indexing attribute
> >>>>> altogether. I'm sure many users are still using .ix, but it would be
> >>>>> worthwhile to go through such code and decide whether it's really
> .loc
> >>>>> or .iloc.
> >>>>>
> >>>>> My hope would be (being a deadline-motivated person) that we could
> see
> >>>>> a pandas 2.0 alpha release sometime mid- or 2nd half 2017, with a
> >>>>> target beta / pre-production QA release in early 2018 or thereabouts.
> >>>>> Part of this would be creating a 1.0 to 2.0 migration guide for
> users.
> >>>>>
> >>>>> My biggest concern with pandas in recent years is how not to be held
> >>>>> back by strict backwards compatibility and still be able to innovate
> >>>>> and stay relevant into the 2020s.
> >>>>>
> >>>>> For pandas 2.0 some of the most important issues I've been thinking
> >>>>> about are:
> >>>>>
> >>>>> - Logical type abstraction layer / decoupling. pandas-only data types
> >>>>> (Categorical, DatetimeTZ, Period, etc.) will become equal citizens as
> >>>>> compared with data types mapping 1-1 on NumPy numeric dtypes
> >>>>>
> >>>>> - Decoupling physical storage to permit non-NumPy data structures
> >>>>> inside Series
> >>>>>
> >>>>> - Removal of BlockManager and 2D block consolidation in DataFrame, in
> >>>>> favor of a native C++ internal table (vector-of-arrays) data
> structure
> >>>>>
> >>>>> - Consistent NA semantics across all data types
> >>>>>
> >>>>> - Significantly improved handling of string/UTF8 data (performance,
> >>>>> memory use -- elimination of PyObject boxes). From the above 2 items,
> >>>>> we could even make all string arrays internally categorical (with the
> >>>>> option to explicitly cast to categorical) -- in the database world
> >>>>> this is often called dictionary encoding.
> >>>>>
> >>>>> - Refactor of most Cython algorithms into C++11/14 templates
> >>>>>
> >>>>> - Copy-on-write for Series and DataFrame
> >>>>>
> >>>>> - Removal of Panel, ndim > 3 data structures
> >>>>>
> >>>>> - Analytical expression VM (for example -- things like
> >>>>> df[boolean_arr].groupby(...).agg(...) could be evaluated by a small
> >>>>> Numexpr-like VM, not dissimilar to R's dplyr library, with
> >>>>> significantly improved memory use and maybe performance too)
> >>>>>
> >>>>> There's a lot to unpack here, but let me know what everyone thinks
> >>>>> about these things. The "pandas 2.0" / internals revamp discussion we
> >>>>> can tackle in a separate thread or in perhaps in a GitHub repo or
> >>>>> design folder in the pandas codebase.
> >>>>>
> >>>>> Thanks,
> >>>>> Wes
> >>>>> _______________________________________________
> >>>>> Pandas-dev mailing list
> >>>>> Pandas-dev at python.org
> >>>>> https://mail.python.org/mailman/listinfo/pandas-dev
> >>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> Pandas-dev mailing list
> >>>> Pandas-dev at python.org
> >>>> https://mail.python.org/mailman/listinfo/pandas-dev
> >>>>
> >>>
> >>>
> >>> _______________________________________________
> >>> Pandas-dev mailing list
> >>> Pandas-dev at python.org
> >>> https://mail.python.org/mailman/listinfo/pandas-dev
> >>>
> >>
> >>
> >> _______________________________________________
> >> Pandas-dev mailing list
> >> Pandas-dev at python.org
> >> https://mail.python.org/mailman/listinfo/pandas-dev
> >>
> >
> >
> > _______________________________________________
> > Pandas-dev mailing list
> > Pandas-dev at python.org
> > https://mail.python.org/mailman/listinfo/pandas-dev
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20160801/59db34de/attachment-0001.html>