[Pandas-dev] Proposal: pandas 1.0 and ideas for pandas 2.0 / future

Tue Jul 26 17:13:11 EDT 2016

I know I expressed concerns about cross-compatibility with the rest of the
SciPy ecosystem before (especially xarray), but this plan sounds very solid
to me. Flexible data types in N-dimensional arrays are important for other
use cases, but also not really a problem for pandas.

A separate 2.0 release will let us make the major breaking changes to the
pandas data model necessary for it to work well in the long term. There are
a few other API warts that will be able to clean up this way (detailed in
github.com/pydata/pandas/issues/10000), indexing on DataFrames being the
most obvious one.

On Tue, Jul 26, 2016 at 1:51 PM, Wes McKinney <wesmckinn at gmail.com> wrote:

> hi folks,
>
> As a continuation of ongoing discussions on GitHub and on the mailing
> list around deprecations and future innovation and internal reworkings
> of pandas, I had a couple of ideas to share that I am looking for
> feedback on.
>
> As far as pandas 0.19.x today, I would like to propose that we
> consider releasing the project as pandas 1.0 in the next major release
> or the one after. The Python community does have a penchant for
> "eternal betas", but after all the hard work of the core developers
> and community over the last 5 years, I think we can safely consider
> making a stable 1.X production release.
>
> If we do decide to release pandas 1.0, I also propose that we strongly
> consider making 1.X an LTS / Long Term Support branch where we can
> continue to make releases, but bug fixes and documentation
> improvements only. Or, we can add new features, but on an extremely
> conservative basis. This might require some changes to development
> process, so looking for feedback on this.
>
> If we commit to this path, I would suggest that we start a pandas-2.0
> integration branch where we can begin more seriously planning and
> executing on
>
> - Cleanup and removal of years' worth of accumulated cruft / legacy code
> - Removal of deprecated features
> - Series and DataFrame internals revamp.
>
> I had hoped that 2016 would offer me more time to work on the
> internals revamp, but between my day job and the 2nd ed of "Python for
> Data Analysis" that turned out to be a little too ambitious. I have
> been almost continuously thinking about how to go about this though,
> and it might be good to figure out a process where we can start
> documenting and coming up with a more granular development roadmap for
> this. Part of this will be carefully documenting any APIs we change or
> unit tests we break along the way.
>
> We would want to give ample time for heavy pandas users to run their
> 3rd-party code based on pandas 2.0-dev to give feedback on whether our
> assumptions about the impact of changes affect real production code.
> As a concrete example: integer and boolean Series would be able to
> accommodate missing data without implicitly casting to float or object
> NumPy dtype respectively. Since many users will have inserted
> workarounds / data massaging code because of such rough edges, this
> may cause code breakage or simply redundancy in some cases. As another
> example: we should probably remove the .ix indexing attribute
> altogether. I'm sure many users are still using .ix, but it would be
> worthwhile to go through such code and decide whether it's really .loc
> or .iloc.
>
> My hope would be (being a deadline-motivated person) that we could see
> a pandas 2.0 alpha release sometime mid- or 2nd half 2017, with a
> target beta / pre-production QA release in early 2018 or thereabouts.
> Part of this would be creating a 1.0 to 2.0 migration guide for users.
>
> My biggest concern with pandas in recent years is how not to be held
> back by strict backwards compatibility and still be able to innovate
> and stay relevant into the 2020s.
>
> For pandas 2.0 some of the most important issues I've been thinking about
> are:
>
> - Logical type abstraction layer / decoupling. pandas-only data types
> (Categorical, DatetimeTZ, Period, etc.) will become equal citizens as
> compared with data types mapping 1-1 on NumPy numeric dtypes
>
> - Decoupling physical storage to permit non-NumPy data structures inside
> Series
>
> - Removal of BlockManager and 2D block consolidation in DataFrame, in
> favor of a native C++ internal table (vector-of-arrays) data structure
>
> - Consistent NA semantics across all data types
>
> - Significantly improved handling of string/UTF8 data (performance,
> memory use -- elimination of PyObject boxes). From the above 2 items,
> we could even make all string arrays internally categorical (with the
> option to explicitly cast to categorical) -- in the database world
> this is often called dictionary encoding.
>
> - Refactor of most Cython algorithms into C++11/14 templates
>
> - Copy-on-write for Series and DataFrame
>
> - Removal of Panel, ndim > 3 data structures
>
> - Analytical expression VM (for example -- things like
> df[boolean_arr].groupby(...).agg(...) could be evaluated by a small
> Numexpr-like VM, not dissimilar to R's dplyr library, with
> significantly improved memory use and maybe performance too)
>
> There's a lot to unpack here, but let me know what everyone thinks
> about these things. The "pandas 2.0" / internals revamp discussion we
> can tackle in a separate thread or in perhaps in a GitHub repo or
> design folder in the pandas codebase.
>
> Thanks,
> Wes
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20160726/1d3246fe/attachment.html>