[Pandas-dev] What could a pandas 2.0 look like?

Wed Feb 12 18:28:48 EST 2020

Thanks Joris.

This was discussed on the call today:
https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit?usp=sharing.
I'll try to summarize the discussion here.

On NA by default, there were a few concerns, none of which is likely a
blocker. Things like the memory overhead of masks can be improved by making
them optional (relatively easy) and possibly using a bitmask (probably
harder).

I wondered if this was blocked by the BlockManager being written in Python.
This change would imply that blockwise ops would become columwise, so we'll
have more overhead for some ops. Joris cited a bit of work he did to make
this not too bad, at least for not too wide of tables.

I also wondered whether this would be inappropriate as long as NA lives in
pandas, rather than something that is understood by the entire scientific
python ecosystem. It's worth thinking about and seeing how the community
reacts to NA. Probably not a blocker.

This would also imply creating a nullable float dtype and making our
datelikes use NA rather than NaT too. That seemed to be generally OK, but
wasn't discussed too much.

---

Other "2.0" topics included rethinking our dependencies. It's possible
Arrow could be added. Going nullable by default would make Arrow a pretty
attractive option for storing arrays. But we would need to consult with our
downstream dependencies (like xarray) and users about that.

Fixing __getitem__ was also discussed. That will take someone to write up a
detailed proposal about specific proposed changes and possible deprecation
paths.

Tom

On Mon, Feb 10, 2020 at 11:43 AM Joris Van den Bossche <
jorisvandenbossche at gmail.com> wrote:

> pandas 1.0 is out, so time to start thinking about 2.0 ;)
>
> In principle, pandas 2.0 will just be one of the next releases when we
> decide we want to clean-up the deprecations / make a few changes that are
> hard to deprecate (following our new versioning policy).
> But nonetheless, I think it can still be interesting to think about it if
> it can also be something more than that, and have more specific goals in
> mind*.
>
> Last year I made the pd.NA proposal, which resulted in using that for the
> nullale integer, boolean and string dtypes. In the proposal, pd.NA was
> described as "can be used consistently across all data types". And for me,
> the aspirational end goal of this proposal is to *actually* have this for
> *all* dtypes, but we never really discussed this aspect explicitly.
>
> So, for me, a possible future pandas 2.0:
>
>    - Uses all "nullable dtypes" by default (i.e. dtypes that use pd.NA as
>    missing value indicator). That means we add a nullable version of all other
>    dtypes (as we now already did for int, boolean, string). End goal: a single
>    missing value indicator with the same behavior for all dtypes.
>    - If we add such nullable dtypes using the extension dtypes/array
>    mechanism (so it can first be opt-in in 1.X), this could "automatically"
>    lead to a simplification of the internals / Block Manager (another
>    aspirational goal that has been discussed before, but never became
>    concrete). Because in such a case (all extension dtypes), we would only be
>    using 1D blocks (simplifying the 1D / 2D thorny cases in internals). This
>    simplifies the memory model, consolidation, etc
>
> Do you think this is a desirable goal? And realistic? Other aspirational
> goals?
>
> Best,
> Joris
>
> *Agreeing on goals doesn't mean it will happen, that's open source (or at
> least community-based open source). But I think it can still be useful to
> guide some efforts where possible or in trying to get traction for certain
> issues from contributors. And then we can still see if it gets done in 2.0,
> 3.0, 4.0 or never ;)
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200212/339dfb16/attachment.html>