[Pandas-dev] What could a pandas 2.0 look like?

Tom Augspurger tom.augspurger88 at gmail.com
Fri Feb 14 16:02:19 EST 2020


> Replacing NaT with NA breaks arithmetic consistency

This means the result dtype of a Series & scalar, right? If so, it's worth
deciding whether that's more valuable than consistency in the behavior of
missing values in arithmetic and comparison operations.

On Thu, Feb 13, 2020 at 4:23 PM Brock Mendel <jbrockmendel at gmail.com> wrote:

> > This would also imply creating a nullable float dtype and making our
> datelikes use NA rather than NaT too. That seemed to be generally OK, but
> wasn't discussed too much.
>
> My understanding of the discussion is that using a mask on top of
> datetimelike arrays would not _replace_ NaT, but supplement it with
> something semantically different.  Replacing NaT with NA breaks arithmetic
> consistency, as has been discussed ad nauseum.
>
> On Wed, Feb 12, 2020 at 3:29 PM Tom Augspurger <tom.augspurger88 at gmail.com>
> wrote:
>
>> Thanks Joris.
>>
>> This was discussed on the call today:
>> https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit?usp=sharing.
>> I'll try to summarize the discussion here.
>>
>> On NA by default, there were a few concerns, none of which is likely a
>> blocker. Things like the memory overhead of masks can be improved by making
>> them optional (relatively easy) and possibly using a bitmask (probably
>> harder).
>>
>> I wondered if this was blocked by the BlockManager being written in
>> Python. This change would imply that blockwise ops would become columwise,
>> so we'll have more overhead for some ops. Joris cited a bit of work he did
>> to make this not too bad, at least for not too wide of tables.
>>
>> I also wondered whether this would be inappropriate as long as NA lives
>> in pandas, rather than something that is understood by the entire
>> scientific python ecosystem. It's worth thinking about and seeing how the
>> community reacts to NA. Probably not a blocker.
>>
>> This would also imply creating a nullable float dtype and making our
>> datelikes use NA rather than NaT too. That seemed to be generally OK, but
>> wasn't discussed too much.
>>
>> ---
>>
>> Other "2.0" topics included rethinking our dependencies. It's possible
>> Arrow could be added. Going nullable by default would make Arrow a pretty
>> attractive option for storing arrays. But we would need to consult with our
>> downstream dependencies (like xarray) and users about that.
>>
>> Fixing __getitem__ was also discussed. That will take someone to write up
>> a detailed proposal about specific proposed changes and possible
>> deprecation paths.
>>
>> Tom
>>
>> On Mon, Feb 10, 2020 at 11:43 AM Joris Van den Bossche <
>> jorisvandenbossche at gmail.com> wrote:
>>
>>> pandas 1.0 is out, so time to start thinking about 2.0 ;)
>>>
>>> In principle, pandas 2.0 will just be one of the next releases when we
>>> decide we want to clean-up the deprecations / make a few changes that are
>>> hard to deprecate (following our new versioning policy).
>>> But nonetheless, I think it can still be interesting to think about it
>>> if it can also be something more than that, and have more specific goals in
>>> mind*.
>>>
>>> Last year I made the pd.NA proposal, which resulted in using that for
>>> the nullale integer, boolean and string dtypes. In the proposal, pd.NA was
>>> described as "can be used consistently across all data types". And for me,
>>> the aspirational end goal of this proposal is to *actually* have this
>>> for *all* dtypes, but we never really discussed this aspect explicitly.
>>>
>>> So, for me, a possible future pandas 2.0:
>>>
>>>    - Uses all "nullable dtypes" by default (i.e. dtypes that use pd.NA as
>>>    missing value indicator). That means we add a nullable version of all other
>>>    dtypes (as we now already did for int, boolean, string). End goal: a single
>>>    missing value indicator with the same behavior for all dtypes.
>>>    - If we add such nullable dtypes using the extension dtypes/array
>>>    mechanism (so it can first be opt-in in 1.X), this could "automatically"
>>>    lead to a simplification of the internals / Block Manager (another
>>>    aspirational goal that has been discussed before, but never became
>>>    concrete). Because in such a case (all extension dtypes), we would only be
>>>    using 1D blocks (simplifying the 1D / 2D thorny cases in internals). This
>>>    simplifies the memory model, consolidation, etc
>>>
>>> Do you think this is a desirable goal? And realistic? Other aspirational
>>> goals?
>>>
>>> Best,
>>> Joris
>>>
>>> *Agreeing on goals doesn't mean it will happen, that's open source (or
>>> at least community-based open source). But I think it can still be useful
>>> to guide some efforts where possible or in trying to get traction for
>>> certain issues from contributors. And then we can still see if it gets done
>>> in 2.0, 3.0, 4.0 or never ;)
>>> _______________________________________________
>>> Pandas-dev mailing list
>>> Pandas-dev at python.org
>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200214/263134f4/attachment.html>


More information about the Pandas-dev mailing list