[Pandas-dev] Fwd: What could a pandas 2.0 look like?

Joris Van den Bossche jorisvandenbossche at gmail.com
Wed Feb 26 05:40:46 EST 2020


On Thu, 20 Feb 2020 at 00:52, Brock Mendel <jbrockmendel at gmail.com> wrote:

> Pivoting: Joris, on the call you mentioned a TimestampArray.  Can you
> expand on that a bit?
>

Basically what I mentioned before in this thread: a new ExtensionArray that
uses pd.NA as missing value indicator instead of pd.NaT, and where the NAs
are potentially tracked in a mask (as done for the nullable integer
dtypes).
There are more things about it (like allowing more resolutions? single
dtype for tz-naive/tz-aware? ..), but I will try to open a separate
discussion going more in depth about this shortly.

But, the issue for NA vs NaT is somewhat similar as NA vs NaN, for which I
just opened an issue to further discuss this in more detail:
https://github.com/pandas-dev/pandas/issues/32265



>
> On Wed, Feb 19, 2020 at 2:55 PM Joris Van den Bossche <
> jorisvandenbossche at gmail.com> wrote:
>
>>
>>
>> On Tue, 18 Feb 2020 at 18:20, Tom Augspurger <tom.augspurger88 at gmail.com>
>> wrote:
>>
>>>
>>> On Mon, Feb 17, 2020 at 7:17 PM Brock Mendel <jbrockmendel at gmail.com>
>>> wrote:
>>>
>>>> > You have no problem with changing the behavior of NaT, or changing to
>>>> use pd.NA?
>>>>
>>>> If/when we get to a point where we propagate NAs in all other
>>>> comparisons, I would have no problem with editing `NaT.__richcmp__` to
>>>> match that convention.
>>>>
>>>
>>> What are the advantages of a NaT with NA-like comparison semantics over
>>> using NA
>>> (or NA[datetime])?
>>>
>>> 1. Retain dtype in array - scalar ops with a scalar NA
>>> 2. ...
>>> 3. Less disruptive than changing to NA
>>>
>>> My ... could include things like `isinstance(NaT, Timestamp)` being true
>>> and
>>> `NaT.<attr>` for Timestamp attributes. But those don't strike me as
>>> necessarily
>>> good things. They seem sometimes useful and sometimes harmful.
>>>
>>> The downside of changing NaT in comparison operations are
>>>
>>> 1. We're diverging from `np.NaT`. I don't know how problematic this
>>> actually is.
>>> 2. It's a special case. Should users need to know that datelikes use
>>> their own
>>>    NA value because the underlying storage is able to store them
>>> "in-band"
>>>    rather than as a mask? My gut reaction is "no, users shouldn't be
>>> exposed to
>>>    this."
>>> 3. Changing NaT would leave just NaN with the "always unequal in
>>> comparisons"
>>>    behavior.
>>>
>>
>> Personally, I think changing the behaviour of NaT in pandas, and thus
>> deviating from the behaviour of the same value in numpy, is not a good
>> idea. For me, that seems more confusing than having a clearly distinct
>> value (pd.NA) that has the different behaviour.
>>
>>
>>>
>>> Thus far, I see three options going forward
>>>
>>> 1. Use NaN for floats, NaT for datelikes, NA for other.
>>>   1-a: Leave NaT with always unequal
>>>   1-b: Change NaT to have NA-like comparison behavior
>>> 2. Use NA everywhere (no NaN for float, no NaT for datelike
>>> 3. Implement a typed `NA<T>`, where we have an `NA` per dtype.
>>>
>>> Option 3 I think solves the array - scalar op issue. It's more complex
>>> for users
>>> though hopefully not too complex? My biggest worry is that it makes the
>>> implementation much more complex, though perhaps I'm being pessimistic.
>>>
>>> On balance, I'm not sure where I come down yet. Good news: we can take
>>> time to
>>> figure this out :)
>>>
>>
>> Thanks for the summary!
>> Personally, I don't like the first option *long term* as it keeps
>> different missing values (eg NaN) with different behaviours for some dtypes
>> as default, while I would like to see us moving to a consistent missing
>> value indicator.
>> And I think we can take a similar approach as we somewhat decided in the
>> original discussion on pd.NA: let's start with a single pd.NA, and we can
>> see later if there is a need to make it typed.
>>
>> Joris
>>
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200226/7c400e89/attachment.html>


More information about the Pandas-dev mailing list