[Pandas-dev] What could a pandas 2.0 look like?

Tom Augspurger tom.augspurger88 at gmail.com
Mon Feb 17 12:55:23 EST 2020


Is NaT defined to be unequal in all comparisons, just like NaN? I think the
goal of propagating NA
requires either using NA or changing the behavior of NaT in comparisons to
be like NA.

On Mon, Feb 17, 2020 at 11:50 AM Brock Mendel <jbrockmendel at gmail.com>
wrote:

> > I think consistently propagating NA in comparison operations is a
> worthwhile goal.
>
> That's an argument for having a three-valued bool-dtype, not for replacing
> all other NA-like values.
>
> On Mon, Feb 17, 2020 at 8:34 AM Tom Augspurger <tom.augspurger88 at gmail.com>
> wrote:
>
>> > 2) The "only one NA value is simpler" argument strikes me as a solution
>> in search of a problem.
>>
>> I don't think that's correct. I think consistently propagating NA in
>> comparison operations is a worthwhile goal.
>>
>> On Mon, Feb 17, 2020 at 10:25 AM Brock Mendel <jbrockmendel at gmail.com>
>> wrote:
>>
>>> > It's not fully clear to me what you want to say with this, so a more
>>> detailed clarification is welcome (I mean, I understand the sentence and
>>> remember the discussion, but don't fully understand the point being made in
>>> context, or in what direction you think more discussion is needed).
>>>
>>> I don't particularly think more discussion is needed, as this is a
>>> rehash of #28095, where this horse has already been beaten to death.
>>>
>>> As Tom noted here
>>> <https://github.com/pandas-dev/pandas/issues/28095#issuecomment-537501744>,
>>> using pd.NA in places where we currently use NaT breaks the usual identity
>>> (that we rely on A LOT)
>>>
>>> ```(array + array)[0].dtype <=> (array + array[0]).dtype```
>>>
>>> (Yes, this holds only imperfectly for NaT because NaT serves as both
>>> NotADatetime and NotATimedelta, and I'll refer you to the decomposing horse
>>> in #28095.)
>>>
>>> Also from #28095:
>>>
>>> ```Series[timedelta64] * pd.NaT``` unambiguously raises, but
>>> ```Series[timedelta64] * pd.NA``` could be timedelta64
>>>
>>> > Assume we introduce a new "nullable datetime" dtype that uses a mask
>>> to track NAs, and can still have NaT in the values. In practice, this still
>>> means that we "replace NaT with NA"
>>>
>>> This strikes me as contradictory.
>>>
>>> > So do you mean: "in my opinion, we should not do this" (what I just
>>> described above), because in practice that would mean breaking arithmetic
>>> consistency? Or that if we want to start using NA for datetimelike dtypes,
>>> you think "dtype-parametrized" NA values are necessary (so you can
>>> distinguish NA[datetime] and NA[timedelta] ?)
>>>
>>> I think:
>>>
>>> 1) pd.NA solves an _actual_ problem which is that we used to use np.nan
>>> in places (categorical, object) where np.nan was semantically misleading.
>>>    a) What these have in common is that they are in general
>>> non-arithmetic dtypes.
>>>    b) This is an improvement, and I'm glad you put in the effort to make
>>> it happen.
>>>    c) Trying to shoe-horn pd.NA into cases where it is semantically
>>> misleading based on the Highlander Principle is counter-productive.
>>>
>>> 2) The "only one NA value is simpler" argument strikes me as a solution
>>> in search of a problem.
>>>    a) All the more so if you want this to supplement np.nan/pd.NaT
>>> instead of replace them.
>>>    b) *the idea of replacing vs supplementing needs to be made much more
>>> explicit/clear*
>>>
>>> 3) The "dtype-parametrized" NA did come up in #28095, but I never
>>> advocated it.
>>>    a) I am open to separating out a NaTimedelta (xref #24983) from
>>> pd.NaT, and don't particularly care what it is called.
>>>
>>>
>>> On Mon, Feb 17, 2020 at 3:37 AM Joris Van den Bossche <
>>> jorisvandenbossche at gmail.com> wrote:
>>>
>>>> > This would also imply creating a nullable float dtype and making our
>>>>> datelikes use NA rather than NaT too. That seemed to be generally OK, but
>>>>> wasn't discussed too much.
>>>>>
>>>>> My understanding of the discussion is that using a mask on top of
>>>>> datetimelike arrays would not _replace_ NaT, but supplement it with
>>>>> something semantically different.
>>>>>
>>>>
>>>> Yes, if we see it similar as NaNs for floats (where NaN is a specific
>>>> float value in the data array, while NAs are tracked in the mask array),
>>>> then for datetimelike arrays we can do something similar. And the same
>>>> discussions about to what extent to distinguish NaN and NA or whether we
>>>> need to provide options that we are going to have for float dtypes, will
>>>> also be relevant for datetimelike dtypes (but then for NaT and NA).
>>>>
>>>> But note that in practice, I *think* that the big majority of use
>>>> cases will mostly use NA and not NaT in the data (eg when reading from
>>>> files that have missing data).
>>>>
>>>> Replacing NaT with NA breaks arithmetic consistency, as has been
>>>>> discussed ad nauseum.
>>>>>
>>>>
>>>> It's not fully clear to me what you want to say with this, so a more
>>>> detailed clarification is welcome (I mean, I understand the sentence and
>>>> remember the discussion, but don't fully understand the point being made in
>>>> context, or in what direction you think more discussion is needed).
>>>>
>>>> Assume we introduce a new "nullable datetime" dtype that uses a mask to
>>>> track NAs, and can still have NaT in the values. In practice, this still
>>>> means that we "replace NaT with NA" (because even though NaT is still
>>>> possible, I think you would mostly get NAs as mentioned above; eg reading a
>>>> file would now give NA instaed of NaT).
>>>> So do you mean: "in my opinion, we should not do this" (what I just
>>>> described above), because in practice that would mean breaking arithmetic
>>>> consistency? Or that if we want to start using NA for datetimelike dtypes,
>>>> you think "dtype-parametrized" NA values are necessary (so you can
>>>> distinguish NA[datetime] and NA[timedelta] ?)
>>>>
>>>> Joris
>>>>
>>>>
>>>> _______________________________________________
>>>> Pandas-dev mailing list
>>>> Pandas-dev at python.org
>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>
>>> _______________________________________________
>>> Pandas-dev mailing list
>>> Pandas-dev at python.org
>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200217/2a349cfa/attachment-0001.html>


More information about the Pandas-dev mailing list