[Pandas-dev] What could a pandas 2.0 look like?

Tom Augspurger tom.augspurger88 at gmail.com
Mon Feb 17 13:06:27 EST 2020


On Mon, Feb 17, 2020 at 11:58 AM Brock Mendel <jbrockmendel at gmail.com>
wrote:

> > or changing the behavior of NaT in comparisons to be like NA.
>
> Pending the kinks being worked out of pd.NA, I have no problem with that.
>

You have no problem with changing the behavior of NaT, or changing to use
pd.NA?

Is changing the defined behavior of NaT even an option? Is it defined in a
spec
like NaN, or did NumPy just choose that behavior?

Assuming NaT had NA-like behavior in comparisons, what's remaining
arguments for keeping NaT?
Preserving dtypes in scalar - array ops? Anything else?

On Mon, Feb 17, 2020 at 9:55 AM Tom Augspurger <tom.augspurger88 at gmail.com>
> wrote:
>
>> Is NaT defined to be unequal in all comparisons, just like NaN? I think
>> the goal of propagating NA
>> requires either using NA or changing the behavior of NaT in comparisons
>> to be like NA.
>>
>> On Mon, Feb 17, 2020 at 11:50 AM Brock Mendel <jbrockmendel at gmail.com>
>> wrote:
>>
>>> > I think consistently propagating NA in comparison operations is a
>>> worthwhile goal.
>>>
>>> That's an argument for having a three-valued bool-dtype, not for
>>> replacing all other NA-like values.
>>>
>>> On Mon, Feb 17, 2020 at 8:34 AM Tom Augspurger <
>>> tom.augspurger88 at gmail.com> wrote:
>>>
>>>> > 2) The "only one NA value is simpler" argument strikes me as a
>>>> solution in search of a problem.
>>>>
>>>> I don't think that's correct. I think consistently propagating NA in
>>>> comparison operations is a worthwhile goal.
>>>>
>>>> On Mon, Feb 17, 2020 at 10:25 AM Brock Mendel <jbrockmendel at gmail.com>
>>>> wrote:
>>>>
>>>>> > It's not fully clear to me what you want to say with this, so a more
>>>>> detailed clarification is welcome (I mean, I understand the sentence and
>>>>> remember the discussion, but don't fully understand the point being made in
>>>>> context, or in what direction you think more discussion is needed).
>>>>>
>>>>> I don't particularly think more discussion is needed, as this is a
>>>>> rehash of #28095, where this horse has already been beaten to death.
>>>>>
>>>>> As Tom noted here
>>>>> <https://github.com/pandas-dev/pandas/issues/28095#issuecomment-537501744>,
>>>>> using pd.NA in places where we currently use NaT breaks the usual identity
>>>>> (that we rely on A LOT)
>>>>>
>>>>> ```(array + array)[0].dtype <=> (array + array[0]).dtype```
>>>>>
>>>>> (Yes, this holds only imperfectly for NaT because NaT serves as both
>>>>> NotADatetime and NotATimedelta, and I'll refer you to the decomposing horse
>>>>> in #28095.)
>>>>>
>>>>> Also from #28095:
>>>>>
>>>>> ```Series[timedelta64] * pd.NaT``` unambiguously raises, but
>>>>> ```Series[timedelta64] * pd.NA``` could be timedelta64
>>>>>
>>>>> > Assume we introduce a new "nullable datetime" dtype that uses a mask
>>>>> to track NAs, and can still have NaT in the values. In practice, this still
>>>>> means that we "replace NaT with NA"
>>>>>
>>>>> This strikes me as contradictory.
>>>>>
>>>>> > So do you mean: "in my opinion, we should not do this" (what I just
>>>>> described above), because in practice that would mean breaking arithmetic
>>>>> consistency? Or that if we want to start using NA for datetimelike dtypes,
>>>>> you think "dtype-parametrized" NA values are necessary (so you can
>>>>> distinguish NA[datetime] and NA[timedelta] ?)
>>>>>
>>>>> I think:
>>>>>
>>>>> 1) pd.NA solves an _actual_ problem which is that we used to use
>>>>> np.nan in places (categorical, object) where np.nan was semantically
>>>>> misleading.
>>>>>    a) What these have in common is that they are in general
>>>>> non-arithmetic dtypes.
>>>>>    b) This is an improvement, and I'm glad you put in the effort to
>>>>> make it happen.
>>>>>    c) Trying to shoe-horn pd.NA into cases where it is semantically
>>>>> misleading based on the Highlander Principle is counter-productive.
>>>>>
>>>>> 2) The "only one NA value is simpler" argument strikes me as a
>>>>> solution in search of a problem.
>>>>>    a) All the more so if you want this to supplement np.nan/pd.NaT
>>>>> instead of replace them.
>>>>>    b) *the idea of replacing vs supplementing needs to be made much
>>>>> more explicit/clear*
>>>>>
>>>>> 3) The "dtype-parametrized" NA did come up in #28095, but I never
>>>>> advocated it.
>>>>>    a) I am open to separating out a NaTimedelta (xref #24983) from
>>>>> pd.NaT, and don't particularly care what it is called.
>>>>>
>>>>>
>>>>> On Mon, Feb 17, 2020 at 3:37 AM Joris Van den Bossche <
>>>>> jorisvandenbossche at gmail.com> wrote:
>>>>>
>>>>>> > This would also imply creating a nullable float dtype and making
>>>>>>> our datelikes use NA rather than NaT too. That seemed to be generally OK,
>>>>>>> but wasn't discussed too much.
>>>>>>>
>>>>>>> My understanding of the discussion is that using a mask on top of
>>>>>>> datetimelike arrays would not _replace_ NaT, but supplement it with
>>>>>>> something semantically different.
>>>>>>>
>>>>>>
>>>>>> Yes, if we see it similar as NaNs for floats (where NaN is a specific
>>>>>> float value in the data array, while NAs are tracked in the mask array),
>>>>>> then for datetimelike arrays we can do something similar. And the same
>>>>>> discussions about to what extent to distinguish NaN and NA or whether we
>>>>>> need to provide options that we are going to have for float dtypes, will
>>>>>> also be relevant for datetimelike dtypes (but then for NaT and NA).
>>>>>>
>>>>>> But note that in practice, I *think* that the big majority of use
>>>>>> cases will mostly use NA and not NaT in the data (eg when reading from
>>>>>> files that have missing data).
>>>>>>
>>>>>> Replacing NaT with NA breaks arithmetic consistency, as has been
>>>>>>> discussed ad nauseum.
>>>>>>>
>>>>>>
>>>>>> It's not fully clear to me what you want to say with this, so a more
>>>>>> detailed clarification is welcome (I mean, I understand the sentence and
>>>>>> remember the discussion, but don't fully understand the point being made in
>>>>>> context, or in what direction you think more discussion is needed).
>>>>>>
>>>>>> Assume we introduce a new "nullable datetime" dtype that uses a mask
>>>>>> to track NAs, and can still have NaT in the values. In practice, this still
>>>>>> means that we "replace NaT with NA" (because even though NaT is still
>>>>>> possible, I think you would mostly get NAs as mentioned above; eg reading a
>>>>>> file would now give NA instaed of NaT).
>>>>>> So do you mean: "in my opinion, we should not do this" (what I just
>>>>>> described above), because in practice that would mean breaking arithmetic
>>>>>> consistency? Or that if we want to start using NA for datetimelike dtypes,
>>>>>> you think "dtype-parametrized" NA values are necessary (so you can
>>>>>> distinguish NA[datetime] and NA[timedelta] ?)
>>>>>>
>>>>>> Joris
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Pandas-dev mailing list
>>>>>> Pandas-dev at python.org
>>>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>>>
>>>>> _______________________________________________
>>>>> Pandas-dev mailing list
>>>>> Pandas-dev at python.org
>>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>>
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200217/d98ade19/attachment-0001.html>


More information about the Pandas-dev mailing list