[Pandas-dev] What could a pandas 2.0 look like?

Brock Mendel jbrockmendel at gmail.com
Mon Feb 17 12:58:28 EST 2020


> or changing the behavior of NaT in comparisons to be like NA.

Pending the kinks being worked out of pd.NA, I have no problem with that.

On Mon, Feb 17, 2020 at 9:55 AM Tom Augspurger <tom.augspurger88 at gmail.com>
wrote:

> Is NaT defined to be unequal in all comparisons, just like NaN? I think
> the goal of propagating NA
> requires either using NA or changing the behavior of NaT in comparisons to
> be like NA.
>
> On Mon, Feb 17, 2020 at 11:50 AM Brock Mendel <jbrockmendel at gmail.com>
> wrote:
>
>> > I think consistently propagating NA in comparison operations is a
>> worthwhile goal.
>>
>> That's an argument for having a three-valued bool-dtype, not for
>> replacing all other NA-like values.
>>
>> On Mon, Feb 17, 2020 at 8:34 AM Tom Augspurger <
>> tom.augspurger88 at gmail.com> wrote:
>>
>>> > 2) The "only one NA value is simpler" argument strikes me as a
>>> solution in search of a problem.
>>>
>>> I don't think that's correct. I think consistently propagating NA in
>>> comparison operations is a worthwhile goal.
>>>
>>> On Mon, Feb 17, 2020 at 10:25 AM Brock Mendel <jbrockmendel at gmail.com>
>>> wrote:
>>>
>>>> > It's not fully clear to me what you want to say with this, so a more
>>>> detailed clarification is welcome (I mean, I understand the sentence and
>>>> remember the discussion, but don't fully understand the point being made in
>>>> context, or in what direction you think more discussion is needed).
>>>>
>>>> I don't particularly think more discussion is needed, as this is a
>>>> rehash of #28095, where this horse has already been beaten to death.
>>>>
>>>> As Tom noted here
>>>> <https://github.com/pandas-dev/pandas/issues/28095#issuecomment-537501744>,
>>>> using pd.NA in places where we currently use NaT breaks the usual identity
>>>> (that we rely on A LOT)
>>>>
>>>> ```(array + array)[0].dtype <=> (array + array[0]).dtype```
>>>>
>>>> (Yes, this holds only imperfectly for NaT because NaT serves as both
>>>> NotADatetime and NotATimedelta, and I'll refer you to the decomposing horse
>>>> in #28095.)
>>>>
>>>> Also from #28095:
>>>>
>>>> ```Series[timedelta64] * pd.NaT``` unambiguously raises, but
>>>> ```Series[timedelta64] * pd.NA``` could be timedelta64
>>>>
>>>> > Assume we introduce a new "nullable datetime" dtype that uses a mask
>>>> to track NAs, and can still have NaT in the values. In practice, this still
>>>> means that we "replace NaT with NA"
>>>>
>>>> This strikes me as contradictory.
>>>>
>>>> > So do you mean: "in my opinion, we should not do this" (what I just
>>>> described above), because in practice that would mean breaking arithmetic
>>>> consistency? Or that if we want to start using NA for datetimelike dtypes,
>>>> you think "dtype-parametrized" NA values are necessary (so you can
>>>> distinguish NA[datetime] and NA[timedelta] ?)
>>>>
>>>> I think:
>>>>
>>>> 1) pd.NA solves an _actual_ problem which is that we used to use np.nan
>>>> in places (categorical, object) where np.nan was semantically misleading.
>>>>    a) What these have in common is that they are in general
>>>> non-arithmetic dtypes.
>>>>    b) This is an improvement, and I'm glad you put in the effort to
>>>> make it happen.
>>>>    c) Trying to shoe-horn pd.NA into cases where it is semantically
>>>> misleading based on the Highlander Principle is counter-productive.
>>>>
>>>> 2) The "only one NA value is simpler" argument strikes me as a solution
>>>> in search of a problem.
>>>>    a) All the more so if you want this to supplement np.nan/pd.NaT
>>>> instead of replace them.
>>>>    b) *the idea of replacing vs supplementing needs to be made much
>>>> more explicit/clear*
>>>>
>>>> 3) The "dtype-parametrized" NA did come up in #28095, but I never
>>>> advocated it.
>>>>    a) I am open to separating out a NaTimedelta (xref #24983) from
>>>> pd.NaT, and don't particularly care what it is called.
>>>>
>>>>
>>>> On Mon, Feb 17, 2020 at 3:37 AM Joris Van den Bossche <
>>>> jorisvandenbossche at gmail.com> wrote:
>>>>
>>>>> > This would also imply creating a nullable float dtype and making our
>>>>>> datelikes use NA rather than NaT too. That seemed to be generally OK, but
>>>>>> wasn't discussed too much.
>>>>>>
>>>>>> My understanding of the discussion is that using a mask on top of
>>>>>> datetimelike arrays would not _replace_ NaT, but supplement it with
>>>>>> something semantically different.
>>>>>>
>>>>>
>>>>> Yes, if we see it similar as NaNs for floats (where NaN is a specific
>>>>> float value in the data array, while NAs are tracked in the mask array),
>>>>> then for datetimelike arrays we can do something similar. And the same
>>>>> discussions about to what extent to distinguish NaN and NA or whether we
>>>>> need to provide options that we are going to have for float dtypes, will
>>>>> also be relevant for datetimelike dtypes (but then for NaT and NA).
>>>>>
>>>>> But note that in practice, I *think* that the big majority of use
>>>>> cases will mostly use NA and not NaT in the data (eg when reading from
>>>>> files that have missing data).
>>>>>
>>>>> Replacing NaT with NA breaks arithmetic consistency, as has been
>>>>>> discussed ad nauseum.
>>>>>>
>>>>>
>>>>> It's not fully clear to me what you want to say with this, so a more
>>>>> detailed clarification is welcome (I mean, I understand the sentence and
>>>>> remember the discussion, but don't fully understand the point being made in
>>>>> context, or in what direction you think more discussion is needed).
>>>>>
>>>>> Assume we introduce a new "nullable datetime" dtype that uses a mask
>>>>> to track NAs, and can still have NaT in the values. In practice, this still
>>>>> means that we "replace NaT with NA" (because even though NaT is still
>>>>> possible, I think you would mostly get NAs as mentioned above; eg reading a
>>>>> file would now give NA instaed of NaT).
>>>>> So do you mean: "in my opinion, we should not do this" (what I just
>>>>> described above), because in practice that would mean breaking arithmetic
>>>>> consistency? Or that if we want to start using NA for datetimelike dtypes,
>>>>> you think "dtype-parametrized" NA values are necessary (so you can
>>>>> distinguish NA[datetime] and NA[timedelta] ?)
>>>>>
>>>>> Joris
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Pandas-dev mailing list
>>>>> Pandas-dev at python.org
>>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>>
>>>> _______________________________________________
>>>> Pandas-dev mailing list
>>>> Pandas-dev at python.org
>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200217/26004cfe/attachment.html>


More information about the Pandas-dev mailing list