[Pandas-dev] What could a pandas 2.0 look like?

Wed Feb 19 17:46:37 EST 2020

Some answers to previous mails first:

On Mon, 17 Feb 2020 at 17:34, Tom Augspurger <tom.augspurger88 at gmail.com>
wrote:

> > 2) The "only one NA value is simpler" argument strikes me as a solution
> in search of a problem.
>
> I don't think that's correct. I think consistently propagating NA in
> comparison operations is a worthwhile goal.
>
> Having a single, consistent missing value indicator across all dtypes is *for
me* one of the main drivers that led me to make the pd.NA proposal.
>From my personal experience (eg when teaching pandas to beginners), this is
an existing problem that complicates things, not one that is being invented.

> On Mon, Feb 17, 2020 at 10:25 AM Brock Mendel <jbrockmendel at gmail.com>
> wrote:
>
>>
>> > Assume we introduce a new "nullable datetime" dtype that uses a mask to
>> track NAs, and can still have NaT in the values. In practice, this still
>> means that we "replace NaT with NA"
>>
>> This strikes me as contradictory.
>>
>
I tried to explain this in the next sentence from the original text:
"(because even though NaT is still possible, I think you would mostly get
NAs as mentioned above; eg reading a file would now give NA instead of
NaT)."
So assuming you have a masked-array-approach for datetimes, then you can
have NaT as a valid datetime value in the values part or NA due to the mask
part of the array. In such a case (but this is only an assumption how the
extension array *could* work!), it's the NA that is the main missing value
indicator. So if you are creating such a masked datetime-like array with
missing values (eg from reading a file), you will get NAs as missing values
in this case in contrast to NaTs right now. Hence, in practice we would
"replace NaT with NA", although you can still have NaT in the values.

Note I only started to explain this in response to your initial "using a
mask on top of datetimelike arrays would not _replace_ NaT, but supplement
it with something semantically different", but maybe I misunderstood your
initial comment.

>
>> > So do you mean: "in my opinion, we should not do this" (what I just
>> described above), because in practice that would mean breaking arithmetic
>> consistency? Or that if we want to start using NA for datetimelike dtypes,
>> you think "dtype-parametrized" NA values are necessary (so you can
>> distinguish NA[datetime] and NA[timedelta] ?)
>>
>> I think:
>>
>> 1) pd.NA solves an _actual_ problem which is that we used to use np.nan
>> in places (categorical, object) where np.nan was semantically misleading.
>>    a) What these have in common is that they are in general
>> non-arithmetic dtypes.
>>    b) This is an improvement, and I'm glad you put in the effort to make
>> it happen.
>>    c) Trying to shoe-horn pd.NA into cases where it is semantically
>> misleading based on the Highlander Principle is counter-productive.
>>
>
With "semantically misleading", I suppose you mean that "Series[Timedelta]
+ pd.NA" could result both in timedelta or datetime64?

Personally, I don't think this is big problem (or at least I think a single
pd.NA brings bigger benefits), but this has indeed already discussed

>
>> 2) The "only one NA value is simpler" argument strikes me as a solution
>> in search of a problem.
>>    a) All the more so if you want this to supplement np.nan/pd.NaT
>> instead of replace them.
>>    b) *the idea of replacing vs supplementing needs to be made much more
>> explicit/clear*
>>
>>
I thought you were actually advocating for supplementing instead of
replacing in your first email ;) (but maybe you were rather trying to
summarize the discussion, and not giving an opinion?)
Anyway, I will open an github issue for float dtypes to further discuss
this (I think the float case is the easier to discuss, while the issues are
similar to datetime with NaT).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200219/ce9489f8/attachment.html>