[Pandas-dev] ROADMAP proposal: Consistent missing value handling with new NA scalar

Joris Van den Bossche jorisvandenbossche at gmail.com
Thu Nov 14 15:44:53 EST 2019


Quick update on this: there has been discussion at
https://github.com/pandas-dev/pandas/issues/28095 and
https://github.com/pandas-dev/pandas/issues/28778/, and there is now also a
PR implementing such a pd.NA scalar missing value indicator:
https://github.com/pandas-dev/pandas/pull/29597
Feedback is still very welcome!

On Thu, 3 Oct 2019 at 22:32, Joris Van den Bossche <
jorisvandenbossche at gmail.com> wrote:

> Hi all,
>
> I would like to propose a revisit of missing value handling in pandas.
> It's already being discussed on github (
> https://github.com/pandas-dev/pandas/issues/28095), but want to mention
> this on the mailing list as well for broader feedback.
> A more detailed proposal can be found here:
> https://hackmd.io/@jorisvandenbossche/Sk0wMeAmB, and discussion can be
> found at the above github issue.
>
> A summary of the proposal is to introduce *a new NA value (singleton) for
> representing scalar missing values* (instead of np.nan) that can be used
> consistently across all data types. This could be achieved under the hood
> by using a mask-based approach to store the missing values on the
> array/series-level, but the main discussion here is about the user-facing
> API: the scalar NA value and the behaviour of NA in several operation.
>
> Motivation for this change:
>
>    - *Consistent user interface.*
>    Currently, the value you get back for a missing scalar (eg from scalar
>    access s[idx]) depends on the data type (np.nan for many, but pd.NaT
>    for datetime-likes). Some types support missing values, others don't. This
>    proposal would ensure you get back pd.NA regardless of the dtype.
>    - *No "mis-use" of the np.nan floating point value.*
>    The NaN value is a specific floating point value, and not necessarily
>    an indicator for missing values (although pandas has always used it that
>    way). And because we also use it for other dtypes, you get back a float
>    value for non-float dtypes, giving misleading dtype information.
>    - *A missing value that behaves accordingly.*
>    Our current behaviour of missing values is inherited of the np.nan
>    behaviour. Other languages that have a NA/NULL value that is distinguished
>    from NaN (eg Julia, SQL, R) typically have different behaviour in
>    comparison and logical operations. For example, comparison with NA could
>    give NA instead of False, and consequently we need to have a boolean dtype
>    with NA support. A new NA value opens up the possibility of having such
>    behaviour.
>    - An "NA" scalar *matches the terminology* that is used throughout
>    pandas in functions and argument names (isna, dropna, fillna, skipna,
>    …).
>
>
> See the proposal <https://hackmd.io/@jorisvandenbossche/Sk0wMeAmB> for
> more details.
>
> This has of course many consequences in the user API of pandas. Initially,
> it could therefore be introduced optionally (eg only in the new data types
> as nullable integer or string dtype).
> And given those pervasive changes, many eyes on it are important. *So
> feedback on this idea would be greatly appreciated!*
>
> Joris
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20191114/e6fddc48/attachment.html>


More information about the Pandas-dev mailing list