[Pandas-dev] ROADMAP proposal: Consistent missing value handling with new NA scalar

Tue Nov 19 13:00:20 EST 2019

On Tue, 2019-11-19 at 18:56 +0100, Joris Van den Bossche wrote:
> In case that people are interested in this: we are having a dev chat
> (hangout) about this topic tomorrow at 18:20 UTC. Certainly welcome
> to join!
> 

Hi,

I think I will listen in. Can you send the meeting details around?

Best,

Sebastian

> 
> On Thu, 14 Nov 2019 at 21:44, Joris Van den Bossche <
> jorisvandenbossche at gmail.com> wrote:
> > Quick update on this: there has been discussion at 
> > https://github.com/pandas-dev/pandas/issues/28095 and 
> > https://github.com/pandas-dev/pandas/issues/28778/, and there is
> > now also a PR implementing such a pd.NA scalar missing value
> > indicator: https://github.com/pandas-dev/pandas/pull/29597 
> > Feedback is still very welcome!
> > 
> > On Thu, 3 Oct 2019 at 22:32, Joris Van den Bossche <
> > jorisvandenbossche at gmail.com> wrote:
> > > Hi all,
> > > 
> > > I would like to propose a revisit of missing value handling in
> > > pandas. It's already being discussed on github (
> > > https://github.com/pandas-dev/pandas/issues/28095), but want to
> > > mention this on the mailing list as well for broader feedback.
> > > A more detailed proposal can be found here: 
> > > https://hackmd.io/@jorisvandenbossche/Sk0wMeAmB, and discussion
> > > can be found at the above github issue.
> > > 
> > > A summary of the proposal is to introduce a new NA value
> > > (singleton) for representing scalar missing values (instead of
> > > np.nan) that can be used consistently across all data types. This
> > > could be achieved under the hood by using a mask-based approach
> > > to store the missing values on the array/series-level, but the
> > > main discussion here is about the user-facing API: the scalar NA
> > > value and the behaviour of NA in several operation.
> > > 
> > > Motivation for this change:
> > > Consistent user interface.
> > > Currently, the value you get back for a missing scalar (eg from
> > > scalar access s[idx]) depends on the data type (np.nan for many,
> > > but pd.NaT for datetime-likes). Some types support missing
> > > values, others don't. This proposal would ensure you get back
> > > pd.NA regardless of the dtype.
> > > No "mis-use" of the np.nan floating point value.
> > > The NaN value is a specific floating point value, and not
> > > necessarily an indicator for missing values (although pandas has
> > > always used it that way). And because we also use it for other
> > > dtypes, you get back a float value for non-float dtypes, giving
> > > misleading dtype information.
> > > A missing value that behaves accordingly.
> > > Our current behaviour of missing values is inherited of the
> > > np.nan behaviour. Other languages that have a NA/NULL value that
> > > is distinguished from NaN (eg Julia, SQL, R) typically have
> > > different behaviour in comparison and logical operations. For
> > > example, comparison with NA could give NA instead of False, and
> > > consequently we need to have a boolean dtype with NA support. A
> > > new NA value opens up the possibility of having such behaviour.
> > > An "NA" scalar matches the terminology that is used throughout
> > > pandas in functions and argument names (isna, dropna, fillna,
> > > skipna, …).
> > > 
> > > See the proposal for more details. 
> > > 
> > > This has of course many consequences in the user API of pandas.
> > > Initially, it could therefore be introduced optionally (eg only
> > > in the new data types as nullable integer or string dtype). 
> > > And given those pervasive changes, many eyes on it are important.
> > > So feedback on this idea would be greatly appreciated!
> > > 
> > > Joris
> > > 
> 
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20191119/7330a86f/attachment.sig>