[Numpy-discussion] DType Roadmap/NEP Discussion

Thu Sep 19 13:51:15 EDT 2019

On Wed, 2019-09-18 at 21:33 -0700, Ralf Gommers wrote:
> Hi Sebastian,
> 
> 
> On Wed, Sep 18, 2019 at 4:35 PM Sebastian Berg <
> sebastian at sipsolutions.net> wrote:
> > Hi all,
> > 
> > to try and make some progress towards a decision since the broad
> > design
> > is pretty much settling from my side. I am thinking about making a
> > meeting, and suggest Monday at 11am Pacific Time (I am open to
> > other
> > times though).
> > 
> > My hope is to get everyone interested on board, so that we can make
> > an
> > informed decision about the general direction very soon. So just
> > reach
> > out, or discuss on the mailing list as well.
> > 
> > The current draft for an NEP is here:
> > https://hackmd.io/kxuh15QGSjueEKft5SaMug?both
> > 
> > There are some design goals that I would like to clear up. 
> 
> The design itself seems very sensible to me insofar as I understand
> it. After having read your document again, I think you're still
> missing the actual goals though. "structure of class layout" and
> "type hierarchy" are important, but they're not the goals. You're
> touching on the real goals in places, but it may be valuable to be
> much more explicit there.
> 

Good points, I will try and incorporate some. Had answers to a few, but
I do not think it is too helpful here and now; this got a bit longer
than expected, but more general...

There is a bit of clash of long term vs. mid term goals. My goal is to
enable pretty much any conceivable long term goal, but in the mid/short
term, that means:

1. Convince you (and me) that the proposed API can handle everything we
can think of now and can grow easily (e.g. optimization, new features).

2. Convince everyone that the current state is unacceptable enough that
any added maintenance burden (during the transition phase) is
acceptable. I personally think, the maintenance will definitely get
better quickly, even if we reuse a lot of old code. The main issue is
the initial massive set of changes.

3. Any necessary ABI/API breakage that may happen is acceptable. The
DType breakage itself is very limited. Specific UFuncs may break more,
but only in hidden features that I know only of astropy as users (and
they are OK with us breaking it), numba might also be affected, but I
think less so.

The main point right now is organizing everything from monolithic ->
operator based, improving long term maintainability and extensibility.
Dog feeding ourselves for the same reason.

E.g. the AbstractDType hierarchy... it is something we could discuss. I
think it is right, since it replaces `dtype.kind` and makes for
powerful organization of dispatching in ufuncs. But, we could limit it
initially!
To give one example: Say ora creates many DTypes with different
datetime representations. ora could create an AbstractOraDType, so that
you can do easy isinstance checks. Especially, during ufunc dispatch
ora can use it to write a single function for figuring out promotion:
`OraDType1 + OraDType1 -> OraDType1 + OraDType2.astype(OraDType1)`.

I agree that this probably missing: UFunc dispatch is a major reason
for the split of "common DType" (class) and "common dtype instance" (of
strings with different length) functionality. I think it is a
reasonable split in any case, but for dispatching the first is
sufficient, while the second is more naturally found after dispatching
(only after you know you have Unit * Unit, can you reasonably figure
out the actual output `Unit("m*m")`).

Best,

Sebastian

PS: The only real limitation that I see right now is allowing promotion
to inspect array values. (This example is not very good probably) For
example `int_arr.astype(Categorical)`, wants to find
`Categorical(np.unique(int_arr))`).
I think not providing that is acceptable, because categorical can
provide its own function to find the actual categorical instance. Or
implement a Categorical and FrozenCategorical, so the dtype instance is
mutable in that it can add new categories.
(For array coercion from a list of items, the issue is different, and
allowing such things can be provided or added later)

> Here are some example goals:
> 
> 1. Make creating new dtypes via the NumPy C API take >4x less lines
> of code on average (in practice: for rational/quaternion, hard to
> measure otherwise).
> 
> 2. Make it possible to create new dypes with full functionality via
> the NumPy Python API. Performance may be up to 1-2 orders of
> magnitude worse than when creating the same dtype via the C API; the
> main purpose is to allow easier prototyping of new dtypes.
> 
> 3. Make the NumPy codebase more maintainable by removing special-
> casing of datetime dtypes in many places.
> 
> 4. Enable creation of a units library whose arrays *are* numpy arrays
> rather than a subclass or duck array. This will make such a library
> work much better with SciPy and other existing libraries that use
> np.asarray extensively.
> 
> 5. Hide currently exposed implementation details in the C API so
> long-term .... (you have this one, but it would be nice to work it
> out a little more - after all we recently considered reverting the
> deprecation for direct field access, so how important is this?)
> 
> 6. Improve casting behavior for external dtypes
> 
> 7. Make np.char behavior better <in ... ways> (you mention fixed
> length strings work poorly now, but not what would change)
> 
> 
> Listing non-goals would also be useful:
> 
> 1. Performance: no significant performance improvements are expected.
> We aim for no performance regressions.
> 
> 2. Introducing new dtypes into NumPy itself
> 
> 3. Pandas ExtensionArrays? You mention them, but does this dtype
> redesign help Pandas in any way or not?
> 
> 4. Changes to NumPy's current casting rules
> 
> 5. Allow creation of dtypes that don't fit the current NumPy model of
> what a dtype is (e.g. ref [1]), such as a variable-length string
> dtype.
> 
> 
> Many of those (and there can be more, this is just what came to mind
> now) can/should be a paragraph or section. In my experience
> describing these goals and requirements well takes about 15-30% of
> the length of the design description. Think of for example a Pandas
> or units library maintainer reading this: they should be able to stop
> reading at where you now have "Overview Graphic" and have a pretty
> clear high-level understanding of what this whole redesign will mean
> for them. Same for a NumPy maintainer who wants to get a sense of
> what the benefits and impacts will be: reading only (the expanded
> version of) your Abstract, Motivation and Scope, and Backwards
> Compatibility sections should be enough.
> 
> Here's a concrete question, that's the type of thing I'd like to
> understand without having to understand the whole design in detail:
> ```
> >>> import datetime                                                  
>                                        
> >>> import pandas as pd                                              
>                                        
> >>> import datetime                                                  
>                                        
> >>> dti = pd.to_datetime(['1/1/2018', np.datetime64('2018-01-01'), 
> ...                       datetime.datetime(2018, 1, 1)])            
>                                        
> >>>                                                                  
>                                        
> >>> dti.values                                                      
>                                         
> array(['2018-01-01T00:00:00.000000000', '2018-01-
> 01T00:00:00.000000000',
>        '2018-01-01T00:00:00.000000000'], dtype='datetime64[ns]')
> >>> dti.values.dtype                                                
>                                         
> dtype('<M8[ns]')
> >>> isinstance(dti.values.dtype, np.dtype)                          
>                                         
> True
> >>> dti.dtype == dti.values.dtype      # okay, that's nice          
>                                                              
> True
> 
> >>> start = pd.to_datetime('2015-02-24')                            
>                                         
> >>> rng = pd.date_range(start, periods=3)                            
>                                        
> >>> t = pd.Series(rng)                                              
>                                         
> >>> t_withzone =
> t.dt.tz_localize('UTC').dt.tz_convert('Asia/Kolkata')                
>                       
> >>> t_withzone                                                      
>                                         
> 0   2015-02-24 05:30:00+05:30
> 1   2015-02-25 05:30:00+05:30
> 2   2015-02-26 05:30:00+05:30
> dtype: datetime64[ns, Asia/Kolkata]
> >>> t_withzone.dtype                                                
>                                         
> datetime64[ns, Asia/Kolkata]
> >>> t_withzone.values.dtype                                          
>                                        
> dtype('<M8[ns]')
> >>> t_withzone.dtype == t_withzone.values.dtype    # could this be
> True in the future?                                     
> False
> ```
> So can Pandas create timezone-aware numpy dtypes in the future if
> they want to, or would they still be better off rolling their own?
> 
> 
> Also one question/comment about the design content. When looking at
> the current external dtypes (e.g. [2]), a large part of the work of
> implementing a new dtype now deals with ufunc behavior. It's not
> clear from your document how that changes with the new design, can
> you add something about that?
>  
> Cheers,
> Ralf
> 
> [1] 
> http://scipy-lectures.org/advanced/advanced_numpy/index.html#the-descriptor
> [2] 
> https://github.com/moble/quaternion/blob/master/numpy_quaternion.c
> 
> 
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20190919/0dd77388/attachment.sig>