[Numpy-discussion] NEP 41: Is there still need to discuss DTypes vs. Scalars (or DType classes)?

Mon Apr 20 16:37:00 EDT 2020

Hi all,

the week has passed, and it has been discussed quite a bit longer, so I
assume that NEP 41 can effectively be accepted.

Even then, I will bring up one point again. I hope that if there is
still need for discussion, it will hopefully happen in a timely manner,
so that, I can go ahead with some changes proposed in NEP 41, and in
the event of more concrete doubts/issues there will only be few changes
that need to be undone. I would hate to revert large amount of work,
simply because an important point/issue is raised in two months instead
of two weeks.

This whole thing is fairly complex, so please do not hesitate to ask
for clarifications!
I am also very happy to do a video conference with anyone interested at
any time, or chat in private on Slack.
So just in case: I will be available around 11:00 PDT (18 UTC) this
Thursday on the NumPy Community Call zoom link [0].

As far as I am aware, there was only one (maybe 2, see point 2. below
which may be independent) discussion points.

In my proposal the DType class (i.e. `type(np.dtype("float64")`), is
the core concept and different for every scalar type. It holds all the
information on how to deal with array elements.

This is some duplication of scalar types and it means that there would
be (usually) exactly one DType for each (NumPy) scalar, possibly
exposed using:

    np.dtype[scalar_type]
    e.g. np.dtype[np.float64]

That does create a certain duality. For each scalar type/class, there
is a corresponding DType class. And in theory the scalar does not even
need to know that NumPy has a DType for it.

From a typing theoretical point of view this is also a bit strange. The
type of each array element is identical to the scalar type! But
although there is only one type, there are two distinct classes: one
for the scalar value, and one to explain them to NumPy and store them
in an array.

I lean in that direction because:

1. I wanted to modify scalars as little as possible, I am not sure we
   will enable this initially, but this is so that:

   * In principle you can create a DType for every Python type without
     touching the original Python scalar.
   * The scalar need not know about NumPy or DTypes thus creating no
     new dependency. (you can use the scalar without installing NumPy)

2. I somewhat like that DType classes have methods that get a "self"
   instance argument and are provided with the data by the array.

   * This makes functions `dtype.__get_array_item__(item_memory)` is
     implemented like a method:

     class DType:
         def __get_array_item__(self, item_memory):
             return item

   * There is an alternative approach to this, that I did not think
     about much, though.
     `item_memory` really is much like a scalar instance (it holds the
     actual value), so you can argue that `item_memory` is `self` here,
     and the dtype instance is the type of `item_memory` (the self).
     E.g. making `__get_array_item__` live on the dtype (not on
     the class). The dtype thus is the type/class of the array
     element.
     This is beautiful, but, in generally you still need to pass
     the dtype instance itself. For example strings cannot interpret
     without knowing their length. In other words, the scalar `self` is
     actually the tuple `(item_memory, dtype)`, which I think is why
     at least I do not have a clear grasp here. [1]

3. There may be `dtypes` without specific scalar types. I am not sure
   this is actually a tidy theoretical concept, but an example is
   the current Pandas Categorical.
   The type of the scalars within a categorical array are arbitrary.
   I am not actually sure that is theoretically tidy. E.g. Python
   uses `enum.Enum`, a class factory, for a similar purpose, and
   you have to use the `.value` attribute. But, desirable or not,
   it would seem less straight forward to potentially allow if we
   design this around the scalar type.

The main downside to using DTypes as proposed in NEP 41 in my opinion
is what I mentioned first:
We must have a DType class for every scalar class, even though at least
most scalars (i.e. all NumPy scalars, except the `object` dtype) can
easily be expanded into including all necessary information,  maybe
they already include almost all of it.
In the NEP 41 framework the scalar could be build from the DType in
practice. Which may seem a bit strange. In general Scalar<->DType will
form a Unit of a sort. And this means that somewhere we have to map
scalars to DTypes.

So, in many ways, I actually do find the scalar version tidier myself.
But, I also find the "there is a DType class for every scalar
type/class" a straight forward user story even if there will be subtle
difference between DType and scalar class/type.

The point 2. may be independent of the whole scalar story, I am
conflating it here, because to me it applies more naturally in that
context.

Cheers,

Sebastian

[0] See the community meeting agenda document for the link: 
https://hackmd.io/76o-IxCjQX2mOXO_wwkcpg

[1] These are thoughts mainly from:
https://gist.github.com/eric-wieser/49c55bcab744b0e782f6c2740603180b#what-this-could-mean-for-dtypes
and a discussion on the pull request, and I will not claim to represent
them quite correctly and especially fully here.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20200420/8cb73c59/attachment.sig>