[Numpy-discussion] Moving forward with value based casting

Tue Jun 18 16:47:15 EDT 2019

Hi Hameer,

On Tue, 2019-06-18 at 04:28 +0200, Hameer Abbasi wrote:
> On Wed, 2019-06-12 at 12:55 -0500, Sebastian Berg wrote:
> > On Wed, 2019-06-05 at 15:41 -0500, Sebastian Berg wrote:
> > > Hi all,
> > > 
<snip>
> A type is safely castable to another if all of these numbers are
> exceeded or met.
> 
> This would give us a clean way for registering new numeric types,
> while
> also cleanly hooking into the type system, and solving the casting
> scenario. Of course, I'm not proposing we generate the loops for or
> provide all these types ourselves, but simply that we allow people to
> define dtypes using such a schema. I do worry that we're special-
> casing 
> numbers here, but it is "Num"Py, so I'm also not too worried.
> 
> This flexibility would, for example, allow us to easily define a
> bfloat16/bcomplex32 type with all the "can_cast" logic in place, even
> if people have to register their own casts or loops (and just to be
> clear, we error if they are not). It also makes it easy to define
> loops
> for int128 and so on if they come along.
> 
> The only open question left here is: What to do with a case like
> int64
> + uint64. And what I propose is we abandon purity for pragmatism here
> and tell ourselves that losing one sign bit is tolerable 90% of the
> time, and going to floating-point is probably worse. It's more of a
> range-versus-accuracy question, and I would argue that people using
> integers expect exactness. Of course, I doubt anyone is actually
> relying on the fact that adding two integers produces floating-point
> results, and it has been the cause of at least one bug, which
> highlights that integers can be used in places where floats cannot.
> [0]

TL;DR: I started a prototype for a possible approach on
casting/promotion and ufunc dispatching, any comments appreciated,
especially if I took a wrong turn! I will look into writing a bit more
about it more in an NEP style and not in code.

(About uint64 + int64 see below [0])

Thanks for the input, Hameer! Sorry for not following up here with some
of the thoughts that I had after writing that. You are right that the
bit counting is a way to handle this (and maybe keep it limited enough
that caching is possible). I had started prototyping a bit with the
thought that maybe caching for scalars is just not so important (just
type if you need the speed). But using this method, it might be fine.
Although, e.g. an int8NA type using -128 to signal an NA value would be
yet another issue with "binning" the values.

I have followed a bit of an older thought of mine right now and started
to do a python mock-up prototyping for that. It is probably still a lot
in flux, but the basic idea is to start of with using slots on a dtype
objects (I realize that during the meeting we had a slight tendency
towards a registration approach for casting, this seemed a bit simpler
to me though).

For the sake of discussion, I posted the start at: 
https://github.com/seberg/numpy_dtype_prototype

On the ufunc side, it just mocks up a simple UfuncImpl model (for
add/multiply, but only add semi supports units right now).

There is a mock up array object using a numpy array + new dtype (it
does support `array.astype(new_dtype)`.

I have basically considered dtypes to be instances of "DType". The
classes of dtypes could be thought of dtype categories (if you so will,
right now these categories are not clean with respect to ufunc
dispatching).

There are basically two different kinds of dtypes in this mock up:

1. Descriptor instances which are "finalized", i.e. they are proper
dtypes with an itemsize and can be attached to arrays.
2. Descriptor instances which are not "finalized". These could be just
abstract dtype. The simplest notion here are flexible dtypes such as
"S" which cannot be attached to an array, but we still call it a dtype.

This way 1. are basically more specific versions of 2. So both groups
have all the machinery for type casting/promotion/discovery, but only
the first group can actually describe real data.

So how to handle casting? In this mock-up, I opted for slots on the
DType object (the C-side would look similar, not sure how exactly), one
reason is that casting in general needs function calls in any case,
because we cannot decide if one string can be cast to another without
looking at its length, etc (or meters cast to seconds). For casting
purposes these are:

methods:

* get_cast_func_to(...), get_cast_func_from(...)
* (optionally: can_cast_to, can_cast_from)
* common_type(self, other)  # allow to override common type operation.
* default_type()  # for non-finalized types to ask them for a real type

classmethods:

* discover_type(python_scalar)  # falls back to discovering from class:
* discover_type_from_class(python_type)

Of course some other information needs to be attached/registered, e.g.
which python classes to associate with a certain dtype category.

To handle python integers, what I did now is that I discover them as
their own dtype (a class 2. one, not a real one). Which remembers the
value and knows how to cast to normal dtypes. It will do so very slowly
by trying all possibilities, but user provided dtypes can handle it for
better speed.

One small nice thing about this is that it should support dtypes such
as `np.blasable` to mean "float32" or "float64", they may not be super
fast, but it should work.

One side effect of this approach for the common type operation is that,
there is no way for a user int24 to say that uint16 + int16 -> int24
(or similar).

On the side of ufunc dispatching it is probably a bit hacky, but
basically the idea is that it is based on the "dtypes category"
(usually class). It could be extended later, but for starters makes the
dispatching very simple, since we ignore type hierarchy.

Best,

Sebastian

[0] One of the additional things with uint64+int64 is that it jumps
kinds/category (whatever you want to call it) between ints and floats
right now.
There is a more general problem with that. Our casting logic is not
strictly ordered according to these categories (int < float), i.e.  a
large int will cause a float16 to upcast to float64.
This is problematic because it is the reason why our common
type/promotion is not associative, while C and Julia's is associative.

The only thing I can think of would be to find common types within the
same categories kinds first, but I bet that just opens a huge can of
worms (and makes dtype discovery much more complex).

> 
> Hameer Abbasi
> 
> [0] https://github.com/numpy/numpy/issues/9982
> [1] https://en.wikipedia.org/wiki/Cayley%E2%80%93Dickson_construction
> 
> > Best,
> > 
> > Sebastian
> > 
> > 
> > > -----------
> > > 
> > > Currently when you write code such as:
> > > 
> > > arr = np.array([1, 43, 23], dtype=np.uint16)
> > > res = arr + 1
> > > 
> > > Numpy uses fairly sophisticated logic to decide that `1` can be
> > > represented as a uint16, and thus for all unary functions (and
> > > most
> > > others as well), the output will have a `res.dtype` of uint16.
> > > 
> > > Similar logic also exists for floating point types, where a lower
> > > precision floating point can be used:
> > > 
> > > arr = np.array([1, 43, 23], dtype=np.float32)
> > > (arr + np.float64(2.)).dtype  # will be float32
> > > 
> > > Currently, this value based logic is enforced by checking whether
> > > the
> > > cast is possible: "4" can be cast to int8, uint8. So the first
> > > call
> > > above will at some point check if "uint16 + uint16 -> uint16" is
> > > a
> > > valid operation, find that it is, and thus stop searching. (There
> > > is
> > > the additional logic, that when both/all operands are scalars, it
> > > is
> > > not applied).
> > > 
> > > Note that while it is defined in terms of casting "1" to uint8
> > > safely
> > > being possible even though 1 may be typed as int64. This logic
> > > thus
> > > affects all promotion rules as well (i.e. what should the output
> > > dtype
> > > be).
> > > 
> > > 
> > > There 2 main discussion points/issues about it:
> > > 
> > > 1. Should value based casting/promotion logic exist at all?
> > > 
> > > Arguably an `np.int32(3)` has type information attached to it, so
> > > why
> > > should we ignore it. It can also be tricky for users, because a
> > > small
> > > change in values can change the result data type.
> > > Because 0-D arrays and scalars are too close inside numpy (you
> > > will
> > > often not know which one you get). There is not much option but
> > > to
> > > handle them identically. However, it seems pretty odd that:
> > >  * `np.array(3, dtype=np.int32)` + np.arange(10, dtype=int8)
> > >  * `np.array([3], dtype=np.int32)` + np.arange(10, dtype=int8)
> > > 
> > > give a different result.
> > > 
> > > This is a bit different for python scalars, which do not have a
> > > type
> > > attached already.
> > > 
> > > 
> > > 2. Promotion and type resolution in Ufuncs:
> > > 
> > > What is currently bothering me is that the decision what the
> > > output
> > > dtypes should be currently depends on the values in complicated
> > > ways.
> > > It would be nice if we can decide which type signature to use
> > > without
> > > actually looking at values (or at least only very early on).
> > > 
> > > One reason here is caching and simplicity. I would like to be
> > > able
> > > to
> > > cache which loop should be used for what input. Having value
> > > based
> > > casting in there bloats up the problem.
> > > Of course it currently works OK, but especially when user dtypes
> > > come
> > > into play, caching would seem like a nice optimization option.
> > > 
> > > Because `uint8(127)` can also be a `int8`, but uint8(128) it is
> > > not
> > > as
> > > simple as finding the "minimal" dtype once and working with
> > > that." 
> > > Of course Eric and I discussed this a bit before, and you could
> > > create
> > > an internal "uint7" dtype which has the only purpose of flagging
> > > that
> > > a
> > > cast to int8 is safe.
> > > 
> > > I suppose it is possible I am barking up the wrong tree here, and
> > > this
> > > caching/predictability is not vital (or can be solved with such
> > > an
> > > internal dtype easily, although I am not sure it seems elegant).
> > > 
> > > 
> > > Possible options to move forward
> > > --------------------------------
> > > 
> > > I have to still see a bit how trick things are. But there are a
> > > few
> > > possible options. I would like to move the scalar logic to the
> > > beginning of ufunc calls:
> > >   * The uint7 idea would be one solution
> > >   * Simply implement something that works for numpy and all
> > > except
> > >     strange external ufuncs (I can only think of numba as a
> > > plausible
> > >     candidate for creating such).
> > > 
> > > My current plan is to see where the second thing leaves me.
> > > 
> > > We also should see if we cannot move the whole thing forward, in
> > > which
> > > case the main decision would have to be forward to where. My
> > > opinion
> > > is
> > > currently that when a type has a dtype associated with it
> > > clearly,
> > > we
> > > should always use that dtype in the future. This mostly means
> > > that
> > > numpy dtypes such as `np.int64` will always be treated like an
> > > int64,
> > > and never like a `uint8` because they happen to be castable to
> > > that.
> > > 
> > > For values without a dtype attached (read python integers,
> > > floats),
> > > I
> > > see three options, from more complex to simpler:
> > > 
> > > 1. Keep the current logic in place as much as possible
> > > 2. Only support value based promotion for operators, e.g.:
> > >    `arr + scalar` may do it, but `np.add(arr, scalar)` will not.
> > >    The upside is that it limits the complexity to a much simpler
> > >    problem, the downside is that the ufunc call and operator
> > > match
> > >    less clearly.
> > > 3. Just associate python float with float64 and python integers
> > > with
> > >    long/int64 and force users to always type them explicitly if
> > > they
> > >    need to.
> > > 
> > > The downside of 1. is that it doesn't help with simplifying the
> > > current
> > > situation all that much, because we still have the special
> > > casting
> > > around...
> > > 
> > > 
> > > I have realized that this got much too long, so I hope it makes
> > > sense.
> > > I will continue to dabble along on these things a bit, so if
> > > nothing
> > > else maybe writing it helps me to get a bit clearer on things...
> > > 
> > > Best,
> > > 
> > > Sebastian
> > > 
> > > 
> > > _______________________________________________
> > > NumPy-Discussion mailing list
> > > NumPy-Discussion at python.org
> > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > 
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion at python.org
> > https://mail.python.org/mailman/listinfo/numpy-discussion
> 
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20190618/39ff1bb8/attachment.sig>