[Numpy-discussion] Moving forward with value based casting

Mon Jun 17 22:28:54 EDT 2019

On Wed, 2019-06-12 at 12:55 -0500, Sebastian Berg wrote:
> On Wed, 2019-06-05 at 15:41 -0500, Sebastian Berg wrote:
> > Hi all,
> > 
> > TL;DR:
> > 
> > Value based promotion seems complex both for users and ufunc-
> > dispatching/promotion logic. Is there any way we can move forward
> > here,
> > and if we do, could we just risk some possible (maybe not-existing)
> > corner cases to break early to get on the way?
> > 
> 
> Hi all,
> 
> just to note. I think I will go forward trying to fill the hole in
> the
> hierarchy with a non-existing uint7 dtype. That seemed like it may be
> ugly, but if it does not escalate too much, it is probably fairly
> straight forward. And it would allow to simplify dispatching without
> any logic change at all. After that we could still decide to change
> the
> logic.

Hi Sebastian!

This seems like the right approach to me as well, I would just add one
additional comment. Earlier on, you mentioned that a lot of "strange"
dtypes will pop up when dealing with floats/ints. E.g. int15, int31,
int63, int52 (for checking double-compat), int23 (single compat), int10
(half compat) and so on and so forth. The lookup table would get tricky
to populate by hand --- It might be worth it to use the logic I
suggested to autogenerate it in some way, or to "determine" the
temporary underspecified type, as Nathaniel proposed in his email to
the list. That is, we store the number of:

* flag (0 for numeric, 1 for non-numeric)
* sign bits (0 for unsigned ints, 1 else)
* integer/fraction bits (self-explanatory)
* exponent bits (self-explanatory)
* Log-Number of items (0 for real, 1 for complex, 2 for quarternion,
etc.) (I propose log because the Cayley-Dickson algebras [1] require a
power of two)

A type is safely castable to another if all of these numbers are
exceeded or met.

This would give us a clean way for registering new numeric types, while
also cleanly hooking into the type system, and solving the casting
scenario. Of course, I'm not proposing we generate the loops for or
provide all these types ourselves, but simply that we allow people to
define dtypes using such a schema. I do worry that we're special-casing 
numbers here, but it is "Num"Py, so I'm also not too worried.

This flexibility would, for example, allow us to easily define a
bfloat16/bcomplex32 type with all the "can_cast" logic in place, even
if people have to register their own casts or loops (and just to be
clear, we error if they are not). It also makes it easy to define loops
for int128 and so on if they come along.

The only open question left here is: What to do with a case like int64
+ uint64. And what I propose is we abandon purity for pragmatism here
and tell ourselves that losing one sign bit is tolerable 90% of the
time, and going to floating-point is probably worse. It's more of a
range-versus-accuracy question, and I would argue that people using
integers expect exactness. Of course, I doubt anyone is actually
relying on the fact that adding two integers produces floating-point
results, and it has been the cause of at least one bug, which
highlights that integers can be used in places where floats cannot. [0]

Hameer Abbasi

[0] https://github.com/numpy/numpy/issues/9982
[1] https://en.wikipedia.org/wiki/Cayley%E2%80%93Dickson_construction

> 
> Best,
> 
> Sebastian
> 
> 
> > -----------
> > 
> > Currently when you write code such as:
> > 
> > arr = np.array([1, 43, 23], dtype=np.uint16)
> > res = arr + 1
> > 
> > Numpy uses fairly sophisticated logic to decide that `1` can be
> > represented as a uint16, and thus for all unary functions (and most
> > others as well), the output will have a `res.dtype` of uint16.
> > 
> > Similar logic also exists for floating point types, where a lower
> > precision floating point can be used:
> > 
> > arr = np.array([1, 43, 23], dtype=np.float32)
> > (arr + np.float64(2.)).dtype  # will be float32
> > 
> > Currently, this value based logic is enforced by checking whether
> > the
> > cast is possible: "4" can be cast to int8, uint8. So the first call
> > above will at some point check if "uint16 + uint16 -> uint16" is a
> > valid operation, find that it is, and thus stop searching. (There
> > is
> > the additional logic, that when both/all operands are scalars, it
> > is
> > not applied).
> > 
> > Note that while it is defined in terms of casting "1" to uint8
> > safely
> > being possible even though 1 may be typed as int64. This logic thus
> > affects all promotion rules as well (i.e. what should the output
> > dtype
> > be).
> > 
> > 
> > There 2 main discussion points/issues about it:
> > 
> > 1. Should value based casting/promotion logic exist at all?
> > 
> > Arguably an `np.int32(3)` has type information attached to it, so
> > why
> > should we ignore it. It can also be tricky for users, because a
> > small
> > change in values can change the result data type.
> > Because 0-D arrays and scalars are too close inside numpy (you will
> > often not know which one you get). There is not much option but to
> > handle them identically. However, it seems pretty odd that:
> >  * `np.array(3, dtype=np.int32)` + np.arange(10, dtype=int8)
> >  * `np.array([3], dtype=np.int32)` + np.arange(10, dtype=int8)
> > 
> > give a different result.
> > 
> > This is a bit different for python scalars, which do not have a
> > type
> > attached already.
> > 
> > 
> > 2. Promotion and type resolution in Ufuncs:
> > 
> > What is currently bothering me is that the decision what the output
> > dtypes should be currently depends on the values in complicated
> > ways.
> > It would be nice if we can decide which type signature to use
> > without
> > actually looking at values (or at least only very early on).
> > 
> > One reason here is caching and simplicity. I would like to be able
> > to
> > cache which loop should be used for what input. Having value based
> > casting in there bloats up the problem.
> > Of course it currently works OK, but especially when user dtypes
> > come
> > into play, caching would seem like a nice optimization option.
> > 
> > Because `uint8(127)` can also be a `int8`, but uint8(128) it is not
> > as
> > simple as finding the "minimal" dtype once and working with that." 
> > Of course Eric and I discussed this a bit before, and you could
> > create
> > an internal "uint7" dtype which has the only purpose of flagging
> > that
> > a
> > cast to int8 is safe.
> > 
> > I suppose it is possible I am barking up the wrong tree here, and
> > this
> > caching/predictability is not vital (or can be solved with such an
> > internal dtype easily, although I am not sure it seems elegant).
> > 
> > 
> > Possible options to move forward
> > --------------------------------
> > 
> > I have to still see a bit how trick things are. But there are a few
> > possible options. I would like to move the scalar logic to the
> > beginning of ufunc calls:
> >   * The uint7 idea would be one solution
> >   * Simply implement something that works for numpy and all except
> >     strange external ufuncs (I can only think of numba as a
> > plausible
> >     candidate for creating such).
> > 
> > My current plan is to see where the second thing leaves me.
> > 
> > We also should see if we cannot move the whole thing forward, in
> > which
> > case the main decision would have to be forward to where. My
> > opinion
> > is
> > currently that when a type has a dtype associated with it clearly,
> > we
> > should always use that dtype in the future. This mostly means that
> > numpy dtypes such as `np.int64` will always be treated like an
> > int64,
> > and never like a `uint8` because they happen to be castable to
> > that.
> > 
> > For values without a dtype attached (read python integers, floats),
> > I
> > see three options, from more complex to simpler:
> > 
> > 1. Keep the current logic in place as much as possible
> > 2. Only support value based promotion for operators, e.g.:
> >    `arr + scalar` may do it, but `np.add(arr, scalar)` will not.
> >    The upside is that it limits the complexity to a much simpler
> >    problem, the downside is that the ufunc call and operator match
> >    less clearly.
> > 3. Just associate python float with float64 and python integers
> > with
> >    long/int64 and force users to always type them explicitly if
> > they
> >    need to.
> > 
> > The downside of 1. is that it doesn't help with simplifying the
> > current
> > situation all that much, because we still have the special casting
> > around...
> > 
> > 
> > I have realized that this got much too long, so I hope it makes
> > sense.
> > I will continue to dabble along on these things a bit, so if
> > nothing
> > else maybe writing it helps me to get a bit clearer on things...
> > 
> > Best,
> > 
> > Sebastian
> > 
> > 
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion at python.org
> > https://mail.python.org/mailman/listinfo/numpy-discussion
> 
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion