[Numpy-discussion] Moving forward with value based casting

Nathaniel Smith njs at pobox.com
Thu Jun 6 19:36:37 EDT 2019


I haven't read all the thread super carefully, so I might have missed
something, but I think we might want to look at this together with the
special rule for scalar casting.

IIUC, the basic end-user problem that motivates all thi sis: when you
have a simple Python constant whose exact dtype is unspecified, people
don't want numpy to first automatically pick a dtype for it, and then
use that automatically chosen dtype to override the explicit dtypes
that the user specified. That's that "x + 1" problem. (This also comes
up a ton for languages trying to figure out how to type manifest
constants.)

Numpy's original solution for this was the special casting rule for
scalars. I don't understand the exact semantics, but it's something
like: in any operation involving a mix of non-zero-dim arrays and
zero-dim arrays, we throw out the exact dtype information for the
scalar ("float64", "int32") and replace it with just the "kind"
("float", "int").

This has several surprising consequences:

- The output dtype depends on not just the input dtypes, but also the
input shapes:

In [19]: (np.array([1, 2], dtype=np.int8) + 1).dtype
Out[19]: dtype('int8')

In [20]: (np.array([1, 2], dtype=np.int8) + [1]).dtype
Out[20]: dtype('int64')

- It doesn't just affect Python scalars with vague dtypes, but also
scalars where the user has specifically set the dtype:

In [21]: (np.array([1, 2], dtype=np.int8) + np.int64(1)).dtype
Out[21]: dtype('int8')

- I'm not sure the "kind" rule even does the right thing, especially
for mixed-kind operations. float16-array + int8-scalar has to do the
same thing as float16-array + int64-scalar, but that feels weird? I
think this is why value-based casting got added (at around the same
time as float16, in fact).

(Kinds are kinda problematic in general... the SAME_KIND casting rule
is very weird – casting int32->int64 is radically different from
casting float64->float32, which is radically different than casting
int64->int32, but SAME_KIND treats them all the same. And it's really
unclear how to generalize the 'kind' concept to new dtypes.)

My intuition is that what users actually want is for *native Python
types* to be treated as having 'underspecified' dtypes, e.g. int is
happy to coerce to int8/int32/int64/whatever, float is happy to coerce
to float32/float64/whatever, but once you have a fully-specified numpy
dtype, it should stay.

Some cases to think about:

np.array([1, 2], dtype=int8) + [1, 1]
 -> maybe this should have dtype int8, because there's no type info on
the right side to contradict that?

np.array([1, 2], dtype=int8) + 2**40
 -> maybe this should be an error, because you can't cast 2**40 to
int8 (under default casting safety rules)? That would introduce some
value-dependence, but it would only affect whether you get an error or
not, and there's precedent for that (e.g. division by zero).

In any case, it would probably be helpful to start by just writing
down the whole set of rules we have now, because I'm not sure anyone
understands all the details...

-n

On Wed, Jun 5, 2019 at 1:42 PM Sebastian Berg
<sebastian at sipsolutions.net> wrote:
>
> Hi all,
>
> TL;DR:
>
> Value based promotion seems complex both for users and ufunc-
> dispatching/promotion logic. Is there any way we can move forward here,
> and if we do, could we just risk some possible (maybe not-existing)
> corner cases to break early to get on the way?
>
> -----------
>
> Currently when you write code such as:
>
> arr = np.array([1, 43, 23], dtype=np.uint16)
> res = arr + 1
>
> Numpy uses fairly sophisticated logic to decide that `1` can be
> represented as a uint16, and thus for all unary functions (and most
> others as well), the output will have a `res.dtype` of uint16.
>
> Similar logic also exists for floating point types, where a lower
> precision floating point can be used:
>
> arr = np.array([1, 43, 23], dtype=np.float32)
> (arr + np.float64(2.)).dtype  # will be float32
>
> Currently, this value based logic is enforced by checking whether the
> cast is possible: "4" can be cast to int8, uint8. So the first call
> above will at some point check if "uint16 + uint16 -> uint16" is a
> valid operation, find that it is, and thus stop searching. (There is
> the additional logic, that when both/all operands are scalars, it is
> not applied).
>
> Note that while it is defined in terms of casting "1" to uint8 safely
> being possible even though 1 may be typed as int64. This logic thus
> affects all promotion rules as well (i.e. what should the output dtype
> be).
>
>
> There 2 main discussion points/issues about it:
>
> 1. Should value based casting/promotion logic exist at all?
>
> Arguably an `np.int32(3)` has type information attached to it, so why
> should we ignore it. It can also be tricky for users, because a small
> change in values can change the result data type.
> Because 0-D arrays and scalars are too close inside numpy (you will
> often not know which one you get). There is not much option but to
> handle them identically. However, it seems pretty odd that:
>  * `np.array(3, dtype=np.int32)` + np.arange(10, dtype=int8)
>  * `np.array([3], dtype=np.int32)` + np.arange(10, dtype=int8)
>
> give a different result.
>
> This is a bit different for python scalars, which do not have a type
> attached already.
>
>
> 2. Promotion and type resolution in Ufuncs:
>
> What is currently bothering me is that the decision what the output
> dtypes should be currently depends on the values in complicated ways.
> It would be nice if we can decide which type signature to use without
> actually looking at values (or at least only very early on).
>
> One reason here is caching and simplicity. I would like to be able to
> cache which loop should be used for what input. Having value based
> casting in there bloats up the problem.
> Of course it currently works OK, but especially when user dtypes come
> into play, caching would seem like a nice optimization option.
>
> Because `uint8(127)` can also be a `int8`, but uint8(128) it is not as
> simple as finding the "minimal" dtype once and working with that."
> Of course Eric and I discussed this a bit before, and you could create
> an internal "uint7" dtype which has the only purpose of flagging that a
> cast to int8 is safe.
>
> I suppose it is possible I am barking up the wrong tree here, and this
> caching/predictability is not vital (or can be solved with such an
> internal dtype easily, although I am not sure it seems elegant).
>
>
> Possible options to move forward
> --------------------------------
>
> I have to still see a bit how trick things are. But there are a few
> possible options. I would like to move the scalar logic to the
> beginning of ufunc calls:
>   * The uint7 idea would be one solution
>   * Simply implement something that works for numpy and all except
>     strange external ufuncs (I can only think of numba as a plausible
>     candidate for creating such).
>
> My current plan is to see where the second thing leaves me.
>
> We also should see if we cannot move the whole thing forward, in which
> case the main decision would have to be forward to where. My opinion is
> currently that when a type has a dtype associated with it clearly, we
> should always use that dtype in the future. This mostly means that
> numpy dtypes such as `np.int64` will always be treated like an int64,
> and never like a `uint8` because they happen to be castable to that.
>
> For values without a dtype attached (read python integers, floats), I
> see three options, from more complex to simpler:
>
> 1. Keep the current logic in place as much as possible
> 2. Only support value based promotion for operators, e.g.:
>    `arr + scalar` may do it, but `np.add(arr, scalar)` will not.
>    The upside is that it limits the complexity to a much simpler
>    problem, the downside is that the ufunc call and operator match
>    less clearly.
> 3. Just associate python float with float64 and python integers with
>    long/int64 and force users to always type them explicitly if they
>    need to.
>
> The downside of 1. is that it doesn't help with simplifying the current
> situation all that much, because we still have the special casting
> around...
>
>
> I have realized that this got much too long, so I hope it makes sense.
> I will continue to dabble along on these things a bit, so if nothing
> else maybe writing it helps me to get a bit clearer on things...
>
> Best,
>
> Sebastian
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion



-- 
Nathaniel J. Smith -- https://vorpus.org


More information about the NumPy-Discussion mailing list