[Pandas-dev] Future of NumPy (and pandas) scalar "promotion" (and concatenate-dtype)

Sebastian Berg sebastian at sipsolutions.net
Mon Mar 8 12:59:32 EST 2021


Hi all,

Summary/Abstract: I am seriously exploring the idea of modifying the
NumPy promotion rules to drop the current value-based logic. This would
probably affect pandas in a similar way as it does NumPy, so I am
wondering what your opinion is on the "value-based" logic and potential
"future" logic.
One of the most annoying things is likely the transition phase (see the
last part about the many warnings I see in the pandas test-suit).


** Long Story: **

I am wondering about the future of type promotion in NumPy [1], but
this would probably just as much affect pandas.
The problem is what to do with things like:

    np.array([1, 2], dtype=np.uint8) + 1000

Where the result is currently upcast to a `uint16`.  The rules for this
are pretty arcane, however.

There are a few "worse" things that probably do not affect pandas as
much. That is, the above does also happen in this case:

    np.array([1, 2], dtype=np.uint8) + np.int64(1000)

Even though int64 is explicitly typed, we just drop that information. 
The weirdest things are probably regarding float precision:

    np.array([0.3], dtype=np.float32) == 0.3
    np.array([0.3], dtype=np.float32) == np.float64(0.3)

Where the latter would probably go from `True` to `False` due to the
limited precision of float32. (At least unless we explicitly try to
counteract this for comparisons.)


** Solution: **

The basic idea right now is the following:

1. All objects with NumPy dtypes use those strictly. Scalars or 0-D
   arrays will have no special handling.
2. Python integers, float, and complex are considered to have a special
   "weak" dtype. In the above example `1000` or `0.3` would simply be
   force-cast to `uint8` or `float32`.  (Potentiality with a
   warning/error for integer-rollover)
3. The "additional" rule that all function calls use `np.asarray()`,
   which convert Python types.  That is `np.add(uint8_arr, 1000)` would
   return the same as `np.add(uint8_arr, np.array(1000))`, while
   `uint8_arr + 1000` would not!
   (I am not sure about this rule, it could be modified but it seems
   easier to limit the "special behaviour" to Python operators)

I did some initial trials with such behaviour. Although without issuing
transition warnings for the "weak" logic (although I expect it rarely
changes the result), but issuing warnings when Point 1. probably makes
a difference.

To my surprise the SciPy test suite did not even Notice!  The pandas
test suit runs into thousands of warnings (but few or no errors). 
Probably mostly due to test that effectively check ufuncs with:

    binary_ufunc(typed_arr, 3)

NumPy does that a lot in its test suite as well.  Maybe we can deal
with it or rethink rule 3.

Cheers,

Sebastian



[1] I have a conundrum. I don't really want to change things right now,
but I need to reimplement it.  Preserving value-based logic seems
tricky to do without introducing technical debt that if we want to get
rid of it later anyway...
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20210308/114fe7dd/attachment.sig>


More information about the Pandas-dev mailing list