[Pandas-dev] Future of NumPy (and pandas) scalar "promotion" (and concatenate-dtype)

Mon Mar 8 13:47:26 EST 2021

hi Sebastian — at a glance this is a scary-looking change. Knowing the
relatively fast-and-loose ways that people have been using NumPy in
industry applications over the last 10+ years, the idea that `arr +
scalar` could cause data loss in "scalar" is pretty worrying. It would
be better to raise an exception than to generate a warning.

I feel like to really understand the impact of this change, you would
need to prepare a set of experimental NumPy wheels that you publish to
PyPI to allow downstream users to run their applications and see what
happens, and engage in outreach efforts to get them to actually do the
testing.

- Wes

On Mon, Mar 8, 2021 at 12:00 PM Sebastian Berg
<sebastian at sipsolutions.net> wrote:
>
> Hi all,
>
> Summary/Abstract: I am seriously exploring the idea of modifying the
> NumPy promotion rules to drop the current value-based logic. This would
> probably affect pandas in a similar way as it does NumPy, so I am
> wondering what your opinion is on the "value-based" logic and potential
> "future" logic.
> One of the most annoying things is likely the transition phase (see the
> last part about the many warnings I see in the pandas test-suit).
>
>
> ** Long Story: **
>
> I am wondering about the future of type promotion in NumPy [1], but
> this would probably just as much affect pandas.
> The problem is what to do with things like:
>
>     np.array([1, 2], dtype=np.uint8) + 1000
>
> Where the result is currently upcast to a `uint16`.  The rules for this
> are pretty arcane, however.
>
> There are a few "worse" things that probably do not affect pandas as
> much. That is, the above does also happen in this case:
>
>     np.array([1, 2], dtype=np.uint8) + np.int64(1000)
>
> Even though int64 is explicitly typed, we just drop that information.
> The weirdest things are probably regarding float precision:
>
>     np.array([0.3], dtype=np.float32) == 0.3
>     np.array([0.3], dtype=np.float32) == np.float64(0.3)
>
> Where the latter would probably go from `True` to `False` due to the
> limited precision of float32. (At least unless we explicitly try to
> counteract this for comparisons.)
>
>
> ** Solution: **
>
> The basic idea right now is the following:
>
> 1. All objects with NumPy dtypes use those strictly. Scalars or 0-D
>    arrays will have no special handling.
> 2. Python integers, float, and complex are considered to have a special
>    "weak" dtype. In the above example `1000` or `0.3` would simply be
>    force-cast to `uint8` or `float32`.  (Potentiality with a
>    warning/error for integer-rollover)
> 3. The "additional" rule that all function calls use `np.asarray()`,
>    which convert Python types.  That is `np.add(uint8_arr, 1000)` would
>    return the same as `np.add(uint8_arr, np.array(1000))`, while
>    `uint8_arr + 1000` would not!
>    (I am not sure about this rule, it could be modified but it seems
>    easier to limit the "special behaviour" to Python operators)
>
> I did some initial trials with such behaviour. Although without issuing
> transition warnings for the "weak" logic (although I expect it rarely
> changes the result), but issuing warnings when Point 1. probably makes
> a difference.
>
> To my surprise the SciPy test suite did not even Notice!  The pandas
> test suit runs into thousands of warnings (but few or no errors).
> Probably mostly due to test that effectively check ufuncs with:
>
>     binary_ufunc(typed_arr, 3)
>
> NumPy does that a lot in its test suite as well.  Maybe we can deal
> with it or rethink rule 3.
>
> Cheers,
>
> Sebastian
>
>
>
> [1] I have a conundrum. I don't really want to change things right now,
> but I need to reimplement it.  Preserving value-based logic seems
> tricky to do without introducing technical debt that if we want to get
> rid of it later anyway...
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev