[Pandas-dev] Future of NumPy (and pandas) scalar "promotion" (and concatenate-dtype)

Wes McKinney wesmckinn at gmail.com
Tue Mar 9 13:16:11 EST 2021


I see your points — FWIW I think that if people are using small
integers (rather than using int_ or int64 for everything), they have
some responsibility to mind these issues. I'm supportive of you trying
to fix it, with the warning that I think you should engage in extra
efforts to try to obtain feedback before it lands in "pip install
numpy".

On Tue, Mar 9, 2021 at 9:41 AM Sebastian Berg
<sebastian at sipsolutions.net> wrote:
>
> On Mon, 2021-03-08 at 12:47 -0600, Wes McKinney wrote:
> > hi Sebastian — at a glance this is a scary-looking change. Knowing
> > the
> > relatively fast-and-loose ways that people have been using NumPy in
> > industry applications over the last 10+ years, the idea that `arr +
> > scalar` could cause data loss in "scalar" is pretty worrying. It
> > would
> > be better to raise an exception than to generate a warning.
>
>
> Well, some notes:
>
> 1. Obviously there would be transition warnings. Honestly, I am a bit
>    worried that the transition warnings would be far more annoying than
>    the change itself.
>
> 2. Yes, errors or at least warnings on unsafe conversion are better.
>    Mostly we just currently don't have them...
>    So for me that is an opinion that ensuring errors (or maybe
>    just warnings) seems required (when the final transition happens).
>    We also may need something like this:
>
>        np.uint8(value, safe=True)
>
>    To be able to "opt-in" into the future behaviour safely. That might
>    be annoying, but its not a serious blocker or particularly complex.
>
> 3. The current situation is already ridiculously unsafe, since
>    integers tend to rollover left and right. Try guessing the result
>    for these:
>
>        np.array([100], dtype="uint8") + 200
>        np.array(100, dtype="uint8") + 200
>        np.array([100], dtype="uint8") + 300
>        np.array([100], dtype="uint8") + np.array(200, dtype="int64")
>        np.array(100, dtype="uint8") + np.array(200, dtype="int64")
>
>        np.array([100], dtype="uint8") - 200
>        np.array([100], dtype="uint8") + -200
>
>    They are (ignoring shape):
>
>        44 (uint8), 300, 400 (uint16), 44 (uint8), 300,
>        156 (uint8), -100 (int16)
>
> 4. In `weak->strong` transition the resulting dtype will always have
>    higher precision, which is less likely to cause trouble (but
>    more likely to give spurious warnings). The typical worst case
>    is probably memory bloat.
>
> 5. For floats, the situation seems much less dramatic, reduced
>    precision due to this change should almost never happen
>    (or give an overflow warning).
>    Of course `float32 + large_integer` might occasionally have
>    upcast to 64bit previously...
>
> 6. By the way: The change would largely revert back to the behaviour
>    of NumPy <1.6! So if the code is 10 years old it might suddenly
>    work again.  (I expect ancient NumPy used "weak" logic even for 0-D
>    arrays, so was much worse.)
>
>
> > I feel like to really understand the impact of this change, you would
> > need to prepare a set of experimental NumPy wheels that you publish
> > to
> > PyPI to allow downstream users to run their applications and see what
> > happens, and engage in outreach efforts to get them to actually do
> > the
> > testing.
>
> Right now, I wanted prod and see pandas-devs think that this would seem
> like the right direction and one that they are willing to work towards.
>
> I think in NumPy there is a consensus that value-based logic is very
> naughty and some loose consensus that the proposal I posted is the most
> promising angle for fixing it (maybe quite loose, but I don't expect
> more insight from NumPy-discussions at this time).
>
> Of course I can't be 100% sure that this will pan out, but I can spend
> my remaining sanity on other things if this it becomes obvious that
> there is serious resistance...
> This is a side battle for me.  But the point is, that doing it now may
> be a unique chance, because if we shelve it now it will become even
> harder to change. And that probably means shelving it again for another
> decade or longer.
>
> Cheers,
>
> Sebastian
>
>
>
> >
> > - Wes
> >
> > On Mon, Mar 8, 2021 at 12:00 PM Sebastian Berg
> > <sebastian at sipsolutions.net> wrote:
> > >
> > > Hi all,
> > >
> > > Summary/Abstract: I am seriously exploring the idea of modifying
> > > the
> > > NumPy promotion rules to drop the current value-based logic. This
> > > would
> > > probably affect pandas in a similar way as it does NumPy, so I am
> > > wondering what your opinion is on the "value-based" logic and
> > > potential
> > > "future" logic.
> > > One of the most annoying things is likely the transition phase (see
> > > the
> > > last part about the many warnings I see in the pandas test-suit).
> > >
> > >
> > > ** Long Story: **
> > >
> > > I am wondering about the future of type promotion in NumPy [1], but
> > > this would probably just as much affect pandas.
> > > The problem is what to do with things like:
> > >
> > >     np.array([1, 2], dtype=np.uint8) + 1000
> > >
> > > Where the result is currently upcast to a `uint16`.  The rules for
> > > this
> > > are pretty arcane, however.
> > >
> > > There are a few "worse" things that probably do not affect pandas
> > > as
> > > much. That is, the above does also happen in this case:
> > >
> > >     np.array([1, 2], dtype=np.uint8) + np.int64(1000)
> > >
> > > Even though int64 is explicitly typed, we just drop that
> > > information.
> > > The weirdest things are probably regarding float precision:
> > >
> > >     np.array([0.3], dtype=np.float32) == 0.3
> > >     np.array([0.3], dtype=np.float32) == np.float64(0.3)
> > >
> > > Where the latter would probably go from `True` to `False` due to
> > > the
> > > limited precision of float32. (At least unless we explicitly try to
> > > counteract this for comparisons.)
> > >
> > >
> > > ** Solution: **
> > >
> > > The basic idea right now is the following:
> > >
> > > 1. All objects with NumPy dtypes use those strictly. Scalars or 0-D
> > >    arrays will have no special handling.
> > > 2. Python integers, float, and complex are considered to have a
> > > special
> > >    "weak" dtype. In the above example `1000` or `0.3` would simply
> > > be
> > >    force-cast to `uint8` or `float32`.  (Potentiality with a
> > >    warning/error for integer-rollover)
> > > 3. The "additional" rule that all function calls use
> > > `np.asarray()`,
> > >    which convert Python types.  That is `np.add(uint8_arr, 1000)`
> > > would
> > >    return the same as `np.add(uint8_arr, np.array(1000))`, while
> > >    `uint8_arr + 1000` would not!
> > >    (I am not sure about this rule, it could be modified but it
> > > seems
> > >    easier to limit the "special behaviour" to Python operators)
> > >
> > > I did some initial trials with such behaviour. Although without
> > > issuing
> > > transition warnings for the "weak" logic (although I expect it
> > > rarely
> > > changes the result), but issuing warnings when Point 1. probably
> > > makes
> > > a difference.
> > >
> > > To my surprise the SciPy test suite did not even Notice!  The
> > > pandas
> > > test suit runs into thousands of warnings (but few or no errors).
> > > Probably mostly due to test that effectively check ufuncs with:
> > >
> > >     binary_ufunc(typed_arr, 3)
> > >
> > > NumPy does that a lot in its test suite as well.  Maybe we can deal
> > > with it or rethink rule 3.
> > >
> > > Cheers,
> > >
> > > Sebastian
> > >
> > >
> > >
> > > [1] I have a conundrum. I don't really want to change things right
> > > now,
> > > but I need to reimplement it.  Preserving value-based logic seems
> > > tricky to do without introducing technical debt that if we want to
> > > get
> > > rid of it later anyway...
> > > _______________________________________________
> > > Pandas-dev mailing list
> > > Pandas-dev at python.org
> > > https://mail.python.org/mailman/listinfo/pandas-dev
> >
>
>
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev


More information about the Pandas-dev mailing list