[Pandas-dev] Future of NumPy (and pandas) scalar "promotion" (and concatenate-dtype)

Joris Van den Bossche jorisvandenbossche at gmail.com
Tue Mar 9 13:27:39 EST 2021


I personally fully support trying to drop any value-based logic. Also
in pandas we have some additional (custom to pandas) value-based
logic, mainly in concat operations, that we are also trying to move
away from.

It will mean some behaviour changes, but as long as there is a
transition period with warnings when there would be data loss (or if
it would result in an error), that seems acceptable to me. Having
consistent rules in the long term will be really beneficial.

What's not fully clear to me is what the exact behaviour is of this
"weak" dtype for python numbers. Would it always use the dtype of the
other typed operand?

Joris

On Tue, 9 Mar 2021 at 16:40, Sebastian Berg <sebastian at sipsolutions.net> wrote:
>
> On Mon, 2021-03-08 at 12:47 -0600, Wes McKinney wrote:
> > hi Sebastian — at a glance this is a scary-looking change. Knowing
> > the
> > relatively fast-and-loose ways that people have been using NumPy in
> > industry applications over the last 10+ years, the idea that `arr +
> > scalar` could cause data loss in "scalar" is pretty worrying. It
> > would
> > be better to raise an exception than to generate a warning.
>
>
> Well, some notes:
>
> 1. Obviously there would be transition warnings. Honestly, I am a bit
>    worried that the transition warnings would be far more annoying than
>    the change itself.
>
> 2. Yes, errors or at least warnings on unsafe conversion are better.
>    Mostly we just currently don't have them...
>    So for me that is an opinion that ensuring errors (or maybe
>    just warnings) seems required (when the final transition happens).
>    We also may need something like this:
>
>        np.uint8(value, safe=True)
>
>    To be able to "opt-in" into the future behaviour safely. That might
>    be annoying, but its not a serious blocker or particularly complex.
>
> 3. The current situation is already ridiculously unsafe, since
>    integers tend to rollover left and right. Try guessing the result
>    for these:
>
>        np.array([100], dtype="uint8") + 200
>        np.array(100, dtype="uint8") + 200
>        np.array([100], dtype="uint8") + 300
>        np.array([100], dtype="uint8") + np.array(200, dtype="int64")
>        np.array(100, dtype="uint8") + np.array(200, dtype="int64")
>
>        np.array([100], dtype="uint8") - 200
>        np.array([100], dtype="uint8") + -200
>
>    They are (ignoring shape):
>
>        44 (uint8), 300, 400 (uint16), 44 (uint8), 300,
>        156 (uint8), -100 (int16)
>
> 4. In `weak->strong` transition the resulting dtype will always have
>    higher precision, which is less likely to cause trouble (but
>    more likely to give spurious warnings). The typical worst case
>    is probably memory bloat.
>
> 5. For floats, the situation seems much less dramatic, reduced
>    precision due to this change should almost never happen
>    (or give an overflow warning).
>    Of course `float32 + large_integer` might occasionally have
>    upcast to 64bit previously...
>
> 6. By the way: The change would largely revert back to the behaviour
>    of NumPy <1.6! So if the code is 10 years old it might suddenly
>    work again.  (I expect ancient NumPy used "weak" logic even for 0-D
>    arrays, so was much worse.)
>
>
> > I feel like to really understand the impact of this change, you would
> > need to prepare a set of experimental NumPy wheels that you publish
> > to
> > PyPI to allow downstream users to run their applications and see what
> > happens, and engage in outreach efforts to get them to actually do
> > the
> > testing.
>
> Right now, I wanted prod and see pandas-devs think that this would seem
> like the right direction and one that they are willing to work towards.
>
> I think in NumPy there is a consensus that value-based logic is very
> naughty and some loose consensus that the proposal I posted is the most
> promising angle for fixing it (maybe quite loose, but I don't expect
> more insight from NumPy-discussions at this time).
>
> Of course I can't be 100% sure that this will pan out, but I can spend
> my remaining sanity on other things if this it becomes obvious that
> there is serious resistance...
> This is a side battle for me.  But the point is, that doing it now may
> be a unique chance, because if we shelve it now it will become even
> harder to change. And that probably means shelving it again for another
> decade or longer.
>
> Cheers,
>
> Sebastian
>
>
>
> >
> > - Wes
> >
> > On Mon, Mar 8, 2021 at 12:00 PM Sebastian Berg
> > <sebastian at sipsolutions.net> wrote:
> > >
> > > Hi all,
> > >
> > > Summary/Abstract: I am seriously exploring the idea of modifying
> > > the
> > > NumPy promotion rules to drop the current value-based logic. This
> > > would
> > > probably affect pandas in a similar way as it does NumPy, so I am
> > > wondering what your opinion is on the "value-based" logic and
> > > potential
> > > "future" logic.
> > > One of the most annoying things is likely the transition phase (see
> > > the
> > > last part about the many warnings I see in the pandas test-suit).
> > >
> > >
> > > ** Long Story: **
> > >
> > > I am wondering about the future of type promotion in NumPy [1], but
> > > this would probably just as much affect pandas.
> > > The problem is what to do with things like:
> > >
> > >     np.array([1, 2], dtype=np.uint8) + 1000
> > >
> > > Where the result is currently upcast to a `uint16`.  The rules for
> > > this
> > > are pretty arcane, however.
> > >
> > > There are a few "worse" things that probably do not affect pandas
> > > as
> > > much. That is, the above does also happen in this case:
> > >
> > >     np.array([1, 2], dtype=np.uint8) + np.int64(1000)
> > >
> > > Even though int64 is explicitly typed, we just drop that
> > > information.
> > > The weirdest things are probably regarding float precision:
> > >
> > >     np.array([0.3], dtype=np.float32) == 0.3
> > >     np.array([0.3], dtype=np.float32) == np.float64(0.3)
> > >
> > > Where the latter would probably go from `True` to `False` due to
> > > the
> > > limited precision of float32. (At least unless we explicitly try to
> > > counteract this for comparisons.)
> > >
> > >
> > > ** Solution: **
> > >
> > > The basic idea right now is the following:
> > >
> > > 1. All objects with NumPy dtypes use those strictly. Scalars or 0-D
> > >    arrays will have no special handling.
> > > 2. Python integers, float, and complex are considered to have a
> > > special
> > >    "weak" dtype. In the above example `1000` or `0.3` would simply
> > > be
> > >    force-cast to `uint8` or `float32`.  (Potentiality with a
> > >    warning/error for integer-rollover)
> > > 3. The "additional" rule that all function calls use
> > > `np.asarray()`,
> > >    which convert Python types.  That is `np.add(uint8_arr, 1000)`
> > > would
> > >    return the same as `np.add(uint8_arr, np.array(1000))`, while
> > >    `uint8_arr + 1000` would not!
> > >    (I am not sure about this rule, it could be modified but it
> > > seems
> > >    easier to limit the "special behaviour" to Python operators)
> > >
> > > I did some initial trials with such behaviour. Although without
> > > issuing
> > > transition warnings for the "weak" logic (although I expect it
> > > rarely
> > > changes the result), but issuing warnings when Point 1. probably
> > > makes
> > > a difference.
> > >
> > > To my surprise the SciPy test suite did not even Notice!  The
> > > pandas
> > > test suit runs into thousands of warnings (but few or no errors).
> > > Probably mostly due to test that effectively check ufuncs with:
> > >
> > >     binary_ufunc(typed_arr, 3)
> > >
> > > NumPy does that a lot in its test suite as well.  Maybe we can deal
> > > with it or rethink rule 3.
> > >
> > > Cheers,
> > >
> > > Sebastian
> > >
> > >
> > >
> > > [1] I have a conundrum. I don't really want to change things right
> > > now,
> > > but I need to reimplement it.  Preserving value-based logic seems
> > > tricky to do without introducing technical debt that if we want to
> > > get
> > > rid of it later anyway...
> > > _______________________________________________
> > > Pandas-dev mailing list
> > > Pandas-dev at python.org
> > > https://mail.python.org/mailman/listinfo/pandas-dev
> >
>
>
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev


More information about the Pandas-dev mailing list