[Pandas-dev] Future of NumPy (and pandas) scalar "promotion" (and concatenate-dtype)

Sebastian Berg sebastian at sipsolutions.net
Tue Mar 9 10:40:36 EST 2021


On Mon, 2021-03-08 at 12:47 -0600, Wes McKinney wrote:
> hi Sebastian — at a glance this is a scary-looking change. Knowing
> the
> relatively fast-and-loose ways that people have been using NumPy in
> industry applications over the last 10+ years, the idea that `arr +
> scalar` could cause data loss in "scalar" is pretty worrying. It
> would
> be better to raise an exception than to generate a warning.


Well, some notes:

1. Obviously there would be transition warnings. Honestly, I am a bit
   worried that the transition warnings would be far more annoying than
   the change itself.

2. Yes, errors or at least warnings on unsafe conversion are better.
   Mostly we just currently don't have them...
   So for me that is an opinion that ensuring errors (or maybe
   just warnings) seems required (when the final transition happens).
   We also may need something like this:

       np.uint8(value, safe=True)

   To be able to "opt-in" into the future behaviour safely. That might
   be annoying, but its not a serious blocker or particularly complex.

3. The current situation is already ridiculously unsafe, since
   integers tend to rollover left and right. Try guessing the result
   for these:

       np.array([100], dtype="uint8") + 200
       np.array(100, dtype="uint8") + 200
       np.array([100], dtype="uint8") + 300
       np.array([100], dtype="uint8") + np.array(200, dtype="int64")
       np.array(100, dtype="uint8") + np.array(200, dtype="int64")

       np.array([100], dtype="uint8") - 200
       np.array([100], dtype="uint8") + -200

   They are (ignoring shape):

       44 (uint8), 300, 400 (uint16), 44 (uint8), 300,
       156 (uint8), -100 (int16)

4. In `weak->strong` transition the resulting dtype will always have
   higher precision, which is less likely to cause trouble (but
   more likely to give spurious warnings). The typical worst case
   is probably memory bloat.

5. For floats, the situation seems much less dramatic, reduced
   precision due to this change should almost never happen
   (or give an overflow warning).
   Of course `float32 + large_integer` might occasionally have
   upcast to 64bit previously...

6. By the way: The change would largely revert back to the behaviour
   of NumPy <1.6! So if the code is 10 years old it might suddenly
   work again.  (I expect ancient NumPy used "weak" logic even for 0-D
   arrays, so was much worse.)


> I feel like to really understand the impact of this change, you would
> need to prepare a set of experimental NumPy wheels that you publish
> to
> PyPI to allow downstream users to run their applications and see what
> happens, and engage in outreach efforts to get them to actually do
> the
> testing.

Right now, I wanted prod and see pandas-devs think that this would seem
like the right direction and one that they are willing to work towards.

I think in NumPy there is a consensus that value-based logic is very
naughty and some loose consensus that the proposal I posted is the most
promising angle for fixing it (maybe quite loose, but I don't expect
more insight from NumPy-discussions at this time).

Of course I can't be 100% sure that this will pan out, but I can spend
my remaining sanity on other things if this it becomes obvious that
there is serious resistance...
This is a side battle for me.  But the point is, that doing it now may
be a unique chance, because if we shelve it now it will become even
harder to change. And that probably means shelving it again for another
decade or longer.

Cheers,

Sebastian



> 
> - Wes
> 
> On Mon, Mar 8, 2021 at 12:00 PM Sebastian Berg
> <sebastian at sipsolutions.net> wrote:
> > 
> > Hi all,
> > 
> > Summary/Abstract: I am seriously exploring the idea of modifying
> > the
> > NumPy promotion rules to drop the current value-based logic. This
> > would
> > probably affect pandas in a similar way as it does NumPy, so I am
> > wondering what your opinion is on the "value-based" logic and
> > potential
> > "future" logic.
> > One of the most annoying things is likely the transition phase (see
> > the
> > last part about the many warnings I see in the pandas test-suit).
> > 
> > 
> > ** Long Story: **
> > 
> > I am wondering about the future of type promotion in NumPy [1], but
> > this would probably just as much affect pandas.
> > The problem is what to do with things like:
> > 
> >     np.array([1, 2], dtype=np.uint8) + 1000
> > 
> > Where the result is currently upcast to a `uint16`.  The rules for
> > this
> > are pretty arcane, however.
> > 
> > There are a few "worse" things that probably do not affect pandas
> > as
> > much. That is, the above does also happen in this case:
> > 
> >     np.array([1, 2], dtype=np.uint8) + np.int64(1000)
> > 
> > Even though int64 is explicitly typed, we just drop that
> > information.
> > The weirdest things are probably regarding float precision:
> > 
> >     np.array([0.3], dtype=np.float32) == 0.3
> >     np.array([0.3], dtype=np.float32) == np.float64(0.3)
> > 
> > Where the latter would probably go from `True` to `False` due to
> > the
> > limited precision of float32. (At least unless we explicitly try to
> > counteract this for comparisons.)
> > 
> > 
> > ** Solution: **
> > 
> > The basic idea right now is the following:
> > 
> > 1. All objects with NumPy dtypes use those strictly. Scalars or 0-D
> >    arrays will have no special handling.
> > 2. Python integers, float, and complex are considered to have a
> > special
> >    "weak" dtype. In the above example `1000` or `0.3` would simply
> > be
> >    force-cast to `uint8` or `float32`.  (Potentiality with a
> >    warning/error for integer-rollover)
> > 3. The "additional" rule that all function calls use
> > `np.asarray()`,
> >    which convert Python types.  That is `np.add(uint8_arr, 1000)`
> > would
> >    return the same as `np.add(uint8_arr, np.array(1000))`, while
> >    `uint8_arr + 1000` would not!
> >    (I am not sure about this rule, it could be modified but it
> > seems
> >    easier to limit the "special behaviour" to Python operators)
> > 
> > I did some initial trials with such behaviour. Although without
> > issuing
> > transition warnings for the "weak" logic (although I expect it
> > rarely
> > changes the result), but issuing warnings when Point 1. probably
> > makes
> > a difference.
> > 
> > To my surprise the SciPy test suite did not even Notice!  The
> > pandas
> > test suit runs into thousands of warnings (but few or no errors).
> > Probably mostly due to test that effectively check ufuncs with:
> > 
> >     binary_ufunc(typed_arr, 3)
> > 
> > NumPy does that a lot in its test suite as well.  Maybe we can deal
> > with it or rethink rule 3.
> > 
> > Cheers,
> > 
> > Sebastian
> > 
> > 
> > 
> > [1] I have a conundrum. I don't really want to change things right
> > now,
> > but I need to reimplement it.  Preserving value-based logic seems
> > tricky to do without introducing technical debt that if we want to
> > get
> > rid of it later anyway...
> > _______________________________________________
> > Pandas-dev mailing list
> > Pandas-dev at python.org
> > https://mail.python.org/mailman/listinfo/pandas-dev
> 



-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20210309/e5ec9aa0/attachment-0001.sig>


More information about the Pandas-dev mailing list