From simonjayhawkins at gmail.com  Tue Mar  2 12:15:15 2021
From: simonjayhawkins at gmail.com (Simon Hawkins)
Date: Tue, 2 Mar 2021 17:15:15 +0000
Subject: [Pandas-dev] ANN: Pandas 1.2.3 Released
Message-ID: <CAFs9qGt-Nn7bh4cYh1tx3-xurdhNcbSxp8bYxymyzit6yDBEeg@mail.gmail.com>

Hi all,

I'm pleased to announce the release of pandas 1.2.3.

This is a patch release in the 1.2.x series and includes some
regression fixes. We recommend that all users upgrade to this version.

See the release notes
<https://pandas.pydata.org/pandas-docs/version/1.2/whatsnew/v1.2.3.html>
for a list of all the changes.

The release can be installed from PyPI

    python -m pip install --upgrade pandas==1.2.3

Or from conda-forge

    conda install -c conda-forge pandas==1.2.3

Please report any issues with the release on the pandas issue tracker
<https://github.com/pandas-dev/pandas/issues>.

Thanks to all the contributors who made this release possible.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20210302/0d858fa2/attachment.html>

From sebastian at sipsolutions.net  Mon Mar  8 12:59:32 2021
From: sebastian at sipsolutions.net (Sebastian Berg)
Date: Mon, 08 Mar 2021 11:59:32 -0600
Subject: [Pandas-dev] Future of NumPy (and pandas) scalar "promotion" (and
 concatenate-dtype)
Message-ID: <35e2f6e7dde83f5d71a2d393fdd2b8644dd90de7.camel@sipsolutions.net>

Hi all,

Summary/Abstract: I am seriously exploring the idea of modifying the
NumPy promotion rules to drop the current value-based logic. This would
probably affect pandas in a similar way as it does NumPy, so I am
wondering what your opinion is on the "value-based" logic and potential
"future" logic.
One of the most annoying things is likely the transition phase (see the
last part about the many warnings I see in the pandas test-suit).


** Long Story: **

I am wondering about the future of type promotion in NumPy [1], but
this would probably just as much affect pandas.
The problem is what to do with things like:

    np.array([1, 2], dtype=np.uint8) + 1000

Where the result is currently upcast to a `uint16`.  The rules for this
are pretty arcane, however.

There are a few "worse" things that probably do not affect pandas as
much. That is, the above does also happen in this case:

    np.array([1, 2], dtype=np.uint8) + np.int64(1000)

Even though int64 is explicitly typed, we just drop that information. 
The weirdest things are probably regarding float precision:

    np.array([0.3], dtype=np.float32) == 0.3
    np.array([0.3], dtype=np.float32) == np.float64(0.3)

Where the latter would probably go from `True` to `False` due to the
limited precision of float32. (At least unless we explicitly try to
counteract this for comparisons.)


** Solution: **

The basic idea right now is the following:

1. All objects with NumPy dtypes use those strictly. Scalars or 0-D
   arrays will have no special handling.
2. Python integers, float, and complex are considered to have a special
   "weak" dtype. In the above example `1000` or `0.3` would simply be
   force-cast to `uint8` or `float32`.  (Potentiality with a
   warning/error for integer-rollover)
3. The "additional" rule that all function calls use `np.asarray()`,
   which convert Python types.  That is `np.add(uint8_arr, 1000)` would
   return the same as `np.add(uint8_arr, np.array(1000))`, while
   `uint8_arr + 1000` would not!
   (I am not sure about this rule, it could be modified but it seems
   easier to limit the "special behaviour" to Python operators)

I did some initial trials with such behaviour. Although without issuing
transition warnings for the "weak" logic (although I expect it rarely
changes the result), but issuing warnings when Point 1. probably makes
a difference.

To my surprise the SciPy test suite did not even Notice!  The pandas
test suit runs into thousands of warnings (but few or no errors). 
Probably mostly due to test that effectively check ufuncs with:

    binary_ufunc(typed_arr, 3)

NumPy does that a lot in its test suite as well.  Maybe we can deal
with it or rethink rule 3.

Cheers,

Sebastian


[1] I have a conundrum. I don't really want to change things right now,
but I need to reimplement it.  Preserving value-based logic seems
tricky to do without introducing technical debt that if we want to get
rid of it later anyway...
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20210308/114fe7dd/attachment.sig>

From wesmckinn at gmail.com  Mon Mar  8 13:47:26 2021
From: wesmckinn at gmail.com (Wes McKinney)
Date: Mon, 8 Mar 2021 12:47:26 -0600
Subject: [Pandas-dev] Future of NumPy (and pandas) scalar "promotion"
 (and concatenate-dtype)
In-Reply-To: <35e2f6e7dde83f5d71a2d393fdd2b8644dd90de7.camel@sipsolutions.net>
References: <35e2f6e7dde83f5d71a2d393fdd2b8644dd90de7.camel@sipsolutions.net>
Message-ID: <CAJPUwMCSkKg1MmWdtnufLaKoC9_Bsn1sX9r5gdzXvcFDRRUG3g@mail.gmail.com>

hi Sebastian ? at a glance this is a scary-looking change. Knowing the
relatively fast-and-loose ways that people have been using NumPy in
industry applications over the last 10+ years, the idea that `arr +
scalar` could cause data loss in "scalar" is pretty worrying. It would
be better to raise an exception than to generate a warning.

I feel like to really understand the impact of this change, you would
need to prepare a set of experimental NumPy wheels that you publish to
PyPI to allow downstream users to run their applications and see what
happens, and engage in outreach efforts to get them to actually do the
testing.

- Wes

On Mon, Mar 8, 2021 at 12:00 PM Sebastian Berg
<sebastian at sipsolutions.net> wrote:
>
> Hi all,
>
> Summary/Abstract: I am seriously exploring the idea of modifying the
> NumPy promotion rules to drop the current value-based logic. This would
> probably affect pandas in a similar way as it does NumPy, so I am
> wondering what your opinion is on the "value-based" logic and potential
> "future" logic.
> One of the most annoying things is likely the transition phase (see the
> last part about the many warnings I see in the pandas test-suit).
>
>
> ** Long Story: **
>
> I am wondering about the future of type promotion in NumPy [1], but
> this would probably just as much affect pandas.
> The problem is what to do with things like:
>
>     np.array([1, 2], dtype=np.uint8) + 1000
>
> Where the result is currently upcast to a `uint16`.  The rules for this
> are pretty arcane, however.
>
> There are a few "worse" things that probably do not affect pandas as
> much. That is, the above does also happen in this case:
>
>     np.array([1, 2], dtype=np.uint8) + np.int64(1000)
>
> Even though int64 is explicitly typed, we just drop that information.
> The weirdest things are probably regarding float precision:
>
>     np.array([0.3], dtype=np.float32) == 0.3
>     np.array([0.3], dtype=np.float32) == np.float64(0.3)
>
> Where the latter would probably go from `True` to `False` due to the
> limited precision of float32. (At least unless we explicitly try to
> counteract this for comparisons.)
>
>
> ** Solution: **
>
> The basic idea right now is the following:
>
> 1. All objects with NumPy dtypes use those strictly. Scalars or 0-D
>    arrays will have no special handling.
> 2. Python integers, float, and complex are considered to have a special
>    "weak" dtype. In the above example `1000` or `0.3` would simply be
>    force-cast to `uint8` or `float32`.  (Potentiality with a
>    warning/error for integer-rollover)
> 3. The "additional" rule that all function calls use `np.asarray()`,
>    which convert Python types.  That is `np.add(uint8_arr, 1000)` would
>    return the same as `np.add(uint8_arr, np.array(1000))`, while
>    `uint8_arr + 1000` would not!
>    (I am not sure about this rule, it could be modified but it seems
>    easier to limit the "special behaviour" to Python operators)
>
> I did some initial trials with such behaviour. Although without issuing
> transition warnings for the "weak" logic (although I expect it rarely
> changes the result), but issuing warnings when Point 1. probably makes
> a difference.
>
> To my surprise the SciPy test suite did not even Notice!  The pandas
> test suit runs into thousands of warnings (but few or no errors).
> Probably mostly due to test that effectively check ufuncs with:
>
>     binary_ufunc(typed_arr, 3)
>
> NumPy does that a lot in its test suite as well.  Maybe we can deal
> with it or rethink rule 3.
>
> Cheers,
>
> Sebastian
>
>
>
> [1] I have a conundrum. I don't really want to change things right now,
> but I need to reimplement it.  Preserving value-based logic seems
> tricky to do without introducing technical debt that if we want to get
> rid of it later anyway...
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev

From sebastian at sipsolutions.net  Tue Mar  9 10:40:36 2021
From: sebastian at sipsolutions.net (Sebastian Berg)
Date: Tue, 09 Mar 2021 09:40:36 -0600
Subject: [Pandas-dev] Future of NumPy (and pandas) scalar "promotion"
 (and concatenate-dtype)
In-Reply-To: <CAJPUwMCSkKg1MmWdtnufLaKoC9_Bsn1sX9r5gdzXvcFDRRUG3g@mail.gmail.com>
 (sfid-20210308_194804_686894_1D1C5F59)
References: <35e2f6e7dde83f5d71a2d393fdd2b8644dd90de7.camel@sipsolutions.net>
 <CAJPUwMCSkKg1MmWdtnufLaKoC9_Bsn1sX9r5gdzXvcFDRRUG3g@mail.gmail.com>
 (sfid-20210308_194804_686894_1D1C5F59)
Message-ID: <fa906f2c24705f2710f406ed6734b334c085b908.camel@sipsolutions.net>

On Mon, 2021-03-08 at 12:47 -0600, Wes McKinney wrote:
> hi Sebastian ? at a glance this is a scary-looking change. Knowing
> the
> relatively fast-and-loose ways that people have been using NumPy in
> industry applications over the last 10+ years, the idea that `arr +
> scalar` could cause data loss in "scalar" is pretty worrying. It
> would
> be better to raise an exception than to generate a warning.


Well, some notes:

1. Obviously there would be transition warnings. Honestly, I am a bit
   worried that the transition warnings would be far more annoying than
   the change itself.

2. Yes, errors or at least warnings on unsafe conversion are better.
   Mostly we just currently don't have them...
   So for me that is an opinion that ensuring errors (or maybe
   just warnings) seems required (when the final transition happens).
   We also may need something like this:

       np.uint8(value, safe=True)

   To be able to "opt-in" into the future behaviour safely. That might
   be annoying, but its not a serious blocker or particularly complex.

3. The current situation is already ridiculously unsafe, since
   integers tend to rollover left and right. Try guessing the result
   for these:

       np.array([100], dtype="uint8") + 200
       np.array(100, dtype="uint8") + 200
       np.array([100], dtype="uint8") + 300
       np.array([100], dtype="uint8") + np.array(200, dtype="int64")
       np.array(100, dtype="uint8") + np.array(200, dtype="int64")

       np.array([100], dtype="uint8") - 200
       np.array([100], dtype="uint8") + -200

   They are (ignoring shape):

       44 (uint8), 300, 400 (uint16), 44 (uint8), 300,
       156 (uint8), -100 (int16)

4. In `weak->strong` transition the resulting dtype will always have
   higher precision, which is less likely to cause trouble (but
   more likely to give spurious warnings). The typical worst case
   is probably memory bloat.

5. For floats, the situation seems much less dramatic, reduced
   precision due to this change should almost never happen
   (or give an overflow warning).
   Of course `float32 + large_integer` might occasionally have
   upcast to 64bit previously...

6. By the way: The change would largely revert back to the behaviour
   of NumPy <1.6! So if the code is 10 years old it might suddenly
   work again.  (I expect ancient NumPy used "weak" logic even for 0-D
   arrays, so was much worse.)


> I feel like to really understand the impact of this change, you would
> need to prepare a set of experimental NumPy wheels that you publish
> to
> PyPI to allow downstream users to run their applications and see what
> happens, and engage in outreach efforts to get them to actually do
> the
> testing.

Right now, I wanted prod and see pandas-devs think that this would seem
like the right direction and one that they are willing to work towards.

I think in NumPy there is a consensus that value-based logic is very
naughty and some loose consensus that the proposal I posted is the most
promising angle for fixing it (maybe quite loose, but I don't expect
more insight from NumPy-discussions at this time).

Of course I can't be 100% sure that this will pan out, but I can spend
my remaining sanity on other things if this it becomes obvious that
there is serious resistance...
This is a side battle for me.  But the point is, that doing it now may
be a unique chance, because if we shelve it now it will become even
harder to change. And that probably means shelving it again for another
decade or longer.

Cheers,

Sebastian


> 
> - Wes
> 
> On Mon, Mar 8, 2021 at 12:00 PM Sebastian Berg
> <sebastian at sipsolutions.net> wrote:
> > 
> > Hi all,
> > 
> > Summary/Abstract: I am seriously exploring the idea of modifying
> > the
> > NumPy promotion rules to drop the current value-based logic. This
> > would
> > probably affect pandas in a similar way as it does NumPy, so I am
> > wondering what your opinion is on the "value-based" logic and
> > potential
> > "future" logic.
> > One of the most annoying things is likely the transition phase (see
> > the
> > last part about the many warnings I see in the pandas test-suit).
> > 
> > 
> > ** Long Story: **
> > 
> > I am wondering about the future of type promotion in NumPy [1], but
> > this would probably just as much affect pandas.
> > The problem is what to do with things like:
> > 
> > ??? np.array([1, 2], dtype=np.uint8) + 1000
> > 
> > Where the result is currently upcast to a `uint16`.? The rules for
> > this
> > are pretty arcane, however.
> > 
> > There are a few "worse" things that probably do not affect pandas
> > as
> > much. That is, the above does also happen in this case:
> > 
> > ??? np.array([1, 2], dtype=np.uint8) + np.int64(1000)
> > 
> > Even though int64 is explicitly typed, we just drop that
> > information.
> > The weirdest things are probably regarding float precision:
> > 
> > ??? np.array([0.3], dtype=np.float32) == 0.3
> > ??? np.array([0.3], dtype=np.float32) == np.float64(0.3)
> > 
> > Where the latter would probably go from `True` to `False` due to
> > the
> > limited precision of float32. (At least unless we explicitly try to
> > counteract this for comparisons.)
> > 
> > 
> > ** Solution: **
> > 
> > The basic idea right now is the following:
> > 
> > 1. All objects with NumPy dtypes use those strictly. Scalars or 0-D
> > ?? arrays will have no special handling.
> > 2. Python integers, float, and complex are considered to have a
> > special
> > ?? "weak" dtype. In the above example `1000` or `0.3` would simply
> > be
> > ?? force-cast to `uint8` or `float32`.? (Potentiality with a
> > ?? warning/error for integer-rollover)
> > 3. The "additional" rule that all function calls use
> > `np.asarray()`,
> > ?? which convert Python types.? That is `np.add(uint8_arr, 1000)`
> > would
> > ?? return the same as `np.add(uint8_arr, np.array(1000))`, while
> > ?? `uint8_arr + 1000` would not!
> > ?? (I am not sure about this rule, it could be modified but it
> > seems
> > ?? easier to limit the "special behaviour" to Python operators)
> > 
> > I did some initial trials with such behaviour. Although without
> > issuing
> > transition warnings for the "weak" logic (although I expect it
> > rarely
> > changes the result), but issuing warnings when Point 1. probably
> > makes
> > a difference.
> > 
> > To my surprise the SciPy test suite did not even Notice!? The
> > pandas
> > test suit runs into thousands of warnings (but few or no errors).
> > Probably mostly due to test that effectively check ufuncs with:
> > 
> > ??? binary_ufunc(typed_arr, 3)
> > 
> > NumPy does that a lot in its test suite as well.? Maybe we can deal
> > with it or rethink rule 3.
> > 
> > Cheers,
> > 
> > Sebastian
> > 
> > 
> > 
> > [1] I have a conundrum. I don't really want to change things right
> > now,
> > but I need to reimplement it.? Preserving value-based logic seems
> > tricky to do without introducing technical debt that if we want to
> > get
> > rid of it later anyway...
> > _______________________________________________
> > Pandas-dev mailing list
> > Pandas-dev at python.org
> > https://mail.python.org/mailman/listinfo/pandas-dev
> 


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20210309/e5ec9aa0/attachment-0001.sig>

From wesmckinn at gmail.com  Tue Mar  9 13:16:11 2021
From: wesmckinn at gmail.com (Wes McKinney)
Date: Tue, 9 Mar 2021 12:16:11 -0600
Subject: [Pandas-dev] Future of NumPy (and pandas) scalar "promotion"
 (and concatenate-dtype)
In-Reply-To: <fa906f2c24705f2710f406ed6734b334c085b908.camel@sipsolutions.net>
References: <35e2f6e7dde83f5d71a2d393fdd2b8644dd90de7.camel@sipsolutions.net>
 <CAJPUwMCSkKg1MmWdtnufLaKoC9_Bsn1sX9r5gdzXvcFDRRUG3g@mail.gmail.com>
 <fa906f2c24705f2710f406ed6734b334c085b908.camel@sipsolutions.net>
Message-ID: <CAJPUwMBaWkPiC2j_tZsyhGkPiook+M68TAAg8TASKH-d9+VytA@mail.gmail.com>

I see your points ? FWIW I think that if people are using small
integers (rather than using int_ or int64 for everything), they have
some responsibility to mind these issues. I'm supportive of you trying
to fix it, with the warning that I think you should engage in extra
efforts to try to obtain feedback before it lands in "pip install
numpy".

On Tue, Mar 9, 2021 at 9:41 AM Sebastian Berg
<sebastian at sipsolutions.net> wrote:
>
> On Mon, 2021-03-08 at 12:47 -0600, Wes McKinney wrote:
> > hi Sebastian ? at a glance this is a scary-looking change. Knowing
> > the
> > relatively fast-and-loose ways that people have been using NumPy in
> > industry applications over the last 10+ years, the idea that `arr +
> > scalar` could cause data loss in "scalar" is pretty worrying. It
> > would
> > be better to raise an exception than to generate a warning.
>
>
> Well, some notes:
>
> 1. Obviously there would be transition warnings. Honestly, I am a bit
>    worried that the transition warnings would be far more annoying than
>    the change itself.
>
> 2. Yes, errors or at least warnings on unsafe conversion are better.
>    Mostly we just currently don't have them...
>    So for me that is an opinion that ensuring errors (or maybe
>    just warnings) seems required (when the final transition happens).
>    We also may need something like this:
>
>        np.uint8(value, safe=True)
>
>    To be able to "opt-in" into the future behaviour safely. That might
>    be annoying, but its not a serious blocker or particularly complex.
>
> 3. The current situation is already ridiculously unsafe, since
>    integers tend to rollover left and right. Try guessing the result
>    for these:
>
>        np.array([100], dtype="uint8") + 200
>        np.array(100, dtype="uint8") + 200
>        np.array([100], dtype="uint8") + 300
>        np.array([100], dtype="uint8") + np.array(200, dtype="int64")
>        np.array(100, dtype="uint8") + np.array(200, dtype="int64")
>
>        np.array([100], dtype="uint8") - 200
>        np.array([100], dtype="uint8") + -200
>
>    They are (ignoring shape):
>
>        44 (uint8), 300, 400 (uint16), 44 (uint8), 300,
>        156 (uint8), -100 (int16)
>
> 4. In `weak->strong` transition the resulting dtype will always have
>    higher precision, which is less likely to cause trouble (but
>    more likely to give spurious warnings). The typical worst case
>    is probably memory bloat.
>
> 5. For floats, the situation seems much less dramatic, reduced
>    precision due to this change should almost never happen
>    (or give an overflow warning).
>    Of course `float32 + large_integer` might occasionally have
>    upcast to 64bit previously...
>
> 6. By the way: The change would largely revert back to the behaviour
>    of NumPy <1.6! So if the code is 10 years old it might suddenly
>    work again.  (I expect ancient NumPy used "weak" logic even for 0-D
>    arrays, so was much worse.)
>
>
> > I feel like to really understand the impact of this change, you would
> > need to prepare a set of experimental NumPy wheels that you publish
> > to
> > PyPI to allow downstream users to run their applications and see what
> > happens, and engage in outreach efforts to get them to actually do
> > the
> > testing.
>
> Right now, I wanted prod and see pandas-devs think that this would seem
> like the right direction and one that they are willing to work towards.
>
> I think in NumPy there is a consensus that value-based logic is very
> naughty and some loose consensus that the proposal I posted is the most
> promising angle for fixing it (maybe quite loose, but I don't expect
> more insight from NumPy-discussions at this time).
>
> Of course I can't be 100% sure that this will pan out, but I can spend
> my remaining sanity on other things if this it becomes obvious that
> there is serious resistance...
> This is a side battle for me.  But the point is, that doing it now may
> be a unique chance, because if we shelve it now it will become even
> harder to change. And that probably means shelving it again for another
> decade or longer.
>
> Cheers,
>
> Sebastian
>
>
>
> >
> > - Wes
> >
> > On Mon, Mar 8, 2021 at 12:00 PM Sebastian Berg
> > <sebastian at sipsolutions.net> wrote:
> > >
> > > Hi all,
> > >
> > > Summary/Abstract: I am seriously exploring the idea of modifying
> > > the
> > > NumPy promotion rules to drop the current value-based logic. This
> > > would
> > > probably affect pandas in a similar way as it does NumPy, so I am
> > > wondering what your opinion is on the "value-based" logic and
> > > potential
> > > "future" logic.
> > > One of the most annoying things is likely the transition phase (see
> > > the
> > > last part about the many warnings I see in the pandas test-suit).
> > >
> > >
> > > ** Long Story: **
> > >
> > > I am wondering about the future of type promotion in NumPy [1], but
> > > this would probably just as much affect pandas.
> > > The problem is what to do with things like:
> > >
> > >     np.array([1, 2], dtype=np.uint8) + 1000
> > >
> > > Where the result is currently upcast to a `uint16`.  The rules for
> > > this
> > > are pretty arcane, however.
> > >
> > > There are a few "worse" things that probably do not affect pandas
> > > as
> > > much. That is, the above does also happen in this case:
> > >
> > >     np.array([1, 2], dtype=np.uint8) + np.int64(1000)
> > >
> > > Even though int64 is explicitly typed, we just drop that
> > > information.
> > > The weirdest things are probably regarding float precision:
> > >
> > >     np.array([0.3], dtype=np.float32) == 0.3
> > >     np.array([0.3], dtype=np.float32) == np.float64(0.3)
> > >
> > > Where the latter would probably go from `True` to `False` due to
> > > the
> > > limited precision of float32. (At least unless we explicitly try to
> > > counteract this for comparisons.)
> > >
> > >
> > > ** Solution: **
> > >
> > > The basic idea right now is the following:
> > >
> > > 1. All objects with NumPy dtypes use those strictly. Scalars or 0-D
> > >    arrays will have no special handling.
> > > 2. Python integers, float, and complex are considered to have a
> > > special
> > >    "weak" dtype. In the above example `1000` or `0.3` would simply
> > > be
> > >    force-cast to `uint8` or `float32`.  (Potentiality with a
> > >    warning/error for integer-rollover)
> > > 3. The "additional" rule that all function calls use
> > > `np.asarray()`,
> > >    which convert Python types.  That is `np.add(uint8_arr, 1000)`
> > > would
> > >    return the same as `np.add(uint8_arr, np.array(1000))`, while
> > >    `uint8_arr + 1000` would not!
> > >    (I am not sure about this rule, it could be modified but it
> > > seems
> > >    easier to limit the "special behaviour" to Python operators)
> > >
> > > I did some initial trials with such behaviour. Although without
> > > issuing
> > > transition warnings for the "weak" logic (although I expect it
> > > rarely
> > > changes the result), but issuing warnings when Point 1. probably
> > > makes
> > > a difference.
> > >
> > > To my surprise the SciPy test suite did not even Notice!  The
> > > pandas
> > > test suit runs into thousands of warnings (but few or no errors).
> > > Probably mostly due to test that effectively check ufuncs with:
> > >
> > >     binary_ufunc(typed_arr, 3)
> > >
> > > NumPy does that a lot in its test suite as well.  Maybe we can deal
> > > with it or rethink rule 3.
> > >
> > > Cheers,
> > >
> > > Sebastian
> > >
> > >
> > >
> > > [1] I have a conundrum. I don't really want to change things right
> > > now,
> > > but I need to reimplement it.  Preserving value-based logic seems
> > > tricky to do without introducing technical debt that if we want to
> > > get
> > > rid of it later anyway...
> > > _______________________________________________
> > > Pandas-dev mailing list
> > > Pandas-dev at python.org
> > > https://mail.python.org/mailman/listinfo/pandas-dev
> >
>
>
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev

From jorisvandenbossche at gmail.com  Tue Mar  9 13:27:39 2021
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Tue, 9 Mar 2021 19:27:39 +0100
Subject: [Pandas-dev] Future of NumPy (and pandas) scalar "promotion"
 (and concatenate-dtype)
In-Reply-To: <fa906f2c24705f2710f406ed6734b334c085b908.camel@sipsolutions.net>
References: <35e2f6e7dde83f5d71a2d393fdd2b8644dd90de7.camel@sipsolutions.net>
 <CAJPUwMCSkKg1MmWdtnufLaKoC9_Bsn1sX9r5gdzXvcFDRRUG3g@mail.gmail.com>
 <fa906f2c24705f2710f406ed6734b334c085b908.camel@sipsolutions.net>
Message-ID: <CALQtMBYexrRnYP5jY-rSL4u5cdReTEtrCTgY=afTAL1duz=s3g@mail.gmail.com>

I personally fully support trying to drop any value-based logic. Also
in pandas we have some additional (custom to pandas) value-based
logic, mainly in concat operations, that we are also trying to move
away from.

It will mean some behaviour changes, but as long as there is a
transition period with warnings when there would be data loss (or if
it would result in an error), that seems acceptable to me. Having
consistent rules in the long term will be really beneficial.

What's not fully clear to me is what the exact behaviour is of this
"weak" dtype for python numbers. Would it always use the dtype of the
other typed operand?

Joris

On Tue, 9 Mar 2021 at 16:40, Sebastian Berg <sebastian at sipsolutions.net> wrote:
>
> On Mon, 2021-03-08 at 12:47 -0600, Wes McKinney wrote:
> > hi Sebastian ? at a glance this is a scary-looking change. Knowing
> > the
> > relatively fast-and-loose ways that people have been using NumPy in
> > industry applications over the last 10+ years, the idea that `arr +
> > scalar` could cause data loss in "scalar" is pretty worrying. It
> > would
> > be better to raise an exception than to generate a warning.
>
>
> Well, some notes:
>
> 1. Obviously there would be transition warnings. Honestly, I am a bit
>    worried that the transition warnings would be far more annoying than
>    the change itself.
>
> 2. Yes, errors or at least warnings on unsafe conversion are better.
>    Mostly we just currently don't have them...
>    So for me that is an opinion that ensuring errors (or maybe
>    just warnings) seems required (when the final transition happens).
>    We also may need something like this:
>
>        np.uint8(value, safe=True)
>
>    To be able to "opt-in" into the future behaviour safely. That might
>    be annoying, but its not a serious blocker or particularly complex.
>
> 3. The current situation is already ridiculously unsafe, since
>    integers tend to rollover left and right. Try guessing the result
>    for these:
>
>        np.array([100], dtype="uint8") + 200
>        np.array(100, dtype="uint8") + 200
>        np.array([100], dtype="uint8") + 300
>        np.array([100], dtype="uint8") + np.array(200, dtype="int64")
>        np.array(100, dtype="uint8") + np.array(200, dtype="int64")
>
>        np.array([100], dtype="uint8") - 200
>        np.array([100], dtype="uint8") + -200
>
>    They are (ignoring shape):
>
>        44 (uint8), 300, 400 (uint16), 44 (uint8), 300,
>        156 (uint8), -100 (int16)
>
> 4. In `weak->strong` transition the resulting dtype will always have
>    higher precision, which is less likely to cause trouble (but
>    more likely to give spurious warnings). The typical worst case
>    is probably memory bloat.
>
> 5. For floats, the situation seems much less dramatic, reduced
>    precision due to this change should almost never happen
>    (or give an overflow warning).
>    Of course `float32 + large_integer` might occasionally have
>    upcast to 64bit previously...
>
> 6. By the way: The change would largely revert back to the behaviour
>    of NumPy <1.6! So if the code is 10 years old it might suddenly
>    work again.  (I expect ancient NumPy used "weak" logic even for 0-D
>    arrays, so was much worse.)
>
>
> > I feel like to really understand the impact of this change, you would
> > need to prepare a set of experimental NumPy wheels that you publish
> > to
> > PyPI to allow downstream users to run their applications and see what
> > happens, and engage in outreach efforts to get them to actually do
> > the
> > testing.
>
> Right now, I wanted prod and see pandas-devs think that this would seem
> like the right direction and one that they are willing to work towards.
>
> I think in NumPy there is a consensus that value-based logic is very
> naughty and some loose consensus that the proposal I posted is the most
> promising angle for fixing it (maybe quite loose, but I don't expect
> more insight from NumPy-discussions at this time).
>
> Of course I can't be 100% sure that this will pan out, but I can spend
> my remaining sanity on other things if this it becomes obvious that
> there is serious resistance...
> This is a side battle for me.  But the point is, that doing it now may
> be a unique chance, because if we shelve it now it will become even
> harder to change. And that probably means shelving it again for another
> decade or longer.
>
> Cheers,
>
> Sebastian
>
>
>
> >
> > - Wes
> >
> > On Mon, Mar 8, 2021 at 12:00 PM Sebastian Berg
> > <sebastian at sipsolutions.net> wrote:
> > >
> > > Hi all,
> > >
> > > Summary/Abstract: I am seriously exploring the idea of modifying
> > > the
> > > NumPy promotion rules to drop the current value-based logic. This
> > > would
> > > probably affect pandas in a similar way as it does NumPy, so I am
> > > wondering what your opinion is on the "value-based" logic and
> > > potential
> > > "future" logic.
> > > One of the most annoying things is likely the transition phase (see
> > > the
> > > last part about the many warnings I see in the pandas test-suit).
> > >
> > >
> > > ** Long Story: **
> > >
> > > I am wondering about the future of type promotion in NumPy [1], but
> > > this would probably just as much affect pandas.
> > > The problem is what to do with things like:
> > >
> > >     np.array([1, 2], dtype=np.uint8) + 1000
> > >
> > > Where the result is currently upcast to a `uint16`.  The rules for
> > > this
> > > are pretty arcane, however.
> > >
> > > There are a few "worse" things that probably do not affect pandas
> > > as
> > > much. That is, the above does also happen in this case:
> > >
> > >     np.array([1, 2], dtype=np.uint8) + np.int64(1000)
> > >
> > > Even though int64 is explicitly typed, we just drop that
> > > information.
> > > The weirdest things are probably regarding float precision:
> > >
> > >     np.array([0.3], dtype=np.float32) == 0.3
> > >     np.array([0.3], dtype=np.float32) == np.float64(0.3)
> > >
> > > Where the latter would probably go from `True` to `False` due to
> > > the
> > > limited precision of float32. (At least unless we explicitly try to
> > > counteract this for comparisons.)
> > >
> > >
> > > ** Solution: **
> > >
> > > The basic idea right now is the following:
> > >
> > > 1. All objects with NumPy dtypes use those strictly. Scalars or 0-D
> > >    arrays will have no special handling.
> > > 2. Python integers, float, and complex are considered to have a
> > > special
> > >    "weak" dtype. In the above example `1000` or `0.3` would simply
> > > be
> > >    force-cast to `uint8` or `float32`.  (Potentiality with a
> > >    warning/error for integer-rollover)
> > > 3. The "additional" rule that all function calls use
> > > `np.asarray()`,
> > >    which convert Python types.  That is `np.add(uint8_arr, 1000)`
> > > would
> > >    return the same as `np.add(uint8_arr, np.array(1000))`, while
> > >    `uint8_arr + 1000` would not!
> > >    (I am not sure about this rule, it could be modified but it
> > > seems
> > >    easier to limit the "special behaviour" to Python operators)
> > >
> > > I did some initial trials with such behaviour. Although without
> > > issuing
> > > transition warnings for the "weak" logic (although I expect it
> > > rarely
> > > changes the result), but issuing warnings when Point 1. probably
> > > makes
> > > a difference.
> > >
> > > To my surprise the SciPy test suite did not even Notice!  The
> > > pandas
> > > test suit runs into thousands of warnings (but few or no errors).
> > > Probably mostly due to test that effectively check ufuncs with:
> > >
> > >     binary_ufunc(typed_arr, 3)
> > >
> > > NumPy does that a lot in its test suite as well.  Maybe we can deal
> > > with it or rethink rule 3.
> > >
> > > Cheers,
> > >
> > > Sebastian
> > >
> > >
> > >
> > > [1] I have a conundrum. I don't really want to change things right
> > > now,
> > > but I need to reimplement it.  Preserving value-based logic seems
> > > tricky to do without introducing technical debt that if we want to
> > > get
> > > rid of it later anyway...
> > > _______________________________________________
> > > Pandas-dev mailing list
> > > Pandas-dev at python.org
> > > https://mail.python.org/mailman/listinfo/pandas-dev
> >
>
>
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev

From sebastian at sipsolutions.net  Tue Mar  9 14:10:12 2021
From: sebastian at sipsolutions.net (Sebastian Berg)
Date: Tue, 09 Mar 2021 13:10:12 -0600
Subject: [Pandas-dev] Future of NumPy (and pandas) scalar "promotion"
 (and concatenate-dtype)
In-Reply-To: <CALQtMBYexrRnYP5jY-rSL4u5cdReTEtrCTgY=afTAL1duz=s3g@mail.gmail.com>
 (sfid-20210309_192753_127578_57BFC3E5)
References: <35e2f6e7dde83f5d71a2d393fdd2b8644dd90de7.camel@sipsolutions.net>
 <CAJPUwMCSkKg1MmWdtnufLaKoC9_Bsn1sX9r5gdzXvcFDRRUG3g@mail.gmail.com>
 <fa906f2c24705f2710f406ed6734b334c085b908.camel@sipsolutions.net>
 <CALQtMBYexrRnYP5jY-rSL4u5cdReTEtrCTgY=afTAL1duz=s3g@mail.gmail.com>
 (sfid-20210309_192753_127578_57BFC3E5)
Message-ID: <ac86a10d2e56f68825d2009a2d213d0a90ae0ab2.camel@sipsolutions.net>

On Tue, 2021-03-09 at 19:27 +0100, Joris Van den Bossche wrote:
> I personally fully support trying to drop any value-based logic. Also
> in pandas we have some additional (custom to pandas) value-based
> logic, mainly in concat operations, that we are also trying to move
> away from.
> 
> It will mean some behaviour changes, but as long as there is a
> transition period with warnings when there would be data loss (or if
> it would result in an error), that seems acceptable to me. Having
> consistent rules in the long term will be really beneficial.
> 
> What's not fully clear to me is what the exact behaviour is of this
> "weak" dtype for python numbers. Would it always use the dtype of the
> other typed operand?
> 

The "weak DType" is the abstract DType like `Integer` or `Floating` (we
don't have those right now).
That means `int8 + 1.0 -> float64` (see details below)!  The term
"weak" is something I borrowed from JAX which uses this type of
promotion, at least roughly (I think there are probably some subtle
differences).


My reasoning for this "weak logic" is that I expect code like:

   res = int8_arr + 2  # probably expects an int8 output?
   res = float32_arr * 2  # clearly should remain float32

might exist quite often, so using it will allow this as a convenience.
And in the vast majority of cases retain its behaviour faithfully.


The reason for suggesting to (mainly) limit it to Python operators, is
an attempt at a fairly clear rule. There could be exceptions, and
frankly, there will be accidental ones.
But the thought is that for most functions, the user should expect an
implicitly `np.asarray()` call on all inputs.  And `np.asarray()` will
not allow "weak logic" to pass.  (But you could get it by calling e.g.
`np.result_type` explicitly.)


** DETAILS: **

The other dtype is prioritized so to speak. But it has to decide what
to do!  `Integer + 1.0` still needs to go to some floating point value
after all.
For NumPy dtypes this means:

    uint8 + 1 -> uint8
    uint8 + 1.0 -> float64  (default precision)
    uint8 + 1j -> complex128  (default precision)
    # same for all integers

    float32 + 1 -> float32
    float32 + 1. -> float32
    float32 + 1j -> complex64 (retains 32bit precision)


In general the other DType will have to decide what to do by
implementing the correct promotion/common (concatenate) DType rules. My
aim is that through "common DType" + casting rules, we can also allow
things code like this (which would use the identical logic):

    uint8.astype(AbstractFloating) -> float64

So, it isn't quite true that we blindly convert ahead of time. That
would not always be correct, for example `timedelta64 / 2` cannot
conver the `2` the same way `timedelta64 + 2` might attempt to do it.

Cheers,

Sebastian


> Joris
> 
> On Tue, 9 Mar 2021 at 16:40, Sebastian Berg <
> sebastian at sipsolutions.net> wrote:
> > 
> > On Mon, 2021-03-08 at 12:47 -0600, Wes McKinney wrote:
> > > hi Sebastian ? at a glance this is a scary-looking change.
> > > Knowing
> > > the
> > > relatively fast-and-loose ways that people have been using NumPy
> > > in
> > > industry applications over the last 10+ years, the idea that `arr
> > > +
> > > scalar` could cause data loss in "scalar" is pretty worrying. It
> > > would
> > > be better to raise an exception than to generate a warning.
> > 
> > 
> > Well, some notes:
> > 
> > 1. Obviously there would be transition warnings. Honestly, I am a
> > bit
> > ?? worried that the transition warnings would be far more annoying
> > than
> > ?? the change itself.
> > 
> > 2. Yes, errors or at least warnings on unsafe conversion are
> > better.
> > ?? Mostly we just currently don't have them...
> > ?? So for me that is an opinion that ensuring errors (or maybe
> > ?? just warnings) seems required (when the final transition
> > happens).
> > ?? We also may need something like this:
> > 
> > ?????? np.uint8(value, safe=True)
> > 
> > ?? To be able to "opt-in" into the future behaviour safely. That
> > might
> > ?? be annoying, but its not a serious blocker or particularly
> > complex.
> > 
> > 3. The current situation is already ridiculously unsafe, since
> > ?? integers tend to rollover left and right. Try guessing the
> > result
> > ?? for these:
> > 
> > ?????? np.array([100], dtype="uint8") + 200
> > ?????? np.array(100, dtype="uint8") + 200
> > ?????? np.array([100], dtype="uint8") + 300
> > ?????? np.array([100], dtype="uint8") + np.array(200,
> > dtype="int64")
> > ?????? np.array(100, dtype="uint8") + np.array(200, dtype="int64")
> > 
> > ?????? np.array([100], dtype="uint8") - 200
> > ?????? np.array([100], dtype="uint8") + -200
> > 
> > ?? They are (ignoring shape):
> > 
> > ?????? 44 (uint8), 300, 400 (uint16), 44 (uint8), 300,
> > ?????? 156 (uint8), -100 (int16)
> > 
> > 4. In `weak->strong` transition the resulting dtype will always
> > have
> > ?? higher precision, which is less likely to cause trouble (but
> > ?? more likely to give spurious warnings). The typical worst case
> > ?? is probably memory bloat.
> > 
> > 5. For floats, the situation seems much less dramatic, reduced
> > ?? precision due to this change should almost never happen
> > ?? (or give an overflow warning).
> > ?? Of course `float32 + large_integer` might occasionally have
> > ?? upcast to 64bit previously...
> > 
> > 6. By the way: The change would largely revert back to the
> > behaviour
> > ?? of NumPy <1.6! So if the code is 10 years old it might suddenly
> > ?? work again.? (I expect ancient NumPy used "weak" logic even for
> > 0-D
> > ?? arrays, so was much worse.)
> > 
> > 
> > > I feel like to really understand the impact of this change, you
> > > would
> > > need to prepare a set of experimental NumPy wheels that you
> > > publish
> > > to
> > > PyPI to allow downstream users to run their applications and see
> > > what
> > > happens, and engage in outreach efforts to get them to actually
> > > do
> > > the
> > > testing.
> > 
> > Right now, I wanted prod and see pandas-devs think that this would
> > seem
> > like the right direction and one that they are willing to work
> > towards.
> > 
> > I think in NumPy there is a consensus that value-based logic is
> > very
> > naughty and some loose consensus that the proposal I posted is the
> > most
> > promising angle for fixing it (maybe quite loose, but I don't
> > expect
> > more insight from NumPy-discussions at this time).
> > 
> > Of course I can't be 100% sure that this will pan out, but I can
> > spend
> > my remaining sanity on other things if this it becomes obvious that
> > there is serious resistance...
> > This is a side battle for me.? But the point is, that doing it now
> > may
> > be a unique chance, because if we shelve it now it will become even
> > harder to change. And that probably means shelving it again for
> > another
> > decade or longer.
> > 
> > Cheers,
> > 
> > Sebastian
> > 
> > 
> > 
> > > 
> > > - Wes
> > > 
> > > On Mon, Mar 8, 2021 at 12:00 PM Sebastian Berg
> > > <sebastian at sipsolutions.net> wrote:
> > > > 
> > > > Hi all,
> > > > 
> > > > Summary/Abstract: I am seriously exploring the idea of
> > > > modifying
> > > > the
> > > > NumPy promotion rules to drop the current value-based logic.
> > > > This
> > > > would
> > > > probably affect pandas in a similar way as it does NumPy, so I
> > > > am
> > > > wondering what your opinion is on the "value-based" logic and
> > > > potential
> > > > "future" logic.
> > > > One of the most annoying things is likely the transition phase
> > > > (see
> > > > the
> > > > last part about the many warnings I see in the pandas test-
> > > > suit).
> > > > 
> > > > 
> > > > ** Long Story: **
> > > > 
> > > > I am wondering about the future of type promotion in NumPy [1],
> > > > but
> > > > this would probably just as much affect pandas.
> > > > The problem is what to do with things like:
> > > > 
> > > > ??? np.array([1, 2], dtype=np.uint8) + 1000
> > > > 
> > > > Where the result is currently upcast to a `uint16`.? The rules
> > > > for
> > > > this
> > > > are pretty arcane, however.
> > > > 
> > > > There are a few "worse" things that probably do not affect
> > > > pandas
> > > > as
> > > > much. That is, the above does also happen in this case:
> > > > 
> > > > ??? np.array([1, 2], dtype=np.uint8) + np.int64(1000)
> > > > 
> > > > Even though int64 is explicitly typed, we just drop that
> > > > information.
> > > > The weirdest things are probably regarding float precision:
> > > > 
> > > > ??? np.array([0.3], dtype=np.float32) == 0.3
> > > > ??? np.array([0.3], dtype=np.float32) == np.float64(0.3)
> > > > 
> > > > Where the latter would probably go from `True` to `False` due
> > > > to
> > > > the
> > > > limited precision of float32. (At least unless we explicitly
> > > > try to
> > > > counteract this for comparisons.)
> > > > 
> > > > 
> > > > ** Solution: **
> > > > 
> > > > The basic idea right now is the following:
> > > > 
> > > > 1. All objects with NumPy dtypes use those strictly. Scalars or
> > > > 0-D
> > > > ?? arrays will have no special handling.
> > > > 2. Python integers, float, and complex are considered to have a
> > > > special
> > > > ?? "weak" dtype. In the above example `1000` or `0.3` would
> > > > simply
> > > > be
> > > > ?? force-cast to `uint8` or `float32`.? (Potentiality with a
> > > > ?? warning/error for integer-rollover)
> > > > 3. The "additional" rule that all function calls use
> > > > `np.asarray()`,
> > > > ?? which convert Python types.? That is `np.add(uint8_arr,
> > > > 1000)`
> > > > would
> > > > ?? return the same as `np.add(uint8_arr, np.array(1000))`,
> > > > while
> > > > ?? `uint8_arr + 1000` would not!
> > > > ?? (I am not sure about this rule, it could be modified but it
> > > > seems
> > > > ?? easier to limit the "special behaviour" to Python operators)
> > > > 
> > > > I did some initial trials with such behaviour. Although without
> > > > issuing
> > > > transition warnings for the "weak" logic (although I expect it
> > > > rarely
> > > > changes the result), but issuing warnings when Point 1.
> > > > probably
> > > > makes
> > > > a difference.
> > > > 
> > > > To my surprise the SciPy test suite did not even Notice!? The
> > > > pandas
> > > > test suit runs into thousands of warnings (but few or no
> > > > errors).
> > > > Probably mostly due to test that effectively check ufuncs with:
> > > > 
> > > > ??? binary_ufunc(typed_arr, 3)
> > > > 
> > > > NumPy does that a lot in its test suite as well.? Maybe we can
> > > > deal
> > > > with it or rethink rule 3.
> > > > 
> > > > Cheers,
> > > > 
> > > > Sebastian
> > > > 
> > > > 
> > > > 
> > > > [1] I have a conundrum. I don't really want to change things
> > > > right
> > > > now,
> > > > but I need to reimplement it.? Preserving value-based logic
> > > > seems
> > > > tricky to do without introducing technical debt that if we want
> > > > to
> > > > get
> > > > rid of it later anyway...
> > > > _______________________________________________
> > > > Pandas-dev mailing list
> > > > Pandas-dev at python.org
> > > > https://mail.python.org/mailman/listinfo/pandas-dev
> > > 
> > 
> > 
> > 
> > _______________________________________________
> > Pandas-dev mailing list
> > Pandas-dev at python.org
> > https://mail.python.org/mailman/listinfo/pandas-dev
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20210309/1e221f47/attachment-0001.sig>

From jorisvandenbossche at gmail.com  Wed Mar 10 10:07:25 2021
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Wed, 10 Mar 2021 16:07:25 +0100
Subject: [Pandas-dev] March 2021 monthly community meeting (today, March 10,
 UTC 18:00)
Message-ID: <CALQtMBayiwp6HdvjPa+ECMEKP_jdChJ-pLvKKZc1baAnnzJqrA@mail.gmail.com>

Hi all,

Late notice, but the next monthly dev call is in a few hours (today, March
10th) at 18:00 UTC (12 am Central). Our calendar is at
https://pandas.pydata.org/docs/development/meeting.html#calendar to check
your local time.
All are welcome to attend!

Video Call:
https://us02web.zoom.us/j/81542460994?pwd=NktGRGd4aVNYeDZhUi96cVZDeTZmdz09
Minutes:
https://docs.google.com/document/u/1/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit?ouid=102771015311436394588&usp=docs_home&ths=true

Joris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20210310/cacdf7aa/attachment.html>

From sebastian at sipsolutions.net  Wed Mar 10 12:42:36 2021
From: sebastian at sipsolutions.net (Sebastian Berg)
Date: Wed, 10 Mar 2021 11:42:36 -0600
Subject: [Pandas-dev] Future of NumPy (and pandas) scalar "promotion"
 (and concatenate-dtype)
In-Reply-To: <CABL7CQhQuq1d3PNsKA81p1qk9fiOR3_F1pOVW6p5T65V5nh6ew@mail.gmail.com>
 (sfid-20210310_105259_207187_1212CE22)
References: <35e2f6e7dde83f5d71a2d393fdd2b8644dd90de7.camel@sipsolutions.net>
 <CAJPUwMCSkKg1MmWdtnufLaKoC9_Bsn1sX9r5gdzXvcFDRRUG3g@mail.gmail.com>
 <fa906f2c24705f2710f406ed6734b334c085b908.camel@sipsolutions.net>
 <CALQtMBYexrRnYP5jY-rSL4u5cdReTEtrCTgY=afTAL1duz=s3g@mail.gmail.com>
 <ac86a10d2e56f68825d2009a2d213d0a90ae0ab2.camel@sipsolutions.net>
 <CABL7CQhQuq1d3PNsKA81p1qk9fiOR3_F1pOVW6p5T65V5nh6ew@mail.gmail.com>
 (sfid-20210310_105259_207187_1212CE22)
Message-ID: <698d946a5681dbc19a1742d7e281dccdf971ca7e.camel@sipsolutions.net>

On Wed, 2021-03-10 at 10:52 +0100, Ralf Gommers wrote:
> On Tue, Mar 9, 2021 at 8:10 PM Sebastian Berg <  
> sebastian at sipsolutions.net>
> wrote:

<snip>

> 
> This all sounds quite good and desirable to me (assuming it's
> introduced
> with the required care, as Wes pointed out).

Sure, there are many details I am slowly hitting. E.g. maybe its just
too noisy to attempt to change ufunc calls, and the NumPy comparisons
(mainly `==`, but that inherits to the others) is bad, which we
probably need to fix, without any real transition, etc.:

    https://github.com/numpy/numpy/issues/10322

> 
> In general the other DType will have to decide what to do by
> > implementing the correct promotion/common (concatenate) DType
> > rules. My
> > aim is that through "common DType" + casting rules, we can also
> > allow
> > things code like this (which would use the identical logic):
> > 
> > ??? uint8.astype(AbstractFloating) -> float64
> > 
> > So, it isn't quite true that we blindly convert ahead of time. That
> > would not always be correct, for example `timedelta64 / 2` cannot
> > conver the `2` the same way `timedelta64 + 2` might attempt to do
> > it.
> > 
> 
> Can you elaborate on "it isn't quite true"? It seems to me like the
> conversion is the same (step 1) but it's a two-step process:
> 
> 1. `2` is treated as the weak dtype AbstractInteger
> 2. The `/` operator is true division and produces floating-point
> dtype as
> output
> 
> Step 2 here is true independent of whether, for `timedelta64 / x`, x
> is a
> Python scalar or an array.

Yes, what I mean is that you cannot do it in a ufunc-agnistoic pre-
processing step (in the sense of "common DType").  But rather, it has
to happen in the ufunc specific promotion step, in particular, there is
no generic rule that will work for timedelta64 and all ufuncs.

Cheers,

Sebastian


PS: Anyway, we are getting beyond my initial intention of just a small
probing appetite and asking for any important insights that I might be
completely missing.  So I will try to steer any further things back to
NumPy when they come in.

> 
> Is that correct, or am I missing some subtlety?
> 
> Cheers,
> Ralf

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20210310/104f236f/attachment.sig>

From jbrockmendel at gmail.com  Sat Mar 27 11:44:32 2021
From: jbrockmendel at gmail.com (Brock Mendel)
Date: Sat, 27 Mar 2021 08:44:32 -0700
Subject: [Pandas-dev] Index Constructor Performance
Message-ID: <CAKf8g9RUW1X+sCa2+Tp4Ajvd2uT-JJoCvH7y+sK9QaC78hR2kQ@mail.gmail.com>

In optimizing the non-cython groupby.apply (
https://github.com/pandas-dev/pandas/issues/40263,
https://github.com/pandas-dev/pandas/pull/40171#issuecomment-789116039)
code I'm finding that an awful lot of overhead is coming from
Index._simple_new*.  This email is about what it would take to get rid of
that overhead.

* Note that the particular code snippet being profiled is chosen to be
worst-case for the non-cython path.  It ends up creating a _lot_ of very
small Index objects.  We don't particularly care about this case, but I'm
thinking about this as micro-optimization of code that affects just about
every use case under the sun.

All of the options I have in mind involve moving some of the constructors
to cython.  There is a tradeoff in how invasive that is vs how much perf
benefit we gain from it.

For a baseline, we can trim 10-13% off the benchmark linked above by
implementing in cython and mixing into NumericIndex (implementation
abbreviated for brevity; the full implementation is 65 lines in cython):

```
@cython.freelist(32)
cdef class NumpyIndex:
    cdef:
        public ndarray _data

    @classmethod
    def _simple_new(cls, values, name=None): ...

    cpdef NumpyIndex _getitem_slice(cls, slice slobj): ...
```

10-13% is pretty good, but this only affects Int64Index, UInt64Index, and
Float64Index.  See Appendix 1 for discussion of what it would take to
extend this to other subclasses.

To get much further than this would require using __cinit__, which (absent
some gymnastics) would require the FooIndex.__new__ methods to behave a lot
more like the existing FooIndex._simple_new methods.  TL;DR: this really
isn't feasible absent a) refactoring RangeIndex to not subclass Int64Index
(easy) and b) breaking API changes on the constructors for affected Index
subclasses (hard).


Appendix 1: Extending to Other Subclasses
a) mixing libindex.NumpyIndex into pd.Index doesn't work because
ExtensionIndex._data is not an ndarray.  AFAICT to get the performance
benefit for object-dtype would require implementing a separate subclass
e.g. ObjectIndex.

b) RangeIndex would not benefit, but something similar could be done for it
following https://github.com/cython/cython/issues/4040 (or if we basically
re-implement range ourselves in cython)

c) MultiIndex could be made to benefit from this by changing ._codes to be
a 2D ndarray instead of a FrozenList of ndarrays.  This actually would
allow for some nice cleanups in MultiIndex.  The downside is that the
memory footprint may be bigger with mismatched level sizes.

d) With modest additional effort, this can be extended to
DTI/TDI/PI/CategoricalIndex.

Appendix 2: __cinit__
__cinit__ gets called implicitly before __init__ or __new__, and with
whatever arguments are passed to init/new, i.e. we can't do validation
before passing arguments like we could with an explicit
super().__init__(...) call.

For NumpyIndex we _could_ define __cinit__ without breaking the world, but
we wouldn't get much use out of it unless we also tightened what we accept
in the constructor

Appendix 3: Notes on cython-related constraints
- We cannot mix a cython cdef class into pd.Index because that will break
3rd party subclasses that use object.__new__(cls) (in particular im
thinking of xarray's CFDatetimeIndex)
- a python class cannot inherit from two separate cython cdef classes.
i.e. if we mix something into NumericIndex, that precludes mixing something
else into Int64Index
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20210327/2e4a7035/attachment.html>

From jreback at yahoo.com  Tue Mar 30 09:55:18 2021
From: jreback at yahoo.com (Jeff Reback)
Date: Tue, 30 Mar 2021 13:55:18 +0000 (UTC)
Subject: [Pandas-dev] welcome new pandas committer
References: <1769743654.1084182.1617112518789.ref@mail.yahoo.com>
Message-ID: <1769743654.1084182.1617112518789@mail.yahoo.com>

 Patrick,
Your contributions and help to others is amazing. We'd love to have you?
as a pandas committer and would like to welcome to the core team!

Here is a wiki that shows a little about how to engage as a maintainer:?https://github.com/pandas-dev/pandas/wiki/Maintainer-Overview
Jeff

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20210330/3f98e8bc/attachment.html>