From simonjayhawkins at gmail.com Tue Mar 2 12:15:15 2021 From: simonjayhawkins at gmail.com (Simon Hawkins) Date: Tue, 2 Mar 2021 17:15:15 +0000 Subject: [Pandas-dev] ANN: Pandas 1.2.3 Released Message-ID: Hi all, I'm pleased to announce the release of pandas 1.2.3. This is a patch release in the 1.2.x series and includes some regression fixes. We recommend that all users upgrade to this version. See the release notes for a list of all the changes. The release can be installed from PyPI python -m pip install --upgrade pandas==1.2.3 Or from conda-forge conda install -c conda-forge pandas==1.2.3 Please report any issues with the release on the pandas issue tracker . Thanks to all the contributors who made this release possible. -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastian at sipsolutions.net Mon Mar 8 12:59:32 2021 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Mon, 08 Mar 2021 11:59:32 -0600 Subject: [Pandas-dev] Future of NumPy (and pandas) scalar "promotion" (and concatenate-dtype) Message-ID: <35e2f6e7dde83f5d71a2d393fdd2b8644dd90de7.camel@sipsolutions.net> Hi all, Summary/Abstract: I am seriously exploring the idea of modifying the NumPy promotion rules to drop the current value-based logic. This would probably affect pandas in a similar way as it does NumPy, so I am wondering what your opinion is on the "value-based" logic and potential "future" logic. One of the most annoying things is likely the transition phase (see the last part about the many warnings I see in the pandas test-suit). ** Long Story: ** I am wondering about the future of type promotion in NumPy [1], but this would probably just as much affect pandas. The problem is what to do with things like: np.array([1, 2], dtype=np.uint8) + 1000 Where the result is currently upcast to a `uint16`. The rules for this are pretty arcane, however. There are a few "worse" things that probably do not affect pandas as much. That is, the above does also happen in this case: np.array([1, 2], dtype=np.uint8) + np.int64(1000) Even though int64 is explicitly typed, we just drop that information. The weirdest things are probably regarding float precision: np.array([0.3], dtype=np.float32) == 0.3 np.array([0.3], dtype=np.float32) == np.float64(0.3) Where the latter would probably go from `True` to `False` due to the limited precision of float32. (At least unless we explicitly try to counteract this for comparisons.) ** Solution: ** The basic idea right now is the following: 1. All objects with NumPy dtypes use those strictly. Scalars or 0-D arrays will have no special handling. 2. Python integers, float, and complex are considered to have a special "weak" dtype. In the above example `1000` or `0.3` would simply be force-cast to `uint8` or `float32`. (Potentiality with a warning/error for integer-rollover) 3. The "additional" rule that all function calls use `np.asarray()`, which convert Python types. That is `np.add(uint8_arr, 1000)` would return the same as `np.add(uint8_arr, np.array(1000))`, while `uint8_arr + 1000` would not! (I am not sure about this rule, it could be modified but it seems easier to limit the "special behaviour" to Python operators) I did some initial trials with such behaviour. Although without issuing transition warnings for the "weak" logic (although I expect it rarely changes the result), but issuing warnings when Point 1. probably makes a difference. To my surprise the SciPy test suite did not even Notice! The pandas test suit runs into thousands of warnings (but few or no errors). Probably mostly due to test that effectively check ufuncs with: binary_ufunc(typed_arr, 3) NumPy does that a lot in its test suite as well. Maybe we can deal with it or rethink rule 3. Cheers, Sebastian [1] I have a conundrum. I don't really want to change things right now, but I need to reimplement it. Preserving value-based logic seems tricky to do without introducing technical debt that if we want to get rid of it later anyway... -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: From wesmckinn at gmail.com Mon Mar 8 13:47:26 2021 From: wesmckinn at gmail.com (Wes McKinney) Date: Mon, 8 Mar 2021 12:47:26 -0600 Subject: [Pandas-dev] Future of NumPy (and pandas) scalar "promotion" (and concatenate-dtype) In-Reply-To: <35e2f6e7dde83f5d71a2d393fdd2b8644dd90de7.camel@sipsolutions.net> References: <35e2f6e7dde83f5d71a2d393fdd2b8644dd90de7.camel@sipsolutions.net> Message-ID: hi Sebastian ? at a glance this is a scary-looking change. Knowing the relatively fast-and-loose ways that people have been using NumPy in industry applications over the last 10+ years, the idea that `arr + scalar` could cause data loss in "scalar" is pretty worrying. It would be better to raise an exception than to generate a warning. I feel like to really understand the impact of this change, you would need to prepare a set of experimental NumPy wheels that you publish to PyPI to allow downstream users to run their applications and see what happens, and engage in outreach efforts to get them to actually do the testing. - Wes On Mon, Mar 8, 2021 at 12:00 PM Sebastian Berg wrote: > > Hi all, > > Summary/Abstract: I am seriously exploring the idea of modifying the > NumPy promotion rules to drop the current value-based logic. This would > probably affect pandas in a similar way as it does NumPy, so I am > wondering what your opinion is on the "value-based" logic and potential > "future" logic. > One of the most annoying things is likely the transition phase (see the > last part about the many warnings I see in the pandas test-suit). > > > ** Long Story: ** > > I am wondering about the future of type promotion in NumPy [1], but > this would probably just as much affect pandas. > The problem is what to do with things like: > > np.array([1, 2], dtype=np.uint8) + 1000 > > Where the result is currently upcast to a `uint16`. The rules for this > are pretty arcane, however. > > There are a few "worse" things that probably do not affect pandas as > much. That is, the above does also happen in this case: > > np.array([1, 2], dtype=np.uint8) + np.int64(1000) > > Even though int64 is explicitly typed, we just drop that information. > The weirdest things are probably regarding float precision: > > np.array([0.3], dtype=np.float32) == 0.3 > np.array([0.3], dtype=np.float32) == np.float64(0.3) > > Where the latter would probably go from `True` to `False` due to the > limited precision of float32. (At least unless we explicitly try to > counteract this for comparisons.) > > > ** Solution: ** > > The basic idea right now is the following: > > 1. All objects with NumPy dtypes use those strictly. Scalars or 0-D > arrays will have no special handling. > 2. Python integers, float, and complex are considered to have a special > "weak" dtype. In the above example `1000` or `0.3` would simply be > force-cast to `uint8` or `float32`. (Potentiality with a > warning/error for integer-rollover) > 3. The "additional" rule that all function calls use `np.asarray()`, > which convert Python types. That is `np.add(uint8_arr, 1000)` would > return the same as `np.add(uint8_arr, np.array(1000))`, while > `uint8_arr + 1000` would not! > (I am not sure about this rule, it could be modified but it seems > easier to limit the "special behaviour" to Python operators) > > I did some initial trials with such behaviour. Although without issuing > transition warnings for the "weak" logic (although I expect it rarely > changes the result), but issuing warnings when Point 1. probably makes > a difference. > > To my surprise the SciPy test suite did not even Notice! The pandas > test suit runs into thousands of warnings (but few or no errors). > Probably mostly due to test that effectively check ufuncs with: > > binary_ufunc(typed_arr, 3) > > NumPy does that a lot in its test suite as well. Maybe we can deal > with it or rethink rule 3. > > Cheers, > > Sebastian > > > > [1] I have a conundrum. I don't really want to change things right now, > but I need to reimplement it. Preserving value-based logic seems > tricky to do without introducing technical debt that if we want to get > rid of it later anyway... > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev From sebastian at sipsolutions.net Tue Mar 9 10:40:36 2021 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Tue, 09 Mar 2021 09:40:36 -0600 Subject: [Pandas-dev] Future of NumPy (and pandas) scalar "promotion" (and concatenate-dtype) In-Reply-To: (sfid-20210308_194804_686894_1D1C5F59) References: <35e2f6e7dde83f5d71a2d393fdd2b8644dd90de7.camel@sipsolutions.net> (sfid-20210308_194804_686894_1D1C5F59) Message-ID: On Mon, 2021-03-08 at 12:47 -0600, Wes McKinney wrote: > hi Sebastian ? at a glance this is a scary-looking change. Knowing > the > relatively fast-and-loose ways that people have been using NumPy in > industry applications over the last 10+ years, the idea that `arr + > scalar` could cause data loss in "scalar" is pretty worrying. It > would > be better to raise an exception than to generate a warning. Well, some notes: 1. Obviously there would be transition warnings. Honestly, I am a bit worried that the transition warnings would be far more annoying than the change itself. 2. Yes, errors or at least warnings on unsafe conversion are better. Mostly we just currently don't have them... So for me that is an opinion that ensuring errors (or maybe just warnings) seems required (when the final transition happens). We also may need something like this: np.uint8(value, safe=True) To be able to "opt-in" into the future behaviour safely. That might be annoying, but its not a serious blocker or particularly complex. 3. The current situation is already ridiculously unsafe, since integers tend to rollover left and right. Try guessing the result for these: np.array([100], dtype="uint8") + 200 np.array(100, dtype="uint8") + 200 np.array([100], dtype="uint8") + 300 np.array([100], dtype="uint8") + np.array(200, dtype="int64") np.array(100, dtype="uint8") + np.array(200, dtype="int64") np.array([100], dtype="uint8") - 200 np.array([100], dtype="uint8") + -200 They are (ignoring shape): 44 (uint8), 300, 400 (uint16), 44 (uint8), 300, 156 (uint8), -100 (int16) 4. In `weak->strong` transition the resulting dtype will always have higher precision, which is less likely to cause trouble (but more likely to give spurious warnings). The typical worst case is probably memory bloat. 5. For floats, the situation seems much less dramatic, reduced precision due to this change should almost never happen (or give an overflow warning). Of course `float32 + large_integer` might occasionally have upcast to 64bit previously... 6. By the way: The change would largely revert back to the behaviour of NumPy <1.6! So if the code is 10 years old it might suddenly work again. (I expect ancient NumPy used "weak" logic even for 0-D arrays, so was much worse.) > I feel like to really understand the impact of this change, you would > need to prepare a set of experimental NumPy wheels that you publish > to > PyPI to allow downstream users to run their applications and see what > happens, and engage in outreach efforts to get them to actually do > the > testing. Right now, I wanted prod and see pandas-devs think that this would seem like the right direction and one that they are willing to work towards. I think in NumPy there is a consensus that value-based logic is very naughty and some loose consensus that the proposal I posted is the most promising angle for fixing it (maybe quite loose, but I don't expect more insight from NumPy-discussions at this time). Of course I can't be 100% sure that this will pan out, but I can spend my remaining sanity on other things if this it becomes obvious that there is serious resistance... This is a side battle for me. But the point is, that doing it now may be a unique chance, because if we shelve it now it will become even harder to change. And that probably means shelving it again for another decade or longer. Cheers, Sebastian > > - Wes > > On Mon, Mar 8, 2021 at 12:00 PM Sebastian Berg > wrote: > > > > Hi all, > > > > Summary/Abstract: I am seriously exploring the idea of modifying > > the > > NumPy promotion rules to drop the current value-based logic. This > > would > > probably affect pandas in a similar way as it does NumPy, so I am > > wondering what your opinion is on the "value-based" logic and > > potential > > "future" logic. > > One of the most annoying things is likely the transition phase (see > > the > > last part about the many warnings I see in the pandas test-suit). > > > > > > ** Long Story: ** > > > > I am wondering about the future of type promotion in NumPy [1], but > > this would probably just as much affect pandas. > > The problem is what to do with things like: > > > > ??? np.array([1, 2], dtype=np.uint8) + 1000 > > > > Where the result is currently upcast to a `uint16`.? The rules for > > this > > are pretty arcane, however. > > > > There are a few "worse" things that probably do not affect pandas > > as > > much. That is, the above does also happen in this case: > > > > ??? np.array([1, 2], dtype=np.uint8) + np.int64(1000) > > > > Even though int64 is explicitly typed, we just drop that > > information. > > The weirdest things are probably regarding float precision: > > > > ??? np.array([0.3], dtype=np.float32) == 0.3 > > ??? np.array([0.3], dtype=np.float32) == np.float64(0.3) > > > > Where the latter would probably go from `True` to `False` due to > > the > > limited precision of float32. (At least unless we explicitly try to > > counteract this for comparisons.) > > > > > > ** Solution: ** > > > > The basic idea right now is the following: > > > > 1. All objects with NumPy dtypes use those strictly. Scalars or 0-D > > ?? arrays will have no special handling. > > 2. Python integers, float, and complex are considered to have a > > special > > ?? "weak" dtype. In the above example `1000` or `0.3` would simply > > be > > ?? force-cast to `uint8` or `float32`.? (Potentiality with a > > ?? warning/error for integer-rollover) > > 3. The "additional" rule that all function calls use > > `np.asarray()`, > > ?? which convert Python types.? That is `np.add(uint8_arr, 1000)` > > would > > ?? return the same as `np.add(uint8_arr, np.array(1000))`, while > > ?? `uint8_arr + 1000` would not! > > ?? (I am not sure about this rule, it could be modified but it > > seems > > ?? easier to limit the "special behaviour" to Python operators) > > > > I did some initial trials with such behaviour. Although without > > issuing > > transition warnings for the "weak" logic (although I expect it > > rarely > > changes the result), but issuing warnings when Point 1. probably > > makes > > a difference. > > > > To my surprise the SciPy test suite did not even Notice!? The > > pandas > > test suit runs into thousands of warnings (but few or no errors). > > Probably mostly due to test that effectively check ufuncs with: > > > > ??? binary_ufunc(typed_arr, 3) > > > > NumPy does that a lot in its test suite as well.? Maybe we can deal > > with it or rethink rule 3. > > > > Cheers, > > > > Sebastian > > > > > > > > [1] I have a conundrum. I don't really want to change things right > > now, > > but I need to reimplement it.? Preserving value-based logic seems > > tricky to do without introducing technical debt that if we want to > > get > > rid of it later anyway... > > _______________________________________________ > > Pandas-dev mailing list > > Pandas-dev at python.org > > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: From wesmckinn at gmail.com Tue Mar 9 13:16:11 2021 From: wesmckinn at gmail.com (Wes McKinney) Date: Tue, 9 Mar 2021 12:16:11 -0600 Subject: [Pandas-dev] Future of NumPy (and pandas) scalar "promotion" (and concatenate-dtype) In-Reply-To: References: <35e2f6e7dde83f5d71a2d393fdd2b8644dd90de7.camel@sipsolutions.net> Message-ID: I see your points ? FWIW I think that if people are using small integers (rather than using int_ or int64 for everything), they have some responsibility to mind these issues. I'm supportive of you trying to fix it, with the warning that I think you should engage in extra efforts to try to obtain feedback before it lands in "pip install numpy". On Tue, Mar 9, 2021 at 9:41 AM Sebastian Berg wrote: > > On Mon, 2021-03-08 at 12:47 -0600, Wes McKinney wrote: > > hi Sebastian ? at a glance this is a scary-looking change. Knowing > > the > > relatively fast-and-loose ways that people have been using NumPy in > > industry applications over the last 10+ years, the idea that `arr + > > scalar` could cause data loss in "scalar" is pretty worrying. It > > would > > be better to raise an exception than to generate a warning. > > > Well, some notes: > > 1. Obviously there would be transition warnings. Honestly, I am a bit > worried that the transition warnings would be far more annoying than > the change itself. > > 2. Yes, errors or at least warnings on unsafe conversion are better. > Mostly we just currently don't have them... > So for me that is an opinion that ensuring errors (or maybe > just warnings) seems required (when the final transition happens). > We also may need something like this: > > np.uint8(value, safe=True) > > To be able to "opt-in" into the future behaviour safely. That might > be annoying, but its not a serious blocker or particularly complex. > > 3. The current situation is already ridiculously unsafe, since > integers tend to rollover left and right. Try guessing the result > for these: > > np.array([100], dtype="uint8") + 200 > np.array(100, dtype="uint8") + 200 > np.array([100], dtype="uint8") + 300 > np.array([100], dtype="uint8") + np.array(200, dtype="int64") > np.array(100, dtype="uint8") + np.array(200, dtype="int64") > > np.array([100], dtype="uint8") - 200 > np.array([100], dtype="uint8") + -200 > > They are (ignoring shape): > > 44 (uint8), 300, 400 (uint16), 44 (uint8), 300, > 156 (uint8), -100 (int16) > > 4. In `weak->strong` transition the resulting dtype will always have > higher precision, which is less likely to cause trouble (but > more likely to give spurious warnings). The typical worst case > is probably memory bloat. > > 5. For floats, the situation seems much less dramatic, reduced > precision due to this change should almost never happen > (or give an overflow warning). > Of course `float32 + large_integer` might occasionally have > upcast to 64bit previously... > > 6. By the way: The change would largely revert back to the behaviour > of NumPy <1.6! So if the code is 10 years old it might suddenly > work again. (I expect ancient NumPy used "weak" logic even for 0-D > arrays, so was much worse.) > > > > I feel like to really understand the impact of this change, you would > > need to prepare a set of experimental NumPy wheels that you publish > > to > > PyPI to allow downstream users to run their applications and see what > > happens, and engage in outreach efforts to get them to actually do > > the > > testing. > > Right now, I wanted prod and see pandas-devs think that this would seem > like the right direction and one that they are willing to work towards. > > I think in NumPy there is a consensus that value-based logic is very > naughty and some loose consensus that the proposal I posted is the most > promising angle for fixing it (maybe quite loose, but I don't expect > more insight from NumPy-discussions at this time). > > Of course I can't be 100% sure that this will pan out, but I can spend > my remaining sanity on other things if this it becomes obvious that > there is serious resistance... > This is a side battle for me. But the point is, that doing it now may > be a unique chance, because if we shelve it now it will become even > harder to change. And that probably means shelving it again for another > decade or longer. > > Cheers, > > Sebastian > > > > > > > - Wes > > > > On Mon, Mar 8, 2021 at 12:00 PM Sebastian Berg > > wrote: > > > > > > Hi all, > > > > > > Summary/Abstract: I am seriously exploring the idea of modifying > > > the > > > NumPy promotion rules to drop the current value-based logic. This > > > would > > > probably affect pandas in a similar way as it does NumPy, so I am > > > wondering what your opinion is on the "value-based" logic and > > > potential > > > "future" logic. > > > One of the most annoying things is likely the transition phase (see > > > the > > > last part about the many warnings I see in the pandas test-suit). > > > > > > > > > ** Long Story: ** > > > > > > I am wondering about the future of type promotion in NumPy [1], but > > > this would probably just as much affect pandas. > > > The problem is what to do with things like: > > > > > > np.array([1, 2], dtype=np.uint8) + 1000 > > > > > > Where the result is currently upcast to a `uint16`. The rules for > > > this > > > are pretty arcane, however. > > > > > > There are a few "worse" things that probably do not affect pandas > > > as > > > much. That is, the above does also happen in this case: > > > > > > np.array([1, 2], dtype=np.uint8) + np.int64(1000) > > > > > > Even though int64 is explicitly typed, we just drop that > > > information. > > > The weirdest things are probably regarding float precision: > > > > > > np.array([0.3], dtype=np.float32) == 0.3 > > > np.array([0.3], dtype=np.float32) == np.float64(0.3) > > > > > > Where the latter would probably go from `True` to `False` due to > > > the > > > limited precision of float32. (At least unless we explicitly try to > > > counteract this for comparisons.) > > > > > > > > > ** Solution: ** > > > > > > The basic idea right now is the following: > > > > > > 1. All objects with NumPy dtypes use those strictly. Scalars or 0-D > > > arrays will have no special handling. > > > 2. Python integers, float, and complex are considered to have a > > > special > > > "weak" dtype. In the above example `1000` or `0.3` would simply > > > be > > > force-cast to `uint8` or `float32`. (Potentiality with a > > > warning/error for integer-rollover) > > > 3. The "additional" rule that all function calls use > > > `np.asarray()`, > > > which convert Python types. That is `np.add(uint8_arr, 1000)` > > > would > > > return the same as `np.add(uint8_arr, np.array(1000))`, while > > > `uint8_arr + 1000` would not! > > > (I am not sure about this rule, it could be modified but it > > > seems > > > easier to limit the "special behaviour" to Python operators) > > > > > > I did some initial trials with such behaviour. Although without > > > issuing > > > transition warnings for the "weak" logic (although I expect it > > > rarely > > > changes the result), but issuing warnings when Point 1. probably > > > makes > > > a difference. > > > > > > To my surprise the SciPy test suite did not even Notice! The > > > pandas > > > test suit runs into thousands of warnings (but few or no errors). > > > Probably mostly due to test that effectively check ufuncs with: > > > > > > binary_ufunc(typed_arr, 3) > > > > > > NumPy does that a lot in its test suite as well. Maybe we can deal > > > with it or rethink rule 3. > > > > > > Cheers, > > > > > > Sebastian > > > > > > > > > > > > [1] I have a conundrum. I don't really want to change things right > > > now, > > > but I need to reimplement it. Preserving value-based logic seems > > > tricky to do without introducing technical debt that if we want to > > > get > > > rid of it later anyway... > > > _______________________________________________ > > > Pandas-dev mailing list > > > Pandas-dev at python.org > > > https://mail.python.org/mailman/listinfo/pandas-dev > > > > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev From jorisvandenbossche at gmail.com Tue Mar 9 13:27:39 2021 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Tue, 9 Mar 2021 19:27:39 +0100 Subject: [Pandas-dev] Future of NumPy (and pandas) scalar "promotion" (and concatenate-dtype) In-Reply-To: References: <35e2f6e7dde83f5d71a2d393fdd2b8644dd90de7.camel@sipsolutions.net> Message-ID: I personally fully support trying to drop any value-based logic. Also in pandas we have some additional (custom to pandas) value-based logic, mainly in concat operations, that we are also trying to move away from. It will mean some behaviour changes, but as long as there is a transition period with warnings when there would be data loss (or if it would result in an error), that seems acceptable to me. Having consistent rules in the long term will be really beneficial. What's not fully clear to me is what the exact behaviour is of this "weak" dtype for python numbers. Would it always use the dtype of the other typed operand? Joris On Tue, 9 Mar 2021 at 16:40, Sebastian Berg wrote: > > On Mon, 2021-03-08 at 12:47 -0600, Wes McKinney wrote: > > hi Sebastian ? at a glance this is a scary-looking change. Knowing > > the > > relatively fast-and-loose ways that people have been using NumPy in > > industry applications over the last 10+ years, the idea that `arr + > > scalar` could cause data loss in "scalar" is pretty worrying. It > > would > > be better to raise an exception than to generate a warning. > > > Well, some notes: > > 1. Obviously there would be transition warnings. Honestly, I am a bit > worried that the transition warnings would be far more annoying than > the change itself. > > 2. Yes, errors or at least warnings on unsafe conversion are better. > Mostly we just currently don't have them... > So for me that is an opinion that ensuring errors (or maybe > just warnings) seems required (when the final transition happens). > We also may need something like this: > > np.uint8(value, safe=True) > > To be able to "opt-in" into the future behaviour safely. That might > be annoying, but its not a serious blocker or particularly complex. > > 3. The current situation is already ridiculously unsafe, since > integers tend to rollover left and right. Try guessing the result > for these: > > np.array([100], dtype="uint8") + 200 > np.array(100, dtype="uint8") + 200 > np.array([100], dtype="uint8") + 300 > np.array([100], dtype="uint8") + np.array(200, dtype="int64") > np.array(100, dtype="uint8") + np.array(200, dtype="int64") > > np.array([100], dtype="uint8") - 200 > np.array([100], dtype="uint8") + -200 > > They are (ignoring shape): > > 44 (uint8), 300, 400 (uint16), 44 (uint8), 300, > 156 (uint8), -100 (int16) > > 4. In `weak->strong` transition the resulting dtype will always have > higher precision, which is less likely to cause trouble (but > more likely to give spurious warnings). The typical worst case > is probably memory bloat. > > 5. For floats, the situation seems much less dramatic, reduced > precision due to this change should almost never happen > (or give an overflow warning). > Of course `float32 + large_integer` might occasionally have > upcast to 64bit previously... > > 6. By the way: The change would largely revert back to the behaviour > of NumPy <1.6! So if the code is 10 years old it might suddenly > work again. (I expect ancient NumPy used "weak" logic even for 0-D > arrays, so was much worse.) > > > > I feel like to really understand the impact of this change, you would > > need to prepare a set of experimental NumPy wheels that you publish > > to > > PyPI to allow downstream users to run their applications and see what > > happens, and engage in outreach efforts to get them to actually do > > the > > testing. > > Right now, I wanted prod and see pandas-devs think that this would seem > like the right direction and one that they are willing to work towards. > > I think in NumPy there is a consensus that value-based logic is very > naughty and some loose consensus that the proposal I posted is the most > promising angle for fixing it (maybe quite loose, but I don't expect > more insight from NumPy-discussions at this time). > > Of course I can't be 100% sure that this will pan out, but I can spend > my remaining sanity on other things if this it becomes obvious that > there is serious resistance... > This is a side battle for me. But the point is, that doing it now may > be a unique chance, because if we shelve it now it will become even > harder to change. And that probably means shelving it again for another > decade or longer. > > Cheers, > > Sebastian > > > > > > > - Wes > > > > On Mon, Mar 8, 2021 at 12:00 PM Sebastian Berg > > wrote: > > > > > > Hi all, > > > > > > Summary/Abstract: I am seriously exploring the idea of modifying > > > the > > > NumPy promotion rules to drop the current value-based logic. This > > > would > > > probably affect pandas in a similar way as it does NumPy, so I am > > > wondering what your opinion is on the "value-based" logic and > > > potential > > > "future" logic. > > > One of the most annoying things is likely the transition phase (see > > > the > > > last part about the many warnings I see in the pandas test-suit). > > > > > > > > > ** Long Story: ** > > > > > > I am wondering about the future of type promotion in NumPy [1], but > > > this would probably just as much affect pandas. > > > The problem is what to do with things like: > > > > > > np.array([1, 2], dtype=np.uint8) + 1000 > > > > > > Where the result is currently upcast to a `uint16`. The rules for > > > this > > > are pretty arcane, however. > > > > > > There are a few "worse" things that probably do not affect pandas > > > as > > > much. That is, the above does also happen in this case: > > > > > > np.array([1, 2], dtype=np.uint8) + np.int64(1000) > > > > > > Even though int64 is explicitly typed, we just drop that > > > information. > > > The weirdest things are probably regarding float precision: > > > > > > np.array([0.3], dtype=np.float32) == 0.3 > > > np.array([0.3], dtype=np.float32) == np.float64(0.3) > > > > > > Where the latter would probably go from `True` to `False` due to > > > the > > > limited precision of float32. (At least unless we explicitly try to > > > counteract this for comparisons.) > > > > > > > > > ** Solution: ** > > > > > > The basic idea right now is the following: > > > > > > 1. All objects with NumPy dtypes use those strictly. Scalars or 0-D > > > arrays will have no special handling. > > > 2. Python integers, float, and complex are considered to have a > > > special > > > "weak" dtype. In the above example `1000` or `0.3` would simply > > > be > > > force-cast to `uint8` or `float32`. (Potentiality with a > > > warning/error for integer-rollover) > > > 3. The "additional" rule that all function calls use > > > `np.asarray()`, > > > which convert Python types. That is `np.add(uint8_arr, 1000)` > > > would > > > return the same as `np.add(uint8_arr, np.array(1000))`, while > > > `uint8_arr + 1000` would not! > > > (I am not sure about this rule, it could be modified but it > > > seems > > > easier to limit the "special behaviour" to Python operators) > > > > > > I did some initial trials with such behaviour. Although without > > > issuing > > > transition warnings for the "weak" logic (although I expect it > > > rarely > > > changes the result), but issuing warnings when Point 1. probably > > > makes > > > a difference. > > > > > > To my surprise the SciPy test suite did not even Notice! The > > > pandas > > > test suit runs into thousands of warnings (but few or no errors). > > > Probably mostly due to test that effectively check ufuncs with: > > > > > > binary_ufunc(typed_arr, 3) > > > > > > NumPy does that a lot in its test suite as well. Maybe we can deal > > > with it or rethink rule 3. > > > > > > Cheers, > > > > > > Sebastian > > > > > > > > > > > > [1] I have a conundrum. I don't really want to change things right > > > now, > > > but I need to reimplement it. Preserving value-based logic seems > > > tricky to do without introducing technical debt that if we want to > > > get > > > rid of it later anyway... > > > _______________________________________________ > > > Pandas-dev mailing list > > > Pandas-dev at python.org > > > https://mail.python.org/mailman/listinfo/pandas-dev > > > > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev From sebastian at sipsolutions.net Tue Mar 9 14:10:12 2021 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Tue, 09 Mar 2021 13:10:12 -0600 Subject: [Pandas-dev] Future of NumPy (and pandas) scalar "promotion" (and concatenate-dtype) In-Reply-To: (sfid-20210309_192753_127578_57BFC3E5) References: <35e2f6e7dde83f5d71a2d393fdd2b8644dd90de7.camel@sipsolutions.net> (sfid-20210309_192753_127578_57BFC3E5) Message-ID: On Tue, 2021-03-09 at 19:27 +0100, Joris Van den Bossche wrote: > I personally fully support trying to drop any value-based logic. Also > in pandas we have some additional (custom to pandas) value-based > logic, mainly in concat operations, that we are also trying to move > away from. > > It will mean some behaviour changes, but as long as there is a > transition period with warnings when there would be data loss (or if > it would result in an error), that seems acceptable to me. Having > consistent rules in the long term will be really beneficial. > > What's not fully clear to me is what the exact behaviour is of this > "weak" dtype for python numbers. Would it always use the dtype of the > other typed operand? > The "weak DType" is the abstract DType like `Integer` or `Floating` (we don't have those right now). That means `int8 + 1.0 -> float64` (see details below)! The term "weak" is something I borrowed from JAX which uses this type of promotion, at least roughly (I think there are probably some subtle differences). My reasoning for this "weak logic" is that I expect code like: res = int8_arr + 2 # probably expects an int8 output? res = float32_arr * 2 # clearly should remain float32 might exist quite often, so using it will allow this as a convenience. And in the vast majority of cases retain its behaviour faithfully. The reason for suggesting to (mainly) limit it to Python operators, is an attempt at a fairly clear rule. There could be exceptions, and frankly, there will be accidental ones. But the thought is that for most functions, the user should expect an implicitly `np.asarray()` call on all inputs. And `np.asarray()` will not allow "weak logic" to pass. (But you could get it by calling e.g. `np.result_type` explicitly.) ** DETAILS: ** The other dtype is prioritized so to speak. But it has to decide what to do! `Integer + 1.0` still needs to go to some floating point value after all. For NumPy dtypes this means: uint8 + 1 -> uint8 uint8 + 1.0 -> float64 (default precision) uint8 + 1j -> complex128 (default precision) # same for all integers float32 + 1 -> float32 float32 + 1. -> float32 float32 + 1j -> complex64 (retains 32bit precision) In general the other DType will have to decide what to do by implementing the correct promotion/common (concatenate) DType rules. My aim is that through "common DType" + casting rules, we can also allow things code like this (which would use the identical logic): uint8.astype(AbstractFloating) -> float64 So, it isn't quite true that we blindly convert ahead of time. That would not always be correct, for example `timedelta64 / 2` cannot conver the `2` the same way `timedelta64 + 2` might attempt to do it. Cheers, Sebastian > Joris > > On Tue, 9 Mar 2021 at 16:40, Sebastian Berg < > sebastian at sipsolutions.net> wrote: > > > > On Mon, 2021-03-08 at 12:47 -0600, Wes McKinney wrote: > > > hi Sebastian ? at a glance this is a scary-looking change. > > > Knowing > > > the > > > relatively fast-and-loose ways that people have been using NumPy > > > in > > > industry applications over the last 10+ years, the idea that `arr > > > + > > > scalar` could cause data loss in "scalar" is pretty worrying. It > > > would > > > be better to raise an exception than to generate a warning. > > > > > > Well, some notes: > > > > 1. Obviously there would be transition warnings. Honestly, I am a > > bit > > ?? worried that the transition warnings would be far more annoying > > than > > ?? the change itself. > > > > 2. Yes, errors or at least warnings on unsafe conversion are > > better. > > ?? Mostly we just currently don't have them... > > ?? So for me that is an opinion that ensuring errors (or maybe > > ?? just warnings) seems required (when the final transition > > happens). > > ?? We also may need something like this: > > > > ?????? np.uint8(value, safe=True) > > > > ?? To be able to "opt-in" into the future behaviour safely. That > > might > > ?? be annoying, but its not a serious blocker or particularly > > complex. > > > > 3. The current situation is already ridiculously unsafe, since > > ?? integers tend to rollover left and right. Try guessing the > > result > > ?? for these: > > > > ?????? np.array([100], dtype="uint8") + 200 > > ?????? np.array(100, dtype="uint8") + 200 > > ?????? np.array([100], dtype="uint8") + 300 > > ?????? np.array([100], dtype="uint8") + np.array(200, > > dtype="int64") > > ?????? np.array(100, dtype="uint8") + np.array(200, dtype="int64") > > > > ?????? np.array([100], dtype="uint8") - 200 > > ?????? np.array([100], dtype="uint8") + -200 > > > > ?? They are (ignoring shape): > > > > ?????? 44 (uint8), 300, 400 (uint16), 44 (uint8), 300, > > ?????? 156 (uint8), -100 (int16) > > > > 4. In `weak->strong` transition the resulting dtype will always > > have > > ?? higher precision, which is less likely to cause trouble (but > > ?? more likely to give spurious warnings). The typical worst case > > ?? is probably memory bloat. > > > > 5. For floats, the situation seems much less dramatic, reduced > > ?? precision due to this change should almost never happen > > ?? (or give an overflow warning). > > ?? Of course `float32 + large_integer` might occasionally have > > ?? upcast to 64bit previously... > > > > 6. By the way: The change would largely revert back to the > > behaviour > > ?? of NumPy <1.6! So if the code is 10 years old it might suddenly > > ?? work again.? (I expect ancient NumPy used "weak" logic even for > > 0-D > > ?? arrays, so was much worse.) > > > > > > > I feel like to really understand the impact of this change, you > > > would > > > need to prepare a set of experimental NumPy wheels that you > > > publish > > > to > > > PyPI to allow downstream users to run their applications and see > > > what > > > happens, and engage in outreach efforts to get them to actually > > > do > > > the > > > testing. > > > > Right now, I wanted prod and see pandas-devs think that this would > > seem > > like the right direction and one that they are willing to work > > towards. > > > > I think in NumPy there is a consensus that value-based logic is > > very > > naughty and some loose consensus that the proposal I posted is the > > most > > promising angle for fixing it (maybe quite loose, but I don't > > expect > > more insight from NumPy-discussions at this time). > > > > Of course I can't be 100% sure that this will pan out, but I can > > spend > > my remaining sanity on other things if this it becomes obvious that > > there is serious resistance... > > This is a side battle for me.? But the point is, that doing it now > > may > > be a unique chance, because if we shelve it now it will become even > > harder to change. And that probably means shelving it again for > > another > > decade or longer. > > > > Cheers, > > > > Sebastian > > > > > > > > > > > > - Wes > > > > > > On Mon, Mar 8, 2021 at 12:00 PM Sebastian Berg > > > wrote: > > > > > > > > Hi all, > > > > > > > > Summary/Abstract: I am seriously exploring the idea of > > > > modifying > > > > the > > > > NumPy promotion rules to drop the current value-based logic. > > > > This > > > > would > > > > probably affect pandas in a similar way as it does NumPy, so I > > > > am > > > > wondering what your opinion is on the "value-based" logic and > > > > potential > > > > "future" logic. > > > > One of the most annoying things is likely the transition phase > > > > (see > > > > the > > > > last part about the many warnings I see in the pandas test- > > > > suit). > > > > > > > > > > > > ** Long Story: ** > > > > > > > > I am wondering about the future of type promotion in NumPy [1], > > > > but > > > > this would probably just as much affect pandas. > > > > The problem is what to do with things like: > > > > > > > > ??? np.array([1, 2], dtype=np.uint8) + 1000 > > > > > > > > Where the result is currently upcast to a `uint16`.? The rules > > > > for > > > > this > > > > are pretty arcane, however. > > > > > > > > There are a few "worse" things that probably do not affect > > > > pandas > > > > as > > > > much. That is, the above does also happen in this case: > > > > > > > > ??? np.array([1, 2], dtype=np.uint8) + np.int64(1000) > > > > > > > > Even though int64 is explicitly typed, we just drop that > > > > information. > > > > The weirdest things are probably regarding float precision: > > > > > > > > ??? np.array([0.3], dtype=np.float32) == 0.3 > > > > ??? np.array([0.3], dtype=np.float32) == np.float64(0.3) > > > > > > > > Where the latter would probably go from `True` to `False` due > > > > to > > > > the > > > > limited precision of float32. (At least unless we explicitly > > > > try to > > > > counteract this for comparisons.) > > > > > > > > > > > > ** Solution: ** > > > > > > > > The basic idea right now is the following: > > > > > > > > 1. All objects with NumPy dtypes use those strictly. Scalars or > > > > 0-D > > > > ?? arrays will have no special handling. > > > > 2. Python integers, float, and complex are considered to have a > > > > special > > > > ?? "weak" dtype. In the above example `1000` or `0.3` would > > > > simply > > > > be > > > > ?? force-cast to `uint8` or `float32`.? (Potentiality with a > > > > ?? warning/error for integer-rollover) > > > > 3. The "additional" rule that all function calls use > > > > `np.asarray()`, > > > > ?? which convert Python types.? That is `np.add(uint8_arr, > > > > 1000)` > > > > would > > > > ?? return the same as `np.add(uint8_arr, np.array(1000))`, > > > > while > > > > ?? `uint8_arr + 1000` would not! > > > > ?? (I am not sure about this rule, it could be modified but it > > > > seems > > > > ?? easier to limit the "special behaviour" to Python operators) > > > > > > > > I did some initial trials with such behaviour. Although without > > > > issuing > > > > transition warnings for the "weak" logic (although I expect it > > > > rarely > > > > changes the result), but issuing warnings when Point 1. > > > > probably > > > > makes > > > > a difference. > > > > > > > > To my surprise the SciPy test suite did not even Notice!? The > > > > pandas > > > > test suit runs into thousands of warnings (but few or no > > > > errors). > > > > Probably mostly due to test that effectively check ufuncs with: > > > > > > > > ??? binary_ufunc(typed_arr, 3) > > > > > > > > NumPy does that a lot in its test suite as well.? Maybe we can > > > > deal > > > > with it or rethink rule 3. > > > > > > > > Cheers, > > > > > > > > Sebastian > > > > > > > > > > > > > > > > [1] I have a conundrum. I don't really want to change things > > > > right > > > > now, > > > > but I need to reimplement it.? Preserving value-based logic > > > > seems > > > > tricky to do without introducing technical debt that if we want > > > > to > > > > get > > > > rid of it later anyway... > > > > _______________________________________________ > > > > Pandas-dev mailing list > > > > Pandas-dev at python.org > > > > https://mail.python.org/mailman/listinfo/pandas-dev > > > > > > > > > > > _______________________________________________ > > Pandas-dev mailing list > > Pandas-dev at python.org > > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: From jorisvandenbossche at gmail.com Wed Mar 10 10:07:25 2021 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Wed, 10 Mar 2021 16:07:25 +0100 Subject: [Pandas-dev] March 2021 monthly community meeting (today, March 10, UTC 18:00) Message-ID: Hi all, Late notice, but the next monthly dev call is in a few hours (today, March 10th) at 18:00 UTC (12 am Central). Our calendar is at https://pandas.pydata.org/docs/development/meeting.html#calendar to check your local time. All are welcome to attend! Video Call: https://us02web.zoom.us/j/81542460994?pwd=NktGRGd4aVNYeDZhUi96cVZDeTZmdz09 Minutes: https://docs.google.com/document/u/1/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit?ouid=102771015311436394588&usp=docs_home&ths=true Joris -------------- next part -------------- An HTML attachment was scrubbed... URL: From sebastian at sipsolutions.net Wed Mar 10 12:42:36 2021 From: sebastian at sipsolutions.net (Sebastian Berg) Date: Wed, 10 Mar 2021 11:42:36 -0600 Subject: [Pandas-dev] Future of NumPy (and pandas) scalar "promotion" (and concatenate-dtype) In-Reply-To: (sfid-20210310_105259_207187_1212CE22) References: <35e2f6e7dde83f5d71a2d393fdd2b8644dd90de7.camel@sipsolutions.net> (sfid-20210310_105259_207187_1212CE22) Message-ID: <698d946a5681dbc19a1742d7e281dccdf971ca7e.camel@sipsolutions.net> On Wed, 2021-03-10 at 10:52 +0100, Ralf Gommers wrote: > On Tue, Mar 9, 2021 at 8:10 PM Sebastian Berg < > sebastian at sipsolutions.net> > wrote: > > This all sounds quite good and desirable to me (assuming it's > introduced > with the required care, as Wes pointed out). Sure, there are many details I am slowly hitting. E.g. maybe its just too noisy to attempt to change ufunc calls, and the NumPy comparisons (mainly `==`, but that inherits to the others) is bad, which we probably need to fix, without any real transition, etc.: https://github.com/numpy/numpy/issues/10322 > > In general the other DType will have to decide what to do by > > implementing the correct promotion/common (concatenate) DType > > rules. My > > aim is that through "common DType" + casting rules, we can also > > allow > > things code like this (which would use the identical logic): > > > > ??? uint8.astype(AbstractFloating) -> float64 > > > > So, it isn't quite true that we blindly convert ahead of time. That > > would not always be correct, for example `timedelta64 / 2` cannot > > conver the `2` the same way `timedelta64 + 2` might attempt to do > > it. > > > > Can you elaborate on "it isn't quite true"? It seems to me like the > conversion is the same (step 1) but it's a two-step process: > > 1. `2` is treated as the weak dtype AbstractInteger > 2. The `/` operator is true division and produces floating-point > dtype as > output > > Step 2 here is true independent of whether, for `timedelta64 / x`, x > is a > Python scalar or an array. Yes, what I mean is that you cannot do it in a ufunc-agnistoic pre- processing step (in the sense of "common DType"). But rather, it has to happen in the ufunc specific promotion step, in particular, there is no generic rule that will work for timedelta64 and all ufuncs. Cheers, Sebastian PS: Anyway, we are getting beyond my initial intention of just a small probing appetite and asking for any important insights that I might be completely missing. So I will try to steer any further things back to NumPy when they come in. > > Is that correct, or am I missing some subtlety? > > Cheers, > Ralf -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part URL: From jbrockmendel at gmail.com Sat Mar 27 11:44:32 2021 From: jbrockmendel at gmail.com (Brock Mendel) Date: Sat, 27 Mar 2021 08:44:32 -0700 Subject: [Pandas-dev] Index Constructor Performance Message-ID: In optimizing the non-cython groupby.apply ( https://github.com/pandas-dev/pandas/issues/40263, https://github.com/pandas-dev/pandas/pull/40171#issuecomment-789116039) code I'm finding that an awful lot of overhead is coming from Index._simple_new*. This email is about what it would take to get rid of that overhead. * Note that the particular code snippet being profiled is chosen to be worst-case for the non-cython path. It ends up creating a _lot_ of very small Index objects. We don't particularly care about this case, but I'm thinking about this as micro-optimization of code that affects just about every use case under the sun. All of the options I have in mind involve moving some of the constructors to cython. There is a tradeoff in how invasive that is vs how much perf benefit we gain from it. For a baseline, we can trim 10-13% off the benchmark linked above by implementing in cython and mixing into NumericIndex (implementation abbreviated for brevity; the full implementation is 65 lines in cython): ``` @cython.freelist(32) cdef class NumpyIndex: cdef: public ndarray _data @classmethod def _simple_new(cls, values, name=None): ... cpdef NumpyIndex _getitem_slice(cls, slice slobj): ... ``` 10-13% is pretty good, but this only affects Int64Index, UInt64Index, and Float64Index. See Appendix 1 for discussion of what it would take to extend this to other subclasses. To get much further than this would require using __cinit__, which (absent some gymnastics) would require the FooIndex.__new__ methods to behave a lot more like the existing FooIndex._simple_new methods. TL;DR: this really isn't feasible absent a) refactoring RangeIndex to not subclass Int64Index (easy) and b) breaking API changes on the constructors for affected Index subclasses (hard). Appendix 1: Extending to Other Subclasses a) mixing libindex.NumpyIndex into pd.Index doesn't work because ExtensionIndex._data is not an ndarray. AFAICT to get the performance benefit for object-dtype would require implementing a separate subclass e.g. ObjectIndex. b) RangeIndex would not benefit, but something similar could be done for it following https://github.com/cython/cython/issues/4040 (or if we basically re-implement range ourselves in cython) c) MultiIndex could be made to benefit from this by changing ._codes to be a 2D ndarray instead of a FrozenList of ndarrays. This actually would allow for some nice cleanups in MultiIndex. The downside is that the memory footprint may be bigger with mismatched level sizes. d) With modest additional effort, this can be extended to DTI/TDI/PI/CategoricalIndex. Appendix 2: __cinit__ __cinit__ gets called implicitly before __init__ or __new__, and with whatever arguments are passed to init/new, i.e. we can't do validation before passing arguments like we could with an explicit super().__init__(...) call. For NumpyIndex we _could_ define __cinit__ without breaking the world, but we wouldn't get much use out of it unless we also tightened what we accept in the constructor Appendix 3: Notes on cython-related constraints - We cannot mix a cython cdef class into pd.Index because that will break 3rd party subclasses that use object.__new__(cls) (in particular im thinking of xarray's CFDatetimeIndex) - a python class cannot inherit from two separate cython cdef classes. i.e. if we mix something into NumericIndex, that precludes mixing something else into Int64Index -------------- next part -------------- An HTML attachment was scrubbed... URL: From jreback at yahoo.com Tue Mar 30 09:55:18 2021 From: jreback at yahoo.com (Jeff Reback) Date: Tue, 30 Mar 2021 13:55:18 +0000 (UTC) Subject: [Pandas-dev] welcome new pandas committer References: <1769743654.1084182.1617112518789.ref@mail.yahoo.com> Message-ID: <1769743654.1084182.1617112518789@mail.yahoo.com> Patrick, Your contributions and help to others is amazing. We'd love to have you? as a pandas committer and would like to welcome to the core team! Here is a wiki that shows a little about how to engage as a maintainer:?https://github.com/pandas-dev/pandas/wiki/Maintainer-Overview Jeff -------------- next part -------------- An HTML attachment was scrubbed... URL: