From simonjayhawkins at gmail.com Mon Jul 5 06:25:59 2021 From: simonjayhawkins at gmail.com (Simon Hawkins) Date: Mon, 5 Jul 2021 11:25:59 +0100 Subject: [Pandas-dev] ANN: pandas v1.3.0 Message-ID: Hi all, The pandas team is pleased to announce the release of pandas 1.3.0. This release includes some new features, bug fixes, and performance improvements. We recommend that all users upgrade to this version. See the release notes for a list of all the changes. The release can be installed from PyPI python -m pip install --upgrade pandas==1.3.0 Or from conda-forge conda install -c conda-forge pandas==1.3.0 Please report any issues with the release on the pandas issue tracker . Thanks to all the contributors who made this release possible. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Sun Jul 11 18:58:00 2021 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Mon, 12 Jul 2021 00:58:00 +0200 Subject: [Pandas-dev] Proposal for consistent, clear copy/view semantics in pandas with Copy-on-Write Message-ID: *(a.k.a. getting rid of the SettingWithCopyWarning)* Hi all, As you are probably aware, it's not always straightforward to understand the copy or view semantics of indexing methods in pandas. To understand when you get a view and when not, or why you get a SettingWithCopyWarning or how to get rid of it? It's also something that has already been discussed regularly (e.g. the discussion and implementation from 2015 started by Nick Eubank at gh-10954 ). Last year, we again started to discuss this, which is tracked at https://github.com/pandas-dev/pandas/issues/36195. Based on those discussions, I have a concrete proposal to change the copy/view semantics of pandas. Short summary of the proposal: 1. The result of *any* indexing operation (subsetting a DataFrame or Series in any way) or any method returning a new DataFrame, always *behaves as if it were* a copy in terms of user API. 2. We implement Copy-on-Write. This way, we can actually use views as much as possible under the hood, while ensuring the user API behaves as a copy. This addresses multiple aspects: 1) a clear and consistent user API (a clear rule: *any* subset or returned series/dataframe is *always* a copy of the original, and thus never modifies the original) and 2) improving performance by avoiding excessive copies (eg a chained method workflow would no longer return an actual data copy at each step). Longer version of this proposal: https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit?usp=sharing Proof-of-concept implementation: https://github.com/pandas-dev/pandas/pull/41878 GitHub issue with relevant discussion: https://github.com/pandas-dev/pandas/issues/36195 *Since this would be a change with a large impact on users, I think it is important to get broad feedback on this*. So comments, thoughts, concerns, ideas etc are very welcome (you can comment on the google doc, answer to this email or on the github issue). Best, Joris -------------- next part -------------- An HTML attachment was scrubbed... URL: From garcia.marc at gmail.com Mon Jul 12 13:42:13 2021 From: garcia.marc at gmail.com (Marc Garcia) Date: Mon, 12 Jul 2021 11:42:13 -0600 Subject: [Pandas-dev] Proposal for consistent, clear copy/view semantics in pandas with Copy-on-Write In-Reply-To: References: Message-ID: +1 on the approach of the proposal, and also +1 to release in a major version, and not raise deprecation warnings. Thanks for working on this, it'll make users life much easier. On Sun, Jul 11, 2021 at 4:58 PM Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > *(a.k.a. getting rid of the SettingWithCopyWarning)* > > Hi all, > > As you are probably aware, it's not always straightforward to understand > the copy or view semantics of indexing methods in pandas. To understand > when you get a view and when not, or why you get a SettingWithCopyWarning > or how to get rid of it? > It's also something that has already been discussed regularly (e.g. the > discussion and implementation from 2015 started by Nick Eubank at gh-10954 > ). Last year, we again > started to discuss this, which is tracked at > https://github.com/pandas-dev/pandas/issues/36195. Based on those > discussions, I have a concrete proposal to change the copy/view semantics > of pandas. > > Short summary of the proposal: > > 1. The result of *any* indexing operation (subsetting a DataFrame or > Series in any way) or any method returning a new DataFrame, always *behaves > as if it were* a copy in terms of user API. > 2. We implement Copy-on-Write. This way, we can actually use views as > much as possible under the hood, while ensuring the user API behaves as a > copy. > > This addresses multiple aspects: 1) a clear and consistent user API (a > clear rule: *any* subset or returned series/dataframe is *always* a copy > of the original, and thus never modifies the original) and 2) improving > performance by avoiding excessive copies (eg a chained method workflow > would no longer return an actual data copy at each step). > > Longer version of this proposal: > https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit?usp=sharing > Proof-of-concept implementation: > https://github.com/pandas-dev/pandas/pull/41878 > GitHub issue with relevant discussion: > https://github.com/pandas-dev/pandas/issues/36195 > > *Since this would be a change with a large impact on users, I think it is > important to get broad feedback on this*. So comments, thoughts, > concerns, ideas etc are very welcome (you can comment on the google doc, > answer to this email or on the github issue). > > Best, > Joris > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Mon Jul 12 14:28:56 2021 From: wesmckinn at gmail.com (Wes McKinney) Date: Mon, 12 Jul 2021 13:28:56 -0500 Subject: [Pandas-dev] Proposal for consistent, clear copy/view semantics in pandas with Copy-on-Write In-Reply-To: References: Message-ID: I think this is an important initiative, and I indeed wish we had designed around copy-on-write ideas from the very beginning. As one protection against improper mutation of views, it may be necessary to introduce defensive copies into APIs that expose internal data, e.g. NumPy arrays that are slices of the parent, or who have had slices taken of them. On Mon, Jul 12, 2021 at 12:42 PM Marc Garcia wrote: > > +1 on the approach of the proposal, and also +1 to release in a major version, and not raise deprecation warnings. > > Thanks for working on this, it'll make users life much easier. > > On Sun, Jul 11, 2021 at 4:58 PM Joris Van den Bossche wrote: >> >> (a.k.a. getting rid of the SettingWithCopyWarning) >> >> Hi all, >> >> As you are probably aware, it's not always straightforward to understand the copy or view semantics of indexing methods in pandas. To understand when you get a view and when not, or why you get a SettingWithCopyWarning or how to get rid of it? >> It's also something that has already been discussed regularly (e.g. the discussion and implementation from 2015 started by Nick Eubank at gh-10954). Last year, we again started to discuss this, which is tracked at https://github.com/pandas-dev/pandas/issues/36195. Based on those discussions, I have a concrete proposal to change the copy/view semantics of pandas. >> >> Short summary of the proposal: >> >> The result of any indexing operation (subsetting a DataFrame or Series in any way) or any method returning a new DataFrame, always behaves as if it were a copy in terms of user API. >> We implement Copy-on-Write. This way, we can actually use views as much as possible under the hood, while ensuring the user API behaves as a copy. >> >> This addresses multiple aspects: 1) a clear and consistent user API (a clear rule: any subset or returned series/dataframe is always a copy of the original, and thus never modifies the original) and 2) improving performance by avoiding excessive copies (eg a chained method workflow would no longer return an actual data copy at each step). >> >> Longer version of this proposal: https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit?usp=sharing >> Proof-of-concept implementation: https://github.com/pandas-dev/pandas/pull/41878 >> GitHub issue with relevant discussion: https://github.com/pandas-dev/pandas/issues/36195 >> >> Since this would be a change with a large impact on users, I think it is important to get broad feedback on this. So comments, thoughts, concerns, ideas etc are very welcome (you can comment on the google doc, answer to this email or on the github issue). >> >> Best, >> Joris >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev From shoyer at gmail.com Mon Jul 12 23:55:01 2021 From: shoyer at gmail.com (Stephan Hoyer) Date: Mon, 12 Jul 2021 20:55:01 -0700 Subject: [Pandas-dev] Proposal for consistent, clear copy/view semantics in pandas with Copy-on-Write In-Reply-To: References: Message-ID: I agree with Wes and Marc. This is an important change for the long term future of pandas. On Mon, Jul 12, 2021 at 11:29 AM Wes McKinney wrote: > I think this is an important initiative, and I indeed wish we had > designed around copy-on-write ideas from the very beginning. > > As one protection against improper mutation of views, it may be > necessary to introduce defensive copies into APIs that expose internal > data, e.g. NumPy arrays that are slices of the parent, or who have had > slices taken of them. > > On Mon, Jul 12, 2021 at 12:42 PM Marc Garcia > wrote: > > > > +1 on the approach of the proposal, and also +1 to release in a major > version, and not raise deprecation warnings. > > > > Thanks for working on this, it'll make users life much easier. > > > > On Sun, Jul 11, 2021 at 4:58 PM Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > >> > >> (a.k.a. getting rid of the SettingWithCopyWarning) > >> > >> Hi all, > >> > >> As you are probably aware, it's not always straightforward to > understand the copy or view semantics of indexing methods in pandas. To > understand when you get a view and when not, or why you get a > SettingWithCopyWarning or how to get rid of it? > >> It's also something that has already been discussed regularly (e.g. the > discussion and implementation from 2015 started by Nick Eubank at > gh-10954). Last year, we again started to discuss this, which is tracked at > https://github.com/pandas-dev/pandas/issues/36195. Based on those > discussions, I have a concrete proposal to change the copy/view semantics > of pandas. > >> > >> Short summary of the proposal: > >> > >> The result of any indexing operation (subsetting a DataFrame or Series > in any way) or any method returning a new DataFrame, always behaves as if > it were a copy in terms of user API. > >> We implement Copy-on-Write. This way, we can actually use views as much > as possible under the hood, while ensuring the user API behaves as a copy. > >> > >> This addresses multiple aspects: 1) a clear and consistent user API (a > clear rule: any subset or returned series/dataframe is always a copy of the > original, and thus never modifies the original) and 2) improving > performance by avoiding excessive copies (eg a chained method workflow > would no longer return an actual data copy at each step). > >> > >> Longer version of this proposal: > https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit?usp=sharing > >> Proof-of-concept implementation: > https://github.com/pandas-dev/pandas/pull/41878 > >> GitHub issue with relevant discussion: > https://github.com/pandas-dev/pandas/issues/36195 > >> > >> Since this would be a change with a large impact on users, I think it > is important to get broad feedback on this. So comments, thoughts, > concerns, ideas etc are very welcome (you can comment on the google doc, > answer to this email or on the github issue). > >> > >> Best, > >> Joris > >> _______________________________________________ > >> Pandas-dev mailing list > >> Pandas-dev at python.org > >> https://mail.python.org/mailman/listinfo/pandas-dev > > > > _______________________________________________ > > Pandas-dev mailing list > > Pandas-dev at python.org > > https://mail.python.org/mailman/listinfo/pandas-dev > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Fri Jul 16 08:15:29 2021 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Fri, 16 Jul 2021 14:15:29 +0200 Subject: [Pandas-dev] Proposal for consistent, clear copy/view semantics in pandas with Copy-on-Write In-Reply-To: References: Message-ID: Thanks for the feedback. Regarding protection against improper mutation of views via numpy (or in general arrays), that's indeed a risk. Since this is Python, a user will always find some (private) way to incorrectly mutate data without triggering the copy-on-write paths, but there are indeed some ways we good try to prevent that. Listing the possible ways to get the "array" data from DataFrame/Series objects: * Series.values / Series.array -> returning a numpy array or pandas ExtensionArray, which currently return the stored data as are mutatble arrays as is (or as views). Mutating such an array wouldn't trigger Copy-on-Write which is managed on the DataFrame/Series level. To prevent users from doing this, we could return those arrays as "read-only"? (to avoid always doing a defensive copy here) * Series.to_numpy() -> returning a numpy array. This method has a `copy` keyword with currently a default of False. We could either make this copy=True by default, or similarly to the above make it read-only by default, leaving the copy=True/False options to choose from explicitly. * DataFrame.to_numpy() / DataFrame.values -> returning a 2D numpy array, which is by definition always a copy (by concatting multiple 1D arrays). Except for the 1-column case, this could still be a view. For simplicity, I would make this case return a copy as well (if you want a view the user can get the Series). Or alternatively this case could follow the logic of Series.to_numpy above. On Tue, 13 Jul 2021 at 05:55, Stephan Hoyer wrote: > I agree with Wes and Marc. This is an important change for the long term > future of pandas. > > On Mon, Jul 12, 2021 at 11:29 AM Wes McKinney wrote: > >> I think this is an important initiative, and I indeed wish we had >> designed around copy-on-write ideas from the very beginning. >> >> As one protection against improper mutation of views, it may be >> necessary to introduce defensive copies into APIs that expose internal >> data, e.g. NumPy arrays that are slices of the parent, or who have had >> slices taken of them. >> >> On Mon, Jul 12, 2021 at 12:42 PM Marc Garcia >> wrote: >> > >> > +1 on the approach of the proposal, and also +1 to release in a major >> version, and not raise deprecation warnings. >> > >> > Thanks for working on this, it'll make users life much easier. >> > >> > On Sun, Jul 11, 2021 at 4:58 PM Joris Van den Bossche < >> jorisvandenbossche at gmail.com> wrote: >> >> >> >> (a.k.a. getting rid of the SettingWithCopyWarning) >> >> >> >> Hi all, >> >> >> >> As you are probably aware, it's not always straightforward to >> understand the copy or view semantics of indexing methods in pandas. To >> understand when you get a view and when not, or why you get a >> SettingWithCopyWarning or how to get rid of it? >> >> It's also something that has already been discussed regularly (e.g. >> the discussion and implementation from 2015 started by Nick Eubank at >> gh-10954). Last year, we again started to discuss this, which is tracked at >> https://github.com/pandas-dev/pandas/issues/36195. Based on those >> discussions, I have a concrete proposal to change the copy/view semantics >> of pandas. >> >> >> >> Short summary of the proposal: >> >> >> >> The result of any indexing operation (subsetting a DataFrame or Series >> in any way) or any method returning a new DataFrame, always behaves as if >> it were a copy in terms of user API. >> >> We implement Copy-on-Write. This way, we can actually use views as >> much as possible under the hood, while ensuring the user API behaves as a >> copy. >> >> >> >> This addresses multiple aspects: 1) a clear and consistent user API (a >> clear rule: any subset or returned series/dataframe is always a copy of the >> original, and thus never modifies the original) and 2) improving >> performance by avoiding excessive copies (eg a chained method workflow >> would no longer return an actual data copy at each step). >> >> >> >> Longer version of this proposal: >> https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit?usp=sharing >> >> Proof-of-concept implementation: >> https://github.com/pandas-dev/pandas/pull/41878 >> >> GitHub issue with relevant discussion: >> https://github.com/pandas-dev/pandas/issues/36195 >> >> >> >> Since this would be a change with a large impact on users, I think it >> is important to get broad feedback on this. So comments, thoughts, >> concerns, ideas etc are very welcome (you can comment on the google doc, >> answer to this email or on the github issue). >> >> >> >> Best, >> >> Joris >> >> _______________________________________________ >> >> Pandas-dev mailing list >> >> Pandas-dev at python.org >> >> https://mail.python.org/mailman/listinfo/pandas-dev >> > >> > _______________________________________________ >> > Pandas-dev mailing list >> > Pandas-dev at python.org >> > https://mail.python.org/mailman/listinfo/pandas-dev >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Fri Jul 16 08:23:35 2021 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Fri, 16 Jul 2021 14:23:35 +0200 Subject: [Pandas-dev] Proposal for consistent, clear copy/view semantics in pandas with Copy-on-Write In-Reply-To: References: Message-ID: On Mon, 12 Jul 2021 at 00:58, Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > Short summary of the proposal: > > 1. The result of *any* indexing operation (subsetting a DataFrame or > Series in any way) or any method returning a new DataFrame, always *behaves > as if it were* a copy in terms of user API. > > To explicitly call out the column-as-Series case (since this is a typical case that right now *always* is a view): "any" indexing operation thus also included accessing a DataFrame column as a Series (or slicing a Series). So something like s = df["col"] and then mutating s will no longer update df. Similarly for series_subset = series[1:5], mutating series_subset will no longer update s. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbrockmendel at gmail.com Fri Jul 16 12:02:19 2021 From: jbrockmendel at gmail.com (Brock Mendel) Date: Fri, 16 Jul 2021 09:02:19 -0700 Subject: [Pandas-dev] Proposal for consistent, clear copy/view semantics in pandas with Copy-on-Write In-Reply-To: References: Message-ID: [xposting from https://github.com/pandas-dev/pandas/issues/36195] I'm glad there is a proof of concept to help clarify what this looks like. I do not like the fact that nothing can ever be "just a view" with these semantics, including series[::-1], frame[col], frame[:]. Users reasonably expect numpy semantics for these. We should revisit the alternative "clear/simple rules" approach that is "indexing on columns always gives a view" ( https://github.com/pandas-dev/pandas/pull/33597). This is simpler to explain/grok, simpler to implement, and not dependent on BlockManager vs ArrayManager. On Fri, Jul 16, 2021 at 5:26 AM Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > > > On Mon, 12 Jul 2021 at 00:58, Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > >> Short summary of the proposal: >> >> 1. The result of *any* indexing operation (subsetting a DataFrame or >> Series in any way) or any method returning a new DataFrame, always *behaves >> as if it were* a copy in terms of user API. >> >> To explicitly call out the column-as-Series case (since this is a > typical case that right now *always* is a view): "any" indexing operation > thus also included accessing a DataFrame column as a Series (or slicing a > Series). > > So something like s = df["col"] and then mutating s will no longer update > df. Similarly for series_subset = series[1:5], mutating series_subset > will no longer update s. > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.augspurger88 at gmail.com Fri Jul 16 12:28:23 2021 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Fri, 16 Jul 2021 11:28:23 -0500 Subject: [Pandas-dev] Proposal for consistent, clear copy/view semantics in pandas with Copy-on-Write In-Reply-To: References: Message-ID: On Fri, Jul 16, 2021 at 11:04 AM Brock Mendel wrote: > [xposting from https://github.com/pandas-dev/pandas/issues/36195] > > I'm glad there is a proof of concept to help clarify what this looks like. > > I do not like the fact that nothing can ever be "just a view" with these > semantics, including series[::-1], frame[col], frame[:]. Users reasonably > expect numpy semantics for these. > I wonder if we can validate what users (new and old) *actually* expect? Users coming from R, which IIRC implements Copy on Write for matrices, might be OK with indexing always being (behaving like) a copy. I'm not sure what users coming from NumPy would expect, since I don't know how many NumPy users really understand *a**.)* when a NumPy slice is a view or copy, and *b.) *how a pandas indexing operation translates to a NumPy slice. > We should revisit the alternative "clear/simple rules" approach that is > "indexing on columns always gives a view" ( > https://github.com/pandas-dev/pandas/pull/33597). This is simpler to > explain/grok, simpler to implement, and not dependent on BlockManager vs > ArrayManager. > > On Fri, Jul 16, 2021 at 5:26 AM Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > >> >> >> On Mon, 12 Jul 2021 at 00:58, Joris Van den Bossche < >> jorisvandenbossche at gmail.com> wrote: >> >>> Short summary of the proposal: >>> >>> 1. The result of *any* indexing operation (subsetting a DataFrame or >>> Series in any way) or any method returning a new DataFrame, always *behaves >>> as if it were* a copy in terms of user API. >>> >>> To explicitly call out the column-as-Series case (since this is a >> typical case that right now *always* is a view): "any" indexing >> operation thus also included accessing a DataFrame column as a Series (or >> slicing a Series). >> >> So something like s = df["col"] and then mutating s will no longer >> update df. Similarly for series_subset = series[1:5], mutating >> series_subset will no longer update s. >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Fri Jul 16 12:58:17 2021 From: shoyer at gmail.com (Stephan Hoyer) Date: Fri, 16 Jul 2021 09:58:17 -0700 Subject: [Pandas-dev] Proposal for consistent, clear copy/view semantics in pandas with Copy-on-Write In-Reply-To: References: Message-ID: On Fri, Jul 16, 2021 at 9:04 AM Brock Mendel wrote: > I do not like the fact that nothing can ever be "just a view" with these > semantics, including series[::-1], frame[col], frame[:]. Users reasonably > expect numpy semantics for these. > > We should revisit the alternative "clear/simple rules" approach that is > "indexing on columns always gives a view" ( > https://github.com/pandas-dev/pandas/pull/33597). This is simpler to > explain/grok, simpler to implement, and not dependent on BlockManager vs > ArrayManager. > I don't know if it is worth the trouble for complex multi-column selections, but I do see the appeal here. A simpler variant would be to make indexing out a single Series from a DataFrame return a view, with everything else doing copy on write. Then the existing pattern df.column_one[:] = ... would still work. > > On Fri, Jul 16, 2021 at 5:26 AM Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > >> >> >> On Mon, 12 Jul 2021 at 00:58, Joris Van den Bossche < >> jorisvandenbossche at gmail.com> wrote: >> >>> Short summary of the proposal: >>> >>> 1. The result of *any* indexing operation (subsetting a DataFrame or >>> Series in any way) or any method returning a new DataFrame, always *behaves >>> as if it were* a copy in terms of user API. >>> >>> To explicitly call out the column-as-Series case (since this is a >> typical case that right now *always* is a view): "any" indexing >> operation thus also included accessing a DataFrame column as a Series (or >> slicing a Series). >> >> So something like s = df["col"] and then mutating s will no longer >> update df. Similarly for series_subset = series[1:5], mutating >> series_subset will no longer update s. >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From irv at princeton.com Fri Jul 16 14:49:46 2021 From: irv at princeton.com (Irv Lustig) Date: Fri, 16 Jul 2021 14:49:46 -0400 Subject: [Pandas-dev] Proposal for consistent, clear copy/view semantics in pandas with Copy-on-Write In-Reply-To: References: Message-ID: Tom Augspurger wrote: I wonder if we can validate what users (new and old) *actually* expect? > Users coming from R, which IIRC implements Copy on Write for matrices, > might be OK with indexing always being (behaving like) a copy. > I'm not sure what users coming from NumPy would expect, since I don't know > how many NumPy users really understand *a**.)* when a NumPy slice is a view > or copy, and *b.) *how a pandas indexing operation translates to a NumPy > slice. > > IMHO, we should concentrate on the "new" users. For my team, there is no numpy or R background. They learn pandas, and what pandas does needs to be really clear in behavior and documentation. I would also hazard a guess that most pandas users are like that - pandas is the first tool they see, not numpy or R. The places where I think confusion could happen are things like this with a DataFrame df : 1. s = df["a"] 2. s.iloc[3:5] = [1, 2, 3] 3. df["a"].iloc[3:5] = [1, 2, 3] 4. df["b"] = df["a"] 5. df["b"].iloc[3:5] = [4, 5, 6] 6. s2 = df["b"] 7. df["c"] = s2 8. s2.iloc[3:5] = [7, 8, 9] As I understand it (please correct me if I'm wrong), these lines would be interpreted as follows with the current proposal: 1. Creates a view into the DataFrame df. No copying is done at all 2. Modifies the series s and the underlying DataFrame df. (copy-on-write) 3. Modifies the dataframe 4. Copies the series from "a" to "b" 5. Modifies "b" in the DataFrame, but not "a" 6. Create a view into the DataFrame df. No copying is done at all. 7. Copies the series from "b" to "c" 8. Modifies s2, which modifies "b", but NOT "c" I think the challenge is explaining the sequence 6,7,8 above in comparison to the other sequences. -Irv -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Sat Jul 17 11:16:32 2021 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Sat, 17 Jul 2021 17:16:32 +0200 Subject: [Pandas-dev] Proposal for consistent, clear copy/view semantics in pandas with Copy-on-Write In-Reply-To: References: Message-ID: On Fri, 16 Jul 2021 at 18:58, Stephan Hoyer wrote: > On Fri, Jul 16, 2021 at 9:04 AM Brock Mendel > wrote: > >> I do not like the fact that nothing can ever be "just a view" with these >> semantics, including series[::-1], frame[col], frame[:]. Users reasonably >> expect numpy semantics for these. >> >> I am personally not sure what "users" in general expect for those (as also mentioned by Tom and Irv already, depending on their background, they might expect different things). For example, for a user that knows basic Python, they could actually expect all those examples to give a copy since `a_list[:]` is a typical way to make a copy of a list. (it might be interesting to reach out to educators (who might have more experience with expectations/typical errors of novice users) or to do some kind of experiment on this topic) Personally, I cannot remember that I ever relied on the mutability-aspect of eg `series[1:3]` or `frame[:]` being a view. I think there are generally 2 reasons for users caring about a view: 1) for performance (less copying) and 2) for being able to mutate the view with the explicit goal to mutate the parent (and not as an irrelevant side-effect). I think the first reason is by far the most common one (but that's my subjective opinion from my experience using pandas, so that can certainly depend), and in the current proposal, all those mentioned example will be actual views under the hood (and thus cover this first reason). The only case where I know I explicitly rely on this is with chained assignment (eg `frame[col][1:3] = ..`). That's certainly a very important use case (and probably the most impacted usage pattern with the current proposal), but it's also a case where 1) there is a clear alternative (don't use chained assignment, but do it in one step (e.g. `frame.loc[1:3, col] = ..`), some corner cases of mixed positional/label-based indexing aside, for which we should find an alternative) and 2) we might be able to detect this and raise an informative error message (specifically for chained assignment). I think it can be easier to explain "chained assignment never works" than "chained assignment only works if first selecting the column(s)" (depending on the exact rules). > We should revisit the alternative "clear/simple rules" approach that is >> "indexing on columns always gives a view" ( >> https://github.com/pandas-dev/pandas/pull/33597). This is simpler to >> explain/grok, simpler to implement >> > > I don't know if it is worth the trouble for complex multi-column > selections, but I do see the appeal here. > > A simpler variant would be to make indexing out a single Series from a > DataFrame return a view, with everything else doing copy on write. Then the > existing pattern df.column_one[:] = ... would still work. > I was initially thinking about this as well. In the end, I didn't (yet) try to implement this, because while thinking it through, it seemed that this might give quite some tricky cases. Consider the following example: df = pd.DataFrame(..) df_subset = df[["col1", "col2"]] s1 = df["col1"] s1_subset = s1[0:3] # modifying s1 should modify df, but not df_subset and s1_subset? s1[0] = 0 If we take "only accessing a single Series from a DataFrame is a view, everything else uses copy-on-write", that gives rise to questions like the above where some parents/childs get modified, and some not. This is both harder to explain to users, as harder to implement. For the implementation of the proof-of-concept, the copy-on-write happens "locally" in the series/dataframe that gets modified (meaning: when modifying a given object, its internal array data first gets copied and replaced *if* the object is viewing another or is being viewed by another object). While in the above case, modifying a given object would need to trigger a copy in other (potentially many) objects, and not in the object being modified. It's probably possible to implement this, but certainly harder/trickier to do. > > >> >> On Fri, Jul 16, 2021 at 5:26 AM Joris Van den Bossche < >> jorisvandenbossche at gmail.com> wrote: >> >>> >>> >>> On Mon, 12 Jul 2021 at 00:58, Joris Van den Bossche < >>> jorisvandenbossche at gmail.com> wrote: >>> >>>> Short summary of the proposal: >>>> >>>> 1. The result of *any* indexing operation (subsetting a DataFrame >>>> or Series in any way) or any method returning a new DataFrame, always *behaves >>>> as if it were* a copy in terms of user API. >>>> >>>> To explicitly call out the column-as-Series case (since this is a >>> typical case that right now *always* is a view): "any" indexing >>> operation thus also included accessing a DataFrame column as a Series (or >>> slicing a Series). >>> >>> So something like s = df["col"] and then mutating s will no longer >>> update df. Similarly for series_subset = series[1:5], mutating >>> series_subset will no longer update s. >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From garcia.marc at gmail.com Sat Jul 17 12:12:32 2021 From: garcia.marc at gmail.com (Marc Garcia) Date: Sat, 17 Jul 2021 10:12:32 -0600 Subject: [Pandas-dev] Proposal for consistent, clear copy/view semantics in pandas with Copy-on-Write In-Reply-To: References: Message-ID: Based on my experience (not sure how biased it is), modifying dataframes with something like `df[col][1:3] = ...` is rare (or the equivalent with ` .loc`) except for boolean arrays. From my experience, when the values of a dataframe column are changed, what I think it's way more common is to use `df[col] = df[col].str.upper()`, `df[col] = df[col].fillna(0)`... While I'm personally happy with Joris proposal, I see two other options that could complement or replace it: Option 1) Deprecate assigning to a subset of rows, and only allow assigning to whole columns. Something like `df[col][1:3] = ...` could be replaced by for example `df[col] = df[col].mask(slice(1, 3), ...)`. Using `mask` and ` where` is already supported for boolean arrays, so slices should be added, and they'd be the only way to replace a subset of values. I think that makes the problem narrower, and easier to understand for users. The main thing to decide and be clear about is what happens if the dataframe is a subset of another one: ``` df2 = df[cond] df2[col] = df2[col].str.upper() ``` Option 2) If assigning with the current syntax (`df[col][1:3] = ...` or ` .loc` equivalent) is something we want to keep (I wouldn't if we move in this direction), maybe it could be moved to a `DataFrame` subclass So, the main dataframe class behaves like in option 1, so expectations are much easier to manage. But users who really want to assign with indexing, can still use it, knowing that having a mutable dataframe comes at a cost (copies, more complex behavior...). The `MutableDataFrame` could be in pandas, or a third-party extension. ``` df_mutable = df.to_mutable() df_mutable[col][1:3] = ... ``` On Sat, Jul 17, 2021 at 9:16 AM Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > > On Fri, 16 Jul 2021 at 18:58, Stephan Hoyer wrote: > >> On Fri, Jul 16, 2021 at 9:04 AM Brock Mendel >> wrote: >> >>> I do not like the fact that nothing can ever be "just a view" with these >>> semantics, including series[::-1], frame[col], frame[:]. Users reasonably >>> expect numpy semantics for these. >>> >>> I am personally not sure what "users" in general expect for those (as > also mentioned by Tom and Irv already, depending on their background, they > might expect different things). > For example, for a user that knows basic Python, they could actually > expect all those examples to give a copy since `a_list[:]` is a typical way > to make a copy of a list. > > (it might be interesting to reach out to educators (who might have more > experience with expectations/typical errors of novice users) or to do some > kind of experiment on this topic) > > Personally, I cannot remember that I ever relied on the mutability-aspect > of eg `series[1:3]` or `frame[:]` being a view. I think there are generally > 2 reasons for users caring about a view: 1) for performance (less copying) > and 2) for being able to mutate the view with the explicit goal to mutate > the parent (and not as an irrelevant side-effect). > I think the first reason is by far the most common one (but that's my > subjective opinion from my experience using pandas, so that can certainly > depend), and in the current proposal, all those mentioned example will be > actual views under the hood (and thus cover this first reason). > > The only case where I know I explicitly rely on this is with chained > assignment (eg `frame[col][1:3] = ..`). That's certainly a very important > use case (and probably the most impacted usage pattern with the current > proposal), but it's also a case where 1) there is a clear alternative > (don't use chained assignment, but do it in one step (e.g. `frame.loc[1:3, > col] = ..`), some corner cases of mixed positional/label-based indexing > aside, for which we should find an alternative) and 2) we might be able to > detect this and raise an informative error message (specifically for > chained assignment). > > I think it can be easier to explain "chained assignment never works" than > "chained assignment only works if first selecting the column(s)" (depending > on the exact rules). > > >> We should revisit the alternative "clear/simple rules" approach that is >>> "indexing on columns always gives a view" ( >>> https://github.com/pandas-dev/pandas/pull/33597). This is simpler to >>> explain/grok, simpler to implement >>> >> >> I don't know if it is worth the trouble for complex multi-column >> selections, but I do see the appeal here. >> >> A simpler variant would be to make indexing out a single Series from a >> DataFrame return a view, with everything else doing copy on write. Then the >> existing pattern df.column_one[:] = ... would still work. >> > > I was initially thinking about this as well. In the end, I didn't (yet) > try to implement this, because while thinking it through, it seemed that > this might give quite some tricky cases. Consider the following example: > > df = pd.DataFrame(..) > df_subset = df[["col1", "col2"]] > s1 = df["col1"] > s1_subset = s1[0:3] > # modifying s1 should modify df, but not df_subset and s1_subset? > s1[0] = 0 > > If we take "only accessing a single Series from a DataFrame is a view, > everything else uses copy-on-write", that gives rise to questions like the > above where some parents/childs get modified, and some not. > This is both harder to explain to users, as harder to implement. For the > implementation of the proof-of-concept, the copy-on-write happens "locally" > in the series/dataframe that gets modified (meaning: when modifying a given > object, its internal array data first gets copied and replaced *if* the > object is viewing another or is being viewed by another object). While in > the above case, modifying a given object would need to trigger a copy in > other (potentially many) objects, and not in the object being modified. > It's probably possible to implement this, but certainly harder/trickier to > do. > > >> >> >>> >>> On Fri, Jul 16, 2021 at 5:26 AM Joris Van den Bossche < >>> jorisvandenbossche at gmail.com> wrote: >>> >>>> >>>> >>>> On Mon, 12 Jul 2021 at 00:58, Joris Van den Bossche < >>>> jorisvandenbossche at gmail.com> wrote: >>>> >>>>> Short summary of the proposal: >>>>> >>>>> 1. The result of *any* indexing operation (subsetting a DataFrame >>>>> or Series in any way) or any method returning a new DataFrame, always *behaves >>>>> as if it were* a copy in terms of user API. >>>>> >>>>> To explicitly call out the column-as-Series case (since this is a >>>> typical case that right now *always* is a view): "any" indexing >>>> operation thus also included accessing a DataFrame column as a Series (or >>>> slicing a Series). >>>> >>>> So something like s = df["col"] and then mutating s will no longer >>>> update df. Similarly for series_subset = series[1:5], mutating >>>> series_subset will no longer update s. >>>> _______________________________________________ >>>> Pandas-dev mailing list >>>> Pandas-dev at python.org >>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>> >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Sat Jul 17 14:51:39 2021 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Sat, 17 Jul 2021 20:51:39 +0200 Subject: [Pandas-dev] Proposal for consistent, clear copy/view semantics in pandas with Copy-on-Write Message-ID: On Fri, 16 Jul 2021 at 20:50, Irv Lustig wrote: > > Tom Augspurger wrote: > > I wonder if we can validate what users (new and old) *actually* expect? >> Users coming from R, which IIRC implements Copy on Write for matrices, >> might be OK with indexing always being (behaving like) a copy. >> I'm not sure what users coming from NumPy would expect, since I don't know >> how many NumPy users really understand *a**.)* when a NumPy slice is a >> view >> or copy, and *b.) *how a pandas indexing operation translates to a NumPy >> slice. >> >> > IMHO, we should concentrate on the "new" users. For my team, there is no > numpy or R background. They learn pandas, and what pandas does needs to be > really clear in behavior and documentation. I would also hazard a guess > that most pandas users are like that - pandas is the first tool they see, > not numpy or R. > > The places where I think confusion could happen are things like this with > a DataFrame df : > > 1. s = df["a"] > 2. s.iloc[3:5] = [1, 2, 3] > 3. df["a"].iloc[3:5] = [1, 2, 3] > 4. df["b"] = df["a"] > 5. df["b"].iloc[3:5] = [4, 5, 6] > 6. s2 = df["b"] > 7. df["c"] = s2 > 8. s2.iloc[3:5] = [7, 8, 9] > > As I understand it (please correct me if I'm wrong), these lines would be > interpreted as follows with the current proposal: > It's a bit different (to reiterate, with the *current* proposal, *any* indexing operation (including series selection) behaves as a copy; and also to be clear, this is one possible proposal, there are certainly other possibilities). Answering case by case: > 1. s = df["a"] > Creates a view into the DataFrame df. No copying is done at all > Indeed a view (but that's an implementation detail) 2. s.iloc[3:5] = [1, 2, 3] > Modifies the series s and the underlying DataFrame df. (copy-on-write) > Due to copy-on-write, it does *not* modify the DataFrame df. Copy-on-write means that only when s is being written to, its data get copied (so at that point breaking the view-relation with the parent df) > 3. df["a"].iloc[3:5] = [1, 2, 3] > Modifies the dataframe > This is an example of chained assignment, which in the current proposal never works (see the example in the google doc ). This is because chained assignment can always be written as: temp = df["a"] temp.iloc[3:5] = [1, 2, 3] and `temp` uses copy-on-write (and then it is the same example as the one above in 2.). (what you describe is the current behaviour of pandas) > 4. df["b"] = df["a"] > Copies the series from "a" to "b" > It would indeed behave as a copy, but under the hood we can actually keep this as a view (delay the copy thanks to copy-on-write). > 5. df["b"].iloc[3:5] = [4, 5, 6] > Modifies "b" in the DataFrame, but not "a" > Also doesn't modify "b" (see example 3. above), but indeed does not modify "a" > 6. s2 = df["b"] > Create a view into the DataFrame df. No copying is done at all. > Same as 1. > 7. df["c"] = s2 > Copies the series from "b" to "c" > Same as 4. > 8. s2.iloc[3:5] = [7, 8, 9] > Modifies s2, which modifies "b", but NOT "c" > Doesn't modify "b" and "c". Similar as 3. I think the challenge is explaining the sequence 6,7,8 above in comparison > to the other sequences. > So with the current proposal, the sequece 6, 7, 8 actually doesn't behave differently. But it is mainly 2 and 3 that would be quite different compared to the current pandas behaviour. > > -Irv > > > > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From adrin.jalali at gmail.com Tue Jul 20 10:10:10 2021 From: adrin.jalali at gmail.com (Adrin) Date: Tue, 20 Jul 2021 16:10:10 +0200 Subject: [Pandas-dev] Proposal for consistent, clear copy/view semantics in pandas with Copy-on-Write In-Reply-To: References: Message-ID: I guess one question I have is what are the memory and time performance implications of the proposed change. I guess I belong to the group of users who think of a pandas DataFrame more as a numpy array with column names attached to them, and hence I'd expect very similar semantics when indexing, and I think copy on write semantics would have a significant impact on our workflows. On Sat, Jul 17, 2021 at 6:12 PM Marc Garcia wrote: > Based on my experience (not sure how biased it is), modifying dataframes > with something like `df[col][1:3] = ...` is rare (or the equivalent with ` > .loc`) except for boolean arrays. From my experience, when the values of > a dataframe column are changed, what I think it's way more common is to use > `df[col] = df[col].str.upper()`, `df[col] = df[col].fillna(0)`... > > While I'm personally happy with Joris proposal, I see two other options > that could complement or replace it: > > Option 1) Deprecate assigning to a subset of rows, and only allow > assigning to whole columns. Something like `df[col][1:3] = ...` could be > replaced by for example `df[col] = df[col].mask(slice(1, 3), ...)`. > Using `mask` and `where` is already supported for boolean arrays, so > slices should be added, and they'd be the only way to replace a subset of > values. I think that makes the problem narrower, and easier to understand > for users. The main thing to decide and be clear about is what happens if > the dataframe is a subset of another one: > > ``` > df2 = df[cond] > df2[col] = df2[col].str.upper() > ``` > > Option 2) If assigning with the current syntax (`df[col][1:3] = ...` or ` > .loc` equivalent) is something we want to keep (I wouldn't if we move in > this direction), maybe it could be moved to a `DataFrame` subclass So, > the main dataframe class behaves like in option 1, so expectations are much > easier to manage. But users who really want to assign with indexing, can > still use it, knowing that having a mutable dataframe comes at a cost > (copies, more complex behavior...). The `MutableDataFrame` could be in > pandas, or a third-party extension. > > ``` > df_mutable = df.to_mutable() > df_mutable[col][1:3] = ... > ``` > > On Sat, Jul 17, 2021 at 9:16 AM Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > >> >> On Fri, 16 Jul 2021 at 18:58, Stephan Hoyer wrote: >> >>> On Fri, Jul 16, 2021 at 9:04 AM Brock Mendel >>> wrote: >>> >>>> I do not like the fact that nothing can ever be "just a view" with >>>> these semantics, including series[::-1], frame[col], frame[:]. Users >>>> reasonably expect numpy semantics for these. >>>> >>>> I am personally not sure what "users" in general expect for those (as >> also mentioned by Tom and Irv already, depending on their background, they >> might expect different things). >> For example, for a user that knows basic Python, they could actually >> expect all those examples to give a copy since `a_list[:]` is a typical way >> to make a copy of a list. >> >> (it might be interesting to reach out to educators (who might have more >> experience with expectations/typical errors of novice users) or to do some >> kind of experiment on this topic) >> >> Personally, I cannot remember that I ever relied on the mutability-aspect >> of eg `series[1:3]` or `frame[:]` being a view. I think there are generally >> 2 reasons for users caring about a view: 1) for performance (less copying) >> and 2) for being able to mutate the view with the explicit goal to mutate >> the parent (and not as an irrelevant side-effect). >> I think the first reason is by far the most common one (but that's my >> subjective opinion from my experience using pandas, so that can certainly >> depend), and in the current proposal, all those mentioned example will be >> actual views under the hood (and thus cover this first reason). >> >> The only case where I know I explicitly rely on this is with chained >> assignment (eg `frame[col][1:3] = ..`). That's certainly a very important >> use case (and probably the most impacted usage pattern with the current >> proposal), but it's also a case where 1) there is a clear alternative >> (don't use chained assignment, but do it in one step (e.g. `frame.loc[1:3, >> col] = ..`), some corner cases of mixed positional/label-based indexing >> aside, for which we should find an alternative) and 2) we might be able to >> detect this and raise an informative error message (specifically for >> chained assignment). >> >> I think it can be easier to explain "chained assignment never works" than >> "chained assignment only works if first selecting the column(s)" (depending >> on the exact rules). >> >> >>> We should revisit the alternative "clear/simple rules" approach that is >>>> "indexing on columns always gives a view" ( >>>> https://github.com/pandas-dev/pandas/pull/33597). This is simpler to >>>> explain/grok, simpler to implement >>>> >>> >>> I don't know if it is worth the trouble for complex multi-column >>> selections, but I do see the appeal here. >>> >>> A simpler variant would be to make indexing out a single Series from a >>> DataFrame return a view, with everything else doing copy on write. Then the >>> existing pattern df.column_one[:] = ... would still work. >>> >> >> I was initially thinking about this as well. In the end, I didn't (yet) >> try to implement this, because while thinking it through, it seemed that >> this might give quite some tricky cases. Consider the following example: >> >> df = pd.DataFrame(..) >> df_subset = df[["col1", "col2"]] >> s1 = df["col1"] >> s1_subset = s1[0:3] >> # modifying s1 should modify df, but not df_subset and s1_subset? >> s1[0] = 0 >> >> If we take "only accessing a single Series from a DataFrame is a view, >> everything else uses copy-on-write", that gives rise to questions like the >> above where some parents/childs get modified, and some not. >> This is both harder to explain to users, as harder to implement. For the >> implementation of the proof-of-concept, the copy-on-write happens "locally" >> in the series/dataframe that gets modified (meaning: when modifying a given >> object, its internal array data first gets copied and replaced *if* the >> object is viewing another or is being viewed by another object). While in >> the above case, modifying a given object would need to trigger a copy in >> other (potentially many) objects, and not in the object being modified. >> It's probably possible to implement this, but certainly harder/trickier to >> do. >> >> >>> >>> >>>> >>>> On Fri, Jul 16, 2021 at 5:26 AM Joris Van den Bossche < >>>> jorisvandenbossche at gmail.com> wrote: >>>> >>>>> >>>>> >>>>> On Mon, 12 Jul 2021 at 00:58, Joris Van den Bossche < >>>>> jorisvandenbossche at gmail.com> wrote: >>>>> >>>>>> Short summary of the proposal: >>>>>> >>>>>> 1. The result of *any* indexing operation (subsetting a DataFrame >>>>>> or Series in any way) or any method returning a new DataFrame, always *behaves >>>>>> as if it were* a copy in terms of user API. >>>>>> >>>>>> To explicitly call out the column-as-Series case (since this is a >>>>> typical case that right now *always* is a view): "any" indexing >>>>> operation thus also included accessing a DataFrame column as a Series (or >>>>> slicing a Series). >>>>> >>>>> So something like s = df["col"] and then mutating s will no longer >>>>> update df. Similarly for series_subset = series[1:5], mutating >>>>> series_subset will no longer update s. >>>>> _______________________________________________ >>>>> Pandas-dev mailing list >>>>> Pandas-dev at python.org >>>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>>> >>>> _______________________________________________ >>>> Pandas-dev mailing list >>>> Pandas-dev at python.org >>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>> >>> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Fri Jul 23 15:31:33 2021 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Fri, 23 Jul 2021 21:31:33 +0200 Subject: [Pandas-dev] Proposal for consistent, clear copy/view semantics in pandas with Copy-on-Write In-Reply-To: References: Message-ID: On Tue, 20 Jul 2021 at 16:10, Adrin wrote: > I guess one question I have is what are the memory and time performance > implications of the proposed change. > Memory implications should be positive (less copying). The performance impact of the additional logic (adding/checking of the weak references) is something I didn't yet check (on my to do list), but I suspect it to not be significant. Based on your comment of "numpy array with column names", I think the potential change of the ArrayManager is much more relevant for you than the Copy-on-Write. And to be clear, the current proposal is not tied to the ArrayManager (it's only the proof of concept that is implemented for that). So I would prefer to keep the discussion focused on the copy/view semantics, at least for now (it's only later, when discussing practical ways to get this released, that we need to decide whether we want to combine this with an ArrayManager refactor or not). > I guess I belong to the group of users who think of a pandas DataFrame > more as a numpy array with column names attached to them, and hence I'd > expect very similar semantics when indexing, and I think copy on write > semantics would have a significant impact on our workflows. > I assuming you are also thinking of scikit-learn like worflows? Can you give an example of what your are thinking about how copy-on-write impacts such (or other) workflows? In any case thanks already for your feedback! > > On Sat, Jul 17, 2021 at 6:12 PM Marc Garcia wrote: > >> Based on my experience (not sure how biased it is), modifying dataframes >> with something like `df[col][1:3] = ...` is rare (or the equivalent with >> `.loc`) except for boolean arrays. From my experience, when the values >> of a dataframe column are changed, what I think it's way more common is to >> use `df[col] = df[col].str.upper()`, `df[col] = df[col].fillna(0)`... >> >> While I'm personally happy with Joris proposal, I see two other options >> that could complement or replace it: >> >> Option 1) Deprecate assigning to a subset of rows, and only allow >> assigning to whole columns. Something like `df[col][1:3] = ...` could >> be replaced by for example `df[col] = df[col].mask(slice(1, 3), ...)`. >> Using `mask` and `where` is already supported for boolean arrays, so >> slices should be added, and they'd be the only way to replace a subset of >> values. I think that makes the problem narrower, and easier to understand >> for users. The main thing to decide and be clear about is what happens if >> the dataframe is a subset of another one: >> >> ``` >> df2 = df[cond] >> df2[col] = df2[col].str.upper() >> ``` >> >> Option 2) If assigning with the current syntax (`df[col][1:3] = ...` or >> `.loc` equivalent) is something we want to keep (I wouldn't if we move >> in this direction), maybe it could be moved to a `DataFrame` subclass >> So, the main dataframe class behaves like in option 1, so expectations are >> much easier to manage. But users who really want to assign with indexing, >> can still use it, knowing that having a mutable dataframe comes at a cost >> (copies, more complex behavior...). The `MutableDataFrame` could be in >> pandas, or a third-party extension. >> >> ``` >> df_mutable = df.to_mutable() >> df_mutable[col][1:3] = ... >> ``` >> >> On Sat, Jul 17, 2021 at 9:16 AM Joris Van den Bossche < >> jorisvandenbossche at gmail.com> wrote: >> >>> >>> On Fri, 16 Jul 2021 at 18:58, Stephan Hoyer wrote: >>> >>>> On Fri, Jul 16, 2021 at 9:04 AM Brock Mendel >>>> wrote: >>>> >>>>> I do not like the fact that nothing can ever be "just a view" with >>>>> these semantics, including series[::-1], frame[col], frame[:]. Users >>>>> reasonably expect numpy semantics for these. >>>>> >>>>> I am personally not sure what "users" in general expect for those (as >>> also mentioned by Tom and Irv already, depending on their background, they >>> might expect different things). >>> For example, for a user that knows basic Python, they could actually >>> expect all those examples to give a copy since `a_list[:]` is a typical way >>> to make a copy of a list. >>> >>> (it might be interesting to reach out to educators (who might have more >>> experience with expectations/typical errors of novice users) or to do some >>> kind of experiment on this topic) >>> >>> Personally, I cannot remember that I ever relied on the >>> mutability-aspect of eg `series[1:3]` or `frame[:]` being a view. I think >>> there are generally 2 reasons for users caring about a view: 1) for >>> performance (less copying) and 2) for being able to mutate the view with >>> the explicit goal to mutate the parent (and not as an irrelevant >>> side-effect). >>> I think the first reason is by far the most common one (but that's my >>> subjective opinion from my experience using pandas, so that can certainly >>> depend), and in the current proposal, all those mentioned example will be >>> actual views under the hood (and thus cover this first reason). >>> >>> The only case where I know I explicitly rely on this is with chained >>> assignment (eg `frame[col][1:3] = ..`). That's certainly a very important >>> use case (and probably the most impacted usage pattern with the current >>> proposal), but it's also a case where 1) there is a clear alternative >>> (don't use chained assignment, but do it in one step (e.g. `frame.loc[1:3, >>> col] = ..`), some corner cases of mixed positional/label-based indexing >>> aside, for which we should find an alternative) and 2) we might be able to >>> detect this and raise an informative error message (specifically for >>> chained assignment). >>> >>> I think it can be easier to explain "chained assignment never works" >>> than "chained assignment only works if first selecting the column(s)" >>> (depending on the exact rules). >>> >>> >>>> We should revisit the alternative "clear/simple rules" approach that is >>>>> "indexing on columns always gives a view" ( >>>>> https://github.com/pandas-dev/pandas/pull/33597). This is simpler to >>>>> explain/grok, simpler to implement >>>>> >>>> >>>> I don't know if it is worth the trouble for complex multi-column >>>> selections, but I do see the appeal here. >>>> >>>> A simpler variant would be to make indexing out a single Series from a >>>> DataFrame return a view, with everything else doing copy on write. Then the >>>> existing pattern df.column_one[:] = ... would still work. >>>> >>> >>> I was initially thinking about this as well. In the end, I didn't (yet) >>> try to implement this, because while thinking it through, it seemed that >>> this might give quite some tricky cases. Consider the following example: >>> >>> df = pd.DataFrame(..) >>> df_subset = df[["col1", "col2"]] >>> s1 = df["col1"] >>> s1_subset = s1[0:3] >>> # modifying s1 should modify df, but not df_subset and s1_subset? >>> s1[0] = 0 >>> >>> If we take "only accessing a single Series from a DataFrame is a view, >>> everything else uses copy-on-write", that gives rise to questions like the >>> above where some parents/childs get modified, and some not. >>> This is both harder to explain to users, as harder to implement. For the >>> implementation of the proof-of-concept, the copy-on-write happens "locally" >>> in the series/dataframe that gets modified (meaning: when modifying a given >>> object, its internal array data first gets copied and replaced *if* the >>> object is viewing another or is being viewed by another object). While in >>> the above case, modifying a given object would need to trigger a copy in >>> other (potentially many) objects, and not in the object being modified. >>> It's probably possible to implement this, but certainly harder/trickier to >>> do. >>> >>> >>>> >>>> >>>>> >>>>> On Fri, Jul 16, 2021 at 5:26 AM Joris Van den Bossche < >>>>> jorisvandenbossche at gmail.com> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Mon, 12 Jul 2021 at 00:58, Joris Van den Bossche < >>>>>> jorisvandenbossche at gmail.com> wrote: >>>>>> >>>>>>> Short summary of the proposal: >>>>>>> >>>>>>> 1. The result of *any* indexing operation (subsetting a >>>>>>> DataFrame or Series in any way) or any method returning a new DataFrame, >>>>>>> always *behaves as if it were* a copy in terms of user API. >>>>>>> >>>>>>> To explicitly call out the column-as-Series case (since this is a >>>>>> typical case that right now *always* is a view): "any" indexing >>>>>> operation thus also included accessing a DataFrame column as a Series (or >>>>>> slicing a Series). >>>>>> >>>>>> So something like s = df["col"] and then mutating s will no longer >>>>>> update df. Similarly for series_subset = series[1:5], mutating >>>>>> series_subset will no longer update s. >>>>>> _______________________________________________ >>>>>> Pandas-dev mailing list >>>>>> Pandas-dev at python.org >>>>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>>>> >>>>> _______________________________________________ >>>>> Pandas-dev mailing list >>>>> Pandas-dev at python.org >>>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>>> >>>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbrockmendel at gmail.com Fri Jul 23 16:09:46 2021 From: jbrockmendel at gmail.com (Brock Mendel) Date: Fri, 23 Jul 2021 13:09:46 -0700 Subject: [Pandas-dev] Proposal for consistent, clear copy/view semantics in pandas with Copy-on-Write In-Reply-To: References: Message-ID: > Memory implications should be positive (less copying). This is accurate _only_ in cases where we currently make copies. In cases where we currently make views, the perf effect goes the other way. On the flip side, Always-Views improves perf in cases where we currently make copies, but if you want a copy then you'll have to make one explicitly which will claw back that gain. (In the long-out-of-date proof of concept https://github.com/pandas-dev/pandas/pull/33597 df[np.random.randint(0, 30, 30)] was ~92% faster than the status quo at the time) > The performance impact of the additional logic (adding/checking of the weak references) is something I didn't yet check (on my to do list), but I suspect it to not be significant. Agreed the CoW logic itself should be negligible outside of microbenchmarks. On Fri, Jul 23, 2021 at 12:32 PM Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > On Tue, 20 Jul 2021 at 16:10, Adrin wrote: > >> I guess one question I have is what are the memory and time performance >> implications of the proposed change. >> > > Memory implications should be positive (less copying). The performance > impact of the additional logic (adding/checking of the weak references) is > something I didn't yet check (on my to do list), but I suspect it to not be > significant. > > Based on your comment of "numpy array with column names", I think the > potential change of the ArrayManager is much more relevant for you than the > Copy-on-Write. And to be clear, the current proposal is not tied to the > ArrayManager (it's only the proof of concept that is implemented for that). > So I would prefer to keep the discussion focused on the copy/view > semantics, at least for now (it's only later, when discussing practical > ways to get this released, that we need to decide whether we want to > combine this with an ArrayManager refactor or not). > > >> I guess I belong to the group of users who think of a pandas DataFrame >> more as a numpy array with column names attached to them, and hence I'd >> expect very similar semantics when indexing, and I think copy on write >> semantics would have a significant impact on our workflows. >> > > I assuming you are also thinking of scikit-learn like worflows? Can you > give an example of what your are thinking about how copy-on-write impacts > such (or other) workflows? > In any case thanks already for your feedback! > > >> >> On Sat, Jul 17, 2021 at 6:12 PM Marc Garcia >> wrote: >> >>> Based on my experience (not sure how biased it is), modifying dataframes >>> with something like `df[col][1:3] = ...` is rare (or the equivalent >>> with `.loc`) except for boolean arrays. From my experience, when the >>> values of a dataframe column are changed, what I think it's way more common >>> is to use `df[col] = df[col].str.upper()`, `df[col] = df[col].fillna(0) >>> `... >>> >>> While I'm personally happy with Joris proposal, I see two other options >>> that could complement or replace it: >>> >>> Option 1) Deprecate assigning to a subset of rows, and only allow >>> assigning to whole columns. Something like `df[col][1:3] = ...` could >>> be replaced by for example `df[col] = df[col].mask(slice(1, 3), ...)`. >>> Using `mask` and `where` is already supported for boolean arrays, so >>> slices should be added, and they'd be the only way to replace a subset of >>> values. I think that makes the problem narrower, and easier to understand >>> for users. The main thing to decide and be clear about is what happens if >>> the dataframe is a subset of another one: >>> >>> ``` >>> df2 = df[cond] >>> df2[col] = df2[col].str.upper() >>> ``` >>> >>> Option 2) If assigning with the current syntax (`df[col][1:3] = ...` >>> or `.loc` equivalent) is something we want to keep (I wouldn't if we >>> move in this direction), maybe it could be moved to a `DataFrame` >>> subclass So, the main dataframe class behaves like in option 1, so >>> expectations are much easier to manage. But users who really want to assign >>> with indexing, can still use it, knowing that having a mutable dataframe >>> comes at a cost (copies, more complex behavior...). The ` >>> MutableDataFrame` could be in pandas, or a third-party extension. >>> >>> ``` >>> df_mutable = df.to_mutable() >>> df_mutable[col][1:3] = ... >>> ``` >>> >>> On Sat, Jul 17, 2021 at 9:16 AM Joris Van den Bossche < >>> jorisvandenbossche at gmail.com> wrote: >>> >>>> >>>> On Fri, 16 Jul 2021 at 18:58, Stephan Hoyer wrote: >>>> >>>>> On Fri, Jul 16, 2021 at 9:04 AM Brock Mendel >>>>> wrote: >>>>> >>>>>> I do not like the fact that nothing can ever be "just a view" with >>>>>> these semantics, including series[::-1], frame[col], frame[:]. Users >>>>>> reasonably expect numpy semantics for these. >>>>>> >>>>>> I am personally not sure what "users" in general expect for those (as >>>> also mentioned by Tom and Irv already, depending on their background, they >>>> might expect different things). >>>> For example, for a user that knows basic Python, they could actually >>>> expect all those examples to give a copy since `a_list[:]` is a typical way >>>> to make a copy of a list. >>>> >>>> (it might be interesting to reach out to educators (who might have more >>>> experience with expectations/typical errors of novice users) or to do some >>>> kind of experiment on this topic) >>>> >>>> Personally, I cannot remember that I ever relied on the >>>> mutability-aspect of eg `series[1:3]` or `frame[:]` being a view. I think >>>> there are generally 2 reasons for users caring about a view: 1) for >>>> performance (less copying) and 2) for being able to mutate the view with >>>> the explicit goal to mutate the parent (and not as an irrelevant >>>> side-effect). >>>> I think the first reason is by far the most common one (but that's my >>>> subjective opinion from my experience using pandas, so that can certainly >>>> depend), and in the current proposal, all those mentioned example will be >>>> actual views under the hood (and thus cover this first reason). >>>> >>>> The only case where I know I explicitly rely on this is with chained >>>> assignment (eg `frame[col][1:3] = ..`). That's certainly a very important >>>> use case (and probably the most impacted usage pattern with the current >>>> proposal), but it's also a case where 1) there is a clear alternative >>>> (don't use chained assignment, but do it in one step (e.g. `frame.loc[1:3, >>>> col] = ..`), some corner cases of mixed positional/label-based indexing >>>> aside, for which we should find an alternative) and 2) we might be able to >>>> detect this and raise an informative error message (specifically for >>>> chained assignment). >>>> >>>> I think it can be easier to explain "chained assignment never works" >>>> than "chained assignment only works if first selecting the column(s)" >>>> (depending on the exact rules). >>>> >>>> >>>>> We should revisit the alternative "clear/simple rules" approach that >>>>>> is "indexing on columns always gives a view" ( >>>>>> https://github.com/pandas-dev/pandas/pull/33597). This is simpler to >>>>>> explain/grok, simpler to implement >>>>>> >>>>> >>>>> I don't know if it is worth the trouble for complex multi-column >>>>> selections, but I do see the appeal here. >>>>> >>>>> A simpler variant would be to make indexing out a single Series from a >>>>> DataFrame return a view, with everything else doing copy on write. Then the >>>>> existing pattern df.column_one[:] = ... would still work. >>>>> >>>> >>>> I was initially thinking about this as well. In the end, I didn't (yet) >>>> try to implement this, because while thinking it through, it seemed that >>>> this might give quite some tricky cases. Consider the following example: >>>> >>>> df = pd.DataFrame(..) >>>> df_subset = df[["col1", "col2"]] >>>> s1 = df["col1"] >>>> s1_subset = s1[0:3] >>>> # modifying s1 should modify df, but not df_subset and s1_subset? >>>> s1[0] = 0 >>>> >>>> If we take "only accessing a single Series from a DataFrame is a view, >>>> everything else uses copy-on-write", that gives rise to questions like the >>>> above where some parents/childs get modified, and some not. >>>> This is both harder to explain to users, as harder to implement. For >>>> the implementation of the proof-of-concept, the copy-on-write happens >>>> "locally" in the series/dataframe that gets modified (meaning: when >>>> modifying a given object, its internal array data first gets copied and >>>> replaced *if* the object is viewing another or is being viewed by another >>>> object). While in the above case, modifying a given object would need to >>>> trigger a copy in other (potentially many) objects, and not in the object >>>> being modified. It's probably possible to implement this, but certainly >>>> harder/trickier to do. >>>> >>>> >>>>> >>>>> >>>>>> >>>>>> On Fri, Jul 16, 2021 at 5:26 AM Joris Van den Bossche < >>>>>> jorisvandenbossche at gmail.com> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> On Mon, 12 Jul 2021 at 00:58, Joris Van den Bossche < >>>>>>> jorisvandenbossche at gmail.com> wrote: >>>>>>> >>>>>>>> Short summary of the proposal: >>>>>>>> >>>>>>>> 1. The result of *any* indexing operation (subsetting a >>>>>>>> DataFrame or Series in any way) or any method returning a new DataFrame, >>>>>>>> always *behaves as if it were* a copy in terms of user API. >>>>>>>> >>>>>>>> To explicitly call out the column-as-Series case (since this is a >>>>>>> typical case that right now *always* is a view): "any" indexing >>>>>>> operation thus also included accessing a DataFrame column as a Series (or >>>>>>> slicing a Series). >>>>>>> >>>>>>> So something like s = df["col"] and then mutating s will no longer >>>>>>> update df. Similarly for series_subset = series[1:5], mutating >>>>>>> series_subset will no longer update s. >>>>>>> _______________________________________________ >>>>>>> Pandas-dev mailing list >>>>>>> Pandas-dev at python.org >>>>>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>>>>> >>>>>> _______________________________________________ >>>>>> Pandas-dev mailing list >>>>>> Pandas-dev at python.org >>>>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>>>> >>>>> _______________________________________________ >>>> Pandas-dev mailing list >>>> Pandas-dev at python.org >>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>> >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From simonjayhawkins at gmail.com Mon Jul 26 05:37:39 2021 From: simonjayhawkins at gmail.com (Simon Hawkins) Date: Mon, 26 Jul 2021 10:37:39 +0100 Subject: [Pandas-dev] ANN: pandas v1.3.1 Message-ID: Hi all, I'm pleased to announce the release of pandas v1.2.5. This is the first patch release in the 1.3.x series and includes some regression fixes and bug fixes. We recommend that all users upgrade to this version. See the release notes for a list of all the changes. The release can be installed from PyPI python -m pip install --upgrade pandas==1.3.1 Or from conda-forge conda install -c conda-forge pandas==1.3.1 Please report any issues with the release on the pandas issue tracker . Thanks to all the contributors who made this release possible. -------------- next part -------------- An HTML attachment was scrubbed... URL: From adrin.jalali at gmail.com Mon Jul 26 05:51:43 2021 From: adrin.jalali at gmail.com (Adrin) Date: Mon, 26 Jul 2021 11:51:43 +0200 Subject: [Pandas-dev] Proposal for consistent, clear copy/view semantics in pandas with Copy-on-Write In-Reply-To: References: Message-ID: There are two cases that I think are relevant here (as opposed to the ArrayManager discussion), but I may be wrong. The two cases I'm thinking, in a simple not-optimized way are, in psuedocode: for column in columns_of(data): data[:, column] = (data[:, column] - mean(data[:, column])) / std(data[:, column]) And the other one is the same as above, but for rows. Also, one issue I have, is that if we're doing copy-on-write, then what does the above mean? As in, if I do `df["column_A"] = ....`, where is that copy? How do I access the new one as opposed to the old one? On Fri, Jul 23, 2021 at 10:09 PM Brock Mendel wrote: > > Memory implications should be positive (less copying). > > This is accurate _only_ in cases where we currently make copies. In cases > where we currently make views, the perf effect goes the other way. > > On the flip side, Always-Views improves perf in cases where we currently > make copies, but if you want a copy then you'll have to make one explicitly > which will claw back that gain. (In the long-out-of-date proof of concept > https://github.com/pandas-dev/pandas/pull/33597 df[np.random.randint(0, > 30, 30)] was ~92% faster than the status quo at the time) > > > The performance impact of the additional logic (adding/checking of the > weak references) is something I didn't yet check (on my to do list), but I > suspect it to not be significant. > > Agreed the CoW logic itself should be negligible outside of > microbenchmarks. > > On Fri, Jul 23, 2021 at 12:32 PM Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > >> On Tue, 20 Jul 2021 at 16:10, Adrin wrote: >> >>> I guess one question I have is what are the memory and time performance >>> implications of the proposed change. >>> >> >> Memory implications should be positive (less copying). The performance >> impact of the additional logic (adding/checking of the weak references) is >> something I didn't yet check (on my to do list), but I suspect it to not be >> significant. >> >> Based on your comment of "numpy array with column names", I think the >> potential change of the ArrayManager is much more relevant for you than the >> Copy-on-Write. And to be clear, the current proposal is not tied to the >> ArrayManager (it's only the proof of concept that is implemented for that). >> So I would prefer to keep the discussion focused on the copy/view >> semantics, at least for now (it's only later, when discussing practical >> ways to get this released, that we need to decide whether we want to >> combine this with an ArrayManager refactor or not). >> >> >>> I guess I belong to the group of users who think of a pandas DataFrame >>> more as a numpy array with column names attached to them, and hence I'd >>> expect very similar semantics when indexing, and I think copy on write >>> semantics would have a significant impact on our workflows. >>> >> >> I assuming you are also thinking of scikit-learn like worflows? Can you >> give an example of what your are thinking about how copy-on-write impacts >> such (or other) workflows? >> In any case thanks already for your feedback! >> >> >>> >>> On Sat, Jul 17, 2021 at 6:12 PM Marc Garcia >>> wrote: >>> >>>> Based on my experience (not sure how biased it is), modifying >>>> dataframes with something like `df[col][1:3] = ...` is rare (or the >>>> equivalent with `.loc`) except for boolean arrays. From my experience, >>>> when the values of a dataframe column are changed, what I think it's way >>>> more common is to use `df[col] = df[col].str.upper()`, `df[col] = >>>> df[col].fillna(0)`... >>>> >>>> While I'm personally happy with Joris proposal, I see two other options >>>> that could complement or replace it: >>>> >>>> Option 1) Deprecate assigning to a subset of rows, and only allow >>>> assigning to whole columns. Something like `df[col][1:3] = ...` could >>>> be replaced by for example `df[col] = df[col].mask(slice(1, 3), ...)`. >>>> Using `mask` and `where` is already supported for boolean arrays, so >>>> slices should be added, and they'd be the only way to replace a subset of >>>> values. I think that makes the problem narrower, and easier to understand >>>> for users. The main thing to decide and be clear about is what happens if >>>> the dataframe is a subset of another one: >>>> >>>> ``` >>>> df2 = df[cond] >>>> df2[col] = df2[col].str.upper() >>>> ``` >>>> >>>> Option 2) If assigning with the current syntax (`df[col][1:3] = ...` >>>> or `.loc` equivalent) is something we want to keep (I wouldn't if we >>>> move in this direction), maybe it could be moved to a `DataFrame` >>>> subclass So, the main dataframe class behaves like in option 1, so >>>> expectations are much easier to manage. But users who really want to assign >>>> with indexing, can still use it, knowing that having a mutable dataframe >>>> comes at a cost (copies, more complex behavior...). The ` >>>> MutableDataFrame` could be in pandas, or a third-party extension. >>>> >>>> ``` >>>> df_mutable = df.to_mutable() >>>> df_mutable[col][1:3] = ... >>>> ``` >>>> >>>> On Sat, Jul 17, 2021 at 9:16 AM Joris Van den Bossche < >>>> jorisvandenbossche at gmail.com> wrote: >>>> >>>>> >>>>> On Fri, 16 Jul 2021 at 18:58, Stephan Hoyer wrote: >>>>> >>>>>> On Fri, Jul 16, 2021 at 9:04 AM Brock Mendel >>>>>> wrote: >>>>>> >>>>>>> I do not like the fact that nothing can ever be "just a view" with >>>>>>> these semantics, including series[::-1], frame[col], frame[:]. Users >>>>>>> reasonably expect numpy semantics for these. >>>>>>> >>>>>>> I am personally not sure what "users" in general expect for those >>>>> (as also mentioned by Tom and Irv already, depending on their background, >>>>> they might expect different things). >>>>> For example, for a user that knows basic Python, they could actually >>>>> expect all those examples to give a copy since `a_list[:]` is a typical way >>>>> to make a copy of a list. >>>>> >>>>> (it might be interesting to reach out to educators (who might have >>>>> more experience with expectations/typical errors of novice users) or to do >>>>> some kind of experiment on this topic) >>>>> >>>>> Personally, I cannot remember that I ever relied on the >>>>> mutability-aspect of eg `series[1:3]` or `frame[:]` being a view. I think >>>>> there are generally 2 reasons for users caring about a view: 1) for >>>>> performance (less copying) and 2) for being able to mutate the view with >>>>> the explicit goal to mutate the parent (and not as an irrelevant >>>>> side-effect). >>>>> I think the first reason is by far the most common one (but that's my >>>>> subjective opinion from my experience using pandas, so that can certainly >>>>> depend), and in the current proposal, all those mentioned example will be >>>>> actual views under the hood (and thus cover this first reason). >>>>> >>>>> The only case where I know I explicitly rely on this is with chained >>>>> assignment (eg `frame[col][1:3] = ..`). That's certainly a very important >>>>> use case (and probably the most impacted usage pattern with the current >>>>> proposal), but it's also a case where 1) there is a clear alternative >>>>> (don't use chained assignment, but do it in one step (e.g. `frame.loc[1:3, >>>>> col] = ..`), some corner cases of mixed positional/label-based indexing >>>>> aside, for which we should find an alternative) and 2) we might be able to >>>>> detect this and raise an informative error message (specifically for >>>>> chained assignment). >>>>> >>>>> I think it can be easier to explain "chained assignment never works" >>>>> than "chained assignment only works if first selecting the column(s)" >>>>> (depending on the exact rules). >>>>> >>>>> >>>>>> We should revisit the alternative "clear/simple rules" approach that >>>>>>> is "indexing on columns always gives a view" ( >>>>>>> https://github.com/pandas-dev/pandas/pull/33597). This is simpler >>>>>>> to explain/grok, simpler to implement >>>>>>> >>>>>> >>>>>> I don't know if it is worth the trouble for complex multi-column >>>>>> selections, but I do see the appeal here. >>>>>> >>>>>> A simpler variant would be to make indexing out a single Series from >>>>>> a DataFrame return a view, with everything else doing copy on write. Then >>>>>> the existing pattern df.column_one[:] = ... would still work. >>>>>> >>>>> >>>>> I was initially thinking about this as well. In the end, I didn't >>>>> (yet) try to implement this, because while thinking it through, it seemed >>>>> that this might give quite some tricky cases. Consider the following >>>>> example: >>>>> >>>>> df = pd.DataFrame(..) >>>>> df_subset = df[["col1", "col2"]] >>>>> s1 = df["col1"] >>>>> s1_subset = s1[0:3] >>>>> # modifying s1 should modify df, but not df_subset and s1_subset? >>>>> s1[0] = 0 >>>>> >>>>> If we take "only accessing a single Series from a DataFrame is a view, >>>>> everything else uses copy-on-write", that gives rise to questions like the >>>>> above where some parents/childs get modified, and some not. >>>>> This is both harder to explain to users, as harder to implement. For >>>>> the implementation of the proof-of-concept, the copy-on-write happens >>>>> "locally" in the series/dataframe that gets modified (meaning: when >>>>> modifying a given object, its internal array data first gets copied and >>>>> replaced *if* the object is viewing another or is being viewed by another >>>>> object). While in the above case, modifying a given object would need to >>>>> trigger a copy in other (potentially many) objects, and not in the object >>>>> being modified. It's probably possible to implement this, but certainly >>>>> harder/trickier to do. >>>>> >>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> On Fri, Jul 16, 2021 at 5:26 AM Joris Van den Bossche < >>>>>>> jorisvandenbossche at gmail.com> wrote: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Mon, 12 Jul 2021 at 00:58, Joris Van den Bossche < >>>>>>>> jorisvandenbossche at gmail.com> wrote: >>>>>>>> >>>>>>>>> Short summary of the proposal: >>>>>>>>> >>>>>>>>> 1. The result of *any* indexing operation (subsetting a >>>>>>>>> DataFrame or Series in any way) or any method returning a new DataFrame, >>>>>>>>> always *behaves as if it were* a copy in terms of user API. >>>>>>>>> >>>>>>>>> To explicitly call out the column-as-Series case (since this is a >>>>>>>> typical case that right now *always* is a view): "any" indexing >>>>>>>> operation thus also included accessing a DataFrame column as a Series (or >>>>>>>> slicing a Series). >>>>>>>> >>>>>>>> So something like s = df["col"] and then mutating s will no longer >>>>>>>> update df. Similarly for series_subset = series[1:5], mutating >>>>>>>> series_subset will no longer update s. >>>>>>>> _______________________________________________ >>>>>>>> Pandas-dev mailing list >>>>>>>> Pandas-dev at python.org >>>>>>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>>>>>> >>>>>>> _______________________________________________ >>>>>>> Pandas-dev mailing list >>>>>>> Pandas-dev at python.org >>>>>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>>>>> >>>>>> _______________________________________________ >>>>> Pandas-dev mailing list >>>>> Pandas-dev at python.org >>>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>>> >>>> _______________________________________________ >>>> Pandas-dev mailing list >>>> Pandas-dev at python.org >>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>> >>> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbrockmendel at gmail.com Mon Jul 26 12:38:11 2021 From: jbrockmendel at gmail.com (Brock Mendel) Date: Mon, 26 Jul 2021 09:38:11 -0700 Subject: [Pandas-dev] Proposal for consistent, clear copy/view semantics in pandas with Copy-on-Write In-Reply-To: References: Message-ID: > data.iloc[:, c] = (data.iloc[:, c] - data.iloc[:, c].mean()) / data.iloc[:, c].std() This would not make any copies under any of the scenarios being discussed, including the status quo. > And the other one is the same as above, but for rows. With ArrayManager, the `data.iloc[r]` will make a copy, but the CoW doesn't affect that. No copies with BlockManager, regardless of CoW. On Mon, Jul 26, 2021 at 2:51 AM Adrin wrote: > There are two cases that I think are relevant here (as opposed to the > ArrayManager discussion), but I may be wrong. > > The two cases I'm thinking, in a simple not-optimized way are, in > psuedocode: > > for column in columns_of(data): > data[:, column] = (data[:, column] - mean(data[:, column])) / > std(data[:, column]) > > And the other one is the same as above, but for rows. > > Also, one issue I have, is that if we're doing copy-on-write, then what > does the above mean? As in, if I do `df["column_A"] = ....`, where is that > copy? How do I access the new one as opposed to the old one? > > On Fri, Jul 23, 2021 at 10:09 PM Brock Mendel > wrote: > >> > Memory implications should be positive (less copying). >> >> This is accurate _only_ in cases where we currently make copies. In >> cases where we currently make views, the perf effect goes the other way. >> >> On the flip side, Always-Views improves perf in cases where we currently >> make copies, but if you want a copy then you'll have to make one explicitly >> which will claw back that gain. (In the long-out-of-date proof of concept >> https://github.com/pandas-dev/pandas/pull/33597 df[np.random.randint(0, >> 30, 30)] was ~92% faster than the status quo at the time) >> >> > The performance impact of the additional logic (adding/checking of the >> weak references) is something I didn't yet check (on my to do list), but I >> suspect it to not be significant. >> >> Agreed the CoW logic itself should be negligible outside of >> microbenchmarks. >> >> On Fri, Jul 23, 2021 at 12:32 PM Joris Van den Bossche < >> jorisvandenbossche at gmail.com> wrote: >> >>> On Tue, 20 Jul 2021 at 16:10, Adrin wrote: >>> >>>> I guess one question I have is what are the memory and time performance >>>> implications of the proposed change. >>>> >>> >>> Memory implications should be positive (less copying). The performance >>> impact of the additional logic (adding/checking of the weak references) is >>> something I didn't yet check (on my to do list), but I suspect it to not be >>> significant. >>> >>> Based on your comment of "numpy array with column names", I think the >>> potential change of the ArrayManager is much more relevant for you than the >>> Copy-on-Write. And to be clear, the current proposal is not tied to the >>> ArrayManager (it's only the proof of concept that is implemented for that). >>> So I would prefer to keep the discussion focused on the copy/view >>> semantics, at least for now (it's only later, when discussing practical >>> ways to get this released, that we need to decide whether we want to >>> combine this with an ArrayManager refactor or not). >>> >>> >>>> I guess I belong to the group of users who think of a pandas DataFrame >>>> more as a numpy array with column names attached to them, and hence I'd >>>> expect very similar semantics when indexing, and I think copy on write >>>> semantics would have a significant impact on our workflows. >>>> >>> >>> I assuming you are also thinking of scikit-learn like worflows? Can you >>> give an example of what your are thinking about how copy-on-write impacts >>> such (or other) workflows? >>> In any case thanks already for your feedback! >>> >>> >>>> >>>> On Sat, Jul 17, 2021 at 6:12 PM Marc Garcia >>>> wrote: >>>> >>>>> Based on my experience (not sure how biased it is), modifying >>>>> dataframes with something like `df[col][1:3] = ...` is rare (or the >>>>> equivalent with `.loc`) except for boolean arrays. From my >>>>> experience, when the values of a dataframe column are changed, what I think >>>>> it's way more common is to use `df[col] = df[col].str.upper()`, `df[col] >>>>> = df[col].fillna(0)`... >>>>> >>>>> While I'm personally happy with Joris proposal, I see two other >>>>> options that could complement or replace it: >>>>> >>>>> Option 1) Deprecate assigning to a subset of rows, and only allow >>>>> assigning to whole columns. Something like `df[col][1:3] = ...` >>>>> could be replaced by for example `df[col] = df[col].mask(slice(1, 3), >>>>> ...)`. Using `mask` and `where` is already supported for boolean >>>>> arrays, so slices should be added, and they'd be the only way to replace a >>>>> subset of values. I think that makes the problem narrower, and easier to >>>>> understand for users. The main thing to decide and be clear about is what >>>>> happens if the dataframe is a subset of another one: >>>>> >>>>> ``` >>>>> df2 = df[cond] >>>>> df2[col] = df2[col].str.upper() >>>>> ``` >>>>> >>>>> Option 2) If assigning with the current syntax (`df[col][1:3] = ...` >>>>> or `.loc` equivalent) is something we want to keep (I wouldn't if we >>>>> move in this direction), maybe it could be moved to a `DataFrame` >>>>> subclass So, the main dataframe class behaves like in option 1, so >>>>> expectations are much easier to manage. But users who really want to assign >>>>> with indexing, can still use it, knowing that having a mutable dataframe >>>>> comes at a cost (copies, more complex behavior...). The ` >>>>> MutableDataFrame` could be in pandas, or a third-party extension. >>>>> >>>>> ``` >>>>> df_mutable = df.to_mutable() >>>>> df_mutable[col][1:3] = ... >>>>> ``` >>>>> >>>>> On Sat, Jul 17, 2021 at 9:16 AM Joris Van den Bossche < >>>>> jorisvandenbossche at gmail.com> wrote: >>>>> >>>>>> >>>>>> On Fri, 16 Jul 2021 at 18:58, Stephan Hoyer wrote: >>>>>> >>>>>>> On Fri, Jul 16, 2021 at 9:04 AM Brock Mendel >>>>>>> wrote: >>>>>>> >>>>>>>> I do not like the fact that nothing can ever be "just a view" with >>>>>>>> these semantics, including series[::-1], frame[col], frame[:]. Users >>>>>>>> reasonably expect numpy semantics for these. >>>>>>>> >>>>>>>> I am personally not sure what "users" in general expect for those >>>>>> (as also mentioned by Tom and Irv already, depending on their background, >>>>>> they might expect different things). >>>>>> For example, for a user that knows basic Python, they could actually >>>>>> expect all those examples to give a copy since `a_list[:]` is a typical way >>>>>> to make a copy of a list. >>>>>> >>>>>> (it might be interesting to reach out to educators (who might have >>>>>> more experience with expectations/typical errors of novice users) or to do >>>>>> some kind of experiment on this topic) >>>>>> >>>>>> Personally, I cannot remember that I ever relied on the >>>>>> mutability-aspect of eg `series[1:3]` or `frame[:]` being a view. I think >>>>>> there are generally 2 reasons for users caring about a view: 1) for >>>>>> performance (less copying) and 2) for being able to mutate the view with >>>>>> the explicit goal to mutate the parent (and not as an irrelevant >>>>>> side-effect). >>>>>> I think the first reason is by far the most common one (but that's my >>>>>> subjective opinion from my experience using pandas, so that can certainly >>>>>> depend), and in the current proposal, all those mentioned example will be >>>>>> actual views under the hood (and thus cover this first reason). >>>>>> >>>>>> The only case where I know I explicitly rely on this is with chained >>>>>> assignment (eg `frame[col][1:3] = ..`). That's certainly a very important >>>>>> use case (and probably the most impacted usage pattern with the current >>>>>> proposal), but it's also a case where 1) there is a clear alternative >>>>>> (don't use chained assignment, but do it in one step (e.g. `frame.loc[1:3, >>>>>> col] = ..`), some corner cases of mixed positional/label-based indexing >>>>>> aside, for which we should find an alternative) and 2) we might be able to >>>>>> detect this and raise an informative error message (specifically for >>>>>> chained assignment). >>>>>> >>>>>> I think it can be easier to explain "chained assignment never works" >>>>>> than "chained assignment only works if first selecting the column(s)" >>>>>> (depending on the exact rules). >>>>>> >>>>>> >>>>>>> We should revisit the alternative "clear/simple rules" approach that >>>>>>>> is "indexing on columns always gives a view" ( >>>>>>>> https://github.com/pandas-dev/pandas/pull/33597). This is simpler >>>>>>>> to explain/grok, simpler to implement >>>>>>>> >>>>>>> >>>>>>> I don't know if it is worth the trouble for complex multi-column >>>>>>> selections, but I do see the appeal here. >>>>>>> >>>>>>> A simpler variant would be to make indexing out a single Series from >>>>>>> a DataFrame return a view, with everything else doing copy on write. Then >>>>>>> the existing pattern df.column_one[:] = ... would still work. >>>>>>> >>>>>> >>>>>> I was initially thinking about this as well. In the end, I didn't >>>>>> (yet) try to implement this, because while thinking it through, it seemed >>>>>> that this might give quite some tricky cases. Consider the following >>>>>> example: >>>>>> >>>>>> df = pd.DataFrame(..) >>>>>> df_subset = df[["col1", "col2"]] >>>>>> s1 = df["col1"] >>>>>> s1_subset = s1[0:3] >>>>>> # modifying s1 should modify df, but not df_subset and s1_subset? >>>>>> s1[0] = 0 >>>>>> >>>>>> If we take "only accessing a single Series from a DataFrame is a >>>>>> view, everything else uses copy-on-write", that gives rise to questions >>>>>> like the above where some parents/childs get modified, and some not. >>>>>> This is both harder to explain to users, as harder to implement. For >>>>>> the implementation of the proof-of-concept, the copy-on-write happens >>>>>> "locally" in the series/dataframe that gets modified (meaning: when >>>>>> modifying a given object, its internal array data first gets copied and >>>>>> replaced *if* the object is viewing another or is being viewed by another >>>>>> object). While in the above case, modifying a given object would need to >>>>>> trigger a copy in other (potentially many) objects, and not in the object >>>>>> being modified. It's probably possible to implement this, but certainly >>>>>> harder/trickier to do. >>>>>> >>>>>> >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> On Fri, Jul 16, 2021 at 5:26 AM Joris Van den Bossche < >>>>>>>> jorisvandenbossche at gmail.com> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, 12 Jul 2021 at 00:58, Joris Van den Bossche < >>>>>>>>> jorisvandenbossche at gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Short summary of the proposal: >>>>>>>>>> >>>>>>>>>> 1. The result of *any* indexing operation (subsetting a >>>>>>>>>> DataFrame or Series in any way) or any method returning a new DataFrame, >>>>>>>>>> always *behaves as if it were* a copy in terms of user API. >>>>>>>>>> >>>>>>>>>> To explicitly call out the column-as-Series case (since this is >>>>>>>>> a typical case that right now *always* is a view): "any" indexing >>>>>>>>> operation thus also included accessing a DataFrame column as a Series (or >>>>>>>>> slicing a Series). >>>>>>>>> >>>>>>>>> So something like s = df["col"] and then mutating s will no >>>>>>>>> longer update df. Similarly for series_subset = series[1:5], >>>>>>>>> mutating series_subset will no longer update s. >>>>>>>>> _______________________________________________ >>>>>>>>> Pandas-dev mailing list >>>>>>>>> Pandas-dev at python.org >>>>>>>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Pandas-dev mailing list >>>>>>>> Pandas-dev at python.org >>>>>>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>>>>>> >>>>>>> _______________________________________________ >>>>>> Pandas-dev mailing list >>>>>> Pandas-dev at python.org >>>>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>>>> >>>>> _______________________________________________ >>>>> Pandas-dev mailing list >>>>> Pandas-dev at python.org >>>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>>> >>>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: