From jorisvandenbossche at gmail.com Mon Aug 9 12:53:30 2021 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Mon, 9 Aug 2021 18:53:30 +0200 Subject: [Pandas-dev] Proposal for consistent, clear copy/view semantics in pandas with Copy-on-Write In-Reply-To: References:

Message-ID: On Fri, 23 Jul 2021 at 22:09, Brock Mendel wrote: > > Memory implications should be positive (less copying). > > This is accurate _only_ in cases where we currently make copies. In cases > where we currently make views, the perf effect goes the other way. > Yes, but to clear: only when you mutate an object. As long as you don't do that (which I think is the majority of operations), we will keep making views where we currently do that already. On Mon, 26 Jul 2021 at 18:38, Brock Mendel wrote: > > data.iloc[:, c] = (data.iloc[:, c] - data.iloc[:, c].mean()) / > data.iloc[:, c].std() > > This would not make any copies under any of the scenarios being discussed, > including the status quo. > One small point: this might depend on whether we keep `[:, col]` as a special case replacing the column altogether (as we currently still do, I think, related to some recent discussions), or if we see it as an in-place mutation of the existing column with a slice (which just happens to be a "full" slice). In the second case, this could actually trigger copy-on-write since the same column is also accessed (only as temporary variable, but python might not yet have garbage collected it). On Mon, 26 Jul 2021 at 11:51, Adrin wrote: > .... > Also, one issue I have, is that if we're doing copy-on-write, then what > does the above mean? As in, if I do `df["column_A"] = ....`, where is that > copy? How do I access the new one as opposed to the old one? > I am not fully sure if I understand your question correctly, but something like `df["column_A"] = ....` still edits the DataFrame in place. So here there is no "new" or "old" version of the DataFrame. That specific example replaces a full column and will not trigger a copy (as it doesn't edit the specific column's data inplace), but if you take something like `df.loc[mask, '"column_A"] = ...`, the possible copy happens inside df: if "column_A" is a view / being viewed, then the underlying array for this column first gets copied before being mutated. So the copy happens on the level of the array. But the DataFrame df itself is still mutated in place (the array for "column_A" get replaced with a copy of it), so also here there is no "old"/"new" version of the DataFrame. Does that answer the question, or can you otherwise clarify your question? Joris -------------- next part -------------- An HTML attachment was scrubbed... URL: From adrin.jalali at gmail.com Tue Aug 10 06:52:35 2021 From: adrin.jalali at gmail.com (Adrin) Date: Tue, 10 Aug 2021 12:52:35 +0200 Subject: [Pandas-dev] Proposal for consistent, clear copy/view semantics in pandas with Copy-on-Write In-Reply-To: References:

Message-ID: > > > I am not fully sure if I understand your question correctly, but > something like `df["column_A"] = ....` still edits the DataFrame in > place. So here there is no "new" or "old" version of the DataFrame. > That specific example replaces a full column and will not trigger a copy > (as it doesn't edit the specific column's data inplace), but if you take > something like `df.loc[mask, '"column_A"] = ...`, the possible copy happens > inside df: if "column_A" is a view / being viewed, then the underlying > array for this column first gets copied before being mutated. So the copy > happens on the level of the array. But the DataFrame df itself is still > mutated in place (the array for "column_A" get replaced with a copy of it), > so also here there is no "old"/"new" version of the DataFrame. > Does that answer the question, or can you otherwise clarify your question? > I guess as a user, I find it odd that with and w/o a mask, the behavior is different. So does that mean `df.loc[mask, '"column_A"] = ...` is not a valid operation? Cause I guess I've lost that copy which holds the modified data, right? Silly question: why not move the other way around, i.e. always modify the original data, unless the user does a `copy()`? Is that not more intuitive to people? On Mon, Aug 9, 2021 at 6:53 PM Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > > On Fri, 23 Jul 2021 at 22:09, Brock Mendel wrote: > >> > Memory implications should be positive (less copying). >> >> This is accurate _only_ in cases where we currently make copies. In >> cases where we currently make views, the perf effect goes the other way. >> > > Yes, but to clear: only when you mutate an object. As long as you don't do > that (which I think is the majority of operations), we will keep making > views where we currently do that already. > > On Mon, 26 Jul 2021 at 18:38, Brock Mendel wrote: > >> > data.iloc[:, c] = (data.iloc[:, c] - data.iloc[:, c].mean()) / >> data.iloc[:, c].std() >> >> This would not make any copies under any of the scenarios being >> discussed, including the status quo. >> > > One small point: this might depend on whether we keep `[:, col]` as a > special case replacing the column altogether (as we currently still do, I > think, related to some recent discussions), or if we see it as an in-place > mutation of the existing column with a slice (which just happens to be a > "full" slice). In the second case, this could actually trigger > copy-on-write since the same column is also accessed (only as temporary > variable, but python might not yet have garbage collected it). > > On Mon, 26 Jul 2021 at 11:51, Adrin wrote: > >> .... >> Also, one issue I have, is that if we're doing copy-on-write, then what >> does the above mean? As in, if I do `df["column_A"] = ....`, where is that >> copy? How do I access the new one as opposed to the old one? >> > > I am not fully sure if I understand your question correctly, but something > like `df["column_A"] = ....` still edits the DataFrame in place. So here > there is no "new" or "old" version of the DataFrame. > That specific example replaces a full column and will not trigger a copy > (as it doesn't edit the specific column's data inplace), but if you take > something like `df.loc[mask, '"column_A"] = ...`, the possible copy happens > inside df: if "column_A" is a view / being viewed, then the underlying > array for this column first gets copied before being mutated. So the copy > happens on the level of the array. But the DataFrame df itself is still > mutated in place (the array for "column_A" get replaced with a copy of it), > so also here there is no "old"/"new" version of the DataFrame. > Does that answer the question, or can you otherwise clarify your question? > > Joris > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Tue Aug 10 17:16:15 2021 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Tue, 10 Aug 2021 23:16:15 +0200 Subject: [Pandas-dev] August 2021 monthly community meeting (Wednesday August 11, UTC 18:00) Message-ID: Hi all, A reminder that the next monthly dev call is tomorrow (Wednesday, August 11th) at 18:00 UTC (1 pm Central). Our calendar is at https://pandas.pydata.org/docs/development/meeting.html#calendar to check your local time. All are welcome to attend! Video Call: https://zoom.us/j/96753852910?pwd=OEgwbUkwOE9kejcwOGdLd09TallTdz09 Minutes: https://docs.google.com/document/u/1/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit?ouid=102771015311436394588&usp=docs_home&ths=true Joris -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Tue Aug 10 18:13:34 2021 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Wed, 11 Aug 2021 00:13:34 +0200 Subject: [Pandas-dev] Proposal for consistent, clear copy/view semantics in pandas with Copy-on-Write In-Reply-To: References:

Message-ID: On Tue, 10 Aug 2021 at 12:52, Adrin wrote: > > Silly question: why not move the other way around, i.e. always modify the > original data, unless the user does a `copy()`? Is that not more intuitive > to people? > > That's certainly not a silly question :) That's an option as well, and somewhat related to the "indexing on columns always gives a view" mentioned by Brock above. The alternatives section in the google doc also mentions a few reasons to prefer copy-on-write IMO. Some points on this: 1) First, we can't "always modify the original data", since that is only possible when we have a view of the original data. That might be obvious for someone (like you and me) familiar with numpy, but if you don't have this background, that's not necessarily the case (I am not sure numpy's copy/view rules are necessarily intuitive, unless you are familiar with memory layout). So we still need some rules. The selection of columns can always be a view, as proposed by Brock. But someone should then make a more complete proposal for how to handle row selection: always copy, or follow numpy rules? (i.e. basically a slice is a view, otherwise a copy) You also get things like `df.iloc[[0, 1, 2], :]` being a copy and `df.iloc[:, [0, 1, 2]]` being a view. Of course that's explainable (i.e. since the storage is columnar, different copy/view rules apply to selecting rows vs columns), but IMO not necessarily simpler as the proposal where both cases act as a copy. Or that `df[0:5]['col'] = ..` works but `df[mask]['col'] = ...` doesn't work. 2) For indexing it's certainly an open question what is most intuitive, but I think for *methods* that return a new DataFrame, people generally expect that those don't modify each other. And for me, this is one of the main reasons for this proposal that I want to improve the efficiency of methods to not have to copy the dataframe by default (methods like rename, (re)set_index, drop columns, etc). In my mind, for this the most logical thing to do is copy-on-write. Of course it's not because we would want copy-on-write for methods, that we can't do something different for indexing (although what with methods that basically are equivalent to an indexing operation .. ?). But, from an implementation point of view, I am not sure it would actually be technically possible to sometimes do copy-on-write, and sometimes not (probably possible in theory, but a lot more complicated; see also one of my previous answers ( https://mail.python.org/pipermail/pandas-dev/2021-July/001368.html) on having a single column as view). 3) Personally, I don't think that I ever (at least not often) had the use case where I intentionally wanted to modify a parent dataframe by modifying a subsetted child dataframe (explicit chained indexing aside). So also from that point of view, I find the "always (if possible) modify the original data" less interesting than the potential performance benefits / the IMO simpler rule of never modifying. Joris -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbrockmendel at gmail.com Wed Aug 11 19:44:59 2021 From: jbrockmendel at gmail.com (Brock Mendel) Date: Wed, 11 Aug 2021 16:44:59 -0700 Subject: [Pandas-dev] Proposal for consistent, clear copy/view semantics in pandas with Copy-on-Write In-Reply-To: References:

Message-ID: A couple of thoughts from the discussion on today's call: 1) A lot of the discussion about the indexing behavior revolved around "users expect X". I fundamentally do *not* want to be in the business of speculating about this. 2) I find the case for CoW more compelling for the chained methods usage `frame.rename(...).reset_index(...).set_index(...)`. If we had a viable way to implement CoW for these independently of the indexing, that would be a slam dunk. Alternatively, we could get a lot of the benefits from a `copy` keyword in the pertinent methods (explicit, better than implicit). On Tue, Aug 10, 2021 at 3:14 PM Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > On Tue, 10 Aug 2021 at 12:52, Adrin wrote: > >> >> Silly question: why not move the other way around, i.e. always modify the >> original data, unless the user does a `copy()`? Is that not more intuitive >> to people? >> >> That's certainly not a silly question :) That's an option as well, and > somewhat related to the "indexing on columns always gives a view" mentioned > by Brock above. The alternatives section in the google doc > > also mentions a few reasons to prefer copy-on-write IMO. Some points on > this: > > 1) First, we can't "always modify the original data", since that is only > possible when we have a view of the original data. That might be obvious > for someone (like you and me) familiar with numpy, but if you don't have > this background, that's not necessarily the case (I am not sure numpy's > copy/view rules are necessarily intuitive, unless you are familiar with > memory layout). > So we still need some rules. The selection of columns can always be a > view, as proposed by Brock. But someone should then make a more complete > proposal for how to handle row selection: always copy, or follow numpy > rules? (i.e. basically a slice is a view, otherwise a copy) > > You also get things like `df.iloc[[0, 1, 2], :]` being a copy and > `df.iloc[:, [0, 1, 2]]` being a view. Of course that's explainable (i.e. > since the storage is columnar, different copy/view rules apply to selecting > rows vs columns), but IMO not necessarily simpler as the proposal where > both cases act as a copy. > Or that `df[0:5]['col'] = ..` works but `df[mask]['col'] = ...` doesn't > work. > > 2) For indexing it's certainly an open question what is most intuitive, > but I think for *methods* that return a new DataFrame, people generally > expect that those don't modify each other. And for me, this is one of the > main reasons for this proposal that I want to improve the efficiency of > methods to not have to copy the dataframe by default (methods like rename, > (re)set_index, drop columns, etc). In my mind, for this the most logical > thing to do is copy-on-write. > Of course it's not because we would want copy-on-write for methods, that > we can't do something different for indexing (although what with methods > that basically are equivalent to an indexing operation .. ?). But, from an > implementation point of view, I am not sure it would actually be > technically possible to sometimes do copy-on-write, and sometimes not > (probably possible in theory, but a lot more complicated; see also one of > my previous answers ( > https://mail.python.org/pipermail/pandas-dev/2021-July/001368.html) on > having a single column as view). > > 3) Personally, I don't think that I ever (at least not often) had the use > case where I intentionally wanted to modify a parent dataframe by modifying > a subsetted child dataframe (explicit chained indexing aside). So also from > that point of view, I find the "always (if possible) modify the original > data" less interesting than the potential performance benefits / the IMO > simpler rule of never modifying. > > Joris > > > > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Thu Aug 12 16:59:06 2021 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Thu, 12 Aug 2021 22:59:06 +0200 Subject: [Pandas-dev] Proposal for consistent, clear copy/view semantics in pandas with Copy-on-Write In-Reply-To: References:

Message-ID: Another follow-up of the discussion we had yesterday: we talked about when objects get modified and when not (in this proposal), and basically the rule would be: *"the only way to modify an object (DataFrame or Series) is to modify the object itself directly"*, or stated in another way: you can never modify an object by modifying a different object (modifications are never propagated, as you would have with numpy views). In Python, we need to take into account "object identity" then (because you can still have multiple variables/names pointing to the same object), and I added a section trying to explain that with an example in the google doc: https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit#heading=h.ejidwnify2zo -------------- next part -------------- An HTML attachment was scrubbed... URL: From simonjayhawkins at gmail.com Mon Aug 16 14:57:48 2021 From: simonjayhawkins at gmail.com (Simon Hawkins) Date: Mon, 16 Aug 2021 19:57:48 +0100 Subject: [Pandas-dev] ANN: pandas v1.3.2 Message-ID: Hi all, I'm pleased to announce the release of pandas v1.3.2. This is a patch release in the 1.3.x series and includes some regression fixes and bug fixes. We recommend that all users upgrade to this version. See the release notes for a list of all the changes. The release can be installed from PyPI python -m pip install --upgrade pandas==1.3.2 Or from conda-forge conda install -c conda-forge pandas==1.3.2 Please report any issues with the release on the pandas issue tracker . Thanks to all the contributors who made this release possible. -------------- next part -------------- An HTML attachment was scrubbed... URL: