[Pandas-dev] Proposal for consistent, clear copy/view semantics in pandas with Copy-on-Write

Mon Jul 12 14:28:56 EDT 2021

I think this is an important initiative, and I indeed wish we had
designed around copy-on-write ideas from the very beginning.

As one protection against improper mutation of views, it may be
necessary to introduce defensive copies into APIs that expose internal
data, e.g. NumPy arrays that are slices of the parent, or who have had
slices taken of them.

On Mon, Jul 12, 2021 at 12:42 PM Marc Garcia <garcia.marc at gmail.com> wrote:
>
> +1 on the approach of the proposal, and also +1 to release in a major version, and not raise deprecation warnings.
>
> Thanks for working on this, it'll make users life much easier.
>
> On Sun, Jul 11, 2021 at 4:58 PM Joris Van den Bossche <jorisvandenbossche at gmail.com> wrote:
>>
>> (a.k.a. getting rid of the SettingWithCopyWarning)
>>
>> Hi all,
>>
>> As you are probably aware, it's not always straightforward to understand the copy or view semantics of indexing methods in pandas. To understand when you get a view and when not, or why you get a SettingWithCopyWarning or how to get rid of it?
>> It's also something that has already been discussed regularly (e.g. the discussion and implementation from 2015 started by Nick Eubank at gh-10954). Last year, we again started to discuss this, which is tracked at https://github.com/pandas-dev/pandas/issues/36195. Based on those discussions, I have a concrete proposal to change the copy/view semantics of pandas.
>>
>> Short summary of the proposal:
>>
>> The result of any indexing operation (subsetting a DataFrame or Series in any way) or any method returning a new DataFrame, always behaves as if it were a copy in terms of user API.
>> We implement Copy-on-Write. This way, we can actually use views as much as possible under the hood, while ensuring the user API behaves as a copy.
>>
>> This addresses multiple aspects: 1) a clear and consistent user API (a clear rule: any subset or returned series/dataframe is always a copy of the original, and thus never modifies the original) and 2) improving performance by avoiding excessive copies (eg a chained method workflow would no longer return an actual data copy at each step).
>>
>> Longer version of this proposal: https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit?usp=sharing
>> Proof-of-concept implementation: https://github.com/pandas-dev/pandas/pull/41878
>> GitHub issue with relevant discussion: https://github.com/pandas-dev/pandas/issues/36195
>>
>> Since this would be a change with a large impact on users, I think it is important to get broad feedback on this. So comments, thoughts, concerns, ideas etc are very welcome (you can comment on the google doc, answer to this email or on the github issue).
>>
>> Best,
>> Joris
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev