[Pandas-dev] Proposal for consistent, clear copy/view semantics in pandas with Copy-on-Write

Stephan Hoyer shoyer at gmail.com
Mon Jul 12 23:55:01 EDT 2021


I agree with Wes and Marc. This is an important change for the long term
future of pandas.

On Mon, Jul 12, 2021 at 11:29 AM Wes McKinney <wesmckinn at gmail.com> wrote:

> I think this is an important initiative, and I indeed wish we had
> designed around copy-on-write ideas from the very beginning.
>
> As one protection against improper mutation of views, it may be
> necessary to introduce defensive copies into APIs that expose internal
> data, e.g. NumPy arrays that are slices of the parent, or who have had
> slices taken of them.
>
> On Mon, Jul 12, 2021 at 12:42 PM Marc Garcia <garcia.marc at gmail.com>
> wrote:
> >
> > +1 on the approach of the proposal, and also +1 to release in a major
> version, and not raise deprecation warnings.
> >
> > Thanks for working on this, it'll make users life much easier.
> >
> > On Sun, Jul 11, 2021 at 4:58 PM Joris Van den Bossche <
> jorisvandenbossche at gmail.com> wrote:
> >>
> >> (a.k.a. getting rid of the SettingWithCopyWarning)
> >>
> >> Hi all,
> >>
> >> As you are probably aware, it's not always straightforward to
> understand the copy or view semantics of indexing methods in pandas. To
> understand when you get a view and when not, or why you get a
> SettingWithCopyWarning or how to get rid of it?
> >> It's also something that has already been discussed regularly (e.g. the
> discussion and implementation from 2015 started by Nick Eubank at
> gh-10954). Last year, we again started to discuss this, which is tracked at
> https://github.com/pandas-dev/pandas/issues/36195. Based on those
> discussions, I have a concrete proposal to change the copy/view semantics
> of pandas.
> >>
> >> Short summary of the proposal:
> >>
> >> The result of any indexing operation (subsetting a DataFrame or Series
> in any way) or any method returning a new DataFrame, always behaves as if
> it were a copy in terms of user API.
> >> We implement Copy-on-Write. This way, we can actually use views as much
> as possible under the hood, while ensuring the user API behaves as a copy.
> >>
> >> This addresses multiple aspects: 1) a clear and consistent user API (a
> clear rule: any subset or returned series/dataframe is always a copy of the
> original, and thus never modifies the original) and 2) improving
> performance by avoiding excessive copies (eg a chained method workflow
> would no longer return an actual data copy at each step).
> >>
> >> Longer version of this proposal:
> https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit?usp=sharing
> >> Proof-of-concept implementation:
> https://github.com/pandas-dev/pandas/pull/41878
> >> GitHub issue with relevant discussion:
> https://github.com/pandas-dev/pandas/issues/36195
> >>
> >> Since this would be a change with a large impact on users, I think it
> is important to get broad feedback on this. So comments, thoughts,
> concerns, ideas etc are very welcome (you can comment on the google doc,
> answer to this email or on the github issue).
> >>
> >> Best,
> >> Joris
> >> _______________________________________________
> >> Pandas-dev mailing list
> >> Pandas-dev at python.org
> >> https://mail.python.org/mailman/listinfo/pandas-dev
> >
> > _______________________________________________
> > Pandas-dev mailing list
> > Pandas-dev at python.org
> > https://mail.python.org/mailman/listinfo/pandas-dev
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20210712/43e8afef/attachment-0001.html>


More information about the Pandas-dev mailing list