[Pandas-dev] Proposal for consistent, clear copy/view semantics in pandas with Copy-on-Write

Mon Oct 11 17:06:18 EDT 2021

(trying to revive this discussion)

Some assorted comments on the last emails in this thread / comments on the
google doc (and I will follow-up with a separate email about the
single-Series-from-DataFrame-as-view issue).

- A small note about "users' expectations": I am not going to say this easy
(in contrast, this is one of the hardest parts of being a library author,
IMO), but we are creating tools to be used by users. So while designing
those tools, I think it is an essential part to think about how users will
use your library / how they think something works / what they need / what
they find intuitive / etc (thus, related to their expectations).
And because this is a hard problem (and subjective), it would be good to
get some more feedback from others on the proposed semantics from the usage
point of view. I think the current proposal will be simpler to grasp and
reason about especially for new users, but I certainly don't hold the truth
on this aspect (and there are different options that are all simpler as the
current situation).

- On the google doc, Adrin made an interesting comment, quoting a part of
that:

I understand a slice and a mask are fundamentally different, but I don't
> think from the perspective of a user they're different. The user is
> selecting a subset of the original data.
> ...
> Reading through this document I understand why users (and I occasionally)
> would get the pandas warnings telling us we're modifying something which is
> not the original object, but it always puzzled me since I didn't expect a
> slice or a mask to create a copy.
>

This is an interesting point, and I think one of the crucial aspects that
the proposal tries to address.

In short: while using a slice or mask are both methods to select a subset
of your original data, when it comes to copy/view semantics they *are*
fundamentally different for numpy arrays (a slice gives a view, a mask
gives a copy). Currently, those numpy rules "leak" through to pandas,
although not exactly the same and fully consistently. So we expect a pandas
user to know those numpy concepts (views / fancy indexing), and know the
differences in rules with pandas. If we want that pandas users don't have
to know this, I think the most sensible option is to make them both behave
as a copy (which is what the copy-on-write proposal does).

I added a new section about this (relation with numpy views and
differences) in the good doc:
https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit#heading=h.yud4azltfua5

On Thu, 12 Aug 2021 at 01:45, Brock Mendel <jbrockmendel at gmail.com> wrote:

>
> 2) I find the case for CoW more compelling for the chained methods usage
> `frame.rename(...).reset_index(...).set_index(...)`.  If we had a viable
> way to implement CoW for these independently of the indexing, that would be
> a slam dunk.  Alternatively, we could get a lot of the benefits from a
> `copy` keyword in the pertinent methods (explicit, better than implicit).
>

Based on my intuition from implementing the POC, I don't think it would be
feasible to have both CoW in some cases, and normal views (eg when
selecting columns from a DataFrame) in other cases (but you are certainly
welcome to experiment with it as well).

Personally I think adding keywords alone would not be a
sufficient/satisfying solution, as I would like to see those methods to not
copy by default, while keeping the behaviour of returning a new object
(that doesn't modify the parent one if mutated).

In addition, there are also methods that do indexing-like operations
(reindex on columns, filter), and I think it would be surprising if those
behaved differently as the indexing operations (getitem).

On Thu, 12 Aug 2021 at 01:45, Brock Mendel <jbrockmendel at gmail.com> wrote:

> A couple of thoughts from the discussion on today's call:
>
> 1) A lot of the discussion about the indexing behavior revolved around
> "users expect X".  I fundamentally do *not* want to be in the business of
> speculating about this.
>
> 2) I find the case for CoW more compelling for the chained methods usage
> `frame.rename(...).reset_index(...).set_index(...)`.  If we had a viable
> way to implement CoW for these independently of the indexing, that would be
> a slam dunk.  Alternatively, we could get a lot of the benefits from a
> `copy` keyword in the pertinent methods (explicit, better than implicit).
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20211011/781267bc/attachment.html>