[Pandas-dev] Proposal for consistent, clear copy/view semantics in pandas with Copy-on-Write

Sun Dec 19 16:37:30 EST 2021

Thanks for testing the branch and the feedback, Irv!

Related to your concern about how users will know or get notified about
behaviour that will change: the branch you tested is a proof-of-concept for
the *final* behaviour, and so I didn't (yet) add warnings for such cases.
So that's the simple reason why a case like df["a"].loc[2] = 112 didn't
trigger a warning.

But I agree that this is important, and it's certainly the idea that we
will have a pandas release (before actually changing the behaviour) where
the cases like above that will change behaviour trigger a deprecation
warning about this. We will need to see a bit how to implement this,
though, and it might become quite complex. But if we are convinced that the
final behaviour is better, I think this is certainly worth it (and only
temporary).

On Wed, 15 Dec 2021 at 15:54, Irv Lustig <irv at princeton.com> wrote:

> Joris:
> I finally had some time to study our conversation from July, reread the
> Google docs proposal, and I tried out the PR as well.
>
> What I'm struggling with is how we document where behavior will change.
> As an example, the following sequence will give different results:
>
> Current behavior:
> >>> df = pd.DataFrame({"a":[10,11,12,13,14], "b": [100,101,102,103,104]})
> >>> df["a"].loc[2] = 112
> >>> df
>      a    b
> 0   10  100
> 1   11  101
> 2  112  102
> 3   13  103
> 4   14  104
>
>
> New behavior: (from the PR):
> >>> df = pd.DataFrame({"a":[10,11,12,13,14], "b": [100,101,102,103,104]})
> >>> df["a"].loc[2] = 112
> >>> df
>     a    b
> 0  10  100
> 1  11  101
> 2  12  102
> 3  13  103
> 4  14  104
>
> But in both cases, the following works:
>
> >>> df.loc[3,"b"] = 999
> >>> df
>     a    b
> 0  10  100
> 1  11  101
> 2  12  102
> 3  13  999
> 4  14  104
>
> So my concern is that if you had existing code that used the pattern df["a"].loc[2]
> = 112 , you'd get no warning that the behavior had changed.  What I don't
> know is how much of code in the wild assumes the current behavior.
>
> So my questions are now:
> 1. How will we document, in a clean and concise way, the new behavior for
> people with existing pandas code?
>

Given that the new behaviour makes more sense than the current behaviour
(in my opinion, and I think yours as well based on your email), it should
be actually be easier to properly document it :)
But joking aside, yes, we will certainly need to put effort in creating a
very good set of documentation on this topic (the google doc could be a
starting point).

> 2. How can people find pandas code where the behavior will change?  Can we
> list all patterns that would produce different results?  Can we detect
> chained indexing with setitem calls?
>

The documentation can certainly list lots of patterns, but is of course
always based on examples. As mentioned above, I think we should be able to
catch most / all cases in setitem where behaviour will change, and trigger
a warning about this. This will be quite some work (probably even more than
the actual implementation that I currently did), but I am convinced this is
possible and worth it.

> 3. I'm guessing there is lots of code where people use DataFrame.copy() to
> avoid the SettingWithCopy warning.  Can they just remove those copies now
> and their code will work?
>

Yes, I think so. Especially if you did "copy" for avoiding the warning, you
were never modifying the original parent dataframe, which will become the
default/automatic behaviour with the proposal.

> I agree that for new users, this new way of doing things makes sense.  I'm
> worried about how we make the transition easier for people with large code
> bases that use pandas.
>

It's indeed a big change, that will impact quite some people, and can be a
big task to update for large code bases. So I think we need to take care
about this and really put effort in this aspect: ensuring we have good
deprecation warnings, a very good migration guide, reach out to (big) users
to check how the migration goes so we can improve this migration path, etc.
This is a lot of work of course, but I think a necessity if we want this to
be a success, and we also have some funding from the CZI grant specifically
for this aspect of the larger roadmap items.

Joris

>
> -Irv
>
>
>
>
> >> On Sat, 17 Jul 2021 at 20:51, Joris Van den Bossche <
> jorisvandenbossche at gmail.com> wrote:
> >>>
> >>> On Fri, 16 Jul 2021 at 20:50, Irv Lustig <irv at princeton.com> wrote:
> >>>>
> >>>>
> >>>> Tom Augspurger <tom.augspurger88 at gmail.com> wrote:
> >>>>
> >>>>> I wonder if we can validate what users (new and old) *actually*
> expect?
> >>>>> Users coming from R, which IIRC implements Copy on Write for
> matrices,
> >>>>> might be OK with indexing always being (behaving like) a copy.
> >>>>> I'm not sure what users coming from NumPy would expect, since I
> don't know
> >>>>> how many NumPy users really understand *a**.)* when a NumPy slice is
> a view
> >>>>> or copy, and *b.) *how a pandas indexing operation translates to a
> NumPy
> >>>>> slice.
> >>>>>
> >>>>
> >>>> IMHO, we should concentrate on the "new" users.  For my team, there
> is no numpy or R background.  They learn pandas, and what pandas does needs
> to be really clear in behavior and documentation.  I would also hazard a
> guess that most pandas users are like that - pandas is the first tool they
> see, not numpy or R.
> >>>>
> >>>> The places where I think confusion could happen are things like this
> with a DataFrame df :
> >>>>
> >>>> s = df["a"]
> >>>> s.iloc[3:5] = [1, 2, 3]
> >>>> df["a"].iloc[3:5] = [1, 2, 3]
> >>>> df["b"] = df["a"]
> >>>> df["b"].iloc[3:5] = [4, 5, 6]
> >>>> s2 = df["b"]
> >>>> df["c"] = s2
> >>>> s2.iloc[3:5] = [7, 8, 9]
> >>>>
> >>>> As I understand it (please correct me if I'm wrong), these lines
> would be interpreted as follows with the current proposal:
> >>>
> >>>
> >>> It's a bit different (to reiterate, with the *current* proposal, *any*
> indexing operation (including series selection) behaves as a copy; and also
> to be clear, this is one possible proposal, there are certainly other
> possibilities). Answering case by case:
> >>>
> >>>>
> >>>> 1. s = df["a"]
> >>>> Creates a view into the DataFrame df.  No copying is done at all
> >>>
> >>>
> >>> Indeed a view (but that's an implementation detail)
> >>>
> >>>> 2. s.iloc[3:5] = [1, 2, 3]
> >>>> Modifies the series s and the underlying DataFrame df.
> (copy-on-write)
> >>>
> >>>
> >>> Due to copy-on-write, it does *not* modify the DataFrame df.
> Copy-on-write means that only when s is being written to, its data get
> copied (so at that point breaking the view-relation with the parent df)
> >>>
> >>>>
> >>>> 3. df["a"].iloc[3:5] = [1, 2, 3]
> >>>> Modifies the dataframe
> >>>
> >>>
> >>> This is an example of chained assignment, which in the current
> proposal never works (see the example in the google doc). This is because
> chained assignment can always be written as:
> >>>
> >>> temp = df["a"]
> >>> temp.iloc[3:5] = [1, 2, 3]
> >>>
> >>> and `temp` uses copy-on-write (and then it is the same example as the
> one above in 2.).
> >>>
> >>> (what you describe is the current behaviour of pandas)
> >>>
> >>>>
> >>>> 4. df["b"] = df["a"]
> >>>> Copies the series from "a" to "b"
> >>>
> >>>
> >>> It would indeed behave as a copy, but under the hood we can actually
> keep this as a view (delay the copy thanks to copy-on-write).
> >>>
> >>>>
> >>>> 5. df["b"].iloc[3:5] = [4, 5, 6]
> >>>> Modifies "b" in the DataFrame, but not "a"
> >>>
> >>>
> >>> Also doesn't modify "b" (see example 3. above), but indeed does not
> modify "a"
> >>>
> >>>>
> >>>> 6. s2 = df["b"]
> >>>> Create a view into the DataFrame df.  No copying is done at all.
> >>>
> >>>
> >>> Same as 1.
> >>>
> >>>>
> >>>> 7. df["c"] = s2
> >>>> Copies the series from "b" to "c"
> >>>
> >>>
> >>> Same as 4.
> >>>
> >>>>
> >>>> 8. s2.iloc[3:5] = [7, 8, 9]
> >>>> Modifies s2, which modifies "b", but NOT "c"
> >>>
> >>>
> >>> Doesn't modify "b" and "c". Similar as 3.
> >>>
> >>>> I think the challenge is explaining the sequence 6,7,8 above in
> comparison to the other sequences.
> >>>
> >>>
> >>> So with the current proposal, the sequece 6, 7, 8 actually doesn't
> behave differently. But it is mainly 2 and 3 that would be quite different
> compared to the current pandas behaviour.
> >>>
> >>>>
> >>>>
> >>>> -Irv
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20211219/9e213cc5/attachment-0001.html>