[Pandas-dev] Proposal for consistent, clear copy/view semantics in pandas with Copy-on-Write

Irv Lustig irv at princeton.com
Wed Dec 15 09:54:18 EST 2021


Joris:
I finally had some time to study our conversation from July, reread the
Google docs proposal, and I tried out the PR as well.

What I'm struggling with is how we document where behavior will change.  As
an example, the following sequence will give different results:

Current behavior:
>>> df = pd.DataFrame({"a":[10,11,12,13,14], "b": [100,101,102,103,104]})
>>> df["a"].loc[2] = 112
>>> df
     a    b
0   10  100
1   11  101
2  112  102
3   13  103
4   14  104


New behavior: (from the PR):
>>> df = pd.DataFrame({"a":[10,11,12,13,14], "b": [100,101,102,103,104]})
>>> df["a"].loc[2] = 112
>>> df
    a    b
0  10  100
1  11  101
2  12  102
3  13  103
4  14  104

But in both cases, the following works:

>>> df.loc[3,"b"] = 999
>>> df
    a    b
0  10  100
1  11  101
2  12  102
3  13  999
4  14  104

So my concern is that if you had existing code that used the pattern
df["a"].loc[2]
= 112 , you'd get no warning that the behavior had changed.  What I don't
know is how much of code in the wild assumes the current behavior.

So my questions are now:
1. How will we document, in a clean and concise way, the new behavior for
people with existing pandas code?
2. How can people find pandas code where the behavior will change?  Can we
list all patterns that would produce different results?  Can we detect
chained indexing with setitem calls?
3. I'm guessing there is lots of code where people use DataFrame.copy() to
avoid the SettingWithCopy warning.  Can they just remove those copies now
and their code will work?

I agree that for new users, this new way of doing things makes sense.  I'm
worried about how we make the transition easier for people with large code
bases that use pandas.

-Irv




>> On Sat, 17 Jul 2021 at 20:51, Joris Van den Bossche <
jorisvandenbossche at gmail.com> wrote:
>>>
>>> On Fri, 16 Jul 2021 at 20:50, Irv Lustig <irv at princeton.com> wrote:
>>>>
>>>>
>>>> Tom Augspurger <tom.augspurger88 at gmail.com> wrote:
>>>>
>>>>> I wonder if we can validate what users (new and old) *actually*
expect?
>>>>> Users coming from R, which IIRC implements Copy on Write for matrices,
>>>>> might be OK with indexing always being (behaving like) a copy.
>>>>> I'm not sure what users coming from NumPy would expect, since I don't
know
>>>>> how many NumPy users really understand *a**.)* when a NumPy slice is
a view
>>>>> or copy, and *b.) *how a pandas indexing operation translates to a
NumPy
>>>>> slice.
>>>>>
>>>>
>>>> IMHO, we should concentrate on the "new" users.  For my team, there is
no numpy or R background.  They learn pandas, and what pandas does needs to
be really clear in behavior and documentation.  I would also hazard a guess
that most pandas users are like that - pandas is the first tool they see,
not numpy or R.
>>>>
>>>> The places where I think confusion could happen are things like this
with a DataFrame df :
>>>>
>>>> s = df["a"]
>>>> s.iloc[3:5] = [1, 2, 3]
>>>> df["a"].iloc[3:5] = [1, 2, 3]
>>>> df["b"] = df["a"]
>>>> df["b"].iloc[3:5] = [4, 5, 6]
>>>> s2 = df["b"]
>>>> df["c"] = s2
>>>> s2.iloc[3:5] = [7, 8, 9]
>>>>
>>>> As I understand it (please correct me if I'm wrong), these lines would
be interpreted as follows with the current proposal:
>>>
>>>
>>> It's a bit different (to reiterate, with the *current* proposal, *any*
indexing operation (including series selection) behaves as a copy; and also
to be clear, this is one possible proposal, there are certainly other
possibilities). Answering case by case:
>>>
>>>>
>>>> 1. s = df["a"]
>>>> Creates a view into the DataFrame df.  No copying is done at all
>>>
>>>
>>> Indeed a view (but that's an implementation detail)
>>>
>>>> 2. s.iloc[3:5] = [1, 2, 3]
>>>> Modifies the series s and the underlying DataFrame df.  (copy-on-write)
>>>
>>>
>>> Due to copy-on-write, it does *not* modify the DataFrame df.
Copy-on-write means that only when s is being written to, its data get
copied (so at that point breaking the view-relation with the parent df)
>>>
>>>>
>>>> 3. df["a"].iloc[3:5] = [1, 2, 3]
>>>> Modifies the dataframe
>>>
>>>
>>> This is an example of chained assignment, which in the current proposal
never works (see the example in the google doc). This is because chained
assignment can always be written as:
>>>
>>> temp = df["a"]
>>> temp.iloc[3:5] = [1, 2, 3]
>>>
>>> and `temp` uses copy-on-write (and then it is the same example as the
one above in 2.).
>>>
>>> (what you describe is the current behaviour of pandas)
>>>
>>>>
>>>> 4. df["b"] = df["a"]
>>>> Copies the series from "a" to "b"
>>>
>>>
>>> It would indeed behave as a copy, but under the hood we can actually
keep this as a view (delay the copy thanks to copy-on-write).
>>>
>>>>
>>>> 5. df["b"].iloc[3:5] = [4, 5, 6]
>>>> Modifies "b" in the DataFrame, but not "a"
>>>
>>>
>>> Also doesn't modify "b" (see example 3. above), but indeed does not
modify "a"
>>>
>>>>
>>>> 6. s2 = df["b"]
>>>> Create a view into the DataFrame df.  No copying is done at all.
>>>
>>>
>>> Same as 1.
>>>
>>>>
>>>> 7. df["c"] = s2
>>>> Copies the series from "b" to "c"
>>>
>>>
>>> Same as 4.
>>>
>>>>
>>>> 8. s2.iloc[3:5] = [7, 8, 9]
>>>> Modifies s2, which modifies "b", but NOT "c"
>>>
>>>
>>> Doesn't modify "b" and "c". Similar as 3.
>>>
>>>> I think the challenge is explaining the sequence 6,7,8 above in
comparison to the other sequences.
>>>
>>>
>>> So with the current proposal, the sequece 6, 7, 8 actually doesn't
behave differently. But it is mainly 2 and 3 that would be quite different
compared to the current pandas behaviour.
>>>
>>>>
>>>>
>>>> -Irv
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20211215/da9f9cbc/attachment-0001.html>


More information about the Pandas-dev mailing list