[Pandas-dev] Proposal for consistent, clear copy/view semantics in pandas with Copy-on-Write

Joris Van den Bossche jorisvandenbossche at gmail.com
Sat Jul 17 14:51:39 EDT 2021


On Fri, 16 Jul 2021 at 20:50, Irv Lustig <irv at princeton.com> wrote:

>
> Tom Augspurger <tom.augspurger88 at gmail.com> wrote:
>
> I wonder if we can validate what users (new and old) *actually* expect?
>> Users coming from R, which IIRC implements Copy on Write for matrices,
>> might be OK with indexing always being (behaving like) a copy.
>> I'm not sure what users coming from NumPy would expect, since I don't know
>> how many NumPy users really understand *a**.)* when a NumPy slice is a
>> view
>> or copy, and *b.) *how a pandas indexing operation translates to a NumPy
>> slice.
>>
>>
> IMHO, we should concentrate on the "new" users.  For my team, there is no
> numpy or R background.  They learn pandas, and what pandas does needs to be
> really clear in behavior and documentation.  I would also hazard a guess
> that most pandas users are like that - pandas is the first tool they see,
> not numpy or R.
>
> The places where I think confusion could happen are things like this with
> a DataFrame df :
>
>    1. s = df["a"]
>    2. s.iloc[3:5] = [1, 2, 3]
>    3. df["a"].iloc[3:5] = [1, 2, 3]
>    4. df["b"] = df["a"]
>    5. df["b"].iloc[3:5] = [4, 5, 6]
>    6. s2 = df["b"]
>    7. df["c"] = s2
>    8. s2.iloc[3:5] = [7, 8, 9]
>
> As I understand it (please correct me if I'm wrong), these lines would be
> interpreted as follows with the current proposal:
>

It's a bit different (to reiterate, with the *current* proposal, *any*
indexing operation (including series selection) behaves as a copy; and also
to be clear, this is one possible proposal, there are certainly other
possibilities). Answering case by case:


> 1. s = df["a"]
> Creates a view into the DataFrame df.  No copying is done at all
>

Indeed a view (but that's an implementation detail)

2. s.iloc[3:5] = [1, 2, 3]
> Modifies the series s and the underlying DataFrame df.  (copy-on-write)
>

Due to copy-on-write, it does *not* modify the DataFrame df. Copy-on-write
means that only when s is being written to, its data get copied (so at that
point breaking the view-relation with the parent df)


> 3. df["a"].iloc[3:5] = [1, 2, 3]
> Modifies the dataframe
>

This is an example of chained assignment, which in the current proposal
never works (see the example in the google doc
<https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit#heading=h.dbqr0xsneytk>).
This is because chained assignment can always be written as:

temp = df["a"]
temp.iloc[3:5] = [1, 2, 3]

and `temp` uses copy-on-write (and then it is the same example as the one
above in 2.).

(what you describe is the current behaviour of pandas)


> 4. df["b"] = df["a"]
> Copies the series from "a" to "b"
>

It would indeed behave as a copy, but under the hood we can actually keep
this as a view (delay the copy thanks to copy-on-write).


> 5. df["b"].iloc[3:5] = [4, 5, 6]
> Modifies "b" in the DataFrame, but not "a"
>

Also doesn't modify "b" (see example 3. above), but indeed does not modify
"a"


> 6. s2 = df["b"]
> Create a view into the DataFrame df.  No copying is done at all.
>

Same as 1.


> 7. df["c"] = s2
> Copies the series from "b" to "c"
>

Same as 4.


> 8. s2.iloc[3:5] = [7, 8, 9]
> Modifies s2, which modifies "b", but NOT "c"
>

Doesn't modify "b" and "c". Similar as 3.

I think the challenge is explaining the sequence 6,7,8 above in comparison
> to the other sequences.
>

So with the current proposal, the sequece 6, 7, 8 actually doesn't behave
differently. But it is mainly 2 and 3 that would be quite different
compared to the current pandas behaviour.


>
> -Irv
>
>
>
>
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20210717/c8d951aa/attachment-0001.html>


More information about the Pandas-dev mailing list