[Pandas-dev] Proposal for consistent, clear copy/view semantics in pandas with Copy-on-Write

Joris Van den Bossche jorisvandenbossche at gmail.com
Sat Jul 17 11:16:32 EDT 2021


On Fri, 16 Jul 2021 at 18:58, Stephan Hoyer <shoyer at gmail.com> wrote:

> On Fri, Jul 16, 2021 at 9:04 AM Brock Mendel <jbrockmendel at gmail.com>
> wrote:
>
>> I do not like the fact that nothing can ever be "just a view" with these
>> semantics, including series[::-1], frame[col], frame[:]. Users reasonably
>> expect numpy semantics for these.
>>
>> I am personally not sure what "users" in general expect for those (as
also mentioned by Tom and Irv already, depending on their background, they
might expect different things).
For example, for a user that knows basic Python, they could actually expect
all those examples to give a copy since `a_list[:]` is a typical way to
make a copy of a list.

(it might be interesting to reach out to educators (who might have more
experience with expectations/typical errors of novice users) or to do some
kind of experiment on this topic)

Personally, I cannot remember that I ever relied on the mutability-aspect
of eg `series[1:3]` or `frame[:]` being a view. I think there are generally
2 reasons for users caring about a view: 1) for performance (less copying)
and 2) for being able to mutate the view with the explicit goal to mutate
the parent (and not as an irrelevant side-effect).
I think the first reason is by far the most common one (but that's my
subjective opinion from my experience using pandas, so that can certainly
depend), and in the current proposal, all those mentioned example will be
actual views under the hood (and thus cover this first reason).

The only case where I know I explicitly rely on this is with chained
assignment (eg `frame[col][1:3] = ..`). That's certainly a very important
use case (and probably the most impacted usage pattern with the current
proposal), but it's also a case where 1) there is a clear alternative
(don't use chained assignment, but do it in one step (e.g. `frame.loc[1:3,
col] = ..`), some corner cases of mixed positional/label-based indexing
aside, for which we should find an alternative) and 2) we might be able to
detect this and raise an informative error message (specifically for
chained assignment).

I think it can be easier to explain "chained assignment never works" than
"chained assignment only works if first selecting the column(s)" (depending
on the exact rules).


> We should revisit the alternative "clear/simple rules" approach that is
>> "indexing on columns always gives a view" (
>> https://github.com/pandas-dev/pandas/pull/33597). This is simpler to
>> explain/grok, simpler to implement
>>
>
> I don't know if it is worth the trouble for complex multi-column
> selections, but I do see the appeal here.
>
> A simpler variant would be to make indexing out a single Series from a
> DataFrame return a view, with everything else doing copy on write. Then the
> existing pattern df.column_one[:] = ... would still work.
>

I was initially thinking about this as well. In the end, I didn't (yet) try
to implement this, because while thinking it through, it seemed that this
might give quite some tricky cases. Consider the following example:

df = pd.DataFrame(..)
df_subset = df[["col1", "col2"]]
s1 = df["col1"]
s1_subset = s1[0:3]
# modifying s1 should modify df, but not df_subset and s1_subset?
s1[0] = 0

If we take "only accessing a single Series from a DataFrame is a view,
everything else uses copy-on-write", that gives rise to questions like the
above where some parents/childs get modified, and some not.
This is both harder to explain to users, as harder to implement. For the
implementation of the proof-of-concept, the copy-on-write happens "locally"
in the series/dataframe that gets modified (meaning: when modifying a given
object, its internal array data first gets copied and replaced *if* the
object is viewing another or is being viewed by another object). While in
the above case, modifying a given object would need to trigger a copy in
other (potentially many) objects, and not in the object being modified.
It's probably possible to implement this, but certainly harder/trickier to
do.


>
>
>>
>> On Fri, Jul 16, 2021 at 5:26 AM Joris Van den Bossche <
>> jorisvandenbossche at gmail.com> wrote:
>>
>>>
>>>
>>> On Mon, 12 Jul 2021 at 00:58, Joris Van den Bossche <
>>> jorisvandenbossche at gmail.com> wrote:
>>>
>>>> Short summary of the proposal:
>>>>
>>>>    1. The result of *any* indexing operation (subsetting a DataFrame
>>>>    or Series in any way) or any method returning a new DataFrame, always *behaves
>>>>    as if it were* a copy in terms of user API.
>>>>
>>>>  To explicitly call out the column-as-Series case (since this is a
>>> typical case that right now *always* is a view): "any" indexing
>>> operation thus also included accessing a DataFrame column as a Series (or
>>> slicing a Series).
>>>
>>> So something like s = df["col"] and then mutating s will no longer
>>> update df. Similarly for series_subset = series[1:5], mutating
>>> series_subset will no longer update s.
>>> _______________________________________________
>>> Pandas-dev mailing list
>>> Pandas-dev at python.org
>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20210717/36559232/attachment-0001.html>


More information about the Pandas-dev mailing list