[Pandas-dev] Proposal for consistent, clear copy/view semantics in pandas with Copy-on-Write

Sat Jul 17 12:12:32 EDT 2021

Based on my experience (not sure how biased it is), modifying dataframes
with something like `df[col][1:3] = ...` is rare (or the equivalent with `
.loc`) except for boolean arrays. From my experience, when the values of a
dataframe column are changed, what I think it's way more common is to
use `df[col]
= df[col].str.upper()`, `df[col] = df[col].fillna(0)`...

While I'm personally happy with Joris proposal, I see two other options
that could complement or replace it:

Option 1) Deprecate assigning to a subset of rows, and only allow assigning
to whole columns.  Something like `df[col][1:3] = ...` could be replaced by
for example `df[col] = df[col].mask(slice(1, 3), ...)`.  Using `mask` and `
where` is already supported for boolean arrays, so slices should be added,
and they'd be the only way to replace a subset of values. I think that
makes the problem narrower, and easier to understand for users. The main
thing to decide and be clear about is what happens if the dataframe is a
subset of another one:

```
df2 = df[cond]
df2[col] = df2[col].str.upper()
```

Option 2) If assigning with the current syntax (`df[col][1:3] = ...`  or `
.loc` equivalent) is something we want to keep (I wouldn't if we move in
this direction), maybe it could be moved to a `DataFrame` subclass So, the
main dataframe class behaves like in option 1, so expectations are much
easier to manage. But users who really want to assign with indexing, can
still use it, knowing that having a mutable dataframe comes at a cost
(copies, more complex behavior...). The `MutableDataFrame` could be in
pandas, or a third-party extension.

```
df_mutable = df.to_mutable()
df_mutable[col][1:3] = ...
```

On Sat, Jul 17, 2021 at 9:16 AM Joris Van den Bossche <
jorisvandenbossche at gmail.com> wrote:

>
> On Fri, 16 Jul 2021 at 18:58, Stephan Hoyer <shoyer at gmail.com> wrote:
>
>> On Fri, Jul 16, 2021 at 9:04 AM Brock Mendel <jbrockmendel at gmail.com>
>> wrote:
>>
>>> I do not like the fact that nothing can ever be "just a view" with these
>>> semantics, including series[::-1], frame[col], frame[:]. Users reasonably
>>> expect numpy semantics for these.
>>>
>>> I am personally not sure what "users" in general expect for those (as
> also mentioned by Tom and Irv already, depending on their background, they
> might expect different things).
> For example, for a user that knows basic Python, they could actually
> expect all those examples to give a copy since `a_list[:]` is a typical way
> to make a copy of a list.
>
> (it might be interesting to reach out to educators (who might have more
> experience with expectations/typical errors of novice users) or to do some
> kind of experiment on this topic)
>
> Personally, I cannot remember that I ever relied on the mutability-aspect
> of eg `series[1:3]` or `frame[:]` being a view. I think there are generally
> 2 reasons for users caring about a view: 1) for performance (less copying)
> and 2) for being able to mutate the view with the explicit goal to mutate
> the parent (and not as an irrelevant side-effect).
> I think the first reason is by far the most common one (but that's my
> subjective opinion from my experience using pandas, so that can certainly
> depend), and in the current proposal, all those mentioned example will be
> actual views under the hood (and thus cover this first reason).
>
> The only case where I know I explicitly rely on this is with chained
> assignment (eg `frame[col][1:3] = ..`). That's certainly a very important
> use case (and probably the most impacted usage pattern with the current
> proposal), but it's also a case where 1) there is a clear alternative
> (don't use chained assignment, but do it in one step (e.g. `frame.loc[1:3,
> col] = ..`), some corner cases of mixed positional/label-based indexing
> aside, for which we should find an alternative) and 2) we might be able to
> detect this and raise an informative error message (specifically for
> chained assignment).
>
> I think it can be easier to explain "chained assignment never works" than
> "chained assignment only works if first selecting the column(s)" (depending
> on the exact rules).
>
>
>> We should revisit the alternative "clear/simple rules" approach that is
>>> "indexing on columns always gives a view" (
>>> https://github.com/pandas-dev/pandas/pull/33597). This is simpler to
>>> explain/grok, simpler to implement
>>>
>>
>> I don't know if it is worth the trouble for complex multi-column
>> selections, but I do see the appeal here.
>>
>> A simpler variant would be to make indexing out a single Series from a
>> DataFrame return a view, with everything else doing copy on write. Then the
>> existing pattern df.column_one[:] = ... would still work.
>>
>
> I was initially thinking about this as well. In the end, I didn't (yet)
> try to implement this, because while thinking it through, it seemed that
> this might give quite some tricky cases. Consider the following example:
>
> df = pd.DataFrame(..)
> df_subset = df[["col1", "col2"]]
> s1 = df["col1"]
> s1_subset = s1[0:3]
> # modifying s1 should modify df, but not df_subset and s1_subset?
> s1[0] = 0
>
> If we take "only accessing a single Series from a DataFrame is a view,
> everything else uses copy-on-write", that gives rise to questions like the
> above where some parents/childs get modified, and some not.
> This is both harder to explain to users, as harder to implement. For the
> implementation of the proof-of-concept, the copy-on-write happens "locally"
> in the series/dataframe that gets modified (meaning: when modifying a given
> object, its internal array data first gets copied and replaced *if* the
> object is viewing another or is being viewed by another object). While in
> the above case, modifying a given object would need to trigger a copy in
> other (potentially many) objects, and not in the object being modified.
> It's probably possible to implement this, but certainly harder/trickier to
> do.
>
>
>>
>>
>>>
>>> On Fri, Jul 16, 2021 at 5:26 AM Joris Van den Bossche <
>>> jorisvandenbossche at gmail.com> wrote:
>>>
>>>>
>>>>
>>>> On Mon, 12 Jul 2021 at 00:58, Joris Van den Bossche <
>>>> jorisvandenbossche at gmail.com> wrote:
>>>>
>>>>> Short summary of the proposal:
>>>>>
>>>>>    1. The result of *any* indexing operation (subsetting a DataFrame
>>>>>    or Series in any way) or any method returning a new DataFrame, always *behaves
>>>>>    as if it were* a copy in terms of user API.
>>>>>
>>>>>  To explicitly call out the column-as-Series case (since this is a
>>>> typical case that right now *always* is a view): "any" indexing
>>>> operation thus also included accessing a DataFrame column as a Series (or
>>>> slicing a Series).
>>>>
>>>> So something like s = df["col"] and then mutating s will no longer
>>>> update df. Similarly for series_subset = series[1:5], mutating
>>>> series_subset will no longer update s.
>>>> _______________________________________________
>>>> Pandas-dev mailing list
>>>> Pandas-dev at python.org
>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>
>>> _______________________________________________
>>> Pandas-dev mailing list
>>> Pandas-dev at python.org
>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>
>> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20210717/fbd7a80a/attachment.html>