[Pandas-dev] Proposal for consistent, clear copy/view semantics in pandas with Copy-on-Write

Fri Jul 23 16:09:46 EDT 2021

> Memory implications should be positive (less copying).

This is accurate _only_ in cases where we currently make copies.  In cases
where we currently make views, the perf effect goes the other way.

On the flip side, Always-Views improves perf in cases where we currently
make copies, but if you want a copy then you'll have to make one explicitly
which will claw back that gain.  (In the long-out-of-date proof of concept
https://github.com/pandas-dev/pandas/pull/33597  df[np.random.randint(0,
30, 30)] was ~92% faster than the status quo at the time)

> The performance impact of the additional logic (adding/checking of the
weak references) is something I didn't yet check (on my to do list), but I
suspect it to not be significant.

Agreed the CoW logic itself should be negligible outside of microbenchmarks.

On Fri, Jul 23, 2021 at 12:32 PM Joris Van den Bossche <
jorisvandenbossche at gmail.com> wrote:

> On Tue, 20 Jul 2021 at 16:10, Adrin <adrin.jalali at gmail.com> wrote:
>
>> I guess one question I have is what are the memory and time performance
>> implications of the proposed change.
>>
>
> Memory implications should be positive (less copying). The performance
> impact of the additional logic (adding/checking of the weak references) is
> something I didn't yet check (on my to do list), but I suspect it to not be
> significant.
>
> Based on your comment of "numpy array with column names", I think the
> potential change of the ArrayManager is much more relevant for you than the
> Copy-on-Write. And to be clear, the current proposal is not tied to the
> ArrayManager (it's only the proof of concept that is implemented for that).
> So I would prefer to keep the discussion focused on the copy/view
> semantics, at least for now (it's only later, when discussing practical
> ways to get this released, that we need to decide whether we want to
> combine this with an ArrayManager refactor or not).
>
>
>> I guess I belong to the group of users who think of a pandas DataFrame
>> more as a numpy array with column names attached to them, and hence I'd
>> expect very similar semantics when indexing, and I think copy on write
>> semantics would have a significant impact on our workflows.
>>
>
> I assuming you are also thinking of scikit-learn like worflows? Can you
> give an example of what your are thinking about how copy-on-write impacts
> such (or other) workflows?
> In any case thanks already for your feedback!
>
>
>>
>> On Sat, Jul 17, 2021 at 6:12 PM Marc Garcia <garcia.marc at gmail.com>
>> wrote:
>>
>>> Based on my experience (not sure how biased it is), modifying dataframes
>>> with something like `df[col][1:3] = ...` is rare (or the equivalent
>>> with `.loc`) except for boolean arrays. From my experience, when the
>>> values of a dataframe column are changed, what I think it's way more common
>>> is to use `df[col] = df[col].str.upper()`, `df[col] = df[col].fillna(0)
>>> `...
>>>
>>> While I'm personally happy with Joris proposal, I see two other options
>>> that could complement or replace it:
>>>
>>> Option 1) Deprecate assigning to a subset of rows, and only allow
>>> assigning to whole columns.  Something like `df[col][1:3] = ...` could
>>> be replaced by for example `df[col] = df[col].mask(slice(1, 3), ...)`.
>>> Using `mask` and `where` is already supported for boolean arrays, so
>>> slices should be added, and they'd be the only way to replace a subset of
>>> values. I think that makes the problem narrower, and easier to understand
>>> for users. The main thing to decide and be clear about is what happens if
>>> the dataframe is a subset of another one:
>>>
>>> ```
>>> df2 = df[cond]
>>> df2[col] = df2[col].str.upper()
>>> ```
>>>
>>> Option 2) If assigning with the current syntax (`df[col][1:3] = ...`
>>> or `.loc` equivalent) is something we want to keep (I wouldn't if we
>>> move in this direction), maybe it could be moved to a `DataFrame`
>>> subclass So, the main dataframe class behaves like in option 1, so
>>> expectations are much easier to manage. But users who really want to assign
>>> with indexing, can still use it, knowing that having a mutable dataframe
>>> comes at a cost (copies, more complex behavior...). The `
>>> MutableDataFrame` could be in pandas, or a third-party extension.
>>>
>>> ```
>>> df_mutable = df.to_mutable()
>>> df_mutable[col][1:3] = ...
>>> ```
>>>
>>> On Sat, Jul 17, 2021 at 9:16 AM Joris Van den Bossche <
>>> jorisvandenbossche at gmail.com> wrote:
>>>
>>>>
>>>> On Fri, 16 Jul 2021 at 18:58, Stephan Hoyer <shoyer at gmail.com> wrote:
>>>>
>>>>> On Fri, Jul 16, 2021 at 9:04 AM Brock Mendel <jbrockmendel at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I do not like the fact that nothing can ever be "just a view" with
>>>>>> these semantics, including series[::-1], frame[col], frame[:]. Users
>>>>>> reasonably expect numpy semantics for these.
>>>>>>
>>>>>> I am personally not sure what "users" in general expect for those (as
>>>> also mentioned by Tom and Irv already, depending on their background, they
>>>> might expect different things).
>>>> For example, for a user that knows basic Python, they could actually
>>>> expect all those examples to give a copy since `a_list[:]` is a typical way
>>>> to make a copy of a list.
>>>>
>>>> (it might be interesting to reach out to educators (who might have more
>>>> experience with expectations/typical errors of novice users) or to do some
>>>> kind of experiment on this topic)
>>>>
>>>> Personally, I cannot remember that I ever relied on the
>>>> mutability-aspect of eg `series[1:3]` or `frame[:]` being a view. I think
>>>> there are generally 2 reasons for users caring about a view: 1) for
>>>> performance (less copying) and 2) for being able to mutate the view with
>>>> the explicit goal to mutate the parent (and not as an irrelevant
>>>> side-effect).
>>>> I think the first reason is by far the most common one (but that's my
>>>> subjective opinion from my experience using pandas, so that can certainly
>>>> depend), and in the current proposal, all those mentioned example will be
>>>> actual views under the hood (and thus cover this first reason).
>>>>
>>>> The only case where I know I explicitly rely on this is with chained
>>>> assignment (eg `frame[col][1:3] = ..`). That's certainly a very important
>>>> use case (and probably the most impacted usage pattern with the current
>>>> proposal), but it's also a case where 1) there is a clear alternative
>>>> (don't use chained assignment, but do it in one step (e.g. `frame.loc[1:3,
>>>> col] = ..`), some corner cases of mixed positional/label-based indexing
>>>> aside, for which we should find an alternative) and 2) we might be able to
>>>> detect this and raise an informative error message (specifically for
>>>> chained assignment).
>>>>
>>>> I think it can be easier to explain "chained assignment never works"
>>>> than "chained assignment only works if first selecting the column(s)"
>>>> (depending on the exact rules).
>>>>
>>>>
>>>>> We should revisit the alternative "clear/simple rules" approach that
>>>>>> is "indexing on columns always gives a view" (
>>>>>> https://github.com/pandas-dev/pandas/pull/33597). This is simpler to
>>>>>> explain/grok, simpler to implement
>>>>>>
>>>>>
>>>>> I don't know if it is worth the trouble for complex multi-column
>>>>> selections, but I do see the appeal here.
>>>>>
>>>>> A simpler variant would be to make indexing out a single Series from a
>>>>> DataFrame return a view, with everything else doing copy on write. Then the
>>>>> existing pattern df.column_one[:] = ... would still work.
>>>>>
>>>>
>>>> I was initially thinking about this as well. In the end, I didn't (yet)
>>>> try to implement this, because while thinking it through, it seemed that
>>>> this might give quite some tricky cases. Consider the following example:
>>>>
>>>> df = pd.DataFrame(..)
>>>> df_subset = df[["col1", "col2"]]
>>>> s1 = df["col1"]
>>>> s1_subset = s1[0:3]
>>>> # modifying s1 should modify df, but not df_subset and s1_subset?
>>>> s1[0] = 0
>>>>
>>>> If we take "only accessing a single Series from a DataFrame is a view,
>>>> everything else uses copy-on-write", that gives rise to questions like the
>>>> above where some parents/childs get modified, and some not.
>>>> This is both harder to explain to users, as harder to implement. For
>>>> the implementation of the proof-of-concept, the copy-on-write happens
>>>> "locally" in the series/dataframe that gets modified (meaning: when
>>>> modifying a given object, its internal array data first gets copied and
>>>> replaced *if* the object is viewing another or is being viewed by another
>>>> object). While in the above case, modifying a given object would need to
>>>> trigger a copy in other (potentially many) objects, and not in the object
>>>> being modified. It's probably possible to implement this, but certainly
>>>> harder/trickier to do.
>>>>
>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> On Fri, Jul 16, 2021 at 5:26 AM Joris Van den Bossche <
>>>>>> jorisvandenbossche at gmail.com> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, 12 Jul 2021 at 00:58, Joris Van den Bossche <
>>>>>>> jorisvandenbossche at gmail.com> wrote:
>>>>>>>
>>>>>>>> Short summary of the proposal:
>>>>>>>>
>>>>>>>>    1. The result of *any* indexing operation (subsetting a
>>>>>>>>    DataFrame or Series in any way) or any method returning a new DataFrame,
>>>>>>>>    always *behaves as if it were* a copy in terms of user API.
>>>>>>>>
>>>>>>>>  To explicitly call out the column-as-Series case (since this is a
>>>>>>> typical case that right now *always* is a view): "any" indexing
>>>>>>> operation thus also included accessing a DataFrame column as a Series (or
>>>>>>> slicing a Series).
>>>>>>>
>>>>>>> So something like s = df["col"] and then mutating s will no longer
>>>>>>> update df. Similarly for series_subset = series[1:5], mutating
>>>>>>> series_subset will no longer update s.
>>>>>>> _______________________________________________
>>>>>>> Pandas-dev mailing list
>>>>>>> Pandas-dev at python.org
>>>>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>>>>
>>>>>> _______________________________________________
>>>>>> Pandas-dev mailing list
>>>>>> Pandas-dev at python.org
>>>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>>>
>>>>> _______________________________________________
>>>> Pandas-dev mailing list
>>>> Pandas-dev at python.org
>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>
>>> _______________________________________________
>>> Pandas-dev mailing list
>>> Pandas-dev at python.org
>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>
>> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20210723/de35245c/attachment-0001.html>