[Pandas-dev] Proposal for consistent, clear copy/view semantics in pandas with Copy-on-Write

Mon Jul 26 05:51:43 EDT 2021

There are two cases that I think are relevant here (as opposed to the
ArrayManager discussion), but I may be wrong.

The two cases I'm thinking, in a simple not-optimized way are, in
psuedocode:

for column in columns_of(data):
    data[:, column] = (data[:, column] - mean(data[:, column])) /
std(data[:, column])

And the other one is the same as above, but for rows.

Also, one issue I have, is that if we're doing copy-on-write, then what
does the above mean? As in, if I do `df["column_A"] = ....`, where is that
copy? How do I access the new one as opposed to the old one?

On Fri, Jul 23, 2021 at 10:09 PM Brock Mendel <jbrockmendel at gmail.com>
wrote:

> > Memory implications should be positive (less copying).
>
> This is accurate _only_ in cases where we currently make copies.  In cases
> where we currently make views, the perf effect goes the other way.
>
> On the flip side, Always-Views improves perf in cases where we currently
> make copies, but if you want a copy then you'll have to make one explicitly
> which will claw back that gain.  (In the long-out-of-date proof of concept
> https://github.com/pandas-dev/pandas/pull/33597  df[np.random.randint(0,
> 30, 30)] was ~92% faster than the status quo at the time)
>
> > The performance impact of the additional logic (adding/checking of the
> weak references) is something I didn't yet check (on my to do list), but I
> suspect it to not be significant.
>
> Agreed the CoW logic itself should be negligible outside of
> microbenchmarks.
>
> On Fri, Jul 23, 2021 at 12:32 PM Joris Van den Bossche <
> jorisvandenbossche at gmail.com> wrote:
>
>> On Tue, 20 Jul 2021 at 16:10, Adrin <adrin.jalali at gmail.com> wrote:
>>
>>> I guess one question I have is what are the memory and time performance
>>> implications of the proposed change.
>>>
>>
>> Memory implications should be positive (less copying). The performance
>> impact of the additional logic (adding/checking of the weak references) is
>> something I didn't yet check (on my to do list), but I suspect it to not be
>> significant.
>>
>> Based on your comment of "numpy array with column names", I think the
>> potential change of the ArrayManager is much more relevant for you than the
>> Copy-on-Write. And to be clear, the current proposal is not tied to the
>> ArrayManager (it's only the proof of concept that is implemented for that).
>> So I would prefer to keep the discussion focused on the copy/view
>> semantics, at least for now (it's only later, when discussing practical
>> ways to get this released, that we need to decide whether we want to
>> combine this with an ArrayManager refactor or not).
>>
>>
>>> I guess I belong to the group of users who think of a pandas DataFrame
>>> more as a numpy array with column names attached to them, and hence I'd
>>> expect very similar semantics when indexing, and I think copy on write
>>> semantics would have a significant impact on our workflows.
>>>
>>
>> I assuming you are also thinking of scikit-learn like worflows? Can you
>> give an example of what your are thinking about how copy-on-write impacts
>> such (or other) workflows?
>> In any case thanks already for your feedback!
>>
>>
>>>
>>> On Sat, Jul 17, 2021 at 6:12 PM Marc Garcia <garcia.marc at gmail.com>
>>> wrote:
>>>
>>>> Based on my experience (not sure how biased it is), modifying
>>>> dataframes with something like `df[col][1:3] = ...` is rare (or the
>>>> equivalent with `.loc`) except for boolean arrays. From my experience,
>>>> when the values of a dataframe column are changed, what I think it's way
>>>> more common is to use `df[col] = df[col].str.upper()`, `df[col] =
>>>> df[col].fillna(0)`...
>>>>
>>>> While I'm personally happy with Joris proposal, I see two other options
>>>> that could complement or replace it:
>>>>
>>>> Option 1) Deprecate assigning to a subset of rows, and only allow
>>>> assigning to whole columns.  Something like `df[col][1:3] = ...` could
>>>> be replaced by for example `df[col] = df[col].mask(slice(1, 3), ...)`.
>>>> Using `mask` and `where` is already supported for boolean arrays, so
>>>> slices should be added, and they'd be the only way to replace a subset of
>>>> values. I think that makes the problem narrower, and easier to understand
>>>> for users. The main thing to decide and be clear about is what happens if
>>>> the dataframe is a subset of another one:
>>>>
>>>> ```
>>>> df2 = df[cond]
>>>> df2[col] = df2[col].str.upper()
>>>> ```
>>>>
>>>> Option 2) If assigning with the current syntax (`df[col][1:3] = ...`
>>>> or `.loc` equivalent) is something we want to keep (I wouldn't if we
>>>> move in this direction), maybe it could be moved to a `DataFrame`
>>>> subclass So, the main dataframe class behaves like in option 1, so
>>>> expectations are much easier to manage. But users who really want to assign
>>>> with indexing, can still use it, knowing that having a mutable dataframe
>>>> comes at a cost (copies, more complex behavior...). The `
>>>> MutableDataFrame` could be in pandas, or a third-party extension.
>>>>
>>>> ```
>>>> df_mutable = df.to_mutable()
>>>> df_mutable[col][1:3] = ...
>>>> ```
>>>>
>>>> On Sat, Jul 17, 2021 at 9:16 AM Joris Van den Bossche <
>>>> jorisvandenbossche at gmail.com> wrote:
>>>>
>>>>>
>>>>> On Fri, 16 Jul 2021 at 18:58, Stephan Hoyer <shoyer at gmail.com> wrote:
>>>>>
>>>>>> On Fri, Jul 16, 2021 at 9:04 AM Brock Mendel <jbrockmendel at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I do not like the fact that nothing can ever be "just a view" with
>>>>>>> these semantics, including series[::-1], frame[col], frame[:]. Users
>>>>>>> reasonably expect numpy semantics for these.
>>>>>>>
>>>>>>> I am personally not sure what "users" in general expect for those
>>>>> (as also mentioned by Tom and Irv already, depending on their background,
>>>>> they might expect different things).
>>>>> For example, for a user that knows basic Python, they could actually
>>>>> expect all those examples to give a copy since `a_list[:]` is a typical way
>>>>> to make a copy of a list.
>>>>>
>>>>> (it might be interesting to reach out to educators (who might have
>>>>> more experience with expectations/typical errors of novice users) or to do
>>>>> some kind of experiment on this topic)
>>>>>
>>>>> Personally, I cannot remember that I ever relied on the
>>>>> mutability-aspect of eg `series[1:3]` or `frame[:]` being a view. I think
>>>>> there are generally 2 reasons for users caring about a view: 1) for
>>>>> performance (less copying) and 2) for being able to mutate the view with
>>>>> the explicit goal to mutate the parent (and not as an irrelevant
>>>>> side-effect).
>>>>> I think the first reason is by far the most common one (but that's my
>>>>> subjective opinion from my experience using pandas, so that can certainly
>>>>> depend), and in the current proposal, all those mentioned example will be
>>>>> actual views under the hood (and thus cover this first reason).
>>>>>
>>>>> The only case where I know I explicitly rely on this is with chained
>>>>> assignment (eg `frame[col][1:3] = ..`). That's certainly a very important
>>>>> use case (and probably the most impacted usage pattern with the current
>>>>> proposal), but it's also a case where 1) there is a clear alternative
>>>>> (don't use chained assignment, but do it in one step (e.g. `frame.loc[1:3,
>>>>> col] = ..`), some corner cases of mixed positional/label-based indexing
>>>>> aside, for which we should find an alternative) and 2) we might be able to
>>>>> detect this and raise an informative error message (specifically for
>>>>> chained assignment).
>>>>>
>>>>> I think it can be easier to explain "chained assignment never works"
>>>>> than "chained assignment only works if first selecting the column(s)"
>>>>> (depending on the exact rules).
>>>>>
>>>>>
>>>>>> We should revisit the alternative "clear/simple rules" approach that
>>>>>>> is "indexing on columns always gives a view" (
>>>>>>> https://github.com/pandas-dev/pandas/pull/33597). This is simpler
>>>>>>> to explain/grok, simpler to implement
>>>>>>>
>>>>>>
>>>>>> I don't know if it is worth the trouble for complex multi-column
>>>>>> selections, but I do see the appeal here.
>>>>>>
>>>>>> A simpler variant would be to make indexing out a single Series from
>>>>>> a DataFrame return a view, with everything else doing copy on write. Then
>>>>>> the existing pattern df.column_one[:] = ... would still work.
>>>>>>
>>>>>
>>>>> I was initially thinking about this as well. In the end, I didn't
>>>>> (yet) try to implement this, because while thinking it through, it seemed
>>>>> that this might give quite some tricky cases. Consider the following
>>>>> example:
>>>>>
>>>>> df = pd.DataFrame(..)
>>>>> df_subset = df[["col1", "col2"]]
>>>>> s1 = df["col1"]
>>>>> s1_subset = s1[0:3]
>>>>> # modifying s1 should modify df, but not df_subset and s1_subset?
>>>>> s1[0] = 0
>>>>>
>>>>> If we take "only accessing a single Series from a DataFrame is a view,
>>>>> everything else uses copy-on-write", that gives rise to questions like the
>>>>> above where some parents/childs get modified, and some not.
>>>>> This is both harder to explain to users, as harder to implement. For
>>>>> the implementation of the proof-of-concept, the copy-on-write happens
>>>>> "locally" in the series/dataframe that gets modified (meaning: when
>>>>> modifying a given object, its internal array data first gets copied and
>>>>> replaced *if* the object is viewing another or is being viewed by another
>>>>> object). While in the above case, modifying a given object would need to
>>>>> trigger a copy in other (potentially many) objects, and not in the object
>>>>> being modified. It's probably possible to implement this, but certainly
>>>>> harder/trickier to do.
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> On Fri, Jul 16, 2021 at 5:26 AM Joris Van den Bossche <
>>>>>>> jorisvandenbossche at gmail.com> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, 12 Jul 2021 at 00:58, Joris Van den Bossche <
>>>>>>>> jorisvandenbossche at gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Short summary of the proposal:
>>>>>>>>>
>>>>>>>>>    1. The result of *any* indexing operation (subsetting a
>>>>>>>>>    DataFrame or Series in any way) or any method returning a new DataFrame,
>>>>>>>>>    always *behaves as if it were* a copy in terms of user API.
>>>>>>>>>
>>>>>>>>>  To explicitly call out the column-as-Series case (since this is a
>>>>>>>> typical case that right now *always* is a view): "any" indexing
>>>>>>>> operation thus also included accessing a DataFrame column as a Series (or
>>>>>>>> slicing a Series).
>>>>>>>>
>>>>>>>> So something like s = df["col"] and then mutating s will no longer
>>>>>>>> update df. Similarly for series_subset = series[1:5], mutating
>>>>>>>> series_subset will no longer update s.
>>>>>>>> _______________________________________________
>>>>>>>> Pandas-dev mailing list
>>>>>>>> Pandas-dev at python.org
>>>>>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Pandas-dev mailing list
>>>>>>> Pandas-dev at python.org
>>>>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>>>>
>>>>>> _______________________________________________
>>>>> Pandas-dev mailing list
>>>>> Pandas-dev at python.org
>>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>>
>>>> _______________________________________________
>>>> Pandas-dev mailing list
>>>> Pandas-dev at python.org
>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>
>>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20210726/18a6bf1b/attachment-0001.html>