[Pandas-dev] Proposal for consistent, clear copy/view semantics in pandas with Copy-on-Write

Tue Aug 10 18:13:34 EDT 2021

On Tue, 10 Aug 2021 at 12:52, Adrin <adrin.jalali at gmail.com> wrote:

>
> Silly question: why not move the other way around, i.e. always modify the
> original data, unless the user does a `copy()`? Is that not more intuitive
> to people?
>
> That's certainly not a silly question :) That's an option as well, and
somewhat related to the "indexing on columns always gives a view" mentioned
by Brock above. The alternatives section in the google doc
<https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit#heading=h.5lz2xdiax72i>
also mentions a few reasons to prefer copy-on-write IMO. Some points on
this:

1) First, we can't "always modify the original data", since that is only
possible when we have a view of the original data. That might be obvious
for someone (like you and me) familiar with numpy, but if you don't have
this background, that's not necessarily the case (I am not sure numpy's
copy/view rules are necessarily intuitive, unless you are familiar with
memory layout).
So we still need some rules. The selection of columns can always be a view,
as proposed by Brock. But someone should then make a more complete proposal
for how to handle row selection: always copy, or follow numpy rules? (i.e.
basically a slice is a view, otherwise a copy)

You also get things like `df.iloc[[0, 1, 2], :]` being a copy and
`df.iloc[:, [0, 1, 2]]` being a view. Of course that's explainable (i.e.
since the storage is columnar, different copy/view rules apply to selecting
rows vs columns), but IMO not necessarily simpler as the proposal where
both cases act as a copy.
Or that `df[0:5]['col'] = ..` works but `df[mask]['col'] = ...` doesn't
work.

2) For indexing it's certainly an open question what is most intuitive, but
I think for *methods* that return a new DataFrame, people generally expect
that those don't modify each other. And for me, this is one of the main
reasons for this proposal that I want to improve the efficiency of methods
to not have to copy the dataframe by default (methods like rename,
(re)set_index, drop columns, etc). In my mind, for this the most logical
thing to do is copy-on-write.
Of course it's not because we would want copy-on-write for methods, that we
can't do something different for indexing (although what with methods that
basically are equivalent to an indexing operation .. ?). But, from an
implementation point of view, I am not sure it would actually be
technically possible to sometimes do copy-on-write, and sometimes not
(probably possible in theory, but a lot more complicated; see also one of
my previous answers (
https://mail.python.org/pipermail/pandas-dev/2021-July/001368.html) on
having a single column as view).

3) Personally, I don't think that I ever (at least not often) had the use
case where I intentionally wanted to modify a parent dataframe by modifying
a subsetted child dataframe (explicit chained indexing aside). So also from
that point of view, I find the "always (if possible) modify the original
data" less interesting than the potential performance benefits / the IMO
simpler rule of never modifying.

Joris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20210811/648d431d/attachment.html>