From simonjayhawkins at gmail.com  Mon Jul  5 06:25:59 2021
From: simonjayhawkins at gmail.com (Simon Hawkins)
Date: Mon, 5 Jul 2021 11:25:59 +0100
Subject: [Pandas-dev] ANN: pandas v1.3.0
Message-ID: <CAFs9qGsSYobAVAOrsgYAkboBTNvTmcp2J94Yt_3unVMwra_v4A@mail.gmail.com>

Hi all,

The pandas team is pleased to announce the release of pandas 1.3.0.

This release includes some new features, bug fixes, and performance
improvements. We recommend that all users upgrade to this version.

See the release notes
<https://pandas.pydata.org/pandas-docs/version/1.3/whatsnew/v1.3.0.html> for
a list of all the changes.


The release can be installed from PyPI

    python -m pip install --upgrade pandas==1.3.0

Or from conda-forge

    conda install -c conda-forge pandas==1.3.0


Please report any issues with the release on the pandas issue tracker
<https://github.com/pandas-dev/pandas/issues>.

Thanks to all the contributors who made this release possible.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20210705/9c9a4de7/attachment.html>

From jorisvandenbossche at gmail.com  Sun Jul 11 18:58:00 2021
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Mon, 12 Jul 2021 00:58:00 +0200
Subject: [Pandas-dev] Proposal for consistent,
 clear copy/view semantics in pandas with Copy-on-Write
Message-ID: <CALQtMBYqKg+HHb5pJZmYG6hbUg9M5f0b1ycPWYda_EvsWyHe=A@mail.gmail.com>

*(a.k.a. getting rid of the SettingWithCopyWarning)*

Hi all,

As you are probably aware, it's not always straightforward to understand
the copy or view semantics of indexing methods in pandas. To understand
when you get a view and when not, or why you get a SettingWithCopyWarning
or how to get rid of it?
It's also something that has already been discussed regularly (e.g. the
discussion and implementation from 2015 started by Nick Eubank at gh-10954
<https://github.com/pandas-dev/pandas/issues/10954>). Last year, we again
started to discuss this, which is tracked at
https://github.com/pandas-dev/pandas/issues/36195. Based on those
discussions, I have a concrete proposal to change the copy/view semantics
of pandas.

Short summary of the proposal:

   1. The result of *any* indexing operation (subsetting a DataFrame or
   Series in any way) or any method returning a new DataFrame, always *behaves
   as if it were* a copy in terms of user API.
   2. We implement Copy-on-Write. This way, we can actually use views as
   much as possible under the hood, while ensuring the user API behaves as a
   copy.

This addresses multiple aspects: 1) a clear and consistent user API (a
clear rule: *any* subset or returned series/dataframe is *always* a copy of
the original, and thus never modifies the original) and 2) improving
performance by avoiding excessive copies (eg a chained method workflow
would no longer return an actual data copy at each step).

Longer version of this proposal:
https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit?usp=sharing
Proof-of-concept implementation:
https://github.com/pandas-dev/pandas/pull/41878
GitHub issue with relevant discussion:
https://github.com/pandas-dev/pandas/issues/36195

*Since this would be a change with a large impact on users, I think it is
important to get broad feedback on this*. So comments, thoughts, concerns,
ideas etc are very welcome (you can comment on the google doc, answer to
this email or on the github issue).

Best,
Joris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20210712/9434589a/attachment.html>

From garcia.marc at gmail.com  Mon Jul 12 13:42:13 2021
From: garcia.marc at gmail.com (Marc Garcia)
Date: Mon, 12 Jul 2021 11:42:13 -0600
Subject: [Pandas-dev] Proposal for consistent,
 clear copy/view semantics in pandas with Copy-on-Write
In-Reply-To: <CALQtMBYqKg+HHb5pJZmYG6hbUg9M5f0b1ycPWYda_EvsWyHe=A@mail.gmail.com>
References: <CALQtMBYqKg+HHb5pJZmYG6hbUg9M5f0b1ycPWYda_EvsWyHe=A@mail.gmail.com>
Message-ID: <CAEk5N5twcR0-4XtsqRjDTjr=Rm_N_JUXm0pGLFasFDNzJa3ExQ@mail.gmail.com>

+1 on the approach of the proposal, and also +1 to release in a major
version, and not raise deprecation warnings.

Thanks for working on this, it'll make users life much easier.

On Sun, Jul 11, 2021 at 4:58 PM Joris Van den Bossche <
jorisvandenbossche at gmail.com> wrote:

> *(a.k.a. getting rid of the SettingWithCopyWarning)*
>
> Hi all,
>
> As you are probably aware, it's not always straightforward to understand
> the copy or view semantics of indexing methods in pandas. To understand
> when you get a view and when not, or why you get a SettingWithCopyWarning
> or how to get rid of it?
> It's also something that has already been discussed regularly (e.g. the
> discussion and implementation from 2015 started by Nick Eubank at gh-10954
> <https://github.com/pandas-dev/pandas/issues/10954>). Last year, we again
> started to discuss this, which is tracked at
> https://github.com/pandas-dev/pandas/issues/36195. Based on those
> discussions, I have a concrete proposal to change the copy/view semantics
> of pandas.
>
> Short summary of the proposal:
>
>    1. The result of *any* indexing operation (subsetting a DataFrame or
>    Series in any way) or any method returning a new DataFrame, always *behaves
>    as if it were* a copy in terms of user API.
>    2. We implement Copy-on-Write. This way, we can actually use views as
>    much as possible under the hood, while ensuring the user API behaves as a
>    copy.
>
> This addresses multiple aspects: 1) a clear and consistent user API (a
> clear rule: *any* subset or returned series/dataframe is *always* a copy
> of the original, and thus never modifies the original) and 2) improving
> performance by avoiding excessive copies (eg a chained method workflow
> would no longer return an actual data copy at each step).
>
> Longer version of this proposal:
> https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit?usp=sharing
> Proof-of-concept implementation:
> https://github.com/pandas-dev/pandas/pull/41878
> GitHub issue with relevant discussion:
> https://github.com/pandas-dev/pandas/issues/36195
>
> *Since this would be a change with a large impact on users, I think it is
> important to get broad feedback on this*. So comments, thoughts,
> concerns, ideas etc are very welcome (you can comment on the google doc,
> answer to this email or on the github issue).
>
> Best,
> Joris
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20210712/d92cbc42/attachment.html>

From wesmckinn at gmail.com  Mon Jul 12 14:28:56 2021
From: wesmckinn at gmail.com (Wes McKinney)
Date: Mon, 12 Jul 2021 13:28:56 -0500
Subject: [Pandas-dev] Proposal for consistent,
 clear copy/view semantics in pandas with Copy-on-Write
In-Reply-To: <CAEk5N5twcR0-4XtsqRjDTjr=Rm_N_JUXm0pGLFasFDNzJa3ExQ@mail.gmail.com>
References: <CALQtMBYqKg+HHb5pJZmYG6hbUg9M5f0b1ycPWYda_EvsWyHe=A@mail.gmail.com>
 <CAEk5N5twcR0-4XtsqRjDTjr=Rm_N_JUXm0pGLFasFDNzJa3ExQ@mail.gmail.com>
Message-ID: <CAJPUwMCEZ5KJG3qXyHx6dqLTujTEhA2=itjMhCkCbz77G6ioKw@mail.gmail.com>

I think this is an important initiative, and I indeed wish we had
designed around copy-on-write ideas from the very beginning.

As one protection against improper mutation of views, it may be
necessary to introduce defensive copies into APIs that expose internal
data, e.g. NumPy arrays that are slices of the parent, or who have had
slices taken of them.

On Mon, Jul 12, 2021 at 12:42 PM Marc Garcia <garcia.marc at gmail.com> wrote:
>
> +1 on the approach of the proposal, and also +1 to release in a major version, and not raise deprecation warnings.
>
> Thanks for working on this, it'll make users life much easier.
>
> On Sun, Jul 11, 2021 at 4:58 PM Joris Van den Bossche <jorisvandenbossche at gmail.com> wrote:
>>
>> (a.k.a. getting rid of the SettingWithCopyWarning)
>>
>> Hi all,
>>
>> As you are probably aware, it's not always straightforward to understand the copy or view semantics of indexing methods in pandas. To understand when you get a view and when not, or why you get a SettingWithCopyWarning or how to get rid of it?
>> It's also something that has already been discussed regularly (e.g. the discussion and implementation from 2015 started by Nick Eubank at gh-10954). Last year, we again started to discuss this, which is tracked at https://github.com/pandas-dev/pandas/issues/36195. Based on those discussions, I have a concrete proposal to change the copy/view semantics of pandas.
>>
>> Short summary of the proposal:
>>
>> The result of any indexing operation (subsetting a DataFrame or Series in any way) or any method returning a new DataFrame, always behaves as if it were a copy in terms of user API.
>> We implement Copy-on-Write. This way, we can actually use views as much as possible under the hood, while ensuring the user API behaves as a copy.
>>
>> This addresses multiple aspects: 1) a clear and consistent user API (a clear rule: any subset or returned series/dataframe is always a copy of the original, and thus never modifies the original) and 2) improving performance by avoiding excessive copies (eg a chained method workflow would no longer return an actual data copy at each step).
>>
>> Longer version of this proposal: https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit?usp=sharing
>> Proof-of-concept implementation: https://github.com/pandas-dev/pandas/pull/41878
>> GitHub issue with relevant discussion: https://github.com/pandas-dev/pandas/issues/36195
>>
>> Since this would be a change with a large impact on users, I think it is important to get broad feedback on this. So comments, thoughts, concerns, ideas etc are very welcome (you can comment on the google doc, answer to this email or on the github issue).
>>
>> Best,
>> Joris
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev

From shoyer at gmail.com  Mon Jul 12 23:55:01 2021
From: shoyer at gmail.com (Stephan Hoyer)
Date: Mon, 12 Jul 2021 20:55:01 -0700
Subject: [Pandas-dev] Proposal for consistent,
 clear copy/view semantics in pandas with Copy-on-Write
In-Reply-To: <CAJPUwMCEZ5KJG3qXyHx6dqLTujTEhA2=itjMhCkCbz77G6ioKw@mail.gmail.com>
References: <CALQtMBYqKg+HHb5pJZmYG6hbUg9M5f0b1ycPWYda_EvsWyHe=A@mail.gmail.com>
 <CAEk5N5twcR0-4XtsqRjDTjr=Rm_N_JUXm0pGLFasFDNzJa3ExQ@mail.gmail.com>
 <CAJPUwMCEZ5KJG3qXyHx6dqLTujTEhA2=itjMhCkCbz77G6ioKw@mail.gmail.com>
Message-ID: <CAEQ_TvcnzoXNtocLxtDwWROGo2qqwyzhRi9EdV1TU7_N7HBfpg@mail.gmail.com>

I agree with Wes and Marc. This is an important change for the long term
future of pandas.

On Mon, Jul 12, 2021 at 11:29 AM Wes McKinney <wesmckinn at gmail.com> wrote:

> I think this is an important initiative, and I indeed wish we had
> designed around copy-on-write ideas from the very beginning.
>
> As one protection against improper mutation of views, it may be
> necessary to introduce defensive copies into APIs that expose internal
> data, e.g. NumPy arrays that are slices of the parent, or who have had
> slices taken of them.
>
> On Mon, Jul 12, 2021 at 12:42 PM Marc Garcia <garcia.marc at gmail.com>
> wrote:
> >
> > +1 on the approach of the proposal, and also +1 to release in a major
> version, and not raise deprecation warnings.
> >
> > Thanks for working on this, it'll make users life much easier.
> >
> > On Sun, Jul 11, 2021 at 4:58 PM Joris Van den Bossche <
> jorisvandenbossche at gmail.com> wrote:
> >>
> >> (a.k.a. getting rid of the SettingWithCopyWarning)
> >>
> >> Hi all,
> >>
> >> As you are probably aware, it's not always straightforward to
> understand the copy or view semantics of indexing methods in pandas. To
> understand when you get a view and when not, or why you get a
> SettingWithCopyWarning or how to get rid of it?
> >> It's also something that has already been discussed regularly (e.g. the
> discussion and implementation from 2015 started by Nick Eubank at
> gh-10954). Last year, we again started to discuss this, which is tracked at
> https://github.com/pandas-dev/pandas/issues/36195. Based on those
> discussions, I have a concrete proposal to change the copy/view semantics
> of pandas.
> >>
> >> Short summary of the proposal:
> >>
> >> The result of any indexing operation (subsetting a DataFrame or Series
> in any way) or any method returning a new DataFrame, always behaves as if
> it were a copy in terms of user API.
> >> We implement Copy-on-Write. This way, we can actually use views as much
> as possible under the hood, while ensuring the user API behaves as a copy.
> >>
> >> This addresses multiple aspects: 1) a clear and consistent user API (a
> clear rule: any subset or returned series/dataframe is always a copy of the
> original, and thus never modifies the original) and 2) improving
> performance by avoiding excessive copies (eg a chained method workflow
> would no longer return an actual data copy at each step).
> >>
> >> Longer version of this proposal:
> https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit?usp=sharing
> >> Proof-of-concept implementation:
> https://github.com/pandas-dev/pandas/pull/41878
> >> GitHub issue with relevant discussion:
> https://github.com/pandas-dev/pandas/issues/36195
> >>
> >> Since this would be a change with a large impact on users, I think it
> is important to get broad feedback on this. So comments, thoughts,
> concerns, ideas etc are very welcome (you can comment on the google doc,
> answer to this email or on the github issue).
> >>
> >> Best,
> >> Joris
> >> _______________________________________________
> >> Pandas-dev mailing list
> >> Pandas-dev at python.org
> >> https://mail.python.org/mailman/listinfo/pandas-dev
> >
> > _______________________________________________
> > Pandas-dev mailing list
> > Pandas-dev at python.org
> > https://mail.python.org/mailman/listinfo/pandas-dev
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20210712/43e8afef/attachment-0001.html>

From jorisvandenbossche at gmail.com  Fri Jul 16 08:15:29 2021
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Fri, 16 Jul 2021 14:15:29 +0200
Subject: [Pandas-dev] Proposal for consistent,
 clear copy/view semantics in pandas with Copy-on-Write
In-Reply-To: <CAEQ_TvcnzoXNtocLxtDwWROGo2qqwyzhRi9EdV1TU7_N7HBfpg@mail.gmail.com>
References: <CALQtMBYqKg+HHb5pJZmYG6hbUg9M5f0b1ycPWYda_EvsWyHe=A@mail.gmail.com>
 <CAEk5N5twcR0-4XtsqRjDTjr=Rm_N_JUXm0pGLFasFDNzJa3ExQ@mail.gmail.com>
 <CAJPUwMCEZ5KJG3qXyHx6dqLTujTEhA2=itjMhCkCbz77G6ioKw@mail.gmail.com>
 <CAEQ_TvcnzoXNtocLxtDwWROGo2qqwyzhRi9EdV1TU7_N7HBfpg@mail.gmail.com>
Message-ID: <CALQtMBaJCK2-hLmghpmwEp4WrqsjPwC5WavuyFA_nUYoc02Rfw@mail.gmail.com>

Thanks for the feedback.

Regarding protection against improper mutation of views via numpy (or in
general arrays), that's indeed a risk. Since this is Python, a user will
always find some (private) way to incorrectly mutate data without
triggering the copy-on-write paths, but there are indeed some ways we good
try to prevent that. Listing the possible ways to get the "array" data from
DataFrame/Series objects:

* Series.values / Series.array -> returning a numpy array or pandas
ExtensionArray, which currently return the stored data as are mutatble
arrays as is (or as views). Mutating such an array wouldn't trigger
Copy-on-Write which is managed on the DataFrame/Series level. To prevent
users from doing this, we could return those arrays as "read-only"? (to
avoid always doing a defensive copy here)

* Series.to_numpy() -> returning a numpy array. This method has a `copy`
keyword with currently a default of False. We could either make this
copy=True by default, or similarly to the above make it read-only by
default, leaving the copy=True/False options to choose from explicitly.

* DataFrame.to_numpy() / DataFrame.values -> returning a 2D numpy array,
which is by definition always a copy (by concatting multiple 1D arrays).
Except for the 1-column case, this could still be a view. For simplicity, I
would make this case return a copy as well (if you want a view the user can
get the Series). Or alternatively this case could follow the logic of
Series.to_numpy above.


On Tue, 13 Jul 2021 at 05:55, Stephan Hoyer <shoyer at gmail.com> wrote:

> I agree with Wes and Marc. This is an important change for the long term
> future of pandas.
>
> On Mon, Jul 12, 2021 at 11:29 AM Wes McKinney <wesmckinn at gmail.com> wrote:
>
>> I think this is an important initiative, and I indeed wish we had
>> designed around copy-on-write ideas from the very beginning.
>>
>> As one protection against improper mutation of views, it may be
>> necessary to introduce defensive copies into APIs that expose internal
>> data, e.g. NumPy arrays that are slices of the parent, or who have had
>> slices taken of them.
>>
>> On Mon, Jul 12, 2021 at 12:42 PM Marc Garcia <garcia.marc at gmail.com>
>> wrote:
>> >
>> > +1 on the approach of the proposal, and also +1 to release in a major
>> version, and not raise deprecation warnings.
>> >
>> > Thanks for working on this, it'll make users life much easier.
>> >
>> > On Sun, Jul 11, 2021 at 4:58 PM Joris Van den Bossche <
>> jorisvandenbossche at gmail.com> wrote:
>> >>
>> >> (a.k.a. getting rid of the SettingWithCopyWarning)
>> >>
>> >> Hi all,
>> >>
>> >> As you are probably aware, it's not always straightforward to
>> understand the copy or view semantics of indexing methods in pandas. To
>> understand when you get a view and when not, or why you get a
>> SettingWithCopyWarning or how to get rid of it?
>> >> It's also something that has already been discussed regularly (e.g.
>> the discussion and implementation from 2015 started by Nick Eubank at
>> gh-10954). Last year, we again started to discuss this, which is tracked at
>> https://github.com/pandas-dev/pandas/issues/36195. Based on those
>> discussions, I have a concrete proposal to change the copy/view semantics
>> of pandas.
>> >>
>> >> Short summary of the proposal:
>> >>
>> >> The result of any indexing operation (subsetting a DataFrame or Series
>> in any way) or any method returning a new DataFrame, always behaves as if
>> it were a copy in terms of user API.
>> >> We implement Copy-on-Write. This way, we can actually use views as
>> much as possible under the hood, while ensuring the user API behaves as a
>> copy.
>> >>
>> >> This addresses multiple aspects: 1) a clear and consistent user API (a
>> clear rule: any subset or returned series/dataframe is always a copy of the
>> original, and thus never modifies the original) and 2) improving
>> performance by avoiding excessive copies (eg a chained method workflow
>> would no longer return an actual data copy at each step).
>> >>
>> >> Longer version of this proposal:
>> https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit?usp=sharing
>> >> Proof-of-concept implementation:
>> https://github.com/pandas-dev/pandas/pull/41878
>> >> GitHub issue with relevant discussion:
>> https://github.com/pandas-dev/pandas/issues/36195
>> >>
>> >> Since this would be a change with a large impact on users, I think it
>> is important to get broad feedback on this. So comments, thoughts,
>> concerns, ideas etc are very welcome (you can comment on the google doc,
>> answer to this email or on the github issue).
>> >>
>> >> Best,
>> >> Joris
>> >> _______________________________________________
>> >> Pandas-dev mailing list
>> >> Pandas-dev at python.org
>> >> https://mail.python.org/mailman/listinfo/pandas-dev
>> >
>> > _______________________________________________
>> > Pandas-dev mailing list
>> > Pandas-dev at python.org
>> > https://mail.python.org/mailman/listinfo/pandas-dev
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20210716/71217910/attachment.html>

From jorisvandenbossche at gmail.com  Fri Jul 16 08:23:35 2021
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Fri, 16 Jul 2021 14:23:35 +0200
Subject: [Pandas-dev] Proposal for consistent,
 clear copy/view semantics in pandas with Copy-on-Write
In-Reply-To: <CALQtMBYqKg+HHb5pJZmYG6hbUg9M5f0b1ycPWYda_EvsWyHe=A@mail.gmail.com>
References: <CALQtMBYqKg+HHb5pJZmYG6hbUg9M5f0b1ycPWYda_EvsWyHe=A@mail.gmail.com>
Message-ID: <CALQtMBbY7t7F3eddhT5DN4UhNdr7GVzyKTWr4FxcdFJPiNw2PA@mail.gmail.com>

On Mon, 12 Jul 2021 at 00:58, Joris Van den Bossche <
jorisvandenbossche at gmail.com> wrote:

> Short summary of the proposal:
>
>    1. The result of *any* indexing operation (subsetting a DataFrame or
>    Series in any way) or any method returning a new DataFrame, always *behaves
>    as if it were* a copy in terms of user API.
>
>  To explicitly call out the column-as-Series case (since this is a typical
case that right now *always* is a view): "any" indexing operation thus also
included accessing a DataFrame column as a Series (or slicing a Series).

So something like s = df["col"] and then mutating s will no longer update df.
Similarly for series_subset = series[1:5], mutating series_subset will no
longer update s.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20210716/369f8632/attachment.html>

From jbrockmendel at gmail.com  Fri Jul 16 12:02:19 2021
From: jbrockmendel at gmail.com (Brock Mendel)
Date: Fri, 16 Jul 2021 09:02:19 -0700
Subject: [Pandas-dev] Proposal for consistent,
 clear copy/view semantics in pandas with Copy-on-Write
In-Reply-To: <CALQtMBbY7t7F3eddhT5DN4UhNdr7GVzyKTWr4FxcdFJPiNw2PA@mail.gmail.com>
References: <CALQtMBYqKg+HHb5pJZmYG6hbUg9M5f0b1ycPWYda_EvsWyHe=A@mail.gmail.com>
 <CALQtMBbY7t7F3eddhT5DN4UhNdr7GVzyKTWr4FxcdFJPiNw2PA@mail.gmail.com>
Message-ID: <CAKf8g9RjaC8b2XHDELHa452pQ+JUVNg4aDDXTKPOJANBTWQZjQ@mail.gmail.com>

[xposting from https://github.com/pandas-dev/pandas/issues/36195]

I'm glad there is a proof of concept to help clarify what this looks like.

I do not like the fact that nothing can ever be "just a view" with these
semantics, including series[::-1], frame[col], frame[:]. Users reasonably
expect numpy semantics for these.

We should revisit the alternative "clear/simple rules" approach that is
"indexing on columns always gives a view" (
https://github.com/pandas-dev/pandas/pull/33597). This is simpler to
explain/grok, simpler to implement, and not dependent on BlockManager vs
ArrayManager.

On Fri, Jul 16, 2021 at 5:26 AM Joris Van den Bossche <
jorisvandenbossche at gmail.com> wrote:

>
>
> On Mon, 12 Jul 2021 at 00:58, Joris Van den Bossche <
> jorisvandenbossche at gmail.com> wrote:
>
>> Short summary of the proposal:
>>
>>    1. The result of *any* indexing operation (subsetting a DataFrame or
>>    Series in any way) or any method returning a new DataFrame, always *behaves
>>    as if it were* a copy in terms of user API.
>>
>>  To explicitly call out the column-as-Series case (since this is a
> typical case that right now *always* is a view): "any" indexing operation
> thus also included accessing a DataFrame column as a Series (or slicing a
> Series).
>
> So something like s = df["col"] and then mutating s will no longer update
> df. Similarly for series_subset = series[1:5], mutating series_subset
> will no longer update s.
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20210716/e729c78c/attachment.html>

From tom.augspurger88 at gmail.com  Fri Jul 16 12:28:23 2021
From: tom.augspurger88 at gmail.com (Tom Augspurger)
Date: Fri, 16 Jul 2021 11:28:23 -0500
Subject: [Pandas-dev] Proposal for consistent,
 clear copy/view semantics in pandas with Copy-on-Write
In-Reply-To: <CAKf8g9RjaC8b2XHDELHa452pQ+JUVNg4aDDXTKPOJANBTWQZjQ@mail.gmail.com>
References: <CALQtMBYqKg+HHb5pJZmYG6hbUg9M5f0b1ycPWYda_EvsWyHe=A@mail.gmail.com>
 <CALQtMBbY7t7F3eddhT5DN4UhNdr7GVzyKTWr4FxcdFJPiNw2PA@mail.gmail.com>
 <CAKf8g9RjaC8b2XHDELHa452pQ+JUVNg4aDDXTKPOJANBTWQZjQ@mail.gmail.com>
Message-ID: <CAE1aY-=j8NB=+G7XqreLvKBjVPvNBpyD0XNX8JypHphHob9MJg@mail.gmail.com>

On Fri, Jul 16, 2021 at 11:04 AM Brock Mendel <jbrockmendel at gmail.com>
wrote:

> [xposting from https://github.com/pandas-dev/pandas/issues/36195]
>
> I'm glad there is a proof of concept to help clarify what this looks like.
>
> I do not like the fact that nothing can ever be "just a view" with these
> semantics, including series[::-1], frame[col], frame[:]. Users reasonably
> expect numpy semantics for these.
>

I wonder if we can validate what users (new and old) *actually* expect?
Users coming from R, which IIRC implements Copy on Write for matrices,
might be OK with indexing always being (behaving like) a copy.
I'm not sure what users coming from NumPy would expect, since I don't know
how many NumPy users really understand *a**.)* when a NumPy slice is a view
or copy, and *b.) *how a pandas indexing operation translates to a NumPy
slice.


> We should revisit the alternative "clear/simple rules" approach that is
> "indexing on columns always gives a view" (
> https://github.com/pandas-dev/pandas/pull/33597). This is simpler to
> explain/grok, simpler to implement, and not dependent on BlockManager vs
> ArrayManager.
>
> On Fri, Jul 16, 2021 at 5:26 AM Joris Van den Bossche <
> jorisvandenbossche at gmail.com> wrote:
>
>>
>>
>> On Mon, 12 Jul 2021 at 00:58, Joris Van den Bossche <
>> jorisvandenbossche at gmail.com> wrote:
>>
>>> Short summary of the proposal:
>>>
>>>    1. The result of *any* indexing operation (subsetting a DataFrame or
>>>    Series in any way) or any method returning a new DataFrame, always *behaves
>>>    as if it were* a copy in terms of user API.
>>>
>>>  To explicitly call out the column-as-Series case (since this is a
>> typical case that right now *always* is a view): "any" indexing
>> operation thus also included accessing a DataFrame column as a Series (or
>> slicing a Series).
>>
>> So something like s = df["col"] and then mutating s will no longer
>> update df. Similarly for series_subset = series[1:5], mutating
>> series_subset will no longer update s.
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20210716/87a75a62/attachment.html>

From shoyer at gmail.com  Fri Jul 16 12:58:17 2021
From: shoyer at gmail.com (Stephan Hoyer)
Date: Fri, 16 Jul 2021 09:58:17 -0700
Subject: [Pandas-dev] Proposal for consistent,
 clear copy/view semantics in pandas with Copy-on-Write
In-Reply-To: <CAKf8g9RjaC8b2XHDELHa452pQ+JUVNg4aDDXTKPOJANBTWQZjQ@mail.gmail.com>
References: <CALQtMBYqKg+HHb5pJZmYG6hbUg9M5f0b1ycPWYda_EvsWyHe=A@mail.gmail.com>
 <CALQtMBbY7t7F3eddhT5DN4UhNdr7GVzyKTWr4FxcdFJPiNw2PA@mail.gmail.com>
 <CAKf8g9RjaC8b2XHDELHa452pQ+JUVNg4aDDXTKPOJANBTWQZjQ@mail.gmail.com>
Message-ID: <CAEQ_TvcZ55BXfh9J--6Su_iLrdy2BL=juzrkFZGhvf1tQCcrbQ@mail.gmail.com>

On Fri, Jul 16, 2021 at 9:04 AM Brock Mendel <jbrockmendel at gmail.com> wrote:

> I do not like the fact that nothing can ever be "just a view" with these
> semantics, including series[::-1], frame[col], frame[:]. Users reasonably
> expect numpy semantics for these.
>
> We should revisit the alternative "clear/simple rules" approach that is
> "indexing on columns always gives a view" (
> https://github.com/pandas-dev/pandas/pull/33597). This is simpler to
> explain/grok, simpler to implement, and not dependent on BlockManager vs
> ArrayManager.
>

I don't know if it is worth the trouble for complex multi-column
selections, but I do see the appeal here.

A simpler variant would be to make indexing out a single Series from a
DataFrame return a view, with everything else doing copy on write. Then the
existing pattern df.column_one[:] = ... would still work.


>
> On Fri, Jul 16, 2021 at 5:26 AM Joris Van den Bossche <
> jorisvandenbossche at gmail.com> wrote:
>
>>
>>
>> On Mon, 12 Jul 2021 at 00:58, Joris Van den Bossche <
>> jorisvandenbossche at gmail.com> wrote:
>>
>>> Short summary of the proposal:
>>>
>>>    1. The result of *any* indexing operation (subsetting a DataFrame or
>>>    Series in any way) or any method returning a new DataFrame, always *behaves
>>>    as if it were* a copy in terms of user API.
>>>
>>>  To explicitly call out the column-as-Series case (since this is a
>> typical case that right now *always* is a view): "any" indexing
>> operation thus also included accessing a DataFrame column as a Series (or
>> slicing a Series).
>>
>> So something like s = df["col"] and then mutating s will no longer
>> update df. Similarly for series_subset = series[1:5], mutating
>> series_subset will no longer update s.
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20210716/52239462/attachment-0001.html>

From irv at princeton.com  Fri Jul 16 14:49:46 2021
From: irv at princeton.com (Irv Lustig)
Date: Fri, 16 Jul 2021 14:49:46 -0400
Subject: [Pandas-dev] Proposal for consistent,
 clear copy/view semantics in pandas with Copy-on-Write
In-Reply-To: <mailman.6009.1626454710.4167.pandas-dev@python.org>
References: <mailman.6009.1626454710.4167.pandas-dev@python.org>
Message-ID: <CAG4hgvMLG8d-jZ29ZQ5JHE91zszj=zyUhQOQGivxYLTZ2FApFg@mail.gmail.com>

Tom Augspurger <tom.augspurger88 at gmail.com> wrote:

I wonder if we can validate what users (new and old) *actually* expect?
> Users coming from R, which IIRC implements Copy on Write for matrices,
> might be OK with indexing always being (behaving like) a copy.
> I'm not sure what users coming from NumPy would expect, since I don't know
> how many NumPy users really understand *a**.)* when a NumPy slice is a view
> or copy, and *b.) *how a pandas indexing operation translates to a NumPy
> slice.
>
>
IMHO, we should concentrate on the "new" users.  For my team, there is no
numpy or R background.  They learn pandas, and what pandas does needs to be
really clear in behavior and documentation.  I would also hazard a guess
that most pandas users are like that - pandas is the first tool they see,
not numpy or R.

The places where I think confusion could happen are things like this with a
DataFrame df :

   1. s = df["a"]
   2. s.iloc[3:5] = [1, 2, 3]
   3. df["a"].iloc[3:5] = [1, 2, 3]
   4. df["b"] = df["a"]
   5. df["b"].iloc[3:5] = [4, 5, 6]
   6. s2 = df["b"]
   7. df["c"] = s2
   8. s2.iloc[3:5] = [7, 8, 9]

As I understand it (please correct me if I'm wrong), these lines would be
interpreted as follows with the current proposal:

   1. Creates a view into the DataFrame df.  No copying is done at all
   2. Modifies the series s and the underlying DataFrame df.
   (copy-on-write)
   3. Modifies the dataframe
   4. Copies the series from "a" to "b"
   5. Modifies "b" in the DataFrame, but not "a"
   6. Create a view into the DataFrame df.  No copying is done at all.
   7. Copies the series from "b" to "c"
   8. Modifies s2, which modifies "b", but NOT "c"

I think the challenge is explaining the sequence 6,7,8 above in comparison
to the other sequences.

-Irv
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20210716/404da5b7/attachment.html>

From jorisvandenbossche at gmail.com  Sat Jul 17 11:16:32 2021
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Sat, 17 Jul 2021 17:16:32 +0200
Subject: [Pandas-dev] Proposal for consistent,
 clear copy/view semantics in pandas with Copy-on-Write
In-Reply-To: <CAEQ_TvcZ55BXfh9J--6Su_iLrdy2BL=juzrkFZGhvf1tQCcrbQ@mail.gmail.com>
References: <CALQtMBYqKg+HHb5pJZmYG6hbUg9M5f0b1ycPWYda_EvsWyHe=A@mail.gmail.com>
 <CALQtMBbY7t7F3eddhT5DN4UhNdr7GVzyKTWr4FxcdFJPiNw2PA@mail.gmail.com>
 <CAKf8g9RjaC8b2XHDELHa452pQ+JUVNg4aDDXTKPOJANBTWQZjQ@mail.gmail.com>
 <CAEQ_TvcZ55BXfh9J--6Su_iLrdy2BL=juzrkFZGhvf1tQCcrbQ@mail.gmail.com>
Message-ID: <CALQtMBY=yCMvdUNE4yi_3Z1FG0vrzZT2CxSQYYN-9B+dHy2ruQ@mail.gmail.com>

On Fri, 16 Jul 2021 at 18:58, Stephan Hoyer <shoyer at gmail.com> wrote:

> On Fri, Jul 16, 2021 at 9:04 AM Brock Mendel <jbrockmendel at gmail.com>
> wrote:
>
>> I do not like the fact that nothing can ever be "just a view" with these
>> semantics, including series[::-1], frame[col], frame[:]. Users reasonably
>> expect numpy semantics for these.
>>
>> I am personally not sure what "users" in general expect for those (as
also mentioned by Tom and Irv already, depending on their background, they
might expect different things).
For example, for a user that knows basic Python, they could actually expect
all those examples to give a copy since `a_list[:]` is a typical way to
make a copy of a list.

(it might be interesting to reach out to educators (who might have more
experience with expectations/typical errors of novice users) or to do some
kind of experiment on this topic)

Personally, I cannot remember that I ever relied on the mutability-aspect
of eg `series[1:3]` or `frame[:]` being a view. I think there are generally
2 reasons for users caring about a view: 1) for performance (less copying)
and 2) for being able to mutate the view with the explicit goal to mutate
the parent (and not as an irrelevant side-effect).
I think the first reason is by far the most common one (but that's my
subjective opinion from my experience using pandas, so that can certainly
depend), and in the current proposal, all those mentioned example will be
actual views under the hood (and thus cover this first reason).

The only case where I know I explicitly rely on this is with chained
assignment (eg `frame[col][1:3] = ..`). That's certainly a very important
use case (and probably the most impacted usage pattern with the current
proposal), but it's also a case where 1) there is a clear alternative
(don't use chained assignment, but do it in one step (e.g. `frame.loc[1:3,
col] = ..`), some corner cases of mixed positional/label-based indexing
aside, for which we should find an alternative) and 2) we might be able to
detect this and raise an informative error message (specifically for
chained assignment).

I think it can be easier to explain "chained assignment never works" than
"chained assignment only works if first selecting the column(s)" (depending
on the exact rules).


> We should revisit the alternative "clear/simple rules" approach that is
>> "indexing on columns always gives a view" (
>> https://github.com/pandas-dev/pandas/pull/33597). This is simpler to
>> explain/grok, simpler to implement
>>
>
> I don't know if it is worth the trouble for complex multi-column
> selections, but I do see the appeal here.
>
> A simpler variant would be to make indexing out a single Series from a
> DataFrame return a view, with everything else doing copy on write. Then the
> existing pattern df.column_one[:] = ... would still work.
>

I was initially thinking about this as well. In the end, I didn't (yet) try
to implement this, because while thinking it through, it seemed that this
might give quite some tricky cases. Consider the following example:

df = pd.DataFrame(..)
df_subset = df[["col1", "col2"]]
s1 = df["col1"]
s1_subset = s1[0:3]
# modifying s1 should modify df, but not df_subset and s1_subset?
s1[0] = 0

If we take "only accessing a single Series from a DataFrame is a view,
everything else uses copy-on-write", that gives rise to questions like the
above where some parents/childs get modified, and some not.
This is both harder to explain to users, as harder to implement. For the
implementation of the proof-of-concept, the copy-on-write happens "locally"
in the series/dataframe that gets modified (meaning: when modifying a given
object, its internal array data first gets copied and replaced *if* the
object is viewing another or is being viewed by another object). While in
the above case, modifying a given object would need to trigger a copy in
other (potentially many) objects, and not in the object being modified.
It's probably possible to implement this, but certainly harder/trickier to
do.


>
>
>>
>> On Fri, Jul 16, 2021 at 5:26 AM Joris Van den Bossche <
>> jorisvandenbossche at gmail.com> wrote:
>>
>>>
>>>
>>> On Mon, 12 Jul 2021 at 00:58, Joris Van den Bossche <
>>> jorisvandenbossche at gmail.com> wrote:
>>>
>>>> Short summary of the proposal:
>>>>
>>>>    1. The result of *any* indexing operation (subsetting a DataFrame
>>>>    or Series in any way) or any method returning a new DataFrame, always *behaves
>>>>    as if it were* a copy in terms of user API.
>>>>
>>>>  To explicitly call out the column-as-Series case (since this is a
>>> typical case that right now *always* is a view): "any" indexing
>>> operation thus also included accessing a DataFrame column as a Series (or
>>> slicing a Series).
>>>
>>> So something like s = df["col"] and then mutating s will no longer
>>> update df. Similarly for series_subset = series[1:5], mutating
>>> series_subset will no longer update s.
>>> _______________________________________________
>>> Pandas-dev mailing list
>>> Pandas-dev at python.org
>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20210717/36559232/attachment-0001.html>

From garcia.marc at gmail.com  Sat Jul 17 12:12:32 2021
From: garcia.marc at gmail.com (Marc Garcia)
Date: Sat, 17 Jul 2021 10:12:32 -0600
Subject: [Pandas-dev] Proposal for consistent,
 clear copy/view semantics in pandas with Copy-on-Write
In-Reply-To: <CALQtMBY=yCMvdUNE4yi_3Z1FG0vrzZT2CxSQYYN-9B+dHy2ruQ@mail.gmail.com>
References: <CALQtMBYqKg+HHb5pJZmYG6hbUg9M5f0b1ycPWYda_EvsWyHe=A@mail.gmail.com>
 <CALQtMBbY7t7F3eddhT5DN4UhNdr7GVzyKTWr4FxcdFJPiNw2PA@mail.gmail.com>
 <CAKf8g9RjaC8b2XHDELHa452pQ+JUVNg4aDDXTKPOJANBTWQZjQ@mail.gmail.com>
 <CAEQ_TvcZ55BXfh9J--6Su_iLrdy2BL=juzrkFZGhvf1tQCcrbQ@mail.gmail.com>
 <CALQtMBY=yCMvdUNE4yi_3Z1FG0vrzZT2CxSQYYN-9B+dHy2ruQ@mail.gmail.com>
Message-ID: <CAEk5N5uY0i+9vyGST62tpXZOmr5nJ0_UtaFZ4Ey0mpxQmq2zEA@mail.gmail.com>

Based on my experience (not sure how biased it is), modifying dataframes
with something like `df[col][1:3] = ...` is rare (or the equivalent with `
.loc`) except for boolean arrays. From my experience, when the values of a
dataframe column are changed, what I think it's way more common is to
use `df[col]
= df[col].str.upper()`, `df[col] = df[col].fillna(0)`...

While I'm personally happy with Joris proposal, I see two other options
that could complement or replace it:

Option 1) Deprecate assigning to a subset of rows, and only allow assigning
to whole columns.  Something like `df[col][1:3] = ...` could be replaced by
for example `df[col] = df[col].mask(slice(1, 3), ...)`.  Using `mask` and `
where` is already supported for boolean arrays, so slices should be added,
and they'd be the only way to replace a subset of values. I think that
makes the problem narrower, and easier to understand for users. The main
thing to decide and be clear about is what happens if the dataframe is a
subset of another one:

```
df2 = df[cond]
df2[col] = df2[col].str.upper()
```

Option 2) If assigning with the current syntax (`df[col][1:3] = ...`  or `
.loc` equivalent) is something we want to keep (I wouldn't if we move in
this direction), maybe it could be moved to a `DataFrame` subclass So, the
main dataframe class behaves like in option 1, so expectations are much
easier to manage. But users who really want to assign with indexing, can
still use it, knowing that having a mutable dataframe comes at a cost
(copies, more complex behavior...). The `MutableDataFrame` could be in
pandas, or a third-party extension.

```
df_mutable = df.to_mutable()
df_mutable[col][1:3] = ...
```

On Sat, Jul 17, 2021 at 9:16 AM Joris Van den Bossche <
jorisvandenbossche at gmail.com> wrote:

>
> On Fri, 16 Jul 2021 at 18:58, Stephan Hoyer <shoyer at gmail.com> wrote:
>
>> On Fri, Jul 16, 2021 at 9:04 AM Brock Mendel <jbrockmendel at gmail.com>
>> wrote:
>>
>>> I do not like the fact that nothing can ever be "just a view" with these
>>> semantics, including series[::-1], frame[col], frame[:]. Users reasonably
>>> expect numpy semantics for these.
>>>
>>> I am personally not sure what "users" in general expect for those (as
> also mentioned by Tom and Irv already, depending on their background, they
> might expect different things).
> For example, for a user that knows basic Python, they could actually
> expect all those examples to give a copy since `a_list[:]` is a typical way
> to make a copy of a list.
>
> (it might be interesting to reach out to educators (who might have more
> experience with expectations/typical errors of novice users) or to do some
> kind of experiment on this topic)
>
> Personally, I cannot remember that I ever relied on the mutability-aspect
> of eg `series[1:3]` or `frame[:]` being a view. I think there are generally
> 2 reasons for users caring about a view: 1) for performance (less copying)
> and 2) for being able to mutate the view with the explicit goal to mutate
> the parent (and not as an irrelevant side-effect).
> I think the first reason is by far the most common one (but that's my
> subjective opinion from my experience using pandas, so that can certainly
> depend), and in the current proposal, all those mentioned example will be
> actual views under the hood (and thus cover this first reason).
>
> The only case where I know I explicitly rely on this is with chained
> assignment (eg `frame[col][1:3] = ..`). That's certainly a very important
> use case (and probably the most impacted usage pattern with the current
> proposal), but it's also a case where 1) there is a clear alternative
> (don't use chained assignment, but do it in one step (e.g. `frame.loc[1:3,
> col] = ..`), some corner cases of mixed positional/label-based indexing
> aside, for which we should find an alternative) and 2) we might be able to
> detect this and raise an informative error message (specifically for
> chained assignment).
>
> I think it can be easier to explain "chained assignment never works" than
> "chained assignment only works if first selecting the column(s)" (depending
> on the exact rules).
>
>
>> We should revisit the alternative "clear/simple rules" approach that is
>>> "indexing on columns always gives a view" (
>>> https://github.com/pandas-dev/pandas/pull/33597). This is simpler to
>>> explain/grok, simpler to implement
>>>
>>
>> I don't know if it is worth the trouble for complex multi-column
>> selections, but I do see the appeal here.
>>
>> A simpler variant would be to make indexing out a single Series from a
>> DataFrame return a view, with everything else doing copy on write. Then the
>> existing pattern df.column_one[:] = ... would still work.
>>
>
> I was initially thinking about this as well. In the end, I didn't (yet)
> try to implement this, because while thinking it through, it seemed that
> this might give quite some tricky cases. Consider the following example:
>
> df = pd.DataFrame(..)
> df_subset = df[["col1", "col2"]]
> s1 = df["col1"]
> s1_subset = s1[0:3]
> # modifying s1 should modify df, but not df_subset and s1_subset?
> s1[0] = 0
>
> If we take "only accessing a single Series from a DataFrame is a view,
> everything else uses copy-on-write", that gives rise to questions like the
> above where some parents/childs get modified, and some not.
> This is both harder to explain to users, as harder to implement. For the
> implementation of the proof-of-concept, the copy-on-write happens "locally"
> in the series/dataframe that gets modified (meaning: when modifying a given
> object, its internal array data first gets copied and replaced *if* the
> object is viewing another or is being viewed by another object). While in
> the above case, modifying a given object would need to trigger a copy in
> other (potentially many) objects, and not in the object being modified.
> It's probably possible to implement this, but certainly harder/trickier to
> do.
>
>
>>
>>
>>>
>>> On Fri, Jul 16, 2021 at 5:26 AM Joris Van den Bossche <
>>> jorisvandenbossche at gmail.com> wrote:
>>>
>>>>
>>>>
>>>> On Mon, 12 Jul 2021 at 00:58, Joris Van den Bossche <
>>>> jorisvandenbossche at gmail.com> wrote:
>>>>
>>>>> Short summary of the proposal:
>>>>>
>>>>>    1. The result of *any* indexing operation (subsetting a DataFrame
>>>>>    or Series in any way) or any method returning a new DataFrame, always *behaves
>>>>>    as if it were* a copy in terms of user API.
>>>>>
>>>>>  To explicitly call out the column-as-Series case (since this is a
>>>> typical case that right now *always* is a view): "any" indexing
>>>> operation thus also included accessing a DataFrame column as a Series (or
>>>> slicing a Series).
>>>>
>>>> So something like s = df["col"] and then mutating s will no longer
>>>> update df. Similarly for series_subset = series[1:5], mutating
>>>> series_subset will no longer update s.
>>>> _______________________________________________
>>>> Pandas-dev mailing list
>>>> Pandas-dev at python.org
>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>
>>> _______________________________________________
>>> Pandas-dev mailing list
>>> Pandas-dev at python.org
>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>
>> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20210717/fbd7a80a/attachment.html>

From jorisvandenbossche at gmail.com  Sat Jul 17 14:51:39 2021
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Sat, 17 Jul 2021 20:51:39 +0200
Subject: [Pandas-dev] Proposal for consistent,
 clear copy/view semantics in pandas with Copy-on-Write
Message-ID: <CALQtMBasVgZJzit2Q1TT2wttZN_M1VLifX0SxGD73aWF9aYjog@mail.gmail.com>

On Fri, 16 Jul 2021 at 20:50, Irv Lustig <irv at princeton.com> wrote:

>
> Tom Augspurger <tom.augspurger88 at gmail.com> wrote:
>
> I wonder if we can validate what users (new and old) *actually* expect?
>> Users coming from R, which IIRC implements Copy on Write for matrices,
>> might be OK with indexing always being (behaving like) a copy.
>> I'm not sure what users coming from NumPy would expect, since I don't know
>> how many NumPy users really understand *a**.)* when a NumPy slice is a
>> view
>> or copy, and *b.) *how a pandas indexing operation translates to a NumPy
>> slice.
>>
>>
> IMHO, we should concentrate on the "new" users.  For my team, there is no
> numpy or R background.  They learn pandas, and what pandas does needs to be
> really clear in behavior and documentation.  I would also hazard a guess
> that most pandas users are like that - pandas is the first tool they see,
> not numpy or R.
>
> The places where I think confusion could happen are things like this with
> a DataFrame df :
>
>    1. s = df["a"]
>    2. s.iloc[3:5] = [1, 2, 3]
>    3. df["a"].iloc[3:5] = [1, 2, 3]
>    4. df["b"] = df["a"]
>    5. df["b"].iloc[3:5] = [4, 5, 6]
>    6. s2 = df["b"]
>    7. df["c"] = s2
>    8. s2.iloc[3:5] = [7, 8, 9]
>
> As I understand it (please correct me if I'm wrong), these lines would be
> interpreted as follows with the current proposal:
>

It's a bit different (to reiterate, with the *current* proposal, *any*
indexing operation (including series selection) behaves as a copy; and also
to be clear, this is one possible proposal, there are certainly other
possibilities). Answering case by case:


> 1. s = df["a"]
> Creates a view into the DataFrame df.  No copying is done at all
>

Indeed a view (but that's an implementation detail)

2. s.iloc[3:5] = [1, 2, 3]
> Modifies the series s and the underlying DataFrame df.  (copy-on-write)
>

Due to copy-on-write, it does *not* modify the DataFrame df. Copy-on-write
means that only when s is being written to, its data get copied (so at that
point breaking the view-relation with the parent df)


> 3. df["a"].iloc[3:5] = [1, 2, 3]
> Modifies the dataframe
>

This is an example of chained assignment, which in the current proposal
never works (see the example in the google doc
<https://docs.google.com/document/d/1ZCQ9mx3LBMy-nhwRl33_jgcvWo9IWdEfxDNQ2thyTb0/edit#heading=h.dbqr0xsneytk>).
This is because chained assignment can always be written as:

temp = df["a"]
temp.iloc[3:5] = [1, 2, 3]

and `temp` uses copy-on-write (and then it is the same example as the one
above in 2.).

(what you describe is the current behaviour of pandas)


> 4. df["b"] = df["a"]
> Copies the series from "a" to "b"
>

It would indeed behave as a copy, but under the hood we can actually keep
this as a view (delay the copy thanks to copy-on-write).


> 5. df["b"].iloc[3:5] = [4, 5, 6]
> Modifies "b" in the DataFrame, but not "a"
>

Also doesn't modify "b" (see example 3. above), but indeed does not modify
"a"


> 6. s2 = df["b"]
> Create a view into the DataFrame df.  No copying is done at all.
>

Same as 1.


> 7. df["c"] = s2
> Copies the series from "b" to "c"
>

Same as 4.


> 8. s2.iloc[3:5] = [7, 8, 9]
> Modifies s2, which modifies "b", but NOT "c"
>

Doesn't modify "b" and "c". Similar as 3.

I think the challenge is explaining the sequence 6,7,8 above in comparison
> to the other sequences.
>

So with the current proposal, the sequece 6, 7, 8 actually doesn't behave
differently. But it is mainly 2 and 3 that would be quite different
compared to the current pandas behaviour.


>
> -Irv
>
>
>
>
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20210717/c8d951aa/attachment-0001.html>

From adrin.jalali at gmail.com  Tue Jul 20 10:10:10 2021
From: adrin.jalali at gmail.com (Adrin)
Date: Tue, 20 Jul 2021 16:10:10 +0200
Subject: [Pandas-dev] Proposal for consistent,
 clear copy/view semantics in pandas with Copy-on-Write
In-Reply-To: <CAEk5N5uY0i+9vyGST62tpXZOmr5nJ0_UtaFZ4Ey0mpxQmq2zEA@mail.gmail.com>
References: <CALQtMBYqKg+HHb5pJZmYG6hbUg9M5f0b1ycPWYda_EvsWyHe=A@mail.gmail.com>
 <CALQtMBbY7t7F3eddhT5DN4UhNdr7GVzyKTWr4FxcdFJPiNw2PA@mail.gmail.com>
 <CAKf8g9RjaC8b2XHDELHa452pQ+JUVNg4aDDXTKPOJANBTWQZjQ@mail.gmail.com>
 <CAEQ_TvcZ55BXfh9J--6Su_iLrdy2BL=juzrkFZGhvf1tQCcrbQ@mail.gmail.com>
 <CALQtMBY=yCMvdUNE4yi_3Z1FG0vrzZT2CxSQYYN-9B+dHy2ruQ@mail.gmail.com>
 <CAEk5N5uY0i+9vyGST62tpXZOmr5nJ0_UtaFZ4Ey0mpxQmq2zEA@mail.gmail.com>
Message-ID: <CAEOrW4-=p6aB-A3oH4RFg=cd8xT5LqhnDxW5ENr9dCvjqwBqOA@mail.gmail.com>

I guess one question I have is what are the memory and time performance
implications of the proposed change. I guess I belong to the group of users
who think of a pandas DataFrame more as a numpy array with column names
attached to them, and hence I'd expect very similar semantics when
indexing, and I think copy on write semantics would have a significant
impact on our workflows.

On Sat, Jul 17, 2021 at 6:12 PM Marc Garcia <garcia.marc at gmail.com> wrote:

> Based on my experience (not sure how biased it is), modifying dataframes
> with something like `df[col][1:3] = ...` is rare (or the equivalent with `
> .loc`) except for boolean arrays. From my experience, when the values of
> a dataframe column are changed, what I think it's way more common is to use
> `df[col] = df[col].str.upper()`, `df[col] = df[col].fillna(0)`...
>
> While I'm personally happy with Joris proposal, I see two other options
> that could complement or replace it:
>
> Option 1) Deprecate assigning to a subset of rows, and only allow
> assigning to whole columns.  Something like `df[col][1:3] = ...` could be
> replaced by for example `df[col] = df[col].mask(slice(1, 3), ...)`.
> Using `mask` and `where` is already supported for boolean arrays, so
> slices should be added, and they'd be the only way to replace a subset of
> values. I think that makes the problem narrower, and easier to understand
> for users. The main thing to decide and be clear about is what happens if
> the dataframe is a subset of another one:
>
> ```
> df2 = df[cond]
> df2[col] = df2[col].str.upper()
> ```
>
> Option 2) If assigning with the current syntax (`df[col][1:3] = ...`  or `
> .loc` equivalent) is something we want to keep (I wouldn't if we move in
> this direction), maybe it could be moved to a `DataFrame` subclass So,
> the main dataframe class behaves like in option 1, so expectations are much
> easier to manage. But users who really want to assign with indexing, can
> still use it, knowing that having a mutable dataframe comes at a cost
> (copies, more complex behavior...). The `MutableDataFrame` could be in
> pandas, or a third-party extension.
>
> ```
> df_mutable = df.to_mutable()
> df_mutable[col][1:3] = ...
> ```
>
> On Sat, Jul 17, 2021 at 9:16 AM Joris Van den Bossche <
> jorisvandenbossche at gmail.com> wrote:
>
>>
>> On Fri, 16 Jul 2021 at 18:58, Stephan Hoyer <shoyer at gmail.com> wrote:
>>
>>> On Fri, Jul 16, 2021 at 9:04 AM Brock Mendel <jbrockmendel at gmail.com>
>>> wrote:
>>>
>>>> I do not like the fact that nothing can ever be "just a view" with
>>>> these semantics, including series[::-1], frame[col], frame[:]. Users
>>>> reasonably expect numpy semantics for these.
>>>>
>>>> I am personally not sure what "users" in general expect for those (as
>> also mentioned by Tom and Irv already, depending on their background, they
>> might expect different things).
>> For example, for a user that knows basic Python, they could actually
>> expect all those examples to give a copy since `a_list[:]` is a typical way
>> to make a copy of a list.
>>
>> (it might be interesting to reach out to educators (who might have more
>> experience with expectations/typical errors of novice users) or to do some
>> kind of experiment on this topic)
>>
>> Personally, I cannot remember that I ever relied on the mutability-aspect
>> of eg `series[1:3]` or `frame[:]` being a view. I think there are generally
>> 2 reasons for users caring about a view: 1) for performance (less copying)
>> and 2) for being able to mutate the view with the explicit goal to mutate
>> the parent (and not as an irrelevant side-effect).
>> I think the first reason is by far the most common one (but that's my
>> subjective opinion from my experience using pandas, so that can certainly
>> depend), and in the current proposal, all those mentioned example will be
>> actual views under the hood (and thus cover this first reason).
>>
>> The only case where I know I explicitly rely on this is with chained
>> assignment (eg `frame[col][1:3] = ..`). That's certainly a very important
>> use case (and probably the most impacted usage pattern with the current
>> proposal), but it's also a case where 1) there is a clear alternative
>> (don't use chained assignment, but do it in one step (e.g. `frame.loc[1:3,
>> col] = ..`), some corner cases of mixed positional/label-based indexing
>> aside, for which we should find an alternative) and 2) we might be able to
>> detect this and raise an informative error message (specifically for
>> chained assignment).
>>
>> I think it can be easier to explain "chained assignment never works" than
>> "chained assignment only works if first selecting the column(s)" (depending
>> on the exact rules).
>>
>>
>>> We should revisit the alternative "clear/simple rules" approach that is
>>>> "indexing on columns always gives a view" (
>>>> https://github.com/pandas-dev/pandas/pull/33597). This is simpler to
>>>> explain/grok, simpler to implement
>>>>
>>>
>>> I don't know if it is worth the trouble for complex multi-column
>>> selections, but I do see the appeal here.
>>>
>>> A simpler variant would be to make indexing out a single Series from a
>>> DataFrame return a view, with everything else doing copy on write. Then the
>>> existing pattern df.column_one[:] = ... would still work.
>>>
>>
>> I was initially thinking about this as well. In the end, I didn't (yet)
>> try to implement this, because while thinking it through, it seemed that
>> this might give quite some tricky cases. Consider the following example:
>>
>> df = pd.DataFrame(..)
>> df_subset = df[["col1", "col2"]]
>> s1 = df["col1"]
>> s1_subset = s1[0:3]
>> # modifying s1 should modify df, but not df_subset and s1_subset?
>> s1[0] = 0
>>
>> If we take "only accessing a single Series from a DataFrame is a view,
>> everything else uses copy-on-write", that gives rise to questions like the
>> above where some parents/childs get modified, and some not.
>> This is both harder to explain to users, as harder to implement. For the
>> implementation of the proof-of-concept, the copy-on-write happens "locally"
>> in the series/dataframe that gets modified (meaning: when modifying a given
>> object, its internal array data first gets copied and replaced *if* the
>> object is viewing another or is being viewed by another object). While in
>> the above case, modifying a given object would need to trigger a copy in
>> other (potentially many) objects, and not in the object being modified.
>> It's probably possible to implement this, but certainly harder/trickier to
>> do.
>>
>>
>>>
>>>
>>>>
>>>> On Fri, Jul 16, 2021 at 5:26 AM Joris Van den Bossche <
>>>> jorisvandenbossche at gmail.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Mon, 12 Jul 2021 at 00:58, Joris Van den Bossche <
>>>>> jorisvandenbossche at gmail.com> wrote:
>>>>>
>>>>>> Short summary of the proposal:
>>>>>>
>>>>>>    1. The result of *any* indexing operation (subsetting a DataFrame
>>>>>>    or Series in any way) or any method returning a new DataFrame, always *behaves
>>>>>>    as if it were* a copy in terms of user API.
>>>>>>
>>>>>>  To explicitly call out the column-as-Series case (since this is a
>>>>> typical case that right now *always* is a view): "any" indexing
>>>>> operation thus also included accessing a DataFrame column as a Series (or
>>>>> slicing a Series).
>>>>>
>>>>> So something like s = df["col"] and then mutating s will no longer
>>>>> update df. Similarly for series_subset = series[1:5], mutating
>>>>> series_subset will no longer update s.
>>>>> _______________________________________________
>>>>> Pandas-dev mailing list
>>>>> Pandas-dev at python.org
>>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>>
>>>> _______________________________________________
>>>> Pandas-dev mailing list
>>>> Pandas-dev at python.org
>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>
>>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20210720/74ff4bd6/attachment.html>

From jorisvandenbossche at gmail.com  Fri Jul 23 15:31:33 2021
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Fri, 23 Jul 2021 21:31:33 +0200
Subject: [Pandas-dev] Proposal for consistent,
 clear copy/view semantics in pandas with Copy-on-Write
In-Reply-To: <CAEOrW4-=p6aB-A3oH4RFg=cd8xT5LqhnDxW5ENr9dCvjqwBqOA@mail.gmail.com>
References: <CALQtMBYqKg+HHb5pJZmYG6hbUg9M5f0b1ycPWYda_EvsWyHe=A@mail.gmail.com>
 <CALQtMBbY7t7F3eddhT5DN4UhNdr7GVzyKTWr4FxcdFJPiNw2PA@mail.gmail.com>
 <CAKf8g9RjaC8b2XHDELHa452pQ+JUVNg4aDDXTKPOJANBTWQZjQ@mail.gmail.com>
 <CAEQ_TvcZ55BXfh9J--6Su_iLrdy2BL=juzrkFZGhvf1tQCcrbQ@mail.gmail.com>
 <CALQtMBY=yCMvdUNE4yi_3Z1FG0vrzZT2CxSQYYN-9B+dHy2ruQ@mail.gmail.com>
 <CAEk5N5uY0i+9vyGST62tpXZOmr5nJ0_UtaFZ4Ey0mpxQmq2zEA@mail.gmail.com>
 <CAEOrW4-=p6aB-A3oH4RFg=cd8xT5LqhnDxW5ENr9dCvjqwBqOA@mail.gmail.com>
Message-ID: <CALQtMBYCf+mtvdq-SnMct2Yzc-pYhRHMct6JoS+Y81Aeggtamw@mail.gmail.com>

On Tue, 20 Jul 2021 at 16:10, Adrin <adrin.jalali at gmail.com> wrote:

> I guess one question I have is what are the memory and time performance
> implications of the proposed change.
>

Memory implications should be positive (less copying). The performance
impact of the additional logic (adding/checking of the weak references) is
something I didn't yet check (on my to do list), but I suspect it to not be
significant.

Based on your comment of "numpy array with column names", I think the
potential change of the ArrayManager is much more relevant for you than the
Copy-on-Write. And to be clear, the current proposal is not tied to the
ArrayManager (it's only the proof of concept that is implemented for that).
So I would prefer to keep the discussion focused on the copy/view
semantics, at least for now (it's only later, when discussing practical
ways to get this released, that we need to decide whether we want to
combine this with an ArrayManager refactor or not).


> I guess I belong to the group of users who think of a pandas DataFrame
> more as a numpy array with column names attached to them, and hence I'd
> expect very similar semantics when indexing, and I think copy on write
> semantics would have a significant impact on our workflows.
>

I assuming you are also thinking of scikit-learn like worflows? Can you
give an example of what your are thinking about how copy-on-write impacts
such (or other) workflows?
In any case thanks already for your feedback!


>
> On Sat, Jul 17, 2021 at 6:12 PM Marc Garcia <garcia.marc at gmail.com> wrote:
>
>> Based on my experience (not sure how biased it is), modifying dataframes
>> with something like `df[col][1:3] = ...` is rare (or the equivalent with
>> `.loc`) except for boolean arrays. From my experience, when the values
>> of a dataframe column are changed, what I think it's way more common is to
>> use `df[col] = df[col].str.upper()`, `df[col] = df[col].fillna(0)`...
>>
>> While I'm personally happy with Joris proposal, I see two other options
>> that could complement or replace it:
>>
>> Option 1) Deprecate assigning to a subset of rows, and only allow
>> assigning to whole columns.  Something like `df[col][1:3] = ...` could
>> be replaced by for example `df[col] = df[col].mask(slice(1, 3), ...)`.
>> Using `mask` and `where` is already supported for boolean arrays, so
>> slices should be added, and they'd be the only way to replace a subset of
>> values. I think that makes the problem narrower, and easier to understand
>> for users. The main thing to decide and be clear about is what happens if
>> the dataframe is a subset of another one:
>>
>> ```
>> df2 = df[cond]
>> df2[col] = df2[col].str.upper()
>> ```
>>
>> Option 2) If assigning with the current syntax (`df[col][1:3] = ...`  or
>> `.loc` equivalent) is something we want to keep (I wouldn't if we move
>> in this direction), maybe it could be moved to a `DataFrame` subclass
>> So, the main dataframe class behaves like in option 1, so expectations are
>> much easier to manage. But users who really want to assign with indexing,
>> can still use it, knowing that having a mutable dataframe comes at a cost
>> (copies, more complex behavior...). The `MutableDataFrame` could be in
>> pandas, or a third-party extension.
>>
>> ```
>> df_mutable = df.to_mutable()
>> df_mutable[col][1:3] = ...
>> ```
>>
>> On Sat, Jul 17, 2021 at 9:16 AM Joris Van den Bossche <
>> jorisvandenbossche at gmail.com> wrote:
>>
>>>
>>> On Fri, 16 Jul 2021 at 18:58, Stephan Hoyer <shoyer at gmail.com> wrote:
>>>
>>>> On Fri, Jul 16, 2021 at 9:04 AM Brock Mendel <jbrockmendel at gmail.com>
>>>> wrote:
>>>>
>>>>> I do not like the fact that nothing can ever be "just a view" with
>>>>> these semantics, including series[::-1], frame[col], frame[:]. Users
>>>>> reasonably expect numpy semantics for these.
>>>>>
>>>>> I am personally not sure what "users" in general expect for those (as
>>> also mentioned by Tom and Irv already, depending on their background, they
>>> might expect different things).
>>> For example, for a user that knows basic Python, they could actually
>>> expect all those examples to give a copy since `a_list[:]` is a typical way
>>> to make a copy of a list.
>>>
>>> (it might be interesting to reach out to educators (who might have more
>>> experience with expectations/typical errors of novice users) or to do some
>>> kind of experiment on this topic)
>>>
>>> Personally, I cannot remember that I ever relied on the
>>> mutability-aspect of eg `series[1:3]` or `frame[:]` being a view. I think
>>> there are generally 2 reasons for users caring about a view: 1) for
>>> performance (less copying) and 2) for being able to mutate the view with
>>> the explicit goal to mutate the parent (and not as an irrelevant
>>> side-effect).
>>> I think the first reason is by far the most common one (but that's my
>>> subjective opinion from my experience using pandas, so that can certainly
>>> depend), and in the current proposal, all those mentioned example will be
>>> actual views under the hood (and thus cover this first reason).
>>>
>>> The only case where I know I explicitly rely on this is with chained
>>> assignment (eg `frame[col][1:3] = ..`). That's certainly a very important
>>> use case (and probably the most impacted usage pattern with the current
>>> proposal), but it's also a case where 1) there is a clear alternative
>>> (don't use chained assignment, but do it in one step (e.g. `frame.loc[1:3,
>>> col] = ..`), some corner cases of mixed positional/label-based indexing
>>> aside, for which we should find an alternative) and 2) we might be able to
>>> detect this and raise an informative error message (specifically for
>>> chained assignment).
>>>
>>> I think it can be easier to explain "chained assignment never works"
>>> than "chained assignment only works if first selecting the column(s)"
>>> (depending on the exact rules).
>>>
>>>
>>>> We should revisit the alternative "clear/simple rules" approach that is
>>>>> "indexing on columns always gives a view" (
>>>>> https://github.com/pandas-dev/pandas/pull/33597). This is simpler to
>>>>> explain/grok, simpler to implement
>>>>>
>>>>
>>>> I don't know if it is worth the trouble for complex multi-column
>>>> selections, but I do see the appeal here.
>>>>
>>>> A simpler variant would be to make indexing out a single Series from a
>>>> DataFrame return a view, with everything else doing copy on write. Then the
>>>> existing pattern df.column_one[:] = ... would still work.
>>>>
>>>
>>> I was initially thinking about this as well. In the end, I didn't (yet)
>>> try to implement this, because while thinking it through, it seemed that
>>> this might give quite some tricky cases. Consider the following example:
>>>
>>> df = pd.DataFrame(..)
>>> df_subset = df[["col1", "col2"]]
>>> s1 = df["col1"]
>>> s1_subset = s1[0:3]
>>> # modifying s1 should modify df, but not df_subset and s1_subset?
>>> s1[0] = 0
>>>
>>> If we take "only accessing a single Series from a DataFrame is a view,
>>> everything else uses copy-on-write", that gives rise to questions like the
>>> above where some parents/childs get modified, and some not.
>>> This is both harder to explain to users, as harder to implement. For the
>>> implementation of the proof-of-concept, the copy-on-write happens "locally"
>>> in the series/dataframe that gets modified (meaning: when modifying a given
>>> object, its internal array data first gets copied and replaced *if* the
>>> object is viewing another or is being viewed by another object). While in
>>> the above case, modifying a given object would need to trigger a copy in
>>> other (potentially many) objects, and not in the object being modified.
>>> It's probably possible to implement this, but certainly harder/trickier to
>>> do.
>>>
>>>
>>>>
>>>>
>>>>>
>>>>> On Fri, Jul 16, 2021 at 5:26 AM Joris Van den Bossche <
>>>>> jorisvandenbossche at gmail.com> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, 12 Jul 2021 at 00:58, Joris Van den Bossche <
>>>>>> jorisvandenbossche at gmail.com> wrote:
>>>>>>
>>>>>>> Short summary of the proposal:
>>>>>>>
>>>>>>>    1. The result of *any* indexing operation (subsetting a
>>>>>>>    DataFrame or Series in any way) or any method returning a new DataFrame,
>>>>>>>    always *behaves as if it were* a copy in terms of user API.
>>>>>>>
>>>>>>>  To explicitly call out the column-as-Series case (since this is a
>>>>>> typical case that right now *always* is a view): "any" indexing
>>>>>> operation thus also included accessing a DataFrame column as a Series (or
>>>>>> slicing a Series).
>>>>>>
>>>>>> So something like s = df["col"] and then mutating s will no longer
>>>>>> update df. Similarly for series_subset = series[1:5], mutating
>>>>>> series_subset will no longer update s.
>>>>>> _______________________________________________
>>>>>> Pandas-dev mailing list
>>>>>> Pandas-dev at python.org
>>>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>>>
>>>>> _______________________________________________
>>>>> Pandas-dev mailing list
>>>>> Pandas-dev at python.org
>>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>>
>>>> _______________________________________________
>>> Pandas-dev mailing list
>>> Pandas-dev at python.org
>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20210723/2e4e6f7e/attachment-0001.html>

From jbrockmendel at gmail.com  Fri Jul 23 16:09:46 2021
From: jbrockmendel at gmail.com (Brock Mendel)
Date: Fri, 23 Jul 2021 13:09:46 -0700
Subject: [Pandas-dev] Proposal for consistent,
 clear copy/view semantics in pandas with Copy-on-Write
In-Reply-To: <CALQtMBYCf+mtvdq-SnMct2Yzc-pYhRHMct6JoS+Y81Aeggtamw@mail.gmail.com>
References: <CALQtMBYqKg+HHb5pJZmYG6hbUg9M5f0b1ycPWYda_EvsWyHe=A@mail.gmail.com>
 <CALQtMBbY7t7F3eddhT5DN4UhNdr7GVzyKTWr4FxcdFJPiNw2PA@mail.gmail.com>
 <CAKf8g9RjaC8b2XHDELHa452pQ+JUVNg4aDDXTKPOJANBTWQZjQ@mail.gmail.com>
 <CAEQ_TvcZ55BXfh9J--6Su_iLrdy2BL=juzrkFZGhvf1tQCcrbQ@mail.gmail.com>
 <CALQtMBY=yCMvdUNE4yi_3Z1FG0vrzZT2CxSQYYN-9B+dHy2ruQ@mail.gmail.com>
 <CAEk5N5uY0i+9vyGST62tpXZOmr5nJ0_UtaFZ4Ey0mpxQmq2zEA@mail.gmail.com>
 <CAEOrW4-=p6aB-A3oH4RFg=cd8xT5LqhnDxW5ENr9dCvjqwBqOA@mail.gmail.com>
 <CALQtMBYCf+mtvdq-SnMct2Yzc-pYhRHMct6JoS+Y81Aeggtamw@mail.gmail.com>
Message-ID: <CAKf8g9TJNdHHihSKP8CzwuXrsONPBEkjybOzdFNAsmFBZiu_WA@mail.gmail.com>

> Memory implications should be positive (less copying).

This is accurate _only_ in cases where we currently make copies.  In cases
where we currently make views, the perf effect goes the other way.

On the flip side, Always-Views improves perf in cases where we currently
make copies, but if you want a copy then you'll have to make one explicitly
which will claw back that gain.  (In the long-out-of-date proof of concept
https://github.com/pandas-dev/pandas/pull/33597  df[np.random.randint(0,
30, 30)] was ~92% faster than the status quo at the time)

> The performance impact of the additional logic (adding/checking of the
weak references) is something I didn't yet check (on my to do list), but I
suspect it to not be significant.

Agreed the CoW logic itself should be negligible outside of microbenchmarks.

On Fri, Jul 23, 2021 at 12:32 PM Joris Van den Bossche <
jorisvandenbossche at gmail.com> wrote:

> On Tue, 20 Jul 2021 at 16:10, Adrin <adrin.jalali at gmail.com> wrote:
>
>> I guess one question I have is what are the memory and time performance
>> implications of the proposed change.
>>
>
> Memory implications should be positive (less copying). The performance
> impact of the additional logic (adding/checking of the weak references) is
> something I didn't yet check (on my to do list), but I suspect it to not be
> significant.
>
> Based on your comment of "numpy array with column names", I think the
> potential change of the ArrayManager is much more relevant for you than the
> Copy-on-Write. And to be clear, the current proposal is not tied to the
> ArrayManager (it's only the proof of concept that is implemented for that).
> So I would prefer to keep the discussion focused on the copy/view
> semantics, at least for now (it's only later, when discussing practical
> ways to get this released, that we need to decide whether we want to
> combine this with an ArrayManager refactor or not).
>
>
>> I guess I belong to the group of users who think of a pandas DataFrame
>> more as a numpy array with column names attached to them, and hence I'd
>> expect very similar semantics when indexing, and I think copy on write
>> semantics would have a significant impact on our workflows.
>>
>
> I assuming you are also thinking of scikit-learn like worflows? Can you
> give an example of what your are thinking about how copy-on-write impacts
> such (or other) workflows?
> In any case thanks already for your feedback!
>
>
>>
>> On Sat, Jul 17, 2021 at 6:12 PM Marc Garcia <garcia.marc at gmail.com>
>> wrote:
>>
>>> Based on my experience (not sure how biased it is), modifying dataframes
>>> with something like `df[col][1:3] = ...` is rare (or the equivalent
>>> with `.loc`) except for boolean arrays. From my experience, when the
>>> values of a dataframe column are changed, what I think it's way more common
>>> is to use `df[col] = df[col].str.upper()`, `df[col] = df[col].fillna(0)
>>> `...
>>>
>>> While I'm personally happy with Joris proposal, I see two other options
>>> that could complement or replace it:
>>>
>>> Option 1) Deprecate assigning to a subset of rows, and only allow
>>> assigning to whole columns.  Something like `df[col][1:3] = ...` could
>>> be replaced by for example `df[col] = df[col].mask(slice(1, 3), ...)`.
>>> Using `mask` and `where` is already supported for boolean arrays, so
>>> slices should be added, and they'd be the only way to replace a subset of
>>> values. I think that makes the problem narrower, and easier to understand
>>> for users. The main thing to decide and be clear about is what happens if
>>> the dataframe is a subset of another one:
>>>
>>> ```
>>> df2 = df[cond]
>>> df2[col] = df2[col].str.upper()
>>> ```
>>>
>>> Option 2) If assigning with the current syntax (`df[col][1:3] = ...`
>>> or `.loc` equivalent) is something we want to keep (I wouldn't if we
>>> move in this direction), maybe it could be moved to a `DataFrame`
>>> subclass So, the main dataframe class behaves like in option 1, so
>>> expectations are much easier to manage. But users who really want to assign
>>> with indexing, can still use it, knowing that having a mutable dataframe
>>> comes at a cost (copies, more complex behavior...). The `
>>> MutableDataFrame` could be in pandas, or a third-party extension.
>>>
>>> ```
>>> df_mutable = df.to_mutable()
>>> df_mutable[col][1:3] = ...
>>> ```
>>>
>>> On Sat, Jul 17, 2021 at 9:16 AM Joris Van den Bossche <
>>> jorisvandenbossche at gmail.com> wrote:
>>>
>>>>
>>>> On Fri, 16 Jul 2021 at 18:58, Stephan Hoyer <shoyer at gmail.com> wrote:
>>>>
>>>>> On Fri, Jul 16, 2021 at 9:04 AM Brock Mendel <jbrockmendel at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I do not like the fact that nothing can ever be "just a view" with
>>>>>> these semantics, including series[::-1], frame[col], frame[:]. Users
>>>>>> reasonably expect numpy semantics for these.
>>>>>>
>>>>>> I am personally not sure what "users" in general expect for those (as
>>>> also mentioned by Tom and Irv already, depending on their background, they
>>>> might expect different things).
>>>> For example, for a user that knows basic Python, they could actually
>>>> expect all those examples to give a copy since `a_list[:]` is a typical way
>>>> to make a copy of a list.
>>>>
>>>> (it might be interesting to reach out to educators (who might have more
>>>> experience with expectations/typical errors of novice users) or to do some
>>>> kind of experiment on this topic)
>>>>
>>>> Personally, I cannot remember that I ever relied on the
>>>> mutability-aspect of eg `series[1:3]` or `frame[:]` being a view. I think
>>>> there are generally 2 reasons for users caring about a view: 1) for
>>>> performance (less copying) and 2) for being able to mutate the view with
>>>> the explicit goal to mutate the parent (and not as an irrelevant
>>>> side-effect).
>>>> I think the first reason is by far the most common one (but that's my
>>>> subjective opinion from my experience using pandas, so that can certainly
>>>> depend), and in the current proposal, all those mentioned example will be
>>>> actual views under the hood (and thus cover this first reason).
>>>>
>>>> The only case where I know I explicitly rely on this is with chained
>>>> assignment (eg `frame[col][1:3] = ..`). That's certainly a very important
>>>> use case (and probably the most impacted usage pattern with the current
>>>> proposal), but it's also a case where 1) there is a clear alternative
>>>> (don't use chained assignment, but do it in one step (e.g. `frame.loc[1:3,
>>>> col] = ..`), some corner cases of mixed positional/label-based indexing
>>>> aside, for which we should find an alternative) and 2) we might be able to
>>>> detect this and raise an informative error message (specifically for
>>>> chained assignment).
>>>>
>>>> I think it can be easier to explain "chained assignment never works"
>>>> than "chained assignment only works if first selecting the column(s)"
>>>> (depending on the exact rules).
>>>>
>>>>
>>>>> We should revisit the alternative "clear/simple rules" approach that
>>>>>> is "indexing on columns always gives a view" (
>>>>>> https://github.com/pandas-dev/pandas/pull/33597). This is simpler to
>>>>>> explain/grok, simpler to implement
>>>>>>
>>>>>
>>>>> I don't know if it is worth the trouble for complex multi-column
>>>>> selections, but I do see the appeal here.
>>>>>
>>>>> A simpler variant would be to make indexing out a single Series from a
>>>>> DataFrame return a view, with everything else doing copy on write. Then the
>>>>> existing pattern df.column_one[:] = ... would still work.
>>>>>
>>>>
>>>> I was initially thinking about this as well. In the end, I didn't (yet)
>>>> try to implement this, because while thinking it through, it seemed that
>>>> this might give quite some tricky cases. Consider the following example:
>>>>
>>>> df = pd.DataFrame(..)
>>>> df_subset = df[["col1", "col2"]]
>>>> s1 = df["col1"]
>>>> s1_subset = s1[0:3]
>>>> # modifying s1 should modify df, but not df_subset and s1_subset?
>>>> s1[0] = 0
>>>>
>>>> If we take "only accessing a single Series from a DataFrame is a view,
>>>> everything else uses copy-on-write", that gives rise to questions like the
>>>> above where some parents/childs get modified, and some not.
>>>> This is both harder to explain to users, as harder to implement. For
>>>> the implementation of the proof-of-concept, the copy-on-write happens
>>>> "locally" in the series/dataframe that gets modified (meaning: when
>>>> modifying a given object, its internal array data first gets copied and
>>>> replaced *if* the object is viewing another or is being viewed by another
>>>> object). While in the above case, modifying a given object would need to
>>>> trigger a copy in other (potentially many) objects, and not in the object
>>>> being modified. It's probably possible to implement this, but certainly
>>>> harder/trickier to do.
>>>>
>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> On Fri, Jul 16, 2021 at 5:26 AM Joris Van den Bossche <
>>>>>> jorisvandenbossche at gmail.com> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, 12 Jul 2021 at 00:58, Joris Van den Bossche <
>>>>>>> jorisvandenbossche at gmail.com> wrote:
>>>>>>>
>>>>>>>> Short summary of the proposal:
>>>>>>>>
>>>>>>>>    1. The result of *any* indexing operation (subsetting a
>>>>>>>>    DataFrame or Series in any way) or any method returning a new DataFrame,
>>>>>>>>    always *behaves as if it were* a copy in terms of user API.
>>>>>>>>
>>>>>>>>  To explicitly call out the column-as-Series case (since this is a
>>>>>>> typical case that right now *always* is a view): "any" indexing
>>>>>>> operation thus also included accessing a DataFrame column as a Series (or
>>>>>>> slicing a Series).
>>>>>>>
>>>>>>> So something like s = df["col"] and then mutating s will no longer
>>>>>>> update df. Similarly for series_subset = series[1:5], mutating
>>>>>>> series_subset will no longer update s.
>>>>>>> _______________________________________________
>>>>>>> Pandas-dev mailing list
>>>>>>> Pandas-dev at python.org
>>>>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>>>>
>>>>>> _______________________________________________
>>>>>> Pandas-dev mailing list
>>>>>> Pandas-dev at python.org
>>>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>>>
>>>>> _______________________________________________
>>>> Pandas-dev mailing list
>>>> Pandas-dev at python.org
>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>
>>> _______________________________________________
>>> Pandas-dev mailing list
>>> Pandas-dev at python.org
>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>
>> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20210723/de35245c/attachment-0001.html>

From simonjayhawkins at gmail.com  Mon Jul 26 05:37:39 2021
From: simonjayhawkins at gmail.com (Simon Hawkins)
Date: Mon, 26 Jul 2021 10:37:39 +0100
Subject: [Pandas-dev] ANN: pandas v1.3.1
Message-ID: <CAFs9qGtPS6tJaVijhmcq-g8ZZx_OOPLzGFvB4C7Mk9+XFrmsjg@mail.gmail.com>

Hi all,

I'm pleased to announce the release of pandas v1.2.5.

This is the first patch release in the 1.3.x series and includes some
regression fixes and bug fixes. We recommend that all users upgrade to this
version.

See the release notes
<https://pandas.pydata.org/pandas-docs/version/1.3/whatsnew/v1.3.1.html> for
a list of all the changes.


The release can be installed from PyPI

    python -m pip install --upgrade pandas==1.3.1

Or from conda-forge

    conda install -c conda-forge pandas==1.3.1


Please report any issues with the release on the pandas issue tracker
<https://github.com/pandas-dev/pandas/issues>.

Thanks to all the contributors who made this release possible.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20210726/c4b246dd/attachment.html>

From adrin.jalali at gmail.com  Mon Jul 26 05:51:43 2021
From: adrin.jalali at gmail.com (Adrin)
Date: Mon, 26 Jul 2021 11:51:43 +0200
Subject: [Pandas-dev] Proposal for consistent,
 clear copy/view semantics in pandas with Copy-on-Write
In-Reply-To: <CAKf8g9TJNdHHihSKP8CzwuXrsONPBEkjybOzdFNAsmFBZiu_WA@mail.gmail.com>
References: <CALQtMBYqKg+HHb5pJZmYG6hbUg9M5f0b1ycPWYda_EvsWyHe=A@mail.gmail.com>
 <CALQtMBbY7t7F3eddhT5DN4UhNdr7GVzyKTWr4FxcdFJPiNw2PA@mail.gmail.com>
 <CAKf8g9RjaC8b2XHDELHa452pQ+JUVNg4aDDXTKPOJANBTWQZjQ@mail.gmail.com>
 <CAEQ_TvcZ55BXfh9J--6Su_iLrdy2BL=juzrkFZGhvf1tQCcrbQ@mail.gmail.com>
 <CALQtMBY=yCMvdUNE4yi_3Z1FG0vrzZT2CxSQYYN-9B+dHy2ruQ@mail.gmail.com>
 <CAEk5N5uY0i+9vyGST62tpXZOmr5nJ0_UtaFZ4Ey0mpxQmq2zEA@mail.gmail.com>
 <CAEOrW4-=p6aB-A3oH4RFg=cd8xT5LqhnDxW5ENr9dCvjqwBqOA@mail.gmail.com>
 <CALQtMBYCf+mtvdq-SnMct2Yzc-pYhRHMct6JoS+Y81Aeggtamw@mail.gmail.com>
 <CAKf8g9TJNdHHihSKP8CzwuXrsONPBEkjybOzdFNAsmFBZiu_WA@mail.gmail.com>
Message-ID: <CAEOrW48BKyv7NF1D8wzLvxhVhQfARCjpPo8eRVwNv3KrmmXyhw@mail.gmail.com>

There are two cases that I think are relevant here (as opposed to the
ArrayManager discussion), but I may be wrong.

The two cases I'm thinking, in a simple not-optimized way are, in
psuedocode:

for column in columns_of(data):
    data[:, column] = (data[:, column] - mean(data[:, column])) /
std(data[:, column])

And the other one is the same as above, but for rows.

Also, one issue I have, is that if we're doing copy-on-write, then what
does the above mean? As in, if I do `df["column_A"] = ....`, where is that
copy? How do I access the new one as opposed to the old one?

On Fri, Jul 23, 2021 at 10:09 PM Brock Mendel <jbrockmendel at gmail.com>
wrote:

> > Memory implications should be positive (less copying).
>
> This is accurate _only_ in cases where we currently make copies.  In cases
> where we currently make views, the perf effect goes the other way.
>
> On the flip side, Always-Views improves perf in cases where we currently
> make copies, but if you want a copy then you'll have to make one explicitly
> which will claw back that gain.  (In the long-out-of-date proof of concept
> https://github.com/pandas-dev/pandas/pull/33597  df[np.random.randint(0,
> 30, 30)] was ~92% faster than the status quo at the time)
>
> > The performance impact of the additional logic (adding/checking of the
> weak references) is something I didn't yet check (on my to do list), but I
> suspect it to not be significant.
>
> Agreed the CoW logic itself should be negligible outside of
> microbenchmarks.
>
> On Fri, Jul 23, 2021 at 12:32 PM Joris Van den Bossche <
> jorisvandenbossche at gmail.com> wrote:
>
>> On Tue, 20 Jul 2021 at 16:10, Adrin <adrin.jalali at gmail.com> wrote:
>>
>>> I guess one question I have is what are the memory and time performance
>>> implications of the proposed change.
>>>
>>
>> Memory implications should be positive (less copying). The performance
>> impact of the additional logic (adding/checking of the weak references) is
>> something I didn't yet check (on my to do list), but I suspect it to not be
>> significant.
>>
>> Based on your comment of "numpy array with column names", I think the
>> potential change of the ArrayManager is much more relevant for you than the
>> Copy-on-Write. And to be clear, the current proposal is not tied to the
>> ArrayManager (it's only the proof of concept that is implemented for that).
>> So I would prefer to keep the discussion focused on the copy/view
>> semantics, at least for now (it's only later, when discussing practical
>> ways to get this released, that we need to decide whether we want to
>> combine this with an ArrayManager refactor or not).
>>
>>
>>> I guess I belong to the group of users who think of a pandas DataFrame
>>> more as a numpy array with column names attached to them, and hence I'd
>>> expect very similar semantics when indexing, and I think copy on write
>>> semantics would have a significant impact on our workflows.
>>>
>>
>> I assuming you are also thinking of scikit-learn like worflows? Can you
>> give an example of what your are thinking about how copy-on-write impacts
>> such (or other) workflows?
>> In any case thanks already for your feedback!
>>
>>
>>>
>>> On Sat, Jul 17, 2021 at 6:12 PM Marc Garcia <garcia.marc at gmail.com>
>>> wrote:
>>>
>>>> Based on my experience (not sure how biased it is), modifying
>>>> dataframes with something like `df[col][1:3] = ...` is rare (or the
>>>> equivalent with `.loc`) except for boolean arrays. From my experience,
>>>> when the values of a dataframe column are changed, what I think it's way
>>>> more common is to use `df[col] = df[col].str.upper()`, `df[col] =
>>>> df[col].fillna(0)`...
>>>>
>>>> While I'm personally happy with Joris proposal, I see two other options
>>>> that could complement or replace it:
>>>>
>>>> Option 1) Deprecate assigning to a subset of rows, and only allow
>>>> assigning to whole columns.  Something like `df[col][1:3] = ...` could
>>>> be replaced by for example `df[col] = df[col].mask(slice(1, 3), ...)`.
>>>> Using `mask` and `where` is already supported for boolean arrays, so
>>>> slices should be added, and they'd be the only way to replace a subset of
>>>> values. I think that makes the problem narrower, and easier to understand
>>>> for users. The main thing to decide and be clear about is what happens if
>>>> the dataframe is a subset of another one:
>>>>
>>>> ```
>>>> df2 = df[cond]
>>>> df2[col] = df2[col].str.upper()
>>>> ```
>>>>
>>>> Option 2) If assigning with the current syntax (`df[col][1:3] = ...`
>>>> or `.loc` equivalent) is something we want to keep (I wouldn't if we
>>>> move in this direction), maybe it could be moved to a `DataFrame`
>>>> subclass So, the main dataframe class behaves like in option 1, so
>>>> expectations are much easier to manage. But users who really want to assign
>>>> with indexing, can still use it, knowing that having a mutable dataframe
>>>> comes at a cost (copies, more complex behavior...). The `
>>>> MutableDataFrame` could be in pandas, or a third-party extension.
>>>>
>>>> ```
>>>> df_mutable = df.to_mutable()
>>>> df_mutable[col][1:3] = ...
>>>> ```
>>>>
>>>> On Sat, Jul 17, 2021 at 9:16 AM Joris Van den Bossche <
>>>> jorisvandenbossche at gmail.com> wrote:
>>>>
>>>>>
>>>>> On Fri, 16 Jul 2021 at 18:58, Stephan Hoyer <shoyer at gmail.com> wrote:
>>>>>
>>>>>> On Fri, Jul 16, 2021 at 9:04 AM Brock Mendel <jbrockmendel at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I do not like the fact that nothing can ever be "just a view" with
>>>>>>> these semantics, including series[::-1], frame[col], frame[:]. Users
>>>>>>> reasonably expect numpy semantics for these.
>>>>>>>
>>>>>>> I am personally not sure what "users" in general expect for those
>>>>> (as also mentioned by Tom and Irv already, depending on their background,
>>>>> they might expect different things).
>>>>> For example, for a user that knows basic Python, they could actually
>>>>> expect all those examples to give a copy since `a_list[:]` is a typical way
>>>>> to make a copy of a list.
>>>>>
>>>>> (it might be interesting to reach out to educators (who might have
>>>>> more experience with expectations/typical errors of novice users) or to do
>>>>> some kind of experiment on this topic)
>>>>>
>>>>> Personally, I cannot remember that I ever relied on the
>>>>> mutability-aspect of eg `series[1:3]` or `frame[:]` being a view. I think
>>>>> there are generally 2 reasons for users caring about a view: 1) for
>>>>> performance (less copying) and 2) for being able to mutate the view with
>>>>> the explicit goal to mutate the parent (and not as an irrelevant
>>>>> side-effect).
>>>>> I think the first reason is by far the most common one (but that's my
>>>>> subjective opinion from my experience using pandas, so that can certainly
>>>>> depend), and in the current proposal, all those mentioned example will be
>>>>> actual views under the hood (and thus cover this first reason).
>>>>>
>>>>> The only case where I know I explicitly rely on this is with chained
>>>>> assignment (eg `frame[col][1:3] = ..`). That's certainly a very important
>>>>> use case (and probably the most impacted usage pattern with the current
>>>>> proposal), but it's also a case where 1) there is a clear alternative
>>>>> (don't use chained assignment, but do it in one step (e.g. `frame.loc[1:3,
>>>>> col] = ..`), some corner cases of mixed positional/label-based indexing
>>>>> aside, for which we should find an alternative) and 2) we might be able to
>>>>> detect this and raise an informative error message (specifically for
>>>>> chained assignment).
>>>>>
>>>>> I think it can be easier to explain "chained assignment never works"
>>>>> than "chained assignment only works if first selecting the column(s)"
>>>>> (depending on the exact rules).
>>>>>
>>>>>
>>>>>> We should revisit the alternative "clear/simple rules" approach that
>>>>>>> is "indexing on columns always gives a view" (
>>>>>>> https://github.com/pandas-dev/pandas/pull/33597). This is simpler
>>>>>>> to explain/grok, simpler to implement
>>>>>>>
>>>>>>
>>>>>> I don't know if it is worth the trouble for complex multi-column
>>>>>> selections, but I do see the appeal here.
>>>>>>
>>>>>> A simpler variant would be to make indexing out a single Series from
>>>>>> a DataFrame return a view, with everything else doing copy on write. Then
>>>>>> the existing pattern df.column_one[:] = ... would still work.
>>>>>>
>>>>>
>>>>> I was initially thinking about this as well. In the end, I didn't
>>>>> (yet) try to implement this, because while thinking it through, it seemed
>>>>> that this might give quite some tricky cases. Consider the following
>>>>> example:
>>>>>
>>>>> df = pd.DataFrame(..)
>>>>> df_subset = df[["col1", "col2"]]
>>>>> s1 = df["col1"]
>>>>> s1_subset = s1[0:3]
>>>>> # modifying s1 should modify df, but not df_subset and s1_subset?
>>>>> s1[0] = 0
>>>>>
>>>>> If we take "only accessing a single Series from a DataFrame is a view,
>>>>> everything else uses copy-on-write", that gives rise to questions like the
>>>>> above where some parents/childs get modified, and some not.
>>>>> This is both harder to explain to users, as harder to implement. For
>>>>> the implementation of the proof-of-concept, the copy-on-write happens
>>>>> "locally" in the series/dataframe that gets modified (meaning: when
>>>>> modifying a given object, its internal array data first gets copied and
>>>>> replaced *if* the object is viewing another or is being viewed by another
>>>>> object). While in the above case, modifying a given object would need to
>>>>> trigger a copy in other (potentially many) objects, and not in the object
>>>>> being modified. It's probably possible to implement this, but certainly
>>>>> harder/trickier to do.
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> On Fri, Jul 16, 2021 at 5:26 AM Joris Van den Bossche <
>>>>>>> jorisvandenbossche at gmail.com> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, 12 Jul 2021 at 00:58, Joris Van den Bossche <
>>>>>>>> jorisvandenbossche at gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Short summary of the proposal:
>>>>>>>>>
>>>>>>>>>    1. The result of *any* indexing operation (subsetting a
>>>>>>>>>    DataFrame or Series in any way) or any method returning a new DataFrame,
>>>>>>>>>    always *behaves as if it were* a copy in terms of user API.
>>>>>>>>>
>>>>>>>>>  To explicitly call out the column-as-Series case (since this is a
>>>>>>>> typical case that right now *always* is a view): "any" indexing
>>>>>>>> operation thus also included accessing a DataFrame column as a Series (or
>>>>>>>> slicing a Series).
>>>>>>>>
>>>>>>>> So something like s = df["col"] and then mutating s will no longer
>>>>>>>> update df. Similarly for series_subset = series[1:5], mutating
>>>>>>>> series_subset will no longer update s.
>>>>>>>> _______________________________________________
>>>>>>>> Pandas-dev mailing list
>>>>>>>> Pandas-dev at python.org
>>>>>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Pandas-dev mailing list
>>>>>>> Pandas-dev at python.org
>>>>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>>>>
>>>>>> _______________________________________________
>>>>> Pandas-dev mailing list
>>>>> Pandas-dev at python.org
>>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>>
>>>> _______________________________________________
>>>> Pandas-dev mailing list
>>>> Pandas-dev at python.org
>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>
>>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20210726/18a6bf1b/attachment-0001.html>

From jbrockmendel at gmail.com  Mon Jul 26 12:38:11 2021
From: jbrockmendel at gmail.com (Brock Mendel)
Date: Mon, 26 Jul 2021 09:38:11 -0700
Subject: [Pandas-dev] Proposal for consistent,
 clear copy/view semantics in pandas with Copy-on-Write
In-Reply-To: <CAEOrW48BKyv7NF1D8wzLvxhVhQfARCjpPo8eRVwNv3KrmmXyhw@mail.gmail.com>
References: <CALQtMBYqKg+HHb5pJZmYG6hbUg9M5f0b1ycPWYda_EvsWyHe=A@mail.gmail.com>
 <CALQtMBbY7t7F3eddhT5DN4UhNdr7GVzyKTWr4FxcdFJPiNw2PA@mail.gmail.com>
 <CAKf8g9RjaC8b2XHDELHa452pQ+JUVNg4aDDXTKPOJANBTWQZjQ@mail.gmail.com>
 <CAEQ_TvcZ55BXfh9J--6Su_iLrdy2BL=juzrkFZGhvf1tQCcrbQ@mail.gmail.com>
 <CALQtMBY=yCMvdUNE4yi_3Z1FG0vrzZT2CxSQYYN-9B+dHy2ruQ@mail.gmail.com>
 <CAEk5N5uY0i+9vyGST62tpXZOmr5nJ0_UtaFZ4Ey0mpxQmq2zEA@mail.gmail.com>
 <CAEOrW4-=p6aB-A3oH4RFg=cd8xT5LqhnDxW5ENr9dCvjqwBqOA@mail.gmail.com>
 <CALQtMBYCf+mtvdq-SnMct2Yzc-pYhRHMct6JoS+Y81Aeggtamw@mail.gmail.com>
 <CAKf8g9TJNdHHihSKP8CzwuXrsONPBEkjybOzdFNAsmFBZiu_WA@mail.gmail.com>
 <CAEOrW48BKyv7NF1D8wzLvxhVhQfARCjpPo8eRVwNv3KrmmXyhw@mail.gmail.com>
Message-ID: <CAKf8g9Redr8s1U4AwVQqnJHqk9+nqT8mRW6AE50EKVgA1z1New@mail.gmail.com>

> data.iloc[:, c] = (data.iloc[:, c] - data.iloc[:, c].mean()) /
data.iloc[:, c].std()

This would not make any copies under any of the scenarios being discussed,
including the status quo.

> And the other one is the same as above, but for rows.

With ArrayManager, the `data.iloc[r]` will make a copy, but the CoW doesn't
affect that.  No copies with BlockManager, regardless of CoW.


On Mon, Jul 26, 2021 at 2:51 AM Adrin <adrin.jalali at gmail.com> wrote:

> There are two cases that I think are relevant here (as opposed to the
> ArrayManager discussion), but I may be wrong.
>
> The two cases I'm thinking, in a simple not-optimized way are, in
> psuedocode:
>
> for column in columns_of(data):
>     data[:, column] = (data[:, column] - mean(data[:, column])) /
> std(data[:, column])
>
> And the other one is the same as above, but for rows.
>
> Also, one issue I have, is that if we're doing copy-on-write, then what
> does the above mean? As in, if I do `df["column_A"] = ....`, where is that
> copy? How do I access the new one as opposed to the old one?
>
> On Fri, Jul 23, 2021 at 10:09 PM Brock Mendel <jbrockmendel at gmail.com>
> wrote:
>
>> > Memory implications should be positive (less copying).
>>
>> This is accurate _only_ in cases where we currently make copies.  In
>> cases where we currently make views, the perf effect goes the other way.
>>
>> On the flip side, Always-Views improves perf in cases where we currently
>> make copies, but if you want a copy then you'll have to make one explicitly
>> which will claw back that gain.  (In the long-out-of-date proof of concept
>> https://github.com/pandas-dev/pandas/pull/33597  df[np.random.randint(0,
>> 30, 30)] was ~92% faster than the status quo at the time)
>>
>> > The performance impact of the additional logic (adding/checking of the
>> weak references) is something I didn't yet check (on my to do list), but I
>> suspect it to not be significant.
>>
>> Agreed the CoW logic itself should be negligible outside of
>> microbenchmarks.
>>
>> On Fri, Jul 23, 2021 at 12:32 PM Joris Van den Bossche <
>> jorisvandenbossche at gmail.com> wrote:
>>
>>> On Tue, 20 Jul 2021 at 16:10, Adrin <adrin.jalali at gmail.com> wrote:
>>>
>>>> I guess one question I have is what are the memory and time performance
>>>> implications of the proposed change.
>>>>
>>>
>>> Memory implications should be positive (less copying). The performance
>>> impact of the additional logic (adding/checking of the weak references) is
>>> something I didn't yet check (on my to do list), but I suspect it to not be
>>> significant.
>>>
>>> Based on your comment of "numpy array with column names", I think the
>>> potential change of the ArrayManager is much more relevant for you than the
>>> Copy-on-Write. And to be clear, the current proposal is not tied to the
>>> ArrayManager (it's only the proof of concept that is implemented for that).
>>> So I would prefer to keep the discussion focused on the copy/view
>>> semantics, at least for now (it's only later, when discussing practical
>>> ways to get this released, that we need to decide whether we want to
>>> combine this with an ArrayManager refactor or not).
>>>
>>>
>>>> I guess I belong to the group of users who think of a pandas DataFrame
>>>> more as a numpy array with column names attached to them, and hence I'd
>>>> expect very similar semantics when indexing, and I think copy on write
>>>> semantics would have a significant impact on our workflows.
>>>>
>>>
>>> I assuming you are also thinking of scikit-learn like worflows? Can you
>>> give an example of what your are thinking about how copy-on-write impacts
>>> such (or other) workflows?
>>> In any case thanks already for your feedback!
>>>
>>>
>>>>
>>>> On Sat, Jul 17, 2021 at 6:12 PM Marc Garcia <garcia.marc at gmail.com>
>>>> wrote:
>>>>
>>>>> Based on my experience (not sure how biased it is), modifying
>>>>> dataframes with something like `df[col][1:3] = ...` is rare (or the
>>>>> equivalent with `.loc`) except for boolean arrays. From my
>>>>> experience, when the values of a dataframe column are changed, what I think
>>>>> it's way more common is to use `df[col] = df[col].str.upper()`, `df[col]
>>>>> = df[col].fillna(0)`...
>>>>>
>>>>> While I'm personally happy with Joris proposal, I see two other
>>>>> options that could complement or replace it:
>>>>>
>>>>> Option 1) Deprecate assigning to a subset of rows, and only allow
>>>>> assigning to whole columns.  Something like `df[col][1:3] = ...`
>>>>> could be replaced by for example `df[col] = df[col].mask(slice(1, 3),
>>>>> ...)`.  Using `mask` and `where` is already supported for boolean
>>>>> arrays, so slices should be added, and they'd be the only way to replace a
>>>>> subset of values. I think that makes the problem narrower, and easier to
>>>>> understand for users. The main thing to decide and be clear about is what
>>>>> happens if the dataframe is a subset of another one:
>>>>>
>>>>> ```
>>>>> df2 = df[cond]
>>>>> df2[col] = df2[col].str.upper()
>>>>> ```
>>>>>
>>>>> Option 2) If assigning with the current syntax (`df[col][1:3] = ...`
>>>>> or `.loc` equivalent) is something we want to keep (I wouldn't if we
>>>>> move in this direction), maybe it could be moved to a `DataFrame`
>>>>> subclass So, the main dataframe class behaves like in option 1, so
>>>>> expectations are much easier to manage. But users who really want to assign
>>>>> with indexing, can still use it, knowing that having a mutable dataframe
>>>>> comes at a cost (copies, more complex behavior...). The `
>>>>> MutableDataFrame` could be in pandas, or a third-party extension.
>>>>>
>>>>> ```
>>>>> df_mutable = df.to_mutable()
>>>>> df_mutable[col][1:3] = ...
>>>>> ```
>>>>>
>>>>> On Sat, Jul 17, 2021 at 9:16 AM Joris Van den Bossche <
>>>>> jorisvandenbossche at gmail.com> wrote:
>>>>>
>>>>>>
>>>>>> On Fri, 16 Jul 2021 at 18:58, Stephan Hoyer <shoyer at gmail.com> wrote:
>>>>>>
>>>>>>> On Fri, Jul 16, 2021 at 9:04 AM Brock Mendel <jbrockmendel at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I do not like the fact that nothing can ever be "just a view" with
>>>>>>>> these semantics, including series[::-1], frame[col], frame[:]. Users
>>>>>>>> reasonably expect numpy semantics for these.
>>>>>>>>
>>>>>>>> I am personally not sure what "users" in general expect for those
>>>>>> (as also mentioned by Tom and Irv already, depending on their background,
>>>>>> they might expect different things).
>>>>>> For example, for a user that knows basic Python, they could actually
>>>>>> expect all those examples to give a copy since `a_list[:]` is a typical way
>>>>>> to make a copy of a list.
>>>>>>
>>>>>> (it might be interesting to reach out to educators (who might have
>>>>>> more experience with expectations/typical errors of novice users) or to do
>>>>>> some kind of experiment on this topic)
>>>>>>
>>>>>> Personally, I cannot remember that I ever relied on the
>>>>>> mutability-aspect of eg `series[1:3]` or `frame[:]` being a view. I think
>>>>>> there are generally 2 reasons for users caring about a view: 1) for
>>>>>> performance (less copying) and 2) for being able to mutate the view with
>>>>>> the explicit goal to mutate the parent (and not as an irrelevant
>>>>>> side-effect).
>>>>>> I think the first reason is by far the most common one (but that's my
>>>>>> subjective opinion from my experience using pandas, so that can certainly
>>>>>> depend), and in the current proposal, all those mentioned example will be
>>>>>> actual views under the hood (and thus cover this first reason).
>>>>>>
>>>>>> The only case where I know I explicitly rely on this is with chained
>>>>>> assignment (eg `frame[col][1:3] = ..`). That's certainly a very important
>>>>>> use case (and probably the most impacted usage pattern with the current
>>>>>> proposal), but it's also a case where 1) there is a clear alternative
>>>>>> (don't use chained assignment, but do it in one step (e.g. `frame.loc[1:3,
>>>>>> col] = ..`), some corner cases of mixed positional/label-based indexing
>>>>>> aside, for which we should find an alternative) and 2) we might be able to
>>>>>> detect this and raise an informative error message (specifically for
>>>>>> chained assignment).
>>>>>>
>>>>>> I think it can be easier to explain "chained assignment never works"
>>>>>> than "chained assignment only works if first selecting the column(s)"
>>>>>> (depending on the exact rules).
>>>>>>
>>>>>>
>>>>>>> We should revisit the alternative "clear/simple rules" approach that
>>>>>>>> is "indexing on columns always gives a view" (
>>>>>>>> https://github.com/pandas-dev/pandas/pull/33597). This is simpler
>>>>>>>> to explain/grok, simpler to implement
>>>>>>>>
>>>>>>>
>>>>>>> I don't know if it is worth the trouble for complex multi-column
>>>>>>> selections, but I do see the appeal here.
>>>>>>>
>>>>>>> A simpler variant would be to make indexing out a single Series from
>>>>>>> a DataFrame return a view, with everything else doing copy on write. Then
>>>>>>> the existing pattern df.column_one[:] = ... would still work.
>>>>>>>
>>>>>>
>>>>>> I was initially thinking about this as well. In the end, I didn't
>>>>>> (yet) try to implement this, because while thinking it through, it seemed
>>>>>> that this might give quite some tricky cases. Consider the following
>>>>>> example:
>>>>>>
>>>>>> df = pd.DataFrame(..)
>>>>>> df_subset = df[["col1", "col2"]]
>>>>>> s1 = df["col1"]
>>>>>> s1_subset = s1[0:3]
>>>>>> # modifying s1 should modify df, but not df_subset and s1_subset?
>>>>>> s1[0] = 0
>>>>>>
>>>>>> If we take "only accessing a single Series from a DataFrame is a
>>>>>> view, everything else uses copy-on-write", that gives rise to questions
>>>>>> like the above where some parents/childs get modified, and some not.
>>>>>> This is both harder to explain to users, as harder to implement. For
>>>>>> the implementation of the proof-of-concept, the copy-on-write happens
>>>>>> "locally" in the series/dataframe that gets modified (meaning: when
>>>>>> modifying a given object, its internal array data first gets copied and
>>>>>> replaced *if* the object is viewing another or is being viewed by another
>>>>>> object). While in the above case, modifying a given object would need to
>>>>>> trigger a copy in other (potentially many) objects, and not in the object
>>>>>> being modified. It's probably possible to implement this, but certainly
>>>>>> harder/trickier to do.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jul 16, 2021 at 5:26 AM Joris Van den Bossche <
>>>>>>>> jorisvandenbossche at gmail.com> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, 12 Jul 2021 at 00:58, Joris Van den Bossche <
>>>>>>>>> jorisvandenbossche at gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Short summary of the proposal:
>>>>>>>>>>
>>>>>>>>>>    1. The result of *any* indexing operation (subsetting a
>>>>>>>>>>    DataFrame or Series in any way) or any method returning a new DataFrame,
>>>>>>>>>>    always *behaves as if it were* a copy in terms of user API.
>>>>>>>>>>
>>>>>>>>>>  To explicitly call out the column-as-Series case (since this is
>>>>>>>>> a typical case that right now *always* is a view): "any" indexing
>>>>>>>>> operation thus also included accessing a DataFrame column as a Series (or
>>>>>>>>> slicing a Series).
>>>>>>>>>
>>>>>>>>> So something like s = df["col"] and then mutating s will no
>>>>>>>>> longer update df. Similarly for series_subset = series[1:5],
>>>>>>>>> mutating series_subset will no longer update s.
>>>>>>>>> _______________________________________________
>>>>>>>>> Pandas-dev mailing list
>>>>>>>>> Pandas-dev at python.org
>>>>>>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Pandas-dev mailing list
>>>>>>>> Pandas-dev at python.org
>>>>>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>> Pandas-dev mailing list
>>>>>> Pandas-dev at python.org
>>>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>>>
>>>>> _______________________________________________
>>>>> Pandas-dev mailing list
>>>>> Pandas-dev at python.org
>>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>>
>>>> _______________________________________________
>>> Pandas-dev mailing list
>>> Pandas-dev at python.org
>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/pandas-dev/attachments/20210726/06c08706/attachment-0001.html>