From shishaozhong at gmail.com Wed Dec 1 08:37:41 2021 From: shishaozhong at gmail.com (Shaozhong SHI) Date: Wed, 1 Dec 2021 13:37:41 +0000 Subject: [Pandas-dev] Is there a pandas_read_gml available? Message-ID: How to read gml into pandas data frame? Is there a pandas_read_gml available? Regards, David -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Wed Dec 1 08:42:26 2021 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Wed, 1 Dec 2021 14:42:26 +0100 Subject: [Pandas-dev] Is there a pandas_read_gml available? In-Reply-To: References: Message-ID: Hi David, I think that the Fiona library should be able to read GML files ( https://stackoverflow.com/questions/53249561/is-it-possible-to-read-gml-or-kml-files-with-fiona), and if that's the case, you can use GeoPandas to read that directly into a dataframe ( https://geopandas.org/en/stable/docs/reference/api/geopandas.read_file.html) Best, Joris On Wed, 1 Dec 2021 at 14:38, Shaozhong SHI wrote: > How to read gml into pandas data frame? > > Is there a pandas_read_gml available? > > Regards, David > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shishaozhong at gmail.com Wed Dec 1 08:50:40 2021 From: shishaozhong at gmail.com (Shaozhong SHI) Date: Wed, 1 Dec 2021 13:50:40 +0000 Subject: [Pandas-dev] Is there a pandas_read_gml available? In-Reply-To: References: Message-ID: Hi, Joris, Many thanks. Which version of Python, Pandas and fiona should I use? At the moment, I got Python 3.6.5 and Pandas 1.1.5 on. Regards, David On Wed, 1 Dec 2021 at 13:42, Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > Hi David, > > I think that the Fiona library should be able to read GML files ( > https://stackoverflow.com/questions/53249561/is-it-possible-to-read-gml-or-kml-files-with-fiona), > and if that's the case, you can use GeoPandas to read that directly into a > dataframe ( > https://geopandas.org/en/stable/docs/reference/api/geopandas.read_file.html > ) > > Best, > Joris > > On Wed, 1 Dec 2021 at 14:38, Shaozhong SHI wrote: > >> How to read gml into pandas data frame? >> >> Is there a pandas_read_gml available? >> >> Regards, David >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Wed Dec 1 08:56:00 2021 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Wed, 1 Dec 2021 14:56:00 +0100 Subject: [Pandas-dev] Is there a pandas_read_gml available? In-Reply-To: References: Message-ID: I don't know myself (didn't use it for reading GML), but based on the comments in the StackOverflow question, you need at minimum Fiona 1.8.4. I think the python and pandas version you mention will be fine. Best, Joris On Wed, 1 Dec 2021 at 14:50, Shaozhong SHI wrote: > Hi, Joris, > > Many thanks. Which version of Python, Pandas and fiona should I use? > > At the moment, I got Python 3.6.5 and Pandas 1.1.5 on. > > Regards, David > > > On Wed, 1 Dec 2021 at 13:42, Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > >> Hi David, >> >> I think that the Fiona library should be able to read GML files ( >> https://stackoverflow.com/questions/53249561/is-it-possible-to-read-gml-or-kml-files-with-fiona), >> and if that's the case, you can use GeoPandas to read that directly into a >> dataframe ( >> https://geopandas.org/en/stable/docs/reference/api/geopandas.read_file.html >> ) >> >> Best, >> Joris >> >> On Wed, 1 Dec 2021 at 14:38, Shaozhong SHI >> wrote: >> >>> How to read gml into pandas data frame? >>> >>> Is there a pandas_read_gml available? >>> >>> Regards, David >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From shishaozhong at gmail.com Thu Dec 2 05:00:17 2021 From: shishaozhong at gmail.com (Shaozhong SHI) Date: Thu, 2 Dec 2021 10:00:17 +0000 Subject: [Pandas-dev] Pandas_read_json produces one single columns of nested dictionary Message-ID: Pandas_read_json produces one single columns of nested dictionary. How best to convert this into a dataframe? Regards, David -------------- next part -------------- An HTML attachment was scrubbed... URL: From jldxgaoyu at 126.com Sat Dec 4 08:55:40 2021 From: jldxgaoyu at 126.com (=?utf-8?B?6auY546J?=) Date: Sat, 4 Dec 2021 21:55:40 +0800 Subject: [Pandas-dev] I need pandas pakge Message-ID: <6A9D76BF-A499-47F1-9D6F-1AC1801A6A7E@126.com> hi pandas -dev? I want this version of the pandas package with pandas1.3.4 for Mac osx_11_inter or Mac osx_11_x86_64. The package I downloaded does not comply with pandas-1.3.4-cp310-cp310-macosx. thank you!! From jorisvandenbossche at gmail.com Tue Dec 7 10:24:55 2021 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Tue, 7 Dec 2021 16:24:55 +0100 Subject: [Pandas-dev] December 2021 monthly community meeting (Wednesday December 8, UTC 18:00) Message-ID: Hi all, A reminder that the next monthly dev call is tomorrow (Wednesday, December 8) at 18:00 UTC (12am Central). Our calendar is at https://pandas.pydata.org/docs/development/meeting.html#calendar to check your local time. All are welcome to attend! Video Call: https://us06web.zoom.us/j/84484803210?pwd=TjUxNmcyNHcvcG9SNGJvbE53Y21GZz09 Minutes: https://docs.google.com/document/u/1/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit?ouid=102771015311436394588&usp=docs_home&ths=true Joris -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Tue Dec 7 11:30:02 2021 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Tue, 7 Dec 2021 17:30:02 +0100 Subject: [Pandas-dev] Decoupling type stubs for the public API from the pandas distribution In-Reply-To: References: Message-ID: Hi Irv, I am not very familiar with the typing space so some questions below. Can you explain a bit more what would be the consequence of the type annotations in pandas itself? I suppose we wouldn't remove those? (we also have type annotations for non-public APIs) Or how would those be kept in sync? Another question: what is the main advantage for doing so? I suppose this doesn't make it necessarily easier for the user, but is the goal the make the type stubs better maintainable? Would the type-stubs package be for a specific pandas version (and get somewhat synced releases?) Joris On Tue, 23 Nov 2021 at 17:22, Irv Lustig wrote: > I discovered this feature of typing: > https://www.python.org/dev/peps/pep-0561/#stub-only-packages > > The idea is that for a package like pandas, we can have a separate package > "pandas-stubs" that would contain the type stubs for pandas. We wouldn't > have to worry about including a `py.typed` file or `.pyi` files in our > standard pandas distribution - all typing for the public API would be in > the separate package. That would allow pandas typing for the public API to > be maintained separately (different GitHub repo). We could start by just > copying over what Microsoft created at > https://github.com/microsoft/python-type-stubs/tree/main/pandas and then > we maintain it as a separate repo, which could be installed via pip and > conda. > > Any thoughts on whether we should consider doing this? > > -Irv > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From irv at princeton.com Tue Dec 7 11:59:13 2021 From: irv at princeton.com (Irv Lustig) Date: Tue, 7 Dec 2021 11:59:13 -0500 Subject: [Pandas-dev] Decoupling type stubs for the public API from the pandas distribution In-Reply-To: References: Message-ID: > > Can you explain a bit more what would be the consequence of the type > annotations in pandas itself? We would keep the type annotations in pandas for maintaining the pandas code (i.e., type checking the code that is written by pandas developers), but not have to worry about typing the public API in conjunction with maintaining the internal typing. They could evolve separately, if needed. > I suppose we wouldn't remove those? (we also have type annotations for > non-public APIs) Or how would those be kept in sync? > That's not entirely clear to me, but I would say that whenever the public API changes, then the pandas-stubs project would get updated. > Another question: what is the main advantage for doing so? I suppose this > doesn't make it necessarily easier for the user, but is the goal the make > the type stubs better maintainable? > > To me, the advantages are: 1. Maintainability - we just have to publish stubs for the public API and not any internal routines, and in some sense, the published stubs are a check for that API 2. Tests - we can develop a set of tests that test the type stubs independent of all the other tests we do 3. Reconciling Issues - with a separate project, any issues with the type stubs for the public API would be in a different GitHub project, which people who consume the API could contribute to, without having to worry about dealing with the full pandas code base, setting up a dev environment, etc. 4. Faster release schedule - because the type stubs code base would be small, as issues/PRs are reconciled, it could be released on a more regular basis, rather than waiting for a full pandas release. Regarding my comments (3) and (4) - I have been regularly contributing PRs to the Microsoft stubs that are included with Visual Studio Code https://github.com/microsoft/python-type-stubs/tree/main/pandas when I find issues with code that I write or members of my team write that doesn't pass the VS Code pyright basic type checks. Being able to do so without waiting for a full pandas release is very helpful! Since pylance in VS Code gets updated every week or two, that means that any changes in the type stubs that were approved by the maintainers end up getting released pretty quickly (and automatically updated). Would the type-stubs package be for a specific pandas version (and get > somewhat synced releases?) I think we would sync it with minor releases, but not patch releases, since the public API shouldn't change in a patch release. I'd like to discuss this in the pandas dev meeting. Marco also pointed me to another set of stubs at https://github.com/VirtusLab/pandas-stubs . That latter project has a nice blog about how they created their stubs here: https://medium.com/virtuslab/pandas-stubs-how-we-enhanced-pandas-with-type-annotations-1f69ecf1519e There is also https://github.com/predictive-analytics-lab/data-science-types/tree/master/pandas-stubs -Irv On Tue, Dec 7, 2021 at 11:30 AM Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > Hi Irv, > > I am not very familiar with the typing space so some questions below. > > Can you explain a bit more what would be the consequence of the type > annotations in pandas itself? I suppose we wouldn't remove those? (we also > have type annotations for non-public APIs) Or how would those be kept in > sync? > > Another question: what is the main advantage for doing so? I suppose this > doesn't make it necessarily easier for the user, but is the goal the make > the type stubs better maintainable? > Would the type-stubs package be for a specific pandas version (and get > somewhat synced releases?) > > Joris > > On Tue, 23 Nov 2021 at 17:22, Irv Lustig wrote: > >> I discovered this feature of typing: >> https://www.python.org/dev/peps/pep-0561/#stub-only-packages >> >> The idea is that for a package like pandas, we can have a separate >> package "pandas-stubs" that would contain the type stubs for pandas. We >> wouldn't have to worry about including a `py.typed` file or `.pyi` files in >> our standard pandas distribution - all typing for the public API would be >> in the separate package. That would allow pandas typing for the public API >> to be maintained separately (different GitHub repo). We could start by >> just copying over what Microsoft created at >> https://github.com/microsoft/python-type-stubs/tree/main/pandas and then >> we maintain it as a separate repo, which could be installed via pip and >> conda. >> >> Any thoughts on whether we should consider doing this? >> >> -Irv >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Tue Dec 7 13:01:30 2021 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Tue, 7 Dec 2021 19:01:30 +0100 Subject: [Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks In-Reply-To: References: <807A8451-2547-4891-95F3-B1211496CEA6@xs4all.nl> <1462664690.2963025.1591907360318@mail.yahoo.com> Message-ID: Another update on this topic: over the last weeks I have been updating the status of this project (and fixing some regressions), and rerunning the benchmarks. You can find an overview of the results of our ASV benchmarks at https://github.com/pandas-dev/pandas/issues/39146#issuecomment-988002256. Some general points about those benchmark results: - The cases that show big slowdows are mostly related with cases where we do `df.values` or equivalent, i.e. converting the DataFrame to a single 2D array (`.values`, `to_numpy`, `transpose`, ..). Another subset of cases involve row-wise operations (reductions with axis=1, selecting a single row as a Series). I think those are the expected cases where a 1D-column store will always be slower. - Many of our ASV benchmarks use wide dataframes (eg an often-used shape is (1000, 1000), so a square dataframe). While it's of course important to cover this, I also think this is not the most common shape of dataframes, and in any case is giving a bit a biased view. - Our ASV benchmarks are mostly micro-benchmarks, or at least benchmarks that at most take up to 1 to 100 ms in general (by using small enough data to limit the runtime to this). While this is important to keep this benchmark suite usable, it also has the consequence that many of those benchmarks are partly or largely measuring "overhead" which doesn't necessarily increase while increasing the data size (more rows). The ArrayManager will typically increase this overhead, but as long as this overhead is in the "milliseconds" range, it does not necessarily have much influence on larger data workflows (depending on the exact workflow of course). Overall, I find the results quite reassuring: it identifies the cases where a slowdown is to be expected (and we will need to judge whether we find this acceptable), highlight some areas that can use improvement, and also shows that many of the benchmarks are not (or not much) impacted. But I think it also shows that we will need to seek more real-world feedback, either by constructing some macro benchmarks, or by getting user feedback from their real-world workflows. For the first option (macro benchmarks), I quickly cleaned up and pushed an experiment I did over a year ago, which is to run one query of one of the industry-standard benchmark suites (TPC) using pandas ( https://nbviewer.org/github/jorisvandenbossche/pandas-benchmarks/blob/main/tpc-ds/query-1.ipynb#Time-the-full-query). This shows basically no difference between BlockManager vs ArrayManager. This if of course also only one single workflow (with narrow long dataframes, doing mostly groupby and merge, and the overall time is dominated by eg the factorize algos, which isn't affected by the dataframe layout), but this is something we could maybe expand with other benchmark cases. --- We now have a prototype implementation people can experiment with + we have an overview of ASV benchmark results. Given this, I think it is a good point to discuss again how we want to move forward with this, and whether we want to communicate the _intent_ to make this the default in some next pandas version (emphasizing "intent", since it will always depend on the feedback we get). Joris On Wed, 7 Apr 2021 at 16:28, Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > And to give another update on this topic: the development branch of pandas > now contains an experimental version of this "columnar store" (using an > ArrayManager class instead of the BlockManager under the hood, which stores > the columns as a list of 1D arrays), which is almost feature-complete (the > biggest missing links are JSON and PyTables IO). > > At the moment, there is an option to enable it for experimenting with it > (not yet documented, as it might still see behaviour changes): > > # set the default manager to ArrayManager > pd.options.mode.data_manager = "array" > > # when creating a DataFrame, you will now get one with an ArrayManager > instead of BlockManager > df = pd.DataFrame(...) > df = pd.read_csv(...) > > There are still some remaining work items (more IO, ironing out some known > bugs/todo's, checking performance), see > https://github.com/pandas-dev/pandas/issues/39146 to keep track of this. > > Best, > Joris > > On Tue, 9 Feb 2021 at 19:17, Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > >> >> On Mon, 31 Aug 2020 at 16:20, Joris Van den Bossche < >> jorisvandenbossche at gmail.com> wrote: >> >>> >>> >>> On Fri, 12 Jun 2020 at 22:34, Joris Van den Bossche < >>> jorisvandenbossche at gmail.com> wrote: >>> >>>> On Thu, 11 Jun 2020 at 23:35, Brock Mendel >>>> wrote: >>>> >>>>> > We actually *have* prototypes: the prototype of the split-policy >>>>> discussed >>>>> >>>>> AFAICT that is a 5 year old branch. Is there a version of this based >>>>> off of master that you can show asv results for? >>>>> >>>>> A correction here: that branch has been updated several times over the >>>> last 5 years, and a last time two weeks ago when I started this thread, as >>>> I explained in the github issue comment I linked to: >>>> https://github.com/pandas-dev/pandas/issues/10556#issuecomment-633703160 >>>> >>>> >>>>> > Also, if performance is in the end the decisive criterion, I repeat >>>>> my earlier remark in this thread: we need to be clearer about what we want >>>>> / expect. >>>>> >>>>> In principle, this is pretty much exactly what the asvs are supposed >>>>> to represent. >>>>> >>>> >>>> Well, I am repeating myself .. but I already mentioned that I am not >>>> sure ASV is fully useful for this, as that requires a complete working >>>> replacement, which is IMO too much to ask for an initial prototype. >>>> >>>> But OK, the message is clear: we need a more concrete implementation / >>>> prototype. So let's put this discussion aside for a moment, and focus on >>>> that instead. I will try to look at that in the coming weeks, but any help >>>> is welcome (and I will try to get it running with ASV, or at least a part >>>> of it). >>>> >>>> >>> To come back to this: I cleaned up a proof-of-concept implementation >>> that I started after the above discussed, and put it in a PR to >>> view/discuss: https://github.com/pandas-dev/pandas/pull/36010 >>> >>> >> >> Another follow-up: the proof-of-concept now is merged in the master >> branch, and I am currently working on making it more feature complete (see >> https://github.com/pandas-dev/pandas/issues/39146 for an overview issue) >> >> Joris >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mszymkiewicz at gmail.com Tue Dec 7 14:47:06 2021 From: mszymkiewicz at gmail.com (Maciej) Date: Tue, 7 Dec 2021 20:47:06 +0100 Subject: [Pandas-dev] Decoupling type stubs for the public API from the pandas distribution In-Reply-To: References: Message-ID: <2b98f319-ea1c-8643-970c-b03971019725@gmail.com> Hi all, Just my two cents On 12/7/21 17:59, Irv Lustig wrote: > > Can you explain a bit more what would be the consequence of the > type annotations in pandas itself? > > > We would keep the type annotations in pandas for maintaining the > pandas code (i.e., type checking the code that is written by pandas > developers), but not have to worry about typing the public API in > conjunction with maintaining the internal typing.? They could evolve > separately, if needed. Meaningful type checking of internals typically requires well annotated public API ? every method which is not properly annotated, especially on the return side, introduces quickly escalating gaps in coverage. Unfortunately, there is no easy and robust way to combine multiple sources of annotations, so it is likely you'll still have to keep "public" API annotated alongside with "internal" parts. That introduces another problem ? if annotations for the public API diverge from external stubs, it is likely to be a source of confusion for the end users. > ? > > I suppose we wouldn't remove those? (we also have type annotations > for non-public APIs) Or how would those be kept in sync? > > > That's not entirely clear to me, but I would say that whenever the > public API changes, then the pandas-stubs project would get updated.? Keeping things in sync in a long run is quite hard (speaking as a long term maintainer of PySpark stubs), and some parts can be automated (i.e. checking for changes in automatically extracted signatures) and it is easy to miss changes that are not immediately visible in the signatures (i.e. subtle changes in types of accepted arguments and return type). Furthermore, (that observations is based mostly on some proprietary work) relationship between annotated code and annotations is not unidirectional ? how we annotate (and the same code can be annotated in different, but still valid ways) affects how you design your APIs. It is also not hard to create functions with signatures that are impossible to annotate. > Another question: what is the main advantage for doing so? I > suppose this doesn't make it necessarily easier for the user, but > is the goal the make the type stubs better maintainable? > > > To me, the advantages are: > 1. Maintainability - we just have to publish stubs for the public API > and not any internal routines, and in some sense, the published stubs > are a check for that API > 2. Tests - we can develop a set of tests that test the type stubs > independent of all the other tests we do > 3.? Reconciling Issues - with a separate project, any issues with the > type stubs for the public API would be in a different GitHub project, > which people who consume the API could contribute to, without having > to worry about dealing with the full pandas code base, setting up a > dev environment, etc. > 4.? Faster release schedule - because the type stubs code base would > be small, as issues/PRs are reconciled, it could be released on a more > regular basis, rather than waiting for a full pandas release. These are really good points, especially when annotation effort is new. However, once annotations mature, there is really not much added value here.? What's worse, if upstream API is evolving, you'll likely to face a problem of versioning ? which version of stubs is matching which version of the upstream package. That might require parallel versioning with version branches in the worst case scenario. > Regarding my comments (3) and (4) - I have been regularly contributing > PRs to the Microsoft stubs that are included with Visual Studio > Code?https://github.com/microsoft/python-type-stubs/tree/main/pandas > when I find issues with code that I write or members of my team write > that doesn't pass the VS Code pyright basic type checks.? Being able > to do so without waiting for a full pandas release is very helpful!? > Since pylance in VS Code gets updated every week or two, that means > that any changes in the type stubs that were approved by the > maintainers end up getting released pretty quickly (and automatically > updated). > > Would the type-stubs package be for a specific pandas version (and > get somewhat synced releases?) > > > I think we would sync it with minor releases, but not patch releases, > since the public API shouldn't change in a patch release. > > I'd like to discuss this in the pandas dev meeting.? Marco also > pointed me to another set of stubs > at?https://github.com/VirtusLab/pandas-stubs .? That latter project > has a nice blog about how they created their stubs here:? > https://medium.com/virtuslab/pandas-stubs-how-we-enhanced-pandas-with-type-annotations-1f69ecf1519e > > There is > also?https://github.com/predictive-analytics-lab/data-science-types/tree/master/pandas-stubs > > -Irv > > > > On Tue, Dec 7, 2021 at 11:30 AM Joris Van den Bossche > wrote: > > Hi Irv, > > I am not very familiar with the typing space so some questions below. > > Can you explain a bit more what would be the consequence of the > type annotations in pandas itself? I suppose we wouldn't remove > those? (we also have type annotations for non-public APIs) Or how > would those be kept in sync? > > Another question: what is the main advantage for doing so? I > suppose this doesn't make it necessarily easier for the user, but > is the goal the make the type stubs better maintainable? > Would the type-stubs package be for a specific pandas version (and > get somewhat synced releases?) > > Joris > > On Tue, 23 Nov 2021 at 17:22, Irv Lustig wrote: > > I discovered this feature of typing: > https://www.python.org/dev/peps/pep-0561/#stub-only-packages > > The idea is that for a package like pandas, we can have a > separate package "pandas-stubs" that would contain the type > stubs for pandas.? We wouldn't have to worry about including a > `py.typed` file or `.pyi` files in our standard pandas > distribution - all typing for the public API would be in the > separate package.? That would allow pandas typing?for the > public API to be maintained separately (different GitHub > repo).? We could start by just copying over what Microsoft > created > at?https://github.com/microsoft/python-type-stubs/tree/main/pandas > and then we maintain it as a separate repo, which could be > installed via pip and conda. > > Any thoughts on whether we should consider doing this? > > -Irv > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev -- Best regards, Maciej Szymkiewicz Web: https://zero323.net Keybase: https://keybase.io/zero323 Gigs: https://www.codementor.io/@zero323 PGP: A30CEF0C31A501EC -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: OpenPGP_signature Type: application/pgp-signature Size: 840 bytes Desc: OpenPGP digital signature URL: From irv at princeton.com Thu Dec 9 18:16:21 2021 From: irv at princeton.com (Irv Lustig) Date: Thu, 9 Dec 2021 18:16:21 -0500 Subject: [Pandas-dev] pandas typing stubs meeting January 7, 2022, Noon Eastern Message-ID: I've connected with the primary author of the Microsoft supplied pandas typing stubs ( https://github.com/microsoft/python-type-stubs/tree/main/pandas) and the primary maintainers of the pandas-stubs package ( https://github.com/VirtusLab/pandas-stubs) and due to the holidays and their schedules, we will meet in the new year on Friday, January 7, 2022, Noon Eastern Time, 5PM UTC. The meeting has been added to the pandas development calendar visible at https://pandas.pydata.org/docs/development/meeting.html Join Zoom Meeting https://us02web.zoom.us/j/89675119530?pwd=emhQQXhuVy9KeHFjZTZhQ0plN3JpUT09 Meeting ID: 896 7511 9530 Passcode: 136973 One tap mobile +13017158592,,89675119530# US (Washington DC) +13126266799,,89675119530# US (Chicago) I will send another reminder out a couple of days before the meeting. -Irv Lustig (Dr-Irv) -------------- next part -------------- An HTML attachment was scrubbed... URL: From simeon.simeonov.s at gmail.com Mon Dec 13 07:14:35 2021 From: simeon.simeonov.s at gmail.com (Simeon Simeonov) Date: Mon, 13 Dec 2021 12:14:35 +0000 Subject: [Pandas-dev] Pandas astype() changes the class type Message-ID: Hi all, I saw this behaviour and I don't know if this is a bug or feature. I don't have much experience with directly inheriting from pandas.DataFrame as I've always preferred aggregation rather than inheritance there. A working sample is pasted below. Notice how *df.astype(dtypes)* changes the type to pandas.DataFrame. Any suggestions if this is intended behaviour? import pandas as pd class DF(pd.DataFrame): @property def _constructor(self): return self.__class__ df = DF({ 'A': [1,2,3], 'B': [10,20,30], 'C': [100,200,300], }) # Type is DF a = df['A'] # type is Series ab = df[['A', 'B']] # type is DF dtypes = {'A': 'float64', 'B': 'float64', 'C': 'float64'} x = df.astype(dtypes) type(x) # type is pd.DataFrame Regards, Simeon -------------- next part -------------- An HTML attachment was scrubbed... URL: From simonjayhawkins at gmail.com Tue Dec 14 07:37:45 2021 From: simonjayhawkins at gmail.com (Simon Hawkins) Date: Tue, 14 Dec 2021 12:37:45 +0000 Subject: [Pandas-dev] ANN: pandas v1.3.5 Message-ID: Hi all, I'm pleased to announce the release of pandas v1.3.5. This is a patch release in the 1.3.x series and includes some regression fixes. We recommend that all users upgrade to this version. See the release notes for a list of all the changes. The release can be installed from PyPI python -m pip install --upgrade pandas==1.3.5 Or from conda-forge conda install -c conda-forge pandas==1.3.5 Please report any issues with the release on the pandas issue tracker . Thanks to all the contributors who made this release possible. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Wed Dec 15 04:30:27 2021 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Wed, 15 Dec 2021 10:30:27 +0100 Subject: [Pandas-dev] Pandas astype() changes the class type In-Reply-To: References: Message-ID: Hi Simeon, This is a somewhat known issue with astype(), and more in general related to the behaviour of concat dealing with subclasses. For example, in GeoPandas, we override astype() for this reason to ensure a proper return type: https://github.com/geopandas/geopandas/blob/ee8adfb27659e9f982ba8cdadbf62c6b36dcc053/geopandas/geodataframe.py#L1694-L1718 When using astype with a dictionary of column name -> dtype, the underlying implementation casts every column separately and then uses concat to combine the columns (Series objects) back into a dataframe. However, without doing anything special in astype(), that means it relies on the logic of concat to determine the output class (which is to use the _constructor_expanddim of the first object, i.e. of the first column / Series). See https://github.com/pandas-dev/pandas/issues/35415 for some discussion about this. I think that we could add some extra logic to the astype method implementation to try to preserve the original class (by using its _constructor) after doing the concat, similarly as was done recently for the convert_dtypes() method (https://github.com/pandas-dev/pandas/pull/44249). I think a contribution (pull request) for that would certainly be welcome! Best, Joris On Mon, 13 Dec 2021 at 13:17, Simeon Simeonov wrote: > Hi all, > > I saw this behaviour and I don't know if this is a bug or feature. I don't > have much experience with directly inheriting from pandas.DataFrame as I've > always preferred aggregation rather than inheritance there. A working > sample is pasted below. Notice how *df.astype(dtypes)* changes the type > to pandas.DataFrame. Any suggestions if this is intended behaviour? > > > import pandas as pd > class DF(pd.DataFrame): @property > def _constructor(self): > return self.__class__ > > > df = DF({ > 'A': [1,2,3], > 'B': [10,20,30], > 'C': [100,200,300], > }) # Type is DF > > > a = df['A'] # type is Series > ab = df[['A', 'B']] # type is DF > > dtypes = {'A': 'float64', 'B': 'float64', 'C': 'float64'} > x = df.astype(dtypes) > type(x) # type is pd.DataFrame > > > Regards, > > Simeon > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From irv at princeton.com Wed Dec 15 09:54:18 2021 From: irv at princeton.com (Irv Lustig) Date: Wed, 15 Dec 2021 09:54:18 -0500 Subject: [Pandas-dev] Proposal for consistent, clear copy/view semantics in pandas with Copy-on-Write Message-ID: Joris: I finally had some time to study our conversation from July, reread the Google docs proposal, and I tried out the PR as well. What I'm struggling with is how we document where behavior will change. As an example, the following sequence will give different results: Current behavior: >>> df = pd.DataFrame({"a":[10,11,12,13,14], "b": [100,101,102,103,104]}) >>> df["a"].loc[2] = 112 >>> df a b 0 10 100 1 11 101 2 112 102 3 13 103 4 14 104 New behavior: (from the PR): >>> df = pd.DataFrame({"a":[10,11,12,13,14], "b": [100,101,102,103,104]}) >>> df["a"].loc[2] = 112 >>> df a b 0 10 100 1 11 101 2 12 102 3 13 103 4 14 104 But in both cases, the following works: >>> df.loc[3,"b"] = 999 >>> df a b 0 10 100 1 11 101 2 12 102 3 13 999 4 14 104 So my concern is that if you had existing code that used the pattern df["a"].loc[2] = 112 , you'd get no warning that the behavior had changed. What I don't know is how much of code in the wild assumes the current behavior. So my questions are now: 1. How will we document, in a clean and concise way, the new behavior for people with existing pandas code? 2. How can people find pandas code where the behavior will change? Can we list all patterns that would produce different results? Can we detect chained indexing with setitem calls? 3. I'm guessing there is lots of code where people use DataFrame.copy() to avoid the SettingWithCopy warning. Can they just remove those copies now and their code will work? I agree that for new users, this new way of doing things makes sense. I'm worried about how we make the transition easier for people with large code bases that use pandas. -Irv >> On Sat, 17 Jul 2021 at 20:51, Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: >>> >>> On Fri, 16 Jul 2021 at 20:50, Irv Lustig wrote: >>>> >>>> >>>> Tom Augspurger wrote: >>>> >>>>> I wonder if we can validate what users (new and old) *actually* expect? >>>>> Users coming from R, which IIRC implements Copy on Write for matrices, >>>>> might be OK with indexing always being (behaving like) a copy. >>>>> I'm not sure what users coming from NumPy would expect, since I don't know >>>>> how many NumPy users really understand *a**.)* when a NumPy slice is a view >>>>> or copy, and *b.) *how a pandas indexing operation translates to a NumPy >>>>> slice. >>>>> >>>> >>>> IMHO, we should concentrate on the "new" users. For my team, there is no numpy or R background. They learn pandas, and what pandas does needs to be really clear in behavior and documentation. I would also hazard a guess that most pandas users are like that - pandas is the first tool they see, not numpy or R. >>>> >>>> The places where I think confusion could happen are things like this with a DataFrame df : >>>> >>>> s = df["a"] >>>> s.iloc[3:5] = [1, 2, 3] >>>> df["a"].iloc[3:5] = [1, 2, 3] >>>> df["b"] = df["a"] >>>> df["b"].iloc[3:5] = [4, 5, 6] >>>> s2 = df["b"] >>>> df["c"] = s2 >>>> s2.iloc[3:5] = [7, 8, 9] >>>> >>>> As I understand it (please correct me if I'm wrong), these lines would be interpreted as follows with the current proposal: >>> >>> >>> It's a bit different (to reiterate, with the *current* proposal, *any* indexing operation (including series selection) behaves as a copy; and also to be clear, this is one possible proposal, there are certainly other possibilities). Answering case by case: >>> >>>> >>>> 1. s = df["a"] >>>> Creates a view into the DataFrame df. No copying is done at all >>> >>> >>> Indeed a view (but that's an implementation detail) >>> >>>> 2. s.iloc[3:5] = [1, 2, 3] >>>> Modifies the series s and the underlying DataFrame df. (copy-on-write) >>> >>> >>> Due to copy-on-write, it does *not* modify the DataFrame df. Copy-on-write means that only when s is being written to, its data get copied (so at that point breaking the view-relation with the parent df) >>> >>>> >>>> 3. df["a"].iloc[3:5] = [1, 2, 3] >>>> Modifies the dataframe >>> >>> >>> This is an example of chained assignment, which in the current proposal never works (see the example in the google doc). This is because chained assignment can always be written as: >>> >>> temp = df["a"] >>> temp.iloc[3:5] = [1, 2, 3] >>> >>> and `temp` uses copy-on-write (and then it is the same example as the one above in 2.). >>> >>> (what you describe is the current behaviour of pandas) >>> >>>> >>>> 4. df["b"] = df["a"] >>>> Copies the series from "a" to "b" >>> >>> >>> It would indeed behave as a copy, but under the hood we can actually keep this as a view (delay the copy thanks to copy-on-write). >>> >>>> >>>> 5. df["b"].iloc[3:5] = [4, 5, 6] >>>> Modifies "b" in the DataFrame, but not "a" >>> >>> >>> Also doesn't modify "b" (see example 3. above), but indeed does not modify "a" >>> >>>> >>>> 6. s2 = df["b"] >>>> Create a view into the DataFrame df. No copying is done at all. >>> >>> >>> Same as 1. >>> >>>> >>>> 7. df["c"] = s2 >>>> Copies the series from "b" to "c" >>> >>> >>> Same as 4. >>> >>>> >>>> 8. s2.iloc[3:5] = [7, 8, 9] >>>> Modifies s2, which modifies "b", but NOT "c" >>> >>> >>> Doesn't modify "b" and "c". Similar as 3. >>> >>>> I think the challenge is explaining the sequence 6,7,8 above in comparison to the other sequences. >>> >>> >>> So with the current proposal, the sequece 6, 7, 8 actually doesn't behave differently. But it is mainly 2 and 3 that would be quite different compared to the current pandas behaviour. >>> >>>> >>>> >>>> -Irv -------------- next part -------------- An HTML attachment was scrubbed... URL: From degilreath at tva.gov Thu Dec 16 22:39:05 2021 From: degilreath at tva.gov (Gilreath, Dalton E) Date: Fri, 17 Dec 2021 03:39:05 +0000 Subject: [Pandas-dev] Pandas Support For TVA Message-ID: Pandas support, You are receiving this message because a cybersecurity vulnerability has been identified and TVA is taking precautions to remedy any applications which may be affected. Pandas is currently being used by TVA. TVA is requesting confirmation on whether the Apache Log4j utility is present as part of this application. If Log4j is being utilized please provide the version and remediation steps your company plans on taking. Please respond to this email within 48 hours and advise if this file is present. If it is not, no further action is necessary except responding with confirmation that it does not exist. If it is present, please advise on if you have a patch available to remediate this issue. We must have a patch to continue use of this application. If no patch is available, the application must be turned off by December 24, 2021 to ensure TVA system security. Your information should be sent to me at degilreath at tva.gov. Feel free to contact me with any questions. Dalton Gilreath Manager, Product Delivery - Maximo Technology and Innovation [TVA logo] W. 423-751-8029 M. 423-785-7215 E. degilreath at tva.gov 1101 Market Street, Chattanooga, TN 37402 NOTICE: This electronic message transmission contains information that may be TVA SENSITIVE, TVA RESTRICTED, or TVA CONFIDENTIAL. Any misuse or unauthorized disclosure can result in both civil and criminal penalties. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the content of this information is prohibited. If you have received this communication in error, please notify me immediately by email and delete the original message. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 22996 bytes Desc: image001.png URL: From garcia.marc at gmail.com Fri Dec 17 06:20:34 2021 From: garcia.marc at gmail.com (Marc Garcia) Date: Fri, 17 Dec 2021 12:20:34 +0100 Subject: [Pandas-dev] Pandas Support For TVA In-Reply-To: References: Message-ID: Thanks for your email Dalton. We only offer this sort of support via Tidelift. If you are a subscriptor, please send us this via Tidelift. If you are not, the only options we can offer is to become one, use pandas without support, or stop using pandas. I hope this helps, cheers! On Fri, 17 Dec 2021, 08:18 Gilreath, Dalton E via Pandas-dev, < pandas-dev at python.org> wrote: > Pandas support, > > > > You are receiving this message because a cybersecurity vulnerability has > been identified and TVA is taking precautions to remedy any applications > which may be affected. Pandas is currently being used by TVA. TVA is > requesting confirmation on whether the Apache Log4j utility is present as > part of this application. If Log4j is being utilized please provide the > version and remediation steps your company plans on taking. > > > > *Please respond to this email within 48 hours and advise if this file is > present.* If it is not, no further action is necessary except responding > with confirmation that it does not exist. If it is present, please advise > on if you have a patch available to remediate this issue. We must have a > patch to continue use of this application. If no patch is available, the > application must be turned off by December 24, 2021 to ensure TVA system > security. > > > > Your information should be sent to me at degilreath at tva.gov. Feel free to > contact me with any questions. > > > > *Dalton Gilreath * > Manager, Product Delivery ? Maximo > Technology and Innovation > > [image: TVA logo] > > *W.* 423-751-8029 *M.* 423-785-7215 *E.* degilreath at tva.gov > 1101 Market Street, Chattanooga, TN 37402 > > > > *NOTICE: *This electronic message transmission contains information that > may be TVA SENSITIVE, TVA RESTRICTED, or TVA CONFIDENTIAL. Any misuse or > unauthorized disclosure can result in both civil and criminal penalties. If > you are not the intended recipient, be aware that any disclosure, copying, > distribution, or use of the content of this information is prohibited. If > you have received this communication in error, please notify me immediately > by email and delete the original message. > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 22996 bytes Desc: not available URL: From wesmckinn at gmail.com Fri Dec 17 09:59:48 2021 From: wesmckinn at gmail.com (Wes McKinney) Date: Fri, 17 Dec 2021 08:59:48 -0600 Subject: [Pandas-dev] Pandas Support For TVA In-Reply-To: References: Message-ID: FWIW, I do not believe that pandas contains or depends on any Java code (log4j is a Java component). On Fri, Dec 17, 2021 at 5:20 AM Marc Garcia wrote: > > Thanks for your email Dalton. We only offer this sort of support via Tidelift. If you are a subscriptor, please send us this via Tidelift. If you are not, the only options we can offer is to become one, use pandas without support, or stop using pandas. > > I hope this helps, cheers! > > On Fri, 17 Dec 2021, 08:18 Gilreath, Dalton E via Pandas-dev, wrote: >> >> Pandas support, >> >> >> >> You are receiving this message because a cybersecurity vulnerability has been identified and TVA is taking precautions to remedy any applications which may be affected. Pandas is currently being used by TVA. TVA is requesting confirmation on whether the Apache Log4j utility is present as part of this application. If Log4j is being utilized please provide the version and remediation steps your company plans on taking. >> >> >> >> Please respond to this email within 48 hours and advise if this file is present. If it is not, no further action is necessary except responding with confirmation that it does not exist. If it is present, please advise on if you have a patch available to remediate this issue. We must have a patch to continue use of this application. If no patch is available, the application must be turned off by December 24, 2021 to ensure TVA system security. >> >> >> >> Your information should be sent to me at degilreath at tva.gov. Feel free to contact me with any questions. >> >> >> >> Dalton Gilreath >> Manager, Product Delivery ? Maximo >> Technology and Innovation >> >> W. 423-751-8029 M. 423-785-7215 E. degilreath at tva.gov >> 1101 Market Street, Chattanooga, TN 37402 >> >> >> >> NOTICE: This electronic message transmission contains information that may be TVA SENSITIVE, TVA RESTRICTED, or TVA CONFIDENTIAL. Any misuse or unauthorized disclosure can result in both civil and criminal penalties. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the content of this information is prohibited. If you have received this communication in error, please notify me immediately by email and delete the original message. >> >> >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev From jorisvandenbossche at gmail.com Fri Dec 17 15:04:05 2021 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Fri, 17 Dec 2021 21:04:05 +0100 Subject: [Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks In-Reply-To: References: <807A8451-2547-4891-95F3-B1211496CEA6@xs4all.nl> <1462664690.2963025.1591907360318@mail.yahoo.com> Message-ID: We have planned a video meeting about this topic next week Wednesday, December 22, at 19:00 UTC. The meeting has been added to the pandas development calendar visible at https://pandas.pydata.org/docs/development/meeting.html, and the zoom meeting link is https://us06web.zoom.us/j/81798190900?pwd=ZEo4SnlGMGZxZkVNRkpOLzg0dld3dz09 Joris On Tue, 7 Dec 2021 at 19:01, Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > Another update on this topic: over the last weeks I have been updating the > status of this project (and fixing some regressions), and rerunning the > benchmarks. > > You can find an overview of the results of our ASV benchmarks at > https://github.com/pandas-dev/pandas/issues/39146#issuecomment-988002256. > Some general points about those benchmark results: > > - The cases that show big slowdows are mostly related with cases where we > do `df.values` or equivalent, i.e. converting the DataFrame to a single 2D > array (`.values`, `to_numpy`, `transpose`, ..). Another subset of cases > involve row-wise operations (reductions with axis=1, selecting a single row > as a Series). I think those are the expected cases where a 1D-column store > will always be slower. > - Many of our ASV benchmarks use wide dataframes (eg an often-used shape > is (1000, 1000), so a square dataframe). While it's of course important to > cover this, I also think this is not the most common shape of dataframes, > and in any case is giving a bit a biased view. > - Our ASV benchmarks are mostly micro-benchmarks, or at least benchmarks > that at most take up to 1 to 100 ms in general (by using small enough data > to limit the runtime to this). While this is important to keep this > benchmark suite usable, it also has the consequence that many of those > benchmarks are partly or largely measuring "overhead" which doesn't > necessarily increase while increasing the data size (more rows). The > ArrayManager will typically increase this overhead, but as long as this > overhead is in the "milliseconds" range, it does not necessarily have much > influence on larger data workflows (depending on the exact workflow of > course). > > Overall, I find the results quite reassuring: it identifies the cases > where a slowdown is to be expected (and we will need to judge whether we > find this acceptable), highlight some areas that can use improvement, and > also shows that many of the benchmarks are not (or not much) impacted. > But I think it also shows that we will need to seek more real-world > feedback, either by constructing some macro benchmarks, or by getting user > feedback from their real-world workflows. > > For the first option (macro benchmarks), I quickly cleaned up and pushed > an experiment I did over a year ago, which is to run one query of one of > the industry-standard benchmark suites (TPC) using pandas ( > https://nbviewer.org/github/jorisvandenbossche/pandas-benchmarks/blob/main/tpc-ds/query-1.ipynb#Time-the-full-query). > This shows basically no difference between BlockManager vs ArrayManager. > This if of course also only one single workflow (with narrow long > dataframes, doing mostly groupby and merge, and the overall time is > dominated by eg the factorize algos, which isn't affected by the dataframe > layout), but this is something we could maybe expand with other benchmark > cases. > > --- > > We now have a prototype implementation people can experiment with + we > have an overview of ASV benchmark results. Given this, I think it is a good > point to discuss again how we want to move forward with this, and whether > we want to communicate the _intent_ to make this the default in some next > pandas version (emphasizing "intent", since it will always depend on the > feedback we get). > > Joris > > > On Wed, 7 Apr 2021 at 16:28, Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > >> And to give another update on this topic: the development branch of >> pandas now contains an experimental version of this "columnar store" (using >> an ArrayManager class instead of the BlockManager under the hood, which >> stores the columns as a list of 1D arrays), which is almost >> feature-complete (the biggest missing links are JSON and PyTables IO). >> >> At the moment, there is an option to enable it for experimenting with it >> (not yet documented, as it might still see behaviour changes): >> >> # set the default manager to ArrayManager >> pd.options.mode.data_manager = "array" >> >> # when creating a DataFrame, you will now get one with an ArrayManager >> instead of BlockManager >> df = pd.DataFrame(...) >> df = pd.read_csv(...) >> >> There are still some remaining work items (more IO, ironing out some >> known bugs/todo's, checking performance), see >> https://github.com/pandas-dev/pandas/issues/39146 to keep track of this. >> >> Best, >> Joris >> >> On Tue, 9 Feb 2021 at 19:17, Joris Van den Bossche < >> jorisvandenbossche at gmail.com> wrote: >> >>> >>> On Mon, 31 Aug 2020 at 16:20, Joris Van den Bossche < >>> jorisvandenbossche at gmail.com> wrote: >>> >>>> >>>> >>>> On Fri, 12 Jun 2020 at 22:34, Joris Van den Bossche < >>>> jorisvandenbossche at gmail.com> wrote: >>>> >>>>> On Thu, 11 Jun 2020 at 23:35, Brock Mendel >>>>> wrote: >>>>> >>>>>> > We actually *have* prototypes: the prototype of the split-policy >>>>>> discussed >>>>>> >>>>>> AFAICT that is a 5 year old branch. Is there a version of this based >>>>>> off of master that you can show asv results for? >>>>>> >>>>>> A correction here: that branch has been updated several times over >>>>> the last 5 years, and a last time two weeks ago when I started this thread, >>>>> as I explained in the github issue comment I linked to: >>>>> https://github.com/pandas-dev/pandas/issues/10556#issuecomment-633703160 >>>>> >>>>> >>>>>> > Also, if performance is in the end the decisive criterion, I repeat >>>>>> my earlier remark in this thread: we need to be clearer about what we want >>>>>> / expect. >>>>>> >>>>>> In principle, this is pretty much exactly what the asvs are supposed >>>>>> to represent. >>>>>> >>>>> >>>>> Well, I am repeating myself .. but I already mentioned that I am not >>>>> sure ASV is fully useful for this, as that requires a complete working >>>>> replacement, which is IMO too much to ask for an initial prototype. >>>>> >>>>> But OK, the message is clear: we need a more concrete implementation / >>>>> prototype. So let's put this discussion aside for a moment, and focus on >>>>> that instead. I will try to look at that in the coming weeks, but any help >>>>> is welcome (and I will try to get it running with ASV, or at least a part >>>>> of it). >>>>> >>>>> >>>> To come back to this: I cleaned up a proof-of-concept implementation >>>> that I started after the above discussed, and put it in a PR to >>>> view/discuss: https://github.com/pandas-dev/pandas/pull/36010 >>>> >>>> >>> >>> Another follow-up: the proof-of-concept now is merged in the master >>> branch, and I am currently working on making it more feature complete (see >>> https://github.com/pandas-dev/pandas/issues/39146 for an overview issue) >>> >>> Joris >>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Sun Dec 19 16:37:30 2021 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Sun, 19 Dec 2021 22:37:30 +0100 Subject: [Pandas-dev] Proposal for consistent, clear copy/view semantics in pandas with Copy-on-Write In-Reply-To: References: Message-ID: Thanks for testing the branch and the feedback, Irv! Related to your concern about how users will know or get notified about behaviour that will change: the branch you tested is a proof-of-concept for the *final* behaviour, and so I didn't (yet) add warnings for such cases. So that's the simple reason why a case like df["a"].loc[2] = 112 didn't trigger a warning. But I agree that this is important, and it's certainly the idea that we will have a pandas release (before actually changing the behaviour) where the cases like above that will change behaviour trigger a deprecation warning about this. We will need to see a bit how to implement this, though, and it might become quite complex. But if we are convinced that the final behaviour is better, I think this is certainly worth it (and only temporary). On Wed, 15 Dec 2021 at 15:54, Irv Lustig wrote: > Joris: > I finally had some time to study our conversation from July, reread the > Google docs proposal, and I tried out the PR as well. > > What I'm struggling with is how we document where behavior will change. > As an example, the following sequence will give different results: > > Current behavior: > >>> df = pd.DataFrame({"a":[10,11,12,13,14], "b": [100,101,102,103,104]}) > >>> df["a"].loc[2] = 112 > >>> df > a b > 0 10 100 > 1 11 101 > 2 112 102 > 3 13 103 > 4 14 104 > > > New behavior: (from the PR): > >>> df = pd.DataFrame({"a":[10,11,12,13,14], "b": [100,101,102,103,104]}) > >>> df["a"].loc[2] = 112 > >>> df > a b > 0 10 100 > 1 11 101 > 2 12 102 > 3 13 103 > 4 14 104 > > But in both cases, the following works: > > >>> df.loc[3,"b"] = 999 > >>> df > a b > 0 10 100 > 1 11 101 > 2 12 102 > 3 13 999 > 4 14 104 > > So my concern is that if you had existing code that used the pattern df["a"].loc[2] > = 112 , you'd get no warning that the behavior had changed. What I don't > know is how much of code in the wild assumes the current behavior. > > So my questions are now: > 1. How will we document, in a clean and concise way, the new behavior for > people with existing pandas code? > Given that the new behaviour makes more sense than the current behaviour (in my opinion, and I think yours as well based on your email), it should be actually be easier to properly document it :) But joking aside, yes, we will certainly need to put effort in creating a very good set of documentation on this topic (the google doc could be a starting point). > 2. How can people find pandas code where the behavior will change? Can we > list all patterns that would produce different results? Can we detect > chained indexing with setitem calls? > The documentation can certainly list lots of patterns, but is of course always based on examples. As mentioned above, I think we should be able to catch most / all cases in setitem where behaviour will change, and trigger a warning about this. This will be quite some work (probably even more than the actual implementation that I currently did), but I am convinced this is possible and worth it. > 3. I'm guessing there is lots of code where people use DataFrame.copy() to > avoid the SettingWithCopy warning. Can they just remove those copies now > and their code will work? > Yes, I think so. Especially if you did "copy" for avoiding the warning, you were never modifying the original parent dataframe, which will become the default/automatic behaviour with the proposal. > I agree that for new users, this new way of doing things makes sense. I'm > worried about how we make the transition easier for people with large code > bases that use pandas. > It's indeed a big change, that will impact quite some people, and can be a big task to update for large code bases. So I think we need to take care about this and really put effort in this aspect: ensuring we have good deprecation warnings, a very good migration guide, reach out to (big) users to check how the migration goes so we can improve this migration path, etc. This is a lot of work of course, but I think a necessity if we want this to be a success, and we also have some funding from the CZI grant specifically for this aspect of the larger roadmap items. Joris > > -Irv > > > > > >> On Sat, 17 Jul 2021 at 20:51, Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > >>> > >>> On Fri, 16 Jul 2021 at 20:50, Irv Lustig wrote: > >>>> > >>>> > >>>> Tom Augspurger wrote: > >>>> > >>>>> I wonder if we can validate what users (new and old) *actually* > expect? > >>>>> Users coming from R, which IIRC implements Copy on Write for > matrices, > >>>>> might be OK with indexing always being (behaving like) a copy. > >>>>> I'm not sure what users coming from NumPy would expect, since I > don't know > >>>>> how many NumPy users really understand *a**.)* when a NumPy slice is > a view > >>>>> or copy, and *b.) *how a pandas indexing operation translates to a > NumPy > >>>>> slice. > >>>>> > >>>> > >>>> IMHO, we should concentrate on the "new" users. For my team, there > is no numpy or R background. They learn pandas, and what pandas does needs > to be really clear in behavior and documentation. I would also hazard a > guess that most pandas users are like that - pandas is the first tool they > see, not numpy or R. > >>>> > >>>> The places where I think confusion could happen are things like this > with a DataFrame df : > >>>> > >>>> s = df["a"] > >>>> s.iloc[3:5] = [1, 2, 3] > >>>> df["a"].iloc[3:5] = [1, 2, 3] > >>>> df["b"] = df["a"] > >>>> df["b"].iloc[3:5] = [4, 5, 6] > >>>> s2 = df["b"] > >>>> df["c"] = s2 > >>>> s2.iloc[3:5] = [7, 8, 9] > >>>> > >>>> As I understand it (please correct me if I'm wrong), these lines > would be interpreted as follows with the current proposal: > >>> > >>> > >>> It's a bit different (to reiterate, with the *current* proposal, *any* > indexing operation (including series selection) behaves as a copy; and also > to be clear, this is one possible proposal, there are certainly other > possibilities). Answering case by case: > >>> > >>>> > >>>> 1. s = df["a"] > >>>> Creates a view into the DataFrame df. No copying is done at all > >>> > >>> > >>> Indeed a view (but that's an implementation detail) > >>> > >>>> 2. s.iloc[3:5] = [1, 2, 3] > >>>> Modifies the series s and the underlying DataFrame df. > (copy-on-write) > >>> > >>> > >>> Due to copy-on-write, it does *not* modify the DataFrame df. > Copy-on-write means that only when s is being written to, its data get > copied (so at that point breaking the view-relation with the parent df) > >>> > >>>> > >>>> 3. df["a"].iloc[3:5] = [1, 2, 3] > >>>> Modifies the dataframe > >>> > >>> > >>> This is an example of chained assignment, which in the current > proposal never works (see the example in the google doc). This is because > chained assignment can always be written as: > >>> > >>> temp = df["a"] > >>> temp.iloc[3:5] = [1, 2, 3] > >>> > >>> and `temp` uses copy-on-write (and then it is the same example as the > one above in 2.). > >>> > >>> (what you describe is the current behaviour of pandas) > >>> > >>>> > >>>> 4. df["b"] = df["a"] > >>>> Copies the series from "a" to "b" > >>> > >>> > >>> It would indeed behave as a copy, but under the hood we can actually > keep this as a view (delay the copy thanks to copy-on-write). > >>> > >>>> > >>>> 5. df["b"].iloc[3:5] = [4, 5, 6] > >>>> Modifies "b" in the DataFrame, but not "a" > >>> > >>> > >>> Also doesn't modify "b" (see example 3. above), but indeed does not > modify "a" > >>> > >>>> > >>>> 6. s2 = df["b"] > >>>> Create a view into the DataFrame df. No copying is done at all. > >>> > >>> > >>> Same as 1. > >>> > >>>> > >>>> 7. df["c"] = s2 > >>>> Copies the series from "b" to "c" > >>> > >>> > >>> Same as 4. > >>> > >>>> > >>>> 8. s2.iloc[3:5] = [7, 8, 9] > >>>> Modifies s2, which modifies "b", but NOT "c" > >>> > >>> > >>> Doesn't modify "b" and "c". Similar as 3. > >>> > >>>> I think the challenge is explaining the sequence 6,7,8 above in > comparison to the other sequences. > >>> > >>> > >>> So with the current proposal, the sequece 6, 7, 8 actually doesn't > behave differently. But it is mainly 2 and 3 that would be quite different > compared to the current pandas behaviour. > >>> > >>>> > >>>> > >>>> -Irv > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: