From jorisvandenbossche at gmail.com Fri Apr 2 13:57:19 2021 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Fri, 2 Apr 2021 19:57:19 +0200 Subject: [Pandas-dev] Index Constructor Performance In-Reply-To: References: Message-ID: Hi Bock, I mentioned it when we were chatting about this as well, but I would first ensure that we are actually benchmarking something sensible. As you say yourself, the current benchmark is a "worst-case" (the applied function doesn't do anything useful, but just returns a constant), in the meaning that moving away from libreduction would give the biggest slowdown, simply because this benchmark is the fastest. And IMO 10% improvement on this fastest benchmark is not necessarily worth it. So if we benchmark instead a function that actually does something, eg calculating the mean of one of the columns (which is still a rather simple and fast function, I think): the benchmark now actually takes 9-10x as long. And if I profile this (with the libreduction code path disabled, so using the python apply code path), the Index._simple_new now is only responsible for 1-2% of the overall time. Optimizing this won't give much of an improvement for groupby.apply. It might certainly be that there are use cases for which it could be useful to further micro-optimize the Index constructors, but before doing that effort, I would first try to identify such use cases (as IMO groupby.apply is not such a case). Joris On Sat, 27 Mar 2021 at 16:44, Brock Mendel wrote: > In optimizing the non-cython groupby.apply ( > https://github.com/pandas-dev/pandas/issues/40263, > https://github.com/pandas-dev/pandas/pull/40171#issuecomment-789116039) > code I'm finding that an awful lot of overhead is coming from > Index._simple_new*. This email is about what it would take to get rid of > that overhead. > > * Note that the particular code snippet being profiled is chosen to be > worst-case for the non-cython path. It ends up creating a _lot_ of very > small Index objects. We don't particularly care about this case, but I'm > thinking about this as micro-optimization of code that affects just about > every use case under the sun. > > All of the options I have in mind involve moving some of the constructors > to cython. There is a tradeoff in how invasive that is vs how much perf > benefit we gain from it. > > For a baseline, we can trim 10-13% off the benchmark linked above by > implementing in cython and mixing into NumericIndex (implementation > abbreviated for brevity; the full implementation is 65 lines in cython): > > ``` > @cython.freelist(32) > cdef class NumpyIndex: > cdef: > public ndarray _data > > @classmethod > def _simple_new(cls, values, name=None): ... > > cpdef NumpyIndex _getitem_slice(cls, slice slobj): ... > ``` > > 10-13% is pretty good, but this only affects Int64Index, UInt64Index, and > Float64Index. See Appendix 1 for discussion of what it would take to > extend this to other subclasses. > > To get much further than this would require using __cinit__, which (absent > some gymnastics) would require the FooIndex.__new__ methods to behave a lot > more like the existing FooIndex._simple_new methods. TL;DR: this really > isn't feasible absent a) refactoring RangeIndex to not subclass Int64Index > (easy) and b) breaking API changes on the constructors for affected Index > subclasses (hard). > > > Appendix 1: Extending to Other Subclasses > a) mixing libindex.NumpyIndex into pd.Index doesn't work because > ExtensionIndex._data is not an ndarray. AFAICT to get the performance > benefit for object-dtype would require implementing a separate subclass > e.g. ObjectIndex. > > b) RangeIndex would not benefit, but something similar could be done for > it following https://github.com/cython/cython/issues/4040 (or if we > basically re-implement range ourselves in cython) > > c) MultiIndex could be made to benefit from this by changing ._codes to be > a 2D ndarray instead of a FrozenList of ndarrays. This actually would > allow for some nice cleanups in MultiIndex. The downside is that the > memory footprint may be bigger with mismatched level sizes. > > d) With modest additional effort, this can be extended to > DTI/TDI/PI/CategoricalIndex. > > Appendix 2: __cinit__ > __cinit__ gets called implicitly before __init__ or __new__, and with > whatever arguments are passed to init/new, i.e. we can't do validation > before passing arguments like we could with an explicit > super().__init__(...) call. > > For NumpyIndex we _could_ define __cinit__ without breaking the world, but > we wouldn't get much use out of it unless we also tightened what we accept > in the constructor > > Appendix 3: Notes on cython-related constraints > - We cannot mix a cython cdef class into pd.Index because that will break > 3rd party subclasses that use object.__new__(cls) (in particular im > thinking of xarray's CFDatetimeIndex) > - a python class cannot inherit from two separate cython cdef classes. > i.e. if we mix something into NumericIndex, that precludes mixing something > else into Int64Index > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Wed Apr 7 10:28:07 2021 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Wed, 7 Apr 2021 16:28:07 +0200 Subject: [Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks In-Reply-To: References:

<807A8451-2547-4891-95F3-B1211496CEA6@xs4all.nl>

<1462664690.2963025.1591907360318@mail.yahoo.com> Message-ID: And to give another update on this topic: the development branch of pandas now contains an experimental version of this "columnar store" (using an ArrayManager class instead of the BlockManager under the hood, which stores the columns as a list of 1D arrays), which is almost feature-complete (the biggest missing links are JSON and PyTables IO). At the moment, there is an option to enable it for experimenting with it (not yet documented, as it might still see behaviour changes): # set the default manager to ArrayManager pd.options.mode.data_manager = "array" # when creating a DataFrame, you will now get one with an ArrayManager instead of BlockManager df = pd.DataFrame(...) df = pd.read_csv(...) There are still some remaining work items (more IO, ironing out some known bugs/todo's, checking performance), see https://github.com/pandas-dev/pandas/issues/39146 to keep track of this. Best, Joris On Tue, 9 Feb 2021 at 19:17, Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > > On Mon, 31 Aug 2020 at 16:20, Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > >> >> >> On Fri, 12 Jun 2020 at 22:34, Joris Van den Bossche < >> jorisvandenbossche at gmail.com> wrote: >> >>> On Thu, 11 Jun 2020 at 23:35, Brock Mendel >>> wrote: >>> >>>> > We actually *have* prototypes: the prototype of the split-policy >>>> discussed >>>> >>>> AFAICT that is a 5 year old branch. Is there a version of this based >>>> off of master that you can show asv results for? >>>> >>>> A correction here: that branch has been updated several times over the >>> last 5 years, and a last time two weeks ago when I started this thread, as >>> I explained in the github issue comment I linked to: >>> https://github.com/pandas-dev/pandas/issues/10556#issuecomment-633703160 >>> >>> >>>> > Also, if performance is in the end the decisive criterion, I repeat >>>> my earlier remark in this thread: we need to be clearer about what we want >>>> / expect. >>>> >>>> In principle, this is pretty much exactly what the asvs are supposed to >>>> represent. >>>> >>> >>> Well, I am repeating myself .. but I already mentioned that I am not >>> sure ASV is fully useful for this, as that requires a complete working >>> replacement, which is IMO too much to ask for an initial prototype. >>> >>> But OK, the message is clear: we need a more concrete implementation / >>> prototype. So let's put this discussion aside for a moment, and focus on >>> that instead. I will try to look at that in the coming weeks, but any help >>> is welcome (and I will try to get it running with ASV, or at least a part >>> of it). >>> >>> >> To come back to this: I cleaned up a proof-of-concept implementation that >> I started after the above discussed, and put it in a PR to view/discuss: >> https://github.com/pandas-dev/pandas/pull/36010 >> >> > > Another follow-up: the proof-of-concept now is merged in the master > branch, and I am currently working on making it more feature complete (see > https://github.com/pandas-dev/pandas/issues/39146 for an overview issue) > > Joris > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Tue Apr 13 16:18:52 2021 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Tue, 13 Apr 2021 22:18:52 +0200 Subject: [Pandas-dev] April 2021 monthly community meeting (Wednesday April 14, UTC 18:00) In-Reply-To: References: Message-ID: Hi all, A reminder that the next monthly dev call is tomorrow (Wednesday, April 14th) at 18:00 UTC (1 pm Central). Our calendar is at https://pandas.pydata.org/docs/development/meeting.html#calendar to check your local time. All are welcome to attend! Video Call: (I will send around the link tomorrow) Minutes: https://docs.google.com/document/u/1/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit?ouid=102771015311436394588&usp=docs_home&ths=true Joris > -------------- next part -------------- An HTML attachment was scrubbed... URL: From simonjayhawkins at gmail.com Wed Apr 14 10:33:25 2021 From: simonjayhawkins at gmail.com (Simon Hawkins) Date: Wed, 14 Apr 2021 15:33:25 +0100 Subject: [Pandas-dev] ANN: Pandas 1.2.4 Released Message-ID: Hi all, I'm pleased to announce the release of pandas 1.2.4. This is a patch release in the 1.2.x series and includes some regression fixes. We recommend that all users upgrade to this version. See the release notes for a list of all the changes. The release can be installed from PyPI python -m pip install --upgrade pandas==1.2.4 Or from conda-forge conda install -c conda-forge pandas==1.2.4 Please report any issues with the release on the pandas issue tracker . Thanks to all the contributors who made this release possible. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Wed Apr 14 12:13:30 2021 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Wed, 14 Apr 2021 18:13:30 +0200 Subject: [Pandas-dev] April 2021 monthly community meeting (Wednesday April 14, UTC 18:00) In-Reply-To: References:

Message-ID: The meeting link for today: https://zoom.us/j/96753852910?pwd=OEgwbUkwOE9kejcwOGdLd09TallTdz09 Joris On Tue, 13 Apr 2021 at 22:18, Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > Hi all, > > A reminder that the next monthly dev call is tomorrow (Wednesday, April > 14th) at 18:00 UTC (1 pm Central). Our calendar is at > https://pandas.pydata.org/docs/development/meeting.html#calendar to check > your local time. > All are welcome to attend! > > Video Call: (I will send around the link tomorrow) > Minutes: > https://docs.google.com/document/u/1/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit?ouid=102771015311436394588&usp=docs_home&ths=true > > Joris > >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From niranda.perera at gmail.com Thu Apr 15 18:04:12 2021 From: niranda.perera at gmail.com (Niranda Perera) Date: Thu, 15 Apr 2021 18:04:12 -0400 Subject: [Pandas-dev] Using a phrase "Similar to Pandas DataFrame" in other projects Message-ID: Hi all, We are working on a distributed memory parallel runtime, "Cylon" ( https://github.com/cylondata/cylon). Cylon has a different execution model compared to Dask Distributed or Ray DF. It is using the bulk synchronous parallel model for execution. So, essentially, Cylon provides a distributed DataFrame for MPI-like environments. In our upcoming release, we have added a Python DataFrame API "Similar to Pandas DataFrame". When we checked the Pandas License it has the clause, "prohibits others from using the name of the project or its contributors to promote derived products without written consent." On the other hand, we have seen third party projects listed under "Ecosystem" on the Pandas website ( https://pandas.pydata.org/community/ecosystem.html) So, we were wondering, 1. What is the process to get the "written consent" to use Pandas name in Cylon project material (documentation, GitHub, etc) 2. What are the requirements a project needs to fulfill, to be listed under the Pandas Ecosystem? Thank you Best -- Niranda Perera @n1r44 +1 812 558 8884 / +94 71 554 8430 *https://niranda.dev/ * -------------- next part -------------- An HTML attachment was scrubbed... URL: From garcia.marc at gmail.com Fri Apr 16 06:58:28 2021 From: garcia.marc at gmail.com (Marc Garcia) Date: Fri, 16 Apr 2021 05:58:28 -0500 Subject: [Pandas-dev] Using a phrase "Similar to Pandas DataFrame" in other projects In-Reply-To: References: Message-ID: In short: 1. You can use pandas without written consent for what in law is called "fair use". Which for what you say is what you want to do, and other projects do. The idea is that you can use the pandas "brand" to say "we are compatible with pandas", "it's similar to a pandas dataframe"... But not "we're the pandas consultancy", "we're pandas approved"... You can share for "fair use" for more details. 2. If it's an open source project, and relevant to the pandas community, just open a PR to be included in the Ecosystem page. Thanks! On Fri, 16 Apr 2021, 03:56 Niranda Perera, wrote: > Hi all, > > We are working on a distributed memory parallel runtime, "Cylon" ( > https://github.com/cylondata/cylon). Cylon has a different execution > model compared to Dask Distributed or Ray DF. It is using the bulk > synchronous parallel model for execution. So, essentially, Cylon provides a > distributed DataFrame for MPI-like environments. > > In our upcoming release, we have added a Python DataFrame API "Similar to > Pandas DataFrame". When we checked the Pandas License it has the clause, > "prohibits others from using the name of the project or its contributors > to promote derived products without written consent." > On the other hand, we have seen third party projects listed under > "Ecosystem" on the Pandas website ( > https://pandas.pydata.org/community/ecosystem.html) > > So, we were wondering, > 1. What is the process to get the "written consent" to use Pandas name in > Cylon project material (documentation, GitHub, etc) > 2. What are the requirements a project needs to fulfill, to be listed > under the Pandas Ecosystem? > > Thank you > Best > > -- > Niranda Perera > @n1r44 > +1 812 558 8884 / +94 71 554 8430 > *https://niranda.dev/ > * > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From niranda.perera at gmail.com Fri Apr 16 08:55:05 2021 From: niranda.perera at gmail.com (Niranda Perera) Date: Fri, 16 Apr 2021 08:55:05 -0400 Subject: [Pandas-dev] Using a phrase "Similar to Pandas DataFrame" in other projects In-Reply-To: References:

Message-ID: Thank you very much Marc. This is very helpful. On Fri, Apr 16, 2021 at 6:58 AM Marc Garcia wrote: > In short: > > 1. You can use pandas without written consent for what in law is called > "fair use". Which for what you say is what you want to do, and other > projects do. The idea is that you can use the pandas "brand" to say "we are > compatible with pandas", "it's similar to a pandas dataframe"... But not > "we're the pandas consultancy", "we're pandas approved"... You can share > for "fair use" for more details. > > 2. If it's an open source project, and relevant to the pandas community, > just open a PR to be included in the Ecosystem page. > > Thanks! > > On Fri, 16 Apr 2021, 03:56 Niranda Perera, > wrote: > >> Hi all, >> >> We are working on a distributed memory parallel runtime, "Cylon" ( >> https://github.com/cylondata/cylon). Cylon has a different execution >> model compared to Dask Distributed or Ray DF. It is using the bulk >> synchronous parallel model for execution. So, essentially, Cylon provides a >> distributed DataFrame for MPI-like environments. >> >> In our upcoming release, we have added a Python DataFrame API "Similar to >> Pandas DataFrame". When we checked the Pandas License it has the clause, >> "prohibits others from using the name of the project or its contributors >> to promote derived products without written consent." >> On the other hand, we have seen third party projects listed under >> "Ecosystem" on the Pandas website ( >> https://pandas.pydata.org/community/ecosystem.html) >> >> So, we were wondering, >> 1. What is the process to get the "written consent" to use Pandas name in >> Cylon project material (documentation, GitHub, etc) >> 2. What are the requirements a project needs to fulfill, to be listed >> under the Pandas Ecosystem? >> >> Thank you >> Best >> >> -- >> Niranda Perera >> @n1r44 >> +1 812 558 8884 / +94 71 554 8430 >> *https://niranda.dev/ >> * >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > -- Niranda Perera @n1r44 +1 812 558 8884 / +94 71 554 8430 https://www.linkedin.com/in/niranda -------------- next part -------------- An HTML attachment was scrubbed... URL: