From jorisvandenbossche at gmail.com Mon Jun 1 05:43:53 2020 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Mon, 1 Jun 2020 11:43:53 +0200 Subject: [Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks In-Reply-To: References: <807A8451-2547-4891-95F3-B1211496CEA6@xs4all.nl> Message-ID: On Sat, 30 May 2020 at 23:55, Adrin wrote: > Although 1 x 5000 may sound an edge case, my whole 4 years of research was > on 500 x 450000 data. Those usecases are probably more common than we may > think. > It's still a lower column/rows ratio as 1x5000 ;) (although not that much) (it is this ratio that mostly determines whether the overhead of performing column by column starts to dominate) But joking aside: yes, that those use cases are more common than I think is quite probable. I never have really used that myself, and therefore again: such feedback is very useful! Also in our user survey from last year, a majority indicated that they occasionally use wide dataframes (although "wide" was described as "100s of columns or more", which is not necessarily that wide). Now, to reiterate: - You will still be able to use pandas with wide dataframes, you only might "pay a price" for using a flexible data structure like a dataframe (that allows heterogenous dtypes, allows inserting columns cheaply, ..) for a use case that might not need that flexibility. And again, with some optimization effort, I think we can keep this "cost" at a minimum. - It might actually be that a different data model fits your use case better, such as xarray (Adrin, since you are a bit familiar with xarray, would you in hindsight rather have used that for your research?) - I think that by simplifying the pandas internals, it would actually *become easier* to better support the wide dataframe use case as well. Jeff mentioned it before as the "DataMatrix", also Stephan mentioned it on twitter. If we can simplify the internals, it would become more realistic to have a DataFrame-version that is for example backed by a single ndarray but supports the familiar DataFrame-API (or at least a subset of it without converting to a columnar DataFrame). On twitter I said "pandas doesn't need to be the best solution for a variety of use cases". But I should probably have said: "pandas *cannot* be the best solution for different use case *at the same time*". Supporting wide dataframes optimally right now comes at the cost of not supporting heterogeneous dataframes as good as we could. But again, if there appears to be enough interest and there are people who want to contribute to this effort, I think we should investigate how we can actually support both cases (my last point in the above list). Joris > > On Sat., May 30, 2020, 21:03 Joris Van den Bossche, < > jorisvandenbossche at gmail.com> wrote: > >> Hi Maarten, >> >> Thanks a lot for the feedback! >> >> On Fri, 29 May 2020 at 20:31, Maarten Ballintijn >> wrote: >> >>> >>> Hi Joris, >>> >>> You said: >>> >>> But I also deliberately choose a dataframe where n_rows >> n_columns, >>> because I personally would be fine if operations on wide dataframes (n_rows >>> < n_columns) show a slowdown. But that is of course something to discuss / >>> agree upon (eg up to which dataframe size or n_columns/n_rows ratio do we >>> care about a performance degradation?). >>> >>> >>> This is an (the) important use case for us and probably for a lot of use >>> in finance in general. I can easily imagine many other >>> areas where storing data for 1000?s of elements (sensors, items, people) >>> on grid of time scales of minutes or more. >>> (n*1000 x m*1000 data with n, m ~ 10 .. 100) >>> >>> Why do you think this use case is no longer important? >>> >> >> To be clear up front: I think wide dataframes are still an important use >> case. >> >> But to put my comment from above in more context: we had a performance >> regression reported (#24990 >> , which Brock >> referenced in his last mail) which was about a DataFrame with 1 row and >> 5000 columns. >> And yes, for *such* a case, I think it will basically be impossible to >> preserve exact performance, even with a lot of optimizations, compared to >> storing this as a single, consolidated (1, 5000) array as is done now. And >> it is for such a case, that I indeed say: I am willing to accept a limited >> slowdown for this, *if* it at the same time gives us improved memory >> usage, performance improvements for more common cases, simplified internals >> making it easier to contribute to and further optimize pandas, etc. >> >> But, I am also quite convinced that, with some optimization effort, we >> can at least preserve the current performance even for relatively wide >> dataframes (see eg this >> >> notebook >> >> for some quick experiments). >> And to be clear: doing such optimizations to ensure good performance for >> a variety of use cases is part of the proposal. Also, I think that having a >> simplified pandas internals should actually also make it easier to further >> explore ways to specifically optimize the "homogeneous-dtype wide >> dataframe" use case. >> >> Now, it is always difficult to make such claims in the abstract. >> So what I personally think would be very valuable, is if you could give >> some example use cases that you care about (eg a notebook creating some >> dummy data with similar characteristics as the data you are working with >> (or using real data, if openly available, and a few typical operations you >> do on those). >> >> Best, >> Joris >> >> >>> >>> We already have to drop into numpy on occasion to make the performance >>> sufficient. I would really prefer for Pandas to >>> improve in this area not slide back. >>> >>> Have a great weekend, >>> Maarten >>> >>> >>> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.christensen at gmail.com Mon Jun 1 10:02:06 2020 From: aaron.christensen at gmail.com (Aaron) Date: Mon, 1 Jun 2020 10:02:06 -0400 Subject: [Pandas-dev] Create an aggregate/summary pandas dataframe based on overlapping dates derived from a more specific dataframe? Message-ID: Hello, Given a dateframe with trips made my employees of different companies, I am trying to generate a new dataframe with only the company names. There are trips made by employees from different companies. I am looking to combine the overlapping travel times from employees of the SAME company into a single row. If there are no overlapping travel times, then that row just transfers over as-is. When there are overlapping travel times, then the following will happen: --The name field is removed b/c that is no longer relevant (company name stays), the Depart date will be the earliest date of any of the trip dates regardless of the employee, the Return date will be the latest date of any of the trip dates regardless of the employee, the charges for the trip will be summed For example, if trips had dates 01/01/20 - 01/31/20, 01/15/20 - 02/15/20, 02/01-20 - 02/28/20, then all three would be combined. The starting date will be 1/1/20 and ending as of 2/28/20 Basically, the company was on that trip from start to finish? kinda like a relay run handing off the baton. Also, the charges will be summed for each of those trips and transferred over to the single row. I was playing around with timedelta, hierarchal indices, grouping, and sorting but had a really hard time since I am looking at date ranges instead of specific dates. Here is the starting dataframe code/output: import pandas as pd emp_trips = {'Name': ['Bob','Joe','Sue','Jack', 'Henry', 'Frank', 'Lee', 'Jack'], 'Company': ['ABC', 'ABC', 'ABC', 'HIJ', 'HIJ', 'DEF', 'DEF', 'DEF'], 'Depart' : ['01/01/2020', '01/01/2020', '01/06/2020', '01/01/2020', '05/01/2020', '01/13/2020', '01/12/2020', '01/14/2020'], 'Return' : ['01/31/2020', '02/15/2020', '02/20/2020', '03/01/2020', '05/05/2020', '01/15/2020', '01/30/2020', '02/02/2020'], 'Charges': [10.10, 20.25, 30.32, 40.00, 50.01, 60.32, 70.99, 80.87] } df = pd.DataFrame(emp_trips, columns = ['Name', 'Company', 'Depart', 'Return', 'Charges']) # Convert to date format df['Return']= pd.to_datetime(df['Return']) df['Depart']= pd.to_datetime(df['Depart']) Name Company Depart Return Charges0 Bob ABC 2020-01-01 2020-01-31 10.101 Joe ABC 2020-01-01 2020-02-15 20.252 Sue ABC 2020-01-06 2020-02-20 30.323 Jack HIJ 2020-01-01 2020-03-01 40.004 Henry HIJ 2020-05-01 2020-05-05 50.015 Frank DEF 2020-01-13 2020-01-15 60.326 Lee DEF 2020-01-12 2020-01-30 70.997 Jack DEF 2020-01-14 2020-02-02 80.87 And, here is the desired/generated dataframe: Company Depart Return Charges0 ABC 01/01/2020 02/20/2020 60.671 HIJ 01/01/2020 03/01/2020 40.002 HIJ 05/01/2020 05/05/2020 50.013 DEF 01/12/2020 02/02/2020 212.18 Thank you in advance! Aaron -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbrockmendel at gmail.com Mon Jun 1 14:07:01 2020 From: jbrockmendel at gmail.com (Brock Mendel) Date: Mon, 1 Jun 2020 11:07:01 -0700 Subject: [Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks In-Reply-To: References: <807A8451-2547-4891-95F3-B1211496CEA6@xs4all.nl> Message-ID: Joris and I accidentally took part of the discussion off-thread. My suggestion boils down to: Let's 1) Identify pieces of this that we want to do regardless of whether we do the rest of it (e.g. consolidate only in internals, view-only indexing on columns). 2) Beef up the asvs to be give closer-to-full measure of the tradeoffs when an eventual proof of concept/PR is made. On Mon, Jun 1, 2020 at 2:44 AM Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > On Sat, 30 May 2020 at 23:55, Adrin wrote: > >> Although 1 x 5000 may sound an edge case, my whole 4 years of research >> was on 500 x 450000 data. Those usecases are probably more common than we >> may think. >> > > It's still a lower column/rows ratio as 1x5000 ;) (although not that much) > (it is this ratio that mostly determines whether the overhead of > performing column by column starts to dominate) > > But joking aside: yes, that those use cases are more common than I think > is quite probable. I never have really used that myself, and therefore > again: such feedback is very useful! > Also in our user survey from last year, a majority indicated that they > occasionally use wide dataframes (although "wide" was described as "100s of > columns or more", which is not necessarily that wide). > > Now, to reiterate: > > - You will still be able to use pandas with wide dataframes, you only > might "pay a price" for using a flexible data structure like a dataframe > (that allows heterogenous dtypes, allows inserting columns cheaply, ..) for > a use case that might not need that flexibility. And again, with some > optimization effort, I think we can keep this "cost" at a minimum. > - It might actually be that a different data model fits your use case > better, such as xarray (Adrin, since you are a bit familiar with xarray, > would you in hindsight rather have used that for your research?) > - I think that by simplifying the pandas internals, it would actually *become > easier* to better support the wide dataframe use case as well. Jeff > mentioned it before as the "DataMatrix", also Stephan mentioned it on > twitter. If we can simplify the internals, it would become more realistic > to have a DataFrame-version that is for example backed by a single ndarray > but supports the familiar DataFrame-API (or at least a subset of it without > converting to a columnar DataFrame). > > On twitter I said "pandas doesn't need to be the best solution for a > variety of use cases". But I should probably have said: "pandas *cannot* > be the best solution for different use case *at the same time*". > Supporting wide dataframes optimally right now comes at the cost of not > supporting heterogeneous dataframes as good as we could. > But again, if there appears to be enough interest and there are people who > want to contribute to this effort, I think we should investigate how we can > actually support both cases (my last point in the above list). > > Joris > > >> >> On Sat., May 30, 2020, 21:03 Joris Van den Bossche, < >> jorisvandenbossche at gmail.com> wrote: >> >>> Hi Maarten, >>> >>> Thanks a lot for the feedback! >>> >>> On Fri, 29 May 2020 at 20:31, Maarten Ballintijn >>> wrote: >>> >>>> >>>> Hi Joris, >>>> >>>> You said: >>>> >>>> But I also deliberately choose a dataframe where n_rows >> n_columns, >>>> because I personally would be fine if operations on wide dataframes (n_rows >>>> < n_columns) show a slowdown. But that is of course something to discuss / >>>> agree upon (eg up to which dataframe size or n_columns/n_rows ratio do we >>>> care about a performance degradation?). >>>> >>>> >>>> This is an (the) important use case for us and probably for a lot of >>>> use in finance in general. I can easily imagine many other >>>> areas where storing data for 1000?s of elements (sensors, items, >>>> people) on grid of time scales of minutes or more. >>>> (n*1000 x m*1000 data with n, m ~ 10 .. 100) >>>> >>>> Why do you think this use case is no longer important? >>>> >>> >>> To be clear up front: I think wide dataframes are still an important use >>> case. >>> >>> But to put my comment from above in more context: we had a performance >>> regression reported (#24990 >>> , which Brock >>> referenced in his last mail) which was about a DataFrame with 1 row and >>> 5000 columns. >>> And yes, for *such* a case, I think it will basically be impossible to >>> preserve exact performance, even with a lot of optimizations, compared to >>> storing this as a single, consolidated (1, 5000) array as is done now. And >>> it is for such a case, that I indeed say: I am willing to accept a limited >>> slowdown for this, *if* it at the same time gives us improved memory >>> usage, performance improvements for more common cases, simplified internals >>> making it easier to contribute to and further optimize pandas, etc. >>> >>> But, I am also quite convinced that, with some optimization effort, we >>> can at least preserve the current performance even for relatively wide >>> dataframes (see eg this >>> >>> notebook >>> >>> for some quick experiments). >>> And to be clear: doing such optimizations to ensure good performance for >>> a variety of use cases is part of the proposal. Also, I think that having a >>> simplified pandas internals should actually also make it easier to further >>> explore ways to specifically optimize the "homogeneous-dtype wide >>> dataframe" use case. >>> >>> Now, it is always difficult to make such claims in the abstract. >>> So what I personally think would be very valuable, is if you could give >>> some example use cases that you care about (eg a notebook creating some >>> dummy data with similar characteristics as the data you are working with >>> (or using real data, if openly available, and a few typical operations you >>> do on those). >>> >>> Best, >>> Joris >>> >>> >>>> >>>> We already have to drop into numpy on occasion to make the performance >>>> sufficient. I would really prefer for Pandas to >>>> improve in this area not slide back. >>>> >>>> Have a great weekend, >>>> Maarten >>>> >>>> >>>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeffreback at gmail.com Mon Jun 1 14:16:13 2020 From: jeffreback at gmail.com (Jeff Reback) Date: Mon, 1 Jun 2020 14:16:13 -0400 Subject: [Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks In-Reply-To: References: Message-ID: +1 in brock?s suggestions here currenty -1 on moving to add a lazy block manager i see this as simply increasing complexity > On Jun 1, 2020, at 2:07 PM, Brock Mendel wrote: > > ? > Joris and I accidentally took part of the discussion off-thread. My suggestion boils down to: Let's > > 1) Identify pieces of this that we want to do regardless of whether we do the rest of it (e.g. consolidate only in internals, view-only indexing on columns). > 2) Beef up the asvs to be give closer-to-full measure of the tradeoffs when an eventual proof of concept/PR is made. > >> On Mon, Jun 1, 2020 at 2:44 AM Joris Van den Bossche wrote: >>> On Sat, 30 May 2020 at 23:55, Adrin wrote: >>> Although 1 x 5000 may sound an edge case, my whole 4 years of research was on 500 x 450000 data. Those usecases are probably more common than we may think. >> >> It's still a lower column/rows ratio as 1x5000 ;) (although not that much) >> (it is this ratio that mostly determines whether the overhead of performing column by column starts to dominate) >> >> But joking aside: yes, that those use cases are more common than I think is quite probable. I never have really used that myself, and therefore again: such feedback is very useful! >> Also in our user survey from last year, a majority indicated that they occasionally use wide dataframes (although "wide" was described as "100s of columns or more", which is not necessarily that wide). >> >> Now, to reiterate: >> >> - You will still be able to use pandas with wide dataframes, you only might "pay a price" for using a flexible data structure like a dataframe (that allows heterogenous dtypes, allows inserting columns cheaply, ..) for a use case that might not need that flexibility. And again, with some optimization effort, I think we can keep this "cost" at a minimum. >> - It might actually be that a different data model fits your use case better, such as xarray (Adrin, since you are a bit familiar with xarray, would you in hindsight rather have used that for your research?) >> - I think that by simplifying the pandas internals, it would actually become easier to better support the wide dataframe use case as well. Jeff mentioned it before as the "DataMatrix", also Stephan mentioned it on twitter. If we can simplify the internals, it would become more realistic to have a DataFrame-version that is for example backed by a single ndarray but supports the familiar DataFrame-API (or at least a subset of it without converting to a columnar DataFrame). >> >> On twitter I said "pandas doesn't need to be the best solution for a variety of use cases". But I should probably have said: "pandas cannot be the best solution for different use case at the same time". Supporting wide dataframes optimally right now comes at the cost of not supporting heterogeneous dataframes as good as we could. >> But again, if there appears to be enough interest and there are people who want to contribute to this effort, I think we should investigate how we can actually support both cases (my last point in the above list). >> >> Joris >> >>> >>>> On Sat., May 30, 2020, 21:03 Joris Van den Bossche, wrote: >>>> Hi Maarten, >>>> >>>> Thanks a lot for the feedback! >>>> >>>>> On Fri, 29 May 2020 at 20:31, Maarten Ballintijn wrote: >>>>> >>>>> Hi Joris, >>>>> >>>>> You said: >>>>> >>>>>> But I also deliberately choose a dataframe where n_rows >> n_columns, because I personally would be fine if operations on wide dataframes (n_rows < n_columns) show a slowdown. But that is of course something to discuss / agree upon (eg up to which dataframe size or n_columns/n_rows ratio do we care about a performance degradation?). >>>>> >>>>> This is an (the) important use case for us and probably for a lot of use in finance in general. I can easily imagine many other >>>>> areas where storing data for 1000?s of elements (sensors, items, people) on grid of time scales of minutes or more. >>>>> (n*1000 x m*1000 data with n, m ~ 10 .. 100) >>>>> >>>>> Why do you think this use case is no longer important? >>>> >>>> To be clear up front: I think wide dataframes are still an important use case. >>>> >>>> But to put my comment from above in more context: we had a performance regression reported (#24990, which Brock referenced in his last mail) which was about a DataFrame with 1 row and 5000 columns. >>>> And yes, for such a case, I think it will basically be impossible to preserve exact performance, even with a lot of optimizations, compared to storing this as a single, consolidated (1, 5000) array as is done now. And it is for such a case, that I indeed say: I am willing to accept a limited slowdown for this, if it at the same time gives us improved memory usage, performance improvements for more common cases, simplified internals making it easier to contribute to and further optimize pandas, etc. >>>> >>>> But, I am also quite convinced that, with some optimization effort, we can at least preserve the current performance even for relatively wide dataframes (see eg this notebook for some quick experiments). >>>> And to be clear: doing such optimizations to ensure good performance for a variety of use cases is part of the proposal. Also, I think that having a simplified pandas internals should actually also make it easier to further explore ways to specifically optimize the "homogeneous-dtype wide dataframe" use case. >>>> >>>> Now, it is always difficult to make such claims in the abstract. >>>> So what I personally think would be very valuable, is if you could give some example use cases that you care about (eg a notebook creating some dummy data with similar characteristics as the data you are working with (or using real data, if openly available, and a few typical operations you do on those). >>>> >>>> Best, >>>> Joris >>>> >>>>> >>>>> We already have to drop into numpy on occasion to make the performance sufficient. I would really prefer for Pandas to >>>>> improve in this area not slide back. >>>>> >>>>> Have a great weekend, >>>>> Maarten >>>>> >>>>> >>>> _______________________________________________ >>>> Pandas-dev mailing list >>>> Pandas-dev at python.org >>>> https://mail.python.org/mailman/listinfo/pandas-dev >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbrockmendel at gmail.com Mon Jun 1 19:36:44 2020 From: jbrockmendel at gmail.com (Brock Mendel) Date: Mon, 1 Jun 2020 16:36:44 -0700 Subject: [Pandas-dev] tslibs 2.0 and non-nanosecond datetime64/timedelta64 In-Reply-To: References: Message-ID: Before responding to questions, one topic I forgot to include in the OP: The performance of Timestamp, Timedelta, and Period could be improved (i do not have an estimate of how much) if they were cdef (cython) classes. This is not viable at the moment because they each have `__new__` methods, which are needed because the constructors can return pd.NaT. If we had dtype-specific NaTs (xref #24983 ) that would allow us to make these cdef classes. --------- > Will this [casting non-nano timestamps to nano to use existing tz-conversion code] cause issues if the original datetime isn't in the bounds of a ns-precision timestamp? Both technically and conceptually, yes. [note to self, expand on this before hitting send] > [...] since it represents a point in time rather than a span. >From an implementations standpoint, that distinction is meaningless; the same conversion code (the hard part) is used for both. Conceptually, I think of `datetime64[minute]` as representing the same thing as `Period[minute]` (both can be used to represent the "4:32" in the corner of my screen). Or for Timestamp[D] we can just call that a Date dtype instead of re-implementing it (xref #34441 ) --------- > Personally, I don't think we necessarily need to add all units that are supported by numpy's datetime64/timedelta64 dtypes. I have a strong preference against using the Year or Month units, as the conversions of those to/from the others is not just multiplication/division. The others I don't feel as strongly about; once nanos is no longer hard-coded, the marginal cost of adding more should be relatively small. On Sat, May 30, 2020 at 12:18 PM Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > Thanks for starting this discussion, Brock! > > On Fri, 29 May 2020 at 21:03, Tom Augspurger > wrote: > >> On Fri, May 29, 2020 at 11:37 AM Brock Mendel >> wrote: >> >>> >>> We could then consider de-duplication. Tick is already redundant with >>> Timedelta, and Timestamp[H] would render Period[H] redundant. With >>> appropriate deprecation cycle, we could rip out a bunch of code. >>> >> >> What would the user facing changes that warrant deprecation? For me, >> `Period` represents a span of time. It would make sense to implement >> something like `pd.Timestamp("2000-01-01") in pd.Period("2000-01-01", >> freq="H")`. But something checking whether that timestamp is in a >> `Timestamp[H]` doesn't seem natural, since it represents a point in time >> rather than a span. >> >> > Personally, I don't think we necessarily need to add all units that are > supported by numpy's datetime64/timedelta64 dtypes. First, because I don't > think it is an important use case (people mostly want to be able to have > dates outside of the range limits that nanosecond resolution gives us), and > also because it makes it conceptually a lot more difficult. For example, > what is a "Timestamp[H]" value? Does it represent the beginning or the end > of the hour? That are questions that are already handled by the Period > dtype, and I think it is a good thing to keep those concepts separated (you > can of course ask the same question with a millisecond resolution, but I > think generally people don't do that). > Further, all the resolutions from nanosecond up to second are "just" > multiplications x1000, keeping the implementation more simple (compared to > resolutions of hours, months, ..). > > So for a timestamp dtype, we could maybe only support ns / ?s / ms / s > resolutions? > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Tue Jun 2 15:42:30 2020 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Tue, 2 Jun 2020 21:42:30 +0200 Subject: [Pandas-dev] tslibs 2.0 and non-nanosecond datetime64/timedelta64 In-Reply-To: References: Message-ID: On Tue, 2 Jun 2020 at 01:36, Brock Mendel wrote: > Before responding to questions, one topic I forgot to include in the OP: > > The performance of Timestamp, Timedelta, and Period could be improved (i > do not have an estimate of how much) if they were cdef (cython) classes. > This is not viable at the moment because they each have `__new__` methods, > which are needed because the constructors can return pd.NaT. If we had > dtype-specific NaTs (xref #24983 > ) that would allow us > to make these cdef classes. > > --------- > > Will this [casting non-nano timestamps to nano to use existing > tz-conversion code] cause issues if the original datetime isn't in the > bounds of a ns-precision timestamp? > > Both technically and conceptually, yes. [note to self, expand on this > before hitting send] > > > [...] since it represents a point in time rather than a span. > > From an implementations standpoint, that distinction is meaningless; the > same conversion code (the hard part) is used for both. Conceptually, I > think of `datetime64[minute]` as representing the same thing as > `Period[minute]` (both can be used to represent the "4:32" in the corner of > my screen). > Implementation wise it's maybe the same, but I think it's useful to keep those concepts separated towards users in the API. Timestamps are points in time, Periods are time spans. I think it is good to keep this distinction. And for timestamps, I think users should mostly not care / need to think about the resolution (the main reason they need to care now is when their dates might not fit in the range supported by nanoseconds, but by having a different default resolution, that issue should also be mostly gone). So it's in this light that I don't think it is needed to support resolutions above seconds. > > Or for Timestamp[D] we can just call that a Date dtype instead of > re-implementing it (xref #34441 > ) > > --------- > > Personally, I don't think we necessarily need to add all units that are > supported by numpy's datetime64/timedelta64 dtypes. > > I have a strong preference against using the Year or Month units, as the > conversions of those to/from the others is not just > multiplication/division. The others I don't feel as strongly about; once > nanos is no longer hard-coded, the marginal cost of adding more should be > relatively small. > > > On Sat, May 30, 2020 at 12:18 PM Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > >> Thanks for starting this discussion, Brock! >> >> On Fri, 29 May 2020 at 21:03, Tom Augspurger >> wrote: >> >>> On Fri, May 29, 2020 at 11:37 AM Brock Mendel >>> wrote: >>> >>>> >>>> We could then consider de-duplication. Tick is already redundant with >>>> Timedelta, and Timestamp[H] would render Period[H] redundant. With >>>> appropriate deprecation cycle, we could rip out a bunch of code. >>>> >>> >>> What would the user facing changes that warrant deprecation? For me, >>> `Period` represents a span of time. It would make sense to implement >>> something like `pd.Timestamp("2000-01-01") in pd.Period("2000-01-01", >>> freq="H")`. But something checking whether that timestamp is in a >>> `Timestamp[H]` doesn't seem natural, since it represents a point in time >>> rather than a span. >>> >>> >> Personally, I don't think we necessarily need to add all units that are >> supported by numpy's datetime64/timedelta64 dtypes. First, because I don't >> think it is an important use case (people mostly want to be able to have >> dates outside of the range limits that nanosecond resolution gives us), and >> also because it makes it conceptually a lot more difficult. For example, >> what is a "Timestamp[H]" value? Does it represent the beginning or the end >> of the hour? That are questions that are already handled by the Period >> dtype, and I think it is a good thing to keep those concepts separated (you >> can of course ask the same question with a millisecond resolution, but I >> think generally people don't do that). >> Further, all the resolutions from nanosecond up to second are "just" >> multiplications x1000, keeping the implementation more simple (compared to >> resolutions of hours, months, ..). >> >> So for a timestamp dtype, we could maybe only support ns / ?s / ms / s >> resolutions? >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From maartenb at xs4all.nl Wed Jun 3 12:43:19 2020 From: maartenb at xs4all.nl (Maarten Ballintijn) Date: Wed, 3 Jun 2020 12:43:19 -0400 Subject: [Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks In-Reply-To: References: <807A8451-2547-4891-95F3-B1211496CEA6@xs4all.nl> Message-ID: <4D698DC9-502E-4BC7-A676-10C762A28C85@xs4all.nl> Joris, Thanks very much for your reply. I can?t provide exact data or code, but I?ll try to come up with a sample of simulated data and operations that relatively closely matches our use cases. Cheers, Maarten > On May 30, 2020, at 3:03 PM, Joris Van den Bossche wrote: > > Hi Maarten, > > Thanks a lot for the feedback! > > On Fri, 29 May 2020 at 20:31, Maarten Ballintijn > wrote: > > Hi Joris, > > You said: > >> But I also deliberately choose a dataframe where n_rows >> n_columns, because I personally would be fine if operations on wide dataframes (n_rows < n_columns) show a slowdown. But that is of course something to discuss / agree upon (eg up to which dataframe size or n_columns/n_rows ratio do we care about a performance degradation?). > > This is an (the) important use case for us and probably for a lot of use in finance in general. I can easily imagine many other > areas where storing data for 1000?s of elements (sensors, items, people) on grid of time scales of minutes or more. > (n*1000 x m*1000 data with n, m ~ 10 .. 100) > > Why do you think this use case is no longer important? > > To be clear up front: I think wide dataframes are still an important use case. > > But to put my comment from above in more context: we had a performance regression reported (#24990 , which Brock referenced in his last mail) which was about a DataFrame with 1 row and 5000 columns. > And yes, for such a case, I think it will basically be impossible to preserve exact performance, even with a lot of optimizations, compared to storing this as a single, consolidated (1, 5000) array as is done now. And it is for such a case, that I indeed say: I am willing to accept a limited slowdown for this, if it at the same time gives us improved memory usage, performance improvements for more common cases, simplified internals making it easier to contribute to and further optimize pandas, etc. > > But, I am also quite convinced that, with some optimization effort, we can at least preserve the current performance even for relatively wide dataframes (see eg this ?notebook for some quick experiments). > And to be clear: doing such optimizations to ensure good performance for a variety of use cases is part of the proposal. Also, I think that having a simplified pandas internals should actually also make it easier to further explore ways to specifically optimize the "homogeneous-dtype wide dataframe" use case. > > Now, it is always difficult to make such claims in the abstract. > So what I personally think would be very valuable, is if you could give some example use cases that you care about (eg a notebook creating some dummy data with similar characteristics as the data you are working with (or using real data, if openly available, and a few typical operations you do on those). > > Best, > Joris > > > We already have to drop into numpy on occasion to make the performance sufficient. I would really prefer for Pandas to > improve in this area not slide back. > > Have a great weekend, > Maarten -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Tue Jun 9 11:46:06 2020 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Tue, 9 Jun 2020 17:46:06 +0200 Subject: [Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks In-Reply-To: References: <807A8451-2547-4891-95F3-B1211496CEA6@xs4all.nl> Message-ID: On Mon, 1 Jun 2020 at 20:07, Brock Mendel wrote: > Joris and I accidentally took part of the discussion off-thread. My > suggestion boils down to: Let's > > 1) Identify pieces of this that we want to do regardless of whether we do > the rest of it (e.g. consolidate only in internals, view-only indexing on > columns). > Personally I am not sure it is worth trying to change consolidation policies (moving to internals is certainly fine of course, but I mean eg delaying) or copy/view semantics for the *current*, consolidated BlockManager. But there are certainly pieces in the internals that can be changed which are useful regardless. I opened https://github.com/pandas-dev/pandas/issues/34669 to have a more concrete discussion about this on github. > 2) Beef up the asvs to be give closer-to-full measure of the tradeoffs > when an eventual proof of concept/PR is made. > > We probably won't have a "one big PR" that is going to implement a simplified block manager, so it's not really clear to me how ASV will help with making a decision on this? (it will for sure be very useful *along the way* to keep track of where we need to optimize things to preserve performance) Joris -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.augspurger88 at gmail.com Thu Jun 11 10:55:51 2020 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Thu, 11 Jun 2020 09:55:51 -0500 Subject: [Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks In-Reply-To: References: <807A8451-2547-4891-95F3-B1211496CEA6@xs4all.nl> Message-ID: We discussed this on the call yesterday ( https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit?usp=sharing ). I'll attempt a summary for the mailing list, and a proposed course of action. In general, there was agreement with the goal of simplifying pandas' internals, and making DataFrame a column-store seems to be the best way to achieve that. The primary arguments against were implementation costs and possible performance slowdowns for very short and wide dataframes. It was generally agreed that the change will need to be toggleable, perhaps by a parameter to the DataFrame constructor and a global option. This will make it easier to implement the new behavior and test it against existing behavior, both for us developers and users. We are keeping in mind the scikit-learn style usecase of boxing and unboxing a (homogenous) array in a DataFrame. We're committed to keeping that 0-copy and avoiding creating one Python object per column. Does this summary accurately capture the discussion? --- Going forward, there are many pieces that can be done, some in parallel. Let's keep that discussion on concrete details in https://github.com/pandas-dev/pandas/issues/34669. I do want to highlight one overlapping area though. We have some PRs up (most from Brock) that affect consolidation today. Mostly disabling consolidation in specific places. (e.g. https://github.com/pandas-dev/pandas/pull/34683). My question: do we want to continue pursuing reduced consolidation *in the current block manager*? IMO, that's a tricky question to answer. The performance implications of consolidation are hard, in part because it's so workload-dependent. Sometimes, it's completely avoided so it's a win. Other times, it's merely delayed until an operation that needs consolidated blocks, and so is a wash. And given 1. The unclear impact changing consolidation has on views vs. copies, and our unclear *policy* on when things are views vs. copies 2. The real possibility of a non-consolidating, all-1D "Block" manager in the next year or two 3. The unclear extent to which non-consolidated data is tested by our unit tests. Certainly, fixing bugs is a worthy goal on its own. So to the extent where (non)consolidation causes buggy behavior we'll want to fix that. But overall, I think the project's efforts would be better focused elsewhere (ideally on progressing to the all 1-D block manager, but wherever we think is highest-value). Do others have thoughts on what changes should be made to the "pandas 1.x BlockManager" while we work towards the "2.x BlockManager"? - Tom On Tue, Jun 9, 2020 at 10:46 AM Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > On Mon, 1 Jun 2020 at 20:07, Brock Mendel wrote: > >> Joris and I accidentally took part of the discussion off-thread. My >> suggestion boils down to: Let's >> >> 1) Identify pieces of this that we want to do regardless of whether we do >> the rest of it (e.g. consolidate only in internals, view-only indexing on >> columns). >> > > Personally I am not sure it is worth trying to change consolidation > policies (moving to internals is certainly fine of course, but I mean eg > delaying) or copy/view semantics for the *current*, consolidated > BlockManager. > > But there are certainly pieces in the internals that can be changed which > are useful regardless. I opened > https://github.com/pandas-dev/pandas/issues/34669 to have a more concrete > discussion about this on github. > > >> 2) Beef up the asvs to be give closer-to-full measure of the tradeoffs >> when an eventual proof of concept/PR is made. >> >> > We probably won't have a "one big PR" that is going to implement a > simplified block manager, so it's not really clear to me how ASV will help > with making a decision on this? > (it will for sure be very useful *along the way* to keep track of where > we need to optimize things to preserve performance) > > Joris > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbrockmendel at gmail.com Thu Jun 11 11:51:25 2020 From: jbrockmendel at gmail.com (Brock Mendel) Date: Thu, 11 Jun 2020 08:51:25 -0700 Subject: [Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks In-Reply-To: References: <807A8451-2547-4891-95F3-B1211496CEA6@xs4all.nl> Message-ID: > Does this summary accurately capture the discussion? Not quite. > there was agreement with the goal of simplifying pandas' internals, Yes. > and making DataFrame a column-store seems to be the best way to achieve that. No. We will not know this until we see an implementation. Nor will we know the performance impact. My expectation is that the performance impact will lead to a bunch of workarounds that cut against the simplification. I strongly object to committing to this before having this information. --- I have tried to avoid bringing up 2D EAs in this conversation, but the term "best way" requires a discussion of alternatives. Allowing 2D EAs will allow for a large fraction of the same simplifications (grep for "TODO(EA2D)"), and will _improve_ performance (in eg reshape, arithmetic operations) instead of hurting it. It means removing workarounds rather than adding new ones. It also allows for an incremental upgrade path: opt-in for 1.X, then if we like it, required for 2.X. ---- > Going forward, there are many pieces that can be done, some in parallel Related to but not identical to consolidation is the views vs copies on column indexing, GH#33780 , discussed on the previous call without a solid conclusion. The FUD largely boiled down to "some users could be relying on the current behavior and there isnt a nice way to deprecate it". On further reflection, this seems like an impossible standard to meet for _any_ change in not-tested/not-documented behavior. We should move to having column indexing being copy-free. On Thu, Jun 11, 2020 at 7:56 AM Tom Augspurger wrote: > We discussed this on the call yesterday > ( > https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit?usp=sharing > ). > I'll attempt a summary for the mailing list, and a proposed course of > action. > > In general, there was agreement with the goal of simplifying pandas' > internals, > and making DataFrame a column-store seems to be the best way to achieve > that. > The primary arguments against were implementation costs and possible > performance > slowdowns for very short and wide dataframes. > > It was generally agreed that the change will need to be toggleable, > perhaps by a > parameter to the DataFrame constructor and a global option. This will make > it > easier to implement the new behavior and test it against existing > behavior, both > for us developers and users. > > We are keeping in mind the scikit-learn style usecase of boxing and > unboxing a > (homogenous) array in a DataFrame. We're committed to keeping that 0-copy > and > avoiding creating one Python object per column. > > Does this summary accurately capture the discussion? > > --- > > Going forward, there are many pieces that can be done, some in parallel. > Let's > keep that discussion on concrete details in > https://github.com/pandas-dev/pandas/issues/34669. > > I do want to highlight one overlapping area though. We have some PRs up > (most > from Brock) that affect consolidation today. Mostly disabling > consolidation in specific places. (e.g. > https://github.com/pandas-dev/pandas/pull/34683). My question: do we want > to > continue pursuing reduced consolidation *in the current block manager*? > > IMO, that's a tricky question to answer. The performance implications of > consolidation are hard, in part because it's so workload-dependent. > Sometimes, > it's completely avoided so it's a win. Other times, it's merely delayed > until an > operation that needs consolidated blocks, and so is a wash. And given > > 1. The unclear impact changing consolidation has on views vs. copies, and > our > unclear *policy* on when things are views vs. copies > 2. The real possibility of a non-consolidating, all-1D "Block" manager in > the > next year or two > 3. The unclear extent to which non-consolidated data is tested by our unit > tests. > > Certainly, fixing bugs is a worthy goal on its own. So to the extent where > (non)consolidation > causes buggy behavior we'll want to fix that. But overall, I think the > project's efforts would be > better focused elsewhere (ideally on progressing to the all 1-D block > manager, but wherever > we think is highest-value). > > Do others have thoughts on what changes should be made to the "pandas 1.x > BlockManager" while we work towards the "2.x BlockManager"? > > - Tom > > On Tue, Jun 9, 2020 at 10:46 AM Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > >> On Mon, 1 Jun 2020 at 20:07, Brock Mendel wrote: >> >>> Joris and I accidentally took part of the discussion off-thread. My >>> suggestion boils down to: Let's >>> >>> 1) Identify pieces of this that we want to do regardless of whether we >>> do the rest of it (e.g. consolidate only in internals, view-only indexing >>> on columns). >>> >> >> Personally I am not sure it is worth trying to change consolidation >> policies (moving to internals is certainly fine of course, but I mean eg >> delaying) or copy/view semantics for the *current*, consolidated >> BlockManager. >> >> But there are certainly pieces in the internals that can be changed which >> are useful regardless. I opened >> https://github.com/pandas-dev/pandas/issues/34669 to have a more >> concrete discussion about this on github. >> >> >>> 2) Beef up the asvs to be give closer-to-full measure of the tradeoffs >>> when an eventual proof of concept/PR is made. >>> >>> >> We probably won't have a "one big PR" that is going to implement a >> simplified block manager, so it's not really clear to me how ASV will help >> with making a decision on this? >> (it will for sure be very useful *along the way* to keep track of where >> we need to optimize things to preserve performance) >> >> Joris >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.augspurger88 at gmail.com Thu Jun 11 12:01:12 2020 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Thu, 11 Jun 2020 11:01:12 -0500 Subject: [Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks In-Reply-To: References: <807A8451-2547-4891-95F3-B1211496CEA6@xs4all.nl> Message-ID: On Thu, Jun 11, 2020 at 10:51 AM Brock Mendel wrote: > > Does this summary accurately capture the discussion? > > Not quite. > > > there was agreement with the goal of simplifying pandas' internals, > > Yes. > > > and making DataFrame a column-store seems to be the best way to achieve > that. > > No. > > We will not know this until we see an implementation. Nor will we know > the performance impact. My expectation is that the performance impact will > lead to a bunch of workarounds that cut against the simplification. > > I strongly object to committing to this before having this information. > It'd be good to clarify exactly what you object to committing to. Changing the Block Manager is a large task, made especially difficult by us being an open-source project with many stake-holders and limited funding. I think that we as a project can say "We as a project think that making DataFrame a column store is best", while still acknowledging that it's an uncertain goal that may be abandoned if it turns out to be a bad idea. So to make sure: You're objecting to a column-store in principle, or you're objecting to the project saying we think it's a good idea, or...? > --- > I have tried to avoid bringing up 2D EAs in this conversation, but the > term "best way" requires a discussion of alternatives. > > Allowing 2D EAs will allow for a large fraction of the same > simplifications (grep for "TODO(EA2D)"), and will _improve_ performance (in > eg reshape, arithmetic operations) instead of hurting it. It means > removing workarounds rather than adding new ones. > > It also allows for an incremental upgrade path: opt-in for 1.X, then if we > like it, required for 2.X. > Will have thoughts on this later. > ---- > > Going forward, there are many pieces that can be done, some in parallel > > Related to but not identical to consolidation is the views vs copies on > column indexing, GH#33780 > , discussed on the > previous call without a solid conclusion. The FUD largely boiled down to > "some users could be relying on the current behavior and there isnt a nice > way to deprecate it". On further reflection, this seems like an impossible > standard to meet for _any_ change in not-tested/not-documented behavior. > We should move to having column indexing being copy-free. > I think I disagree with that, at least to a degree. But it's primarily about views vs. copies so I'll take it to https://github.com/pandas-dev/pandas/issues/33780. > On Thu, Jun 11, 2020 at 7:56 AM Tom Augspurger > wrote: > >> We discussed this on the call yesterday >> ( >> https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit?usp=sharing >> ). >> I'll attempt a summary for the mailing list, and a proposed course of >> action. >> >> In general, there was agreement with the goal of simplifying pandas' >> internals, >> and making DataFrame a column-store seems to be the best way to achieve >> that. >> The primary arguments against were implementation costs and possible >> performance >> slowdowns for very short and wide dataframes. >> >> It was generally agreed that the change will need to be toggleable, >> perhaps by a >> parameter to the DataFrame constructor and a global option. This will >> make it >> easier to implement the new behavior and test it against existing >> behavior, both >> for us developers and users. >> >> We are keeping in mind the scikit-learn style usecase of boxing and >> unboxing a >> (homogenous) array in a DataFrame. We're committed to keeping that 0-copy >> and >> avoiding creating one Python object per column. >> >> Does this summary accurately capture the discussion? >> >> --- >> >> Going forward, there are many pieces that can be done, some in parallel. >> Let's >> keep that discussion on concrete details in >> https://github.com/pandas-dev/pandas/issues/34669. >> >> I do want to highlight one overlapping area though. We have some PRs up >> (most >> from Brock) that affect consolidation today. Mostly disabling >> consolidation in specific places. (e.g. >> https://github.com/pandas-dev/pandas/pull/34683). My question: do we >> want to >> continue pursuing reduced consolidation *in the current block manager*? >> >> IMO, that's a tricky question to answer. The performance implications of >> consolidation are hard, in part because it's so workload-dependent. >> Sometimes, >> it's completely avoided so it's a win. Other times, it's merely delayed >> until an >> operation that needs consolidated blocks, and so is a wash. And given >> >> 1. The unclear impact changing consolidation has on views vs. copies, and >> our >> unclear *policy* on when things are views vs. copies >> 2. The real possibility of a non-consolidating, all-1D "Block" manager in >> the >> next year or two >> 3. The unclear extent to which non-consolidated data is tested by our >> unit tests. >> >> Certainly, fixing bugs is a worthy goal on its own. So to the extent >> where (non)consolidation >> causes buggy behavior we'll want to fix that. But overall, I think the >> project's efforts would be >> better focused elsewhere (ideally on progressing to the all 1-D block >> manager, but wherever >> we think is highest-value). >> >> Do others have thoughts on what changes should be made to the "pandas 1.x >> BlockManager" while we work towards the "2.x BlockManager"? >> >> - Tom >> >> On Tue, Jun 9, 2020 at 10:46 AM Joris Van den Bossche < >> jorisvandenbossche at gmail.com> wrote: >> >>> On Mon, 1 Jun 2020 at 20:07, Brock Mendel >>> wrote: >>> >>>> Joris and I accidentally took part of the discussion off-thread. My >>>> suggestion boils down to: Let's >>>> >>>> 1) Identify pieces of this that we want to do regardless of whether we >>>> do the rest of it (e.g. consolidate only in internals, view-only indexing >>>> on columns). >>>> >>> >>> Personally I am not sure it is worth trying to change consolidation >>> policies (moving to internals is certainly fine of course, but I mean eg >>> delaying) or copy/view semantics for the *current*, consolidated >>> BlockManager. >>> >>> But there are certainly pieces in the internals that can be changed >>> which are useful regardless. I opened >>> https://github.com/pandas-dev/pandas/issues/34669 to have a more >>> concrete discussion about this on github. >>> >>> >>>> 2) Beef up the asvs to be give closer-to-full measure of the tradeoffs >>>> when an eventual proof of concept/PR is made. >>>> >>>> >>> We probably won't have a "one big PR" that is going to implement a >>> simplified block manager, so it's not really clear to me how ASV will help >>> with making a decision on this? >>> (it will for sure be very useful *along the way* to keep track of where >>> we need to optimize things to preserve performance) >>> >>> Joris >>> >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbrockmendel at gmail.com Thu Jun 11 12:28:50 2020 From: jbrockmendel at gmail.com (Brock Mendel) Date: Thu, 11 Jun 2020 09:28:50 -0700 Subject: [Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks In-Reply-To: References: <807A8451-2547-4891-95F3-B1211496CEA6@xs4all.nl> Message-ID: > So to make sure: You're objecting to a column-store in principle, or you're objecting to the project saying we think it's a good idea, or...? Not at all. I look forward to seeing an implementation so that we can actually make an informed decision as to whether or not we want to use it. I object to a) declaring ex-ante that we intend to replace the existing BlockManager with it and b) effectively declaring a moratorium on improvements to the existing code. On Thu, Jun 11, 2020 at 9:01 AM Tom Augspurger wrote: > > > On Thu, Jun 11, 2020 at 10:51 AM Brock Mendel > wrote: > >> > Does this summary accurately capture the discussion? >> >> Not quite. >> >> > there was agreement with the goal of simplifying pandas' internals, >> >> Yes. >> >> > and making DataFrame a column-store seems to be the best way to achieve >> that. >> >> No. >> >> We will not know this until we see an implementation. Nor will we know >> the performance impact. My expectation is that the performance impact will >> lead to a bunch of workarounds that cut against the simplification. >> >> I strongly object to committing to this before having this information. >> > > It'd be good to clarify exactly what you object to committing to. Changing > the Block Manager is a large task, made especially difficult by us being an > open-source project with many stake-holders and limited funding. I think > that we as a project can say "We as a project think that making DataFrame a > column store is best", while still acknowledging that it's an uncertain > goal that may be abandoned if it turns out to be a bad idea. > > So to make sure: You're objecting to a column-store in principle, or > you're objecting to the project saying we think it's a good idea, or...? > > >> --- >> I have tried to avoid bringing up 2D EAs in this conversation, but the >> term "best way" requires a discussion of alternatives. >> >> Allowing 2D EAs will allow for a large fraction of the same >> simplifications (grep for "TODO(EA2D)"), and will _improve_ performance (in >> eg reshape, arithmetic operations) instead of hurting it. It means >> removing workarounds rather than adding new ones. >> >> It also allows for an incremental upgrade path: opt-in for 1.X, then if >> we like it, required for 2.X. >> > > Will have thoughts on this later. > > >> ---- >> > Going forward, there are many pieces that can be done, some in parallel >> >> Related to but not identical to consolidation is the views vs copies on >> column indexing, GH#33780 >> , discussed on the >> previous call without a solid conclusion. The FUD largely boiled down to >> "some users could be relying on the current behavior and there isnt a nice >> way to deprecate it". On further reflection, this seems like an impossible >> standard to meet for _any_ change in not-tested/not-documented behavior. >> We should move to having column indexing being copy-free. >> > > I think I disagree with that, at least to a degree. But it's primarily > about views vs. copies so I'll take it to > https://github.com/pandas-dev/pandas/issues/33780. > > >> On Thu, Jun 11, 2020 at 7:56 AM Tom Augspurger < >> tom.augspurger88 at gmail.com> wrote: >> >>> We discussed this on the call yesterday >>> ( >>> https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit?usp=sharing >>> ). >>> I'll attempt a summary for the mailing list, and a proposed course of >>> action. >>> >>> In general, there was agreement with the goal of simplifying pandas' >>> internals, >>> and making DataFrame a column-store seems to be the best way to achieve >>> that. >>> The primary arguments against were implementation costs and possible >>> performance >>> slowdowns for very short and wide dataframes. >>> >>> It was generally agreed that the change will need to be toggleable, >>> perhaps by a >>> parameter to the DataFrame constructor and a global option. This will >>> make it >>> easier to implement the new behavior and test it against existing >>> behavior, both >>> for us developers and users. >>> >>> We are keeping in mind the scikit-learn style usecase of boxing and >>> unboxing a >>> (homogenous) array in a DataFrame. We're committed to keeping that >>> 0-copy and >>> avoiding creating one Python object per column. >>> >>> Does this summary accurately capture the discussion? >>> >>> --- >>> >>> Going forward, there are many pieces that can be done, some in parallel. >>> Let's >>> keep that discussion on concrete details in >>> https://github.com/pandas-dev/pandas/issues/34669. >>> >>> I do want to highlight one overlapping area though. We have some PRs up >>> (most >>> from Brock) that affect consolidation today. Mostly disabling >>> consolidation in specific places. (e.g. >>> https://github.com/pandas-dev/pandas/pull/34683). My question: do we >>> want to >>> continue pursuing reduced consolidation *in the current block manager*? >>> >>> IMO, that's a tricky question to answer. The performance implications of >>> consolidation are hard, in part because it's so workload-dependent. >>> Sometimes, >>> it's completely avoided so it's a win. Other times, it's merely delayed >>> until an >>> operation that needs consolidated blocks, and so is a wash. And given >>> >>> 1. The unclear impact changing consolidation has on views vs. copies, >>> and our >>> unclear *policy* on when things are views vs. copies >>> 2. The real possibility of a non-consolidating, all-1D "Block" manager >>> in the >>> next year or two >>> 3. The unclear extent to which non-consolidated data is tested by our >>> unit tests. >>> >>> Certainly, fixing bugs is a worthy goal on its own. So to the extent >>> where (non)consolidation >>> causes buggy behavior we'll want to fix that. But overall, I think the >>> project's efforts would be >>> better focused elsewhere (ideally on progressing to the all 1-D block >>> manager, but wherever >>> we think is highest-value). >>> >>> Do others have thoughts on what changes should be made to the "pandas 1.x >>> BlockManager" while we work towards the "2.x BlockManager"? >>> >>> - Tom >>> >>> On Tue, Jun 9, 2020 at 10:46 AM Joris Van den Bossche < >>> jorisvandenbossche at gmail.com> wrote: >>> >>>> On Mon, 1 Jun 2020 at 20:07, Brock Mendel >>>> wrote: >>>> >>>>> Joris and I accidentally took part of the discussion off-thread. My >>>>> suggestion boils down to: Let's >>>>> >>>>> 1) Identify pieces of this that we want to do regardless of whether we >>>>> do the rest of it (e.g. consolidate only in internals, view-only indexing >>>>> on columns). >>>>> >>>> >>>> Personally I am not sure it is worth trying to change consolidation >>>> policies (moving to internals is certainly fine of course, but I mean eg >>>> delaying) or copy/view semantics for the *current*, consolidated >>>> BlockManager. >>>> >>>> But there are certainly pieces in the internals that can be changed >>>> which are useful regardless. I opened >>>> https://github.com/pandas-dev/pandas/issues/34669 to have a more >>>> concrete discussion about this on github. >>>> >>>> >>>>> 2) Beef up the asvs to be give closer-to-full measure of the tradeoffs >>>>> when an eventual proof of concept/PR is made. >>>>> >>>>> >>>> We probably won't have a "one big PR" that is going to implement a >>>> simplified block manager, so it's not really clear to me how ASV will help >>>> with making a decision on this? >>>> (it will for sure be very useful *along the way* to keep track of >>>> where we need to optimize things to preserve performance) >>>> >>>> Joris >>>> >>>> _______________________________________________ >>>> Pandas-dev mailing list >>>> Pandas-dev at python.org >>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>> >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Thu Jun 11 16:10:21 2020 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Thu, 11 Jun 2020 22:10:21 +0200 Subject: [Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks In-Reply-To: References: <807A8451-2547-4891-95F3-B1211496CEA6@xs4all.nl> Message-ID: On Thu, 11 Jun 2020 at 18:29, Brock Mendel wrote: > > So to make sure: You're objecting to a column-store in principle, or > you're objecting to the project saying we think it's a good idea, or...? > > Not at all. I look forward to seeing an implementation so that we can > actually make an informed decision as to whether or not we want to use it. > I object to a) declaring ex-ante that we intend to replace the existing > BlockManager with it and b) effectively declaring a moratorium on > improvements to the existing code. > We actually *have* prototypes: the prototype of the split-policy discussed in GH-10556 and for which I made a notebook benchmarking a few common operations as mentioned in my initial post (notebook ), and the prototype of using all integer extension arrays (and float with my PR). In the linked notebook, I show that, for the given dataframe, a set of common operations are not slower, or if slower, that there is a clear path towards optimizing this. I welcome a critical evaluation of this notebook. And since those are based on the current BlockManager, any version of a BlockManager specifically tailored to storing the columns separately, will only do a better job. I think that based on those prototypes we already can make an informed decision right now (or with some additional benchmarks based on those prototypes). For sure, if it turns out that we were wrong, we can later again abandon the idea, but I think we can already be *confident* that there is a high probability it will work out. Also, if performance is in the end the decisive criterion, I repeat my earlier remark in this thread: we need to be clearer about what we want / expect. Because with benchmarks you can prove anything you want, depending on what you choose to benchmark. So: what size of dataframe, which set of operations, .. do we care about? On Thu, Jun 11, 2020 at 10:51 AM Brock Mendel >> wrote: >> >> >>> --- >>> I have tried to avoid bringing up 2D EAs in this conversation, but the >>> term "best way" requires a discussion of alternatives. >>> >>> Allowing 2D EAs will allow for a large fraction of the same >>> simplifications (grep for "TODO(EA2D)"), and will _improve_ performance (in >>> eg reshape, arithmetic operations) instead of hurting it. It means >>> removing workarounds rather than adding new ones. >>> >>> It also allows for an incremental upgrade path: opt-in for 1.X, then if >>> we like it, required for 2.X. >>> >> >> In my original mail, I explicitly didn't mention 1D vs 2D extension arrays, but rather 1D vs 2D *blocks*. As for me, that is the core of the proposal. It is this column-store that will give additional simplifications by not having to care about 2D blocks (on top of getting rid of 1D/2D mixture, which could in itself also be solved by all 2D blocks), that will make it possible to get clearer copy/view semantics, that will make it easier to look into other improvements (like copy-on-write to avoid many copies in pandas operations, lazy selection filters, ..). So if we decide in the end to keep the consolidating blockmanager with 2D blocks, we certainly should consider 2D extension arrays, I fully agree on that. And indeed, the consolidating block manager is the alternative to consider. But: - "It means removing workarounds rather than adding new ones." -> whether we go all 1D or all 2D, we will initially need to keep workarounds for the other option in both cases, anyway, that's no different for all 1D or all 2D. But I am convinced that after we can remove the workarounds (eg in 2.0), the end result will be simpler in the all 1D case. - "It also allows for an incremental upgrade path: opt-in for 1.X, then if we like it, required for 2.X." -> we can perfectly provide an opt-in, incremental upgrade path for the all 1D case as well, I don't see why that would be different. Joris -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbrockmendel at gmail.com Thu Jun 11 17:34:52 2020 From: jbrockmendel at gmail.com (Brock Mendel) Date: Thu, 11 Jun 2020 14:34:52 -0700 Subject: [Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks In-Reply-To: <1462664690.2963025.1591907360318@mail.yahoo.com> References: <807A8451-2547-4891-95F3-B1211496CEA6@xs4all.nl> <1462664690.2963025.1591907360318@mail.yahoo.com> Message-ID: > We actually *have* prototypes: the prototype of the split-policy discussed AFAICT that is a 5 year old branch. Is there a version of this based off of master that you can show asv results for? > Also, if performance is in the end the decisive criterion, I repeat my earlier remark in this thread: we need to be clearer about what we want / expect. In principle, this is pretty much exactly what the asvs are supposed to represent. ---- You have demonstrated that you are willing to repeat yourself more than I am, to the point that I find pandas interactions more frustrating than fulfilling. I'm going to step away for a little while. On Thu, Jun 11, 2020 at 1:29 PM Daniel Scott wrote: > > > Sent from Yahoo Mail on Android > > > On Thu, Jun 11, 2020 at 3:10 PM, Joris Van den Bossche > wrote: > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Fri Jun 12 16:34:18 2020 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Fri, 12 Jun 2020 22:34:18 +0200 Subject: [Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks In-Reply-To: References: <807A8451-2547-4891-95F3-B1211496CEA6@xs4all.nl> <1462664690.2963025.1591907360318@mail.yahoo.com> Message-ID: On Thu, 11 Jun 2020 at 23:35, Brock Mendel wrote: > > We actually *have* prototypes: the prototype of the split-policy > discussed > > AFAICT that is a 5 year old branch. Is there a version of this based off > of master that you can show asv results for? > > A correction here: that branch has been updated several times over the last 5 years, and a last time two weeks ago when I started this thread, as I explained in the github issue comment I linked to: https://github.com/pandas-dev/pandas/issues/10556#issuecomment-633703160 > > Also, if performance is in the end the decisive criterion, I repeat my > earlier remark in this thread: we need to be clearer about what we want / > expect. > > In principle, this is pretty much exactly what the asvs are supposed to > represent. > Well, I am repeating myself .. but I already mentioned that I am not sure ASV is fully useful for this, as that requires a complete working replacement, which is IMO too much to ask for an initial prototype. But OK, the message is clear: we need a more concrete implementation / prototype. So let's put this discussion aside for a moment, and focus on that instead. I will try to look at that in the coming weeks, but any help is welcome (and I will try to get it running with ASV, or at least a part of it). > ---- > You have demonstrated that you are willing to repeat yourself more than I > am, to the point that I find pandas interactions more frustrating than > fulfilling. I'm going to step away for a little while. > > I am sincerely sorry that you find this discussion frustrating. Healthy disagreement and discussion are an essential part of (open source) collaborative projects, but we also need to avoid getting tired of it. So maybe we should evaluate at some point the way this discussion went (including my own interactions) or how to improve our discussions in general. Joris -------------- next part -------------- An HTML attachment was scrubbed... URL: From simonjayhawkins at gmail.com Thu Jun 18 08:19:23 2020 From: simonjayhawkins at gmail.com (Simon Hawkins) Date: Thu, 18 Jun 2020 13:19:23 +0100 Subject: [Pandas-dev] ANN: Pandas 1.0.5 Released Message-ID: Hi all, I'm pleased to announce that pandas 1.0.5 is now available. This is a minor bug-fix release in the 1.0.x series and includes some regression fixes and bug fixes. We recommend that all users upgrade to this version. See the full whatsnew for a list of all the changes. The release will be available on the defaults and conda-forge channels: conda install pandas Or via PyPI: python3 -m pip install --upgrade pandas Please report any issues with the release on the pandas issue tracker . Thanks to all the contributors who made this release possible. - Simon -------------- next part -------------- An HTML attachment was scrubbed... URL: