[Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks

Jeff Reback jeffreback at gmail.com
Mon Jun 1 14:16:13 EDT 2020


+1 in brock’s suggestions here

currenty -1 on moving to add a lazy block manager

i see this as simply increasing complexity

> On Jun 1, 2020, at 2:07 PM, Brock Mendel <jbrockmendel at gmail.com> wrote:
> 
> 
> Joris and I accidentally took part of the discussion off-thread.  My suggestion boils down to: Let's
> 
> 1) Identify pieces of this that we want to do regardless of whether we do the rest of it (e.g. consolidate only in internals, view-only indexing on columns).
> 2) Beef up the asvs to be give closer-to-full measure of the tradeoffs when an eventual proof of concept/PR is made.
> 
>> On Mon, Jun 1, 2020 at 2:44 AM Joris Van den Bossche <jorisvandenbossche at gmail.com> wrote:
>>> On Sat, 30 May 2020 at 23:55, Adrin <adrin.jalali at gmail.com> wrote:
>>> Although 1 x 5000 may sound an edge case, my whole 4 years of research was on 500 x 450000 data. Those usecases are probably more common than we may think.
>> 
>> It's still a lower column/rows ratio as 1x5000 ;) (although not that much)
>> (it is this ratio that mostly determines whether the overhead of performing column by column starts to dominate)
>> 
>> But joking aside: yes, that those use cases are more common than I think is quite probable. I never have really used that myself, and therefore again: such feedback is very useful!
>> Also in our user survey from last year, a majority indicated that they occasionally use wide dataframes (although "wide" was described as "100s of columns or more", which is not necessarily that wide).
>> 
>> Now, to reiterate: 
>> 
>> - You will still be able to use pandas with wide dataframes, you only might "pay a price" for using a flexible data structure like a dataframe (that allows heterogenous dtypes, allows inserting columns cheaply, ..) for a use case that might not need that flexibility. And again, with some optimization effort, I think we can keep this "cost" at a minimum.
>> - It might actually be that a different data model fits your use case better, such as xarray (Adrin, since you are a bit familiar with xarray, would you in hindsight rather have used that for your research?)
>> - I think that by simplifying the pandas internals, it would actually become easier to better support the wide dataframe use case as well. Jeff mentioned it before as the "DataMatrix", also Stephan mentioned it on twitter. If we can simplify the internals, it would become more realistic to have a DataFrame-version that is for example backed by a single ndarray but supports the familiar DataFrame-API (or at least a subset of it without converting to a columnar DataFrame).
>> 
>> On twitter I said "pandas doesn't need to be the best solution for a variety of use cases". But I should probably have said: "pandas cannot be the best solution for different use case at the same time". Supporting wide dataframes optimally right now comes at the cost of not supporting heterogeneous dataframes as good as we could. 
>> But again, if there appears to be enough interest and there are people who want to contribute to this effort, I think we should investigate how we can actually support both cases (my last point in the above list).
>> 
>> Joris
>>  
>>> 
>>>> On Sat., May 30, 2020, 21:03 Joris Van den Bossche, <jorisvandenbossche at gmail.com> wrote:
>>>> Hi Maarten,
>>>> 
>>>> Thanks a lot for the feedback!
>>>> 
>>>>> On Fri, 29 May 2020 at 20:31, Maarten Ballintijn <maartenb at xs4all.nl> wrote:
>>>>> 
>>>>> Hi Joris,
>>>>> 
>>>>> You said:
>>>>> 
>>>>>> But I also deliberately choose a dataframe where n_rows >> n_columns, because I personally would be fine if operations on wide dataframes (n_rows < n_columns) show a slowdown. But that is of course something to discuss / agree upon (eg up to which dataframe size or n_columns/n_rows ratio do we care about a performance degradation?).
>>>>> 
>>>>> This is an (the) important use case for us and probably for a lot of use in finance in general. I can easily imagine many other
>>>>> areas where storing data for 1000’s of elements (sensors, items, people) on grid of  time scales of minutes or more.
>>>>> (n*1000 x m*1000 data with n, m ~ 10 .. 100)
>>>>> 
>>>>> Why do you think this use case is no longer important? 
>>>> 
>>>> To be clear up front: I think wide dataframes are still an important use case. 
>>>> 
>>>> But to put my comment from above in more context: we had a performance regression reported (#24990, which Brock referenced in his last mail) which was about a DataFrame with 1 row and 5000 columns. 
>>>> And yes, for such a case, I think it will basically be impossible to preserve exact performance, even with a lot of optimizations, compared to storing this as a single, consolidated (1, 5000) array as is done now. And it is for such a case, that I indeed say: I am willing to accept a limited slowdown for this, if it at the same time gives us improved memory usage, performance improvements for more common cases, simplified internals making it easier to contribute to and further optimize pandas, etc.
>>>> 
>>>> But, I am also quite convinced that, with some optimization effort, we can at least preserve the current performance even for relatively wide dataframes (see eg this notebook for some quick experiments). 
>>>> And to be clear: doing such optimizations to ensure good performance for a variety of use cases is part of the proposal. Also, I think that having a simplified pandas internals should actually also make it easier to further explore ways to specifically optimize the "homogeneous-dtype wide dataframe" use case.
>>>> 
>>>> Now, it is always difficult to make such claims in the abstract. 
>>>> So what I personally think would be very valuable, is if you could give some example use cases that you care about (eg a notebook creating some dummy data with similar characteristics as the data you are working with (or using real data, if openly available, and a few typical operations you do on those). 
>>>> 
>>>> Best,
>>>> Joris
>>>>  
>>>>> 
>>>>> We already have to drop into numpy on occasion to make the performance sufficient. I would really prefer for Pandas to
>>>>> improve in this area not slide back.
>>>>> 
>>>>> Have a great weekend,
>>>>> Maarten
>>>>> 
>>>>> 
>>>> _______________________________________________
>>>> Pandas-dev mailing list
>>>> Pandas-dev at python.org
>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200601/e4fa6b26/attachment-0001.html>


More information about the Pandas-dev mailing list