[Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks

Wed Jun 3 12:43:19 EDT 2020

Joris,

Thanks very much for your reply.

I can’t provide exact data or code, but I’ll try to come up with a sample of simulated data and operations
that relatively closely matches our use cases.

Cheers,
Maarten

> On May 30, 2020, at 3:03 PM, Joris Van den Bossche <jorisvandenbossche at gmail.com> wrote:
> 
> Hi Maarten,
> 
> Thanks a lot for the feedback!
> 
> On Fri, 29 May 2020 at 20:31, Maarten Ballintijn <maartenb at xs4all.nl <mailto:maartenb at xs4all.nl>> wrote:
> 
> Hi Joris,
> 
> You said:
> 
>> But I also deliberately choose a dataframe where n_rows >> n_columns, because I personally would be fine if operations on wide dataframes (n_rows < n_columns) show a slowdown. But that is of course something to discuss / agree upon (eg up to which dataframe size or n_columns/n_rows ratio do we care about a performance degradation?).
> 
> This is an (the) important use case for us and probably for a lot of use in finance in general. I can easily imagine many other
> areas where storing data for 1000’s of elements (sensors, items, people) on grid of  time scales of minutes or more.
> (n*1000 x m*1000 data with n, m ~ 10 .. 100)
> 
> Why do you think this use case is no longer important? 
> 
> To be clear up front: I think wide dataframes are still an important use case. 
> 
> But to put my comment from above in more context: we had a performance regression reported (#24990 <https://github.com/pandas-dev/pandas/issues/24990>, which Brock referenced in his last mail) which was about a DataFrame with 1 row and 5000 columns. 
> And yes, for such a case, I think it will basically be impossible to preserve exact performance, even with a lot of optimizations, compared to storing this as a single, consolidated (1, 5000) array as is done now. And it is for such a case, that I indeed say: I am willing to accept a limited slowdown for this, if it at the same time gives us improved memory usage, performance improvements for more common cases, simplified internals making it easier to contribute to and further optimize pandas, etc.
> 
> But, I am also quite convinced that, with some optimization effort, we can at least preserve the current performance even for relatively wide dataframes (see eg this <https://gist.github.com/jorisvandenbossche/25f240a221583002720b2edf0886d609> notebook <https://gist.github.com/jorisvandenbossche/25f240a221583002720b2edf0886d609> for some quick experiments). 
> And to be clear: doing such optimizations to ensure good performance for a variety of use cases is part of the proposal. Also, I think that having a simplified pandas internals should actually also make it easier to further explore ways to specifically optimize the "homogeneous-dtype wide dataframe" use case.
> 
> Now, it is always difficult to make such claims in the abstract. 
> So what I personally think would be very valuable, is if you could give some example use cases that you care about (eg a notebook creating some dummy data with similar characteristics as the data you are working with (or using real data, if openly available, and a few typical operations you do on those). 
> 
> Best,
> Joris
>  
> 
> We already have to drop into numpy on occasion to make the performance sufficient. I would really prefer for Pandas to
> improve in this area not slide back.
> 
> Have a great weekend,
> Maarten

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200603/2562d9c4/attachment.html>