[Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks

Fri May 29 14:31:44 EDT 2020

Hi Joris,

You said:

> But I also deliberately choose a dataframe where n_rows >> n_columns, because I personally would be fine if operations on wide dataframes (n_rows < n_columns) show a slowdown. But that is of course something to discuss / agree upon (eg up to which dataframe size or n_columns/n_rows ratio do we care about a performance degradation?).

This is an (the) important use case for us and probably for a lot of use in finance in general. I can easily imagine many other
areas where storing data for 1000’s of elements (sensors, items, people) on grid of  time scales of minutes or more.
(n*1000 x m*1000 data with n, m ~ 10 .. 100)

Why do you think this use case is no longer important? 

We already have to drop into numpy on occasion to make the performance sufficient. I would really prefer for Pandas to
improve in this area not slide back.

Have a great weekend,
Maarten

> On May 29, 2020, at 1:34 PM, Joris Van den Bossche <jorisvandenbossche at gmail.com> wrote:
> 
> On Wed, 27 May 2020 at 23:07, Brock Mendel <jbrockmendel at gmail.com <mailto:jbrockmendel at gmail.com>> wrote:
> 
> The main upsides I see are a) internal complexity reduction, b) downstream library upsides, c) clearer view vs copy semantics, d) perf improvements from making fewer copies, e) clear "dict of Series" data model.
> 
> The main downside is potential performance degradation (at the extreme end e.g. 3000x <https://github.com/pandas-dev/pandas/issues/24990> for arithmetic).  As Wes commented some of that can be ameliorated with compiled code but that cuts against the complexity reduction.
> 
> I am looking for ways to quantify these tradeoffs so we can make an informed decision.
> 
> Can you try to explain a bit more what kind of quantification you are looking for? 
> 
> - Complexity: I think we agree a non-consolidating block manager can be simpler? (and it's not only the internals, also eg the algos become simpler). But I am not sure this can be expressed in a number.
> - Clearer view vs copy semantics: this is partly an issue of making pandas easier to understand (both as developer and user), which again seems hard to quantify. And partly an issue of performance / memory usage. This is something that could potentially be measured (eg the memory usage of some typical workflows). But this probably also something that might only show effect after a refactor / implementation of new semantics.
> - Potential performance degradation: here you can measure things, and I actually did that for some cases, see https://gist.github.com/jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c <https://gist.github.com/jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c> (the notebook that I posted in #10556 <https://github.com/pandas-dev/pandas/issues/10556> a few days ago). 
> 
> However: 1) a lot depends on what kind of dataframe you take for your benchmarks (number of rows vs number of columns), 2) there are of course a lot of potential operations to test, 3) there will be a set of operations that will always be slower with a columnar dataframe, whatever the optimization, and 4) we would be testing with current pandas, which is often not yet optimized for column-wise operations.
> 
> I would be fine with choosing a set of example datasets with example operations, on which we can have some comparisons. 
> My notebook linked above is already something like that (in a limited form), I think. From this set of timings, I personally don't see any insurmountable performance degradations. 
> 
> But I also deliberately choose a dataframe where n_rows >> n_columns, because I personally would be fine if operations on wide dataframes (n_rows < n_columns) show a slowdown. But that is of course something to discuss / agree upon (eg up to which dataframe size or n_columns/n_rows ratio do we care about a performance degradation?).
> 
> Joris
>  
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200529/2a6648e8/attachment-0001.html>