[Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks

Fri May 29 13:34:01 EDT 2020

On Wed, 27 May 2020 at 23:07, Brock Mendel <jbrockmendel at gmail.com> wrote:

>
> The main upsides I see are a) internal complexity reduction, b) downstream
> library upsides, c) clearer view vs copy semantics, d) perf improvements
> from making fewer copies, e) clear "dict of Series" data model.
>
> The main downside is potential performance degradation (at the extreme end
> e.g. 3000x <https://github.com/pandas-dev/pandas/issues/24990> for
> arithmetic).  As Wes commented some of that can be ameliorated with
> compiled code but that cuts against the complexity reduction.
>
> I am looking for ways to quantify these tradeoffs so we can make an
> informed decision.
>
> Can you try to explain a bit more what kind of quantification you are
looking for?

- Complexity: I think we agree a non-consolidating block manager *can* be
simpler? (and it's not only the internals, also eg the algos become
simpler). But I am not sure this can be expressed in a number.
- Clearer view vs copy semantics: this is partly an issue of making pandas
easier to understand (both as developer and user), which again seems hard
to quantify. And partly an issue of performance / memory usage. This is
something that could potentially be measured (eg the memory usage of some
typical workflows). But this probably also something that might only show
effect after a refactor / implementation of new semantics.
- Potential performance degradation: here you can measure things, and I
actually did that for some cases, see
https://gist.github.com/jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c
(the notebook that I posted in #10556
<https://github.com/pandas-dev/pandas/issues/10556> a few days ago).

However: 1) a lot depends on what kind of dataframe you take for your
benchmarks (number of rows vs number of columns), 2) there are of course a
lot of potential operations to test, 3) there will be a set of operations
that will always be slower with a columnar dataframe, whatever the
optimization, and 4) we would be testing with current pandas, which is
often not yet optimized for column-wise operations.

I would be fine with choosing a set of example datasets with example
operations, on which we can have some comparisons.
My notebook linked above is already something like that (in a limited
form), I think. From this set of timings, I personally don't see any
insurmountable performance degradations.

But I also deliberately choose a dataframe where n_rows >> n_columns,
because I personally would be fine if operations on wide dataframes (n_rows
< n_columns) show a slowdown. But that is of course something to discuss /
agree upon (eg up to which dataframe size or n_columns/n_rows ratio do we
care about a performance degradation?).

Joris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200529/2476b080/attachment.html>