[Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks

Mon May 25 17:39:13 EDT 2020

Hi list,

Rewriting the BlockManager based on a simpler collection of 1D-arrays is
actually on our roadmap (see here
<https://pandas.pydata.org/docs/dev/development/roadmap.html#block-manager-rewrite>),
and I also touched on it in a mailing list discussion about pandas 2.0
earlier this year (see here
<https://mail.python.org/pipermail/pandas-dev/2020-February/001180.html>).

But since the topic came up again recently at the last online dev meeting
(and also Uwe Korn who wrote a nice blog post
<https://uwekorn.com/2020/05/24/the-one-pandas-internal.html> about this
yesterday), I thought to do a write-up of my thoughts on why I think we
should actually move towards a simpler, non-consolidating BlockManager with
1D blocks.

*Simplication of the internals*

It's regularly brought up as a reason to have 2D EextensionArrays (EAs)
because right now we have a lot of special cases for 1D EAs in the
internals. But to be clear: the additional complexity does not come from 1D
EAs in itself, it comes from the fact that we have a mixture of 2D and 1D
blocks.
Solving this would require a consistent block dimension, and thus removing
this added complexity can be done in two ways: have all 1D blocks, or have
all 2D blocks.
Just to say: IMO, this is not an argument in favor of 2D blocks /
consolidation.

Moreover, when going with all 1D blocks, we cannot only remove the added
complexity from dealing with the mixture of 1D/2D blocks, we will *also* be
able to reduce the complexity of dealing with 2D blocks. A BlockManager
with 2D blocks is inherently more complex than with 1D blocks, as one needs
to deal with proper alignment of the blocks, a more complex "placement"
logic of the blocks, etc.

I think we would be able to simplify the internals a lot by going with a
BlockManager as a store of 1D arrays.

*Performance*

Performance is typically given as a reason to have consolidated, 2D blocks.
And of course, certain operations (especially row-wise operations, or on
dataframes with more columns as rows) will always be faster when done on a
2D numpy array under the hood.
However, based on recent experimentation with this (eg triggered by
the block-wise
frame ops PR <https://github.com/pandas-dev/pandas/pull/32779>, and see
also some benchmarks I justed posted in #10556
<https://github.com/pandas-dev/pandas/issues/10556#issuecomment-633703160> /
 this gist
<https://gist.github.com/jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c>),
I also think that for many operations and with decent-sized dataframes,
this performance penalty is actually quite OK.

Further, there are also operations that will *benefit* from 1D blocks.
First, operations that now involve aligning/splitting blocks,
re-consolidation, .. will benefit (e.g. a large part of the slowdown doing
frame/frame operations column-wise is currently due to the consolidation in
the end). And operations like adding a column, concatting (with axis=1) or
merging dataframes will be much faster when no consolidation is needed.

Personally, I am convinced that with some effort, we can get on-par or
sometimes even better performance with 1D blocks compared to the
performance we have now for those cases that 90+% of our users care about:

   - With limited effort optimizing the column-wise code paths in the
   internals, we can get a long way.
   - After that, if needed, we can still consider if parts of the internals
   could be cythonized to further improve certain bottlenecks (and actually
   cythonizing this will also be simpler for a simpler non-consolidating block
   manager).

*Possibility to get better copy/view semantics*

Pandas is badly known for how much it copies ("you need 10x the memory
available as the size of your dataframe"), and having 1D blocks will allow
us to address part of those concerns.

*No consolidation = less copying.* Regularly consolidating introduces
copies, and thus removing consolidation will mean less copies. For example,
this would enable that you can actually add a single column to a dataframe
without having to copy to the full dataframe.

*Copy / view semantics* Recently there has been discussion again around
whether selecting columns should be a copy or a view, and some other issues
were opened with questions about views/copies when slicing columns. In the
consolidated 2D block layout this will always be inherently messy, and
unpredictable (meaning: depending on the actual block layout, which means
in practice unpredictable for the user unaware of the block layout).
Going with a non-consolidated BlockManager should at least allow us to get
better / more understandable semantics around this.

------------------------------

*So what are the reasons to have 2D blocks?*

I personally don't directly see reasons to have 2D blocks *for pandas
itself* (apart from performance in certain row-wise use cases, and except
for the fact that we have "always done it like this"). But quite likely I
am missing reasons, so please bring them up.

But I think there are certainly use cases where 2D blocks can be useful,
but typically "external" (but nonetheless important) use cases: conversion
to/from numpy, xarray, etc. A typical example that has recently come up is
scikit-learn, where they want to have a cheap dataframe <-> numpy array
roundtrip for use in their pipelines.
However, I personally think there are possible ways that we can still
accommodate for those use cases, with some effort, while still having 1D
Blocks in pandas itself. So IMO this is not sufficient to warrant the
complexity of 2D blocks in pandas.
(but will stop here, as this mail is getting already long ..).

Joris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200525/2e31dcd1/attachment.html>