[Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks

Joris Van den Bossche jorisvandenbossche at gmail.com
Tue May 26 04:55:08 EDT 2020


On Tue, 26 May 2020 at 00:46, Brock Mendel <jbrockmendel at gmail.com> wrote:

> Thanks for writing this up, Joris.  Assuming we go down this path, do you
> have an idea of how we get from here to there incrementally?  i.e.
> presumably this wont just be one massive PR
>

Yes, this is certainly not a one-PR change. I think there are multiple
options for working towards this, that are worth discussing.

But personally, I would first like to focus on the "assuming we go down
this path" part. Let's discuss the pros and cons and trade-offs, and try to
turn assumptions in an agreed-upon roadmap.
(and of course, it's not because something is on our roadmap that it can't
be questioned and discussed again in the future, as we are also doing now).

---

Some thoughts on possible options:

- We briefly discussed before the idea of using (nullable) extension dtypes
for all dtypes by default in pandas 2.0. If we strive towards that, and
assuming we keep the current 1D-restriction on ExtensionBlock, then we
would "automatically" get a BlockManager with 1D blocks. And we could then
focus on optimizing some code paths (eg constructing a new block)
specifically for the case of 1D ExtensionBlocks.
- A "consolidation policy" option similarly as in the branch discussed in
https://github.com/pandas-dev/pandas/issues/10556. Right now, that branch
still uses 2D blocks (but separate 2D blocks of shape (1, n) per column)
and not actually 1D blocks. So we could add 1D versions of our numeric
blocks as well. But that would probably add a lot of complexity, although
temporary, to the Blocks, so maybe not an ideal path forward.
- Add a version of the ExtensionBlock but that can work with numpy arrays
instead of extension arrays, or actually use the "PandasArrays" to store it
them in the existing ExtensionBlock (so to already start using the existing
1D blocks without requiring all dtypes to be extension dtypes).

Those are all about BlockManager with 1D blocks. Once we only have 1D
Blocks, I suppose there are many things we could simplify in the current
BlockManager. The intermediate step of the current BlockManager with 1D
blocks might not be an optimal situation, but seems the easiest as
intermediate goal in practice.

It probably also depends on how much "backwards compatibility" or
"transition period" we want to provide.


> On Mon, May 25, 2020 at 2:39 PM Joris Van den Bossche <
> jorisvandenbossche at gmail.com> wrote:
>
>> Hi list,
>>
>> Rewriting the BlockManager based on a simpler collection of 1D-arrays is
>> actually on our roadmap (see here
>> <https://pandas.pydata.org/docs/dev/development/roadmap.html#block-manager-rewrite>),
>> and I also touched on it in a mailing list discussion about pandas 2.0
>> earlier this year (see here
>> <https://mail.python.org/pipermail/pandas-dev/2020-February/001180.html>
>> ).
>>
>> But since the topic came up again recently at the last online dev meeting
>> (and also Uwe Korn who wrote a nice blog post
>> <https://uwekorn.com/2020/05/24/the-one-pandas-internal.html> about this
>> yesterday), I thought to do a write-up of my thoughts on why I think we
>> should actually move towards a simpler, non-consolidating BlockManager with
>> 1D blocks.
>>
>>
>> *Simplication of the internals*
>>
>> It's regularly brought up as a reason to have 2D EextensionArrays (EAs)
>> because right now we have a lot of special cases for 1D EAs in the
>> internals. But to be clear: the additional complexity does not come from 1D
>> EAs in itself, it comes from the fact that we have a mixture of 2D and 1D
>> blocks.
>> Solving this would require a consistent block dimension, and thus
>> removing this added complexity can be done in two ways: have all 1D blocks,
>> or have all 2D blocks.
>> Just to say: IMO, this is not an argument in favor of 2D blocks /
>> consolidation.
>>
>> Moreover, when going with all 1D blocks, we cannot only remove the added
>> complexity from dealing with the mixture of 1D/2D blocks, we will *also* be
>> able to reduce the complexity of dealing with 2D blocks. A BlockManager
>> with 2D blocks is inherently more complex than with 1D blocks, as one needs
>> to deal with proper alignment of the blocks, a more complex "placement"
>> logic of the blocks, etc.
>>
>> I think we would be able to simplify the internals a lot by going with a
>> BlockManager as a store of 1D arrays.
>>
>>
>> *Performance*
>>
>> Performance is typically given as a reason to have consolidated, 2D
>> blocks. And of course, certain operations (especially row-wise operations,
>> or on dataframes with more columns as rows) will always be faster when done
>> on a 2D numpy array under the hood.
>> However, based on recent experimentation with this (eg triggered by the block-wise
>> frame ops PR <https://github.com/pandas-dev/pandas/pull/32779>, and see
>> also some benchmarks I justed posted in #10556
>> <https://github.com/pandas-dev/pandas/issues/10556#issuecomment-633703160>
>>  / this gist
>> <https://gist.github.com/jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c>),
>> I also think that for many operations and with decent-sized dataframes,
>> this performance penalty is actually quite OK.
>>
>> Further, there are also operations that will *benefit* from 1D blocks.
>> First, operations that now involve aligning/splitting blocks,
>> re-consolidation, .. will benefit (e.g. a large part of the slowdown doing
>> frame/frame operations column-wise is currently due to the consolidation in
>> the end). And operations like adding a column, concatting (with axis=1) or
>> merging dataframes will be much faster when no consolidation is needed.
>>
>> Personally, I am convinced that with some effort, we can get on-par or
>> sometimes even better performance with 1D blocks compared to the
>> performance we have now for those cases that 90+% of our users care about:
>>
>>    - With limited effort optimizing the column-wise code paths in the
>>    internals, we can get a long way.
>>    - After that, if needed, we can still consider if parts of the
>>    internals could be cythonized to further improve certain bottlenecks (and
>>    actually cythonizing this will also be simpler for a simpler
>>    non-consolidating block manager).
>>
>>
>> *Possibility to get better copy/view semantics*
>>
>> Pandas is badly known for how much it copies ("you need 10x the memory
>> available as the size of your dataframe"), and having 1D blocks will allow
>> us to address part of those concerns.
>>
>> *No consolidation = less copying.* Regularly consolidating introduces
>> copies, and thus removing consolidation will mean less copies. For example,
>> this would enable that you can actually add a single column to a dataframe
>> without having to copy to the full dataframe.
>>
>> *Copy / view semantics* Recently there has been discussion again around
>> whether selecting columns should be a copy or a view, and some other issues
>> were opened with questions about views/copies when slicing columns. In the
>> consolidated 2D block layout this will always be inherently messy, and
>> unpredictable (meaning: depending on the actual block layout, which means
>> in practice unpredictable for the user unaware of the block layout).
>> Going with a non-consolidated BlockManager should at least allow us to
>> get better / more understandable semantics around this.
>>
>>
>> ------------------------------
>>
>> *So what are the reasons to have 2D blocks?*
>>
>> I personally don't directly see reasons to have 2D blocks *for pandas
>> itself* (apart from performance in certain row-wise use cases, and
>> except for the fact that we have "always done it like this"). But quite
>> likely I am missing reasons, so please bring them up.
>>
>> But I think there are certainly use cases where 2D blocks can be useful,
>> but typically "external" (but nonetheless important) use cases: conversion
>> to/from numpy, xarray, etc. A typical example that has recently come up is
>> scikit-learn, where they want to have a cheap dataframe <-> numpy array
>> roundtrip for use in their pipelines.
>> However, I personally think there are possible ways that we can still
>> accommodate for those use cases, with some effort, while still having 1D
>> Blocks in pandas itself. So IMO this is not sufficient to warrant the
>> complexity of 2D blocks in pandas.
>> (but will stop here, as this mail is getting already long ..).
>>
>> Joris
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200526/225f0c2c/attachment-0001.html>


More information about the Pandas-dev mailing list