[Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks

Mon May 25 18:45:57 EDT 2020

Thanks for writing this up, Joris.  Assuming we go down this path, do you
have an idea of how we get from here to there incrementally?  i.e.
presumably this wont just be one massive PR

On Mon, May 25, 2020 at 2:39 PM Joris Van den Bossche <
jorisvandenbossche at gmail.com> wrote:

> Hi list,
>
> Rewriting the BlockManager based on a simpler collection of 1D-arrays is
> actually on our roadmap (see here
> <https://pandas.pydata.org/docs/dev/development/roadmap.html#block-manager-rewrite>),
> and I also touched on it in a mailing list discussion about pandas 2.0
> earlier this year (see here
> <https://mail.python.org/pipermail/pandas-dev/2020-February/001180.html>).
>
> But since the topic came up again recently at the last online dev meeting
> (and also Uwe Korn who wrote a nice blog post
> <https://uwekorn.com/2020/05/24/the-one-pandas-internal.html> about this
> yesterday), I thought to do a write-up of my thoughts on why I think we
> should actually move towards a simpler, non-consolidating BlockManager with
> 1D blocks.
>
>
> *Simplication of the internals*
>
> It's regularly brought up as a reason to have 2D EextensionArrays (EAs)
> because right now we have a lot of special cases for 1D EAs in the
> internals. But to be clear: the additional complexity does not come from 1D
> EAs in itself, it comes from the fact that we have a mixture of 2D and 1D
> blocks.
> Solving this would require a consistent block dimension, and thus removing
> this added complexity can be done in two ways: have all 1D blocks, or have
> all 2D blocks.
> Just to say: IMO, this is not an argument in favor of 2D blocks /
> consolidation.
>
> Moreover, when going with all 1D blocks, we cannot only remove the added
> complexity from dealing with the mixture of 1D/2D blocks, we will *also* be
> able to reduce the complexity of dealing with 2D blocks. A BlockManager
> with 2D blocks is inherently more complex than with 1D blocks, as one needs
> to deal with proper alignment of the blocks, a more complex "placement"
> logic of the blocks, etc.
>
> I think we would be able to simplify the internals a lot by going with a
> BlockManager as a store of 1D arrays.
>
>
> *Performance*
>
> Performance is typically given as a reason to have consolidated, 2D
> blocks. And of course, certain operations (especially row-wise operations,
> or on dataframes with more columns as rows) will always be faster when done
> on a 2D numpy array under the hood.
> However, based on recent experimentation with this (eg triggered by the block-wise
> frame ops PR <https://github.com/pandas-dev/pandas/pull/32779>, and see
> also some benchmarks I justed posted in #10556
> <https://github.com/pandas-dev/pandas/issues/10556#issuecomment-633703160>
>  / this gist
> <https://gist.github.com/jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c>),
> I also think that for many operations and with decent-sized dataframes,
> this performance penalty is actually quite OK.
>
> Further, there are also operations that will *benefit* from 1D blocks.
> First, operations that now involve aligning/splitting blocks,
> re-consolidation, .. will benefit (e.g. a large part of the slowdown doing
> frame/frame operations column-wise is currently due to the consolidation in
> the end). And operations like adding a column, concatting (with axis=1) or
> merging dataframes will be much faster when no consolidation is needed.
>
> Personally, I am convinced that with some effort, we can get on-par or
> sometimes even better performance with 1D blocks compared to the
> performance we have now for those cases that 90+% of our users care about:
>
>    - With limited effort optimizing the column-wise code paths in the
>    internals, we can get a long way.
>    - After that, if needed, we can still consider if parts of the
>    internals could be cythonized to further improve certain bottlenecks (and
>    actually cythonizing this will also be simpler for a simpler
>    non-consolidating block manager).
>
>
> *Possibility to get better copy/view semantics*
>
> Pandas is badly known for how much it copies ("you need 10x the memory
> available as the size of your dataframe"), and having 1D blocks will allow
> us to address part of those concerns.
>
> *No consolidation = less copying.* Regularly consolidating introduces
> copies, and thus removing consolidation will mean less copies. For example,
> this would enable that you can actually add a single column to a dataframe
> without having to copy to the full dataframe.
>
> *Copy / view semantics* Recently there has been discussion again around
> whether selecting columns should be a copy or a view, and some other issues
> were opened with questions about views/copies when slicing columns. In the
> consolidated 2D block layout this will always be inherently messy, and
> unpredictable (meaning: depending on the actual block layout, which means
> in practice unpredictable for the user unaware of the block layout).
> Going with a non-consolidated BlockManager should at least allow us to get
> better / more understandable semantics around this.
>
>
> ------------------------------
>
> *So what are the reasons to have 2D blocks?*
>
> I personally don't directly see reasons to have 2D blocks *for pandas
> itself* (apart from performance in certain row-wise use cases, and except
> for the fact that we have "always done it like this"). But quite likely I
> am missing reasons, so please bring them up.
>
> But I think there are certainly use cases where 2D blocks can be useful,
> but typically "external" (but nonetheless important) use cases: conversion
> to/from numpy, xarray, etc. A typical example that has recently come up is
> scikit-learn, where they want to have a cheap dataframe <-> numpy array
> roundtrip for use in their pipelines.
> However, I personally think there are possible ways that we can still
> accommodate for those use cases, with some effort, while still having 1D
> Blocks in pandas itself. So IMO this is not sufficient to warrant the
> complexity of 2D blocks in pandas.
> (but will stop here, as this mail is getting already long ..).
>
> Joris
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200525/686fcf9d/attachment-0001.html>