[Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks

Jeff Reback jeffreback at gmail.com
Tue May 26 08:16:53 EDT 2020


A little historical perspective

10 years ago the standard input to a Dataframe was a single dtype 2D numpy array. This provides the following nice properties:

- 0 cost construction, you can simply wrap Dataframe around the input with very little overhead. This provides a labeled array interface, gaining pandas users
- very fast reductions; the block is passed to numpy directly for the reductions; numpy can then reduce with aligned memory access
- almost all operations in pandas coerced to float64 on operations

The block manager is optimized for this case as this was the original DataMatrix. It serves its purpose pretty well. 

In the last few years things have changed in the following ways:

- dict of 1D numpy arrays is by far the most common construction
- heterogenous dtypes have grown quite a bit, eg it’s now very common to use int8, float32; these are also preserved pretty well by pandas operations 
- non numpy backed dtypes are increasingly common

To me removing the block manager is not about performance, rather about simplifying the code and mental model, though we should be mindful of construction from 2D inputs will require splitting and thus be not cheap (note that you can view the 1D slices but these are not memory aligned); this is a typical trap that folks get into; 1D looks all rosy but it all depends on usecase.

I think it would be ok for pandas to move to dict of columns and simply document the non performing cases (eg very wide single dtypes or 2D construction);

I suppose it’s also possible to reinvent the DataMatrix in a limited form but that of course adds complexity and would like to see that after a refactor.

my 3c

Jeff

On May 26, 2020, at 7:22 AM, Tom Augspurger <tom.augspurger88 at gmail.com> wrote:
> 
> 
> 
> 
>> On Tue, May 26, 2020 at 3:35 AM Joris Van den Bossche <jorisvandenbossche at gmail.com> wrote:
>> Thanks for those links!
>> 
>> Personally, I see the "roundtrip conversion to/from sparse matrices" a bit as in the same bucket as conversion to/from a 2D numpy array. 
>> Yes, both are important use cases. But the question we need to ask ourselves is still: is this important enough to hugely complicate the pandas' internals and block several other improvements? It's a trade-off that we need to make. 
>> 
>> Moreover, I think that we could accommodate the important part of those use cases also with a column-store DataFrame, with some effort (but with less complexity as a consolidated BlockManager). 
>> 
>> Focusing on scikit-learn: in the end, you mostly care about cheap roundtripping of 2D numpy array or sparse matrix to/from a pandas DataFrame to carry feature labels in between steps of a pipeline, correct?
>> Such cheap roundtripping is only possible anyway if you have a single dtype for all columns (which is typically the case after some transformation step). So you don't necessarily need consolidated blocks specifically, but rather the ability to store a *single* 2D array/matrix in a DataFrame (so kind of a single 2D block).
>> 
>> Thinking out loud here, didn't try anything in code:
>> 
>> - We could make the DataFrame construction from a 2D array/matrix kind of "lazy" (or have an option to do it like this): upon construction just store the 2D array as is, and only once you perform an actual operation on it, convert to a columnar store. And that would make it possible to still get the 2D array back with zero-copy, if all you did was passing this DataFrame to the next step of the pipeline.
>> - We could take the above a step further and try to preserve the 2D array under the hood in some "easy" operations (but again, limited to a single 2D block/array, not multiple consolidated blocks). This is actually similar to the DataMatrix that pandas had a very long time ago. Of course this adds back complexity, so this would need some more exploration to see if how this would be possible (without duplicating a lot), and some buy-in from people interested in this.
>> 
>> I think the first option should be fairly easy to do, and should solve a large part of the concerns for scikit-learn (I think?). 
> 
> I think the first option would solve that use case for scikit-learn. It sounds feasible, but I'm not sure how easy it would be.
>  
>> I think the second idea is also interesting: IMO such a data structure would be useful to have somewhere in the PyData ecosystem, and a worthwhile discussion to think about where this could fit. Maybe the answer is simply: use xarray for this use case (although there are still differences) ? That are interesting discussions, but personally I would not complicate the core pandas data model for heterogeneous dataframes to accommodate the single-dtype + fixed number of columns use case.
> 
> The current prototype[1] accepts preserves both xarray and pandas data structures.
> 
> [1]: https://github.com/scikit-learn/scikit-learn/pull/16772
>  
>> Joris
>> 
>>> On Tue, 26 May 2020 at 09:50, Adrin <adrin.jalali at gmail.com> wrote:
>>> Hi Joris,
>>> 
>>> Thanks for the summary. I think another missing point is the roundtrip conversion to/from sparse matrices.
>>> There are some benchmarks and discussion here; https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-615423097
>>> and here's some discussion on the pandas issue tracker: https://github.com/pandas-dev/pandas/issues/33182
>>> and some benchmark by Tom, assuming pandas would accept a 2D sparse array: https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-615440896
>>> 
>>> What do you think of these usecases?
>>> 
>>> Thanks,
>>> Adrin
>>> 
>>>> On Mon, May 25, 2020 at 11:39 PM Joris Van den Bossche <jorisvandenbossche at gmail.com> wrote:
>>>> Hi list,
>>>> 
>>>> Rewriting the BlockManager based on a simpler collection of 1D-arrays is actually on our roadmap (see here), and I also touched on it in a mailing list discussion about pandas 2.0 earlier this year (see here).
>>>> 
>>>> But since the topic came up again recently at the last online dev meeting (and also Uwe Korn who wrote a nice blog post about this yesterday), I thought to do a write-up of my thoughts on why I think we should actually move towards a simpler, non-consolidating BlockManager with 1D blocks.
>>>> 
>>>> 
>>>> 
>>>> Simplication of the internals
>>>> 
>>>> It's regularly brought up as a reason to have 2D EextensionArrays (EAs) because right now we have a lot of special cases for 1D EAs in the internals. But to be clear: the additional complexity does not come from 1D EAs in itself, it comes from the fact that we have a mixture of 2D and 1D blocks.
>>>> Solving this would require a consistent block dimension, and thus removing this added complexity can be done in two ways: have all 1D blocks, or have all 2D blocks.
>>>> Just to say: IMO, this is not an argument in favor of 2D blocks / consolidation.
>>>> 
>>>> Moreover, when going with all 1D blocks, we cannot only remove the added complexity from dealing with the mixture of 1D/2D blocks, we will also be able to reduce the complexity of dealing with 2D blocks. A BlockManager with 2D blocks is inherently more complex than with 1D blocks, as one needs to deal with proper alignment of the blocks, a more complex "placement" logic of the blocks, etc.
>>>> 
>>>> I think we would be able to simplify the internals a lot by going with a BlockManager as a store of 1D arrays.
>>>> 
>>>> 
>>>> 
>>>> Performance
>>>> 
>>>> Performance is typically given as a reason to have consolidated, 2D blocks. And of course, certain operations (especially row-wise operations, or on dataframes with more columns as rows) will always be faster when done on a 2D numpy array under the hood.
>>>> However, based on recent experimentation with this (eg triggered by the block-wise frame ops PR, and see also some benchmarks I justed posted in #10556 / this gist), I also think that for many operations and with decent-sized dataframes, this performance penalty is actually quite OK.
>>>> 
>>>> Further, there are also operations that will benefit from 1D blocks. First, operations that now involve aligning/splitting blocks, re-consolidation, .. will benefit (e.g. a large part of the slowdown doing frame/frame operations column-wise is currently due to the consolidation in the end). And operations like adding a column, concatting (with axis=1) or merging dataframes will be much faster when no consolidation is needed.
>>>> 
>>>> Personally, I am convinced that with some effort, we can get on-par or sometimes even better performance with 1D blocks compared to the performance we have now for those cases that 90+% of our users care about:
>>>> 
>>>> With limited effort optimizing the column-wise code paths in the internals, we can get a long way.
>>>> After that, if needed, we can still consider if parts of the internals could be cythonized to further improve certain bottlenecks (and actually cythonizing this will also be simpler for a simpler non-consolidating block manager).
>>>> 
>>>> Possibility to get better copy/view semantics
>>>> 
>>>> Pandas is badly known for how much it copies ("you need 10x the memory available as the size of your dataframe"), and having 1D blocks will allow us to address part of those concerns.
>>>> 
>>>> No consolidation = less copying. Regularly consolidating introduces copies, and thus removing consolidation will mean less copies. For example, this would enable that you can actually add a single column to a dataframe without having to copy to the full dataframe.
>>>> 
>>>> Copy / view semantics Recently there has been discussion again around whether selecting columns should be a copy or a view, and some other issues were opened with questions about views/copies when slicing columns. In the consolidated 2D block layout this will always be inherently messy, and unpredictable (meaning: depending on the actual block layout, which means in practice unpredictable for the user unaware of the block layout).
>>>> Going with a non-consolidated BlockManager should at least allow us to get better / more understandable semantics around this.
>>>> 
>>>> 
>>>> 
>>>> So what are the reasons to have 2D blocks?
>>>> 
>>>> I personally don't directly see reasons to have 2D blocks for pandas itself (apart from performance in certain row-wise use cases, and except for the fact that we have "always done it like this"). But quite likely I am missing reasons, so please bring them up.
>>>> 
>>>> But I think there are certainly use cases where 2D blocks can be useful, but typically "external" (but nonetheless important) use cases: conversion to/from numpy, xarray, etc. A typical example that has recently come up is scikit-learn, where they want to have a cheap dataframe <-> numpy array roundtrip for use in their pipelines.
>>>> However, I personally think there are possible ways that we can still accommodate for those use cases, with some effort, while still having 1D Blocks in pandas itself. So IMO this is not sufficient to warrant the complexity of 2D blocks in pandas. 
>>>> (but will stop here, as this mail is getting already long ..).
>>>> 
>>>> 
>>>> Joris
>>>> 
>>>> _______________________________________________
>>>> Pandas-dev mailing list
>>>> Pandas-dev at python.org
>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200526/f8618bc9/attachment-0001.html>


More information about the Pandas-dev mailing list