[Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks

Tue May 26 04:35:17 EDT 2020

Thanks for those links!

Personally, I see the "roundtrip conversion to/from sparse matrices" a bit
as in the same bucket as conversion to/from a 2D numpy array.
Yes, both are important use cases. But the question we need to ask
ourselves is still: is this important enough to hugely complicate the
pandas' internals and block several other improvements? It's a trade-off
that we need to make.

Moreover, I think that we could accommodate the important part of those use
cases also with a column-store DataFrame, with some effort (but with less
complexity as a consolidated BlockManager).

Focusing on scikit-learn: in the end, you mostly care about cheap
roundtripping of 2D numpy array or sparse matrix to/from a pandas DataFrame
to carry feature labels in between steps of a pipeline, correct?
Such cheap roundtripping is only possible anyway if you have a single dtype
for all columns (which is typically the case after some transformation
step). So you don't necessarily need consolidated blocks specifically, but
rather the ability to store a *single* 2D array/matrix in a DataFrame (so
kind of a single 2D block).

Thinking out loud here, didn't try anything in code:

- We could make the DataFrame construction from a 2D array/matrix kind of
"lazy" (or have an option to do it like this): upon construction just store
the 2D array as is, and only once you perform an actual operation on it,
convert to a columnar store. And that would make it possible to still get
the 2D array back with zero-copy, if all you did was passing this DataFrame
to the next step of the pipeline.
- We could take the above a step further and try to preserve the 2D array
under the hood in some "easy" operations (but again, limited to a single 2D
block/array, not multiple consolidated blocks). This is actually similar to
the DataMatrix that pandas had a very long time ago. Of course this adds
back complexity, so this would need some more exploration to see if how
this would be possible (without duplicating a lot), and some buy-in from
people interested in this.

I think the first option should be fairly easy to do, and should solve a
large part of the concerns for scikit-learn (I think?).

I think the second idea is also interesting: IMO such a data structure
would be useful to have somewhere in the PyData ecosystem, and a worthwhile
discussion to think about where this could fit. Maybe the answer is simply:
use xarray for this use case (although there are still differences) ? That
are interesting discussions, but personally I would not complicate the core
pandas data model for heterogeneous dataframes to accommodate the
single-dtype + fixed number of columns use case.

Joris

On Tue, 26 May 2020 at 09:50, Adrin <adrin.jalali at gmail.com> wrote:

> Hi Joris,
>
> Thanks for the summary. I think another missing point is the roundtrip
> conversion to/from sparse matrices.
> There are some benchmarks and discussion here;
> https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-615423097
> and here's some discussion on the pandas issue tracker:
> https://github.com/pandas-dev/pandas/issues/33182
> and some benchmark by Tom, assuming pandas would accept a 2D sparse array:
> https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-615440896
>
> What do you think of these usecases?
>
> Thanks,
> Adrin
>
> On Mon, May 25, 2020 at 11:39 PM Joris Van den Bossche <
> jorisvandenbossche at gmail.com> wrote:
>
>> Hi list,
>>
>> Rewriting the BlockManager based on a simpler collection of 1D-arrays is
>> actually on our roadmap (see here
>> <https://pandas.pydata.org/docs/dev/development/roadmap.html#block-manager-rewrite>),
>> and I also touched on it in a mailing list discussion about pandas 2.0
>> earlier this year (see here
>> <https://mail.python.org/pipermail/pandas-dev/2020-February/001180.html>
>> ).
>>
>> But since the topic came up again recently at the last online dev meeting
>> (and also Uwe Korn who wrote a nice blog post
>> <https://uwekorn.com/2020/05/24/the-one-pandas-internal.html> about this
>> yesterday), I thought to do a write-up of my thoughts on why I think we
>> should actually move towards a simpler, non-consolidating BlockManager with
>> 1D blocks.
>>
>>
>> *Simplication of the internals*
>>
>> It's regularly brought up as a reason to have 2D EextensionArrays (EAs)
>> because right now we have a lot of special cases for 1D EAs in the
>> internals. But to be clear: the additional complexity does not come from 1D
>> EAs in itself, it comes from the fact that we have a mixture of 2D and 1D
>> blocks.
>> Solving this would require a consistent block dimension, and thus
>> removing this added complexity can be done in two ways: have all 1D blocks,
>> or have all 2D blocks.
>> Just to say: IMO, this is not an argument in favor of 2D blocks /
>> consolidation.
>>
>> Moreover, when going with all 1D blocks, we cannot only remove the added
>> complexity from dealing with the mixture of 1D/2D blocks, we will *also* be
>> able to reduce the complexity of dealing with 2D blocks. A BlockManager
>> with 2D blocks is inherently more complex than with 1D blocks, as one needs
>> to deal with proper alignment of the blocks, a more complex "placement"
>> logic of the blocks, etc.
>>
>> I think we would be able to simplify the internals a lot by going with a
>> BlockManager as a store of 1D arrays.
>>
>>
>> *Performance*
>>
>> Performance is typically given as a reason to have consolidated, 2D
>> blocks. And of course, certain operations (especially row-wise operations,
>> or on dataframes with more columns as rows) will always be faster when done
>> on a 2D numpy array under the hood.
>> However, based on recent experimentation with this (eg triggered by the block-wise
>> frame ops PR <https://github.com/pandas-dev/pandas/pull/32779>, and see
>> also some benchmarks I justed posted in #10556
>> <https://github.com/pandas-dev/pandas/issues/10556#issuecomment-633703160>
>>  / this gist
>> <https://gist.github.com/jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c>),
>> I also think that for many operations and with decent-sized dataframes,
>> this performance penalty is actually quite OK.
>>
>> Further, there are also operations that will *benefit* from 1D blocks.
>> First, operations that now involve aligning/splitting blocks,
>> re-consolidation, .. will benefit (e.g. a large part of the slowdown doing
>> frame/frame operations column-wise is currently due to the consolidation in
>> the end). And operations like adding a column, concatting (with axis=1) or
>> merging dataframes will be much faster when no consolidation is needed.
>>
>> Personally, I am convinced that with some effort, we can get on-par or
>> sometimes even better performance with 1D blocks compared to the
>> performance we have now for those cases that 90+% of our users care about:
>>
>>    - With limited effort optimizing the column-wise code paths in the
>>    internals, we can get a long way.
>>    - After that, if needed, we can still consider if parts of the
>>    internals could be cythonized to further improve certain bottlenecks (and
>>    actually cythonizing this will also be simpler for a simpler
>>    non-consolidating block manager).
>>
>>
>> *Possibility to get better copy/view semantics*
>>
>> Pandas is badly known for how much it copies ("you need 10x the memory
>> available as the size of your dataframe"), and having 1D blocks will allow
>> us to address part of those concerns.
>>
>> *No consolidation = less copying.* Regularly consolidating introduces
>> copies, and thus removing consolidation will mean less copies. For example,
>> this would enable that you can actually add a single column to a dataframe
>> without having to copy to the full dataframe.
>>
>> *Copy / view semantics* Recently there has been discussion again around
>> whether selecting columns should be a copy or a view, and some other issues
>> were opened with questions about views/copies when slicing columns. In the
>> consolidated 2D block layout this will always be inherently messy, and
>> unpredictable (meaning: depending on the actual block layout, which means
>> in practice unpredictable for the user unaware of the block layout).
>> Going with a non-consolidated BlockManager should at least allow us to
>> get better / more understandable semantics around this.
>>
>>
>> ------------------------------
>>
>> *So what are the reasons to have 2D blocks?*
>>
>> I personally don't directly see reasons to have 2D blocks *for pandas
>> itself* (apart from performance in certain row-wise use cases, and
>> except for the fact that we have "always done it like this"). But quite
>> likely I am missing reasons, so please bring them up.
>>
>> But I think there are certainly use cases where 2D blocks can be useful,
>> but typically "external" (but nonetheless important) use cases: conversion
>> to/from numpy, xarray, etc. A typical example that has recently come up is
>> scikit-learn, where they want to have a cheap dataframe <-> numpy array
>> roundtrip for use in their pipelines.
>> However, I personally think there are possible ways that we can still
>> accommodate for those use cases, with some effort, while still having 1D
>> Blocks in pandas itself. So IMO this is not sufficient to warrant the
>> complexity of 2D blocks in pandas.
>> (but will stop here, as this mail is getting already long ..).
>>
>> Joris
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200526/7a643829/attachment-0001.html>