[Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks

Tue May 26 10:13:44 EDT 2020

>> Assuming we go down this path, do you have an idea of how we get from
here to there incrementally?  i.e. presumably this wont just be one massive
PR
>  [...] I would first like to focus on the "assuming we go down this path"
part. Let's discuss the pros and cons and trade-offs, and try to turn
assumptions in an agreed-upon roadmap. [...]

I think understanding the difficulty/feasibility of the implementation is a
pretty important part of the pros/cons.

Looking back at #10556, I'm wondering if we could disable _most_
consolidation, e.g. only consolidate when making copies anyway, which might
be a never-break-views policy.  From a user standpoint would that achieve
much/most of th benefits here?

On Tue, May 26, 2020 at 5:17 AM Jeff Reback <jeffreback at gmail.com> wrote:

> A little historical perspective
>
> 10 years ago the standard input to a Dataframe was a single dtype 2D numpy
> array. This provides the following nice properties:
>
> - 0 cost construction, you can simply wrap Dataframe around the input with
> very little overhead. This provides a labeled array interface, gaining
> pandas users
> - very fast reductions; the block is passed to numpy directly for the
> reductions; numpy can then reduce with aligned memory access
> - almost all operations in pandas coerced to float64 on operations
>
> The block manager is optimized for this case as this was the original
> DataMatrix. It serves its purpose pretty well.
>
> In the last few years things have changed in the following ways:
>
> - dict of 1D numpy arrays is by far the most common construction
> - heterogenous dtypes have grown quite a bit, eg it’s now very common to
> use int8, float32; these are also preserved pretty well by pandas
> operations
> - non numpy backed dtypes are increasingly common
>
> To me removing the block manager is not about performance, rather about
> simplifying the code and mental model, though we should be mindful of
> construction from 2D inputs will require splitting and thus be not cheap
> (note that you can view the 1D slices but these are not memory aligned);
> this is a typical trap that folks get into; 1D looks all rosy but it all
> depends on usecase.
>
> I think it would be ok for pandas to move to dict of columns and simply
> document the non performing cases (eg very wide single dtypes or 2D
> construction);
>
> I suppose it’s also possible to reinvent the DataMatrix in a limited form
> but that of course adds complexity and would like to see that after a
> refactor.
>
> my 3c
>
> Jeff
>
> On May 26, 2020, at 7:22 AM, Tom Augspurger <tom.augspurger88 at gmail.com>
> wrote:
>
>
> 
>
>
> On Tue, May 26, 2020 at 3:35 AM Joris Van den Bossche <
> jorisvandenbossche at gmail.com> wrote:
>
>> Thanks for those links!
>>
>> Personally, I see the "roundtrip conversion to/from sparse matrices" a
>> bit as in the same bucket as conversion to/from a 2D numpy array.
>> Yes, both are important use cases. But the question we need to ask
>> ourselves is still: is this important enough to hugely complicate the
>> pandas' internals and block several other improvements? It's a trade-off
>> that we need to make.
>>
>> Moreover, I think that we could accommodate the important part of those
>> use cases also with a column-store DataFrame, with some effort (but with
>> less complexity as a consolidated BlockManager).
>>
>> Focusing on scikit-learn: in the end, you mostly care about cheap
>> roundtripping of 2D numpy array or sparse matrix to/from a pandas DataFrame
>> to carry feature labels in between steps of a pipeline, correct?
>> Such cheap roundtripping is only possible anyway if you have a single
>> dtype for all columns (which is typically the case after some
>> transformation step). So you don't necessarily need consolidated blocks
>> specifically, but rather the ability to store a *single* 2D array/matrix in
>> a DataFrame (so kind of a single 2D block).
>>
>> Thinking out loud here, didn't try anything in code:
>>
>> - We could make the DataFrame construction from a 2D array/matrix kind of
>> "lazy" (or have an option to do it like this): upon construction just store
>> the 2D array as is, and only once you perform an actual operation on it,
>> convert to a columnar store. And that would make it possible to still get
>> the 2D array back with zero-copy, if all you did was passing this DataFrame
>> to the next step of the pipeline.
>> - We could take the above a step further and try to preserve the 2D array
>> under the hood in some "easy" operations (but again, limited to a single 2D
>> block/array, not multiple consolidated blocks). This is actually similar to
>> the DataMatrix that pandas had a very long time ago. Of course this adds
>> back complexity, so this would need some more exploration to see if how
>> this would be possible (without duplicating a lot), and some buy-in from
>> people interested in this.
>>
>> I think the first option should be fairly easy to do, and should solve a
>> large part of the concerns for scikit-learn (I think?).
>>
>
> I think the first option would solve that use case for scikit-learn. It
> sounds feasible, but I'm not sure how easy it would be.
>
>
>> I think the second idea is also interesting: IMO such a data structure
>> would be useful to have somewhere in the PyData ecosystem, and a worthwhile
>> discussion to think about where this could fit. Maybe the answer is simply:
>> use xarray for this use case (although there are still differences) ? That
>> are interesting discussions, but personally I would not complicate the core
>> pandas data model for heterogeneous dataframes to accommodate the
>> single-dtype + fixed number of columns use case.
>>
>
> The current prototype[1] accepts preserves both xarray and pandas data
> structures.
>
> [1]: https://github.com/scikit-learn/scikit-learn/pull/16772
>
>
>> Joris
>>
>> On Tue, 26 May 2020 at 09:50, Adrin <adrin.jalali at gmail.com> wrote:
>>
>>> Hi Joris,
>>>
>>> Thanks for the summary. I think another missing point is the roundtrip
>>> conversion to/from sparse matrices.
>>> There are some benchmarks and discussion here;
>>> https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-615423097
>>> and here's some discussion on the pandas issue tracker:
>>> https://github.com/pandas-dev/pandas/issues/33182
>>> and some benchmark by Tom, assuming pandas would accept a 2D sparse
>>> array:
>>> https://github.com/scikit-learn/scikit-learn/pull/16772#issuecomment-615440896
>>>
>>> What do you think of these usecases?
>>>
>>> Thanks,
>>> Adrin
>>>
>>> On Mon, May 25, 2020 at 11:39 PM Joris Van den Bossche <
>>> jorisvandenbossche at gmail.com> wrote:
>>>
>>>> Hi list,
>>>>
>>>> Rewriting the BlockManager based on a simpler collection of 1D-arrays
>>>> is actually on our roadmap (see here
>>>> <https://pandas.pydata.org/docs/dev/development/roadmap.html#block-manager-rewrite>),
>>>> and I also touched on it in a mailing list discussion about pandas 2.0
>>>> earlier this year (see here
>>>> <https://mail.python.org/pipermail/pandas-dev/2020-February/001180.html>
>>>> ).
>>>>
>>>> But since the topic came up again recently at the last online dev
>>>> meeting (and also Uwe Korn who wrote a nice blog post
>>>> <https://uwekorn.com/2020/05/24/the-one-pandas-internal.html> about
>>>> this yesterday), I thought to do a write-up of my thoughts on why I think
>>>> we should actually move towards a simpler, non-consolidating BlockManager
>>>> with 1D blocks.
>>>>
>>>>
>>>> *Simplication of the internals*
>>>>
>>>> It's regularly brought up as a reason to have 2D EextensionArrays (EAs)
>>>> because right now we have a lot of special cases for 1D EAs in the
>>>> internals. But to be clear: the additional complexity does not come from 1D
>>>> EAs in itself, it comes from the fact that we have a mixture of 2D and 1D
>>>> blocks.
>>>> Solving this would require a consistent block dimension, and thus
>>>> removing this added complexity can be done in two ways: have all 1D blocks,
>>>> or have all 2D blocks.
>>>> Just to say: IMO, this is not an argument in favor of 2D blocks /
>>>> consolidation.
>>>>
>>>> Moreover, when going with all 1D blocks, we cannot only remove the
>>>> added complexity from dealing with the mixture of 1D/2D blocks, we will
>>>>  *also* be able to reduce the complexity of dealing with 2D blocks. A
>>>> BlockManager with 2D blocks is inherently more complex than with 1D blocks,
>>>> as one needs to deal with proper alignment of the blocks, a more complex
>>>> "placement" logic of the blocks, etc.
>>>>
>>>> I think we would be able to simplify the internals a lot by going with
>>>> a BlockManager as a store of 1D arrays.
>>>>
>>>>
>>>> *Performance*
>>>>
>>>> Performance is typically given as a reason to have consolidated, 2D
>>>> blocks. And of course, certain operations (especially row-wise operations,
>>>> or on dataframes with more columns as rows) will always be faster when done
>>>> on a 2D numpy array under the hood.
>>>> However, based on recent experimentation with this (eg triggered by the
>>>>  block-wise frame ops PR
>>>> <https://github.com/pandas-dev/pandas/pull/32779>, and see also some
>>>> benchmarks I justed posted in #10556
>>>> <https://github.com/pandas-dev/pandas/issues/10556#issuecomment-633703160>
>>>>  / this gist
>>>> <https://gist.github.com/jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c>),
>>>> I also think that for many operations and with decent-sized dataframes,
>>>> this performance penalty is actually quite OK.
>>>>
>>>> Further, there are also operations that will *benefit* from 1D blocks.
>>>> First, operations that now involve aligning/splitting blocks,
>>>> re-consolidation, .. will benefit (e.g. a large part of the slowdown doing
>>>> frame/frame operations column-wise is currently due to the consolidation in
>>>> the end). And operations like adding a column, concatting (with axis=1) or
>>>> merging dataframes will be much faster when no consolidation is needed.
>>>>
>>>> Personally, I am convinced that with some effort, we can get on-par or
>>>> sometimes even better performance with 1D blocks compared to the
>>>> performance we have now for those cases that 90+% of our users care about:
>>>>
>>>>    - With limited effort optimizing the column-wise code paths in the
>>>>    internals, we can get a long way.
>>>>    - After that, if needed, we can still consider if parts of the
>>>>    internals could be cythonized to further improve certain bottlenecks (and
>>>>    actually cythonizing this will also be simpler for a simpler
>>>>    non-consolidating block manager).
>>>>
>>>>
>>>> *Possibility to get better copy/view semantics*
>>>>
>>>> Pandas is badly known for how much it copies ("you need 10x the memory
>>>> available as the size of your dataframe"), and having 1D blocks will allow
>>>> us to address part of those concerns.
>>>>
>>>> *No consolidation = less copying.* Regularly consolidating introduces
>>>> copies, and thus removing consolidation will mean less copies. For example,
>>>> this would enable that you can actually add a single column to a dataframe
>>>> without having to copy to the full dataframe.
>>>>
>>>> *Copy / view semantics* Recently there has been discussion again
>>>> around whether selecting columns should be a copy or a view, and some other
>>>> issues were opened with questions about views/copies when slicing columns.
>>>> In the consolidated 2D block layout this will always be inherently messy,
>>>> and unpredictable (meaning: depending on the actual block layout, which
>>>> means in practice unpredictable for the user unaware of the block layout).
>>>> Going with a non-consolidated BlockManager should at least allow us to
>>>> get better / more understandable semantics around this.
>>>>
>>>>
>>>> ------------------------------
>>>>
>>>> *So what are the reasons to have 2D blocks?*
>>>>
>>>> I personally don't directly see reasons to have 2D blocks *for pandas
>>>> itself* (apart from performance in certain row-wise use cases, and
>>>> except for the fact that we have "always done it like this"). But quite
>>>> likely I am missing reasons, so please bring them up.
>>>>
>>>> But I think there are certainly use cases where 2D blocks can be
>>>> useful, but typically "external" (but nonetheless important) use cases:
>>>> conversion to/from numpy, xarray, etc. A typical example that has recently
>>>> come up is scikit-learn, where they want to have a cheap dataframe <->
>>>> numpy array roundtrip for use in their pipelines.
>>>> However, I personally think there are possible ways that we can still
>>>> accommodate for those use cases, with some effort, while still having 1D
>>>> Blocks in pandas itself. So IMO this is not sufficient to warrant the
>>>> complexity of 2D blocks in pandas.
>>>> (but will stop here, as this mail is getting already long ..).
>>>>
>>>> Joris
>>>> _______________________________________________
>>>> Pandas-dev mailing list
>>>> Pandas-dev at python.org
>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>
>>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200526/a4df1701/attachment-0001.html>