[Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks

Tue May 26 16:58:17 EDT 2020

On Tue, May 26, 2020 at 3:50 PM Brock Mendel <jbrockmendel at gmail.com> wrote:

> > It allows to create a "DataFrame" from an ndarray without creating a
> BlockManager, and it allows accessing this original ndarray:
>
> This is a neat proof of concept, but it cuts against the "decreases
> complexity" argument.  Is there a viable way to quantify (even very
> roughly) the complexity effect of going all-1D?
>

That complexity is at least localized to a single attribute. That's quite
different from the 1D & 2D blocks situation, where many methods (though
fewer than a year ago) need to be concerned with whether the array in a
block is 1D or 2D, or whether the DataFrame is consolidated, homogenous, ...

> A couple ideas for ways to simplify this decision-making problem:
>
> 1) ATM there are a handful of places outside of core.internals where we
> call consolidate/consolidate_inplace.  If we can refactor those away, we
> can focus on the BlockManager in (closer-to-)isolation.
>

If possible, isolating consolidation to `core.internals` sounds like a
generally useful cleanup, regardless of whether we pursue the larger
changes.

> 2) IIUC going all-1D will cause column indexing to always return views.
> Elsewhere you have noted that this is a breaking API change which merited
> discussion in its own right.  xref #33780
> <https://github.com/pandas-dev/pandas/issues/33780>.  My takeaway from
> this part of the last dev call was that people were generally positive on
> the all-views idea, but were wary of how to handle the potential
> deprecation.
>

This type of change would merit a major version bump. If possible, we'd
ideally have some kind of option to disable consolidation / enable
splitting, which would allow for users to test their code on older versions.

> On Tue, May 26, 2020 at 12:49 PM Wes McKinney <wesmckinn at gmail.com> wrote:
>
>> Something to add here (in favor of removing the BM) -- and apologies
>> if it's already mentioned in a different form:
>>
>> It is very, very difficult for third party code to construct
>> heterogeneously-typed DataFrames without triggering a memory doubling.
>> To give you an example what I mean, in Apache Arrow, we painstakingly
>> implemented block consolidation in C++ [1] so that we can construct a
>> DataFrame that won't suddenly double memory the first time that a user
>> interacts with it. So the possibility of users having an OOM on their
>> first interaction with an object they created is not great. If
>> avoiding it for library developers were easy then perhaps it would be
>> less of an issue, but avoiding the doubling requires advanced
>> knowledge of pandas's internals.
>>
>> Looking back 9-10 years, the primary motivations I had for creating
>> the BlockManager in the first place don't persuade me anymore:
>>
>> * pandas's success was still very much coupled to vectorized
>> operations on wide row-major data (e.g. as present in certain sectors
>> of the financial industry). I don't think this represents the majority
>> of pandas users now
>> * In 2011 I was uncomfortable writing significant compiled code. Many
>> of the performance issues that the BM tried to ameliorate are
>> non-issues if you're OK writing non-trivial C/C++ code to deal with
>> row-level interactions. Even if there were a 50% performance
>> regression on some of these operations that are faster with 2D blocks
>> because of row-major vs. column-major memory layout, that still seems
>> worth it for the vast code simplification and the
>> memory-use-predictability benefits that others have articulated
>> already.
>>
>> - Wes
>>
>> [1]:
>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/python/arrow_to_pandas.cc
>>
>> On Tue, May 26, 2020 at 2:35 PM Joris Van den Bossche
>> <jorisvandenbossche at gmail.com> wrote:
>> >
>> > On Tue, 26 May 2020 at 13:21, Tom Augspurger <
>> tom.augspurger88 at gmail.com> wrote:
>> >>
>> >>
>> >> On Tue, May 26, 2020 at 3:35 AM Joris Van den Bossche <
>> jorisvandenbossche at gmail.com> wrote:
>> >>>
>> >>> - We could make the DataFrame construction from a 2D array/matrix
>> kind of "lazy" (or have an option to do it like this): upon construction
>> just store the 2D array as is, and only once you perform an actual
>> operation on it, convert to a columnar store. And that would make it
>> possible to still get the 2D array back with zero-copy, if all you did was
>> passing this DataFrame to the next step of the pipeline.
>> >>>
>> >>> I think the first option should be fairly easy to do, and should
>> solve a large part of the concerns for scikit-learn (I think?).
>> >>
>> >>
>> >> I think the first option would solve that use case for scikit-learn.
>> It sounds feasible, but I'm not sure how easy it would be.
>> >>
>> >
>> > A quick, ugly proof-of-concept:
>> https://github.com/pandas-dev/pandas/commit/cf387dced4803b81ec8709eeaf624369abca1188
>> >
>> > It allows to create a "DataFrame" from an ndarray without creating a
>> BlockManager, and it allows accessing this original ndarray:
>> >
>> > In [1]: df = pd.DataFrame._init_lazy(np.random.randn(4, 3),
>> (pd.RangeIndex(4), pd.RangeIndex(3)))
>> >
>> > In [2]: df._mgr_data
>> > Out[2]:
>> > (array([[ 1.52971972e-01, -5.69204971e-01,  5.54430115e-01],
>> >         [-1.09916133e+00, -1.16315362e+00, -1.51071081e+00],
>> >         [ 7.05185110e-01, -1.53009348e-03,  1.54260335e+00],
>> >         [-4.60590231e-01, -3.85364427e-01,  1.80760103e+00]]),
>> >  RangeIndex(start=0, stop=4, step=1),
>> >  RangeIndex(start=0, stop=3, step=1))
>> >
>> > And once you do something with the dataframe, such as printing or
>> calculating something, the BlockManager gets only created at this step:
>> >
>> > In [3]: df
>> > Out[3]: Initializing !!!
>> >
>> >           0         1         2
>> > 0  0.152972 -0.569205  0.554430
>> > 1 -1.099161 -1.163154 -1.510711
>> > 2  0.705185 -0.001530  1.542603
>> > 3 -0.460590 -0.385364  1.807601
>> >
>> > In [4]: df = pd.DataFrame._init_lazy(np.random.randn(4, 3),
>> (pd.RangeIndex(4), pd.RangeIndex(3)))
>> >
>> > In [5]: df.mean()
>> > Initializing !!!
>> > Out[5]:
>> > 0    0.397243
>> > 1    0.269996
>> > 2   -0.454929
>> > dtype: float64
>> >
>> > There are of course many things missing (validation of the input to
>> init_lazy, potentially being able to access df.index/df.columns without
>> initializing the block manager, hooking this up in __array__, what with
>> pickling?, ...)
>> > But just to illustrate the idea.
>> > _______________________________________________
>> > Pandas-dev mailing list
>> > Pandas-dev at python.org
>> > https://mail.python.org/mailman/listinfo/pandas-dev
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200526/45581dda/attachment.html>