[Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks

Tue May 26 15:42:44 EDT 2020

Thanks for verifying the feasibility. Validation is a bit tricky, but I'd
hope that we can delay everything except the splitting / forming of blocks.
That may result in some non-obvious performance quirks, but at least of the
simple case of `data` being an ndarray and index / columns not forcing any
reindexing, I'm hopeful that it's not too bad.

On Tue, May 26, 2020 at 2:35 PM Joris Van den Bossche <
jorisvandenbossche at gmail.com> wrote:

> On Tue, 26 May 2020 at 13:21, Tom Augspurger <tom.augspurger88 at gmail.com>
> wrote:
>
>>
>> On Tue, May 26, 2020 at 3:35 AM Joris Van den Bossche <
>> jorisvandenbossche at gmail.com> wrote:
>>
>>> - We could make the DataFrame construction from a 2D array/matrix kind
>>> of "lazy" (or have an option to do it like this): upon construction just
>>> store the 2D array as is, and only once you perform an actual operation on
>>> it, convert to a columnar store. And that would make it possible to still
>>> get the 2D array back with zero-copy, if all you did was passing this
>>> DataFrame to the next step of the pipeline.
>>>
>>> I think the first option should be fairly easy to do, and should solve a
>>> large part of the concerns for scikit-learn (I think?).
>>>
>>
>> I think the first option would solve that use case for scikit-learn. It
>> sounds feasible, but I'm not sure how easy it would be.
>>
>>
> A quick, ugly proof-of-concept:
> https://github.com/pandas-dev/pandas/commit/cf387dced4803b81ec8709eeaf624369abca1188
>
> It allows to create a "DataFrame" from an ndarray without creating a
> BlockManager, and it allows accessing this original ndarray:
>
> In [1]: df = pd.DataFrame._init_lazy(np.random.randn(4, 3),
> (pd.RangeIndex(4), pd.RangeIndex(3)))
>
> In [2]: df._mgr_data
> Out[2]:
> (array([[ 1.52971972e-01, -5.69204971e-01,  5.54430115e-01],
>         [-1.09916133e+00, -1.16315362e+00, -1.51071081e+00],
>         [ 7.05185110e-01, -1.53009348e-03,  1.54260335e+00],
>         [-4.60590231e-01, -3.85364427e-01,  1.80760103e+00]]),
>  RangeIndex(start=0, stop=4, step=1),
>  RangeIndex(start=0, stop=3, step=1))
>
> And once you do something with the dataframe, such as printing or
> calculating something, the BlockManager gets only created at this step:
>
> In [3]: df
> Out[3]: Initializing !!!
>
>           0         1         2
> 0  0.152972 -0.569205  0.554430
> 1 -1.099161 -1.163154 -1.510711
> 2  0.705185 -0.001530  1.542603
> 3 -0.460590 -0.385364  1.807601
>
> In [4]: df = pd.DataFrame._init_lazy(np.random.randn(4, 3),
> (pd.RangeIndex(4), pd.RangeIndex(3)))
>
> In [5]: df.mean()
> Initializing !!!
> Out[5]:
> 0    0.397243
> 1    0.269996
> 2   -0.454929
> dtype: float64
>
> There are of course many things missing (validation of the input to
> init_lazy, potentially being able to access df.index/df.columns without
> initializing the block manager, hooking this up in __array__, what with
> pickling?, ...)
> But just to illustrate the idea.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200526/6d3563cc/attachment.html>