[Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks

Tue May 26 15:34:52 EDT 2020

On Tue, 26 May 2020 at 13:21, Tom Augspurger <tom.augspurger88 at gmail.com>
wrote:

>
> On Tue, May 26, 2020 at 3:35 AM Joris Van den Bossche <
> jorisvandenbossche at gmail.com> wrote:
>
>> - We could make the DataFrame construction from a 2D array/matrix kind of
>> "lazy" (or have an option to do it like this): upon construction just store
>> the 2D array as is, and only once you perform an actual operation on it,
>> convert to a columnar store. And that would make it possible to still get
>> the 2D array back with zero-copy, if all you did was passing this DataFrame
>> to the next step of the pipeline.
>>
>> I think the first option should be fairly easy to do, and should solve a
>> large part of the concerns for scikit-learn (I think?).
>>
>
> I think the first option would solve that use case for scikit-learn. It
> sounds feasible, but I'm not sure how easy it would be.
>
>
A quick, ugly proof-of-concept:
https://github.com/pandas-dev/pandas/commit/cf387dced4803b81ec8709eeaf624369abca1188

It allows to create a "DataFrame" from an ndarray without creating a
BlockManager, and it allows accessing this original ndarray:

In [1]: df = pd.DataFrame._init_lazy(np.random.randn(4, 3),
(pd.RangeIndex(4), pd.RangeIndex(3)))

In [2]: df._mgr_data
Out[2]:
(array([[ 1.52971972e-01, -5.69204971e-01,  5.54430115e-01],
        [-1.09916133e+00, -1.16315362e+00, -1.51071081e+00],
        [ 7.05185110e-01, -1.53009348e-03,  1.54260335e+00],
        [-4.60590231e-01, -3.85364427e-01,  1.80760103e+00]]),
 RangeIndex(start=0, stop=4, step=1),
 RangeIndex(start=0, stop=3, step=1))

And once you do something with the dataframe, such as printing or
calculating something, the BlockManager gets only created at this step:

In [3]: df
Out[3]: Initializing !!!

          0         1         2
0  0.152972 -0.569205  0.554430
1 -1.099161 -1.163154 -1.510711
2  0.705185 -0.001530  1.542603
3 -0.460590 -0.385364  1.807601

In [4]: df = pd.DataFrame._init_lazy(np.random.randn(4, 3),
(pd.RangeIndex(4), pd.RangeIndex(3)))

In [5]: df.mean()
Initializing !!!
Out[5]:
0    0.397243
1    0.269996
2   -0.454929
dtype: float64

There are of course many things missing (validation of the input to
init_lazy, potentially being able to access df.index/df.columns without
initializing the block manager, hooking this up in __array__, what with
pickling?, ...)
But just to illustrate the idea.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200526/315a6110/attachment.html>