[Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks

Wed May 27 17:07:41 EDT 2020

> I don't think this "lazy _mgr attribute" is comparable in complexity with
the consolidated BlockManager

Not on its own, no.  But my prior is that this isn't the last thing that
will merit its own special case.

> I think it clear that a BlockManager with only 1D arrays/blocks *can* be
simpler as one with interleaved/consolidated blocks.

Absolutely agree.  I've spent a big chunk of the last year dealing with
BlockManager code and have no great love for it.

> But this is also only one of the arguments. Complexity alone is not a
reason to not do something; it's the general trade-off with what you gain
or lose with it.

The main upsides I see are a) internal complexity reduction, b) downstream
library upsides, c) clearer view vs copy semantics, d) perf improvements
from making fewer copies, e) clear "dict of Series" data model.

The main downside is potential performance degradation (at the extreme end
e.g. 3000x <https://github.com/pandas-dev/pandas/issues/24990> for
arithmetic).  As Wes commented some of that can be ameliorated with
compiled code but that cuts against the complexity reduction.

I am looking for ways to quantify these tradeoffs so we can make an
informed decision.

On Wed, May 27, 2020 at 12:57 PM Joris Van den Bossche <
jorisvandenbossche at gmail.com> wrote:

> On Tue, 26 May 2020 at 23:00, Tom Augspurger <tom.augspurger88 at gmail.com>
> wrote:
>
>>
>> On Tue, May 26, 2020 at 3:50 PM Brock Mendel <jbrockmendel at gmail.com>
>> wrote:
>>
>>> > It allows to create a "DataFrame" from an ndarray without creating a
>>> BlockManager, and it allows accessing this original ndarray:
>>>
>>> This is a neat proof of concept, but it cuts against the "decreases
>>> complexity" argument.  Is there a viable way to quantify (even very
>>> roughly) the complexity effect of going all-1D?
>>>
>>
>> That complexity is at least localized to a single attribute. That's quite
>> different from the 1D & 2D blocks situation, where many methods (though
>> fewer than a year ago) need to be concerned with whether the array in a
>> block is 1D or 2D, or whether the DataFrame is consolidated, homogenous, ...
>>
>>
> I don't think this "lazy _mgr attribute" is comparable in complexity with
> the consolidated BlockManager. Furthermore: it's targeted to a very
> specific and limited use case (and eg also doesn't need to be the default,
> I think).
> Now, exactly quantifying the effect of going all-1D, that's of course
> hard. But just one example: all code that deals with blknos/blklocs (the
> mapping between the position in the consolidated blocks and the position in
> the dataframe), which is a significant part of managers.py, could be
> simplified considerably.
>
> But anyway: I think it clear that a BlockManager with only 1D
> arrays/blocks *can* be simpler as one with interleaved/consolidated
> blocks. But this is also only one of the arguments. Complexity alone is not
> a reason to not do something; it's the general trade-off with what you gain
> or lose with it.
>
>
>> A couple ideas for ways to simplify this decision-making problem:
>>>
>>
>>
>>> 2) IIUC going all-1D will cause column indexing to always return views.
>>> Elsewhere you have noted that this is a breaking API change which merited
>>> discussion in its own right.  xref #33780
>>> <https://github.com/pandas-dev/pandas/issues/33780>.  My takeaway from
>>> this part of the last dev call was that people were generally positive on
>>> the all-views idea, but were wary of how to handle the potential
>>> deprecation.
>>>
>>
>> This type of change would merit a major version bump. If possible, we'd
>> ideally have some kind of option to disable consolidation / enable
>> splitting, which would allow for users to test their code on older versions.
>>
>
> Yes, going to an all-1D-BlockManager would be something for a major
> version bump, eg pandas 2.0. So I think that is the perfect opportunity to
> do such a change of making column selections always views.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200527/37114d07/attachment.html>