[Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks

Tom Augspurger tom.augspurger88 at gmail.com
Thu Jun 11 10:55:51 EDT 2020


We discussed this on the call yesterday
(
https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit?usp=sharing
).
I'll attempt a summary for the mailing list, and a proposed course of
action.

In general, there was agreement with the goal of simplifying pandas'
internals,
and making DataFrame a column-store seems to be the best way to achieve
that.
The primary arguments against were implementation costs and possible
performance
slowdowns for very short and wide dataframes.

It was generally agreed that the change will need to be toggleable, perhaps
by a
parameter to the DataFrame constructor and a global option. This will make
it
easier to implement the new behavior and test it against existing behavior,
both
for us developers and users.

We are keeping in mind the scikit-learn style usecase of boxing and
unboxing a
(homogenous) array in a DataFrame. We're committed to keeping that 0-copy
and
avoiding creating one Python object per column.

Does this summary accurately capture the discussion?

---

Going forward, there are many pieces that can be done, some in parallel.
Let's
keep that discussion on concrete details in
https://github.com/pandas-dev/pandas/issues/34669.

I do want to highlight one overlapping area though. We have some PRs up
(most
from Brock) that affect consolidation today. Mostly disabling consolidation
in specific places. (e.g.
https://github.com/pandas-dev/pandas/pull/34683). My question: do we want to
continue pursuing reduced consolidation *in the current block manager*?

IMO, that's a tricky question to answer. The performance implications of
consolidation are hard, in part because it's so workload-dependent.
Sometimes,
it's completely avoided so it's a win. Other times, it's merely delayed
until an
operation that needs consolidated blocks, and so is a wash. And given

1. The unclear impact changing consolidation has on views vs. copies, and
our
   unclear *policy* on when things are views vs. copies
2. The real possibility of a non-consolidating, all-1D "Block" manager in
the
   next year or two
3. The unclear extent to which non-consolidated data is tested by our unit
tests.

Certainly, fixing bugs is a worthy goal on its own. So to the extent where
(non)consolidation
causes buggy behavior we'll want to fix that. But overall, I think the
project's efforts would be
better focused elsewhere (ideally on progressing to the all 1-D block
manager, but wherever
we think is highest-value).

Do others have thoughts on what changes should be made to the "pandas 1.x
BlockManager" while we work towards the "2.x BlockManager"?

- Tom

On Tue, Jun 9, 2020 at 10:46 AM Joris Van den Bossche <
jorisvandenbossche at gmail.com> wrote:

> On Mon, 1 Jun 2020 at 20:07, Brock Mendel <jbrockmendel at gmail.com> wrote:
>
>> Joris and I accidentally took part of the discussion off-thread.  My
>> suggestion boils down to: Let's
>>
>> 1) Identify pieces of this that we want to do regardless of whether we do
>> the rest of it (e.g. consolidate only in internals, view-only indexing on
>> columns).
>>
>
> Personally I am not sure it is worth trying to change consolidation
> policies (moving to internals is certainly fine of course, but I mean eg
> delaying) or copy/view semantics for the *current*, consolidated
> BlockManager.
>
> But there are certainly pieces in the internals that can be changed which
> are useful regardless. I opened
> https://github.com/pandas-dev/pandas/issues/34669 to have a more concrete
> discussion about this on github.
>
>
>> 2) Beef up the asvs to be give closer-to-full measure of the tradeoffs
>> when an eventual proof of concept/PR is made.
>>
>>
> We probably won't have a "one big PR" that is going to implement a
> simplified block manager, so it's not really clear to me how ASV will help
> with making a decision on this?
> (it will for sure be very useful *along the way* to keep track of where
> we need to optimize things to preserve performance)
>
> Joris
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200611/dae824c5/attachment.html>


More information about the Pandas-dev mailing list