[Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks

Thu Jun 11 11:51:25 EDT 2020

> Does this summary accurately capture the discussion?

Not quite.

> there was agreement with the goal of simplifying pandas' internals,

Yes.

> and making DataFrame a column-store seems to be the best way to achieve
that.

No.

We will not know this until we see an implementation.  Nor will we know the
performance impact.  My expectation is that the performance impact will
lead to a bunch of workarounds that cut against the simplification.

I strongly object to committing to this before having this information.

---
I have tried to avoid bringing up 2D EAs in this conversation, but the term
"best way" requires a discussion of alternatives.

Allowing 2D EAs will allow for a large fraction of the same simplifications
(grep for "TODO(EA2D)"), and will _improve_ performance (in eg reshape,
arithmetic operations) instead of hurting it.  It means removing
workarounds rather than adding new ones.

It also allows for an incremental upgrade path: opt-in for 1.X, then if we
like it, required for 2.X.

----
> Going forward, there are many pieces that can be done, some in parallel

Related to but not identical to consolidation is the views vs copies on
column indexing, GH#33780
<https://github.com/pandas-dev/pandas/issues/33780>, discussed on the
previous call without a solid conclusion.  The FUD largely boiled down to
"some users could be relying on the current behavior and there isnt a nice
way to deprecate it".  On further reflection, this seems like an impossible
standard to meet for _any_ change in not-tested/not-documented behavior.
We should move to having column indexing being copy-free.

On Thu, Jun 11, 2020 at 7:56 AM Tom Augspurger <tom.augspurger88 at gmail.com>
wrote:

> We discussed this on the call yesterday
> (
> https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit?usp=sharing
> ).
> I'll attempt a summary for the mailing list, and a proposed course of
> action.
>
> In general, there was agreement with the goal of simplifying pandas'
> internals,
> and making DataFrame a column-store seems to be the best way to achieve
> that.
> The primary arguments against were implementation costs and possible
> performance
> slowdowns for very short and wide dataframes.
>
> It was generally agreed that the change will need to be toggleable,
> perhaps by a
> parameter to the DataFrame constructor and a global option. This will make
> it
> easier to implement the new behavior and test it against existing
> behavior, both
> for us developers and users.
>
> We are keeping in mind the scikit-learn style usecase of boxing and
> unboxing a
> (homogenous) array in a DataFrame. We're committed to keeping that 0-copy
> and
> avoiding creating one Python object per column.
>
> Does this summary accurately capture the discussion?
>
> ---
>
> Going forward, there are many pieces that can be done, some in parallel.
> Let's
> keep that discussion on concrete details in
> https://github.com/pandas-dev/pandas/issues/34669.
>
> I do want to highlight one overlapping area though. We have some PRs up
> (most
> from Brock) that affect consolidation today. Mostly disabling
> consolidation in specific places. (e.g.
> https://github.com/pandas-dev/pandas/pull/34683). My question: do we want
> to
> continue pursuing reduced consolidation *in the current block manager*?
>
> IMO, that's a tricky question to answer. The performance implications of
> consolidation are hard, in part because it's so workload-dependent.
> Sometimes,
> it's completely avoided so it's a win. Other times, it's merely delayed
> until an
> operation that needs consolidated blocks, and so is a wash. And given
>
> 1. The unclear impact changing consolidation has on views vs. copies, and
> our
>    unclear *policy* on when things are views vs. copies
> 2. The real possibility of a non-consolidating, all-1D "Block" manager in
> the
>    next year or two
> 3. The unclear extent to which non-consolidated data is tested by our unit
> tests.
>
> Certainly, fixing bugs is a worthy goal on its own. So to the extent where
> (non)consolidation
> causes buggy behavior we'll want to fix that. But overall, I think the
> project's efforts would be
> better focused elsewhere (ideally on progressing to the all 1-D block
> manager, but wherever
> we think is highest-value).
>
> Do others have thoughts on what changes should be made to the "pandas 1.x
> BlockManager" while we work towards the "2.x BlockManager"?
>
> - Tom
>
> On Tue, Jun 9, 2020 at 10:46 AM Joris Van den Bossche <
> jorisvandenbossche at gmail.com> wrote:
>
>> On Mon, 1 Jun 2020 at 20:07, Brock Mendel <jbrockmendel at gmail.com> wrote:
>>
>>> Joris and I accidentally took part of the discussion off-thread.  My
>>> suggestion boils down to: Let's
>>>
>>> 1) Identify pieces of this that we want to do regardless of whether we
>>> do the rest of it (e.g. consolidate only in internals, view-only indexing
>>> on columns).
>>>
>>
>> Personally I am not sure it is worth trying to change consolidation
>> policies (moving to internals is certainly fine of course, but I mean eg
>> delaying) or copy/view semantics for the *current*, consolidated
>> BlockManager.
>>
>> But there are certainly pieces in the internals that can be changed which
>> are useful regardless. I opened
>> https://github.com/pandas-dev/pandas/issues/34669 to have a more
>> concrete discussion about this on github.
>>
>>
>>> 2) Beef up the asvs to be give closer-to-full measure of the tradeoffs
>>> when an eventual proof of concept/PR is made.
>>>
>>>
>> We probably won't have a "one big PR" that is going to implement a
>> simplified block manager, so it's not really clear to me how ASV will help
>> with making a decision on this?
>> (it will for sure be very useful *along the way* to keep track of where
>> we need to optimize things to preserve performance)
>>
>> Joris
>>
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200611/ead6927e/attachment-0001.html>