[Pandas-dev] A case for a simplified (non-consolidating) BlockManager with 1D blocks

Thu Jun 11 12:01:12 EDT 2020

On Thu, Jun 11, 2020 at 10:51 AM Brock Mendel <jbrockmendel at gmail.com>
wrote:

> > Does this summary accurately capture the discussion?
>
> Not quite.
>
> > there was agreement with the goal of simplifying pandas' internals,
>
> Yes.
>
> > and making DataFrame a column-store seems to be the best way to achieve
> that.
>
> No.
>
> We will not know this until we see an implementation.  Nor will we know
> the performance impact.  My expectation is that the performance impact will
> lead to a bunch of workarounds that cut against the simplification.
>
> I strongly object to committing to this before having this information.
>

It'd be good to clarify exactly what you object to committing to. Changing
the Block Manager is a large task, made especially difficult by us being an
open-source project with many stake-holders and limited funding. I think
that we as a project can say "We as a project think that making DataFrame a
column store is best", while still acknowledging that it's an uncertain
goal that may be abandoned if it turns out to be a bad idea.

So to make sure: You're objecting to a column-store in principle, or you're
objecting to the project saying we think it's a good idea, or...?

> ---
> I have tried to avoid bringing up 2D EAs in this conversation, but the
> term "best way" requires a discussion of alternatives.
>
> Allowing 2D EAs will allow for a large fraction of the same
> simplifications (grep for "TODO(EA2D)"), and will _improve_ performance (in
> eg reshape, arithmetic operations) instead of hurting it.  It means
> removing workarounds rather than adding new ones.
>
> It also allows for an incremental upgrade path: opt-in for 1.X, then if we
> like it, required for 2.X.
>

Will have thoughts on this later.

> ----
> > Going forward, there are many pieces that can be done, some in parallel
>
> Related to but not identical to consolidation is the views vs copies on
> column indexing, GH#33780
> <https://github.com/pandas-dev/pandas/issues/33780>, discussed on the
> previous call without a solid conclusion.  The FUD largely boiled down to
> "some users could be relying on the current behavior and there isnt a nice
> way to deprecate it".  On further reflection, this seems like an impossible
> standard to meet for _any_ change in not-tested/not-documented behavior.
> We should move to having column indexing being copy-free.
>

I think I disagree with that, at least to a degree. But it's primarily
about views vs. copies so I'll take it to
https://github.com/pandas-dev/pandas/issues/33780.

> On Thu, Jun 11, 2020 at 7:56 AM Tom Augspurger <tom.augspurger88 at gmail.com>
> wrote:
>
>> We discussed this on the call yesterday
>> (
>> https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit?usp=sharing
>> ).
>> I'll attempt a summary for the mailing list, and a proposed course of
>> action.
>>
>> In general, there was agreement with the goal of simplifying pandas'
>> internals,
>> and making DataFrame a column-store seems to be the best way to achieve
>> that.
>> The primary arguments against were implementation costs and possible
>> performance
>> slowdowns for very short and wide dataframes.
>>
>> It was generally agreed that the change will need to be toggleable,
>> perhaps by a
>> parameter to the DataFrame constructor and a global option. This will
>> make it
>> easier to implement the new behavior and test it against existing
>> behavior, both
>> for us developers and users.
>>
>> We are keeping in mind the scikit-learn style usecase of boxing and
>> unboxing a
>> (homogenous) array in a DataFrame. We're committed to keeping that 0-copy
>> and
>> avoiding creating one Python object per column.
>>
>> Does this summary accurately capture the discussion?
>>
>> ---
>>
>> Going forward, there are many pieces that can be done, some in parallel.
>> Let's
>> keep that discussion on concrete details in
>> https://github.com/pandas-dev/pandas/issues/34669.
>>
>> I do want to highlight one overlapping area though. We have some PRs up
>> (most
>> from Brock) that affect consolidation today. Mostly disabling
>> consolidation in specific places. (e.g.
>> https://github.com/pandas-dev/pandas/pull/34683). My question: do we
>> want to
>> continue pursuing reduced consolidation *in the current block manager*?
>>
>> IMO, that's a tricky question to answer. The performance implications of
>> consolidation are hard, in part because it's so workload-dependent.
>> Sometimes,
>> it's completely avoided so it's a win. Other times, it's merely delayed
>> until an
>> operation that needs consolidated blocks, and so is a wash. And given
>>
>> 1. The unclear impact changing consolidation has on views vs. copies, and
>> our
>>    unclear *policy* on when things are views vs. copies
>> 2. The real possibility of a non-consolidating, all-1D "Block" manager in
>> the
>>    next year or two
>> 3. The unclear extent to which non-consolidated data is tested by our
>> unit tests.
>>
>> Certainly, fixing bugs is a worthy goal on its own. So to the extent
>> where (non)consolidation
>> causes buggy behavior we'll want to fix that. But overall, I think the
>> project's efforts would be
>> better focused elsewhere (ideally on progressing to the all 1-D block
>> manager, but wherever
>> we think is highest-value).
>>
>> Do others have thoughts on what changes should be made to the "pandas 1.x
>> BlockManager" while we work towards the "2.x BlockManager"?
>>
>> - Tom
>>
>> On Tue, Jun 9, 2020 at 10:46 AM Joris Van den Bossche <
>> jorisvandenbossche at gmail.com> wrote:
>>
>>> On Mon, 1 Jun 2020 at 20:07, Brock Mendel <jbrockmendel at gmail.com>
>>> wrote:
>>>
>>>> Joris and I accidentally took part of the discussion off-thread.  My
>>>> suggestion boils down to: Let's
>>>>
>>>> 1) Identify pieces of this that we want to do regardless of whether we
>>>> do the rest of it (e.g. consolidate only in internals, view-only indexing
>>>> on columns).
>>>>
>>>
>>> Personally I am not sure it is worth trying to change consolidation
>>> policies (moving to internals is certainly fine of course, but I mean eg
>>> delaying) or copy/view semantics for the *current*, consolidated
>>> BlockManager.
>>>
>>> But there are certainly pieces in the internals that can be changed
>>> which are useful regardless. I opened
>>> https://github.com/pandas-dev/pandas/issues/34669 to have a more
>>> concrete discussion about this on github.
>>>
>>>
>>>> 2) Beef up the asvs to be give closer-to-full measure of the tradeoffs
>>>> when an eventual proof of concept/PR is made.
>>>>
>>>>
>>> We probably won't have a "one big PR" that is going to implement a
>>> simplified block manager, so it's not really clear to me how ASV will help
>>> with making a decision on this?
>>> (it will for sure be very useful *along the way* to keep track of where
>>> we need to optimize things to preserve performance)
>>>
>>> Joris
>>>
>>> _______________________________________________
>>> Pandas-dev mailing list
>>> Pandas-dev at python.org
>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20200611/7d21a431/attachment.html>