[Pandas-dev] Arithmetic Proposal

Wed Jun 12 10:46:37 EDT 2019

TL;DR:

> So while I agree that Blocks being backed by a maybe 1D / maybe 2D array
causes no end of headaches

For readers who don't find the performance issue compelling, the bugs and
complexity this addresses should be compelling.

--------

> Could we instead dispatch both Series and DataFrame ops to Block ops
(which then
do the op on the ndarray or dispatch to the EA)?

@TomAugspurger Yes, though as mentioned in the OP, my attempts so far to
make this work have failed.

This suggestion boils down to effectively implementing these ops on Block,
which is the opposite of the direction we want to be taking the Block
classes.  In terms of Separation of Concerns it makes much more sense for
the array-like operations to be defined on a dedicated array class, in this
case PandasArray.

Moreover, implementing them on PandasArray gives us "for free" consistency
between Series/DataFrame, Index, and PandasArray ops, whereas implementing
them on Block gives only Series/DataFrame consistency.

> 10% performance boost for those (I’m taking that figure from one of your
comments in #24990).

@WillAyd that comment referred to the cost of instantiating the DataFrame,
not the arithmetic op.  Earlier in that same comment I refer to the
arithmetic op as being 10x slower, not 10% slower.

> I’ve done zero work with blocks and I think they definitely come at an
extra development / maintenance cost.

I've done a bunch of work with blocks, mostly trying to get code _out_ of
them.  Ignore the entire performance issue: allowing EA to be 2D (heck,
even restricted to (1, N) and (N, 1) would be enough!) would let us rip out
so much (buggy) code I'll shed tears of joy.

On Wed, Jun 12, 2019 at 7:53 AM William Ayd via Pandas-dev <
pandas-dev at python.org> wrote:

> I’m wary to expand operations done at the Block level. As a core developer
> for over a year now, I’ve done zero work with blocks and I think they
> definitely come at an extra development / maintenance cost.
>
> I think wide DataFrames are the exception rather than the norm so it’s
> probably not worth code to eek out a 10% performance boost for those (I’m
> taking that figure from one of your comments in #24990).
>
> - Will
>
> On Jun 11, 2019, at 10:08 PM, Tom Augspurger <tom.augspurger88 at gmail.com>
> wrote:
>
> One general question, motivated by Joris' same concern about the future
> simplified BlockManager: why does block-based, rather than column-based,
> ops
> require 2D Extension Arrays? You say
>
> > by making DataFrame arithmetic ops operate column-by-column, dispatching
> to
> > the Series implementations.
>
> Could we instead dispatch both Series and DataFrame ops to Block ops
> (which then
> do the op on the ndarray or dispatch to the EA)? If I understand your
> proposal
> correctly, then you still have the general DataFrame -> Block -> Array
> nesting
> doll. It seems like that should work equally well with our current mix of
> 2-D
> and 1-D blocks.
>
> So while I agree that Blocks being backed by a maybe 1D / maybe 2D array
> causes
> no end of headaches, I don't see why block-based ops need 2D EAs (though
> I'm not
> especially familiar with this area; I could easily be missing something
> basic).
>
> - Tom
>
> On Tue, Jun 11, 2019 at 5:25 PM Stephan Hoyer <shoyer at gmail.com> wrote:
>
>> Indeed, it's worth considering if perhaps it would be OK to have a
>> performance regression for very wide dataframes instead.
>>
>> With regards to xarray, 2D extension arrays are interesting but still not
>> particularly helpful. We would still need a wrapper to make them fully N-D,
>> which we need for our data model.
>>
>> On Tue, Jun 11, 2019 at 6:18 PM Joris Van den Bossche <
>> jorisvandenbossche at gmail.com> wrote:
>>
>>> Hi Brock,
>>>
>>> Thanks a lot for starting this discussion and the detailed proposal!
>>>
>>> I will try to look at it in more detail tomorrow, but one general
>>> remark: from time to time, we talked about "getting rid of the
>>> BlockManager" or "simplifying the BlockManager" (although I am not sure if
>>> there is any specific github issue about it, might be from in-person
>>> discussions). One of the interpretations of that (or at least how I
>>> understood those discussions) was to get away of the 2D block based
>>> internals, and go to a simpler "table as collection of 1D arrays" model.
>>> This would also enable a simplication of the internals / BlockManager and
>>> many of the other items you mention.
>>>
>>> So I think we should at least compare a more detailed version of what I
>>> described above against your proposal. As if we would want to go in that
>>> direction long term, I am not sure extensive work on the current 2D
>>> blocks-based BlockManager is worth our time.
>>>
>>> Joris
>>>
>>> Op di 11 jun. 2019 om 22:38 schreef Brock Mendel <jbrockmendel at gmail.com
>>> >:
>>>
>>>> I've been working on arithmetic/comparison bugs and more recently on
>>>> performance problems caused by fixing some of those bugs.  After trying
>>>> less-invasive approaches, I've concluded a fairly big fix is called for.
>>>> This is an RFC for that proposed fix.
>>>>
>>>> ------
>>>> In 0.24.0 we fixed some arithmetic bugs in DataFrame operations by
>>>> making DataFrame arithmetic ops operate column-by-column, dispatching to
>>>> the Series implementations.  This led to a significant performance hit for
>>>> operations on DataFrames with many columns (#24990, #26061).
>>>>
>>>> To restore the lost performance, we need to have these operations take
>>>> place
>>>> at the Block level.  To prevent DataFrame behavior from diverging from
>>>> Series
>>>> behavior (again), we need to retain a single shared implementation.
>>>>
>>>> This is a proposal for how meet these two needs.
>>>>
>>>> Proposal:
>>>> - Allow EA to support 2D arrays
>>>> - Use PandasArray to back Block subclasses currently backed by ndarray
>>>> - Implement arithmetic and comparison ops directly on PandasArray, then
>>>> have Series, DataFrame, and Index ops pass through to the PandasArray
>>>> implementations.
>>>>
>>>> Fixes:
>>>> - Performance degradation in DataFrame ops (#24990, #26061)
>>>> - The last remaining inconsistencies between Index and Series ops
>>>> (#19322, #18824)
>>>> - Most of the xfailing arithmetic tests
>>>> - #22120: Transposing dataframe loses dtype and ExtensionArray
>>>> - #24600 BUG: DataFrame[Sparse] quantile fails because SparseArray has
>>>> no reshape
>>>> - #23925 DataFrame Quantile Broken with Datetime Data
>>>>
>>>> Other Upsides:
>>>> - Series constructor could dispatch to pd.array, de-duplicating a lot
>>>> of code.
>>>> - Easier to move to Arrow backend if Blocks are numpy-naive.
>>>> - Make EA closer to a drop-in replacement for np.ndarray, necessary if
>>>> we want e.g. xarray to find them directly useful (#24716, #24583)
>>>> - Block/BlockManager simplifications, see below.
>>>>
>>>> Downsides:
>>>> - Existing constructors assume 1D
>>>> - Existing downstream authors assume 1D
>>>> - Reduction ops (of which there aren't many) don't have axis kwarg ATM
>>>>    - But for PandasArray they just pass through to nanops, which
>>>> already have+test the axis kwargs
>>>>    - For DatetimeArray/TimedeltaArray/PeriodArray, I'm the one
>>>> implementing the reductions and am OK with this extra complication.
>>>>
>>>> Block Simplifications:
>>>> - Blocks have three main attributes: values, mgr_locs, and ndim
>>>> - ndim is _usually_ the same as values.ndim, the exceptions being for
>>>> cases where type(values) is restricted to 1D
>>>> - Without these restrictions, we can get rid of:
>>>>    - Block.ndim, associated kludgy ndim-checking code
>>>>    - numerous can-this-be-reshaped/transposed checks and special cases
>>>> in Block and BlockManager code (which are buggy anyway, e.g. #23925)
>>>> - With ndim gone, we can then get rid of mgr_locs!
>>>>    - The blocks themselves never use mgr_locs except when passing to
>>>> their own constructors.
>>>>    - mgr_locs makes _much_ more sense as an attribute of the
>>>> BlockManager
>>>> - With mgr_locs gone, Block becomes just a thin wrapper around an EA
>>>>
>>>> Implementation Strategy:
>>>> - Remove the 1D restriction
>>>>    - Fairly small tweak, EA subclass must define `shape` instead of
>>>> `__len__`; other attrs define in terms of shape.
>>>>    - Define `transpose`, `T`, `reshape`, and `ravel`
>>>> - With this done, several tasks can proceed in parallel:
>>>>    - simplifications in core.internals, as special-cases for 1D-only
>>>> can be removed
>>>>    - implement and test arithmetic ops on PandasArray
>>>>    - back Blocks with PandasArray
>>>>    - back Index (and numeric subclasses) with PandasArray
>>>> - Change DataFrame, Series, Index ops to pass through to underlying
>>>> Blocks/PandasArrays
>>>> _______________________________________________
>>>> Pandas-dev mailing list
>>>> Pandas-dev at python.org
>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>
>>> _______________________________________________
>>> Pandas-dev mailing list
>>> Pandas-dev at python.org
>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20190612/07e0a806/attachment.html>