[Pandas-dev] Arithmetic Proposal

Tom Augspurger tom.augspurger88 at gmail.com
Wed Jun 12 11:56:02 EDT 2019


On Wed, Jun 12, 2019 at 9:46 AM Brock Mendel <jbrockmendel at gmail.com> wrote:

> TL;DR:
>
> > So while I agree that Blocks being backed by a maybe 1D / maybe 2D array
> causes no end of headaches
>
> For readers who don't find the performance issue compelling, the bugs and
> complexity this addresses should be compelling.
>
> --------
>
> > Could we instead dispatch both Series and DataFrame ops to Block ops
> (which then
> do the op on the ndarray or dispatch to the EA)?
>
> @TomAugspurger Yes, though as mentioned in the OP, my attempts so far to
> make this work have failed.
>
> This suggestion boils down to effectively implementing these ops on Block,
> which is the opposite of the direction we want to be taking the Block
> classes.  In terms of Separation of Concerns it makes much more sense for
> the array-like operations to be defined on a dedicated array class, in this
> case PandasArray.
>

I think we're in agreement here.

Moreover, implementing them on PandasArray gives us "for free" consistency
> between Series/DataFrame, Index, and PandasArray ops, whereas implementing
> them on Block gives only Series/DataFrame consistency.
>
> > 10% performance boost for those (I’m taking that figure from one of your
> comments in #24990).
>
> @WillAyd that comment referred to the cost of instantiating the DataFrame,
> not the arithmetic op.  Earlier in that same comment I refer to the
> arithmetic op as being 10x slower, not 10% slower.
>
> > I’ve done zero work with blocks and I think they definitely come at an
> extra development / maintenance cost.
>
> I've done a bunch of work with blocks, mostly trying to get code _out_ of
> them.  Ignore the entire performance issue: allowing EA to be 2D (heck,
> even restricted to (1, N) and (N, 1) would be enough!) would let us rip out
> so much (buggy) code I'll shed tears of joy.
>
>
Stepping back a bit, I see two potential issues we'd like to solve

1. The current structure of

- Container (dataframe, series, index) ->
  - Block (DataFrame / Series only) ->
  - Array (ndarray or EA)

is bad for two reasons: first, Indexes don't have Blocks; this argues for
putting more functionality on the Array, to share code between all the
containers; second, Array can be an ndarray or an EA. They're different
enough that
EA isn't a drop-in replacement for ndarray.

2. Arrays being either 1D or 2D causes many issues.

A few questions

Q1: Do those two issues accurately capture your concerns as well?
Q2: Can you clarify: with 2D EAs would *all* EAs stored within pandas be 2D
internally (and Series / Index would squeeze before data gets back to the
user)? Otherwise, I don't see how we get the internal simplification.
Q3: What do you think about a simple, private PandasArray-like thing that
*is* allowed to be 2D, and itself wraps a 2D ndarray? That solves my
problem 1, but doesn't address problem 2.

Tom




> On Wed, Jun 12, 2019 at 7:53 AM William Ayd via Pandas-dev <
> pandas-dev at python.org> wrote:
>
>> I’m wary to expand operations done at the Block level. As a core
>> developer for over a year now, I’ve done zero work with blocks and I think
>> they definitely come at an extra development / maintenance cost.
>>
>> I think wide DataFrames are the exception rather than the norm so it’s
>> probably not worth code to eek out a 10% performance boost for those (I’m
>> taking that figure from one of your comments in #24990).
>>
>> - Will
>>
>> On Jun 11, 2019, at 10:08 PM, Tom Augspurger <tom.augspurger88 at gmail.com>
>> wrote:
>>
>> One general question, motivated by Joris' same concern about the future
>> simplified BlockManager: why does block-based, rather than column-based,
>> ops
>> require 2D Extension Arrays? You say
>>
>> > by making DataFrame arithmetic ops operate column-by-column,
>> dispatching to
>> > the Series implementations.
>>
>> Could we instead dispatch both Series and DataFrame ops to Block ops
>> (which then
>> do the op on the ndarray or dispatch to the EA)? If I understand your
>> proposal
>> correctly, then you still have the general DataFrame -> Block -> Array
>> nesting
>> doll. It seems like that should work equally well with our current mix of
>> 2-D
>> and 1-D blocks.
>>
>> So while I agree that Blocks being backed by a maybe 1D / maybe 2D array
>> causes
>> no end of headaches, I don't see why block-based ops need 2D EAs (though
>> I'm not
>> especially familiar with this area; I could easily be missing something
>> basic).
>>
>> - Tom
>>
>> On Tue, Jun 11, 2019 at 5:25 PM Stephan Hoyer <shoyer at gmail.com> wrote:
>>
>>> Indeed, it's worth considering if perhaps it would be OK to have a
>>> performance regression for very wide dataframes instead.
>>>
>>> With regards to xarray, 2D extension arrays are interesting but still
>>> not particularly helpful. We would still need a wrapper to make them fully
>>> N-D, which we need for our data model.
>>>
>>> On Tue, Jun 11, 2019 at 6:18 PM Joris Van den Bossche <
>>> jorisvandenbossche at gmail.com> wrote:
>>>
>>>> Hi Brock,
>>>>
>>>> Thanks a lot for starting this discussion and the detailed proposal!
>>>>
>>>> I will try to look at it in more detail tomorrow, but one general
>>>> remark: from time to time, we talked about "getting rid of the
>>>> BlockManager" or "simplifying the BlockManager" (although I am not sure if
>>>> there is any specific github issue about it, might be from in-person
>>>> discussions). One of the interpretations of that (or at least how I
>>>> understood those discussions) was to get away of the 2D block based
>>>> internals, and go to a simpler "table as collection of 1D arrays" model.
>>>> This would also enable a simplication of the internals / BlockManager and
>>>> many of the other items you mention.
>>>>
>>>> So I think we should at least compare a more detailed version of what I
>>>> described above against your proposal. As if we would want to go in that
>>>> direction long term, I am not sure extensive work on the current 2D
>>>> blocks-based BlockManager is worth our time.
>>>>
>>>> Joris
>>>>
>>>> Op di 11 jun. 2019 om 22:38 schreef Brock Mendel <
>>>> jbrockmendel at gmail.com>:
>>>>
>>>>> I've been working on arithmetic/comparison bugs and more recently on
>>>>> performance problems caused by fixing some of those bugs.  After trying
>>>>> less-invasive approaches, I've concluded a fairly big fix is called for.
>>>>> This is an RFC for that proposed fix.
>>>>>
>>>>> ------
>>>>> In 0.24.0 we fixed some arithmetic bugs in DataFrame operations by
>>>>> making DataFrame arithmetic ops operate column-by-column, dispatching to
>>>>> the Series implementations.  This led to a significant performance hit for
>>>>> operations on DataFrames with many columns (#24990, #26061).
>>>>>
>>>>> To restore the lost performance, we need to have these operations take
>>>>> place
>>>>> at the Block level.  To prevent DataFrame behavior from diverging from
>>>>> Series
>>>>> behavior (again), we need to retain a single shared implementation.
>>>>>
>>>>> This is a proposal for how meet these two needs.
>>>>>
>>>>> Proposal:
>>>>> - Allow EA to support 2D arrays
>>>>> - Use PandasArray to back Block subclasses currently backed by ndarray
>>>>> - Implement arithmetic and comparison ops directly on PandasArray,
>>>>> then have Series, DataFrame, and Index ops pass through to the PandasArray
>>>>> implementations.
>>>>>
>>>>> Fixes:
>>>>> - Performance degradation in DataFrame ops (#24990, #26061)
>>>>> - The last remaining inconsistencies between Index and Series ops
>>>>> (#19322, #18824)
>>>>> - Most of the xfailing arithmetic tests
>>>>> - #22120: Transposing dataframe loses dtype and ExtensionArray
>>>>> - #24600 BUG: DataFrame[Sparse] quantile fails because SparseArray has
>>>>> no reshape
>>>>> - #23925 DataFrame Quantile Broken with Datetime Data
>>>>>
>>>>> Other Upsides:
>>>>> - Series constructor could dispatch to pd.array, de-duplicating a lot
>>>>> of code.
>>>>> - Easier to move to Arrow backend if Blocks are numpy-naive.
>>>>> - Make EA closer to a drop-in replacement for np.ndarray, necessary if
>>>>> we want e.g. xarray to find them directly useful (#24716, #24583)
>>>>> - Block/BlockManager simplifications, see below.
>>>>>
>>>>> Downsides:
>>>>> - Existing constructors assume 1D
>>>>> - Existing downstream authors assume 1D
>>>>> - Reduction ops (of which there aren't many) don't have axis kwarg ATM
>>>>>    - But for PandasArray they just pass through to nanops, which
>>>>> already have+test the axis kwargs
>>>>>    - For DatetimeArray/TimedeltaArray/PeriodArray, I'm the one
>>>>> implementing the reductions and am OK with this extra complication.
>>>>>
>>>>> Block Simplifications:
>>>>> - Blocks have three main attributes: values, mgr_locs, and ndim
>>>>> - ndim is _usually_ the same as values.ndim, the exceptions being for
>>>>> cases where type(values) is restricted to 1D
>>>>> - Without these restrictions, we can get rid of:
>>>>>    - Block.ndim, associated kludgy ndim-checking code
>>>>>    - numerous can-this-be-reshaped/transposed checks and special cases
>>>>> in Block and BlockManager code (which are buggy anyway, e.g. #23925)
>>>>> - With ndim gone, we can then get rid of mgr_locs!
>>>>>    - The blocks themselves never use mgr_locs except when passing to
>>>>> their own constructors.
>>>>>    - mgr_locs makes _much_ more sense as an attribute of the
>>>>> BlockManager
>>>>> - With mgr_locs gone, Block becomes just a thin wrapper around an EA
>>>>>
>>>>> Implementation Strategy:
>>>>> - Remove the 1D restriction
>>>>>    - Fairly small tweak, EA subclass must define `shape` instead of
>>>>> `__len__`; other attrs define in terms of shape.
>>>>>    - Define `transpose`, `T`, `reshape`, and `ravel`
>>>>> - With this done, several tasks can proceed in parallel:
>>>>>    - simplifications in core.internals, as special-cases for 1D-only
>>>>> can be removed
>>>>>    - implement and test arithmetic ops on PandasArray
>>>>>    - back Blocks with PandasArray
>>>>>    - back Index (and numeric subclasses) with PandasArray
>>>>> - Change DataFrame, Series, Index ops to pass through to underlying
>>>>> Blocks/PandasArrays
>>>>> _______________________________________________
>>>>> Pandas-dev mailing list
>>>>> Pandas-dev at python.org
>>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>>
>>>> _______________________________________________
>>>> Pandas-dev mailing list
>>>> Pandas-dev at python.org
>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>
>>> _______________________________________________
>>> Pandas-dev mailing list
>>> Pandas-dev at python.org
>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
>>
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20190612/be22d0ca/attachment-0001.html>


More information about the Pandas-dev mailing list