[Pandas-dev] Arithmetic Proposal

Wed Jun 12 12:17:44 EDT 2019

The reason we have Blocks in the first place is that performance on same
dtypes is much better on 2D containers. I don't think we can match
hold-as-1D and do ops dispatched via python in a performant way. So longer
term the solution is to use a table of 1D and use pyarrow (as holder with
kernels to operate)
or hold as 1D and use something like numba to perform the kernel operations.

So we either need to keep the Blocks around as a way to hold things and do
block operations (e.g. we actually hold things as 2D), or change to holding
1D as indicated above.

Now we currently have a hybrid approach. numpy array backed blocks are 2D,
while EA arrays are 1D.

I fully agree this hybrid approach is, and has been the cause of many
issues.

Our contract on EA is 1D and I agree that changing this is not a good idea
(at least publicly).

So here's another proposal (a bit half-baked but....):

You *could* build a single dtyped container that actually holds the 1D
arrays themselves). Then you could put EA arrays and numpy arrays on the
same footing. Meaning each
'Block' would be exactly the same.

- This would make operations the *same* across all 'Blocks', reducing
complexity
- We could simply take views on 2D numpy arrays to actually avoid a
performance penaltly of copying (as we can construct from a 2D numpy array
a lot); this causes some aggregation ops to be much slower that if we
actually copy, but that has a cost too
- Ops could be defined on EA & Pandas Arrays; these can then operate
array-by-array (within a Block), or using numba we could implement the ops
in a way that we can get a pretty big speedup for a particular kernel

Jeff

On Wed, Jun 12, 2019 at 11:56 AM Tom Augspurger <tom.augspurger88 at gmail.com>
wrote:

>
>
> On Wed, Jun 12, 2019 at 9:46 AM Brock Mendel <jbrockmendel at gmail.com>
> wrote:
>
>> TL;DR:
>>
>> > So while I agree that Blocks being backed by a maybe 1D / maybe 2D
>> array causes no end of headaches
>>
>> For readers who don't find the performance issue compelling, the bugs and
>> complexity this addresses should be compelling.
>>
>> --------
>>
>> > Could we instead dispatch both Series and DataFrame ops to Block ops
>> (which then
>> do the op on the ndarray or dispatch to the EA)?
>>
>> @TomAugspurger Yes, though as mentioned in the OP, my attempts so far to
>> make this work have failed.
>>
>> This suggestion boils down to effectively implementing these ops on
>> Block, which is the opposite of the direction we want to be taking the
>> Block classes.  In terms of Separation of Concerns it makes much more sense
>> for the array-like operations to be defined on a dedicated array class, in
>> this case PandasArray.
>>
>
> I think we're in agreement here.
>
> Moreover, implementing them on PandasArray gives us "for free" consistency
>> between Series/DataFrame, Index, and PandasArray ops, whereas implementing
>> them on Block gives only Series/DataFrame consistency.
>>
>> > 10% performance boost for those (I’m taking that figure from one of
>> your comments in #24990).
>>
>> @WillAyd that comment referred to the cost of instantiating the
>> DataFrame, not the arithmetic op.  Earlier in that same comment I refer to
>> the arithmetic op as being 10x slower, not 10% slower.
>>
>> > I’ve done zero work with blocks and I think they definitely come at an
>> extra development / maintenance cost.
>>
>> I've done a bunch of work with blocks, mostly trying to get code _out_ of
>> them.  Ignore the entire performance issue: allowing EA to be 2D (heck,
>> even restricted to (1, N) and (N, 1) would be enough!) would let us rip out
>> so much (buggy) code I'll shed tears of joy.
>>
>>
> Stepping back a bit, I see two potential issues we'd like to solve
>
> 1. The current structure of
>
> - Container (dataframe, series, index) ->
>   - Block (DataFrame / Series only) ->
>   - Array (ndarray or EA)
>
> is bad for two reasons: first, Indexes don't have Blocks; this argues for
> putting more functionality on the Array, to share code between all the
> containers; second, Array can be an ndarray or an EA. They're different
> enough that
> EA isn't a drop-in replacement for ndarray.
>
> 2. Arrays being either 1D or 2D causes many issues.
>
> A few questions
>
> Q1: Do those two issues accurately capture your concerns as well?
> Q2: Can you clarify: with 2D EAs would *all* EAs stored within pandas be
> 2D internally (and Series / Index would squeeze before data gets back to
> the user)? Otherwise, I don't see how we get the internal simplification.
> Q3: What do you think about a simple, private PandasArray-like thing that
> *is* allowed to be 2D, and itself wraps a 2D ndarray? That solves my
> problem 1, but doesn't address problem 2.
>
> Tom
>
>
>
>
>> On Wed, Jun 12, 2019 at 7:53 AM William Ayd via Pandas-dev <
>> pandas-dev at python.org> wrote:
>>
>>> I’m wary to expand operations done at the Block level. As a core
>>> developer for over a year now, I’ve done zero work with blocks and I think
>>> they definitely come at an extra development / maintenance cost.
>>>
>>> I think wide DataFrames are the exception rather than the norm so it’s
>>> probably not worth code to eek out a 10% performance boost for those (I’m
>>> taking that figure from one of your comments in #24990).
>>>
>>> - Will
>>>
>>> On Jun 11, 2019, at 10:08 PM, Tom Augspurger <tom.augspurger88 at gmail.com>
>>> wrote:
>>>
>>> One general question, motivated by Joris' same concern about the future
>>> simplified BlockManager: why does block-based, rather than column-based,
>>> ops
>>> require 2D Extension Arrays? You say
>>>
>>> > by making DataFrame arithmetic ops operate column-by-column,
>>> dispatching to
>>> > the Series implementations.
>>>
>>> Could we instead dispatch both Series and DataFrame ops to Block ops
>>> (which then
>>> do the op on the ndarray or dispatch to the EA)? If I understand your
>>> proposal
>>> correctly, then you still have the general DataFrame -> Block -> Array
>>> nesting
>>> doll. It seems like that should work equally well with our current mix
>>> of 2-D
>>> and 1-D blocks.
>>>
>>> So while I agree that Blocks being backed by a maybe 1D / maybe 2D array
>>> causes
>>> no end of headaches, I don't see why block-based ops need 2D EAs (though
>>> I'm not
>>> especially familiar with this area; I could easily be missing something
>>> basic).
>>>
>>> - Tom
>>>
>>> On Tue, Jun 11, 2019 at 5:25 PM Stephan Hoyer <shoyer at gmail.com> wrote:
>>>
>>>> Indeed, it's worth considering if perhaps it would be OK to have a
>>>> performance regression for very wide dataframes instead.
>>>>
>>>> With regards to xarray, 2D extension arrays are interesting but still
>>>> not particularly helpful. We would still need a wrapper to make them fully
>>>> N-D, which we need for our data model.
>>>>
>>>> On Tue, Jun 11, 2019 at 6:18 PM Joris Van den Bossche <
>>>> jorisvandenbossche at gmail.com> wrote:
>>>>
>>>>> Hi Brock,
>>>>>
>>>>> Thanks a lot for starting this discussion and the detailed proposal!
>>>>>
>>>>> I will try to look at it in more detail tomorrow, but one general
>>>>> remark: from time to time, we talked about "getting rid of the
>>>>> BlockManager" or "simplifying the BlockManager" (although I am not sure if
>>>>> there is any specific github issue about it, might be from in-person
>>>>> discussions). One of the interpretations of that (or at least how I
>>>>> understood those discussions) was to get away of the 2D block based
>>>>> internals, and go to a simpler "table as collection of 1D arrays" model.
>>>>> This would also enable a simplication of the internals / BlockManager and
>>>>> many of the other items you mention.
>>>>>
>>>>> So I think we should at least compare a more detailed version of what
>>>>> I described above against your proposal. As if we would want to go in that
>>>>> direction long term, I am not sure extensive work on the current 2D
>>>>> blocks-based BlockManager is worth our time.
>>>>>
>>>>> Joris
>>>>>
>>>>> Op di 11 jun. 2019 om 22:38 schreef Brock Mendel <
>>>>> jbrockmendel at gmail.com>:
>>>>>
>>>>>> I've been working on arithmetic/comparison bugs and more recently on
>>>>>> performance problems caused by fixing some of those bugs.  After trying
>>>>>> less-invasive approaches, I've concluded a fairly big fix is called for.
>>>>>> This is an RFC for that proposed fix.
>>>>>>
>>>>>> ------
>>>>>> In 0.24.0 we fixed some arithmetic bugs in DataFrame operations by
>>>>>> making DataFrame arithmetic ops operate column-by-column, dispatching to
>>>>>> the Series implementations.  This led to a significant performance hit for
>>>>>> operations on DataFrames with many columns (#24990, #26061).
>>>>>>
>>>>>> To restore the lost performance, we need to have these operations
>>>>>> take place
>>>>>> at the Block level.  To prevent DataFrame behavior from diverging
>>>>>> from Series
>>>>>> behavior (again), we need to retain a single shared implementation.
>>>>>>
>>>>>> This is a proposal for how meet these two needs.
>>>>>>
>>>>>> Proposal:
>>>>>> - Allow EA to support 2D arrays
>>>>>> - Use PandasArray to back Block subclasses currently backed by ndarray
>>>>>> - Implement arithmetic and comparison ops directly on PandasArray,
>>>>>> then have Series, DataFrame, and Index ops pass through to the PandasArray
>>>>>> implementations.
>>>>>>
>>>>>> Fixes:
>>>>>> - Performance degradation in DataFrame ops (#24990, #26061)
>>>>>> - The last remaining inconsistencies between Index and Series ops
>>>>>> (#19322, #18824)
>>>>>> - Most of the xfailing arithmetic tests
>>>>>> - #22120: Transposing dataframe loses dtype and ExtensionArray
>>>>>> - #24600 BUG: DataFrame[Sparse] quantile fails because SparseArray
>>>>>> has no reshape
>>>>>> - #23925 DataFrame Quantile Broken with Datetime Data
>>>>>>
>>>>>> Other Upsides:
>>>>>> - Series constructor could dispatch to pd.array, de-duplicating a lot
>>>>>> of code.
>>>>>> - Easier to move to Arrow backend if Blocks are numpy-naive.
>>>>>> - Make EA closer to a drop-in replacement for np.ndarray, necessary
>>>>>> if we want e.g. xarray to find them directly useful (#24716, #24583)
>>>>>> - Block/BlockManager simplifications, see below.
>>>>>>
>>>>>> Downsides:
>>>>>> - Existing constructors assume 1D
>>>>>> - Existing downstream authors assume 1D
>>>>>> - Reduction ops (of which there aren't many) don't have axis kwarg ATM
>>>>>>    - But for PandasArray they just pass through to nanops, which
>>>>>> already have+test the axis kwargs
>>>>>>    - For DatetimeArray/TimedeltaArray/PeriodArray, I'm the one
>>>>>> implementing the reductions and am OK with this extra complication.
>>>>>>
>>>>>> Block Simplifications:
>>>>>> - Blocks have three main attributes: values, mgr_locs, and ndim
>>>>>> - ndim is _usually_ the same as values.ndim, the exceptions being for
>>>>>> cases where type(values) is restricted to 1D
>>>>>> - Without these restrictions, we can get rid of:
>>>>>>    - Block.ndim, associated kludgy ndim-checking code
>>>>>>    - numerous can-this-be-reshaped/transposed checks and special
>>>>>> cases in Block and BlockManager code (which are buggy anyway, e.g. #23925)
>>>>>> - With ndim gone, we can then get rid of mgr_locs!
>>>>>>    - The blocks themselves never use mgr_locs except when passing to
>>>>>> their own constructors.
>>>>>>    - mgr_locs makes _much_ more sense as an attribute of the
>>>>>> BlockManager
>>>>>> - With mgr_locs gone, Block becomes just a thin wrapper around an EA
>>>>>>
>>>>>> Implementation Strategy:
>>>>>> - Remove the 1D restriction
>>>>>>    - Fairly small tweak, EA subclass must define `shape` instead of
>>>>>> `__len__`; other attrs define in terms of shape.
>>>>>>    - Define `transpose`, `T`, `reshape`, and `ravel`
>>>>>> - With this done, several tasks can proceed in parallel:
>>>>>>    - simplifications in core.internals, as special-cases for 1D-only
>>>>>> can be removed
>>>>>>    - implement and test arithmetic ops on PandasArray
>>>>>>    - back Blocks with PandasArray
>>>>>>    - back Index (and numeric subclasses) with PandasArray
>>>>>> - Change DataFrame, Series, Index ops to pass through to underlying
>>>>>> Blocks/PandasArrays
>>>>>> _______________________________________________
>>>>>> Pandas-dev mailing list
>>>>>> Pandas-dev at python.org
>>>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>>>
>>>>> _______________________________________________
>>>>> Pandas-dev mailing list
>>>>> Pandas-dev at python.org
>>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>>
>>>> _______________________________________________
>>>> Pandas-dev mailing list
>>>> Pandas-dev at python.org
>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>
>>> _______________________________________________
>>> Pandas-dev mailing list
>>> Pandas-dev at python.org
>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>
>>>
>>> _______________________________________________
>>> Pandas-dev mailing list
>>> Pandas-dev at python.org
>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>
>> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20190612/40d3337a/attachment-0001.html>