[Pandas-dev] Arithmetic Proposal

Wed Jun 12 08:53:22 EDT 2019

I’m wary to expand operations done at the Block level. As a core developer for over a year now, I’ve done zero work with blocks and I think they definitely come at an extra development / maintenance cost. 

I think wide DataFrames are the exception rather than the norm so it’s probably not worth code to eek out a 10% performance boost for those (I’m taking that figure from one of your comments in #24990).

- Will

> On Jun 11, 2019, at 10:08 PM, Tom Augspurger <tom.augspurger88 at gmail.com> wrote:
> 
> One general question, motivated by Joris' same concern about the future
> simplified BlockManager: why does block-based, rather than column-based, ops
> require 2D Extension Arrays? You say
> 
> > by making DataFrame arithmetic ops operate column-by-column, dispatching to
> > the Series implementations.
> 
> Could we instead dispatch both Series and DataFrame ops to Block ops (which then
> do the op on the ndarray or dispatch to the EA)? If I understand your proposal
> correctly, then you still have the general DataFrame -> Block -> Array nesting
> doll. It seems like that should work equally well with our current mix of 2-D
> and 1-D blocks.
> 
> So while I agree that Blocks being backed by a maybe 1D / maybe 2D array causes
> no end of headaches, I don't see why block-based ops need 2D EAs (though I'm not
> especially familiar with this area; I could easily be missing something basic).
> 
> - Tom
> 
> On Tue, Jun 11, 2019 at 5:25 PM Stephan Hoyer <shoyer at gmail.com <mailto:shoyer at gmail.com>> wrote:
> Indeed, it's worth considering if perhaps it would be OK to have a performance regression for very wide dataframes instead.
> 
> With regards to xarray, 2D extension arrays are interesting but still not particularly helpful. We would still need a wrapper to make them fully N-D, which we need for our data model.
> 
> On Tue, Jun 11, 2019 at 6:18 PM Joris Van den Bossche <jorisvandenbossche at gmail.com <mailto:jorisvandenbossche at gmail.com>> wrote:
> Hi Brock,
> 
> Thanks a lot for starting this discussion and the detailed proposal!
> 
> I will try to look at it in more detail tomorrow, but one general remark: from time to time, we talked about "getting rid of the BlockManager" or "simplifying the BlockManager" (although I am not sure if there is any specific github issue about it, might be from in-person discussions). One of the interpretations of that (or at least how I understood those discussions) was to get away of the 2D block based internals, and go to a simpler "table as collection of 1D arrays" model. This would also enable a simplication of the internals / BlockManager and many of the other items you mention.
> 
> So I think we should at least compare a more detailed version of what I described above against your proposal. As if we would want to go in that direction long term, I am not sure extensive work on the current 2D blocks-based BlockManager is worth our time.
> 
> Joris
> 
> Op di 11 jun. 2019 om 22:38 schreef Brock Mendel <jbrockmendel at gmail.com <mailto:jbrockmendel at gmail.com>>:
> I've been working on arithmetic/comparison bugs and more recently on performance problems caused by fixing some of those bugs.  After trying less-invasive approaches, I've concluded a fairly big fix is called for.  This is an RFC for that proposed fix.
> 
> ------
> In 0.24.0 we fixed some arithmetic bugs in DataFrame operations by making DataFrame arithmetic ops operate column-by-column, dispatching to the Series implementations.  This led to a significant performance hit for operations on DataFrames with many columns (#24990, #26061).
> 
> To restore the lost performance, we need to have these operations take place
> at the Block level.  To prevent DataFrame behavior from diverging from Series
> behavior (again), we need to retain a single shared implementation.
> 
> This is a proposal for how meet these two needs.
> 
> Proposal:
> - Allow EA to support 2D arrays
> - Use PandasArray to back Block subclasses currently backed by ndarray
> - Implement arithmetic and comparison ops directly on PandasArray, then have Series, DataFrame, and Index ops pass through to the PandasArray implementations.
> 
> Fixes:
> - Performance degradation in DataFrame ops (#24990, #26061)
> - The last remaining inconsistencies between Index and Series ops (#19322, #18824)
> - Most of the xfailing arithmetic tests
> - #22120: Transposing dataframe loses dtype and ExtensionArray
> - #24600 BUG: DataFrame[Sparse] quantile fails because SparseArray has no reshape
> - #23925 DataFrame Quantile Broken with Datetime Data
> 
> Other Upsides:
> - Series constructor could dispatch to pd.array, de-duplicating a lot of code.
> - Easier to move to Arrow backend if Blocks are numpy-naive.
> - Make EA closer to a drop-in replacement for np.ndarray, necessary if we want e.g. xarray to find them directly useful (#24716, #24583)
> - Block/BlockManager simplifications, see below.
> 
> Downsides:
> - Existing constructors assume 1D
> - Existing downstream authors assume 1D
> - Reduction ops (of which there aren't many) don't have axis kwarg ATM
>   	- But for PandasArray they just pass through to nanops, which already have+test the axis kwargs
>    - For DatetimeArray/TimedeltaArray/PeriodArray, I'm the one implementing the reductions and am OK with this extra complication.
> 
> Block Simplifications:
> - Blocks have three main attributes: values, mgr_locs, and ndim
> - ndim is _usually_ the same as values.ndim, the exceptions being for cases where type(values) is restricted to 1D
> - Without these restrictions, we can get rid of:
>   	- Block.ndim, associated kludgy ndim-checking code
>   	- numerous can-this-be-reshaped/transposed checks and special cases in Block and BlockManager code (which are buggy anyway, e.g. #23925)
> - With ndim gone, we can then get rid of mgr_locs!
>   	- The blocks themselves never use mgr_locs except when passing to their own constructors.
>   	- mgr_locs makes _much_ more sense as an attribute of the BlockManager
> - With mgr_locs gone, Block becomes just a thin wrapper around an EA
> 
> Implementation Strategy:
> - Remove the 1D restriction
>   	- Fairly small tweak, EA subclass must define `shape` instead of `__len__`; other attrs define in terms of shape.
>   	- Define `transpose`, `T`, `reshape`, and `ravel`
> - With this done, several tasks can proceed in parallel:
>   	- simplifications in core.internals, as special-cases for 1D-only can be removed
>   	- implement and test arithmetic ops on PandasArray
>   	- back Blocks with PandasArray
>   	- back Index (and numeric subclasses) with PandasArray
> - Change DataFrame, Series, Index ops to pass through to underlying Blocks/PandasArrays
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org <mailto:Pandas-dev at python.org>
> https://mail.python.org/mailman/listinfo/pandas-dev <https://mail.python.org/mailman/listinfo/pandas-dev>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org <mailto:Pandas-dev at python.org>
> https://mail.python.org/mailman/listinfo/pandas-dev <https://mail.python.org/mailman/listinfo/pandas-dev>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org <mailto:Pandas-dev at python.org>
> https://mail.python.org/mailman/listinfo/pandas-dev <https://mail.python.org/mailman/listinfo/pandas-dev>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20190612/1be4b30d/attachment-0001.html>