[Pandas-dev] Arithmetic Proposal

Tue Jun 11 16:38:18 EDT 2019

I've been working on arithmetic/comparison bugs and more recently on
performance problems caused by fixing some of those bugs.  After trying
less-invasive approaches, I've concluded a fairly big fix is called for.
This is an RFC for that proposed fix.

------
In 0.24.0 we fixed some arithmetic bugs in DataFrame operations by making
DataFrame arithmetic ops operate column-by-column, dispatching to the
Series implementations.  This led to a significant performance hit for
operations on DataFrames with many columns (#24990, #26061).

To restore the lost performance, we need to have these operations take place
at the Block level.  To prevent DataFrame behavior from diverging from
Series
behavior (again), we need to retain a single shared implementation.

This is a proposal for how meet these two needs.

Proposal:
- Allow EA to support 2D arrays
- Use PandasArray to back Block subclasses currently backed by ndarray
- Implement arithmetic and comparison ops directly on PandasArray, then
have Series, DataFrame, and Index ops pass through to the PandasArray
implementations.

Fixes:
- Performance degradation in DataFrame ops (#24990, #26061)
- The last remaining inconsistencies between Index and Series ops (#19322,
#18824)
- Most of the xfailing arithmetic tests
- #22120: Transposing dataframe loses dtype and ExtensionArray
- #24600 BUG: DataFrame[Sparse] quantile fails because SparseArray has no
reshape
- #23925 DataFrame Quantile Broken with Datetime Data

Other Upsides:
- Series constructor could dispatch to pd.array, de-duplicating a lot of
code.
- Easier to move to Arrow backend if Blocks are numpy-naive.
- Make EA closer to a drop-in replacement for np.ndarray, necessary if we
want e.g. xarray to find them directly useful (#24716, #24583)
- Block/BlockManager simplifications, see below.

Downsides:
- Existing constructors assume 1D
- Existing downstream authors assume 1D
- Reduction ops (of which there aren't many) don't have axis kwarg ATM
   - But for PandasArray they just pass through to nanops, which already
have+test the axis kwargs
   - For DatetimeArray/TimedeltaArray/PeriodArray, I'm the one implementing
the reductions and am OK with this extra complication.

Block Simplifications:
- Blocks have three main attributes: values, mgr_locs, and ndim
- ndim is _usually_ the same as values.ndim, the exceptions being for cases
where type(values) is restricted to 1D
- Without these restrictions, we can get rid of:
   - Block.ndim, associated kludgy ndim-checking code
   - numerous can-this-be-reshaped/transposed checks and special cases in
Block and BlockManager code (which are buggy anyway, e.g. #23925)
- With ndim gone, we can then get rid of mgr_locs!
   - The blocks themselves never use mgr_locs except when passing to their
own constructors.
   - mgr_locs makes _much_ more sense as an attribute of the BlockManager
- With mgr_locs gone, Block becomes just a thin wrapper around an EA

Implementation Strategy:
- Remove the 1D restriction
   - Fairly small tweak, EA subclass must define `shape` instead of
`__len__`; other attrs define in terms of shape.
   - Define `transpose`, `T`, `reshape`, and `ravel`
- With this done, several tasks can proceed in parallel:
   - simplifications in core.internals, as special-cases for 1D-only can be
removed
   - implement and test arithmetic ops on PandasArray
   - back Blocks with PandasArray
   - back Index (and numeric subclasses) with PandasArray
- Change DataFrame, Series, Index ops to pass through to underlying
Blocks/PandasArrays
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20190611/e1acdc1e/attachment.html>