[Pandas-dev] Arithmetic Proposal
Brock Mendel
jbrockmendel at gmail.com
Tue Jun 11 16:38:18 EDT 2019
I've been working on arithmetic/comparison bugs and more recently on
performance problems caused by fixing some of those bugs. After trying
less-invasive approaches, I've concluded a fairly big fix is called for.
This is an RFC for that proposed fix.
------
In 0.24.0 we fixed some arithmetic bugs in DataFrame operations by making
DataFrame arithmetic ops operate column-by-column, dispatching to the
Series implementations. This led to a significant performance hit for
operations on DataFrames with many columns (#24990, #26061).
To restore the lost performance, we need to have these operations take place
at the Block level. To prevent DataFrame behavior from diverging from
Series
behavior (again), we need to retain a single shared implementation.
This is a proposal for how meet these two needs.
Proposal:
- Allow EA to support 2D arrays
- Use PandasArray to back Block subclasses currently backed by ndarray
- Implement arithmetic and comparison ops directly on PandasArray, then
have Series, DataFrame, and Index ops pass through to the PandasArray
implementations.
Fixes:
- Performance degradation in DataFrame ops (#24990, #26061)
- The last remaining inconsistencies between Index and Series ops (#19322,
#18824)
- Most of the xfailing arithmetic tests
- #22120: Transposing dataframe loses dtype and ExtensionArray
- #24600 BUG: DataFrame[Sparse] quantile fails because SparseArray has no
reshape
- #23925 DataFrame Quantile Broken with Datetime Data
Other Upsides:
- Series constructor could dispatch to pd.array, de-duplicating a lot of
code.
- Easier to move to Arrow backend if Blocks are numpy-naive.
- Make EA closer to a drop-in replacement for np.ndarray, necessary if we
want e.g. xarray to find them directly useful (#24716, #24583)
- Block/BlockManager simplifications, see below.
Downsides:
- Existing constructors assume 1D
- Existing downstream authors assume 1D
- Reduction ops (of which there aren't many) don't have axis kwarg ATM
- But for PandasArray they just pass through to nanops, which already
have+test the axis kwargs
- For DatetimeArray/TimedeltaArray/PeriodArray, I'm the one implementing
the reductions and am OK with this extra complication.
Block Simplifications:
- Blocks have three main attributes: values, mgr_locs, and ndim
- ndim is _usually_ the same as values.ndim, the exceptions being for cases
where type(values) is restricted to 1D
- Without these restrictions, we can get rid of:
- Block.ndim, associated kludgy ndim-checking code
- numerous can-this-be-reshaped/transposed checks and special cases in
Block and BlockManager code (which are buggy anyway, e.g. #23925)
- With ndim gone, we can then get rid of mgr_locs!
- The blocks themselves never use mgr_locs except when passing to their
own constructors.
- mgr_locs makes _much_ more sense as an attribute of the BlockManager
- With mgr_locs gone, Block becomes just a thin wrapper around an EA
Implementation Strategy:
- Remove the 1D restriction
- Fairly small tweak, EA subclass must define `shape` instead of
`__len__`; other attrs define in terms of shape.
- Define `transpose`, `T`, `reshape`, and `ravel`
- With this done, several tasks can proceed in parallel:
- simplifications in core.internals, as special-cases for 1D-only can be
removed
- implement and test arithmetic ops on PandasArray
- back Blocks with PandasArray
- back Index (and numeric subclasses) with PandasArray
- Change DataFrame, Series, Index ops to pass through to underlying
Blocks/PandasArrays
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20190611/e1acdc1e/attachment.html>
More information about the Pandas-dev
mailing list