[Pandas-dev] Arithmetic Proposal

Joris Van den Bossche jorisvandenbossche at gmail.com
Thu Jun 20 11:42:49 EDT 2019


Thanks Brock for the update and POC.

I certainly don't fully understand yet all details and consequences for the
EA of the proposal (given the discussion on github ..), but I have the
feeling that this is moving complexity (to deal with both 1D/2D) from our
internals / BlockManager to the ExtensionArray.
Or, if we still want to allow EAs that are strictly 1D: moving the
complexity to the ExtensionBlock (which is of course a more focused part of
the internals (possibly a plus), but that also means that internal and
external 1D EAs start to deviate more).

I am not sure that I find that a net improvement. I rather keep some
complexity concentrated in our internals, than exposing that complexity on
the ExtensionArrays.
But maybe I don't work enough in the internals code to really understand
the problem this is trying to solve.

Repeating myself from earlier in this thread: if we want to put
considerable effort in refactoring the internals, I think we should
seriously consider other options.

Joris

Op wo 19 jun. 2019 om 21:26 schreef Brock Mendel <jbrockmendel at gmail.com>:

> A Proof of Concept is up at
> https://github.com/panhttps://issues.apache.org/jira/browse/ARROW-5665das-dev/pandas/pull/26914
> <https://github.com/pandas-dev/pandas/pull/26914>.  A brief overview:
>
> - implement ReshapeMixin for EAs that wrap an ndarray (e.g. DatetimeArray,
> PeriodArray, Categorical, CyberPandas, ...).  For this type of EAs, the
> implementation is pretty trivial.
> - Patch DatetimeArray.__getitem__ and _box_values
> - See how much core.internals simplification becomes feasible
>    - DatetimeTZBlock can now use the base class implementations for shape,
> _slice, copy, iget, and interpolate.
>    - DatetimeTZBlock becomes a thin wrapper around Block.where for when
> Block.where incorrectly casts to object dtype.
>    - With minor additional edits on DatetimeArray, we could also remove
> the need for DatetimeTZBlock to override diff, shift, take_nd
>    - If we allow DatetimeTZBlock to hold multiple columns, _unstack could
> also use the base class implementation.
>
> This also turned up existing bugs (
> https://github.com/pandas-dev/pandas/issues/26864) that I speculate would
> be easier to address with less Block/BlockManager complexity.
>
> Bottom Line: The relevant array operations are going to be defined for EAs
> regardless.  The question is whether they are going to be defined directly
> on the EAs (and tested in isolation), or defined by the Blocks (and tested
> indirectly).  I advocate the former.
>
> On Thu, Jun 13, 2019 at 12:48 PM Tom Augspurger <
> tom.augspurger88 at gmail.com> wrote:
>
>> OK thanks, I think I understand things better now. IMO, the most
>> promising line of development is internally reshaping EAs to be (N, 1). No
>> concrete
>> thoughts on how to do this yet though.
>>
>> And there's another reason I'd prefer not to have public 2-D EAs yet.
>> Aside from the potential future block manager simplification, 2+-dimensional
>> arrays open up a bunch of complexity for EA authors, especially around
>> indexing, take, and concat. I'd prefer to delay that while we still have
>> other
>> options on the table that look promising.
>>
>>
>> On Thu, Jun 13, 2019 at 2:28 PM Brock Mendel <jbrockmendel at gmail.com>
>> wrote:
>>
>>> > I think I was missing a subtle point; We'll still have a mix of 1-D
>>> and 2-D blocks under your proposal. What we *won't* have is cases where the
>>> Block.shape doesn't match the Block.values.shape?
>>>
>>> Correct.  For curious readers, "mix" here only means that both 1D Block
>>> and 2D blocks will exist.  Within a DataFrame you will only ever see 2D
>>> blocks.  Within a Series you will only ever find a single 1D Block.  (at
>>> least under the current proposal; Tom's option 1 above would change this)
>>>
>>> > So assuming we want the laudable goal of Block.{ndim,shape} ==
>>> Block.values.{ndim,shape},
>>>
>>> Following that train of thought (hopefully this helps explain the
>>> promised simplifications):
>>>  - now blocks don't need an ndim attribute or kwarg for the constructor
>>>  - so their only attributes are mgr_locs and values
>>>  - and hey, when do they ever use self.mgr_locs?  Only when calling
>>> their own constructors!  These make more sense as a BlockManager attribute
>>> anyway.
>>>  - so... if Block's only attribute is `values`, maybe we can get rid of
>>> Block altogether?
>>>
>>> > 1. Allow BlockManager to store a mix of 1-D and 2-D blocks.
>>> > 2. Allow EAs to be (at least internally) reshaped to 2D (possibly just
>>> (N, 1) or (1, N)).
>>> >
>>> > Is that fair? Doing number 1 sounds really bad / difficult, right?
>>>
>>> Option 1: I haven't really thought about, but my intuition is that it
>>> would cause new headaches.  I expect this would show up in the
>>> reshape/concat code, which I'm not as familiar with.
>>>
>>> Option 2: allowing (N,1) and (1,N) would give us most of the discussed
>>> simplifications (and bugfixes) in Block/BlockManager.  Even without the
>>> performance considerations, I would consider this a massive win.
>>>
>>> > And I think it would help with the *regression* in arithmetic
>>> performance, right? Since ndarray-backed blocks would be allowed to be (N,
>>> P)?
>>>
>>> Not directly, but I think it would make one of the
>>> previously-tried-and-failed approaches less problematic.
>>>
>>>
>>> On Thu, Jun 13, 2019 at 11:30 AM Tom Augspurger <
>>> tom.augspurger88 at gmail.com> wrote:
>>>
>>>>
>>>>
>>>> On Thu, Jun 13, 2019 at 12:20 PM Brock Mendel <jbrockmendel at gmail.com>
>>>> wrote:
>>>>
>>>>> > can you given your thoughts about the idea of having _only_ 1D
>>>>> Blocks? That could also solve a lot of the complexity of the internals (no
>>>>> 1D vs 2D) and have many of the advantages you mentioned in the first email,
>>>>> but I don't think you really answered that aspect.
>>>>>
>>>>> My read from the earlier email was that you were going to present a
>>>>> more detailed proposal for what this would look like.  Going all-1D would
>>>>> solve the sometimes-1D-sometimes-2D problem, but I think it would cause
>>>>> real problems for transpose and reductions with axis=1 (I've seen it argued
>>>>> that these are not common use cases).  It also wouldn't change the fact
>>>>> that we need BlockManager or something like it to do alignment.  Getting
>>>>> array-like operations out of BlockManager/Block and into PandasArray could
>>>>> work with the all-1D idea.
>>>>>
>>>>> > 1. The current structure of [...] They're different enough that EA
>>>>> isn't a drop-in replacement for ndarray.
>>>>> > 2. Arrays being either 1D or 2D causes many issues.
>>>>> > Q1: Do those two issues accurately capture your concerns as well?
>>>>>
>>>>> Pretty much, yes.
>>>>>
>>>>> > Q2: Can you clarify: with 2D EAs would *all* EAs stored within
>>>>> pandas be 2D internally (and Series / Index would squeeze before data gets
>>>>> back to the user)? Otherwise, I don't see how we get the internal
>>>>> simplification.
>>>>>
>>>>> My thought was that only EAs backing Blocks inside DataFrames would
>>>>> get reshaped.  Everything else would retain their existing dimensions.  It
>>>>> isn't clear to me why you'd want 2D backing Index/Series, though I'm open
>>>>> to being convinced.
>>>>>
>>>>
>>>> I think I was missing a subtle point; We'll still have a mix of 1-D and
>>>> 2-D blocks under your proposal. What we *won't* have is cases where the
>>>> Block.shape doesn't match the Block.values.shape?
>>>>
>>>> ```
>>>> In [9]: df = pd.DataFrame({"A": [1, 2], 'B': pd.array([1, 2],
>>>> dtype='Int64')})
>>>>
>>>> In [10]: df._data.blocks
>>>> Out[10]:
>>>> (IntBlock: slice(0, 1, 1), 1 x 2, dtype: int64,
>>>>  ExtensionBlock: slice(1, 2, 1), 1 x 2, dtype: Int64)
>>>>
>>>> In [11]: df._data.blocks[0].shape == df._data.blocks[0].values.shape
>>>> Out[11]: True
>>>>
>>>> In [12]: df._data.blocks[1].shape == df._data.blocks[1].values.shape
>>>> Out[12]: False
>>>>
>>>> ```
>>>>
>>>> So assuming we want the laudable goal of Block.{ndim,shape} ==
>>>> Block.values.{ndim,shape}, we have two options
>>>>
>>>> 1. Allow BlockManager to store a mix of 1-D and 2-D blocks.
>>>> 2. Allow EAs to be (at least internally) reshaped to 2D (possibly just
>>>> (N, 1) or (1, N)).
>>>>
>>>> Is that fair? Doing number 1 sounds really bad / difficult, right?
>>>>
>>>> Demonstrating the internal simplification may require a Proof of
>>>>> Concept.
>>>>>
>>>>> > Our contract on EA is 1D and I agree that changing this is not a
>>>>> good idea (at least publicly).
>>>>>
>>>>> Another slightly hacky option would be to secretly allow EAs to be
>>>>> temporarily 2D while backing DataFrame Blocks, but restrict that to just
>>>>> (N, 1) or (1, N).  That wouldn't do anything to address the arithmetic
>>>>> performance, but might let us de-kludge the 1D/2D code in core.internals.
>>>>>
>>>>
>>>> I would be curious to see this. And I think it would help with the
>>>> *regression* in arithmetic performance, right? Since ndarray-backed blocks
>>>> would be allowed to be (N, P)?
>>>>
>>>>
>>>>> On Wed, Jun 12, 2019 at 1:56 PM Joris Van den Bossche <
>>>>> jorisvandenbossche at gmail.com> wrote:
>>>>>
>>>>>> Op wo 12 jun. 2019 om 18:18 schreef Jeff Reback <jeffreback at gmail.com
>>>>>> >:
>>>>>>
>>>>>>> ...
>>>>>>>
>>>>>> So here's another proposal (a bit half-baked but....):
>>>>>>>
>>>>>>> You *could* build a single dtyped container that actually holds the
>>>>>>> 1D arrays themselves). Then you could put EA arrays and numpy arrays on the
>>>>>>> same footing. Meaning each
>>>>>>> 'Block' would be exactly the same.
>>>>>>>
>>>>>>> - This would make operations the *same* across all 'Blocks',
>>>>>>> reducing complexity
>>>>>>> - We could simply take views on 2D numpy arrays to actually avoid a
>>>>>>> performance penaltly of copying (as we can construct from a 2D numpy array
>>>>>>> a lot); this causes some aggregation ops to be much slower that if we
>>>>>>> actually copy, but that has a cost too
>>>>>>> - Ops could be defined on EA & Pandas Arrays; these can then operate
>>>>>>> array-by-array (within a Block), or using numba we could implement the ops
>>>>>>> in a way that we can get a pretty big speedup for a particular kernel
>>>>>>>
>>>>>>
>>>>>> How would this proposal avoid the above-mentioned performance
>>>>>> implication of doing ops column-by-column?
>>>>>>
>>>>>> In general, I think we should try to do a few basic benchmarks on
>>>>>> what the performance impact would be for some typical use cases when all
>>>>>> ops are done column-by-column / all columns are stored as separate blocks
>>>>>> (Jeff had a branch at some point that made this optional). To have a better
>>>>>> idea of the (dis)advantages for the different proposals.
>>>>>>
>>>>>> Brock, can you given your thoughts about the idea of having _only_ 1D
>>>>>> Blocks? That could also solve a lot of the complexity of the internals (no
>>>>>> 1D vs 2D) and have many of the advantages you mentioned in the first email,
>>>>>> but I don't think you really answered that aspect.
>>>>>>
>>>>>> Joris
>>>>>>
>>>>>> _______________________________________________
>>>>>> Pandas-dev mailing list
>>>>>> Pandas-dev at python.org
>>>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>>>
>>>>> _______________________________________________
>>>>> Pandas-dev mailing list
>>>>> Pandas-dev at python.org
>>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>>
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20190620/4f4db85e/attachment-0001.html>


More information about the Pandas-dev mailing list