[Pandas-dev] Arithmetic Proposal

Brock Mendel jbrockmendel at gmail.com
Thu Jun 13 15:28:29 EDT 2019


> I think I was missing a subtle point; We'll still have a mix of 1-D and
2-D blocks under your proposal. What we *won't* have is cases where the
Block.shape doesn't match the Block.values.shape?

Correct.  For curious readers, "mix" here only means that both 1D Block and
2D blocks will exist.  Within a DataFrame you will only ever see 2D
blocks.  Within a Series you will only ever find a single 1D Block.  (at
least under the current proposal; Tom's option 1 above would change this)

> So assuming we want the laudable goal of Block.{ndim,shape} ==
Block.values.{ndim,shape},

Following that train of thought (hopefully this helps explain the promised
simplifications):
 - now blocks don't need an ndim attribute or kwarg for the constructor
 - so their only attributes are mgr_locs and values
 - and hey, when do they ever use self.mgr_locs?  Only when calling their
own constructors!  These make more sense as a BlockManager attribute anyway.
 - so... if Block's only attribute is `values`, maybe we can get rid of
Block altogether?

> 1. Allow BlockManager to store a mix of 1-D and 2-D blocks.
> 2. Allow EAs to be (at least internally) reshaped to 2D (possibly just
(N, 1) or (1, N)).
>
> Is that fair? Doing number 1 sounds really bad / difficult, right?

Option 1: I haven't really thought about, but my intuition is that it would
cause new headaches.  I expect this would show up in the reshape/concat
code, which I'm not as familiar with.

Option 2: allowing (N,1) and (1,N) would give us most of the discussed
simplifications (and bugfixes) in Block/BlockManager.  Even without the
performance considerations, I would consider this a massive win.

> And I think it would help with the *regression* in arithmetic
performance, right? Since ndarray-backed blocks would be allowed to be (N,
P)?

Not directly, but I think it would make one of the
previously-tried-and-failed approaches less problematic.


On Thu, Jun 13, 2019 at 11:30 AM Tom Augspurger <tom.augspurger88 at gmail.com>
wrote:

>
>
> On Thu, Jun 13, 2019 at 12:20 PM Brock Mendel <jbrockmendel at gmail.com>
> wrote:
>
>> > can you given your thoughts about the idea of having _only_ 1D Blocks?
>> That could also solve a lot of the complexity of the internals (no 1D vs
>> 2D) and have many of the advantages you mentioned in the first email, but I
>> don't think you really answered that aspect.
>>
>> My read from the earlier email was that you were going to present a more
>> detailed proposal for what this would look like.  Going all-1D would solve
>> the sometimes-1D-sometimes-2D problem, but I think it would cause real
>> problems for transpose and reductions with axis=1 (I've seen it argued that
>> these are not common use cases).  It also wouldn't change the fact that we
>> need BlockManager or something like it to do alignment.  Getting array-like
>> operations out of BlockManager/Block and into PandasArray could work with
>> the all-1D idea.
>>
>> > 1. The current structure of [...] They're different enough that EA
>> isn't a drop-in replacement for ndarray.
>> > 2. Arrays being either 1D or 2D causes many issues.
>> > Q1: Do those two issues accurately capture your concerns as well?
>>
>> Pretty much, yes.
>>
>> > Q2: Can you clarify: with 2D EAs would *all* EAs stored within pandas
>> be 2D internally (and Series / Index would squeeze before data gets back to
>> the user)? Otherwise, I don't see how we get the internal simplification.
>>
>> My thought was that only EAs backing Blocks inside DataFrames would get
>> reshaped.  Everything else would retain their existing dimensions.  It
>> isn't clear to me why you'd want 2D backing Index/Series, though I'm open
>> to being convinced.
>>
>
> I think I was missing a subtle point; We'll still have a mix of 1-D and
> 2-D blocks under your proposal. What we *won't* have is cases where the
> Block.shape doesn't match the Block.values.shape?
>
> ```
> In [9]: df = pd.DataFrame({"A": [1, 2], 'B': pd.array([1, 2],
> dtype='Int64')})
>
> In [10]: df._data.blocks
> Out[10]:
> (IntBlock: slice(0, 1, 1), 1 x 2, dtype: int64,
>  ExtensionBlock: slice(1, 2, 1), 1 x 2, dtype: Int64)
>
> In [11]: df._data.blocks[0].shape == df._data.blocks[0].values.shape
> Out[11]: True
>
> In [12]: df._data.blocks[1].shape == df._data.blocks[1].values.shape
> Out[12]: False
>
> ```
>
> So assuming we want the laudable goal of Block.{ndim,shape} ==
> Block.values.{ndim,shape}, we have two options
>
> 1. Allow BlockManager to store a mix of 1-D and 2-D blocks.
> 2. Allow EAs to be (at least internally) reshaped to 2D (possibly just (N,
> 1) or (1, N)).
>
> Is that fair? Doing number 1 sounds really bad / difficult, right?
>
> Demonstrating the internal simplification may require a Proof of Concept.
>>
>> > Our contract on EA is 1D and I agree that changing this is not a good
>> idea (at least publicly).
>>
>> Another slightly hacky option would be to secretly allow EAs to be
>> temporarily 2D while backing DataFrame Blocks, but restrict that to just
>> (N, 1) or (1, N).  That wouldn't do anything to address the arithmetic
>> performance, but might let us de-kludge the 1D/2D code in core.internals.
>>
>
> I would be curious to see this. And I think it would help with the
> *regression* in arithmetic performance, right? Since ndarray-backed blocks
> would be allowed to be (N, P)?
>
>
>> On Wed, Jun 12, 2019 at 1:56 PM Joris Van den Bossche <
>> jorisvandenbossche at gmail.com> wrote:
>>
>>> Op wo 12 jun. 2019 om 18:18 schreef Jeff Reback <jeffreback at gmail.com>:
>>>
>>>> ...
>>>>
>>> So here's another proposal (a bit half-baked but....):
>>>>
>>>> You *could* build a single dtyped container that actually holds the 1D
>>>> arrays themselves). Then you could put EA arrays and numpy arrays on the
>>>> same footing. Meaning each
>>>> 'Block' would be exactly the same.
>>>>
>>>> - This would make operations the *same* across all 'Blocks', reducing
>>>> complexity
>>>> - We could simply take views on 2D numpy arrays to actually avoid a
>>>> performance penaltly of copying (as we can construct from a 2D numpy array
>>>> a lot); this causes some aggregation ops to be much slower that if we
>>>> actually copy, but that has a cost too
>>>> - Ops could be defined on EA & Pandas Arrays; these can then operate
>>>> array-by-array (within a Block), or using numba we could implement the ops
>>>> in a way that we can get a pretty big speedup for a particular kernel
>>>>
>>>
>>> How would this proposal avoid the above-mentioned performance
>>> implication of doing ops column-by-column?
>>>
>>> In general, I think we should try to do a few basic benchmarks on what
>>> the performance impact would be for some typical use cases when all ops are
>>> done column-by-column / all columns are stored as separate blocks (Jeff had
>>> a branch at some point that made this optional). To have a better idea of
>>> the (dis)advantages for the different proposals.
>>>
>>> Brock, can you given your thoughts about the idea of having _only_ 1D
>>> Blocks? That could also solve a lot of the complexity of the internals (no
>>> 1D vs 2D) and have many of the advantages you mentioned in the first email,
>>> but I don't think you really answered that aspect.
>>>
>>> Joris
>>>
>>> _______________________________________________
>>> Pandas-dev mailing list
>>> Pandas-dev at python.org
>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20190613/32b4710a/attachment-0001.html>


More information about the Pandas-dev mailing list