[Pandas-dev] Arithmetic Proposal

Thu Jun 13 13:19:46 EDT 2019

> can you given your thoughts about the idea of having _only_ 1D Blocks?
That could also solve a lot of the complexity of the internals (no 1D vs
2D) and have many of the advantages you mentioned in the first email, but I
don't think you really answered that aspect.

My read from the earlier email was that you were going to present a more
detailed proposal for what this would look like.  Going all-1D would solve
the sometimes-1D-sometimes-2D problem, but I think it would cause real
problems for transpose and reductions with axis=1 (I've seen it argued that
these are not common use cases).  It also wouldn't change the fact that we
need BlockManager or something like it to do alignment.  Getting array-like
operations out of BlockManager/Block and into PandasArray could work with
the all-1D idea.

> 1. The current structure of [...] They're different enough that EA isn't
a drop-in replacement for ndarray.
> 2. Arrays being either 1D or 2D causes many issues.
> Q1: Do those two issues accurately capture your concerns as well?

Pretty much, yes.

> Q2: Can you clarify: with 2D EAs would *all* EAs stored within pandas be
2D internally (and Series / Index would squeeze before data gets back to
the user)? Otherwise, I don't see how we get the internal simplification.

My thought was that only EAs backing Blocks inside DataFrames would get
reshaped.  Everything else would retain their existing dimensions.  It
isn't clear to me why you'd want 2D backing Index/Series, though I'm open
to being convinced.

Demonstrating the internal simplification may require a Proof of Concept.

> Our contract on EA is 1D and I agree that changing this is not a good
idea (at least publicly).

Another slightly hacky option would be to secretly allow EAs to be
temporarily 2D while backing DataFrame Blocks, but restrict that to just
(N, 1) or (1, N).  That wouldn't do anything to address the arithmetic
performance, but might let us de-kludge the 1D/2D code in core.internals.

On Wed, Jun 12, 2019 at 1:56 PM Joris Van den Bossche <
jorisvandenbossche at gmail.com> wrote:

> Op wo 12 jun. 2019 om 18:18 schreef Jeff Reback <jeffreback at gmail.com>:
>
>> ...
>>
> So here's another proposal (a bit half-baked but....):
>>
>> You *could* build a single dtyped container that actually holds the 1D
>> arrays themselves). Then you could put EA arrays and numpy arrays on the
>> same footing. Meaning each
>> 'Block' would be exactly the same.
>>
>> - This would make operations the *same* across all 'Blocks', reducing
>> complexity
>> - We could simply take views on 2D numpy arrays to actually avoid a
>> performance penaltly of copying (as we can construct from a 2D numpy array
>> a lot); this causes some aggregation ops to be much slower that if we
>> actually copy, but that has a cost too
>> - Ops could be defined on EA & Pandas Arrays; these can then operate
>> array-by-array (within a Block), or using numba we could implement the ops
>> in a way that we can get a pretty big speedup for a particular kernel
>>
>
> How would this proposal avoid the above-mentioned performance implication
> of doing ops column-by-column?
>
> In general, I think we should try to do a few basic benchmarks on what the
> performance impact would be for some typical use cases when all ops are
> done column-by-column / all columns are stored as separate blocks (Jeff had
> a branch at some point that made this optional). To have a better idea of
> the (dis)advantages for the different proposals.
>
> Brock, can you given your thoughts about the idea of having _only_ 1D
> Blocks? That could also solve a lot of the complexity of the internals (no
> 1D vs 2D) and have many of the advantages you mentioned in the first email,
> but I don't think you really answered that aspect.
>
> Joris
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20190613/804926ce/attachment.html>