[Pandas-dev] Arithmetic Proposal

Joris Van den Bossche jorisvandenbossche at gmail.com
Wed Jun 12 16:55:52 EDT 2019


Op wo 12 jun. 2019 om 18:18 schreef Jeff Reback <jeffreback at gmail.com>:

> ...
>
So here's another proposal (a bit half-baked but....):
>
> You *could* build a single dtyped container that actually holds the 1D
> arrays themselves). Then you could put EA arrays and numpy arrays on the
> same footing. Meaning each
> 'Block' would be exactly the same.
>
> - This would make operations the *same* across all 'Blocks', reducing
> complexity
> - We could simply take views on 2D numpy arrays to actually avoid a
> performance penaltly of copying (as we can construct from a 2D numpy array
> a lot); this causes some aggregation ops to be much slower that if we
> actually copy, but that has a cost too
> - Ops could be defined on EA & Pandas Arrays; these can then operate
> array-by-array (within a Block), or using numba we could implement the ops
> in a way that we can get a pretty big speedup for a particular kernel
>

How would this proposal avoid the above-mentioned performance implication
of doing ops column-by-column?

In general, I think we should try to do a few basic benchmarks on what the
performance impact would be for some typical use cases when all ops are
done column-by-column / all columns are stored as separate blocks (Jeff had
a branch at some point that made this optional). To have a better idea of
the (dis)advantages for the different proposals.

Brock, can you given your thoughts about the idea of having _only_ 1D
Blocks? That could also solve a lot of the complexity of the internals (no
1D vs 2D) and have many of the advantages you mentioned in the first email,
but I don't think you really answered that aspect.

Joris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20190612/fb6822f1/attachment.html>


More information about the Pandas-dev mailing list