[Pandas-dev] How Far do we take ExtensionArrays?

Joris Van den Bossche jorisvandenbossche at gmail.com
Fri Feb 8 07:26:40 EST 2019


Op wo 16 jan. 2019 om 18:16 schreef Tom Augspurger <
tom.augspurger88 at gmail.com>:

> This is something I've been mulling over the past few days: how much do we
> want
> ExtensionArrays to change pandas?
>
> [...]
>
> As another semi-example, users may be interested in storing some or all
> their
> data on a GPU in an ExtensionArray or arrays backed by GPU-memory. I
> suspect
> that some things work quite well currently, e.g. `Series.sort_values` will
> dispatch to the `ExtensionArray.argsort`, which can use a GPU-accelerated
> sorting algorithm. But other parts of pandas (anything in Cython, for
> example)
> won't necessarily work. How much are we willing to refactor pandas'
> internals to
> support something that's going to live outside pandas (as a GPU extension
> array
> likely would)?
>
> To have a practical example: for example for a groupby operation, we
dispatch to the ExtensionArray for the factorization step, but the actual
computation of grouped reductions is still done in cython. Is that the kind
of things you were thinking about?


> Finally (and this may be a topic for another day) have people thought
> about how
> 3rd-party EAs fit in with the potential block manager rewrite? IIUC, one
> of the
> goals there was a stable C API to the memory inside a DataFrame. Does
> anyone
> know how that would work with a array that doesn't (or can't) implement the
> buffer protocol?
>
> In the idea of getting rid of blocks and having just 1D arrays, it
certainly fits I would say (we could extend the current numpy-backed
PandasArrays that are now only used in `.array`). But if the idea is to
rewrite the block manager in C/Cython, that might be more difficult.
However, if a future version of pandas would be backed by Arrow, we
wouldn't necessarily need our own C API, as a reference to the Arrow table
/ arrays might be sufficient? Of course that depends on how tight we want
to depend on Arrow, as that might limit the extensibility with other
backends.

Op ma 4 feb. 2019 om 17:26 schreef Uwe L. Korn <xhochy at gmail.com>:

> Hello Tom,
>
> overall I really like the concept of ExtensionArrays but for more advanced
> usage I think there is still a lot to do. At the moment, an implementer is
> quite well off when the ExtensionArray can be coerced into a numpy array.
> Once you have data that is not well represented by a numpy array, you need
> to develop much more algorithms.
>

I think that is more or less to be expected. All our internal algorithms
are based on numpy arrays. I think it would be an interesting idea to see
if we can/want to expose some of our algorithms for external users (eg
external ExtensionArray implementors). But even if we do that, it wouldn't
really help for the fletcher case given the different memory layout.
Shorter term idea that I would find interesting is to see to what extent
xtensor could be used for the algorithms.

Joris


> For fletcher this has been a major hurdle for me (or why I'm not
> implementing so much). This might also just be that my backing library
> (Apache Arrow) is missing a lot of numerical operations yet. I hope to have
> some time in the next months to work more on this and then we can see how
> much issues pop up. At the end though, I would like to avoid coercing as
> much as possible to NumPy arrays as the conversion of arrays with null adds
> some computational overhead.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20190208/0030b499/attachment-0001.html>


More information about the Pandas-dev mailing list