[Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap

Mon Jan 11 14:55:24 EST 2016

>
> I don't see alternative ways for pandas to have a truly healthy
> relationship with more general purpose array / scientific computing
> libraries without being able to add new pandas functionality in a
> clean way, and without requiring us to get patches accepted (and
> released) in NumPy or DyND.
>

Indeed, I think my disagreement is mostly about the order in which we
approach these problems.

> Can you clarify what aspects of this plan are disagreeable /
> contentious?

See my comments below.

> Are you arguing for pandas becoming more of a companion
> tool / user interface layer for NumPy or DyND?
>

Not quite. Pandas has some fantastic and highly useable data (Series,
DataFrame, Index). These certainly don't belong in NumPy or DyND.

However, the array-based ecosystem certainly could use improvements to
dtypes (e.g., datetime and categorical) and dtype specific methods (e.g.,
for strings) just as much as pandas. I do firmly believe that pushing these
types of improvements upstream, rather than implementing them independently
for pandas, would yield benefits for the broader ecosystem. With the right
infrastructure, generalizing things to arrays is not much more work.

I'd like to see pandas itself focus more on the data-structures and less on
the data types. This would let us share more work with the "general purpose
array / scientific computing libraries".

1) Introduce a proper (from a software engineering perspective)
> logical data type abstraction that models the way that pandas already
> works, but cleaning up all the mess (implicit upcasts, lack of a real
> "NA" scalar value, making pandas-specific methods like unique,
> factorize, match, etc. true "array methods")
>

New abstractions have a cost. A new logical data type abstraction is better
than no proper abstraction at all, but (in principle), one data type
abstraction should be enough to share.

A proper logical data type abstraction would be an improvement over the
current situation, but if there's a way we could introduce one less
abstraction (by improving things upstream in a general purpose array
library) that would help even more.

For example, we could imagine pushing to make DyND the new core for pandas.
This could be enough of a push to make DyND generally useful -- I know it
still has a few kinks to work out.

4) Give pandas objects a real C API so that users can manipulate and
> create pandas objects with their own native (C/C++/Cython) code.
>

5) Yes, absolutely improve NumPy and DyND and transition to improved
> NumPy and DyND facilities as soon as they are available and shipped
>

I like the sound of both of these.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20160111/164f61fb/attachment-0001.html>