[Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap

Mon Jan 11 18:04:58 EST 2016

I am in favor of the Wes refactoring, but for some slightly different
reasons.

I am including some in-line comments.

On Mon, Jan 11, 2016 at 2:55 PM, Stephan Hoyer <shoyer at gmail.com> wrote:

> I don't see alternative ways for pandas to have a truly healthy
>> relationship with more general purpose array / scientific computing
>> libraries without being able to add new pandas functionality in a
>> clean way, and without requiring us to get patches accepted (and
>> released) in NumPy or DyND.
>>
>
> Indeed, I think my disagreement is mostly about the order in which we
> approach these problems.
>

I agree here. I had started on *some* of this to enable swappable numpy to
DyND to support IntNA (all in python,
but the fundamental change was to provide an API layer to the back-end).

>
>
>> Can you clarify what aspects of this plan are disagreeable /
>> contentious?
>
>
> See my comments below.
>
>
>> Are you arguing for pandas becoming more of a companion
>> tool / user interface layer for NumPy or DyND?
>>
>
> Not quite. Pandas has some fantastic and highly useable data (Series,
> DataFrame, Index). These certainly don't belong in NumPy or DyND.
>
> However, the array-based ecosystem certainly could use improvements to
> dtypes (e.g., datetime and categorical) and dtype specific methods (e.g.,
> for strings) just as much as pandas. I do firmly believe that pushing these
> types of improvements upstream, rather than implementing them independently
> for pandas, would yield benefits for the broader ecosystem. With the right
> infrastructure, generalizing things to arrays is not much more work.
>

I dont' think Wes nor I disagree here at all. The problem was (and is), the
pace of change in the underlying libraries. It is simply too slow
for pandas development efforts.

I think the pandas efforts (and other libraries) can result in more
powerful fundamental libraries
that get pushed upstream. However, it would not benefit ANYONE to slow down
downstream efforts. I am not sure why you suggest that we WAIT for the
upstream libraries to change? We have been waiting forever for that. Now we
have a concrete implementation of certain data types that are useful. They
(upstream) can take
this and build on (or throw it away and make a better one or whatever). But
I don't think it benefits anyone to WAIT for someone to change numpy first.
Look at how long it took them to (partially) fix datetimes.

xarray in particular has done the same thing to pandas, e.g. you have added
additional selection operators and syntax (e.g. passing dicts of named
axes). These changes are in fact propogating to pandas. This has taken time
(but much much less that this took for any of pandas changes to numpy).
Further look at how long you have advocated (correctly) for labeled arrays
in numpy (which we are still waiting).

>
> I'd like to see pandas itself focus more on the data-structures and less
> on the data types. This would let us share more work with the "general
> purpose array / scientific computing libraries".
>
> Pandas IS about specifying the correct data types. It is simply incorrect
to decouple this problem from the data-structures. A lot of effort over the
years has gone into
making all dtypes playing nice with each other and within pandas.

> 1) Introduce a proper (from a software engineering perspective)
>> logical data type abstraction that models the way that pandas already
>> works, but cleaning up all the mess (implicit upcasts, lack of a real
>> "NA" scalar value, making pandas-specific methods like unique,
>> factorize, match, etc. true "array methods")
>>
>
> New abstractions have a cost. A new logical data type abstraction is
> better than no proper abstraction at all, but (in principle), one data type
> abstraction should be enough to share.
>
>

> A proper logical data type abstraction would be an improvement over the
> current situation, but if there's a way we could introduce one less
> abstraction (by improving things upstream in a general purpose array
> library) that would help even more.
>
>
This is just pushing a problem upstream, which ultimately, given the track
history of numpy, won't be solved at all. We will be here 1 year from now
with the exact same discussion. Why are we waiting on upstream for
anything? As I said above, if something is created which upstream finds
useful on a general level. great. The great cost here is time.

> For example, we could imagine pushing to make DyND the new core for
> pandas. This could be enough of a push to make DyND generally useful -- I
> know it still has a few kinks to work out.
>
>
maybe, but DyND has to have full compat with what currently is out there
(soonish). Then I agree this could be possible. But wouldn't it be even
better
for pandas to be able to swap back-ends. Why limit ourselves to a
particular backend if its not that difficult.

> 4) Give pandas objects a real C API so that users can manipulate and
>> create pandas objects with their own native (C/C++/Cython) code.
>>
>
> 5) Yes, absolutely improve NumPy and DyND and transition to improved
>> NumPy and DyND facilities as soon as they are available and shipped
>>
>
> I like the sound of both of these.
>

Further you made a point above

You are right that pandas has started to supplant numpy as a high level API
> for data analysis, but of course the robust (and often numpy based) Python
> ecosystem is part of what has made pandas so successful. In practice,
> ecosystem projects often want to work with more primitive objects than
> series/dataframes in their internal data structures and without numpy this
> becomes more difficult. For example, how do you concatenate a list of
> categoricals? If these were numpy arrays, we could use np.concatenate, but
> the current implementation of categorical would require a custom solution.
> First class compatibility with pandas is harder when pandas data cannotbe
> used with a full ndarray API.

I disagree entirely here. I think that Series/DataFrame ARE becoming
primitive objects. Look at seaborn, statsmodels, and xarray These are first
class users of these structures, whom need the additional meta-data
attached.

Yes categorical are useful in numpy, and they should support them. But lots
of libraries can simply use pandas and do lots of really useful stuff.
However, why reinvent the wheel and use numpy, when you have DataFrames.

>From a user point of view, I don't think they even care about numpy (or
whatever drives pandas). It solves a very general problem of working with
labeled data.

Jeff
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20160111/daba5242/attachment.html>