[Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap

Mon Jan 11 19:23:51 EST 2016

On Mon, Jan 11, 2016 at 3:04 PM, Jeff Reback <jeffreback at gmail.com> wrote:
> I am in favor of the Wes refactoring, but for some slightly different
> reasons.
>
> I am including some in-line comments.
>
> On Mon, Jan 11, 2016 at 2:55 PM, Stephan Hoyer <shoyer at gmail.com> wrote:
>>>
>>> I don't see alternative ways for pandas to have a truly healthy
>>> relationship with more general purpose array / scientific computing
>>> libraries without being able to add new pandas functionality in a
>>> clean way, and without requiring us to get patches accepted (and
>>> released) in NumPy or DyND.
>>
>>
>> Indeed, I think my disagreement is mostly about the order in which we
>> approach these problems.
>
>
> I agree here. I had started on *some* of this to enable swappable numpy to
> DyND to support IntNA (all in python,
> but the fundamental change was to provide an API layer to the back-end).
>
>>
>>
>>>
>>> Can you clarify what aspects of this plan are disagreeable /
>>> contentious?
>>
>>
>> See my comments below.
>>
>>>
>>> Are you arguing for pandas becoming more of a companion
>>> tool / user interface layer for NumPy or DyND?
>>
>>
>> Not quite. Pandas has some fantastic and highly useable data (Series,
>> DataFrame, Index). These certainly don't belong in NumPy or DyND.
>>
>> However, the array-based ecosystem certainly could use improvements to
>> dtypes (e.g., datetime and categorical) and dtype specific methods (e.g.,
>> for strings) just as much as pandas. I do firmly believe that pushing these
>> types of improvements upstream, rather than implementing them independently
>> for pandas, would yield benefits for the broader ecosystem. With the right
>> infrastructure, generalizing things to arrays is not much more work.
>
>
> I dont' think Wes nor I disagree here at all. The problem was (and is), the
> pace of change in the underlying libraries. It is simply too slow
> for pandas development efforts.
>
> I think the pandas efforts (and other libraries) can result in more powerful
> fundamental libraries
> that get pushed upstream. However, it would not benefit ANYONE to slow down
> downstream efforts. I am not sure why you suggest that we WAIT for the
> upstream libraries to change? We have been waiting forever for that. Now we
> have a concrete implementation of certain data types that are useful. They
> (upstream) can take
> this and build on (or throw it away and make a better one or whatever). But
> I don't think it benefits anyone to WAIT for someone to change numpy first.
> Look at how long it took them to (partially) fix datetimes.
>
> xarray in particular has done the same thing to pandas, e.g. you have added
> additional selection operators and syntax (e.g. passing dicts of named
> axes). These changes are in fact propogating to pandas. This has taken time
> (but much much less that this took for any of pandas changes to numpy).
> Further look at how long you have advocated (correctly) for labeled arrays
> in numpy (which we are still waiting).
>
>>
>>
>> I'd like to see pandas itself focus more on the data-structures and less
>> on the data types. This would let us share more work with the "general
>> purpose array / scientific computing libraries".
>>
> Pandas IS about specifying the correct data types. It is simply incorrect to
> decouple this problem from the data-structures. A lot of effort over the
> years has gone into
> making all dtypes playing nice with each other and within pandas.
>
>>>
>>> 1) Introduce a proper (from a software engineering perspective)
>>> logical data type abstraction that models the way that pandas already
>>> works, but cleaning up all the mess (implicit upcasts, lack of a real
>>> "NA" scalar value, making pandas-specific methods like unique,
>>> factorize, match, etc. true "array methods")
>>
>>
>> New abstractions have a cost. A new logical data type abstraction is
>> better than no proper abstraction at all, but (in principle), one data type
>> abstraction should be enough to share.
>>
>
>>
>> A proper logical data type abstraction would be an improvement over the
>> current situation, but if there's a way we could introduce one less
>> abstraction (by improving things upstream in a general purpose array
>> library) that would help even more.
>>
>
> This is just pushing a problem upstream, which ultimately, given the track
> history of numpy, won't be solved at all. We will be here 1 year from now
> with the exact same discussion. Why are we waiting on upstream for anything?
> As I said above, if something is created which upstream finds useful on a
> general level. great. The great cost here is time.
>
>>
>> For example, we could imagine pushing to make DyND the new core for
>> pandas. This could be enough of a push to make DyND generally useful -- I
>> know it still has a few kinks to work out.
>>
>
> maybe, but DyND has to have full compat with what currently is out there
> (soonish). Then I agree this could be possible. But wouldn't it be even
> better
> for pandas to be able to swap back-ends. Why limit ourselves to a particular
> backend if its not that difficult.
>

I think Jeff and I are on the same page here. 5 years ago we were
having the *exact same* discussions around NumPy and adding new data
type functionality. 5 years is a staggering amount of time in open
source. It was less than 5 years between pandas not existing and being
a super popular project with 2/3 of a best-selling O'Reilly book
written about it. To whit, DyND exists in large part because of the
difficulty in making progress within NumPy.

Now, as 5 years ago, I think we should be acting in the best interests
of pandas users, and what I've been describing is intended as a
straightforward (though definitely labor intensive) and relatively
low-risk plan that will "future-proof" the pandas user API for at
least the next few years, and probably much longer. If we find that
enabling some internals to use DyND is the right choice, we can do
that in a non-invasive way while carefully minding data
interoperability. Meaningful performance benefits would be a clear
motivation.

To be 100% open and transparent (in the spirit of pandas's new
governance docs): Before committing to using DyND in any binding way
(i.e. required, as opposed to opt-in) in pandas, I'd really like to
see more evidence from 3rd parties without direct financial interest
(i.e. employment or equity from Continuum) that DyND is "the future of
Python array computing"; in the absence of significant user and
community code contribution, it still feels like a political quagmire
leftover from the Continuum-Enthought rift in 2011.

- Wes

>>>
>>> 4) Give pandas objects a real C API so that users can manipulate and
>>> create pandas objects with their own native (C/C++/Cython) code.
>>
>>
>>> 5) Yes, absolutely improve NumPy and DyND and transition to improved
>>> NumPy and DyND facilities as soon as they are available and shipped
>>
>>
>> I like the sound of both of these.
>
>
>
> Further you made a point above
>
>> You are right that pandas has started to supplant numpy as a high level
>> API for data analysis, but of course the robust (and often numpy based)
>> Python ecosystem is part of what has made pandas so successful. In practice,
>> ecosystem projects often want to work with more primitive objects than
>> series/dataframes in their internal data structures and without numpy this
>> becomes more difficult. For example, how do you concatenate a list of
>> categoricals? If these were numpy arrays, we could use np.concatenate, but
>> the current implementation of categorical would require a custom solution.
>> First class compatibility with pandas is harder when pandas data cannotbe
>> used with a full ndarray API.
>
>
> I disagree entirely here. I think that Series/DataFrame ARE becoming
> primitive objects. Look at seaborn, statsmodels, and xarray These are first
> class users of these structures, whom need the additional meta-data
> attached.
>
> Yes categorical are useful in numpy, and they should support them. But lots
> of libraries can simply use pandas and do lots of really useful stuff.
> However, why reinvent the wheel and use numpy, when you have DataFrames.
>
> From a user point of view, I don't think they even care about numpy (or
> whatever drives pandas). It solves a very general problem of working with
> labeled data.
>
> Jeff