[Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap

Tue Jan 12 19:06:55 EST 2016

I think I'm mostly on the same page as well. Five years has certainly been
too long.

I agree that it would be premature to commit to using DyND in a binding way
in pandas. A lot seems to be up in the air with regards to dtypes in Python
right now (yes, particularly from projects sponsored by Continuum).

So I would advocate for proceeding with the refactor for now (which will
have numerous other benefits), and see how the situation evolves. If it
seems like we are in a plausible position to unify the dtype system with a
tool like DyND, then let's seriously consider that down the road. Either
way, explicit interfaces (e.g., to_numpy(), to_dynd()) will help.

On Mon, Jan 11, 2016 at 4:23 PM, Wes McKinney <wesmckinn at gmail.com> wrote:

> On Mon, Jan 11, 2016 at 3:04 PM, Jeff Reback <jeffreback at gmail.com> wrote:
> > I am in favor of the Wes refactoring, but for some slightly different
> > reasons.
> >
> > I am including some in-line comments.
> >
> > On Mon, Jan 11, 2016 at 2:55 PM, Stephan Hoyer <shoyer at gmail.com> wrote:
> >>>
> >>> I don't see alternative ways for pandas to have a truly healthy
> >>> relationship with more general purpose array / scientific computing
> >>> libraries without being able to add new pandas functionality in a
> >>> clean way, and without requiring us to get patches accepted (and
> >>> released) in NumPy or DyND.
> >>
> >>
> >> Indeed, I think my disagreement is mostly about the order in which we
> >> approach these problems.
> >
> >
> > I agree here. I had started on *some* of this to enable swappable numpy
> to
> > DyND to support IntNA (all in python,
> > but the fundamental change was to provide an API layer to the back-end).
> >
> >>
> >>
> >>>
> >>> Can you clarify what aspects of this plan are disagreeable /
> >>> contentious?
> >>
> >>
> >> See my comments below.
> >>
> >>>
> >>> Are you arguing for pandas becoming more of a companion
> >>> tool / user interface layer for NumPy or DyND?
> >>
> >>
> >> Not quite. Pandas has some fantastic and highly useable data (Series,
> >> DataFrame, Index). These certainly don't belong in NumPy or DyND.
> >>
> >> However, the array-based ecosystem certainly could use improvements to
> >> dtypes (e.g., datetime and categorical) and dtype specific methods
> (e.g.,
> >> for strings) just as much as pandas. I do firmly believe that pushing
> these
> >> types of improvements upstream, rather than implementing them
> independently
> >> for pandas, would yield benefits for the broader ecosystem. With the
> right
> >> infrastructure, generalizing things to arrays is not much more work.
> >
> >
> > I dont' think Wes nor I disagree here at all. The problem was (and is),
> the
> > pace of change in the underlying libraries. It is simply too slow
> > for pandas development efforts.
> >
> > I think the pandas efforts (and other libraries) can result in more
> powerful
> > fundamental libraries
> > that get pushed upstream. However, it would not benefit ANYONE to slow
> down
> > downstream efforts. I am not sure why you suggest that we WAIT for the
> > upstream libraries to change? We have been waiting forever for that. Now
> we
> > have a concrete implementation of certain data types that are useful.
> They
> > (upstream) can take
> > this and build on (or throw it away and make a better one or whatever).
> But
> > I don't think it benefits anyone to WAIT for someone to change numpy
> first.
> > Look at how long it took them to (partially) fix datetimes.
> >
> > xarray in particular has done the same thing to pandas, e.g. you have
> added
> > additional selection operators and syntax (e.g. passing dicts of named
> > axes). These changes are in fact propogating to pandas. This has taken
> time
> > (but much much less that this took for any of pandas changes to numpy).
> > Further look at how long you have advocated (correctly) for labeled
> arrays
> > in numpy (which we are still waiting).
> >
> >>
> >>
> >> I'd like to see pandas itself focus more on the data-structures and less
> >> on the data types. This would let us share more work with the "general
> >> purpose array / scientific computing libraries".
> >>
> > Pandas IS about specifying the correct data types. It is simply
> incorrect to
> > decouple this problem from the data-structures. A lot of effort over the
> > years has gone into
> > making all dtypes playing nice with each other and within pandas.
> >
> >>>
> >>> 1) Introduce a proper (from a software engineering perspective)
> >>> logical data type abstraction that models the way that pandas already
> >>> works, but cleaning up all the mess (implicit upcasts, lack of a real
> >>> "NA" scalar value, making pandas-specific methods like unique,
> >>> factorize, match, etc. true "array methods")
> >>
> >>
> >> New abstractions have a cost. A new logical data type abstraction is
> >> better than no proper abstraction at all, but (in principle), one data
> type
> >> abstraction should be enough to share.
> >>
> >
> >>
> >> A proper logical data type abstraction would be an improvement over the
> >> current situation, but if there's a way we could introduce one less
> >> abstraction (by improving things upstream in a general purpose array
> >> library) that would help even more.
> >>
> >
> > This is just pushing a problem upstream, which ultimately, given the
> track
> > history of numpy, won't be solved at all. We will be here 1 year from now
> > with the exact same discussion. Why are we waiting on upstream for
> anything?
> > As I said above, if something is created which upstream finds useful on a
> > general level. great. The great cost here is time.
> >
> >>
> >> For example, we could imagine pushing to make DyND the new core for
> >> pandas. This could be enough of a push to make DyND generally useful --
> I
> >> know it still has a few kinks to work out.
> >>
> >
> > maybe, but DyND has to have full compat with what currently is out there
> > (soonish). Then I agree this could be possible. But wouldn't it be even
> > better
> > for pandas to be able to swap back-ends. Why limit ourselves to a
> particular
> > backend if its not that difficult.
> >
>
> I think Jeff and I are on the same page here. 5 years ago we were
> having the *exact same* discussions around NumPy and adding new data
> type functionality. 5 years is a staggering amount of time in open
> source. It was less than 5 years between pandas not existing and being
> a super popular project with 2/3 of a best-selling O'Reilly book
> written about it. To whit, DyND exists in large part because of the
> difficulty in making progress within NumPy.
>
> Now, as 5 years ago, I think we should be acting in the best interests
> of pandas users, and what I've been describing is intended as a
> straightforward (though definitely labor intensive) and relatively
> low-risk plan that will "future-proof" the pandas user API for at
> least the next few years, and probably much longer. If we find that
> enabling some internals to use DyND is the right choice, we can do
> that in a non-invasive way while carefully minding data
> interoperability. Meaningful performance benefits would be a clear
> motivation.
>
> To be 100% open and transparent (in the spirit of pandas's new
> governance docs): Before committing to using DyND in any binding way
> (i.e. required, as opposed to opt-in) in pandas, I'd really like to
> see more evidence from 3rd parties without direct financial interest
> (i.e. employment or equity from Continuum) that DyND is "the future of
> Python array computing"; in the absence of significant user and
> community code contribution, it still feels like a political quagmire
> leftover from the Continuum-Enthought rift in 2011.
>
> - Wes
>
> >>>
> >>> 4) Give pandas objects a real C API so that users can manipulate and
> >>> create pandas objects with their own native (C/C++/Cython) code.
> >>
> >>
> >>> 5) Yes, absolutely improve NumPy and DyND and transition to improved
> >>> NumPy and DyND facilities as soon as they are available and shipped
> >>
> >>
> >> I like the sound of both of these.
> >
> >
> >
> > Further you made a point above
> >
> >> You are right that pandas has started to supplant numpy as a high level
> >> API for data analysis, but of course the robust (and often numpy based)
> >> Python ecosystem is part of what has made pandas so successful. In
> practice,
> >> ecosystem projects often want to work with more primitive objects than
> >> series/dataframes in their internal data structures and without numpy
> this
> >> becomes more difficult. For example, how do you concatenate a list of
> >> categoricals? If these were numpy arrays, we could use np.concatenate,
> but
> >> the current implementation of categorical would require a custom
> solution.
> >> First class compatibility with pandas is harder when pandas data
> cannotbe
> >> used with a full ndarray API.
> >
> >
> > I disagree entirely here. I think that Series/DataFrame ARE becoming
> > primitive objects. Look at seaborn, statsmodels, and xarray These are
> first
> > class users of these structures, whom need the additional meta-data
> > attached.
> >
> > Yes categorical are useful in numpy, and they should support them. But
> lots
> > of libraries can simply use pandas and do lots of really useful stuff.
> > However, why reinvent the wheel and use numpy, when you have DataFrames.
> >
> > From a user point of view, I don't think they even care about numpy (or
> > whatever drives pandas). It solves a very general problem of working with
> > labeled data.
> >
> > Jeff
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20160112/f7b8e2a7/attachment.html>