[Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap

Jeff Reback jeffreback at gmail.com
Wed Mar 23 17:16:47 EDT 2016


https://github.com/apache/arrow/tree/master/python/pyarrow

looking pretty good. assume there is a notion of an extension dtype? (to support dtype/schema that other systems may not) in order to implement things like categorical / datetime tz etc

then libpandas becomes a pretty thin wrapper around this


> On Mar 16, 2016, at 12:44 PM, Stephan Hoyer <shoyer at gmail.com> wrote:
> 
> After taking a step back and starting a new job, I am coming around to Wes's perspective here.
> 
> The lack of integer-NAs and the overly complex/unpredictable internal memory model are major shortcomings (along with the indexing API) for using pandas in production software.
> 
> Compatibility with the rest of the SciPy ecosystem is important, but it shouldn't hold pandas back. There's no good reason why pandas needs to built on a library for strided n-dimensional arrays -- that's a lot more complexity than we need.
> 
> Best,
> Stephan
> 
> 
>> On Tue, Jan 12, 2016 at 5:42 PM, Wes McKinney <wesmckinn at gmail.com> wrote:
>> On Tue, Jan 12, 2016 at 4:06 PM, Stephan Hoyer <shoyer at gmail.com> wrote:
>> > I think I'm mostly on the same page as well. Five years has certainly been
>> > too long.
>> >
>> > I agree that it would be premature to commit to using DyND in a binding way
>> > in pandas. A lot seems to be up in the air with regards to dtypes in Python
>> > right now (yes, particularly from projects sponsored by Continuum).
>> >
>> > So I would advocate for proceeding with the refactor for now (which will
>> > have numerous other benefits), and see how the situation evolves. If it
>> > seems like we are in a plausible position to unify the dtype system with a
>> > tool like DyND, then let's seriously consider that down the road. Either
>> > way, explicit interfaces (e.g., to_numpy(), to_dynd()) will help.
>> >
>> 
>> +1 -- I think our long term goal should be to have a common physical
>> memory representation. If pandas internally stays slightly malleable
>> (in a non-user-visible-way) we can conform to a standard (presuming
>> one develops) with less user-land disruption. If a standard does not
>> develop we can just shrug our shoulders and do what's best for pandas.
>> We'll have to think about how this will affect pandas's future C API
>> (zero-copy interop guarantees): we might make the C API in the first
>> release more clearly not-for-production use.
>> 
>> Aside: There doesn't even seem to be consensus at the moment on
>> missing data representation. Sentinels, for example, causes
>> interoperability problems with ODBC / databases, and Apache ecosystem
>> projects (e.g. HDFS file formats, Thrift, Spark, Kafka, etc.). If we
>> build a C interface to Avro or Parquet in pandas right now we'll have
>> to convert bitmasks to pandas's bespoke sentinels. To be clear, R has
>> this problem too. I see good arguments for even nixing NaN in floating
>> point arrays, as heretical as that might sound. Ironically I used to
>> be in favor of sentinels but I realized it was an isolationist view.
>> 
>> -W
>> 
>> > On Mon, Jan 11, 2016 at 4:23 PM, Wes McKinney <wesmckinn at gmail.com> wrote:
>> >>
>> >> On Mon, Jan 11, 2016 at 3:04 PM, Jeff Reback <jeffreback at gmail.com> wrote:
>> >> > I am in favor of the Wes refactoring, but for some slightly different
>> >> > reasons.
>> >> >
>> >> > I am including some in-line comments.
>> >> >
>> >> > On Mon, Jan 11, 2016 at 2:55 PM, Stephan Hoyer <shoyer at gmail.com> wrote:
>> >> >>>
>> >> >>> I don't see alternative ways for pandas to have a truly healthy
>> >> >>> relationship with more general purpose array / scientific computing
>> >> >>> libraries without being able to add new pandas functionality in a
>> >> >>> clean way, and without requiring us to get patches accepted (and
>> >> >>> released) in NumPy or DyND.
>> >> >>
>> >> >>
>> >> >> Indeed, I think my disagreement is mostly about the order in which we
>> >> >> approach these problems.
>> >> >
>> >> >
>> >> > I agree here. I had started on *some* of this to enable swappable numpy
>> >> > to
>> >> > DyND to support IntNA (all in python,
>> >> > but the fundamental change was to provide an API layer to the back-end).
>> >> >
>> >> >>
>> >> >>
>> >> >>>
>> >> >>> Can you clarify what aspects of this plan are disagreeable /
>> >> >>> contentious?
>> >> >>
>> >> >>
>> >> >> See my comments below.
>> >> >>
>> >> >>>
>> >> >>> Are you arguing for pandas becoming more of a companion
>> >> >>> tool / user interface layer for NumPy or DyND?
>> >> >>
>> >> >>
>> >> >> Not quite. Pandas has some fantastic and highly useable data (Series,
>> >> >> DataFrame, Index). These certainly don't belong in NumPy or DyND.
>> >> >>
>> >> >> However, the array-based ecosystem certainly could use improvements to
>> >> >> dtypes (e.g., datetime and categorical) and dtype specific methods
>> >> >> (e.g.,
>> >> >> for strings) just as much as pandas. I do firmly believe that pushing
>> >> >> these
>> >> >> types of improvements upstream, rather than implementing them
>> >> >> independently
>> >> >> for pandas, would yield benefits for the broader ecosystem. With the
>> >> >> right
>> >> >> infrastructure, generalizing things to arrays is not much more work.
>> >> >
>> >> >
>> >> > I dont' think Wes nor I disagree here at all. The problem was (and is),
>> >> > the
>> >> > pace of change in the underlying libraries. It is simply too slow
>> >> > for pandas development efforts.
>> >> >
>> >> > I think the pandas efforts (and other libraries) can result in more
>> >> > powerful
>> >> > fundamental libraries
>> >> > that get pushed upstream. However, it would not benefit ANYONE to slow
>> >> > down
>> >> > downstream efforts. I am not sure why you suggest that we WAIT for the
>> >> > upstream libraries to change? We have been waiting forever for that. Now
>> >> > we
>> >> > have a concrete implementation of certain data types that are useful.
>> >> > They
>> >> > (upstream) can take
>> >> > this and build on (or throw it away and make a better one or whatever).
>> >> > But
>> >> > I don't think it benefits anyone to WAIT for someone to change numpy
>> >> > first.
>> >> > Look at how long it took them to (partially) fix datetimes.
>> >> >
>> >> > xarray in particular has done the same thing to pandas, e.g. you have
>> >> > added
>> >> > additional selection operators and syntax (e.g. passing dicts of named
>> >> > axes). These changes are in fact propogating to pandas. This has taken
>> >> > time
>> >> > (but much much less that this took for any of pandas changes to numpy).
>> >> > Further look at how long you have advocated (correctly) for labeled
>> >> > arrays
>> >> > in numpy (which we are still waiting).
>> >> >
>> >> >>
>> >> >>
>> >> >> I'd like to see pandas itself focus more on the data-structures and
>> >> >> less
>> >> >> on the data types. This would let us share more work with the "general
>> >> >> purpose array / scientific computing libraries".
>> >> >>
>> >> > Pandas IS about specifying the correct data types. It is simply
>> >> > incorrect to
>> >> > decouple this problem from the data-structures. A lot of effort over the
>> >> > years has gone into
>> >> > making all dtypes playing nice with each other and within pandas.
>> >> >
>> >> >>>
>> >> >>> 1) Introduce a proper (from a software engineering perspective)
>> >> >>> logical data type abstraction that models the way that pandas already
>> >> >>> works, but cleaning up all the mess (implicit upcasts, lack of a real
>> >> >>> "NA" scalar value, making pandas-specific methods like unique,
>> >> >>> factorize, match, etc. true "array methods")
>> >> >>
>> >> >>
>> >> >> New abstractions have a cost. A new logical data type abstraction is
>> >> >> better than no proper abstraction at all, but (in principle), one data
>> >> >> type
>> >> >> abstraction should be enough to share.
>> >> >>
>> >> >
>> >> >>
>> >> >> A proper logical data type abstraction would be an improvement over the
>> >> >> current situation, but if there's a way we could introduce one less
>> >> >> abstraction (by improving things upstream in a general purpose array
>> >> >> library) that would help even more.
>> >> >>
>> >> >
>> >> > This is just pushing a problem upstream, which ultimately, given the
>> >> > track
>> >> > history of numpy, won't be solved at all. We will be here 1 year from
>> >> > now
>> >> > with the exact same discussion. Why are we waiting on upstream for
>> >> > anything?
>> >> > As I said above, if something is created which upstream finds useful on
>> >> > a
>> >> > general level. great. The great cost here is time.
>> >> >
>> >> >>
>> >> >> For example, we could imagine pushing to make DyND the new core for
>> >> >> pandas. This could be enough of a push to make DyND generally useful --
>> >> >> I
>> >> >> know it still has a few kinks to work out.
>> >> >>
>> >> >
>> >> > maybe, but DyND has to have full compat with what currently is out there
>> >> > (soonish). Then I agree this could be possible. But wouldn't it be even
>> >> > better
>> >> > for pandas to be able to swap back-ends. Why limit ourselves to a
>> >> > particular
>> >> > backend if its not that difficult.
>> >> >
>> >>
>> >> I think Jeff and I are on the same page here. 5 years ago we were
>> >> having the *exact same* discussions around NumPy and adding new data
>> >> type functionality. 5 years is a staggering amount of time in open
>> >> source. It was less than 5 years between pandas not existing and being
>> >> a super popular project with 2/3 of a best-selling O'Reilly book
>> >> written about it. To whit, DyND exists in large part because of the
>> >> difficulty in making progress within NumPy.
>> >>
>> >> Now, as 5 years ago, I think we should be acting in the best interests
>> >> of pandas users, and what I've been describing is intended as a
>> >> straightforward (though definitely labor intensive) and relatively
>> >> low-risk plan that will "future-proof" the pandas user API for at
>> >> least the next few years, and probably much longer. If we find that
>> >> enabling some internals to use DyND is the right choice, we can do
>> >> that in a non-invasive way while carefully minding data
>> >> interoperability. Meaningful performance benefits would be a clear
>> >> motivation.
>> >>
>> >> To be 100% open and transparent (in the spirit of pandas's new
>> >> governance docs): Before committing to using DyND in any binding way
>> >> (i.e. required, as opposed to opt-in) in pandas, I'd really like to
>> >> see more evidence from 3rd parties without direct financial interest
>> >> (i.e. employment or equity from Continuum) that DyND is "the future of
>> >> Python array computing"; in the absence of significant user and
>> >> community code contribution, it still feels like a political quagmire
>> >> leftover from the Continuum-Enthought rift in 2011.
>> >>
>> >> - Wes
>> >>
>> >> >>>
>> >> >>> 4) Give pandas objects a real C API so that users can manipulate and
>> >> >>> create pandas objects with their own native (C/C++/Cython) code.
>> >> >>
>> >> >>
>> >> >>> 5) Yes, absolutely improve NumPy and DyND and transition to improved
>> >> >>> NumPy and DyND facilities as soon as they are available and shipped
>> >> >>
>> >> >>
>> >> >> I like the sound of both of these.
>> >> >
>> >> >
>> >> >
>> >> > Further you made a point above
>> >> >
>> >> >> You are right that pandas has started to supplant numpy as a high level
>> >> >> API for data analysis, but of course the robust (and often numpy based)
>> >> >> Python ecosystem is part of what has made pandas so successful. In
>> >> >> practice,
>> >> >> ecosystem projects often want to work with more primitive objects than
>> >> >> series/dataframes in their internal data structures and without numpy
>> >> >> this
>> >> >> becomes more difficult. For example, how do you concatenate a list of
>> >> >> categoricals? If these were numpy arrays, we could use np.concatenate,
>> >> >> but
>> >> >> the current implementation of categorical would require a custom
>> >> >> solution.
>> >> >> First class compatibility with pandas is harder when pandas data
>> >> >> cannotbe
>> >> >> used with a full ndarray API.
>> >> >
>> >> >
>> >> > I disagree entirely here. I think that Series/DataFrame ARE becoming
>> >> > primitive objects. Look at seaborn, statsmodels, and xarray These are
>> >> > first
>> >> > class users of these structures, whom need the additional meta-data
>> >> > attached.
>> >> >
>> >> > Yes categorical are useful in numpy, and they should support them. But
>> >> > lots
>> >> > of libraries can simply use pandas and do lots of really useful stuff.
>> >> > However, why reinvent the wheel and use numpy, when you have DataFrames.
>> >> >
>> >> > From a user point of view, I don't think they even care about numpy (or
>> >> > whatever drives pandas). It solves a very general problem of working
>> >> > with
>> >> > labeled data.
>> >> >
>> >> > Jeff
>> >
>> >
> 
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20160323/c14808e4/attachment-0001.html>


More information about the Pandas-dev mailing list