[Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap

Mon Jan 11 13:45:24 EST 2016

On Mon, Jan 11, 2016 at 9:36 AM, Stephan Hoyer <shoyer at gmail.com> wrote:
> Hi Wes,
>
> You raise some important points.
>
> I agree that pandas's patched version of the numpy dtype system is a mess.
> But despite its issues, its leaky abstraction on top of NumPy provides
> benefits. In particular, it makes pandas easy to emulate (e.g., xarray),
> extend (e.g., geopandas) and integrate with other libraries (e.g., patsy,
> Scikit-Learn, matplotlib).
>
> You are right that pandas has started to supplant numpy as a high level API
> for data analysis, but of course the robust (and often numpy based) Python
> ecosystem is part of what has made pandas so successful. In practice,
> ecosystem projects often want to work with more primitive objects than
> series/dataframes in their internal data structures and without numpy this
> becomes more difficult. For example, how do you concatenate a list of
> categoricals? If these were numpy arrays, we could use np.concatenate, but
> the current implementation of categorical would require a custom solution.
> First class compatibility with pandas is harder when pandas data cannot be
> used with a full ndarray API.
>
> Likewise, hiding implementation details retains some flexibility for us (as
> developers), but in an ideal world, we would know we have the right
> abstraction, and then could expose the implementation as an advanced API!
> This is the case for some very mature projects, such as NumPy. Pandas is not
> really here yet (with the block manager), but it might be something to
> strive towards in this rewrite.
>
> At this point, I suppose the ship has sailed (e.g., with categorical in
> .values) on full numpy compatibility. So we absolutely do need explicit
> interfaces to converting to NumPy, rather than the current implicit
> guarantees about .values -- which we violated with categorical. Something
> like your suggested .to_numpy() method would indeed be an improvement over
> the current state, where we half-pretend that NumPy could be used as an
> advanced API for pandas, even though it doesn't really work.
>
> I'm sure you would agree that -- at least in theory -- it would be nice to
> push dtype improvements upstream to numpy, but that is obviously more work
> (for a variety of reasons) than starting from scratch in pandas. Of course,
> I think pandas has a need and right to exist as a separate library. But I do
> think building off of NumPy made it stronger, and pushing improvements
> upstream would be a better way to go. This has been my approach, and is why
> I've worked on both pandas and NumPy.
>
> The bottom line is that I don't agree that this is the most productive path
> forward -- I would opt for improving NumPy or DyND instead, which I believe
> would cause much less pain downstream -- but given that I'm not going to be
> the person doing the work, I will defer to your judgment. Pandas is
> certainly in need of holistic improvements and the maturity of a v1.0
> release, and that's not something I'm in a position to push myself.
>

This seems like a false dichotomy to me. I'm not arguing for forging
NumPy-free or DyND-free path, but rather making DyND's or NumPy's
physical memory representation and array computing infrastructure more
clearly implementation details of pandas that have limited
user-visibility (except when using NumPy / DyND-based tools is
necessary).

The main problem we have faced with NumPy is:

- Much more difficult to extend
- Legacy code makes major changes difficult or impossible
- pandas users likely represent a minority (but perhaps a plurality,
at this point) of users

DyND's scope, as I understand it, is to be used for more use cases
than an internal detail of pandas objects. It doesn't have the legacy
baggage, but it will face similar challenges around being a general
purpose array library versus a more domain-specific analytics and data
preparation library.

pandas already has what can be called a "logical type system" (see
e.g. https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md
for other examples of logical type representations). We use NumPy
dtypes for the physical memory representation along with various
conventions for pandas-specific behavior like missing data, but they
are weakly abstracted in a way that's definitely harmful for users.
What I am arguing is

1) Introduce a proper (from a software engineering perspective)
logical data type abstraction that models the way that pandas already
works, but cleaning up all the mess (implicit upcasts, lack of a real
"NA" scalar value, making pandas-specific methods like unique,
factorize, match, etc. true "array methods")

2) Use NumPy physical dtypes (for now) as the primary target physical
representation

3) Layer new machinery (like bitmasks) on top of raw NumPy arrays to
add new features to pandas

4) Give pandas objects a real C API so that users can manipulate and
create pandas objects with their own native (C/C++/Cython) code.

5) Yes, absolutely improve NumPy and DyND and transition to improved
NumPy and DyND facilities as soon as they are available and shipped

I don't see alternative ways for pandas to have a truly healthy
relationship with more general purpose array / scientific computing
libraries without being able to add new pandas functionality in a
clean way, and without requiring us to get patches accepted (and
released) in NumPy or DyND.

Can you clarify what aspects of this plan are disagreeable /
contentious? Are you arguing for pandas becoming more of a companion
tool / user interface layer for NumPy or DyND?

cheers,
Wes

> Best,
> Stephan
>
> P.S. apologies for the delay -- it's been a busy week.
>
>
> On Wed, Jan 6, 2016 at 12:15 PM, Wes McKinney <wesmckinn at gmail.com> wrote:
>>
>> I also will add that there is an ideology that has existed in the
>> scientific Python community since 2011 at least which is this: pandas
>> should not have existed; it should be part of NumPy instead.
>>
>> In my opinion, that misses the point of pandas, both then and now.
>>
>> There's a large and mostly new class of Python users working on
>> domain-specific industry analytics problems for whom pandas is the
>> most important tool that they use on a daily basis. Their knowledge of
>> NumPy is limited, beyond the aspects of the ndarray API that are the
>> same in pandas. High level APIs and accessibility for them is
>> extremely important. But their skill sets and problems they are
>> solving are not the same ones on the whole that you would have heard
>> discussed at SciPy 2010.
>>
>> Sometime in 2015, "Python for Data Analysis" sold it's 100,000th copy.
>> I have 5 foreign translations sitting on my shelf -- this represents a
>> very large group of people that we have all collectively enabled by
>> developing pandas -- for a lot of people, pandas is the main reason
>> they use Python!
>>
>> So the summary of all this is: pandas is much more important as a
>> project now than it was 5 years ago. Our relationship with our library
>> dependencies like NumPy should reflect that. Downstream pandas
>> consumers should similarly eventually concern themselves more with
>> pandas compatibility (rather than always assuming that NumPy arrays
>> are the only intermediary). This is a philosophical shift, but one
>> that will ultimately benefit the usability of the stack.
>>
>> On Wed, Jan 6, 2016 at 11:45 AM, Jeff Reback <jeffreback at gmail.com> wrote:
>> > I'll just apologize right up front! hahah.
>> >
>> > No I think I have been pushing on these extras in pandas to help move it
>> > forward. I have commented a bit
>> > on Stephan's issue here about why I didn't push for these in numpy.
>> > numpy is
>> > fairly slow moving
>> > (though moves faster lately, I suspect the pace when Wes was developing
>> > pandas was not much faster).
>> >
>> > So pandas was essentially 'fixing' lots of bug / compat issues in numpy.
>> >
>> > To the extent that we can keep the current user facing API the same
>> > (high
>> > likelihood I think), willing
>> > to acccept *some* breakage with the pandas->duck-like array container
>> > API in
>> > order to provide swappable containers.
>> >
>> > For example I recall that in doing datetime w/tz, that we wanted
>> > Series.values to return a numpy array (which it DOES!)
>> > but it is actually lossy (its loses the tz). Samething with the
>> > Categorical
>> > example wes gave. I dont' think these requirements
>> > should hold pandas back!
>> >
>> > People are increasingly using pandas as the API for there work. That
>> > makes
>> > it very important that we can handle
>> > lots of input properly, w/o the handcuffs of numpy.
>> >
>> > All this said, I'll reiterate Wes (and others points). That back-compat
>> > is
>> > extremely important. (I in fact try
>> > to bend over backwards to provide this, sometimes its too much of
>> > course!).
>> > E.g. take the resample changes to API
>> >
>> > Was originally going to just do a hard break, but this turns off people
>> > when
>> > they have to update there code or else.
>> >
>> > my 4c (incrementing!)
>> >
>> > Jeff
>> >
>
>