[Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap

Stephan Hoyer shoyer at gmail.com
Mon Jan 11 12:36:42 EST 2016


Hi Wes,

You raise some important points.

I agree that pandas's patched version of the numpy dtype system is a mess.
But despite its issues, its leaky abstraction on top of NumPy provides
benefits. In particular, it makes pandas easy to emulate (e.g., xarray),
extend (e.g., geopandas) and integrate with other libraries (e.g., patsy,
Scikit-Learn, matplotlib).

You are right that pandas has started to supplant numpy as a high level API
for data analysis, but of course the robust (and often numpy based) Python
ecosystem is part of what has made pandas so successful. In practice,
ecosystem projects often want to work with more primitive objects than
series/dataframes in their internal data structures and without numpy this
becomes more difficult. For example, how do you concatenate a list of
categoricals? If these were numpy arrays, we could use np.concatenate, but
the current implementation of categorical would require a custom solution.
First class compatibility with pandas is harder when pandas data cannot be
used with a full ndarray API.

Likewise, hiding implementation details retains some flexibility for us (as
developers), but in an ideal world, we would know we have the right
abstraction, and then could expose the implementation as an advanced API!
This is the case for some very mature projects, such as NumPy. Pandas is
not really here yet (with the block manager), but it might be something to
strive towards in this rewrite.

At this point, I suppose the ship has sailed (e.g., with categorical in
.values) on full numpy compatibility. So we absolutely do need explicit
interfaces to converting to NumPy, rather than the current implicit
guarantees about .values -- which we violated with categorical. Something
like your suggested .to_numpy() method would indeed be an improvement over
the current state, where we half-pretend that NumPy could be used as an
advanced API for pandas, even though it doesn't really work.

I'm sure you would agree that -- at least in theory -- it would be nice to
push dtype improvements upstream to numpy, but that is obviously more work
(for a variety of reasons) than starting from scratch in pandas. Of course,
I think pandas has a need and right to exist as a separate library. But I
do think building off of NumPy made it stronger, and pushing improvements
upstream would be a better way to go. This has been my approach, and is why
I've worked on both pandas and NumPy.

The bottom line is that I don't agree that this is the most productive path
forward -- I would opt for improving NumPy or DyND instead, which I believe
would cause much less pain downstream -- but given that I'm not going to be
the person doing the work, I will defer to your judgment. Pandas is
certainly in need of holistic improvements and the maturity of a v1.0
release, and that's not something I'm in a position to push myself.

Best,
Stephan

P.S. apologies for the delay -- it's been a busy week.


On Wed, Jan 6, 2016 at 12:15 PM, Wes McKinney <wesmckinn at gmail.com> wrote:

> I also will add that there is an ideology that has existed in the
> scientific Python community since 2011 at least which is this: pandas
> should not have existed; it should be part of NumPy instead.
>
> In my opinion, that misses the point of pandas, both then and now.
>
> There's a large and mostly new class of Python users working on
> domain-specific industry analytics problems for whom pandas is the
> most important tool that they use on a daily basis. Their knowledge of
> NumPy is limited, beyond the aspects of the ndarray API that are the
> same in pandas. High level APIs and accessibility for them is
> extremely important. But their skill sets and problems they are
> solving are not the same ones on the whole that you would have heard
> discussed at SciPy 2010.
>
> Sometime in 2015, "Python for Data Analysis" sold it's 100,000th copy.
> I have 5 foreign translations sitting on my shelf -- this represents a
> very large group of people that we have all collectively enabled by
> developing pandas -- for a lot of people, pandas is the main reason
> they use Python!
>
> So the summary of all this is: pandas is much more important as a
> project now than it was 5 years ago. Our relationship with our library
> dependencies like NumPy should reflect that. Downstream pandas
> consumers should similarly eventually concern themselves more with
> pandas compatibility (rather than always assuming that NumPy arrays
> are the only intermediary). This is a philosophical shift, but one
> that will ultimately benefit the usability of the stack.
>
> On Wed, Jan 6, 2016 at 11:45 AM, Jeff Reback <jeffreback at gmail.com> wrote:
> > I'll just apologize right up front! hahah.
> >
> > No I think I have been pushing on these extras in pandas to help move it
> > forward. I have commented a bit
> > on Stephan's issue here about why I didn't push for these in numpy.
> numpy is
> > fairly slow moving
> > (though moves faster lately, I suspect the pace when Wes was developing
> > pandas was not much faster).
> >
> > So pandas was essentially 'fixing' lots of bug / compat issues in numpy.
> >
> > To the extent that we can keep the current user facing API the same (high
> > likelihood I think), willing
> > to acccept *some* breakage with the pandas->duck-like array container
> API in
> > order to provide swappable containers.
> >
> > For example I recall that in doing datetime w/tz, that we wanted
> > Series.values to return a numpy array (which it DOES!)
> > but it is actually lossy (its loses the tz). Samething with the
> Categorical
> > example wes gave. I dont' think these requirements
> > should hold pandas back!
> >
> > People are increasingly using pandas as the API for there work. That
> makes
> > it very important that we can handle
> > lots of input properly, w/o the handcuffs of numpy.
> >
> > All this said, I'll reiterate Wes (and others points). That back-compat
> is
> > extremely important. (I in fact try
> > to bend over backwards to provide this, sometimes its too much of
> course!).
> > E.g. take the resample changes to API
> >
> > Was originally going to just do a hard break, but this turns off people
> when
> > they have to update there code or else.
> >
> > my 4c (incrementing!)
> >
> > Jeff
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20160111/bcd04360/attachment.html>


More information about the Pandas-dev mailing list