[Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap

Fri Dec 25 17:14:35 EST 2015

Here are some of my thoughts about pandas Roadmap / status and some
responses to Wes's thoughts.

In the last few (and upcoming) major releases we have been made the
following changes:

- dtype enhancements (Categorical, Timedelta, Datetime w/tz) & making these
first class objects
- code refactoring to remove subclassing of ndarrays for Series & Index
- carving out / deprecating non-core parts of pandas
  - datareader
  - SparsePanel, WidePanel & other aliases (TImeSeries)
  - rpy, rplot, irow et al.
  - google-analytics
- API changes to make things more consistent
  - pd.rolling/expanding * -> .rolling/expanding (this is in master now)
  - .resample becoming a full defered like groupby.
  - multi-index slicing along any level (obviates need for .xs) and allows
assignment
  - .loc/.iloc - for the most part obviates use of .ix
  - .pipe & .assign
  - plotting accessors
  - fixing of the sorting API
- many performance enhancements both micro & macro (e.g. release GIL)

Some on-deck enhancements are (meaning these are basically ready to go in):
  - IntervalIndex (and eventually make PeriodIndex just a sub-class of this)
  - RangeIndex

so lots of changes, though nothing really earth shaking, just more
convenience, reducing magicness somewhat
and providing flexibility.

Of course we are getting increasing issues, mostly bug reports (and lots of
dupes), some edge case enhancements
which can add to the existing API's and of course, requests to expand the
(already) large code to other usecases.
Balancing this are a good many pull-requests from many different users,
some even deep into the internals.

Here are some things that I have talked about and could be considered for
the roadmap. Disclaimer: I do work for Continuum
but these views are of course my own; furthermore obviously I am a bit more
familiar with some of the 'sponsored' open-source
libraries, but always open to new things.

- integration / automatic deferral to numba for JIT (this would be thru
.apply)
- automatic deferal to dask from groubpy where appropriate / maybe a
.to_parallel (to simply return a dask.DataFrame object)
- incorporation of quantities / units (as part of the dtype)
- use of DyND to allow missing values for int dtypes
- make Period a first class dtype.
- provide some copy-on-write semantics to alleviate the chained-indexing
issues which occasionaly come up with the mis-use of the indexing API
- allow a 'policy' to automatically provide column blocks for dict-like
input (e.g. each column would be a block), this would allow a pass-thru API
where you could
put in numpy arrays where you have views and have them preserved rather
than copied automatically. Note that this would also allow what I call
'split' where a passed in
multi-dim numpy array could be split up to individual blocks (which
actually gives a nice perf boost after the splitting costs).

In working towards some of these goals. I have come to the opinion that it
would make sense to have a neutral API protocol layer
that would allow us to swap out different engines as needed, for particular
dtypes, or *maybe* out-of-core type computations. E.g.
imagine that we replaced the in-memory block structure with a bclolz /
memap type; in theory this should be 'easy' and just work.
I could also see us adopting *some* of the SFrame code to allow easier
interop with this API layer.

In practice, I think a nice API layer would need to be created to make this
clean / nice.

So this comes around to Wes's point about creating a c++ library for the
internals (and possibly even some of the indexing routines).
In an ideal world, or course this would be desirable. Getting there is a
bit non-trivial I think, and IMHO might not be worth the effort. I don't
really see big performance bottlenecks. We *already* defer much of the
computation to libraries like numexpr & bottleneck (where appropriate).
Adding numba / dask to the list would be helpful.

I think that almost all performance issues are the result of:

a) gross misuse of the pandas API. How much code have you seen that does
df.apply(lambda x: x.sum())
b) routines which operate column-by-column rather block-by-block and are in
python space (e.g. we have an issue right now about .quantile)

So I am glossing over a big goal of having a c++ library that represents
the pandas internals. This would by definition have a c-API that so
you *could* use pandas like semantics in c/c++ and just have it work (and
then pandas would be a thin wrapper around this library).

I am not averse to this, but I think would be quite a big effort, and not a
huge perf boost IMHO. Further there are a number of API issues w.r.t.
indexing
which need to be clarified / worked out (e.g. should we simply deprecate
[]) that are much easier to test / figure out in python space.

I also thing that we have quite a large number of contributors. Moving to
c++ might make the internals a bit more impenetrable that the current
internals.
(though this would allow c++ people to contribute, so that might balance
out).

We have a limited core of devs whom right now are familar with things. If
someone happened to have a starting base for a c++ library, then I might
change
opinions here.

my 4c.

Jeff

On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney <wesmckinn at gmail.com> wrote:

> Deep thoughts during the holidays.
>
> I might be out of line here, but the interpreter-heaviness of the
> inside of pandas objects is likely to be a long-term liability and
> source of performance problems and technical debt.
>
> Has anyone put any thought into planning and beginning to execute on a
> rewrite that moves as much as possible of the internals into native /
> compiled code? I'm talking about:
>
> - pandas/core/internals
> - indexing and assignment
> - much of pandas/core/common
> - categorical and custom dtypes
> - all indexing mechanisms
>
> I'm concerned we've already exposed too much internals to users, so
> this might lead to a lot of API breakage, but it might be for the
> Greater Good. As a first step, beginning a partial migration of
> internals into some C++ classes that encapsulate the insides of
> DataFrame objects and implement indexing and block-level manipulations
> would be a good place to start. I think you could do this wouldn't too
> much disruption.
>
> As part of this internal retooling we might give consideration to
> alternative data structures for representing data internal to pandas
> objects. Now in 2015/2016, continuing to be hamstrung by NumPy's
> limitations feels somewhat anachronistic. User code is riddled with
> workarounds for data type fidelity issues and the like. Like, really,
> why not add a bitndarray (similar to ilanschnell/bitarray) for storing
> nullness for problematic types and hide this from the user? =)
>
> Since we are now a NumFOCUS-sponsored project, I feel like we might
> consider establishing some formal governance over pandas and
> publishing meetings notes and roadmap documents describing plans for
> the project and meetings notes from committers. There's no real
> "committer culture" for NumFOCUS projects like there is with the
> Apache Software Foundation, but we might try leading by example!
>
> Also, I believe pandas as a project has reached a level of importance
> where we ought to consider planning and execution on larger scale
> undertakings such as this for safeguarding the future.
>
> As for myself, well, I have my hands full in Big Data-land. I wish I
> could be helping more with pandas, but there a quite a few fundamental
> issues (like data interoperability nested data handling and file
> format support — e.g. Parquet, see
>
> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/
> )
> preventing Python from being more useful in industry analytics
> applications.
>
> Aside: one of the bigger mistakes I made with pandas's API design was
> making it acceptable to call class constructors — like
> pandas.DataFrame — directly (versus factory functions). Sorry about
> that! If we could convince everyone to start writing pandas.data_frame
> or dataframe instead of using the class reference it would help a lot
> with code cleanup. It's hard to plan for these things — NumPy
> interoperability seemed a lot more important in 2008 than it does now,
> so I forgive myself.
>
> cheers and best wishes for 2016,
> Wes
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20151225/ae77e19a/attachment.html>