[Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap

Tue Dec 29 14:56:08 EST 2015

Ok certainly not averse to using bitfields. I agree that would solve the
problem. In fact Stefan Hoyer and I briefly discussed this w.r.t.
IntervalIndex. Turns out just as easy to use a sentinel. In fact that was
my original idea (for int NA). really similar to how we handle Datetime et
al.

So will create a google doc for discussion points.

I agree creating a minimalist c++ library is not too hard. But my original
question stands, what are the use cases. I can enumerate some here:

- 1) performance (I am not convinced of this, but could be wrong)
- 2) c-api always a good thing & other lang bindings

I suspect you are in the part 2 camp?

On Tue, Dec 29, 2015 at 2:49 PM, Wes McKinney <wesmckinn at gmail.com> wrote:

> I will write a more detailed response to some of these things after
> the new year, but, in particular, re: missing values, can you or
> someone tell me why creating an object that contains a NumPy array and
> a bitmap is not sufficient? If we we can add a lightweight C/C++ class
> layer between NumPy function calls (e.g. arithmetic) and pandas
> function calls, then I see no reason why we cannot have
>
> Int32Array->add
>
> and
>
> Float32Array->add
>
> do the right thing (the former would be responsible for bitmasking to
> propagate NA values; the latter would defer to NumPy). If we can put
> all the internals of pandas objects inside a black box, we can add
> layers of virtual function indirection without a performance penalty
> (e.g. adding more interpreter overhead with more abstraction layers
> does add up to a perf penalty).
>
> I don't think this is too scary -- I would be willing to create a
> small POC C++ library to prototype something like what I'm talking
> about.
>
> Since pandas has limited points of contact with NumPy I don't think
> this would end up being too onerous.
>
> For the record, I'm pretty allergic to "advanced C++"; I think it is a
> useful tool if you pick a sane 20% subset of the C++11 spec and follow
> Google C++ style it's not very inaccessible to intermediate
> developers. More or less "C plus OOP and easier object lifetime
> management (shared/unique_ptr, etc.)". As soon as you add a lot of
> template metaprogramming C++ library development quickly becomes
> inaccessible except to the C++-Jedi.
>
> Maybe let's start a Google document on "pandas roadmap" where we can
> break down the 1-2 year goals and some of these infrastructure issues
> and have our discussion there? (obviously publish this someplace once
> we're done)
>
> - Wes
>
> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback <jeffreback at gmail.com> wrote:
> > Here are some of my thoughts about pandas Roadmap / status and some
> > responses to Wes's thoughts.
> >
> > In the last few (and upcoming) major releases we have been made the
> > following changes:
> >
> > - dtype enhancements (Categorical, Timedelta, Datetime w/tz) & making
> these
> > first class objects
> > - code refactoring to remove subclassing of ndarrays for Series & Index
> > - carving out / deprecating non-core parts of pandas
> >   - datareader
> >   - SparsePanel, WidePanel & other aliases (TImeSeries)
> >   - rpy, rplot, irow et al.
> >   - google-analytics
> > - API changes to make things more consistent
> >   - pd.rolling/expanding * -> .rolling/expanding (this is in master now)
> >   - .resample becoming a full defered like groupby.
> >   - multi-index slicing along any level (obviates need for .xs) and
> allows
> > assignment
> >   - .loc/.iloc - for the most part obviates use of .ix
> >   - .pipe & .assign
> >   - plotting accessors
> >   - fixing of the sorting API
> > - many performance enhancements both micro & macro (e.g. release GIL)
> >
> > Some on-deck enhancements are (meaning these are basically ready to go
> in):
> >   - IntervalIndex (and eventually make PeriodIndex just a sub-class of
> this)
> >   - RangeIndex
> >
> > so lots of changes, though nothing really earth shaking, just more
> > convenience, reducing magicness somewhat
> > and providing flexibility.
> >
> > Of course we are getting increasing issues, mostly bug reports (and lots
> of
> > dupes), some edge case enhancements
> > which can add to the existing API's and of course, requests to expand the
> > (already) large code to other usecases.
> > Balancing this are a good many pull-requests from many different users,
> some
> > even deep into the internals.
> >
> > Here are some things that I have talked about and could be considered for
> > the roadmap. Disclaimer: I do work for Continuum
> > but these views are of course my own; furthermore obviously I am a bit
> more
> > familiar with some of the 'sponsored' open-source
> > libraries, but always open to new things.
> >
> > - integration / automatic deferral to numba for JIT (this would be thru
> > .apply)
> > - automatic deferal to dask from groubpy where appropriate / maybe a
> > .to_parallel (to simply return a dask.DataFrame object)
> > - incorporation of quantities / units (as part of the dtype)
> > - use of DyND to allow missing values for int dtypes
> > - make Period a first class dtype.
> > - provide some copy-on-write semantics to alleviate the chained-indexing
> > issues which occasionaly come up with the mis-use of the indexing API
> > - allow a 'policy' to automatically provide column blocks for dict-like
> > input (e.g. each column would be a block), this would allow a pass-thru
> API
> > where you could
> > put in numpy arrays where you have views and have them preserved rather
> than
> > copied automatically. Note that this would also allow what I call 'split'
> > where a passed in
> > multi-dim numpy array could be split up to individual blocks (which
> actually
> > gives a nice perf boost after the splitting costs).
> >
> > In working towards some of these goals. I have come to the opinion that
> it
> > would make sense to have a neutral API protocol layer
> > that would allow us to swap out different engines as needed, for
> particular
> > dtypes, or *maybe* out-of-core type computations. E.g.
> > imagine that we replaced the in-memory block structure with a bclolz /
> memap
> > type; in theory this should be 'easy' and just work.
> > I could also see us adopting *some* of the SFrame code to allow easier
> > interop with this API layer.
> >
> > In practice, I think a nice API layer would need to be created to make
> this
> > clean / nice.
> >
> > So this comes around to Wes's point about creating a c++ library for the
> > internals (and possibly even some of the indexing routines).
> > In an ideal world, or course this would be desirable. Getting there is a
> bit
> > non-trivial I think, and IMHO might not be worth the effort. I don't
> > really see big performance bottlenecks. We *already* defer much of the
> > computation to libraries like numexpr & bottleneck (where appropriate).
> > Adding numba / dask to the list would be helpful.
> >
> > I think that almost all performance issues are the result of:
> >
> > a) gross misuse of the pandas API. How much code have you seen that does
> > df.apply(lambda x: x.sum())
> > b) routines which operate column-by-column rather block-by-block and are
> in
> > python space (e.g. we have an issue right now about .quantile)
> >
> > So I am glossing over a big goal of having a c++ library that represents
> the
> > pandas internals. This would by definition have a c-API that so
> > you *could* use pandas like semantics in c/c++ and just have it work (and
> > then pandas would be a thin wrapper around this library).
> >
> > I am not averse to this, but I think would be quite a big effort, and
> not a
> > huge perf boost IMHO. Further there are a number of API issues w.r.t.
> > indexing
> > which need to be clarified / worked out (e.g. should we simply deprecate
> [])
> > that are much easier to test / figure out in python space.
> >
> > I also thing that we have quite a large number of contributors. Moving to
> > c++ might make the internals a bit more impenetrable that the current
> > internals.
> > (though this would allow c++ people to contribute, so that might balance
> > out).
> >
> > We have a limited core of devs whom right now are familar with things. If
> > someone happened to have a starting base for a c++ library, then I might
> > change
> > opinions here.
> >
> >
> > my 4c.
> >
> > Jeff
> >
> >
> >
> >
> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney <wesmckinn at gmail.com>
> wrote:
> >>
> >> Deep thoughts during the holidays.
> >>
> >> I might be out of line here, but the interpreter-heaviness of the
> >> inside of pandas objects is likely to be a long-term liability and
> >> source of performance problems and technical debt.
> >>
> >> Has anyone put any thought into planning and beginning to execute on a
> >> rewrite that moves as much as possible of the internals into native /
> >> compiled code? I'm talking about:
> >>
> >> - pandas/core/internals
> >> - indexing and assignment
> >> - much of pandas/core/common
> >> - categorical and custom dtypes
> >> - all indexing mechanisms
> >>
> >> I'm concerned we've already exposed too much internals to users, so
> >> this might lead to a lot of API breakage, but it might be for the
> >> Greater Good. As a first step, beginning a partial migration of
> >> internals into some C++ classes that encapsulate the insides of
> >> DataFrame objects and implement indexing and block-level manipulations
> >> would be a good place to start. I think you could do this wouldn't too
> >> much disruption.
> >>
> >> As part of this internal retooling we might give consideration to
> >> alternative data structures for representing data internal to pandas
> >> objects. Now in 2015/2016, continuing to be hamstrung by NumPy's
> >> limitations feels somewhat anachronistic. User code is riddled with
> >> workarounds for data type fidelity issues and the like. Like, really,
> >> why not add a bitndarray (similar to ilanschnell/bitarray) for storing
> >> nullness for problematic types and hide this from the user? =)
> >>
> >> Since we are now a NumFOCUS-sponsored project, I feel like we might
> >> consider establishing some formal governance over pandas and
> >> publishing meetings notes and roadmap documents describing plans for
> >> the project and meetings notes from committers. There's no real
> >> "committer culture" for NumFOCUS projects like there is with the
> >> Apache Software Foundation, but we might try leading by example!
> >>
> >> Also, I believe pandas as a project has reached a level of importance
> >> where we ought to consider planning and execution on larger scale
> >> undertakings such as this for safeguarding the future.
> >>
> >> As for myself, well, I have my hands full in Big Data-land. I wish I
> >> could be helping more with pandas, but there a quite a few fundamental
> >> issues (like data interoperability nested data handling and file
> >> format support — e.g. Parquet, see
> >>
> >>
> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/
> )
> >> preventing Python from being more useful in industry analytics
> >> applications.
> >>
> >> Aside: one of the bigger mistakes I made with pandas's API design was
> >> making it acceptable to call class constructors — like
> >> pandas.DataFrame — directly (versus factory functions). Sorry about
> >> that! If we could convince everyone to start writing pandas.data_frame
> >> or dataframe instead of using the class reference it would help a lot
> >> with code cleanup. It's hard to plan for these things — NumPy
> >> interoperability seemed a lot more important in 2008 than it does now,
> >> so I forgive myself.
> >>
> >> cheers and best wishes for 2016,
> >> Wes
> >> _______________________________________________
> >> Pandas-dev mailing list
> >> Pandas-dev at python.org
> >> https://mail.python.org/mailman/listinfo/pandas-dev
> >
> >
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20151229/68373d60/attachment-0001.html>