[Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap

Tue Dec 29 15:14:06 EST 2015

Maybe this is saying the same thing as Wes, but how far would something
like this get us?

// warning: things are probably not this simple
struct data_array_t {
    void *primitive;  // scalar data
    data_array_t *nested; // nested data
    boost::dynamic_bitset isnull;  // might have to create our own to
avoid boost
    schema_t schema;  // not sure exactly what this looks like
};
typedef std::map<string, data_array_t> data_frame_t;  // probably not
this simple

To answer Jeff’s use-case question: I think that the use cases are 1)
freedom from numpy (mostly) 2) no more block manager which frees us from
the limitations of the block memory layout. In particular, the ability to
take advantage of memory mapped IO would be a big win IMO.

On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney <wesmckinn at gmail.com> wrote:

> I will write a more detailed response to some of these things after
> the new year, but, in particular, re: missing values, can you or
> someone tell me why creating an object that contains a NumPy array and
> a bitmap is not sufficient? If we we can add a lightweight C/C++ class
> layer between NumPy function calls (e.g. arithmetic) and pandas
> function calls, then I see no reason why we cannot have
>
> Int32Array->add
>
> and
>
> Float32Array->add
>
> do the right thing (the former would be responsible for bitmasking to
> propagate NA values; the latter would defer to NumPy). If we can put
> all the internals of pandas objects inside a black box, we can add
> layers of virtual function indirection without a performance penalty
> (e.g. adding more interpreter overhead with more abstraction layers
> does add up to a perf penalty).
>
> I don't think this is too scary -- I would be willing to create a
> small POC C++ library to prototype something like what I'm talking
> about.
>
> Since pandas has limited points of contact with NumPy I don't think
> this would end up being too onerous.
>
> For the record, I'm pretty allergic to "advanced C++"; I think it is a
> useful tool if you pick a sane 20% subset of the C++11 spec and follow
> Google C++ style it's not very inaccessible to intermediate
> developers. More or less "C plus OOP and easier object lifetime
> management (shared/unique_ptr, etc.)". As soon as you add a lot of
> template metaprogramming C++ library development quickly becomes
> inaccessible except to the C++-Jedi.
>
> Maybe let's start a Google document on "pandas roadmap" where we can
> break down the 1-2 year goals and some of these infrastructure issues
> and have our discussion there? (obviously publish this someplace once
> we're done)
>
> - Wes
>
> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback <jeffreback at gmail.com> wrote:
> > Here are some of my thoughts about pandas Roadmap / status and some
> > responses to Wes's thoughts.
> >
> > In the last few (and upcoming) major releases we have been made the
> > following changes:
> >
> > - dtype enhancements (Categorical, Timedelta, Datetime w/tz) & making
> these
> > first class objects
> > - code refactoring to remove subclassing of ndarrays for Series & Index
> > - carving out / deprecating non-core parts of pandas
> >   - datareader
> >   - SparsePanel, WidePanel & other aliases (TImeSeries)
> >   - rpy, rplot, irow et al.
> >   - google-analytics
> > - API changes to make things more consistent
> >   - pd.rolling/expanding * -> .rolling/expanding (this is in master now)
> >   - .resample becoming a full defered like groupby.
> >   - multi-index slicing along any level (obviates need for .xs) and
> allows
> > assignment
> >   - .loc/.iloc - for the most part obviates use of .ix
> >   - .pipe & .assign
> >   - plotting accessors
> >   - fixing of the sorting API
> > - many performance enhancements both micro & macro (e.g. release GIL)
> >
> > Some on-deck enhancements are (meaning these are basically ready to go
> in):
> >   - IntervalIndex (and eventually make PeriodIndex just a sub-class of
> this)
> >   - RangeIndex
> >
> > so lots of changes, though nothing really earth shaking, just more
> > convenience, reducing magicness somewhat
> > and providing flexibility.
> >
> > Of course we are getting increasing issues, mostly bug reports (and lots
> of
> > dupes), some edge case enhancements
> > which can add to the existing API's and of course, requests to expand the
> > (already) large code to other usecases.
> > Balancing this are a good many pull-requests from many different users,
> some
> > even deep into the internals.
> >
> > Here are some things that I have talked about and could be considered for
> > the roadmap. Disclaimer: I do work for Continuum
> > but these views are of course my own; furthermore obviously I am a bit
> more
> > familiar with some of the 'sponsored' open-source
> > libraries, but always open to new things.
> >
> > - integration / automatic deferral to numba for JIT (this would be thru
> > .apply)
> > - automatic deferal to dask from groubpy where appropriate / maybe a
> > .to_parallel (to simply return a dask.DataFrame object)
> > - incorporation of quantities / units (as part of the dtype)
> > - use of DyND to allow missing values for int dtypes
> > - make Period a first class dtype.
> > - provide some copy-on-write semantics to alleviate the chained-indexing
> > issues which occasionaly come up with the mis-use of the indexing API
> > - allow a 'policy' to automatically provide column blocks for dict-like
> > input (e.g. each column would be a block), this would allow a pass-thru
> API
> > where you could
> > put in numpy arrays where you have views and have them preserved rather
> than
> > copied automatically. Note that this would also allow what I call 'split'
> > where a passed in
> > multi-dim numpy array could be split up to individual blocks (which
> actually
> > gives a nice perf boost after the splitting costs).
> >
> > In working towards some of these goals. I have come to the opinion that
> it
> > would make sense to have a neutral API protocol layer
> > that would allow us to swap out different engines as needed, for
> particular
> > dtypes, or *maybe* out-of-core type computations. E.g.
> > imagine that we replaced the in-memory block structure with a bclolz /
> memap
> > type; in theory this should be 'easy' and just work.
> > I could also see us adopting *some* of the SFrame code to allow easier
> > interop with this API layer.
> >
> > In practice, I think a nice API layer would need to be created to make
> this
> > clean / nice.
> >
> > So this comes around to Wes's point about creating a c++ library for the
> > internals (and possibly even some of the indexing routines).
> > In an ideal world, or course this would be desirable. Getting there is a
> bit
> > non-trivial I think, and IMHO might not be worth the effort. I don't
> > really see big performance bottlenecks. We *already* defer much of the
> > computation to libraries like numexpr & bottleneck (where appropriate).
> > Adding numba / dask to the list would be helpful.
> >
> > I think that almost all performance issues are the result of:
> >
> > a) gross misuse of the pandas API. How much code have you seen that does
> > df.apply(lambda x: x.sum())
> > b) routines which operate column-by-column rather block-by-block and are
> in
> > python space (e.g. we have an issue right now about .quantile)
> >
> > So I am glossing over a big goal of having a c++ library that represents
> the
> > pandas internals. This would by definition have a c-API that so
> > you *could* use pandas like semantics in c/c++ and just have it work (and
> > then pandas would be a thin wrapper around this library).
> >
> > I am not averse to this, but I think would be quite a big effort, and
> not a
> > huge perf boost IMHO. Further there are a number of API issues w.r.t.
> > indexing
> > which need to be clarified / worked out (e.g. should we simply deprecate
> [])
> > that are much easier to test / figure out in python space.
> >
> > I also thing that we have quite a large number of contributors. Moving to
> > c++ might make the internals a bit more impenetrable that the current
> > internals.
> > (though this would allow c++ people to contribute, so that might balance
> > out).
> >
> > We have a limited core of devs whom right now are familar with things. If
> > someone happened to have a starting base for a c++ library, then I might
> > change
> > opinions here.
> >
> >
> > my 4c.
> >
> > Jeff
> >
> >
> >
> >
> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney <wesmckinn at gmail.com>
> wrote:
> >>
> >> Deep thoughts during the holidays.
> >>
> >> I might be out of line here, but the interpreter-heaviness of the
> >> inside of pandas objects is likely to be a long-term liability and
> >> source of performance problems and technical debt.
> >>
> >> Has anyone put any thought into planning and beginning to execute on a
> >> rewrite that moves as much as possible of the internals into native /
> >> compiled code? I'm talking about:
> >>
> >> - pandas/core/internals
> >> - indexing and assignment
> >> - much of pandas/core/common
> >> - categorical and custom dtypes
> >> - all indexing mechanisms
> >>
> >> I'm concerned we've already exposed too much internals to users, so
> >> this might lead to a lot of API breakage, but it might be for the
> >> Greater Good. As a first step, beginning a partial migration of
> >> internals into some C++ classes that encapsulate the insides of
> >> DataFrame objects and implement indexing and block-level manipulations
> >> would be a good place to start. I think you could do this wouldn't too
> >> much disruption.
> >>
> >> As part of this internal retooling we might give consideration to
> >> alternative data structures for representing data internal to pandas
> >> objects. Now in 2015/2016, continuing to be hamstrung by NumPy's
> >> limitations feels somewhat anachronistic. User code is riddled with
> >> workarounds for data type fidelity issues and the like. Like, really,
> >> why not add a bitndarray (similar to ilanschnell/bitarray) for storing
> >> nullness for problematic types and hide this from the user? =)
> >>
> >> Since we are now a NumFOCUS-sponsored project, I feel like we might
> >> consider establishing some formal governance over pandas and
> >> publishing meetings notes and roadmap documents describing plans for
> >> the project and meetings notes from committers. There's no real
> >> "committer culture" for NumFOCUS projects like there is with the
> >> Apache Software Foundation, but we might try leading by example!
> >>
> >> Also, I believe pandas as a project has reached a level of importance
> >> where we ought to consider planning and execution on larger scale
> >> undertakings such as this for safeguarding the future.
> >>
> >> As for myself, well, I have my hands full in Big Data-land. I wish I
> >> could be helping more with pandas, but there a quite a few fundamental
> >> issues (like data interoperability nested data handling and file
> >> format support — e.g. Parquet, see
> >>
> >>
> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/
> )
> >> preventing Python from being more useful in industry analytics
> >> applications.
> >>
> >> Aside: one of the bigger mistakes I made with pandas's API design was
> >> making it acceptable to call class constructors — like
> >> pandas.DataFrame — directly (versus factory functions). Sorry about
> >> that! If we could convince everyone to start writing pandas.data_frame
> >> or dataframe instead of using the class reference it would help a lot
> >> with code cleanup. It's hard to plan for these things — NumPy
> >> interoperability seemed a lot more important in 2008 than it does now,
> >> so I forgive myself.
> >>
> >> cheers and best wishes for 2016,
> >> Wes
> >> _______________________________________________
> >> Pandas-dev mailing list
> >> Pandas-dev at python.org
> >> https://mail.python.org/mailman/listinfo/pandas-dev
> >
> >
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20151229/c6ff4a36/attachment-0001.html>