[Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap

Tue Dec 29 16:12:52 EST 2015

The other huge thing this will enable is to do is copy-on-write for
various kinds of views, which should cut down on some of the defensive
copying in the library and reduce memory usage.

On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney <wesmckinn at gmail.com> wrote:
> Basically the approach is
>
> 1) Base dtype type
> 2) Base array type with K >= 1 dimensions
> 3) Base scalar type
> 4) Base index type
> 5) "Wrapper" subclasses for all NumPy types fitting into categories
> #1, #2, #3, #4
> 6) Subclasses for pandas-specific types like category, datetimeTZ, etc.
> 7) NDFrame as cpcloud wrote is just a list of these
>
> Indexes and axis labels / column names can get layered on top.
>
> After we do all this we can look at adding nested types (arrays, maps,
> structs) to better support JSON.
>
> - Wes
>
> On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud <cpcloud at gmail.com> wrote:
>> Maybe this is saying the same thing as Wes, but how far would something like
>> this get us?
>>
>> // warning: things are probably not this simple
>>
>> struct data_array_t {
>>     void *primitive;  // scalar data
>>     data_array_t *nested; // nested data
>>     boost::dynamic_bitset isnull;  // might have to create our own to avoid
>> boost
>>     schema_t schema;  // not sure exactly what this looks like
>> };
>>
>> typedef std::map<string, data_array_t> data_frame_t;  // probably not this
>> simple
>>
>> To answer Jeff’s use-case question: I think that the use cases are 1)
>> freedom from numpy (mostly) 2) no more block manager which frees us from the
>> limitations of the block memory layout. In particular, the ability to take
>> advantage of memory mapped IO would be a big win IMO.
>>
>>
>> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney <wesmckinn at gmail.com> wrote:
>>>
>>> I will write a more detailed response to some of these things after
>>> the new year, but, in particular, re: missing values, can you or
>>> someone tell me why creating an object that contains a NumPy array and
>>> a bitmap is not sufficient? If we we can add a lightweight C/C++ class
>>> layer between NumPy function calls (e.g. arithmetic) and pandas
>>> function calls, then I see no reason why we cannot have
>>>
>>> Int32Array->add
>>>
>>> and
>>>
>>> Float32Array->add
>>>
>>> do the right thing (the former would be responsible for bitmasking to
>>> propagate NA values; the latter would defer to NumPy). If we can put
>>> all the internals of pandas objects inside a black box, we can add
>>> layers of virtual function indirection without a performance penalty
>>> (e.g. adding more interpreter overhead with more abstraction layers
>>> does add up to a perf penalty).
>>>
>>> I don't think this is too scary -- I would be willing to create a
>>> small POC C++ library to prototype something like what I'm talking
>>> about.
>>>
>>> Since pandas has limited points of contact with NumPy I don't think
>>> this would end up being too onerous.
>>>
>>> For the record, I'm pretty allergic to "advanced C++"; I think it is a
>>> useful tool if you pick a sane 20% subset of the C++11 spec and follow
>>> Google C++ style it's not very inaccessible to intermediate
>>> developers. More or less "C plus OOP and easier object lifetime
>>> management (shared/unique_ptr, etc.)". As soon as you add a lot of
>>> template metaprogramming C++ library development quickly becomes
>>> inaccessible except to the C++-Jedi.
>>>
>>> Maybe let's start a Google document on "pandas roadmap" where we can
>>> break down the 1-2 year goals and some of these infrastructure issues
>>> and have our discussion there? (obviously publish this someplace once
>>> we're done)
>>>
>>> - Wes
>>>
>>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback <jeffreback at gmail.com> wrote:
>>> > Here are some of my thoughts about pandas Roadmap / status and some
>>> > responses to Wes's thoughts.
>>> >
>>> > In the last few (and upcoming) major releases we have been made the
>>> > following changes:
>>> >
>>> > - dtype enhancements (Categorical, Timedelta, Datetime w/tz) & making
>>> > these
>>> > first class objects
>>> > - code refactoring to remove subclassing of ndarrays for Series & Index
>>> > - carving out / deprecating non-core parts of pandas
>>> >   - datareader
>>> >   - SparsePanel, WidePanel & other aliases (TImeSeries)
>>> >   - rpy, rplot, irow et al.
>>> >   - google-analytics
>>> > - API changes to make things more consistent
>>> >   - pd.rolling/expanding * -> .rolling/expanding (this is in master now)
>>> >   - .resample becoming a full defered like groupby.
>>> >   - multi-index slicing along any level (obviates need for .xs) and
>>> > allows
>>> > assignment
>>> >   - .loc/.iloc - for the most part obviates use of .ix
>>> >   - .pipe & .assign
>>> >   - plotting accessors
>>> >   - fixing of the sorting API
>>> > - many performance enhancements both micro & macro (e.g. release GIL)
>>> >
>>> > Some on-deck enhancements are (meaning these are basically ready to go
>>> > in):
>>> >   - IntervalIndex (and eventually make PeriodIndex just a sub-class of
>>> > this)
>>> >   - RangeIndex
>>> >
>>> > so lots of changes, though nothing really earth shaking, just more
>>> > convenience, reducing magicness somewhat
>>> > and providing flexibility.
>>> >
>>> > Of course we are getting increasing issues, mostly bug reports (and lots
>>> > of
>>> > dupes), some edge case enhancements
>>> > which can add to the existing API's and of course, requests to expand
>>> > the
>>> > (already) large code to other usecases.
>>> > Balancing this are a good many pull-requests from many different users,
>>> > some
>>> > even deep into the internals.
>>> >
>>> > Here are some things that I have talked about and could be considered
>>> > for
>>> > the roadmap. Disclaimer: I do work for Continuum
>>> > but these views are of course my own; furthermore obviously I am a bit
>>> > more
>>> > familiar with some of the 'sponsored' open-source
>>> > libraries, but always open to new things.
>>> >
>>> > - integration / automatic deferral to numba for JIT (this would be thru
>>> > .apply)
>>> > - automatic deferal to dask from groubpy where appropriate / maybe a
>>> > .to_parallel (to simply return a dask.DataFrame object)
>>> > - incorporation of quantities / units (as part of the dtype)
>>> > - use of DyND to allow missing values for int dtypes
>>> > - make Period a first class dtype.
>>> > - provide some copy-on-write semantics to alleviate the chained-indexing
>>> > issues which occasionaly come up with the mis-use of the indexing API
>>> > - allow a 'policy' to automatically provide column blocks for dict-like
>>> > input (e.g. each column would be a block), this would allow a pass-thru
>>> > API
>>> > where you could
>>> > put in numpy arrays where you have views and have them preserved rather
>>> > than
>>> > copied automatically. Note that this would also allow what I call
>>> > 'split'
>>> > where a passed in
>>> > multi-dim numpy array could be split up to individual blocks (which
>>> > actually
>>> > gives a nice perf boost after the splitting costs).
>>> >
>>> > In working towards some of these goals. I have come to the opinion that
>>> > it
>>> > would make sense to have a neutral API protocol layer
>>> > that would allow us to swap out different engines as needed, for
>>> > particular
>>> > dtypes, or *maybe* out-of-core type computations. E.g.
>>> > imagine that we replaced the in-memory block structure with a bclolz /
>>> > memap
>>> > type; in theory this should be 'easy' and just work.
>>> > I could also see us adopting *some* of the SFrame code to allow easier
>>> > interop with this API layer.
>>> >
>>> > In practice, I think a nice API layer would need to be created to make
>>> > this
>>> > clean / nice.
>>> >
>>> > So this comes around to Wes's point about creating a c++ library for the
>>> > internals (and possibly even some of the indexing routines).
>>> > In an ideal world, or course this would be desirable. Getting there is a
>>> > bit
>>> > non-trivial I think, and IMHO might not be worth the effort. I don't
>>> > really see big performance bottlenecks. We *already* defer much of the
>>> > computation to libraries like numexpr & bottleneck (where appropriate).
>>> > Adding numba / dask to the list would be helpful.
>>> >
>>> > I think that almost all performance issues are the result of:
>>> >
>>> > a) gross misuse of the pandas API. How much code have you seen that does
>>> > df.apply(lambda x: x.sum())
>>> > b) routines which operate column-by-column rather block-by-block and are
>>> > in
>>> > python space (e.g. we have an issue right now about .quantile)
>>> >
>>> > So I am glossing over a big goal of having a c++ library that represents
>>> > the
>>> > pandas internals. This would by definition have a c-API that so
>>> > you *could* use pandas like semantics in c/c++ and just have it work
>>> > (and
>>> > then pandas would be a thin wrapper around this library).
>>> >
>>> > I am not averse to this, but I think would be quite a big effort, and
>>> > not a
>>> > huge perf boost IMHO. Further there are a number of API issues w.r.t.
>>> > indexing
>>> > which need to be clarified / worked out (e.g. should we simply deprecate
>>> > [])
>>> > that are much easier to test / figure out in python space.
>>> >
>>> > I also thing that we have quite a large number of contributors. Moving
>>> > to
>>> > c++ might make the internals a bit more impenetrable that the current
>>> > internals.
>>> > (though this would allow c++ people to contribute, so that might balance
>>> > out).
>>> >
>>> > We have a limited core of devs whom right now are familar with things.
>>> > If
>>> > someone happened to have a starting base for a c++ library, then I might
>>> > change
>>> > opinions here.
>>> >
>>> >
>>> > my 4c.
>>> >
>>> > Jeff
>>> >
>>> >
>>> >
>>> >
>>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney <wesmckinn at gmail.com>
>>> > wrote:
>>> >>
>>> >> Deep thoughts during the holidays.
>>> >>
>>> >> I might be out of line here, but the interpreter-heaviness of the
>>> >> inside of pandas objects is likely to be a long-term liability and
>>> >> source of performance problems and technical debt.
>>> >>
>>> >> Has anyone put any thought into planning and beginning to execute on a
>>> >> rewrite that moves as much as possible of the internals into native /
>>> >> compiled code? I'm talking about:
>>> >>
>>> >> - pandas/core/internals
>>> >> - indexing and assignment
>>> >> - much of pandas/core/common
>>> >> - categorical and custom dtypes
>>> >> - all indexing mechanisms
>>> >>
>>> >> I'm concerned we've already exposed too much internals to users, so
>>> >> this might lead to a lot of API breakage, but it might be for the
>>> >> Greater Good. As a first step, beginning a partial migration of
>>> >> internals into some C++ classes that encapsulate the insides of
>>> >> DataFrame objects and implement indexing and block-level manipulations
>>> >> would be a good place to start. I think you could do this wouldn't too
>>> >> much disruption.
>>> >>
>>> >> As part of this internal retooling we might give consideration to
>>> >> alternative data structures for representing data internal to pandas
>>> >> objects. Now in 2015/2016, continuing to be hamstrung by NumPy's
>>> >> limitations feels somewhat anachronistic. User code is riddled with
>>> >> workarounds for data type fidelity issues and the like. Like, really,
>>> >> why not add a bitndarray (similar to ilanschnell/bitarray) for storing
>>> >> nullness for problematic types and hide this from the user? =)
>>> >>
>>> >> Since we are now a NumFOCUS-sponsored project, I feel like we might
>>> >> consider establishing some formal governance over pandas and
>>> >> publishing meetings notes and roadmap documents describing plans for
>>> >> the project and meetings notes from committers. There's no real
>>> >> "committer culture" for NumFOCUS projects like there is with the
>>> >> Apache Software Foundation, but we might try leading by example!
>>> >>
>>> >> Also, I believe pandas as a project has reached a level of importance
>>> >> where we ought to consider planning and execution on larger scale
>>> >> undertakings such as this for safeguarding the future.
>>> >>
>>> >> As for myself, well, I have my hands full in Big Data-land. I wish I
>>> >> could be helping more with pandas, but there a quite a few fundamental
>>> >> issues (like data interoperability nested data handling and file
>>> >> format support — e.g. Parquet, see
>>> >>
>>> >>
>>> >> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/)
>>> >> preventing Python from being more useful in industry analytics
>>> >> applications.
>>> >>
>>> >> Aside: one of the bigger mistakes I made with pandas's API design was
>>> >> making it acceptable to call class constructors — like
>>> >> pandas.DataFrame — directly (versus factory functions). Sorry about
>>> >> that! If we could convince everyone to start writing pandas.data_frame
>>> >> or dataframe instead of using the class reference it would help a lot
>>> >> with code cleanup. It's hard to plan for these things — NumPy
>>> >> interoperability seemed a lot more important in 2008 than it does now,
>>> >> so I forgive myself.
>>> >>
>>> >> cheers and best wishes for 2016,
>>> >> Wes
>>> >> _______________________________________________
>>> >> Pandas-dev mailing list
>>> >> Pandas-dev at python.org
>>> >> https://mail.python.org/mailman/listinfo/pandas-dev
>>> >
>>> >
>>> _______________________________________________
>>> Pandas-dev mailing list
>>> Pandas-dev at python.org
>>> https://mail.python.org/mailman/listinfo/pandas-dev