[Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap

Tue Dec 29 18:31:59 EST 2015

Hi Wes (and others),

I've been following this conversation with interest. I do think it would be
worth exploring DyND, rather than setting up yet another rewrite of
NumPy-functionality. Especially because DyND is already an optional
dependency of Pandas.

For things like Integer NA and new dtypes, DyND is there and ready to do
this.

Irwin

On Tue, Dec 29, 2015 at 5:18 PM, Wes McKinney <wesmckinn at gmail.com> wrote:

> Can you link to the PR you're talking about?
>
> I will see about spending a few hours setting up a libpandas.so as a C++
> shared library where we can run some experiments and validate whether it
> can solve the integer-NA problem and be a place to put new data types
> (categorical and friends). I'm +1 on targeting
>
> Would it also be worth making a wish list of APIs we might consider
> breaking in a pandas 1.0 release that also features this new "native core"?
> Might as well right some wrongs while we're doing some invasive work on the
> internals; some breakage might be unavoidable. We can always maintain a
> pandas legacy 0.x.x maintenance branch (providing a conda binary build) for
> legacy users where showstopper bugs can get fixed.
>
> On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback <jeffreback at gmail.com> wrote:
> > Wes your last is noted as well. I *think* we can actually do this now
> (well
> > there is a PR out there).
> >
> > On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney <wesmckinn at gmail.com>
> wrote:
> >>
> >> The other huge thing this will enable is to do is copy-on-write for
> >> various kinds of views, which should cut down on some of the defensive
> >> copying in the library and reduce memory usage.
> >>
> >> On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney <wesmckinn at gmail.com>
> wrote:
> >> > Basically the approach is
> >> >
> >> > 1) Base dtype type
> >> > 2) Base array type with K >= 1 dimensions
> >> > 3) Base scalar type
> >> > 4) Base index type
> >> > 5) "Wrapper" subclasses for all NumPy types fitting into categories
> >> > #1, #2, #3, #4
> >> > 6) Subclasses for pandas-specific types like category, datetimeTZ,
> etc.
> >> > 7) NDFrame as cpcloud wrote is just a list of these
> >> >
> >> > Indexes and axis labels / column names can get layered on top.
> >> >
> >> > After we do all this we can look at adding nested types (arrays, maps,
> >> > structs) to better support JSON.
> >> >
> >> > - Wes
> >> >
> >> > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud <cpcloud at gmail.com>
> >> > wrote:
> >> >> Maybe this is saying the same thing as Wes, but how far would
> something
> >> >> like
> >> >> this get us?
> >> >>
> >> >> // warning: things are probably not this simple
> >> >>
> >> >> struct data_array_t {
> >> >>     void *primitive;  // scalar data
> >> >>     data_array_t *nested; // nested data
> >> >>     boost::dynamic_bitset isnull;  // might have to create our own to
> >> >> avoid
> >> >> boost
> >> >>     schema_t schema;  // not sure exactly what this looks like
> >> >> };
> >> >>
> >> >> typedef std::map<string, data_array_t> data_frame_t;  // probably not
> >> >> this
> >> >> simple
> >> >>
> >> >> To answer Jeff’s use-case question: I think that the use cases are 1)
> >> >> freedom from numpy (mostly) 2) no more block manager which frees us
> >> >> from the
> >> >> limitations of the block memory layout. In particular, the ability to
> >> >> take
> >> >> advantage of memory mapped IO would be a big win IMO.
> >> >>
> >> >>
> >> >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney <wesmckinn at gmail.com>
> >> >> wrote:
> >> >>>
> >> >>> I will write a more detailed response to some of these things after
> >> >>> the new year, but, in particular, re: missing values, can you or
> >> >>> someone tell me why creating an object that contains a NumPy array
> and
> >> >>> a bitmap is not sufficient? If we we can add a lightweight C/C++
> class
> >> >>> layer between NumPy function calls (e.g. arithmetic) and pandas
> >> >>> function calls, then I see no reason why we cannot have
> >> >>>
> >> >>> Int32Array->add
> >> >>>
> >> >>> and
> >> >>>
> >> >>> Float32Array->add
> >> >>>
> >> >>> do the right thing (the former would be responsible for bitmasking
> to
> >> >>> propagate NA values; the latter would defer to NumPy). If we can put
> >> >>> all the internals of pandas objects inside a black box, we can add
> >> >>> layers of virtual function indirection without a performance penalty
> >> >>> (e.g. adding more interpreter overhead with more abstraction layers
> >> >>> does add up to a perf penalty).
> >> >>>
> >> >>> I don't think this is too scary -- I would be willing to create a
> >> >>> small POC C++ library to prototype something like what I'm talking
> >> >>> about.
> >> >>>
> >> >>> Since pandas has limited points of contact with NumPy I don't think
> >> >>> this would end up being too onerous.
> >> >>>
> >> >>> For the record, I'm pretty allergic to "advanced C++"; I think it
> is a
> >> >>> useful tool if you pick a sane 20% subset of the C++11 spec and
> follow
> >> >>> Google C++ style it's not very inaccessible to intermediate
> >> >>> developers. More or less "C plus OOP and easier object lifetime
> >> >>> management (shared/unique_ptr, etc.)". As soon as you add a lot of
> >> >>> template metaprogramming C++ library development quickly becomes
> >> >>> inaccessible except to the C++-Jedi.
> >> >>>
> >> >>> Maybe let's start a Google document on "pandas roadmap" where we can
> >> >>> break down the 1-2 year goals and some of these infrastructure
> issues
> >> >>> and have our discussion there? (obviously publish this someplace
> once
> >> >>> we're done)
> >> >>>
> >> >>> - Wes
> >> >>>
> >> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback <jeffreback at gmail.com>
> >> >>> wrote:
> >> >>> > Here are some of my thoughts about pandas Roadmap / status and
> some
> >> >>> > responses to Wes's thoughts.
> >> >>> >
> >> >>> > In the last few (and upcoming) major releases we have been made
> the
> >> >>> > following changes:
> >> >>> >
> >> >>> > - dtype enhancements (Categorical, Timedelta, Datetime w/tz) &
> >> >>> > making
> >> >>> > these
> >> >>> > first class objects
> >> >>> > - code refactoring to remove subclassing of ndarrays for Series &
> >> >>> > Index
> >> >>> > - carving out / deprecating non-core parts of pandas
> >> >>> >   - datareader
> >> >>> >   - SparsePanel, WidePanel & other aliases (TImeSeries)
> >> >>> >   - rpy, rplot, irow et al.
> >> >>> >   - google-analytics
> >> >>> > - API changes to make things more consistent
> >> >>> >   - pd.rolling/expanding * -> .rolling/expanding (this is in
> master
> >> >>> > now)
> >> >>> >   - .resample becoming a full defered like groupby.
> >> >>> >   - multi-index slicing along any level (obviates need for .xs)
> and
> >> >>> > allows
> >> >>> > assignment
> >> >>> >   - .loc/.iloc - for the most part obviates use of .ix
> >> >>> >   - .pipe & .assign
> >> >>> >   - plotting accessors
> >> >>> >   - fixing of the sorting API
> >> >>> > - many performance enhancements both micro & macro (e.g. release
> >> >>> > GIL)
> >> >>> >
> >> >>> > Some on-deck enhancements are (meaning these are basically ready
> to
> >> >>> > go
> >> >>> > in):
> >> >>> >   - IntervalIndex (and eventually make PeriodIndex just a
> sub-class
> >> >>> > of
> >> >>> > this)
> >> >>> >   - RangeIndex
> >> >>> >
> >> >>> > so lots of changes, though nothing really earth shaking, just more
> >> >>> > convenience, reducing magicness somewhat
> >> >>> > and providing flexibility.
> >> >>> >
> >> >>> > Of course we are getting increasing issues, mostly bug reports
> (and
> >> >>> > lots
> >> >>> > of
> >> >>> > dupes), some edge case enhancements
> >> >>> > which can add to the existing API's and of course, requests to
> >> >>> > expand
> >> >>> > the
> >> >>> > (already) large code to other usecases.
> >> >>> > Balancing this are a good many pull-requests from many different
> >> >>> > users,
> >> >>> > some
> >> >>> > even deep into the internals.
> >> >>> >
> >> >>> > Here are some things that I have talked about and could be
> >> >>> > considered
> >> >>> > for
> >> >>> > the roadmap. Disclaimer: I do work for Continuum
> >> >>> > but these views are of course my own; furthermore obviously I am a
> >> >>> > bit
> >> >>> > more
> >> >>> > familiar with some of the 'sponsored' open-source
> >> >>> > libraries, but always open to new things.
> >> >>> >
> >> >>> > - integration / automatic deferral to numba for JIT (this would be
> >> >>> > thru
> >> >>> > .apply)
> >> >>> > - automatic deferal to dask from groubpy where appropriate /
> maybe a
> >> >>> > .to_parallel (to simply return a dask.DataFrame object)
> >> >>> > - incorporation of quantities / units (as part of the dtype)
> >> >>> > - use of DyND to allow missing values for int dtypes
> >> >>> > - make Period a first class dtype.
> >> >>> > - provide some copy-on-write semantics to alleviate the
> >> >>> > chained-indexing
> >> >>> > issues which occasionaly come up with the mis-use of the indexing
> >> >>> > API
> >> >>> > - allow a 'policy' to automatically provide column blocks for
> >> >>> > dict-like
> >> >>> > input (e.g. each column would be a block), this would allow a
> >> >>> > pass-thru
> >> >>> > API
> >> >>> > where you could
> >> >>> > put in numpy arrays where you have views and have them preserved
> >> >>> > rather
> >> >>> > than
> >> >>> > copied automatically. Note that this would also allow what I call
> >> >>> > 'split'
> >> >>> > where a passed in
> >> >>> > multi-dim numpy array could be split up to individual blocks
> (which
> >> >>> > actually
> >> >>> > gives a nice perf boost after the splitting costs).
> >> >>> >
> >> >>> > In working towards some of these goals. I have come to the opinion
> >> >>> > that
> >> >>> > it
> >> >>> > would make sense to have a neutral API protocol layer
> >> >>> > that would allow us to swap out different engines as needed, for
> >> >>> > particular
> >> >>> > dtypes, or *maybe* out-of-core type computations. E.g.
> >> >>> > imagine that we replaced the in-memory block structure with a
> bclolz
> >> >>> > /
> >> >>> > memap
> >> >>> > type; in theory this should be 'easy' and just work.
> >> >>> > I could also see us adopting *some* of the SFrame code to allow
> >> >>> > easier
> >> >>> > interop with this API layer.
> >> >>> >
> >> >>> > In practice, I think a nice API layer would need to be created to
> >> >>> > make
> >> >>> > this
> >> >>> > clean / nice.
> >> >>> >
> >> >>> > So this comes around to Wes's point about creating a c++ library
> for
> >> >>> > the
> >> >>> > internals (and possibly even some of the indexing routines).
> >> >>> > In an ideal world, or course this would be desirable. Getting
> there
> >> >>> > is a
> >> >>> > bit
> >> >>> > non-trivial I think, and IMHO might not be worth the effort. I
> don't
> >> >>> > really see big performance bottlenecks. We *already* defer much of
> >> >>> > the
> >> >>> > computation to libraries like numexpr & bottleneck (where
> >> >>> > appropriate).
> >> >>> > Adding numba / dask to the list would be helpful.
> >> >>> >
> >> >>> > I think that almost all performance issues are the result of:
> >> >>> >
> >> >>> > a) gross misuse of the pandas API. How much code have you seen
> that
> >> >>> > does
> >> >>> > df.apply(lambda x: x.sum())
> >> >>> > b) routines which operate column-by-column rather block-by-block
> and
> >> >>> > are
> >> >>> > in
> >> >>> > python space (e.g. we have an issue right now about .quantile)
> >> >>> >
> >> >>> > So I am glossing over a big goal of having a c++ library that
> >> >>> > represents
> >> >>> > the
> >> >>> > pandas internals. This would by definition have a c-API that so
> >> >>> > you *could* use pandas like semantics in c/c++ and just have it
> work
> >> >>> > (and
> >> >>> > then pandas would be a thin wrapper around this library).
> >> >>> >
> >> >>> > I am not averse to this, but I think would be quite a big effort,
> >> >>> > and
> >> >>> > not a
> >> >>> > huge perf boost IMHO. Further there are a number of API issues
> >> >>> > w.r.t.
> >> >>> > indexing
> >> >>> > which need to be clarified / worked out (e.g. should we simply
> >> >>> > deprecate
> >> >>> > [])
> >> >>> > that are much easier to test / figure out in python space.
> >> >>> >
> >> >>> > I also thing that we have quite a large number of contributors.
> >> >>> > Moving
> >> >>> > to
> >> >>> > c++ might make the internals a bit more impenetrable that the
> >> >>> > current
> >> >>> > internals.
> >> >>> > (though this would allow c++ people to contribute, so that might
> >> >>> > balance
> >> >>> > out).
> >> >>> >
> >> >>> > We have a limited core of devs whom right now are familar with
> >> >>> > things.
> >> >>> > If
> >> >>> > someone happened to have a starting base for a c++ library, then I
> >> >>> > might
> >> >>> > change
> >> >>> > opinions here.
> >> >>> >
> >> >>> >
> >> >>> > my 4c.
> >> >>> >
> >> >>> > Jeff
> >> >>> >
> >> >>> >
> >> >>> >
> >> >>> >
> >> >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney <
> wesmckinn at gmail.com>
> >> >>> > wrote:
> >> >>> >>
> >> >>> >> Deep thoughts during the holidays.
> >> >>> >>
> >> >>> >> I might be out of line here, but the interpreter-heaviness of the
> >> >>> >> inside of pandas objects is likely to be a long-term liability
> and
> >> >>> >> source of performance problems and technical debt.
> >> >>> >>
> >> >>> >> Has anyone put any thought into planning and beginning to execute
> >> >>> >> on a
> >> >>> >> rewrite that moves as much as possible of the internals into
> native
> >> >>> >> /
> >> >>> >> compiled code? I'm talking about:
> >> >>> >>
> >> >>> >> - pandas/core/internals
> >> >>> >> - indexing and assignment
> >> >>> >> - much of pandas/core/common
> >> >>> >> - categorical and custom dtypes
> >> >>> >> - all indexing mechanisms
> >> >>> >>
> >> >>> >> I'm concerned we've already exposed too much internals to users,
> so
> >> >>> >> this might lead to a lot of API breakage, but it might be for the
> >> >>> >> Greater Good. As a first step, beginning a partial migration of
> >> >>> >> internals into some C++ classes that encapsulate the insides of
> >> >>> >> DataFrame objects and implement indexing and block-level
> >> >>> >> manipulations
> >> >>> >> would be a good place to start. I think you could do this
> wouldn't
> >> >>> >> too
> >> >>> >> much disruption.
> >> >>> >>
> >> >>> >> As part of this internal retooling we might give consideration to
> >> >>> >> alternative data structures for representing data internal to
> >> >>> >> pandas
> >> >>> >> objects. Now in 2015/2016, continuing to be hamstrung by NumPy's
> >> >>> >> limitations feels somewhat anachronistic. User code is riddled
> with
> >> >>> >> workarounds for data type fidelity issues and the like. Like,
> >> >>> >> really,
> >> >>> >> why not add a bitndarray (similar to ilanschnell/bitarray) for
> >> >>> >> storing
> >> >>> >> nullness for problematic types and hide this from the user? =)
> >> >>> >>
> >> >>> >> Since we are now a NumFOCUS-sponsored project, I feel like we
> might
> >> >>> >> consider establishing some formal governance over pandas and
> >> >>> >> publishing meetings notes and roadmap documents describing plans
> >> >>> >> for
> >> >>> >> the project and meetings notes from committers. There's no real
> >> >>> >> "committer culture" for NumFOCUS projects like there is with the
> >> >>> >> Apache Software Foundation, but we might try leading by example!
> >> >>> >>
> >> >>> >> Also, I believe pandas as a project has reached a level of
> >> >>> >> importance
> >> >>> >> where we ought to consider planning and execution on larger scale
> >> >>> >> undertakings such as this for safeguarding the future.
> >> >>> >>
> >> >>> >> As for myself, well, I have my hands full in Big Data-land. I
> wish
> >> >>> >> I
> >> >>> >> could be helping more with pandas, but there a quite a few
> >> >>> >> fundamental
> >> >>> >> issues (like data interoperability nested data handling and file
> >> >>> >> format support — e.g. Parquet, see
> >> >>> >>
> >> >>> >>
> >> >>> >>
> >> >>> >>
> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/
> )
> >> >>> >> preventing Python from being more useful in industry analytics
> >> >>> >> applications.
> >> >>> >>
> >> >>> >> Aside: one of the bigger mistakes I made with pandas's API design
> >> >>> >> was
> >> >>> >> making it acceptable to call class constructors — like
> >> >>> >> pandas.DataFrame — directly (versus factory functions). Sorry
> about
> >> >>> >> that! If we could convince everyone to start writing
> >> >>> >> pandas.data_frame
> >> >>> >> or dataframe instead of using the class reference it would help a
> >> >>> >> lot
> >> >>> >> with code cleanup. It's hard to plan for these things — NumPy
> >> >>> >> interoperability seemed a lot more important in 2008 than it does
> >> >>> >> now,
> >> >>> >> so I forgive myself.
> >> >>> >>
> >> >>> >> cheers and best wishes for 2016,
> >> >>> >> Wes
> >> >>> >> _______________________________________________
> >> >>> >> Pandas-dev mailing list
> >> >>> >> Pandas-dev at python.org
> >> >>> >> https://mail.python.org/mailman/listinfo/pandas-dev
> >> >>> >
> >> >>> >
> >> >>> _______________________________________________
> >> >>> Pandas-dev mailing list
> >> >>> Pandas-dev at python.org
> >> >>> https://mail.python.org/mailman/listinfo/pandas-dev
> >> _______________________________________________
> >> Pandas-dev mailing list
> >> Pandas-dev at python.org
> >> https://mail.python.org/mailman/listinfo/pandas-dev
> >
> >
>
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20151229/f6497ac6/attachment-0001.html>