[Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap

Tue Dec 29 18:18:04 EST 2015

Can you link to the PR you're talking about?

I will see about spending a few hours setting up a libpandas.so as a C++
shared library where we can run some experiments and validate whether it
can solve the integer-NA problem and be a place to put new data types
(categorical and friends). I'm +1 on targeting

Would it also be worth making a wish list of APIs we might consider
breaking in a pandas 1.0 release that also features this new "native core"?
Might as well right some wrongs while we're doing some invasive work on the
internals; some breakage might be unavoidable. We can always maintain a
pandas legacy 0.x.x maintenance branch (providing a conda binary build) for
legacy users where showstopper bugs can get fixed.

On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback <jeffreback at gmail.com
<javascript:;>> wrote:
> Wes your last is noted as well. I *think* we can actually do this now
(well
> there is a PR out there).
>
> On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney <wesmckinn at gmail.com
<javascript:;>> wrote:
>>
>> The other huge thing this will enable is to do is copy-on-write for
>> various kinds of views, which should cut down on some of the defensive
>> copying in the library and reduce memory usage.
>>
>> On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney <wesmckinn at gmail.com
<javascript:;>> wrote:
>> > Basically the approach is
>> >
>> > 1) Base dtype type
>> > 2) Base array type with K >= 1 dimensions
>> > 3) Base scalar type
>> > 4) Base index type
>> > 5) "Wrapper" subclasses for all NumPy types fitting into categories
>> > #1, #2, #3, #4
>> > 6) Subclasses for pandas-specific types like category, datetimeTZ, etc.
>> > 7) NDFrame as cpcloud wrote is just a list of these
>> >
>> > Indexes and axis labels / column names can get layered on top.
>> >
>> > After we do all this we can look at adding nested types (arrays, maps,
>> > structs) to better support JSON.
>> >
>> > - Wes
>> >
>> > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud <cpcloud at gmail.com
<javascript:;>>
>> > wrote:
>> >> Maybe this is saying the same thing as Wes, but how far would
something
>> >> like
>> >> this get us?
>> >>
>> >> // warning: things are probably not this simple
>> >>
>> >> struct data_array_t {
>> >>     void *primitive;  // scalar data
>> >>     data_array_t *nested; // nested data
>> >>     boost::dynamic_bitset isnull;  // might have to create our own to
>> >> avoid
>> >> boost
>> >>     schema_t schema;  // not sure exactly what this looks like
>> >> };
>> >>
>> >> typedef std::map<string, data_array_t> data_frame_t;  // probably not
>> >> this
>> >> simple
>> >>
>> >> To answer Jeff’s use-case question: I think that the use cases are 1)
>> >> freedom from numpy (mostly) 2) no more block manager which frees us
>> >> from the
>> >> limitations of the block memory layout. In particular, the ability to
>> >> take
>> >> advantage of memory mapped IO would be a big win IMO.
>> >>
>> >>
>> >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney <wesmckinn at gmail.com
<javascript:;>>
>> >> wrote:
>> >>>
>> >>> I will write a more detailed response to some of these things after
>> >>> the new year, but, in particular, re: missing values, can you or
>> >>> someone tell me why creating an object that contains a NumPy array
and
>> >>> a bitmap is not sufficient? If we we can add a lightweight C/C++
class
>> >>> layer between NumPy function calls (e.g. arithmetic) and pandas
>> >>> function calls, then I see no reason why we cannot have
>> >>>
>> >>> Int32Array->add
>> >>>
>> >>> and
>> >>>
>> >>> Float32Array->add
>> >>>
>> >>> do the right thing (the former would be responsible for bitmasking to
>> >>> propagate NA values; the latter would defer to NumPy). If we can put
>> >>> all the internals of pandas objects inside a black box, we can add
>> >>> layers of virtual function indirection without a performance penalty
>> >>> (e.g. adding more interpreter overhead with more abstraction layers
>> >>> does add up to a perf penalty).
>> >>>
>> >>> I don't think this is too scary -- I would be willing to create a
>> >>> small POC C++ library to prototype something like what I'm talking
>> >>> about.
>> >>>
>> >>> Since pandas has limited points of contact with NumPy I don't think
>> >>> this would end up being too onerous.
>> >>>
>> >>> For the record, I'm pretty allergic to "advanced C++"; I think it is
a
>> >>> useful tool if you pick a sane 20% subset of the C++11 spec and
follow
>> >>> Google C++ style it's not very inaccessible to intermediate
>> >>> developers. More or less "C plus OOP and easier object lifetime
>> >>> management (shared/unique_ptr, etc.)". As soon as you add a lot of
>> >>> template metaprogramming C++ library development quickly becomes
>> >>> inaccessible except to the C++-Jedi.
>> >>>
>> >>> Maybe let's start a Google document on "pandas roadmap" where we can
>> >>> break down the 1-2 year goals and some of these infrastructure issues
>> >>> and have our discussion there? (obviously publish this someplace once
>> >>> we're done)
>> >>>
>> >>> - Wes
>> >>>
>> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback <jeffreback at gmail.com
<javascript:;>>
>> >>> wrote:
>> >>> > Here are some of my thoughts about pandas Roadmap / status and some
>> >>> > responses to Wes's thoughts.
>> >>> >
>> >>> > In the last few (and upcoming) major releases we have been made the
>> >>> > following changes:
>> >>> >
>> >>> > - dtype enhancements (Categorical, Timedelta, Datetime w/tz) &
>> >>> > making
>> >>> > these
>> >>> > first class objects
>> >>> > - code refactoring to remove subclassing of ndarrays for Series &
>> >>> > Index
>> >>> > - carving out / deprecating non-core parts of pandas
>> >>> >   - datareader
>> >>> >   - SparsePanel, WidePanel & other aliases (TImeSeries)
>> >>> >   - rpy, rplot, irow et al.
>> >>> >   - google-analytics
>> >>> > - API changes to make things more consistent
>> >>> >   - pd.rolling/expanding * -> .rolling/expanding (this is in master
>> >>> > now)
>> >>> >   - .resample becoming a full defered like groupby.
>> >>> >   - multi-index slicing along any level (obviates need for .xs) and
>> >>> > allows
>> >>> > assignment
>> >>> >   - .loc/.iloc - for the most part obviates use of .ix
>> >>> >   - .pipe & .assign
>> >>> >   - plotting accessors
>> >>> >   - fixing of the sorting API
>> >>> > - many performance enhancements both micro & macro (e.g. release
>> >>> > GIL)
>> >>> >
>> >>> > Some on-deck enhancements are (meaning these are basically ready to
>> >>> > go
>> >>> > in):
>> >>> >   - IntervalIndex (and eventually make PeriodIndex just a sub-class
>> >>> > of
>> >>> > this)
>> >>> >   - RangeIndex
>> >>> >
>> >>> > so lots of changes, though nothing really earth shaking, just more
>> >>> > convenience, reducing magicness somewhat
>> >>> > and providing flexibility.
>> >>> >
>> >>> > Of course we are getting increasing issues, mostly bug reports (and
>> >>> > lots
>> >>> > of
>> >>> > dupes), some edge case enhancements
>> >>> > which can add to the existing API's and of course, requests to
>> >>> > expand
>> >>> > the
>> >>> > (already) large code to other usecases.
>> >>> > Balancing this are a good many pull-requests from many different
>> >>> > users,
>> >>> > some
>> >>> > even deep into the internals.
>> >>> >
>> >>> > Here are some things that I have talked about and could be
>> >>> > considered
>> >>> > for
>> >>> > the roadmap. Disclaimer: I do work for Continuum
>> >>> > but these views are of course my own; furthermore obviously I am a
>> >>> > bit
>> >>> > more
>> >>> > familiar with some of the 'sponsored' open-source
>> >>> > libraries, but always open to new things.
>> >>> >
>> >>> > - integration / automatic deferral to numba for JIT (this would be
>> >>> > thru
>> >>> > .apply)
>> >>> > - automatic deferal to dask from groubpy where appropriate / maybe
a
>> >>> > .to_parallel (to simply return a dask.DataFrame object)
>> >>> > - incorporation of quantities / units (as part of the dtype)
>> >>> > - use of DyND to allow missing values for int dtypes
>> >>> > - make Period a first class dtype.
>> >>> > - provide some copy-on-write semantics to alleviate the
>> >>> > chained-indexing
>> >>> > issues which occasionaly come up with the mis-use of the indexing
>> >>> > API
>> >>> > - allow a 'policy' to automatically provide column blocks for
>> >>> > dict-like
>> >>> > input (e.g. each column would be a block), this would allow a
>> >>> > pass-thru
>> >>> > API
>> >>> > where you could
>> >>> > put in numpy arrays where you have views and have them preserved
>> >>> > rather
>> >>> > than
>> >>> > copied automatically. Note that this would also allow what I call
>> >>> > 'split'
>> >>> > where a passed in
>> >>> > multi-dim numpy array could be split up to individual blocks (which
>> >>> > actually
>> >>> > gives a nice perf boost after the splitting costs).
>> >>> >
>> >>> > In working towards some of these goals. I have come to the opinion
>> >>> > that
>> >>> > it
>> >>> > would make sense to have a neutral API protocol layer
>> >>> > that would allow us to swap out different engines as needed, for
>> >>> > particular
>> >>> > dtypes, or *maybe* out-of-core type computations. E.g.
>> >>> > imagine that we replaced the in-memory block structure with a
bclolz
>> >>> > /
>> >>> > memap
>> >>> > type; in theory this should be 'easy' and just work.
>> >>> > I could also see us adopting *some* of the SFrame code to allow
>> >>> > easier
>> >>> > interop with this API layer.
>> >>> >
>> >>> > In practice, I think a nice API layer would need to be created to
>> >>> > make
>> >>> > this
>> >>> > clean / nice.
>> >>> >
>> >>> > So this comes around to Wes's point about creating a c++ library
for
>> >>> > the
>> >>> > internals (and possibly even some of the indexing routines).
>> >>> > In an ideal world, or course this would be desirable. Getting there
>> >>> > is a
>> >>> > bit
>> >>> > non-trivial I think, and IMHO might not be worth the effort. I
don't
>> >>> > really see big performance bottlenecks. We *already* defer much of
>> >>> > the
>> >>> > computation to libraries like numexpr & bottleneck (where
>> >>> > appropriate).
>> >>> > Adding numba / dask to the list would be helpful.
>> >>> >
>> >>> > I think that almost all performance issues are the result of:
>> >>> >
>> >>> > a) gross misuse of the pandas API. How much code have you seen that
>> >>> > does
>> >>> > df.apply(lambda x: x.sum())
>> >>> > b) routines which operate column-by-column rather block-by-block
and
>> >>> > are
>> >>> > in
>> >>> > python space (e.g. we have an issue right now about .quantile)
>> >>> >
>> >>> > So I am glossing over a big goal of having a c++ library that
>> >>> > represents
>> >>> > the
>> >>> > pandas internals. This would by definition have a c-API that so
>> >>> > you *could* use pandas like semantics in c/c++ and just have it
work
>> >>> > (and
>> >>> > then pandas would be a thin wrapper around this library).
>> >>> >
>> >>> > I am not averse to this, but I think would be quite a big effort,
>> >>> > and
>> >>> > not a
>> >>> > huge perf boost IMHO. Further there are a number of API issues
>> >>> > w.r.t.
>> >>> > indexing
>> >>> > which need to be clarified / worked out (e.g. should we simply
>> >>> > deprecate
>> >>> > [])
>> >>> > that are much easier to test / figure out in python space.
>> >>> >
>> >>> > I also thing that we have quite a large number of contributors.
>> >>> > Moving
>> >>> > to
>> >>> > c++ might make the internals a bit more impenetrable that the
>> >>> > current
>> >>> > internals.
>> >>> > (though this would allow c++ people to contribute, so that might
>> >>> > balance
>> >>> > out).
>> >>> >
>> >>> > We have a limited core of devs whom right now are familar with
>> >>> > things.
>> >>> > If
>> >>> > someone happened to have a starting base for a c++ library, then I
>> >>> > might
>> >>> > change
>> >>> > opinions here.
>> >>> >
>> >>> >
>> >>> > my 4c.
>> >>> >
>> >>> > Jeff
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney <wesmckinn at gmail.com
<javascript:;>>
>> >>> > wrote:
>> >>> >>
>> >>> >> Deep thoughts during the holidays.
>> >>> >>
>> >>> >> I might be out of line here, but the interpreter-heaviness of the
>> >>> >> inside of pandas objects is likely to be a long-term liability and
>> >>> >> source of performance problems and technical debt.
>> >>> >>
>> >>> >> Has anyone put any thought into planning and beginning to execute
>> >>> >> on a
>> >>> >> rewrite that moves as much as possible of the internals into
native
>> >>> >> /
>> >>> >> compiled code? I'm talking about:
>> >>> >>
>> >>> >> - pandas/core/internals
>> >>> >> - indexing and assignment
>> >>> >> - much of pandas/core/common
>> >>> >> - categorical and custom dtypes
>> >>> >> - all indexing mechanisms
>> >>> >>
>> >>> >> I'm concerned we've already exposed too much internals to users,
so
>> >>> >> this might lead to a lot of API breakage, but it might be for the
>> >>> >> Greater Good. As a first step, beginning a partial migration of
>> >>> >> internals into some C++ classes that encapsulate the insides of
>> >>> >> DataFrame objects and implement indexing and block-level
>> >>> >> manipulations
>> >>> >> would be a good place to start. I think you could do this wouldn't
>> >>> >> too
>> >>> >> much disruption.
>> >>> >>
>> >>> >> As part of this internal retooling we might give consideration to
>> >>> >> alternative data structures for representing data internal to
>> >>> >> pandas
>> >>> >> objects. Now in 2015/2016, continuing to be hamstrung by NumPy's
>> >>> >> limitations feels somewhat anachronistic. User code is riddled
with
>> >>> >> workarounds for data type fidelity issues and the like. Like,
>> >>> >> really,
>> >>> >> why not add a bitndarray (similar to ilanschnell/bitarray) for
>> >>> >> storing
>> >>> >> nullness for problematic types and hide this from the user? =)
>> >>> >>
>> >>> >> Since we are now a NumFOCUS-sponsored project, I feel like we
might
>> >>> >> consider establishing some formal governance over pandas and
>> >>> >> publishing meetings notes and roadmap documents describing plans
>> >>> >> for
>> >>> >> the project and meetings notes from committers. There's no real
>> >>> >> "committer culture" for NumFOCUS projects like there is with the
>> >>> >> Apache Software Foundation, but we might try leading by example!
>> >>> >>
>> >>> >> Also, I believe pandas as a project has reached a level of
>> >>> >> importance
>> >>> >> where we ought to consider planning and execution on larger scale
>> >>> >> undertakings such as this for safeguarding the future.
>> >>> >>
>> >>> >> As for myself, well, I have my hands full in Big Data-land. I wish
>> >>> >> I
>> >>> >> could be helping more with pandas, but there a quite a few
>> >>> >> fundamental
>> >>> >> issues (like data interoperability nested data handling and file
>> >>> >> format support — e.g. Parquet, see
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >>
http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/
)
>> >>> >> preventing Python from being more useful in industry analytics
>> >>> >> applications.
>> >>> >>
>> >>> >> Aside: one of the bigger mistakes I made with pandas's API design
>> >>> >> was
>> >>> >> making it acceptable to call class constructors — like
>> >>> >> pandas.DataFrame — directly (versus factory functions). Sorry
about
>> >>> >> that! If we could convince everyone to start writing
>> >>> >> pandas.data_frame
>> >>> >> or dataframe instead of using the class reference it would help a
>> >>> >> lot
>> >>> >> with code cleanup. It's hard to plan for these things — NumPy
>> >>> >> interoperability seemed a lot more important in 2008 than it does
>> >>> >> now,
>> >>> >> so I forgive myself.
>> >>> >>
>> >>> >> cheers and best wishes for 2016,
>> >>> >> Wes
>> >>> >> _______________________________________________
>> >>> >> Pandas-dev mailing list
>> >>> >> Pandas-dev at python.org <javascript:;>
>> >>> >> https://mail.python.org/mailman/listinfo/pandas-dev
>> >>> >
>> >>> >
>> >>> _______________________________________________
>> >>> Pandas-dev mailing list
>> >>> Pandas-dev at python.org <javascript:;>
>> >>> https://mail.python.org/mailman/listinfo/pandas-dev
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org <javascript:;>
>> https://mail.python.org/mailman/listinfo/pandas-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20151229/fce10103/attachment-0001.html>