[Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap

Tue Dec 29 19:17:25 EST 2015

Yeah, that seems reasonable and I totally agree a Pandas wrapper layer
would be necessary.

I'll keep an eye on this and I'd like to help if I can.

Irwin

On Tue, Dec 29, 2015 at 6:01 PM, Wes McKinney <wesmckinn at gmail.com> wrote:

> I'm not suggesting a rewrite of NumPy functionality but rather pandas
> functionality that is currently written in a mishmash of Cython and Python.
> Happy to experiment with changing the internal compute infrastructure and
> data representation to DyND after this first stage of cleanup is done. Even
> if we use DyND a pretty extensive pandas wrapper layer will be  necessary.
>
>
> On Tuesday, December 29, 2015, Irwin Zaid <izaid at continuum.io> wrote:
>
>> Hi Wes (and others),
>>
>> I've been following this conversation with interest. I do think it would
>> be worth exploring DyND, rather than setting up yet another rewrite of
>> NumPy-functionality. Especially because DyND is already an optional
>> dependency of Pandas.
>>
>> For things like Integer NA and new dtypes, DyND is there and ready to do
>> this.
>>
>> Irwin
>>
>> On Tue, Dec 29, 2015 at 5:18 PM, Wes McKinney <wesmckinn at gmail.com>
>> wrote:
>>
>>> Can you link to the PR you're talking about?
>>>
>>> I will see about spending a few hours setting up a libpandas.so as a C++
>>> shared library where we can run some experiments and validate whether it
>>> can solve the integer-NA problem and be a place to put new data types
>>> (categorical and friends). I'm +1 on targeting
>>>
>>> Would it also be worth making a wish list of APIs we might consider
>>> breaking in a pandas 1.0 release that also features this new "native core"?
>>> Might as well right some wrongs while we're doing some invasive work on the
>>> internals; some breakage might be unavoidable. We can always maintain a
>>> pandas legacy 0.x.x maintenance branch (providing a conda binary build) for
>>> legacy users where showstopper bugs can get fixed.
>>>
>>> On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback <jeffreback at gmail.com>
>>> wrote:
>>> > Wes your last is noted as well. I *think* we can actually do this now
>>> (well
>>> > there is a PR out there).
>>> >
>>> > On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney <wesmckinn at gmail.com>
>>> wrote:
>>> >>
>>> >> The other huge thing this will enable is to do is copy-on-write for
>>> >> various kinds of views, which should cut down on some of the defensive
>>> >> copying in the library and reduce memory usage.
>>> >>
>>> >> On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney <wesmckinn at gmail.com>
>>> wrote:
>>> >> > Basically the approach is
>>> >> >
>>> >> > 1) Base dtype type
>>> >> > 2) Base array type with K >= 1 dimensions
>>> >> > 3) Base scalar type
>>> >> > 4) Base index type
>>> >> > 5) "Wrapper" subclasses for all NumPy types fitting into categories
>>> >> > #1, #2, #3, #4
>>> >> > 6) Subclasses for pandas-specific types like category, datetimeTZ,
>>> etc.
>>> >> > 7) NDFrame as cpcloud wrote is just a list of these
>>> >> >
>>> >> > Indexes and axis labels / column names can get layered on top.
>>> >> >
>>> >> > After we do all this we can look at adding nested types (arrays,
>>> maps,
>>> >> > structs) to better support JSON.
>>> >> >
>>> >> > - Wes
>>> >> >
>>> >> > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud <cpcloud at gmail.com>
>>> >> > wrote:
>>> >> >> Maybe this is saying the same thing as Wes, but how far would
>>> something
>>> >> >> like
>>> >> >> this get us?
>>> >> >>
>>> >> >> // warning: things are probably not this simple
>>> >> >>
>>> >> >> struct data_array_t {
>>> >> >>     void *primitive;  // scalar data
>>> >> >>     data_array_t *nested; // nested data
>>> >> >>     boost::dynamic_bitset isnull;  // might have to create our own
>>> to
>>> >> >> avoid
>>> >> >> boost
>>> >> >>     schema_t schema;  // not sure exactly what this looks like
>>> >> >> };
>>> >> >>
>>> >> >> typedef std::map<string, data_array_t> data_frame_t;  // probably
>>> not
>>> >> >> this
>>> >> >> simple
>>> >> >>
>>> >> >> To answer Jeff’s use-case question: I think that the use cases are
>>> 1)
>>> >> >> freedom from numpy (mostly) 2) no more block manager which frees us
>>> >> >> from the
>>> >> >> limitations of the block memory layout. In particular, the ability
>>> to
>>> >> >> take
>>> >> >> advantage of memory mapped IO would be a big win IMO.
>>> >> >>
>>> >> >>
>>> >> >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney <wesmckinn at gmail.com>
>>> >> >> wrote:
>>> >> >>>
>>> >> >>> I will write a more detailed response to some of these things
>>> after
>>> >> >>> the new year, but, in particular, re: missing values, can you or
>>> >> >>> someone tell me why creating an object that contains a NumPy
>>> array and
>>> >> >>> a bitmap is not sufficient? If we we can add a lightweight C/C++
>>> class
>>> >> >>> layer between NumPy function calls (e.g. arithmetic) and pandas
>>> >> >>> function calls, then I see no reason why we cannot have
>>> >> >>>
>>> >> >>> Int32Array->add
>>> >> >>>
>>> >> >>> and
>>> >> >>>
>>> >> >>> Float32Array->add
>>> >> >>>
>>> >> >>> do the right thing (the former would be responsible for
>>> bitmasking to
>>> >> >>> propagate NA values; the latter would defer to NumPy). If we can
>>> put
>>> >> >>> all the internals of pandas objects inside a black box, we can add
>>> >> >>> layers of virtual function indirection without a performance
>>> penalty
>>> >> >>> (e.g. adding more interpreter overhead with more abstraction
>>> layers
>>> >> >>> does add up to a perf penalty).
>>> >> >>>
>>> >> >>> I don't think this is too scary -- I would be willing to create a
>>> >> >>> small POC C++ library to prototype something like what I'm talking
>>> >> >>> about.
>>> >> >>>
>>> >> >>> Since pandas has limited points of contact with NumPy I don't
>>> think
>>> >> >>> this would end up being too onerous.
>>> >> >>>
>>> >> >>> For the record, I'm pretty allergic to "advanced C++"; I think it
>>> is a
>>> >> >>> useful tool if you pick a sane 20% subset of the C++11 spec and
>>> follow
>>> >> >>> Google C++ style it's not very inaccessible to intermediate
>>> >> >>> developers. More or less "C plus OOP and easier object lifetime
>>> >> >>> management (shared/unique_ptr, etc.)". As soon as you add a lot of
>>> >> >>> template metaprogramming C++ library development quickly becomes
>>> >> >>> inaccessible except to the C++-Jedi.
>>> >> >>>
>>> >> >>> Maybe let's start a Google document on "pandas roadmap" where we
>>> can
>>> >> >>> break down the 1-2 year goals and some of these infrastructure
>>> issues
>>> >> >>> and have our discussion there? (obviously publish this someplace
>>> once
>>> >> >>> we're done)
>>> >> >>>
>>> >> >>> - Wes
>>> >> >>>
>>> >> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback <
>>> jeffreback at gmail.com>
>>> >> >>> wrote:
>>> >> >>> > Here are some of my thoughts about pandas Roadmap / status and
>>> some
>>> >> >>> > responses to Wes's thoughts.
>>> >> >>> >
>>> >> >>> > In the last few (and upcoming) major releases we have been made
>>> the
>>> >> >>> > following changes:
>>> >> >>> >
>>> >> >>> > - dtype enhancements (Categorical, Timedelta, Datetime w/tz) &
>>> >> >>> > making
>>> >> >>> > these
>>> >> >>> > first class objects
>>> >> >>> > - code refactoring to remove subclassing of ndarrays for Series
>>> &
>>> >> >>> > Index
>>> >> >>> > - carving out / deprecating non-core parts of pandas
>>> >> >>> >   - datareader
>>> >> >>> >   - SparsePanel, WidePanel & other aliases (TImeSeries)
>>> >> >>> >   - rpy, rplot, irow et al.
>>> >> >>> >   - google-analytics
>>> >> >>> > - API changes to make things more consistent
>>> >> >>> >   - pd.rolling/expanding * -> .rolling/expanding (this is in
>>> master
>>> >> >>> > now)
>>> >> >>> >   - .resample becoming a full defered like groupby.
>>> >> >>> >   - multi-index slicing along any level (obviates need for .xs)
>>> and
>>> >> >>> > allows
>>> >> >>> > assignment
>>> >> >>> >   - .loc/.iloc - for the most part obviates use of .ix
>>> >> >>> >   - .pipe & .assign
>>> >> >>> >   - plotting accessors
>>> >> >>> >   - fixing of the sorting API
>>> >> >>> > - many performance enhancements both micro & macro (e.g. release
>>> >> >>> > GIL)
>>> >> >>> >
>>> >> >>> > Some on-deck enhancements are (meaning these are basically
>>> ready to
>>> >> >>> > go
>>> >> >>> > in):
>>> >> >>> >   - IntervalIndex (and eventually make PeriodIndex just a
>>> sub-class
>>> >> >>> > of
>>> >> >>> > this)
>>> >> >>> >   - RangeIndex
>>> >> >>> >
>>> >> >>> > so lots of changes, though nothing really earth shaking, just
>>> more
>>> >> >>> > convenience, reducing magicness somewhat
>>> >> >>> > and providing flexibility.
>>> >> >>> >
>>> >> >>> > Of course we are getting increasing issues, mostly bug reports
>>> (and
>>> >> >>> > lots
>>> >> >>> > of
>>> >> >>> > dupes), some edge case enhancements
>>> >> >>> > which can add to the existing API's and of course, requests to
>>> >> >>> > expand
>>> >> >>> > the
>>> >> >>> > (already) large code to other usecases.
>>> >> >>> > Balancing this are a good many pull-requests from many different
>>> >> >>> > users,
>>> >> >>> > some
>>> >> >>> > even deep into the internals.
>>> >> >>> >
>>> >> >>> > Here are some things that I have talked about and could be
>>> >> >>> > considered
>>> >> >>> > for
>>> >> >>> > the roadmap. Disclaimer: I do work for Continuum
>>> >> >>> > but these views are of course my own; furthermore obviously I
>>> am a
>>> >> >>> > bit
>>> >> >>> > more
>>> >> >>> > familiar with some of the 'sponsored' open-source
>>> >> >>> > libraries, but always open to new things.
>>> >> >>> >
>>> >> >>> > - integration / automatic deferral to numba for JIT (this would
>>> be
>>> >> >>> > thru
>>> >> >>> > .apply)
>>> >> >>> > - automatic deferal to dask from groubpy where appropriate /
>>> maybe a
>>> >> >>> > .to_parallel (to simply return a dask.DataFrame object)
>>> >> >>> > - incorporation of quantities / units (as part of the dtype)
>>> >> >>> > - use of DyND to allow missing values for int dtypes
>>> >> >>> > - make Period a first class dtype.
>>> >> >>> > - provide some copy-on-write semantics to alleviate the
>>> >> >>> > chained-indexing
>>> >> >>> > issues which occasionaly come up with the mis-use of the
>>> indexing
>>> >> >>> > API
>>> >> >>> > - allow a 'policy' to automatically provide column blocks for
>>> >> >>> > dict-like
>>> >> >>> > input (e.g. each column would be a block), this would allow a
>>> >> >>> > pass-thru
>>> >> >>> > API
>>> >> >>> > where you could
>>> >> >>> > put in numpy arrays where you have views and have them preserved
>>> >> >>> > rather
>>> >> >>> > than
>>> >> >>> > copied automatically. Note that this would also allow what I
>>> call
>>> >> >>> > 'split'
>>> >> >>> > where a passed in
>>> >> >>> > multi-dim numpy array could be split up to individual blocks
>>> (which
>>> >> >>> > actually
>>> >> >>> > gives a nice perf boost after the splitting costs).
>>> >> >>> >
>>> >> >>> > In working towards some of these goals. I have come to the
>>> opinion
>>> >> >>> > that
>>> >> >>> > it
>>> >> >>> > would make sense to have a neutral API protocol layer
>>> >> >>> > that would allow us to swap out different engines as needed, for
>>> >> >>> > particular
>>> >> >>> > dtypes, or *maybe* out-of-core type computations. E.g.
>>> >> >>> > imagine that we replaced the in-memory block structure with a
>>> bclolz
>>> >> >>> > /
>>> >> >>> > memap
>>> >> >>> > type; in theory this should be 'easy' and just work.
>>> >> >>> > I could also see us adopting *some* of the SFrame code to allow
>>> >> >>> > easier
>>> >> >>> > interop with this API layer.
>>> >> >>> >
>>> >> >>> > In practice, I think a nice API layer would need to be created
>>> to
>>> >> >>> > make
>>> >> >>> > this
>>> >> >>> > clean / nice.
>>> >> >>> >
>>> >> >>> > So this comes around to Wes's point about creating a c++
>>> library for
>>> >> >>> > the
>>> >> >>> > internals (and possibly even some of the indexing routines).
>>> >> >>> > In an ideal world, or course this would be desirable. Getting
>>> there
>>> >> >>> > is a
>>> >> >>> > bit
>>> >> >>> > non-trivial I think, and IMHO might not be worth the effort. I
>>> don't
>>> >> >>> > really see big performance bottlenecks. We *already* defer much
>>> of
>>> >> >>> > the
>>> >> >>> > computation to libraries like numexpr & bottleneck (where
>>> >> >>> > appropriate).
>>> >> >>> > Adding numba / dask to the list would be helpful.
>>> >> >>> >
>>> >> >>> > I think that almost all performance issues are the result of:
>>> >> >>> >
>>> >> >>> > a) gross misuse of the pandas API. How much code have you seen
>>> that
>>> >> >>> > does
>>> >> >>> > df.apply(lambda x: x.sum())
>>> >> >>> > b) routines which operate column-by-column rather
>>> block-by-block and
>>> >> >>> > are
>>> >> >>> > in
>>> >> >>> > python space (e.g. we have an issue right now about .quantile)
>>> >> >>> >
>>> >> >>> > So I am glossing over a big goal of having a c++ library that
>>> >> >>> > represents
>>> >> >>> > the
>>> >> >>> > pandas internals. This would by definition have a c-API that so
>>> >> >>> > you *could* use pandas like semantics in c/c++ and just have it
>>> work
>>> >> >>> > (and
>>> >> >>> > then pandas would be a thin wrapper around this library).
>>> >> >>> >
>>> >> >>> > I am not averse to this, but I think would be quite a big
>>> effort,
>>> >> >>> > and
>>> >> >>> > not a
>>> >> >>> > huge perf boost IMHO. Further there are a number of API issues
>>> >> >>> > w.r.t.
>>> >> >>> > indexing
>>> >> >>> > which need to be clarified / worked out (e.g. should we simply
>>> >> >>> > deprecate
>>> >> >>> > [])
>>> >> >>> > that are much easier to test / figure out in python space.
>>> >> >>> >
>>> >> >>> > I also thing that we have quite a large number of contributors.
>>> >> >>> > Moving
>>> >> >>> > to
>>> >> >>> > c++ might make the internals a bit more impenetrable that the
>>> >> >>> > current
>>> >> >>> > internals.
>>> >> >>> > (though this would allow c++ people to contribute, so that might
>>> >> >>> > balance
>>> >> >>> > out).
>>> >> >>> >
>>> >> >>> > We have a limited core of devs whom right now are familar with
>>> >> >>> > things.
>>> >> >>> > If
>>> >> >>> > someone happened to have a starting base for a c++ library,
>>> then I
>>> >> >>> > might
>>> >> >>> > change
>>> >> >>> > opinions here.
>>> >> >>> >
>>> >> >>> >
>>> >> >>> > my 4c.
>>> >> >>> >
>>> >> >>> > Jeff
>>> >> >>> >
>>> >> >>> >
>>> >> >>> >
>>> >> >>> >
>>> >> >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney <
>>> wesmckinn at gmail.com>
>>> >> >>> > wrote:
>>> >> >>> >>
>>> >> >>> >> Deep thoughts during the holidays.
>>> >> >>> >>
>>> >> >>> >> I might be out of line here, but the interpreter-heaviness of
>>> the
>>> >> >>> >> inside of pandas objects is likely to be a long-term liability
>>> and
>>> >> >>> >> source of performance problems and technical debt.
>>> >> >>> >>
>>> >> >>> >> Has anyone put any thought into planning and beginning to
>>> execute
>>> >> >>> >> on a
>>> >> >>> >> rewrite that moves as much as possible of the internals into
>>> native
>>> >> >>> >> /
>>> >> >>> >> compiled code? I'm talking about:
>>> >> >>> >>
>>> >> >>> >> - pandas/core/internals
>>> >> >>> >> - indexing and assignment
>>> >> >>> >> - much of pandas/core/common
>>> >> >>> >> - categorical and custom dtypes
>>> >> >>> >> - all indexing mechanisms
>>> >> >>> >>
>>> >> >>> >> I'm concerned we've already exposed too much internals to
>>> users, so
>>> >> >>> >> this might lead to a lot of API breakage, but it might be for
>>> the
>>> >> >>> >> Greater Good. As a first step, beginning a partial migration of
>>> >> >>> >> internals into some C++ classes that encapsulate the insides of
>>> >> >>> >> DataFrame objects and implement indexing and block-level
>>> >> >>> >> manipulations
>>> >> >>> >> would be a good place to start. I think you could do this
>>> wouldn't
>>> >> >>> >> too
>>> >> >>> >> much disruption.
>>> >> >>> >>
>>> >> >>> >> As part of this internal retooling we might give consideration
>>> to
>>> >> >>> >> alternative data structures for representing data internal to
>>> >> >>> >> pandas
>>> >> >>> >> objects. Now in 2015/2016, continuing to be hamstrung by
>>> NumPy's
>>> >> >>> >> limitations feels somewhat anachronistic. User code is riddled
>>> with
>>> >> >>> >> workarounds for data type fidelity issues and the like. Like,
>>> >> >>> >> really,
>>> >> >>> >> why not add a bitndarray (similar to ilanschnell/bitarray) for
>>> >> >>> >> storing
>>> >> >>> >> nullness for problematic types and hide this from the user? =)
>>> >> >>> >>
>>> >> >>> >> Since we are now a NumFOCUS-sponsored project, I feel like we
>>> might
>>> >> >>> >> consider establishing some formal governance over pandas and
>>> >> >>> >> publishing meetings notes and roadmap documents describing
>>> plans
>>> >> >>> >> for
>>> >> >>> >> the project and meetings notes from committers. There's no real
>>> >> >>> >> "committer culture" for NumFOCUS projects like there is with
>>> the
>>> >> >>> >> Apache Software Foundation, but we might try leading by
>>> example!
>>> >> >>> >>
>>> >> >>> >> Also, I believe pandas as a project has reached a level of
>>> >> >>> >> importance
>>> >> >>> >> where we ought to consider planning and execution on larger
>>> scale
>>> >> >>> >> undertakings such as this for safeguarding the future.
>>> >> >>> >>
>>> >> >>> >> As for myself, well, I have my hands full in Big Data-land. I
>>> wish
>>> >> >>> >> I
>>> >> >>> >> could be helping more with pandas, but there a quite a few
>>> >> >>> >> fundamental
>>> >> >>> >> issues (like data interoperability nested data handling and
>>> file
>>> >> >>> >> format support — e.g. Parquet, see
>>> >> >>> >>
>>> >> >>> >>
>>> >> >>> >>
>>> >> >>> >>
>>> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/
>>> )
>>> >> >>> >> preventing Python from being more useful in industry analytics
>>> >> >>> >> applications.
>>> >> >>> >>
>>> >> >>> >> Aside: one of the bigger mistakes I made with pandas's API
>>> design
>>> >> >>> >> was
>>> >> >>> >> making it acceptable to call class constructors — like
>>> >> >>> >> pandas.DataFrame — directly (versus factory functions). Sorry
>>> about
>>> >> >>> >> that! If we could convince everyone to start writing
>>> >> >>> >> pandas.data_frame
>>> >> >>> >> or dataframe instead of using the class reference it would
>>> help a
>>> >> >>> >> lot
>>> >> >>> >> with code cleanup. It's hard to plan for these things — NumPy
>>> >> >>> >> interoperability seemed a lot more important in 2008 than it
>>> does
>>> >> >>> >> now,
>>> >> >>> >> so I forgive myself.
>>> >> >>> >>
>>> >> >>> >> cheers and best wishes for 2016,
>>> >> >>> >> Wes
>>> >> >>> >> _______________________________________________
>>> >> >>> >> Pandas-dev mailing list
>>> >> >>> >> Pandas-dev at python.org
>>> >> >>> >> https://mail.python.org/mailman/listinfo/pandas-dev
>>> >> >>> >
>>> >> >>> >
>>> >> >>> _______________________________________________
>>> >> >>> Pandas-dev mailing list
>>> >> >>> Pandas-dev at python.org
>>> >> >>> https://mail.python.org/mailman/listinfo/pandas-dev
>>> >> _______________________________________________
>>> >> Pandas-dev mailing list
>>> >> Pandas-dev at python.org
>>> >> https://mail.python.org/mailman/listinfo/pandas-dev
>>> >
>>> >
>>>
>>>
>>> _______________________________________________
>>> Pandas-dev mailing list
>>> Pandas-dev at python.org
>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20151229/63cdaebe/attachment-0001.html>