[Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap

Fri Jan 1 21:06:35 EST 2016

ok I moved the document to the Pandas folder, where the same group should
be able to edit/upload/etc. lmk if any issues

On Fri, Jan 1, 2016 at 8:48 PM, Wes McKinney <wesmckinn at gmail.com> wrote:

> Thanks Jeff. Can you create and share a shared Drive folder containing
> this where I can put other auxiliary / follow up documents?
>
> On Fri, Jan 1, 2016 at 5:23 PM, Jeff Reback <jeffreback at gmail.com> wrote:
> > I changed the doc so that the core dev people can edit. I *think* that
> > everyone should be able to view/comment though.
> >
> > On Fri, Jan 1, 2016 at 8:13 PM, Wes McKinney <wesmckinn at gmail.com>
> wrote:
> >>
> >> Jeff -- can you require log-in for editing on this document?
> >>
> >>
> https://docs.google.com/document/d/151ct8jcZWwh7XStptjbLsda6h2b3C0IuiH_hfZnUA58/edit#
> >>
> >> There are a number of anonymous edits.
> >>
> >> On Wed, Dec 30, 2015 at 6:04 PM, Wes McKinney <wesmckinn at gmail.com>
> wrote:
> >> > I cobbled together an ugly start of a c++->cython->pandas toolchain
> here
> >> >
> >> > https://github.com/wesm/pandas/tree/libpandas-native-core
> >> >
> >> > I used a mix of Kudu, Impala, and dynd-python cmake sources, so it's a
> >> > bit messy at the moment but it should be sufficient to run some real
> >> > experiments with a little more work. I reckon it's like a 6 month
> >> > project to tear out the insides of Series and DataFrame and replace it
> >> > with a new "native core", but we should be able to get enough info to
> >> > see whether it's a viable plan within a month or so.
> >> >
> >> > The end goal is to create "private" extension types in Cython that can
> >> > be the new base classes for Series and NDFrame; these will hold a
> >> > reference to a C++ object that contains wrappered NumPy arrays and
> >> > other metadata (like pandas-only dtypes).
> >> >
> >> > It might be too hard to try to replace a single usage of block manager
> >> > as a first experiment, so I'll try to create a minimal "SeriesLite"
> >> > that supports 3 dtypes
> >> >
> >> > 1) float64 with nans
> >> > 2) int64 with a bitmask for NAs
> >> > 3) category type for one of these
> >> >
> >> > Just want to get a feel for the extensibility and offer an NA
> >> > singleton Python object (a la None) for getting and setting NAs across
> >> > these 3 dtypes.
> >> >
> >> > If we end up going down this route, any way to place a moratorium on
> >> > invasive work on pandas internals (outside bug fixes)?
> >> >
> >> > Pedantic aside: I'd rather avoid shipping thirdparty C/C++ libraries
> >> > like googletest and friends in pandas if we can. Cloudera folks have
> >> > been working on a portable C++ library toolchain for Impala and other
> >> > projects at https://github.com/cloudera/native-toolchain, but it is
> >> > only being tested on Linux and OS X. Most google libraries should
> >> > build out of the box on MSVC but it'll be something to keep an eye on.
> >> >
> >> > BTW thanks to the libdynd developers for pioneering the c++ lib <->
> >> > python-c++ lib <-> cython toolchain; being able to build Cython
> >> > extensions directly from cmake is a godsend
> >> >
> >> > HNY all
> >> > Wes
> >> >
> >> > On Tue, Dec 29, 2015 at 4:17 PM, Irwin Zaid <izaid at continuum.io>
> wrote:
> >> >> Yeah, that seems reasonable and I totally agree a Pandas wrapper
> layer
> >> >> would
> >> >> be necessary.
> >> >>
> >> >> I'll keep an eye on this and I'd like to help if I can.
> >> >>
> >> >> Irwin
> >> >>
> >> >>
> >> >> On Tue, Dec 29, 2015 at 6:01 PM, Wes McKinney <wesmckinn at gmail.com>
> >> >> wrote:
> >> >>>
> >> >>> I'm not suggesting a rewrite of NumPy functionality but rather
> pandas
> >> >>> functionality that is currently written in a mishmash of Cython and
> >> >>> Python.
> >> >>> Happy to experiment with changing the internal compute
> infrastructure
> >> >>> and
> >> >>> data representation to DyND after this first stage of cleanup is
> done.
> >> >>> Even
> >> >>> if we use DyND a pretty extensive pandas wrapper layer will be
> >> >>> necessary.
> >> >>>
> >> >>>
> >> >>> On Tuesday, December 29, 2015, Irwin Zaid <izaid at continuum.io>
> wrote:
> >> >>>>
> >> >>>> Hi Wes (and others),
> >> >>>>
> >> >>>> I've been following this conversation with interest. I do think it
> >> >>>> would
> >> >>>> be worth exploring DyND, rather than setting up yet another rewrite
> >> >>>> of
> >> >>>> NumPy-functionality. Especially because DyND is already an optional
> >> >>>> dependency of Pandas.
> >> >>>>
> >> >>>> For things like Integer NA and new dtypes, DyND is there and ready
> to
> >> >>>> do
> >> >>>> this.
> >> >>>>
> >> >>>> Irwin
> >> >>>>
> >> >>>> On Tue, Dec 29, 2015 at 5:18 PM, Wes McKinney <wesmckinn at gmail.com
> >
> >> >>>> wrote:
> >> >>>>>
> >> >>>>> Can you link to the PR you're talking about?
> >> >>>>>
> >> >>>>> I will see about spending a few hours setting up a libpandas.so
> as a
> >> >>>>> C++
> >> >>>>> shared library where we can run some experiments and validate
> >> >>>>> whether it can
> >> >>>>> solve the integer-NA problem and be a place to put new data types
> >> >>>>> (categorical and friends). I'm +1 on targeting
> >> >>>>>
> >> >>>>> Would it also be worth making a wish list of APIs we might
> consider
> >> >>>>> breaking in a pandas 1.0 release that also features this new
> "native
> >> >>>>> core"?
> >> >>>>> Might as well right some wrongs while we're doing some invasive
> work
> >> >>>>> on the
> >> >>>>> internals; some breakage might be unavoidable. We can always
> >> >>>>> maintain a
> >> >>>>> pandas legacy 0.x.x maintenance branch (providing a conda binary
> >> >>>>> build) for
> >> >>>>> legacy users where showstopper bugs can get fixed.
> >> >>>>>
> >> >>>>> On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback <
> jeffreback at gmail.com>
> >> >>>>> wrote:
> >> >>>>> > Wes your last is noted as well. I *think* we can actually do
> this
> >> >>>>> > now
> >> >>>>> > (well
> >> >>>>> > there is a PR out there).
> >> >>>>> >
> >> >>>>> > On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney
> >> >>>>> > <wesmckinn at gmail.com>
> >> >>>>> > wrote:
> >> >>>>> >>
> >> >>>>> >> The other huge thing this will enable is to do is copy-on-write
> >> >>>>> >> for
> >> >>>>> >> various kinds of views, which should cut down on some of the
> >> >>>>> >> defensive
> >> >>>>> >> copying in the library and reduce memory usage.
> >> >>>>> >>
> >> >>>>> >> On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney
> >> >>>>> >> <wesmckinn at gmail.com>
> >> >>>>> >> wrote:
> >> >>>>> >> > Basically the approach is
> >> >>>>> >> >
> >> >>>>> >> > 1) Base dtype type
> >> >>>>> >> > 2) Base array type with K >= 1 dimensions
> >> >>>>> >> > 3) Base scalar type
> >> >>>>> >> > 4) Base index type
> >> >>>>> >> > 5) "Wrapper" subclasses for all NumPy types fitting into
> >> >>>>> >> > categories
> >> >>>>> >> > #1, #2, #3, #4
> >> >>>>> >> > 6) Subclasses for pandas-specific types like category,
> >> >>>>> >> > datetimeTZ,
> >> >>>>> >> > etc.
> >> >>>>> >> > 7) NDFrame as cpcloud wrote is just a list of these
> >> >>>>> >> >
> >> >>>>> >> > Indexes and axis labels / column names can get layered on
> top.
> >> >>>>> >> >
> >> >>>>> >> > After we do all this we can look at adding nested types
> >> >>>>> >> > (arrays,
> >> >>>>> >> > maps,
> >> >>>>> >> > structs) to better support JSON.
> >> >>>>> >> >
> >> >>>>> >> > - Wes
> >> >>>>> >> >
> >> >>>>> >> > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud
> >> >>>>> >> > <cpcloud at gmail.com>
> >> >>>>> >> > wrote:
> >> >>>>> >> >> Maybe this is saying the same thing as Wes, but how far
> would
> >> >>>>> >> >> something
> >> >>>>> >> >> like
> >> >>>>> >> >> this get us?
> >> >>>>> >> >>
> >> >>>>> >> >> // warning: things are probably not this simple
> >> >>>>> >> >>
> >> >>>>> >> >> struct data_array_t {
> >> >>>>> >> >>     void *primitive;  // scalar data
> >> >>>>> >> >>     data_array_t *nested; // nested data
> >> >>>>> >> >>     boost::dynamic_bitset isnull;  // might have to create
> our
> >> >>>>> >> >> own
> >> >>>>> >> >> to
> >> >>>>> >> >> avoid
> >> >>>>> >> >> boost
> >> >>>>> >> >>     schema_t schema;  // not sure exactly what this looks
> like
> >> >>>>> >> >> };
> >> >>>>> >> >>
> >> >>>>> >> >> typedef std::map<string, data_array_t> data_frame_t;  //
> >> >>>>> >> >> probably
> >> >>>>> >> >> not
> >> >>>>> >> >> this
> >> >>>>> >> >> simple
> >> >>>>> >> >>
> >> >>>>> >> >> To answer Jeff’s use-case question: I think that the use
> cases
> >> >>>>> >> >> are
> >> >>>>> >> >> 1)
> >> >>>>> >> >> freedom from numpy (mostly) 2) no more block manager which
> >> >>>>> >> >> frees
> >> >>>>> >> >> us
> >> >>>>> >> >> from the
> >> >>>>> >> >> limitations of the block memory layout. In particular, the
> >> >>>>> >> >> ability
> >> >>>>> >> >> to
> >> >>>>> >> >> take
> >> >>>>> >> >> advantage of memory mapped IO would be a big win IMO.
> >> >>>>> >> >>
> >> >>>>> >> >>
> >> >>>>> >> >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney
> >> >>>>> >> >> <wesmckinn at gmail.com>
> >> >>>>> >> >> wrote:
> >> >>>>> >> >>>
> >> >>>>> >> >>> I will write a more detailed response to some of these
> things
> >> >>>>> >> >>> after
> >> >>>>> >> >>> the new year, but, in particular, re: missing values, can
> you
> >> >>>>> >> >>> or
> >> >>>>> >> >>> someone tell me why creating an object that contains a
> NumPy
> >> >>>>> >> >>> array and
> >> >>>>> >> >>> a bitmap is not sufficient? If we we can add a lightweight
> >> >>>>> >> >>> C/C++
> >> >>>>> >> >>> class
> >> >>>>> >> >>> layer between NumPy function calls (e.g. arithmetic) and
> >> >>>>> >> >>> pandas
> >> >>>>> >> >>> function calls, then I see no reason why we cannot have
> >> >>>>> >> >>>
> >> >>>>> >> >>> Int32Array->add
> >> >>>>> >> >>>
> >> >>>>> >> >>> and
> >> >>>>> >> >>>
> >> >>>>> >> >>> Float32Array->add
> >> >>>>> >> >>>
> >> >>>>> >> >>> do the right thing (the former would be responsible for
> >> >>>>> >> >>> bitmasking to
> >> >>>>> >> >>> propagate NA values; the latter would defer to NumPy). If
> we
> >> >>>>> >> >>> can
> >> >>>>> >> >>> put
> >> >>>>> >> >>> all the internals of pandas objects inside a black box, we
> >> >>>>> >> >>> can
> >> >>>>> >> >>> add
> >> >>>>> >> >>> layers of virtual function indirection without a
> performance
> >> >>>>> >> >>> penalty
> >> >>>>> >> >>> (e.g. adding more interpreter overhead with more
> abstraction
> >> >>>>> >> >>> layers
> >> >>>>> >> >>> does add up to a perf penalty).
> >> >>>>> >> >>>
> >> >>>>> >> >>> I don't think this is too scary -- I would be willing to
> >> >>>>> >> >>> create a
> >> >>>>> >> >>> small POC C++ library to prototype something like what I'm
> >> >>>>> >> >>> talking
> >> >>>>> >> >>> about.
> >> >>>>> >> >>>
> >> >>>>> >> >>> Since pandas has limited points of contact with NumPy I
> don't
> >> >>>>> >> >>> think
> >> >>>>> >> >>> this would end up being too onerous.
> >> >>>>> >> >>>
> >> >>>>> >> >>> For the record, I'm pretty allergic to "advanced C++"; I
> >> >>>>> >> >>> think it
> >> >>>>> >> >>> is a
> >> >>>>> >> >>> useful tool if you pick a sane 20% subset of the C++11 spec
> >> >>>>> >> >>> and
> >> >>>>> >> >>> follow
> >> >>>>> >> >>> Google C++ style it's not very inaccessible to intermediate
> >> >>>>> >> >>> developers. More or less "C plus OOP and easier object
> >> >>>>> >> >>> lifetime
> >> >>>>> >> >>> management (shared/unique_ptr, etc.)". As soon as you add a
> >> >>>>> >> >>> lot
> >> >>>>> >> >>> of
> >> >>>>> >> >>> template metaprogramming C++ library development quickly
> >> >>>>> >> >>> becomes
> >> >>>>> >> >>> inaccessible except to the C++-Jedi.
> >> >>>>> >> >>>
> >> >>>>> >> >>> Maybe let's start a Google document on "pandas roadmap"
> where
> >> >>>>> >> >>> we
> >> >>>>> >> >>> can
> >> >>>>> >> >>> break down the 1-2 year goals and some of these
> >> >>>>> >> >>> infrastructure
> >> >>>>> >> >>> issues
> >> >>>>> >> >>> and have our discussion there? (obviously publish this
> >> >>>>> >> >>> someplace
> >> >>>>> >> >>> once
> >> >>>>> >> >>> we're done)
> >> >>>>> >> >>>
> >> >>>>> >> >>> - Wes
> >> >>>>> >> >>>
> >> >>>>> >> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback
> >> >>>>> >> >>> <jeffreback at gmail.com>
> >> >>>>> >> >>> wrote:
> >> >>>>> >> >>> > Here are some of my thoughts about pandas Roadmap /
> status
> >> >>>>> >> >>> > and
> >> >>>>> >> >>> > some
> >> >>>>> >> >>> > responses to Wes's thoughts.
> >> >>>>> >> >>> >
> >> >>>>> >> >>> > In the last few (and upcoming) major releases we have
> been
> >> >>>>> >> >>> > made
> >> >>>>> >> >>> > the
> >> >>>>> >> >>> > following changes:
> >> >>>>> >> >>> >
> >> >>>>> >> >>> > - dtype enhancements (Categorical, Timedelta, Datetime
> >> >>>>> >> >>> > w/tz) &
> >> >>>>> >> >>> > making
> >> >>>>> >> >>> > these
> >> >>>>> >> >>> > first class objects
> >> >>>>> >> >>> > - code refactoring to remove subclassing of ndarrays for
> >> >>>>> >> >>> > Series
> >> >>>>> >> >>> > &
> >> >>>>> >> >>> > Index
> >> >>>>> >> >>> > - carving out / deprecating non-core parts of pandas
> >> >>>>> >> >>> >   - datareader
> >> >>>>> >> >>> >   - SparsePanel, WidePanel & other aliases (TImeSeries)
> >> >>>>> >> >>> >   - rpy, rplot, irow et al.
> >> >>>>> >> >>> >   - google-analytics
> >> >>>>> >> >>> > - API changes to make things more consistent
> >> >>>>> >> >>> >   - pd.rolling/expanding * -> .rolling/expanding (this is
> >> >>>>> >> >>> > in
> >> >>>>> >> >>> > master
> >> >>>>> >> >>> > now)
> >> >>>>> >> >>> >   - .resample becoming a full defered like groupby.
> >> >>>>> >> >>> >   - multi-index slicing along any level (obviates need
> for
> >> >>>>> >> >>> > .xs)
> >> >>>>> >> >>> > and
> >> >>>>> >> >>> > allows
> >> >>>>> >> >>> > assignment
> >> >>>>> >> >>> >   - .loc/.iloc - for the most part obviates use of .ix
> >> >>>>> >> >>> >   - .pipe & .assign
> >> >>>>> >> >>> >   - plotting accessors
> >> >>>>> >> >>> >   - fixing of the sorting API
> >> >>>>> >> >>> > - many performance enhancements both micro & macro (e.g.
> >> >>>>> >> >>> > release
> >> >>>>> >> >>> > GIL)
> >> >>>>> >> >>> >
> >> >>>>> >> >>> > Some on-deck enhancements are (meaning these are
> basically
> >> >>>>> >> >>> > ready to
> >> >>>>> >> >>> > go
> >> >>>>> >> >>> > in):
> >> >>>>> >> >>> >   - IntervalIndex (and eventually make PeriodIndex just a
> >> >>>>> >> >>> > sub-class
> >> >>>>> >> >>> > of
> >> >>>>> >> >>> > this)
> >> >>>>> >> >>> >   - RangeIndex
> >> >>>>> >> >>> >
> >> >>>>> >> >>> > so lots of changes, though nothing really earth shaking,
> >> >>>>> >> >>> > just
> >> >>>>> >> >>> > more
> >> >>>>> >> >>> > convenience, reducing magicness somewhat
> >> >>>>> >> >>> > and providing flexibility.
> >> >>>>> >> >>> >
> >> >>>>> >> >>> > Of course we are getting increasing issues, mostly bug
> >> >>>>> >> >>> > reports
> >> >>>>> >> >>> > (and
> >> >>>>> >> >>> > lots
> >> >>>>> >> >>> > of
> >> >>>>> >> >>> > dupes), some edge case enhancements
> >> >>>>> >> >>> > which can add to the existing API's and of course,
> requests
> >> >>>>> >> >>> > to
> >> >>>>> >> >>> > expand
> >> >>>>> >> >>> > the
> >> >>>>> >> >>> > (already) large code to other usecases.
> >> >>>>> >> >>> > Balancing this are a good many pull-requests from many
> >> >>>>> >> >>> > different
> >> >>>>> >> >>> > users,
> >> >>>>> >> >>> > some
> >> >>>>> >> >>> > even deep into the internals.
> >> >>>>> >> >>> >
> >> >>>>> >> >>> > Here are some things that I have talked about and could
> be
> >> >>>>> >> >>> > considered
> >> >>>>> >> >>> > for
> >> >>>>> >> >>> > the roadmap. Disclaimer: I do work for Continuum
> >> >>>>> >> >>> > but these views are of course my own; furthermore
> obviously
> >> >>>>> >> >>> > I
> >> >>>>> >> >>> > am a
> >> >>>>> >> >>> > bit
> >> >>>>> >> >>> > more
> >> >>>>> >> >>> > familiar with some of the 'sponsored' open-source
> >> >>>>> >> >>> > libraries, but always open to new things.
> >> >>>>> >> >>> >
> >> >>>>> >> >>> > - integration / automatic deferral to numba for JIT (this
> >> >>>>> >> >>> > would
> >> >>>>> >> >>> > be
> >> >>>>> >> >>> > thru
> >> >>>>> >> >>> > .apply)
> >> >>>>> >> >>> > - automatic deferal to dask from groubpy where
> appropriate
> >> >>>>> >> >>> > /
> >> >>>>> >> >>> > maybe a
> >> >>>>> >> >>> > .to_parallel (to simply return a dask.DataFrame object)
> >> >>>>> >> >>> > - incorporation of quantities / units (as part of the
> >> >>>>> >> >>> > dtype)
> >> >>>>> >> >>> > - use of DyND to allow missing values for int dtypes
> >> >>>>> >> >>> > - make Period a first class dtype.
> >> >>>>> >> >>> > - provide some copy-on-write semantics to alleviate the
> >> >>>>> >> >>> > chained-indexing
> >> >>>>> >> >>> > issues which occasionaly come up with the mis-use of the
> >> >>>>> >> >>> > indexing
> >> >>>>> >> >>> > API
> >> >>>>> >> >>> > - allow a 'policy' to automatically provide column blocks
> >> >>>>> >> >>> > for
> >> >>>>> >> >>> > dict-like
> >> >>>>> >> >>> > input (e.g. each column would be a block), this would
> allow
> >> >>>>> >> >>> > a
> >> >>>>> >> >>> > pass-thru
> >> >>>>> >> >>> > API
> >> >>>>> >> >>> > where you could
> >> >>>>> >> >>> > put in numpy arrays where you have views and have them
> >> >>>>> >> >>> > preserved
> >> >>>>> >> >>> > rather
> >> >>>>> >> >>> > than
> >> >>>>> >> >>> > copied automatically. Note that this would also allow
> what
> >> >>>>> >> >>> > I
> >> >>>>> >> >>> > call
> >> >>>>> >> >>> > 'split'
> >> >>>>> >> >>> > where a passed in
> >> >>>>> >> >>> > multi-dim numpy array could be split up to individual
> >> >>>>> >> >>> > blocks
> >> >>>>> >> >>> > (which
> >> >>>>> >> >>> > actually
> >> >>>>> >> >>> > gives a nice perf boost after the splitting costs).
> >> >>>>> >> >>> >
> >> >>>>> >> >>> > In working towards some of these goals. I have come to
> the
> >> >>>>> >> >>> > opinion
> >> >>>>> >> >>> > that
> >> >>>>> >> >>> > it
> >> >>>>> >> >>> > would make sense to have a neutral API protocol layer
> >> >>>>> >> >>> > that would allow us to swap out different engines as
> >> >>>>> >> >>> > needed,
> >> >>>>> >> >>> > for
> >> >>>>> >> >>> > particular
> >> >>>>> >> >>> > dtypes, or *maybe* out-of-core type computations. E.g.
> >> >>>>> >> >>> > imagine that we replaced the in-memory block structure
> with
> >> >>>>> >> >>> > a
> >> >>>>> >> >>> > bclolz
> >> >>>>> >> >>> > /
> >> >>>>> >> >>> > memap
> >> >>>>> >> >>> > type; in theory this should be 'easy' and just work.
> >> >>>>> >> >>> > I could also see us adopting *some* of the SFrame code to
> >> >>>>> >> >>> > allow
> >> >>>>> >> >>> > easier
> >> >>>>> >> >>> > interop with this API layer.
> >> >>>>> >> >>> >
> >> >>>>> >> >>> > In practice, I think a nice API layer would need to be
> >> >>>>> >> >>> > created
> >> >>>>> >> >>> > to
> >> >>>>> >> >>> > make
> >> >>>>> >> >>> > this
> >> >>>>> >> >>> > clean / nice.
> >> >>>>> >> >>> >
> >> >>>>> >> >>> > So this comes around to Wes's point about creating a c++
> >> >>>>> >> >>> > library for
> >> >>>>> >> >>> > the
> >> >>>>> >> >>> > internals (and possibly even some of the indexing
> >> >>>>> >> >>> > routines).
> >> >>>>> >> >>> > In an ideal world, or course this would be desirable.
> >> >>>>> >> >>> > Getting
> >> >>>>> >> >>> > there
> >> >>>>> >> >>> > is a
> >> >>>>> >> >>> > bit
> >> >>>>> >> >>> > non-trivial I think, and IMHO might not be worth the
> >> >>>>> >> >>> > effort. I
> >> >>>>> >> >>> > don't
> >> >>>>> >> >>> > really see big performance bottlenecks. We *already*
> defer
> >> >>>>> >> >>> > much
> >> >>>>> >> >>> > of
> >> >>>>> >> >>> > the
> >> >>>>> >> >>> > computation to libraries like numexpr & bottleneck (where
> >> >>>>> >> >>> > appropriate).
> >> >>>>> >> >>> > Adding numba / dask to the list would be helpful.
> >> >>>>> >> >>> >
> >> >>>>> >> >>> > I think that almost all performance issues are the result
> >> >>>>> >> >>> > of:
> >> >>>>> >> >>> >
> >> >>>>> >> >>> > a) gross misuse of the pandas API. How much code have you
> >> >>>>> >> >>> > seen
> >> >>>>> >> >>> > that
> >> >>>>> >> >>> > does
> >> >>>>> >> >>> > df.apply(lambda x: x.sum())
> >> >>>>> >> >>> > b) routines which operate column-by-column rather
> >> >>>>> >> >>> > block-by-block and
> >> >>>>> >> >>> > are
> >> >>>>> >> >>> > in
> >> >>>>> >> >>> > python space (e.g. we have an issue right now about
> >> >>>>> >> >>> > .quantile)
> >> >>>>> >> >>> >
> >> >>>>> >> >>> > So I am glossing over a big goal of having a c++ library
> >> >>>>> >> >>> > that
> >> >>>>> >> >>> > represents
> >> >>>>> >> >>> > the
> >> >>>>> >> >>> > pandas internals. This would by definition have a c-API
> >> >>>>> >> >>> > that so
> >> >>>>> >> >>> > you *could* use pandas like semantics in c/c++ and just
> >> >>>>> >> >>> > have it
> >> >>>>> >> >>> > work
> >> >>>>> >> >>> > (and
> >> >>>>> >> >>> > then pandas would be a thin wrapper around this library).
> >> >>>>> >> >>> >
> >> >>>>> >> >>> > I am not averse to this, but I think would be quite a big
> >> >>>>> >> >>> > effort,
> >> >>>>> >> >>> > and
> >> >>>>> >> >>> > not a
> >> >>>>> >> >>> > huge perf boost IMHO. Further there are a number of API
> >> >>>>> >> >>> > issues
> >> >>>>> >> >>> > w.r.t.
> >> >>>>> >> >>> > indexing
> >> >>>>> >> >>> > which need to be clarified / worked out (e.g. should we
> >> >>>>> >> >>> > simply
> >> >>>>> >> >>> > deprecate
> >> >>>>> >> >>> > [])
> >> >>>>> >> >>> > that are much easier to test / figure out in python
> space.
> >> >>>>> >> >>> >
> >> >>>>> >> >>> > I also thing that we have quite a large number of
> >> >>>>> >> >>> > contributors.
> >> >>>>> >> >>> > Moving
> >> >>>>> >> >>> > to
> >> >>>>> >> >>> > c++ might make the internals a bit more impenetrable that
> >> >>>>> >> >>> > the
> >> >>>>> >> >>> > current
> >> >>>>> >> >>> > internals.
> >> >>>>> >> >>> > (though this would allow c++ people to contribute, so
> that
> >> >>>>> >> >>> > might
> >> >>>>> >> >>> > balance
> >> >>>>> >> >>> > out).
> >> >>>>> >> >>> >
> >> >>>>> >> >>> > We have a limited core of devs whom right now are familar
> >> >>>>> >> >>> > with
> >> >>>>> >> >>> > things.
> >> >>>>> >> >>> > If
> >> >>>>> >> >>> > someone happened to have a starting base for a c++
> library,
> >> >>>>> >> >>> > then I
> >> >>>>> >> >>> > might
> >> >>>>> >> >>> > change
> >> >>>>> >> >>> > opinions here.
> >> >>>>> >> >>> >
> >> >>>>> >> >>> >
> >> >>>>> >> >>> > my 4c.
> >> >>>>> >> >>> >
> >> >>>>> >> >>> > Jeff
> >> >>>>> >> >>> >
> >> >>>>> >> >>> >
> >> >>>>> >> >>> >
> >> >>>>> >> >>> >
> >> >>>>> >> >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney
> >> >>>>> >> >>> > <wesmckinn at gmail.com>
> >> >>>>> >> >>> > wrote:
> >> >>>>> >> >>> >>
> >> >>>>> >> >>> >> Deep thoughts during the holidays.
> >> >>>>> >> >>> >>
> >> >>>>> >> >>> >> I might be out of line here, but the
> interpreter-heaviness
> >> >>>>> >> >>> >> of
> >> >>>>> >> >>> >> the
> >> >>>>> >> >>> >> inside of pandas objects is likely to be a long-term
> >> >>>>> >> >>> >> liability
> >> >>>>> >> >>> >> and
> >> >>>>> >> >>> >> source of performance problems and technical debt.
> >> >>>>> >> >>> >>
> >> >>>>> >> >>> >> Has anyone put any thought into planning and beginning
> to
> >> >>>>> >> >>> >> execute
> >> >>>>> >> >>> >> on a
> >> >>>>> >> >>> >> rewrite that moves as much as possible of the internals
> >> >>>>> >> >>> >> into
> >> >>>>> >> >>> >> native
> >> >>>>> >> >>> >> /
> >> >>>>> >> >>> >> compiled code? I'm talking about:
> >> >>>>> >> >>> >>
> >> >>>>> >> >>> >> - pandas/core/internals
> >> >>>>> >> >>> >> - indexing and assignment
> >> >>>>> >> >>> >> - much of pandas/core/common
> >> >>>>> >> >>> >> - categorical and custom dtypes
> >> >>>>> >> >>> >> - all indexing mechanisms
> >> >>>>> >> >>> >>
> >> >>>>> >> >>> >> I'm concerned we've already exposed too much internals
> to
> >> >>>>> >> >>> >> users, so
> >> >>>>> >> >>> >> this might lead to a lot of API breakage, but it might
> be
> >> >>>>> >> >>> >> for
> >> >>>>> >> >>> >> the
> >> >>>>> >> >>> >> Greater Good. As a first step, beginning a partial
> >> >>>>> >> >>> >> migration
> >> >>>>> >> >>> >> of
> >> >>>>> >> >>> >> internals into some C++ classes that encapsulate the
> >> >>>>> >> >>> >> insides
> >> >>>>> >> >>> >> of
> >> >>>>> >> >>> >> DataFrame objects and implement indexing and block-level
> >> >>>>> >> >>> >> manipulations
> >> >>>>> >> >>> >> would be a good place to start. I think you could do
> this
> >> >>>>> >> >>> >> wouldn't
> >> >>>>> >> >>> >> too
> >> >>>>> >> >>> >> much disruption.
> >> >>>>> >> >>> >>
> >> >>>>> >> >>> >> As part of this internal retooling we might give
> >> >>>>> >> >>> >> consideration
> >> >>>>> >> >>> >> to
> >> >>>>> >> >>> >> alternative data structures for representing data
> internal
> >> >>>>> >> >>> >> to
> >> >>>>> >> >>> >> pandas
> >> >>>>> >> >>> >> objects. Now in 2015/2016, continuing to be hamstrung by
> >> >>>>> >> >>> >> NumPy's
> >> >>>>> >> >>> >> limitations feels somewhat anachronistic. User code is
> >> >>>>> >> >>> >> riddled
> >> >>>>> >> >>> >> with
> >> >>>>> >> >>> >> workarounds for data type fidelity issues and the like.
> >> >>>>> >> >>> >> Like,
> >> >>>>> >> >>> >> really,
> >> >>>>> >> >>> >> why not add a bitndarray (similar to
> ilanschnell/bitarray)
> >> >>>>> >> >>> >> for
> >> >>>>> >> >>> >> storing
> >> >>>>> >> >>> >> nullness for problematic types and hide this from the
> >> >>>>> >> >>> >> user? =)
> >> >>>>> >> >>> >>
> >> >>>>> >> >>> >> Since we are now a NumFOCUS-sponsored project, I feel
> like
> >> >>>>> >> >>> >> we
> >> >>>>> >> >>> >> might
> >> >>>>> >> >>> >> consider establishing some formal governance over pandas
> >> >>>>> >> >>> >> and
> >> >>>>> >> >>> >> publishing meetings notes and roadmap documents
> describing
> >> >>>>> >> >>> >> plans
> >> >>>>> >> >>> >> for
> >> >>>>> >> >>> >> the project and meetings notes from committers. There's
> no
> >> >>>>> >> >>> >> real
> >> >>>>> >> >>> >> "committer culture" for NumFOCUS projects like there is
> >> >>>>> >> >>> >> with
> >> >>>>> >> >>> >> the
> >> >>>>> >> >>> >> Apache Software Foundation, but we might try leading by
> >> >>>>> >> >>> >> example!
> >> >>>>> >> >>> >>
> >> >>>>> >> >>> >> Also, I believe pandas as a project has reached a level
> of
> >> >>>>> >> >>> >> importance
> >> >>>>> >> >>> >> where we ought to consider planning and execution on
> >> >>>>> >> >>> >> larger
> >> >>>>> >> >>> >> scale
> >> >>>>> >> >>> >> undertakings such as this for safeguarding the future.
> >> >>>>> >> >>> >>
> >> >>>>> >> >>> >> As for myself, well, I have my hands full in Big
> >> >>>>> >> >>> >> Data-land. I
> >> >>>>> >> >>> >> wish
> >> >>>>> >> >>> >> I
> >> >>>>> >> >>> >> could be helping more with pandas, but there a quite a
> few
> >> >>>>> >> >>> >> fundamental
> >> >>>>> >> >>> >> issues (like data interoperability nested data handling
> >> >>>>> >> >>> >> and
> >> >>>>> >> >>> >> file
> >> >>>>> >> >>> >> format support — e.g. Parquet, see
> >> >>>>> >> >>> >>
> >> >>>>> >> >>> >>
> >> >>>>> >> >>> >>
> >> >>>>> >> >>> >>
> >> >>>>> >> >>> >>
> >> >>>>> >> >>> >>
> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/
> )
> >> >>>>> >> >>> >> preventing Python from being more useful in industry
> >> >>>>> >> >>> >> analytics
> >> >>>>> >> >>> >> applications.
> >> >>>>> >> >>> >>
> >> >>>>> >> >>> >> Aside: one of the bigger mistakes I made with pandas's
> API
> >> >>>>> >> >>> >> design
> >> >>>>> >> >>> >> was
> >> >>>>> >> >>> >> making it acceptable to call class constructors — like
> >> >>>>> >> >>> >> pandas.DataFrame — directly (versus factory functions).
> >> >>>>> >> >>> >> Sorry
> >> >>>>> >> >>> >> about
> >> >>>>> >> >>> >> that! If we could convince everyone to start writing
> >> >>>>> >> >>> >> pandas.data_frame
> >> >>>>> >> >>> >> or dataframe instead of using the class reference it
> would
> >> >>>>> >> >>> >> help a
> >> >>>>> >> >>> >> lot
> >> >>>>> >> >>> >> with code cleanup. It's hard to plan for these things —
> >> >>>>> >> >>> >> NumPy
> >> >>>>> >> >>> >> interoperability seemed a lot more important in 2008
> than
> >> >>>>> >> >>> >> it
> >> >>>>> >> >>> >> does
> >> >>>>> >> >>> >> now,
> >> >>>>> >> >>> >> so I forgive myself.
> >> >>>>> >> >>> >>
> >> >>>>> >> >>> >> cheers and best wishes for 2016,
> >> >>>>> >> >>> >> Wes
> >> >>>>> >> >>> >> _______________________________________________
> >> >>>>> >> >>> >> Pandas-dev mailing list
> >> >>>>> >> >>> >> Pandas-dev at python.org
> >> >>>>> >> >>> >> https://mail.python.org/mailman/listinfo/pandas-dev
> >> >>>>> >> >>> >
> >> >>>>> >> >>> >
> >> >>>>> >> >>> _______________________________________________
> >> >>>>> >> >>> Pandas-dev mailing list
> >> >>>>> >> >>> Pandas-dev at python.org
> >> >>>>> >> >>> https://mail.python.org/mailman/listinfo/pandas-dev
> >> >>>>> >> _______________________________________________
> >> >>>>> >> Pandas-dev mailing list
> >> >>>>> >> Pandas-dev at python.org
> >> >>>>> >> https://mail.python.org/mailman/listinfo/pandas-dev
> >> >>>>> >
> >> >>>>> >
> >> >>>>>
> >> >>>>>
> >> >>>>> _______________________________________________
> >> >>>>> Pandas-dev mailing list
> >> >>>>> Pandas-dev at python.org
> >> >>>>> https://mail.python.org/mailman/listinfo/pandas-dev
> >> >>>>>
> >> >>>>
> >> >>
> >> _______________________________________________
> >> Pandas-dev mailing list
> >> Pandas-dev at python.org
> >> https://mail.python.org/mailman/listinfo/pandas-dev
> >
> >
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20160101/395307ca/attachment-0001.html>