[Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap

Wes McKinney wesmckinn at gmail.com
Fri Jan 1 20:13:58 EST 2016


Jeff -- can you require log-in for editing on this document?
https://docs.google.com/document/d/151ct8jcZWwh7XStptjbLsda6h2b3C0IuiH_hfZnUA58/edit#

There are a number of anonymous edits.

On Wed, Dec 30, 2015 at 6:04 PM, Wes McKinney <wesmckinn at gmail.com> wrote:
> I cobbled together an ugly start of a c++->cython->pandas toolchain here
>
> https://github.com/wesm/pandas/tree/libpandas-native-core
>
> I used a mix of Kudu, Impala, and dynd-python cmake sources, so it's a
> bit messy at the moment but it should be sufficient to run some real
> experiments with a little more work. I reckon it's like a 6 month
> project to tear out the insides of Series and DataFrame and replace it
> with a new "native core", but we should be able to get enough info to
> see whether it's a viable plan within a month or so.
>
> The end goal is to create "private" extension types in Cython that can
> be the new base classes for Series and NDFrame; these will hold a
> reference to a C++ object that contains wrappered NumPy arrays and
> other metadata (like pandas-only dtypes).
>
> It might be too hard to try to replace a single usage of block manager
> as a first experiment, so I'll try to create a minimal "SeriesLite"
> that supports 3 dtypes
>
> 1) float64 with nans
> 2) int64 with a bitmask for NAs
> 3) category type for one of these
>
> Just want to get a feel for the extensibility and offer an NA
> singleton Python object (a la None) for getting and setting NAs across
> these 3 dtypes.
>
> If we end up going down this route, any way to place a moratorium on
> invasive work on pandas internals (outside bug fixes)?
>
> Pedantic aside: I'd rather avoid shipping thirdparty C/C++ libraries
> like googletest and friends in pandas if we can. Cloudera folks have
> been working on a portable C++ library toolchain for Impala and other
> projects at https://github.com/cloudera/native-toolchain, but it is
> only being tested on Linux and OS X. Most google libraries should
> build out of the box on MSVC but it'll be something to keep an eye on.
>
> BTW thanks to the libdynd developers for pioneering the c++ lib <->
> python-c++ lib <-> cython toolchain; being able to build Cython
> extensions directly from cmake is a godsend
>
> HNY all
> Wes
>
> On Tue, Dec 29, 2015 at 4:17 PM, Irwin Zaid <izaid at continuum.io> wrote:
>> Yeah, that seems reasonable and I totally agree a Pandas wrapper layer would
>> be necessary.
>>
>> I'll keep an eye on this and I'd like to help if I can.
>>
>> Irwin
>>
>>
>> On Tue, Dec 29, 2015 at 6:01 PM, Wes McKinney <wesmckinn at gmail.com> wrote:
>>>
>>> I'm not suggesting a rewrite of NumPy functionality but rather pandas
>>> functionality that is currently written in a mishmash of Cython and Python.
>>> Happy to experiment with changing the internal compute infrastructure and
>>> data representation to DyND after this first stage of cleanup is done. Even
>>> if we use DyND a pretty extensive pandas wrapper layer will be  necessary.
>>>
>>>
>>> On Tuesday, December 29, 2015, Irwin Zaid <izaid at continuum.io> wrote:
>>>>
>>>> Hi Wes (and others),
>>>>
>>>> I've been following this conversation with interest. I do think it would
>>>> be worth exploring DyND, rather than setting up yet another rewrite of
>>>> NumPy-functionality. Especially because DyND is already an optional
>>>> dependency of Pandas.
>>>>
>>>> For things like Integer NA and new dtypes, DyND is there and ready to do
>>>> this.
>>>>
>>>> Irwin
>>>>
>>>> On Tue, Dec 29, 2015 at 5:18 PM, Wes McKinney <wesmckinn at gmail.com>
>>>> wrote:
>>>>>
>>>>> Can you link to the PR you're talking about?
>>>>>
>>>>> I will see about spending a few hours setting up a libpandas.so as a C++
>>>>> shared library where we can run some experiments and validate whether it can
>>>>> solve the integer-NA problem and be a place to put new data types
>>>>> (categorical and friends). I'm +1 on targeting
>>>>>
>>>>> Would it also be worth making a wish list of APIs we might consider
>>>>> breaking in a pandas 1.0 release that also features this new "native core"?
>>>>> Might as well right some wrongs while we're doing some invasive work on the
>>>>> internals; some breakage might be unavoidable. We can always maintain a
>>>>> pandas legacy 0.x.x maintenance branch (providing a conda binary build) for
>>>>> legacy users where showstopper bugs can get fixed.
>>>>>
>>>>> On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback <jeffreback at gmail.com>
>>>>> wrote:
>>>>> > Wes your last is noted as well. I *think* we can actually do this now
>>>>> > (well
>>>>> > there is a PR out there).
>>>>> >
>>>>> > On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney <wesmckinn at gmail.com>
>>>>> > wrote:
>>>>> >>
>>>>> >> The other huge thing this will enable is to do is copy-on-write for
>>>>> >> various kinds of views, which should cut down on some of the
>>>>> >> defensive
>>>>> >> copying in the library and reduce memory usage.
>>>>> >>
>>>>> >> On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney <wesmckinn at gmail.com>
>>>>> >> wrote:
>>>>> >> > Basically the approach is
>>>>> >> >
>>>>> >> > 1) Base dtype type
>>>>> >> > 2) Base array type with K >= 1 dimensions
>>>>> >> > 3) Base scalar type
>>>>> >> > 4) Base index type
>>>>> >> > 5) "Wrapper" subclasses for all NumPy types fitting into categories
>>>>> >> > #1, #2, #3, #4
>>>>> >> > 6) Subclasses for pandas-specific types like category, datetimeTZ,
>>>>> >> > etc.
>>>>> >> > 7) NDFrame as cpcloud wrote is just a list of these
>>>>> >> >
>>>>> >> > Indexes and axis labels / column names can get layered on top.
>>>>> >> >
>>>>> >> > After we do all this we can look at adding nested types (arrays,
>>>>> >> > maps,
>>>>> >> > structs) to better support JSON.
>>>>> >> >
>>>>> >> > - Wes
>>>>> >> >
>>>>> >> > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud <cpcloud at gmail.com>
>>>>> >> > wrote:
>>>>> >> >> Maybe this is saying the same thing as Wes, but how far would
>>>>> >> >> something
>>>>> >> >> like
>>>>> >> >> this get us?
>>>>> >> >>
>>>>> >> >> // warning: things are probably not this simple
>>>>> >> >>
>>>>> >> >> struct data_array_t {
>>>>> >> >>     void *primitive;  // scalar data
>>>>> >> >>     data_array_t *nested; // nested data
>>>>> >> >>     boost::dynamic_bitset isnull;  // might have to create our own
>>>>> >> >> to
>>>>> >> >> avoid
>>>>> >> >> boost
>>>>> >> >>     schema_t schema;  // not sure exactly what this looks like
>>>>> >> >> };
>>>>> >> >>
>>>>> >> >> typedef std::map<string, data_array_t> data_frame_t;  // probably
>>>>> >> >> not
>>>>> >> >> this
>>>>> >> >> simple
>>>>> >> >>
>>>>> >> >> To answer Jeff’s use-case question: I think that the use cases are
>>>>> >> >> 1)
>>>>> >> >> freedom from numpy (mostly) 2) no more block manager which frees
>>>>> >> >> us
>>>>> >> >> from the
>>>>> >> >> limitations of the block memory layout. In particular, the ability
>>>>> >> >> to
>>>>> >> >> take
>>>>> >> >> advantage of memory mapped IO would be a big win IMO.
>>>>> >> >>
>>>>> >> >>
>>>>> >> >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney <wesmckinn at gmail.com>
>>>>> >> >> wrote:
>>>>> >> >>>
>>>>> >> >>> I will write a more detailed response to some of these things
>>>>> >> >>> after
>>>>> >> >>> the new year, but, in particular, re: missing values, can you or
>>>>> >> >>> someone tell me why creating an object that contains a NumPy
>>>>> >> >>> array and
>>>>> >> >>> a bitmap is not sufficient? If we we can add a lightweight C/C++
>>>>> >> >>> class
>>>>> >> >>> layer between NumPy function calls (e.g. arithmetic) and pandas
>>>>> >> >>> function calls, then I see no reason why we cannot have
>>>>> >> >>>
>>>>> >> >>> Int32Array->add
>>>>> >> >>>
>>>>> >> >>> and
>>>>> >> >>>
>>>>> >> >>> Float32Array->add
>>>>> >> >>>
>>>>> >> >>> do the right thing (the former would be responsible for
>>>>> >> >>> bitmasking to
>>>>> >> >>> propagate NA values; the latter would defer to NumPy). If we can
>>>>> >> >>> put
>>>>> >> >>> all the internals of pandas objects inside a black box, we can
>>>>> >> >>> add
>>>>> >> >>> layers of virtual function indirection without a performance
>>>>> >> >>> penalty
>>>>> >> >>> (e.g. adding more interpreter overhead with more abstraction
>>>>> >> >>> layers
>>>>> >> >>> does add up to a perf penalty).
>>>>> >> >>>
>>>>> >> >>> I don't think this is too scary -- I would be willing to create a
>>>>> >> >>> small POC C++ library to prototype something like what I'm
>>>>> >> >>> talking
>>>>> >> >>> about.
>>>>> >> >>>
>>>>> >> >>> Since pandas has limited points of contact with NumPy I don't
>>>>> >> >>> think
>>>>> >> >>> this would end up being too onerous.
>>>>> >> >>>
>>>>> >> >>> For the record, I'm pretty allergic to "advanced C++"; I think it
>>>>> >> >>> is a
>>>>> >> >>> useful tool if you pick a sane 20% subset of the C++11 spec and
>>>>> >> >>> follow
>>>>> >> >>> Google C++ style it's not very inaccessible to intermediate
>>>>> >> >>> developers. More or less "C plus OOP and easier object lifetime
>>>>> >> >>> management (shared/unique_ptr, etc.)". As soon as you add a lot
>>>>> >> >>> of
>>>>> >> >>> template metaprogramming C++ library development quickly becomes
>>>>> >> >>> inaccessible except to the C++-Jedi.
>>>>> >> >>>
>>>>> >> >>> Maybe let's start a Google document on "pandas roadmap" where we
>>>>> >> >>> can
>>>>> >> >>> break down the 1-2 year goals and some of these infrastructure
>>>>> >> >>> issues
>>>>> >> >>> and have our discussion there? (obviously publish this someplace
>>>>> >> >>> once
>>>>> >> >>> we're done)
>>>>> >> >>>
>>>>> >> >>> - Wes
>>>>> >> >>>
>>>>> >> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback
>>>>> >> >>> <jeffreback at gmail.com>
>>>>> >> >>> wrote:
>>>>> >> >>> > Here are some of my thoughts about pandas Roadmap / status and
>>>>> >> >>> > some
>>>>> >> >>> > responses to Wes's thoughts.
>>>>> >> >>> >
>>>>> >> >>> > In the last few (and upcoming) major releases we have been made
>>>>> >> >>> > the
>>>>> >> >>> > following changes:
>>>>> >> >>> >
>>>>> >> >>> > - dtype enhancements (Categorical, Timedelta, Datetime w/tz) &
>>>>> >> >>> > making
>>>>> >> >>> > these
>>>>> >> >>> > first class objects
>>>>> >> >>> > - code refactoring to remove subclassing of ndarrays for Series
>>>>> >> >>> > &
>>>>> >> >>> > Index
>>>>> >> >>> > - carving out / deprecating non-core parts of pandas
>>>>> >> >>> >   - datareader
>>>>> >> >>> >   - SparsePanel, WidePanel & other aliases (TImeSeries)
>>>>> >> >>> >   - rpy, rplot, irow et al.
>>>>> >> >>> >   - google-analytics
>>>>> >> >>> > - API changes to make things more consistent
>>>>> >> >>> >   - pd.rolling/expanding * -> .rolling/expanding (this is in
>>>>> >> >>> > master
>>>>> >> >>> > now)
>>>>> >> >>> >   - .resample becoming a full defered like groupby.
>>>>> >> >>> >   - multi-index slicing along any level (obviates need for .xs)
>>>>> >> >>> > and
>>>>> >> >>> > allows
>>>>> >> >>> > assignment
>>>>> >> >>> >   - .loc/.iloc - for the most part obviates use of .ix
>>>>> >> >>> >   - .pipe & .assign
>>>>> >> >>> >   - plotting accessors
>>>>> >> >>> >   - fixing of the sorting API
>>>>> >> >>> > - many performance enhancements both micro & macro (e.g.
>>>>> >> >>> > release
>>>>> >> >>> > GIL)
>>>>> >> >>> >
>>>>> >> >>> > Some on-deck enhancements are (meaning these are basically
>>>>> >> >>> > ready to
>>>>> >> >>> > go
>>>>> >> >>> > in):
>>>>> >> >>> >   - IntervalIndex (and eventually make PeriodIndex just a
>>>>> >> >>> > sub-class
>>>>> >> >>> > of
>>>>> >> >>> > this)
>>>>> >> >>> >   - RangeIndex
>>>>> >> >>> >
>>>>> >> >>> > so lots of changes, though nothing really earth shaking, just
>>>>> >> >>> > more
>>>>> >> >>> > convenience, reducing magicness somewhat
>>>>> >> >>> > and providing flexibility.
>>>>> >> >>> >
>>>>> >> >>> > Of course we are getting increasing issues, mostly bug reports
>>>>> >> >>> > (and
>>>>> >> >>> > lots
>>>>> >> >>> > of
>>>>> >> >>> > dupes), some edge case enhancements
>>>>> >> >>> > which can add to the existing API's and of course, requests to
>>>>> >> >>> > expand
>>>>> >> >>> > the
>>>>> >> >>> > (already) large code to other usecases.
>>>>> >> >>> > Balancing this are a good many pull-requests from many
>>>>> >> >>> > different
>>>>> >> >>> > users,
>>>>> >> >>> > some
>>>>> >> >>> > even deep into the internals.
>>>>> >> >>> >
>>>>> >> >>> > Here are some things that I have talked about and could be
>>>>> >> >>> > considered
>>>>> >> >>> > for
>>>>> >> >>> > the roadmap. Disclaimer: I do work for Continuum
>>>>> >> >>> > but these views are of course my own; furthermore obviously I
>>>>> >> >>> > am a
>>>>> >> >>> > bit
>>>>> >> >>> > more
>>>>> >> >>> > familiar with some of the 'sponsored' open-source
>>>>> >> >>> > libraries, but always open to new things.
>>>>> >> >>> >
>>>>> >> >>> > - integration / automatic deferral to numba for JIT (this would
>>>>> >> >>> > be
>>>>> >> >>> > thru
>>>>> >> >>> > .apply)
>>>>> >> >>> > - automatic deferal to dask from groubpy where appropriate /
>>>>> >> >>> > maybe a
>>>>> >> >>> > .to_parallel (to simply return a dask.DataFrame object)
>>>>> >> >>> > - incorporation of quantities / units (as part of the dtype)
>>>>> >> >>> > - use of DyND to allow missing values for int dtypes
>>>>> >> >>> > - make Period a first class dtype.
>>>>> >> >>> > - provide some copy-on-write semantics to alleviate the
>>>>> >> >>> > chained-indexing
>>>>> >> >>> > issues which occasionaly come up with the mis-use of the
>>>>> >> >>> > indexing
>>>>> >> >>> > API
>>>>> >> >>> > - allow a 'policy' to automatically provide column blocks for
>>>>> >> >>> > dict-like
>>>>> >> >>> > input (e.g. each column would be a block), this would allow a
>>>>> >> >>> > pass-thru
>>>>> >> >>> > API
>>>>> >> >>> > where you could
>>>>> >> >>> > put in numpy arrays where you have views and have them
>>>>> >> >>> > preserved
>>>>> >> >>> > rather
>>>>> >> >>> > than
>>>>> >> >>> > copied automatically. Note that this would also allow what I
>>>>> >> >>> > call
>>>>> >> >>> > 'split'
>>>>> >> >>> > where a passed in
>>>>> >> >>> > multi-dim numpy array could be split up to individual blocks
>>>>> >> >>> > (which
>>>>> >> >>> > actually
>>>>> >> >>> > gives a nice perf boost after the splitting costs).
>>>>> >> >>> >
>>>>> >> >>> > In working towards some of these goals. I have come to the
>>>>> >> >>> > opinion
>>>>> >> >>> > that
>>>>> >> >>> > it
>>>>> >> >>> > would make sense to have a neutral API protocol layer
>>>>> >> >>> > that would allow us to swap out different engines as needed,
>>>>> >> >>> > for
>>>>> >> >>> > particular
>>>>> >> >>> > dtypes, or *maybe* out-of-core type computations. E.g.
>>>>> >> >>> > imagine that we replaced the in-memory block structure with a
>>>>> >> >>> > bclolz
>>>>> >> >>> > /
>>>>> >> >>> > memap
>>>>> >> >>> > type; in theory this should be 'easy' and just work.
>>>>> >> >>> > I could also see us adopting *some* of the SFrame code to allow
>>>>> >> >>> > easier
>>>>> >> >>> > interop with this API layer.
>>>>> >> >>> >
>>>>> >> >>> > In practice, I think a nice API layer would need to be created
>>>>> >> >>> > to
>>>>> >> >>> > make
>>>>> >> >>> > this
>>>>> >> >>> > clean / nice.
>>>>> >> >>> >
>>>>> >> >>> > So this comes around to Wes's point about creating a c++
>>>>> >> >>> > library for
>>>>> >> >>> > the
>>>>> >> >>> > internals (and possibly even some of the indexing routines).
>>>>> >> >>> > In an ideal world, or course this would be desirable. Getting
>>>>> >> >>> > there
>>>>> >> >>> > is a
>>>>> >> >>> > bit
>>>>> >> >>> > non-trivial I think, and IMHO might not be worth the effort. I
>>>>> >> >>> > don't
>>>>> >> >>> > really see big performance bottlenecks. We *already* defer much
>>>>> >> >>> > of
>>>>> >> >>> > the
>>>>> >> >>> > computation to libraries like numexpr & bottleneck (where
>>>>> >> >>> > appropriate).
>>>>> >> >>> > Adding numba / dask to the list would be helpful.
>>>>> >> >>> >
>>>>> >> >>> > I think that almost all performance issues are the result of:
>>>>> >> >>> >
>>>>> >> >>> > a) gross misuse of the pandas API. How much code have you seen
>>>>> >> >>> > that
>>>>> >> >>> > does
>>>>> >> >>> > df.apply(lambda x: x.sum())
>>>>> >> >>> > b) routines which operate column-by-column rather
>>>>> >> >>> > block-by-block and
>>>>> >> >>> > are
>>>>> >> >>> > in
>>>>> >> >>> > python space (e.g. we have an issue right now about .quantile)
>>>>> >> >>> >
>>>>> >> >>> > So I am glossing over a big goal of having a c++ library that
>>>>> >> >>> > represents
>>>>> >> >>> > the
>>>>> >> >>> > pandas internals. This would by definition have a c-API that so
>>>>> >> >>> > you *could* use pandas like semantics in c/c++ and just have it
>>>>> >> >>> > work
>>>>> >> >>> > (and
>>>>> >> >>> > then pandas would be a thin wrapper around this library).
>>>>> >> >>> >
>>>>> >> >>> > I am not averse to this, but I think would be quite a big
>>>>> >> >>> > effort,
>>>>> >> >>> > and
>>>>> >> >>> > not a
>>>>> >> >>> > huge perf boost IMHO. Further there are a number of API issues
>>>>> >> >>> > w.r.t.
>>>>> >> >>> > indexing
>>>>> >> >>> > which need to be clarified / worked out (e.g. should we simply
>>>>> >> >>> > deprecate
>>>>> >> >>> > [])
>>>>> >> >>> > that are much easier to test / figure out in python space.
>>>>> >> >>> >
>>>>> >> >>> > I also thing that we have quite a large number of contributors.
>>>>> >> >>> > Moving
>>>>> >> >>> > to
>>>>> >> >>> > c++ might make the internals a bit more impenetrable that the
>>>>> >> >>> > current
>>>>> >> >>> > internals.
>>>>> >> >>> > (though this would allow c++ people to contribute, so that
>>>>> >> >>> > might
>>>>> >> >>> > balance
>>>>> >> >>> > out).
>>>>> >> >>> >
>>>>> >> >>> > We have a limited core of devs whom right now are familar with
>>>>> >> >>> > things.
>>>>> >> >>> > If
>>>>> >> >>> > someone happened to have a starting base for a c++ library,
>>>>> >> >>> > then I
>>>>> >> >>> > might
>>>>> >> >>> > change
>>>>> >> >>> > opinions here.
>>>>> >> >>> >
>>>>> >> >>> >
>>>>> >> >>> > my 4c.
>>>>> >> >>> >
>>>>> >> >>> > Jeff
>>>>> >> >>> >
>>>>> >> >>> >
>>>>> >> >>> >
>>>>> >> >>> >
>>>>> >> >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney
>>>>> >> >>> > <wesmckinn at gmail.com>
>>>>> >> >>> > wrote:
>>>>> >> >>> >>
>>>>> >> >>> >> Deep thoughts during the holidays.
>>>>> >> >>> >>
>>>>> >> >>> >> I might be out of line here, but the interpreter-heaviness of
>>>>> >> >>> >> the
>>>>> >> >>> >> inside of pandas objects is likely to be a long-term liability
>>>>> >> >>> >> and
>>>>> >> >>> >> source of performance problems and technical debt.
>>>>> >> >>> >>
>>>>> >> >>> >> Has anyone put any thought into planning and beginning to
>>>>> >> >>> >> execute
>>>>> >> >>> >> on a
>>>>> >> >>> >> rewrite that moves as much as possible of the internals into
>>>>> >> >>> >> native
>>>>> >> >>> >> /
>>>>> >> >>> >> compiled code? I'm talking about:
>>>>> >> >>> >>
>>>>> >> >>> >> - pandas/core/internals
>>>>> >> >>> >> - indexing and assignment
>>>>> >> >>> >> - much of pandas/core/common
>>>>> >> >>> >> - categorical and custom dtypes
>>>>> >> >>> >> - all indexing mechanisms
>>>>> >> >>> >>
>>>>> >> >>> >> I'm concerned we've already exposed too much internals to
>>>>> >> >>> >> users, so
>>>>> >> >>> >> this might lead to a lot of API breakage, but it might be for
>>>>> >> >>> >> the
>>>>> >> >>> >> Greater Good. As a first step, beginning a partial migration
>>>>> >> >>> >> of
>>>>> >> >>> >> internals into some C++ classes that encapsulate the insides
>>>>> >> >>> >> of
>>>>> >> >>> >> DataFrame objects and implement indexing and block-level
>>>>> >> >>> >> manipulations
>>>>> >> >>> >> would be a good place to start. I think you could do this
>>>>> >> >>> >> wouldn't
>>>>> >> >>> >> too
>>>>> >> >>> >> much disruption.
>>>>> >> >>> >>
>>>>> >> >>> >> As part of this internal retooling we might give consideration
>>>>> >> >>> >> to
>>>>> >> >>> >> alternative data structures for representing data internal to
>>>>> >> >>> >> pandas
>>>>> >> >>> >> objects. Now in 2015/2016, continuing to be hamstrung by
>>>>> >> >>> >> NumPy's
>>>>> >> >>> >> limitations feels somewhat anachronistic. User code is riddled
>>>>> >> >>> >> with
>>>>> >> >>> >> workarounds for data type fidelity issues and the like. Like,
>>>>> >> >>> >> really,
>>>>> >> >>> >> why not add a bitndarray (similar to ilanschnell/bitarray) for
>>>>> >> >>> >> storing
>>>>> >> >>> >> nullness for problematic types and hide this from the user? =)
>>>>> >> >>> >>
>>>>> >> >>> >> Since we are now a NumFOCUS-sponsored project, I feel like we
>>>>> >> >>> >> might
>>>>> >> >>> >> consider establishing some formal governance over pandas and
>>>>> >> >>> >> publishing meetings notes and roadmap documents describing
>>>>> >> >>> >> plans
>>>>> >> >>> >> for
>>>>> >> >>> >> the project and meetings notes from committers. There's no
>>>>> >> >>> >> real
>>>>> >> >>> >> "committer culture" for NumFOCUS projects like there is with
>>>>> >> >>> >> the
>>>>> >> >>> >> Apache Software Foundation, but we might try leading by
>>>>> >> >>> >> example!
>>>>> >> >>> >>
>>>>> >> >>> >> Also, I believe pandas as a project has reached a level of
>>>>> >> >>> >> importance
>>>>> >> >>> >> where we ought to consider planning and execution on larger
>>>>> >> >>> >> scale
>>>>> >> >>> >> undertakings such as this for safeguarding the future.
>>>>> >> >>> >>
>>>>> >> >>> >> As for myself, well, I have my hands full in Big Data-land. I
>>>>> >> >>> >> wish
>>>>> >> >>> >> I
>>>>> >> >>> >> could be helping more with pandas, but there a quite a few
>>>>> >> >>> >> fundamental
>>>>> >> >>> >> issues (like data interoperability nested data handling and
>>>>> >> >>> >> file
>>>>> >> >>> >> format support — e.g. Parquet, see
>>>>> >> >>> >>
>>>>> >> >>> >>
>>>>> >> >>> >>
>>>>> >> >>> >>
>>>>> >> >>> >> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/)
>>>>> >> >>> >> preventing Python from being more useful in industry analytics
>>>>> >> >>> >> applications.
>>>>> >> >>> >>
>>>>> >> >>> >> Aside: one of the bigger mistakes I made with pandas's API
>>>>> >> >>> >> design
>>>>> >> >>> >> was
>>>>> >> >>> >> making it acceptable to call class constructors — like
>>>>> >> >>> >> pandas.DataFrame — directly (versus factory functions). Sorry
>>>>> >> >>> >> about
>>>>> >> >>> >> that! If we could convince everyone to start writing
>>>>> >> >>> >> pandas.data_frame
>>>>> >> >>> >> or dataframe instead of using the class reference it would
>>>>> >> >>> >> help a
>>>>> >> >>> >> lot
>>>>> >> >>> >> with code cleanup. It's hard to plan for these things — NumPy
>>>>> >> >>> >> interoperability seemed a lot more important in 2008 than it
>>>>> >> >>> >> does
>>>>> >> >>> >> now,
>>>>> >> >>> >> so I forgive myself.
>>>>> >> >>> >>
>>>>> >> >>> >> cheers and best wishes for 2016,
>>>>> >> >>> >> Wes
>>>>> >> >>> >> _______________________________________________
>>>>> >> >>> >> Pandas-dev mailing list
>>>>> >> >>> >> Pandas-dev at python.org
>>>>> >> >>> >> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>> >> >>> >
>>>>> >> >>> >
>>>>> >> >>> _______________________________________________
>>>>> >> >>> Pandas-dev mailing list
>>>>> >> >>> Pandas-dev at python.org
>>>>> >> >>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>> >> _______________________________________________
>>>>> >> Pandas-dev mailing list
>>>>> >> Pandas-dev at python.org
>>>>> >> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>> >
>>>>> >
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Pandas-dev mailing list
>>>>> Pandas-dev at python.org
>>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>>
>>>>
>>


More information about the Pandas-dev mailing list