[Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap

Jeff Reback jeffreback at gmail.com
Wed Jan 6 14:45:45 EST 2016


I'll just apologize right up front! hahah.

No I think I have been pushing on these extras in pandas to help move it
forward. I have commented a bit
on Stephan's issue here <https://github.com/pydata/pandas/issues/8350> about
why I didn't push for these in numpy. numpy is fairly slow moving
(though moves faster lately, I suspect the pace when Wes was developing
pandas was not much faster).

So pandas was essentially 'fixing' lots of bug / compat issues in numpy.

To the extent that we can keep the current user facing API the same (high
likelihood I think), willing
to acccept *some* breakage with the pandas->duck-like array container API
in order to provide swappable containers.

For example I recall that in doing datetime w/tz, that we wanted
Series.values to return a numpy array (which it DOES!)
but it is actually lossy (its loses the tz). Samething with the Categorical
example wes gave. I dont' think these requirements
should hold pandas back!

People are increasingly using pandas as the API for there work. That makes
it very important that we can handle
lots of input properly, w/o the handcuffs of numpy.

All this said, I'll reiterate Wes (and others points). That back-compat is
extremely important. (I in fact try
to bend over backwards to provide this, sometimes its too much of course!).
E.g. take the resample changes to API

Was originally going to just do a hard break, but this turns off people
when they have to update there code or else.

my 4c (incrementing!)

Jeff


On Wed, Jan 6, 2016 at 2:37 PM, Wes McKinney <wesmckinn at gmail.com> wrote:

> On Wed, Jan 6, 2016 at 11:26 AM, Wes McKinney <wesmckinn at gmail.com> wrote:
> > hey Stephan,
> >
> > Thanks for all the thoughts. Let me make a few off-the-cuff comments.
> >
> > On Wed, Jan 6, 2016 at 10:11 AM, Stephan Hoyer <shoyer at gmail.com> wrote:
> >> I was asked about this off list, so I'll belatedly share my thoughts.
> >>
> >> First of all, I am really excited by Wes's renewed engagement in the
> project
> >> and his interest in rewriting pandas internals. This is quite an
> ambitious
> >> plan and nobody is better positioned to tackle it than Wes.
> >>
> >> I have mixed feelings about the details of the rewrite itself.
> >>
> >> +1 on the simpler internal data model. The block manager is confusing
> and
> >> leads to hard to predict performance issues related to copying data. If
> we
> >> can do all column additions/removals/re-orderings without a copy it
> will be
> >> a clear win.
> >>
> >> +0 on moving internals to C++. I do like the performance benefits, but
> it
> >> seems like a lot of work, and it may make pandas less friendly to new
> >> contributors.
> >>
> >
> > It really goes beyond performance benefits. If you go back to my 2013
> > talk
> http://www.slideshare.net/wesm/practical-medium-data-analytics-with-python
> > there's a long list of architectural problems that now in 2016 haven't
> > found solutions. The only way (that I can fully reason through -- I am
> > happy to look at alternate proposals) to move the internals of pandas
> > closer to the metal is to give Series and DataFrame a C/C++ API --
> > this is the "libpandas native core" as I've been describing.
>
> I should point out the the main thing that's changed since that preso
> is "synthetic" data types like Categorical. But seeing what it took
> for Jeff et al to build that is a prime motivation for this internals
> refactoring plan.
>
> >
> >> -0 on writing a brand new dtype system just for pandas -- this stuff
> really
> >> belongs in NumPy (or another array library like DyND), and I am
> skeptical
> >> that pandas can do a complete enough job to be useful without
> replicating
> >> all that functionality.
> >>
> >
> > I'm curious what "a brand new dtype system" means to you. pandas
> > already has its own data type system, but it's a potpourri of
> > inconsistencies and rough edges with self-evident problems for both
> > users and developers. Some indicators:
> >
> > - Some pandas types use NaN for missing data, others None (or both),
> > others nothing at all. We lose data (integers) or bloat memory
> > (booleans) by upcasting to float-NaN or object-None.
> > - Internal functions full of is_XXX_dtype functions:
> > pandas.core.common, pandas.core.algorithms, etc.
> > - Series.values on synthetic dtypes like Categorical
> > - We use arrays of Python objects for string data
> >
> > The biggest cause IMHO is that pandas is too tightly coupled to NumPy,
> > but it's coupled in a way that makes development and extensibility
> > difficult. We've already allowed NumPy-specific details to taint the
> > pandas user API in many unpleasant ways. This isn't to say "NumPy is
> > bad" but rather "pandas tries to layer domain-specific functionality
> > [that NumPy was not designed for] on top".
> >
> > Some things things I'm advocating with the internals refactor:
> >
> > 1) First class "pandas type" objects. This is not the same as a NumPy
> > dtype which has some pretty loaded implications -- in particular,
> > NumPy dtypes are implicitly coupled to an array computing framework
> > (see the function table that is attached to the PyArray_Descr object)
> >
> > 2) Pandas array container types that map user-land API calls to
> > implementation-land API calls (in NumPy, DyND, or pandas-native code
> > like pandas.core.algorithms etc.). This will make it much easier to
> > leverage innovations in NumPy and DyND without those implementation
> > details spilling over into the pandas user API
> >
> > 3) Adding a single pandas.NA singleton to have one library-wide notion
> > of a scalar null value (obviously, we can automatically map NaN and
> > None to NA for backwards compatibility).
> >
> > 4) Layering a bitmask internally on NumPy arrays (especially integer
> > and boolean) to add null-ness to types that need it. Note that this
> > does not prevent us from switching to DyND arrays with option dtype in
> > the future. If the details of how we are implementing NULL are visible
> > to the user, we have failed.
> >
> > 5) Removing the block manager in favor of simpler pandas Array (1D)
> > and Table (2D -- vector of Array) data structures
> >
> > I believe you can do all this without harming interoperability with
> > the ecosystem of projects that people currently use in conjunction
> > with pandas.
> >
> >> More broadly, I am concerned that this rewrite may improve the tabular
> >> computation ecosystem at the cost of inter-operability with the
> array-based
> >> ecosystem (numpy, scipy, sklearn, xarray, etc.). The later has been one
> of
> >> the strengths of pandas and it would be a shame to see that go away.
> >>
> >
> > I have no intention of letting this happen. What I've am asking from
> > you (and others reading) is to help define what constitutes
> > interoperability. What guarantees do we make the user?
> >
> > For example, we should have very strict guidelines for the output of:
> >
> > np.asarray(pandas_obj)
> >
> > For example
> >
> > In [3]: s = pd.Series([1,2,3]*10).astype('category')
> >
> > In [4]: np.asarray(s)
> > Out[4]:
> > array([1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1,
> 2,
> >        3, 1, 2, 3, 1, 2, 3])
> >
> > I see no reason why this should necessarily behave any differently.
> > The problem will come in when there is pandas data that is not
> > precisely representable in a NumPy array. Example:
> >
> > In [5]: s = pd.Series([1,2,3, 4])
> >
> > In [6]: s.dtype
> > Out[6]: dtype('int64')
> >
> > In [7]: s2 = s.reindex(np.arange(10))
> >
> > In [8]: s2.dtype
> > Out[8]: dtype('float64')
> >
> > In [9]: np.asarray(s2)
> > Out[9]: array([  1.,   2.,   3.,   4.,  nan,  nan,  nan,  nan,  nan,
> nan])
> >
> > With the "new internals", s2 will still be int64 type, but we may
> > decide that np.asarray(s2) should raise an exception rather than
> > implicitly make a decision about how to perform a "lossy" conversion
> > to a NumPy array. If you are using DyND with pandas, then the
> > equivalent function would be able to implicitly convert without data
> > loss.
> >
> >> We're already starting to struggle with inter-operability with the new
> >> pandas dtypes and a further rewrite would make this even harder.
> >> For example, see categoricals and scikit-learn in Tom's recent post
> [1], or the
> >> fact that .values no longer always returns a numpy array. This has also
> been
> >> a challenge for xarray, which can't handle these new dtypes because we
> lack
> >> a suitable array backend for them.
> >
> > I'm definitely motivated in this initiative by these challenges. The
> > idea here is that with the new internals, Series.values will always
> > return the same type of object, and there will be one consistent code
> > path for getting a NumPy array out. For example, rather than:
> >
> > if isinstance(s.values, Categorical):
> >     # pandas
> >     ...
> > else:
> >     # NumPy
> >     ...
> >
> > We could have (just an idea)
> >
> > s.values.to_numpy()
> >
> > Or simply
> >
> > np.asarray(s.values)
> >
> >>
> >> Personally, I would much rather leverage a full featured library like an
> >> improved NumPy or DyND for new dtypes, because that could also be used
> by
> >> the array-based ecosystem. At the very least, it would be good to think
> >> about zero-copy inter-operability with array-based tools.
> >>
> >
> > I'm all for zero-copy interoperability when possible, but my gut
> > feeling is that exposing the data type system of an array library (the
> > choice of which is an implementation detail) to pandas users is an
> > inherent leaky abstraction that will continue to cause problems if we
> > plan to keep innovating inside pandas. By better hiding NumPy details
> > and types from the user we will make it much easier to swap out new
> > low level array data structures and compute components (e.g. DyND), or
> > add custom data structures or out-of-core tools (memory maps, bcolz,
> > etc.)
> >
> > I'm additionally offering to do nearly all of this replumbing of
> > pandas internals myself, and completely in my free time. What I will
> > expect in return from you all is to help enumerate our contracts with
> > the pandas user (i.e. interoperability) and to hold me accountable to
> > not break them. I know I haven't been committing code on pandas since
> > mid-2013 (after a 5 year marathon), but these architectural problems
> > have been on my mind almost constantly since then, I just haven't had
> > the bandwidth to start tackling them.
> >
> > cheers,
> > Wes
> >
> >> On the other hand, I wonder if maybe it would be better to write a
> native
> >> in-memory backend for Ibis instead of rewriting pandas. Ibis does seem
> to
> >> have improved/simplified API which resolves many of pandas's warts. That
> >> said, it's a pretty big change from the "DataFrame as matrix" model, and
> >> pandas won't be going away anytime soon. I do like that it would force
> users
> >> to be more explicit about converting between tables and arrays, which
> might
> >> also make distinctions between the tabular and array oriented ecosystems
> >> easier to swallow.
> >>
> >> Just my two cents, from someone who has lots of opinions but who will
> likely
> >> stay on the sidelines for most of this work.
> >>
> >> Cheers,
> >> Stephan
> >>
> >> [1] http://tomaugspurger.github.io/categorical-pipelines.html
> >>
> >> On Fri, Jan 1, 2016 at 6:06 PM, Jeff Reback <jeffreback at gmail.com>
> wrote:
> >>>
> >>> ok I moved the document to the Pandas folder, where the same group
> should
> >>> be able to edit/upload/etc. lmk if any issues
> >>>
> >>> On Fri, Jan 1, 2016 at 8:48 PM, Wes McKinney <wesmckinn at gmail.com>
> wrote:
> >>>>
> >>>> Thanks Jeff. Can you create and share a shared Drive folder containing
> >>>> this where I can put other auxiliary / follow up documents?
> >>>>
> >>>> On Fri, Jan 1, 2016 at 5:23 PM, Jeff Reback <jeffreback at gmail.com>
> wrote:
> >>>> > I changed the doc so that the core dev people can edit. I *think*
> that
> >>>> > everyone should be able to view/comment though.
> >>>> >
> >>>> > On Fri, Jan 1, 2016 at 8:13 PM, Wes McKinney <wesmckinn at gmail.com>
> >>>> > wrote:
> >>>> >>
> >>>> >> Jeff -- can you require log-in for editing on this document?
> >>>> >>
> >>>> >>
> >>>> >>
> https://docs.google.com/document/d/151ct8jcZWwh7XStptjbLsda6h2b3C0IuiH_hfZnUA58/edit#
> >>>> >>
> >>>> >> There are a number of anonymous edits.
> >>>> >>
> >>>> >> On Wed, Dec 30, 2015 at 6:04 PM, Wes McKinney <wesmckinn at gmail.com
> >
> >>>> >> wrote:
> >>>> >> > I cobbled together an ugly start of a c++->cython->pandas
> toolchain
> >>>> >> > here
> >>>> >> >
> >>>> >> > https://github.com/wesm/pandas/tree/libpandas-native-core
> >>>> >> >
> >>>> >> > I used a mix of Kudu, Impala, and dynd-python cmake sources, so
> it's
> >>>> >> > a
> >>>> >> > bit messy at the moment but it should be sufficient to run some
> real
> >>>> >> > experiments with a little more work. I reckon it's like a 6 month
> >>>> >> > project to tear out the insides of Series and DataFrame and
> replace
> >>>> >> > it
> >>>> >> > with a new "native core", but we should be able to get enough
> info
> >>>> >> > to
> >>>> >> > see whether it's a viable plan within a month or so.
> >>>> >> >
> >>>> >> > The end goal is to create "private" extension types in Cython
> that
> >>>> >> > can
> >>>> >> > be the new base classes for Series and NDFrame; these will hold a
> >>>> >> > reference to a C++ object that contains wrappered NumPy arrays
> and
> >>>> >> > other metadata (like pandas-only dtypes).
> >>>> >> >
> >>>> >> > It might be too hard to try to replace a single usage of block
> >>>> >> > manager
> >>>> >> > as a first experiment, so I'll try to create a minimal
> "SeriesLite"
> >>>> >> > that supports 3 dtypes
> >>>> >> >
> >>>> >> > 1) float64 with nans
> >>>> >> > 2) int64 with a bitmask for NAs
> >>>> >> > 3) category type for one of these
> >>>> >> >
> >>>> >> > Just want to get a feel for the extensibility and offer an NA
> >>>> >> > singleton Python object (a la None) for getting and setting NAs
> >>>> >> > across
> >>>> >> > these 3 dtypes.
> >>>> >> >
> >>>> >> > If we end up going down this route, any way to place a
> moratorium on
> >>>> >> > invasive work on pandas internals (outside bug fixes)?
> >>>> >> >
> >>>> >> > Pedantic aside: I'd rather avoid shipping thirdparty C/C++
> libraries
> >>>> >> > like googletest and friends in pandas if we can. Cloudera folks
> have
> >>>> >> > been working on a portable C++ library toolchain for Impala and
> >>>> >> > other
> >>>> >> > projects at https://github.com/cloudera/native-toolchain, but
> it is
> >>>> >> > only being tested on Linux and OS X. Most google libraries should
> >>>> >> > build out of the box on MSVC but it'll be something to keep an
> eye
> >>>> >> > on.
> >>>> >> >
> >>>> >> > BTW thanks to the libdynd developers for pioneering the c++ lib
> <->
> >>>> >> > python-c++ lib <-> cython toolchain; being able to build Cython
> >>>> >> > extensions directly from cmake is a godsend
> >>>> >> >
> >>>> >> > HNY all
> >>>> >> > Wes
> >>>> >> >
> >>>> >> > On Tue, Dec 29, 2015 at 4:17 PM, Irwin Zaid <izaid at continuum.io>
> >>>> >> > wrote:
> >>>> >> >> Yeah, that seems reasonable and I totally agree a Pandas wrapper
> >>>> >> >> layer
> >>>> >> >> would
> >>>> >> >> be necessary.
> >>>> >> >>
> >>>> >> >> I'll keep an eye on this and I'd like to help if I can.
> >>>> >> >>
> >>>> >> >> Irwin
> >>>> >> >>
> >>>> >> >>
> >>>> >> >> On Tue, Dec 29, 2015 at 6:01 PM, Wes McKinney <
> wesmckinn at gmail.com>
> >>>> >> >> wrote:
> >>>> >> >>>
> >>>> >> >>> I'm not suggesting a rewrite of NumPy functionality but rather
> >>>> >> >>> pandas
> >>>> >> >>> functionality that is currently written in a mishmash of Cython
> >>>> >> >>> and
> >>>> >> >>> Python.
> >>>> >> >>> Happy to experiment with changing the internal compute
> >>>> >> >>> infrastructure
> >>>> >> >>> and
> >>>> >> >>> data representation to DyND after this first stage of cleanup
> is
> >>>> >> >>> done.
> >>>> >> >>> Even
> >>>> >> >>> if we use DyND a pretty extensive pandas wrapper layer will be
> >>>> >> >>> necessary.
> >>>> >> >>>
> >>>> >> >>>
> >>>> >> >>> On Tuesday, December 29, 2015, Irwin Zaid <izaid at continuum.io>
> >>>> >> >>> wrote:
> >>>> >> >>>>
> >>>> >> >>>> Hi Wes (and others),
> >>>> >> >>>>
> >>>> >> >>>> I've been following this conversation with interest. I do
> think
> >>>> >> >>>> it
> >>>> >> >>>> would
> >>>> >> >>>> be worth exploring DyND, rather than setting up yet another
> >>>> >> >>>> rewrite
> >>>> >> >>>> of
> >>>> >> >>>> NumPy-functionality. Especially because DyND is already an
> >>>> >> >>>> optional
> >>>> >> >>>> dependency of Pandas.
> >>>> >> >>>>
> >>>> >> >>>> For things like Integer NA and new dtypes, DyND is there and
> >>>> >> >>>> ready to
> >>>> >> >>>> do
> >>>> >> >>>> this.
> >>>> >> >>>>
> >>>> >> >>>> Irwin
> >>>> >> >>>>
> >>>> >> >>>> On Tue, Dec 29, 2015 at 5:18 PM, Wes McKinney
> >>>> >> >>>> <wesmckinn at gmail.com>
> >>>> >> >>>> wrote:
> >>>> >> >>>>>
> >>>> >> >>>>> Can you link to the PR you're talking about?
> >>>> >> >>>>>
> >>>> >> >>>>> I will see about spending a few hours setting up a
> libpandas.so
> >>>> >> >>>>> as a
> >>>> >> >>>>> C++
> >>>> >> >>>>> shared library where we can run some experiments and validate
> >>>> >> >>>>> whether it can
> >>>> >> >>>>> solve the integer-NA problem and be a place to put new data
> >>>> >> >>>>> types
> >>>> >> >>>>> (categorical and friends). I'm +1 on targeting
> >>>> >> >>>>>
> >>>> >> >>>>> Would it also be worth making a wish list of APIs we might
> >>>> >> >>>>> consider
> >>>> >> >>>>> breaking in a pandas 1.0 release that also features this new
> >>>> >> >>>>> "native
> >>>> >> >>>>> core"?
> >>>> >> >>>>> Might as well right some wrongs while we're doing some
> invasive
> >>>> >> >>>>> work
> >>>> >> >>>>> on the
> >>>> >> >>>>> internals; some breakage might be unavoidable. We can always
> >>>> >> >>>>> maintain a
> >>>> >> >>>>> pandas legacy 0.x.x maintenance branch (providing a conda
> binary
> >>>> >> >>>>> build) for
> >>>> >> >>>>> legacy users where showstopper bugs can get fixed.
> >>>> >> >>>>>
> >>>> >> >>>>> On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback
> >>>> >> >>>>> <jeffreback at gmail.com>
> >>>> >> >>>>> wrote:
> >>>> >> >>>>> > Wes your last is noted as well. I *think* we can actually
> do
> >>>> >> >>>>> > this
> >>>> >> >>>>> > now
> >>>> >> >>>>> > (well
> >>>> >> >>>>> > there is a PR out there).
> >>>> >> >>>>> >
> >>>> >> >>>>> > On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney
> >>>> >> >>>>> > <wesmckinn at gmail.com>
> >>>> >> >>>>> > wrote:
> >>>> >> >>>>> >>
> >>>> >> >>>>> >> The other huge thing this will enable is to do is
> >>>> >> >>>>> >> copy-on-write
> >>>> >> >>>>> >> for
> >>>> >> >>>>> >> various kinds of views, which should cut down on some of
> the
> >>>> >> >>>>> >> defensive
> >>>> >> >>>>> >> copying in the library and reduce memory usage.
> >>>> >> >>>>> >>
> >>>> >> >>>>> >> On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney
> >>>> >> >>>>> >> <wesmckinn at gmail.com>
> >>>> >> >>>>> >> wrote:
> >>>> >> >>>>> >> > Basically the approach is
> >>>> >> >>>>> >> >
> >>>> >> >>>>> >> > 1) Base dtype type
> >>>> >> >>>>> >> > 2) Base array type with K >= 1 dimensions
> >>>> >> >>>>> >> > 3) Base scalar type
> >>>> >> >>>>> >> > 4) Base index type
> >>>> >> >>>>> >> > 5) "Wrapper" subclasses for all NumPy types fitting into
> >>>> >> >>>>> >> > categories
> >>>> >> >>>>> >> > #1, #2, #3, #4
> >>>> >> >>>>> >> > 6) Subclasses for pandas-specific types like category,
> >>>> >> >>>>> >> > datetimeTZ,
> >>>> >> >>>>> >> > etc.
> >>>> >> >>>>> >> > 7) NDFrame as cpcloud wrote is just a list of these
> >>>> >> >>>>> >> >
> >>>> >> >>>>> >> > Indexes and axis labels / column names can get layered
> on
> >>>> >> >>>>> >> > top.
> >>>> >> >>>>> >> >
> >>>> >> >>>>> >> > After we do all this we can look at adding nested types
> >>>> >> >>>>> >> > (arrays,
> >>>> >> >>>>> >> > maps,
> >>>> >> >>>>> >> > structs) to better support JSON.
> >>>> >> >>>>> >> >
> >>>> >> >>>>> >> > - Wes
> >>>> >> >>>>> >> >
> >>>> >> >>>>> >> > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud
> >>>> >> >>>>> >> > <cpcloud at gmail.com>
> >>>> >> >>>>> >> > wrote:
> >>>> >> >>>>> >> >> Maybe this is saying the same thing as Wes, but how far
> >>>> >> >>>>> >> >> would
> >>>> >> >>>>> >> >> something
> >>>> >> >>>>> >> >> like
> >>>> >> >>>>> >> >> this get us?
> >>>> >> >>>>> >> >>
> >>>> >> >>>>> >> >> // warning: things are probably not this simple
> >>>> >> >>>>> >> >>
> >>>> >> >>>>> >> >> struct data_array_t {
> >>>> >> >>>>> >> >>     void *primitive;  // scalar data
> >>>> >> >>>>> >> >>     data_array_t *nested; // nested data
> >>>> >> >>>>> >> >>     boost::dynamic_bitset isnull;  // might have to
> create
> >>>> >> >>>>> >> >> our
> >>>> >> >>>>> >> >> own
> >>>> >> >>>>> >> >> to
> >>>> >> >>>>> >> >> avoid
> >>>> >> >>>>> >> >> boost
> >>>> >> >>>>> >> >>     schema_t schema;  // not sure exactly what this
> looks
> >>>> >> >>>>> >> >> like
> >>>> >> >>>>> >> >> };
> >>>> >> >>>>> >> >>
> >>>> >> >>>>> >> >> typedef std::map<string, data_array_t> data_frame_t;
> //
> >>>> >> >>>>> >> >> probably
> >>>> >> >>>>> >> >> not
> >>>> >> >>>>> >> >> this
> >>>> >> >>>>> >> >> simple
> >>>> >> >>>>> >> >>
> >>>> >> >>>>> >> >> To answer Jeff’s use-case question: I think that the
> use
> >>>> >> >>>>> >> >> cases
> >>>> >> >>>>> >> >> are
> >>>> >> >>>>> >> >> 1)
> >>>> >> >>>>> >> >> freedom from numpy (mostly) 2) no more block manager
> which
> >>>> >> >>>>> >> >> frees
> >>>> >> >>>>> >> >> us
> >>>> >> >>>>> >> >> from the
> >>>> >> >>>>> >> >> limitations of the block memory layout. In particular,
> the
> >>>> >> >>>>> >> >> ability
> >>>> >> >>>>> >> >> to
> >>>> >> >>>>> >> >> take
> >>>> >> >>>>> >> >> advantage of memory mapped IO would be a big win IMO.
> >>>> >> >>>>> >> >>
> >>>> >> >>>>> >> >>
> >>>> >> >>>>> >> >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney
> >>>> >> >>>>> >> >> <wesmckinn at gmail.com>
> >>>> >> >>>>> >> >> wrote:
> >>>> >> >>>>> >> >>>
> >>>> >> >>>>> >> >>> I will write a more detailed response to some of these
> >>>> >> >>>>> >> >>> things
> >>>> >> >>>>> >> >>> after
> >>>> >> >>>>> >> >>> the new year, but, in particular, re: missing values,
> can
> >>>> >> >>>>> >> >>> you
> >>>> >> >>>>> >> >>> or
> >>>> >> >>>>> >> >>> someone tell me why creating an object that contains a
> >>>> >> >>>>> >> >>> NumPy
> >>>> >> >>>>> >> >>> array and
> >>>> >> >>>>> >> >>> a bitmap is not sufficient? If we we can add a
> >>>> >> >>>>> >> >>> lightweight
> >>>> >> >>>>> >> >>> C/C++
> >>>> >> >>>>> >> >>> class
> >>>> >> >>>>> >> >>> layer between NumPy function calls (e.g. arithmetic)
> and
> >>>> >> >>>>> >> >>> pandas
> >>>> >> >>>>> >> >>> function calls, then I see no reason why we cannot
> have
> >>>> >> >>>>> >> >>>
> >>>> >> >>>>> >> >>> Int32Array->add
> >>>> >> >>>>> >> >>>
> >>>> >> >>>>> >> >>> and
> >>>> >> >>>>> >> >>>
> >>>> >> >>>>> >> >>> Float32Array->add
> >>>> >> >>>>> >> >>>
> >>>> >> >>>>> >> >>> do the right thing (the former would be responsible
> for
> >>>> >> >>>>> >> >>> bitmasking to
> >>>> >> >>>>> >> >>> propagate NA values; the latter would defer to
> NumPy). If
> >>>> >> >>>>> >> >>> we
> >>>> >> >>>>> >> >>> can
> >>>> >> >>>>> >> >>> put
> >>>> >> >>>>> >> >>> all the internals of pandas objects inside a black
> box,
> >>>> >> >>>>> >> >>> we
> >>>> >> >>>>> >> >>> can
> >>>> >> >>>>> >> >>> add
> >>>> >> >>>>> >> >>> layers of virtual function indirection without a
> >>>> >> >>>>> >> >>> performance
> >>>> >> >>>>> >> >>> penalty
> >>>> >> >>>>> >> >>> (e.g. adding more interpreter overhead with more
> >>>> >> >>>>> >> >>> abstraction
> >>>> >> >>>>> >> >>> layers
> >>>> >> >>>>> >> >>> does add up to a perf penalty).
> >>>> >> >>>>> >> >>>
> >>>> >> >>>>> >> >>> I don't think this is too scary -- I would be willing
> to
> >>>> >> >>>>> >> >>> create a
> >>>> >> >>>>> >> >>> small POC C++ library to prototype something like what
> >>>> >> >>>>> >> >>> I'm
> >>>> >> >>>>> >> >>> talking
> >>>> >> >>>>> >> >>> about.
> >>>> >> >>>>> >> >>>
> >>>> >> >>>>> >> >>> Since pandas has limited points of contact with NumPy
> I
> >>>> >> >>>>> >> >>> don't
> >>>> >> >>>>> >> >>> think
> >>>> >> >>>>> >> >>> this would end up being too onerous.
> >>>> >> >>>>> >> >>>
> >>>> >> >>>>> >> >>> For the record, I'm pretty allergic to "advanced
> C++"; I
> >>>> >> >>>>> >> >>> think it
> >>>> >> >>>>> >> >>> is a
> >>>> >> >>>>> >> >>> useful tool if you pick a sane 20% subset of the C++11
> >>>> >> >>>>> >> >>> spec
> >>>> >> >>>>> >> >>> and
> >>>> >> >>>>> >> >>> follow
> >>>> >> >>>>> >> >>> Google C++ style it's not very inaccessible to
> >>>> >> >>>>> >> >>> intermediate
> >>>> >> >>>>> >> >>> developers. More or less "C plus OOP and easier object
> >>>> >> >>>>> >> >>> lifetime
> >>>> >> >>>>> >> >>> management (shared/unique_ptr, etc.)". As soon as you
> add
> >>>> >> >>>>> >> >>> a
> >>>> >> >>>>> >> >>> lot
> >>>> >> >>>>> >> >>> of
> >>>> >> >>>>> >> >>> template metaprogramming C++ library development
> quickly
> >>>> >> >>>>> >> >>> becomes
> >>>> >> >>>>> >> >>> inaccessible except to the C++-Jedi.
> >>>> >> >>>>> >> >>>
> >>>> >> >>>>> >> >>> Maybe let's start a Google document on "pandas
> roadmap"
> >>>> >> >>>>> >> >>> where
> >>>> >> >>>>> >> >>> we
> >>>> >> >>>>> >> >>> can
> >>>> >> >>>>> >> >>> break down the 1-2 year goals and some of these
> >>>> >> >>>>> >> >>> infrastructure
> >>>> >> >>>>> >> >>> issues
> >>>> >> >>>>> >> >>> and have our discussion there? (obviously publish this
> >>>> >> >>>>> >> >>> someplace
> >>>> >> >>>>> >> >>> once
> >>>> >> >>>>> >> >>> we're done)
> >>>> >> >>>>> >> >>>
> >>>> >> >>>>> >> >>> - Wes
> >>>> >> >>>>> >> >>>
> >>>> >> >>>>> >> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback
> >>>> >> >>>>> >> >>> <jeffreback at gmail.com>
> >>>> >> >>>>> >> >>> wrote:
> >>>> >> >>>>> >> >>> > Here are some of my thoughts about pandas Roadmap /
> >>>> >> >>>>> >> >>> > status
> >>>> >> >>>>> >> >>> > and
> >>>> >> >>>>> >> >>> > some
> >>>> >> >>>>> >> >>> > responses to Wes's thoughts.
> >>>> >> >>>>> >> >>> >
> >>>> >> >>>>> >> >>> > In the last few (and upcoming) major releases we
> have
> >>>> >> >>>>> >> >>> > been
> >>>> >> >>>>> >> >>> > made
> >>>> >> >>>>> >> >>> > the
> >>>> >> >>>>> >> >>> > following changes:
> >>>> >> >>>>> >> >>> >
> >>>> >> >>>>> >> >>> > - dtype enhancements (Categorical, Timedelta,
> Datetime
> >>>> >> >>>>> >> >>> > w/tz) &
> >>>> >> >>>>> >> >>> > making
> >>>> >> >>>>> >> >>> > these
> >>>> >> >>>>> >> >>> > first class objects
> >>>> >> >>>>> >> >>> > - code refactoring to remove subclassing of ndarrays
> >>>> >> >>>>> >> >>> > for
> >>>> >> >>>>> >> >>> > Series
> >>>> >> >>>>> >> >>> > &
> >>>> >> >>>>> >> >>> > Index
> >>>> >> >>>>> >> >>> > - carving out / deprecating non-core parts of pandas
> >>>> >> >>>>> >> >>> >   - datareader
> >>>> >> >>>>> >> >>> >   - SparsePanel, WidePanel & other aliases
> (TImeSeries)
> >>>> >> >>>>> >> >>> >   - rpy, rplot, irow et al.
> >>>> >> >>>>> >> >>> >   - google-analytics
> >>>> >> >>>>> >> >>> > - API changes to make things more consistent
> >>>> >> >>>>> >> >>> >   - pd.rolling/expanding * -> .rolling/expanding
> (this
> >>>> >> >>>>> >> >>> > is
> >>>> >> >>>>> >> >>> > in
> >>>> >> >>>>> >> >>> > master
> >>>> >> >>>>> >> >>> > now)
> >>>> >> >>>>> >> >>> >   - .resample becoming a full defered like groupby.
> >>>> >> >>>>> >> >>> >   - multi-index slicing along any level (obviates
> need
> >>>> >> >>>>> >> >>> > for
> >>>> >> >>>>> >> >>> > .xs)
> >>>> >> >>>>> >> >>> > and
> >>>> >> >>>>> >> >>> > allows
> >>>> >> >>>>> >> >>> > assignment
> >>>> >> >>>>> >> >>> >   - .loc/.iloc - for the most part obviates use of
> .ix
> >>>> >> >>>>> >> >>> >   - .pipe & .assign
> >>>> >> >>>>> >> >>> >   - plotting accessors
> >>>> >> >>>>> >> >>> >   - fixing of the sorting API
> >>>> >> >>>>> >> >>> > - many performance enhancements both micro & macro
> >>>> >> >>>>> >> >>> > (e.g.
> >>>> >> >>>>> >> >>> > release
> >>>> >> >>>>> >> >>> > GIL)
> >>>> >> >>>>> >> >>> >
> >>>> >> >>>>> >> >>> > Some on-deck enhancements are (meaning these are
> >>>> >> >>>>> >> >>> > basically
> >>>> >> >>>>> >> >>> > ready to
> >>>> >> >>>>> >> >>> > go
> >>>> >> >>>>> >> >>> > in):
> >>>> >> >>>>> >> >>> >   - IntervalIndex (and eventually make PeriodIndex
> just
> >>>> >> >>>>> >> >>> > a
> >>>> >> >>>>> >> >>> > sub-class
> >>>> >> >>>>> >> >>> > of
> >>>> >> >>>>> >> >>> > this)
> >>>> >> >>>>> >> >>> >   - RangeIndex
> >>>> >> >>>>> >> >>> >
> >>>> >> >>>>> >> >>> > so lots of changes, though nothing really earth
> >>>> >> >>>>> >> >>> > shaking,
> >>>> >> >>>>> >> >>> > just
> >>>> >> >>>>> >> >>> > more
> >>>> >> >>>>> >> >>> > convenience, reducing magicness somewhat
> >>>> >> >>>>> >> >>> > and providing flexibility.
> >>>> >> >>>>> >> >>> >
> >>>> >> >>>>> >> >>> > Of course we are getting increasing issues, mostly
> bug
> >>>> >> >>>>> >> >>> > reports
> >>>> >> >>>>> >> >>> > (and
> >>>> >> >>>>> >> >>> > lots
> >>>> >> >>>>> >> >>> > of
> >>>> >> >>>>> >> >>> > dupes), some edge case enhancements
> >>>> >> >>>>> >> >>> > which can add to the existing API's and of course,
> >>>> >> >>>>> >> >>> > requests
> >>>> >> >>>>> >> >>> > to
> >>>> >> >>>>> >> >>> > expand
> >>>> >> >>>>> >> >>> > the
> >>>> >> >>>>> >> >>> > (already) large code to other usecases.
> >>>> >> >>>>> >> >>> > Balancing this are a good many pull-requests from
> many
> >>>> >> >>>>> >> >>> > different
> >>>> >> >>>>> >> >>> > users,
> >>>> >> >>>>> >> >>> > some
> >>>> >> >>>>> >> >>> > even deep into the internals.
> >>>> >> >>>>> >> >>> >
> >>>> >> >>>>> >> >>> > Here are some things that I have talked about and
> could
> >>>> >> >>>>> >> >>> > be
> >>>> >> >>>>> >> >>> > considered
> >>>> >> >>>>> >> >>> > for
> >>>> >> >>>>> >> >>> > the roadmap. Disclaimer: I do work for Continuum
> >>>> >> >>>>> >> >>> > but these views are of course my own; furthermore
> >>>> >> >>>>> >> >>> > obviously
> >>>> >> >>>>> >> >>> > I
> >>>> >> >>>>> >> >>> > am a
> >>>> >> >>>>> >> >>> > bit
> >>>> >> >>>>> >> >>> > more
> >>>> >> >>>>> >> >>> > familiar with some of the 'sponsored' open-source
> >>>> >> >>>>> >> >>> > libraries, but always open to new things.
> >>>> >> >>>>> >> >>> >
> >>>> >> >>>>> >> >>> > - integration / automatic deferral to numba for JIT
> >>>> >> >>>>> >> >>> > (this
> >>>> >> >>>>> >> >>> > would
> >>>> >> >>>>> >> >>> > be
> >>>> >> >>>>> >> >>> > thru
> >>>> >> >>>>> >> >>> > .apply)
> >>>> >> >>>>> >> >>> > - automatic deferal to dask from groubpy where
> >>>> >> >>>>> >> >>> > appropriate
> >>>> >> >>>>> >> >>> > /
> >>>> >> >>>>> >> >>> > maybe a
> >>>> >> >>>>> >> >>> > .to_parallel (to simply return a dask.DataFrame
> object)
> >>>> >> >>>>> >> >>> > - incorporation of quantities / units (as part of
> the
> >>>> >> >>>>> >> >>> > dtype)
> >>>> >> >>>>> >> >>> > - use of DyND to allow missing values for int dtypes
> >>>> >> >>>>> >> >>> > - make Period a first class dtype.
> >>>> >> >>>>> >> >>> > - provide some copy-on-write semantics to alleviate
> the
> >>>> >> >>>>> >> >>> > chained-indexing
> >>>> >> >>>>> >> >>> > issues which occasionaly come up with the mis-use of
> >>>> >> >>>>> >> >>> > the
> >>>> >> >>>>> >> >>> > indexing
> >>>> >> >>>>> >> >>> > API
> >>>> >> >>>>> >> >>> > - allow a 'policy' to automatically provide column
> >>>> >> >>>>> >> >>> > blocks
> >>>> >> >>>>> >> >>> > for
> >>>> >> >>>>> >> >>> > dict-like
> >>>> >> >>>>> >> >>> > input (e.g. each column would be a block), this
> would
> >>>> >> >>>>> >> >>> > allow
> >>>> >> >>>>> >> >>> > a
> >>>> >> >>>>> >> >>> > pass-thru
> >>>> >> >>>>> >> >>> > API
> >>>> >> >>>>> >> >>> > where you could
> >>>> >> >>>>> >> >>> > put in numpy arrays where you have views and have
> them
> >>>> >> >>>>> >> >>> > preserved
> >>>> >> >>>>> >> >>> > rather
> >>>> >> >>>>> >> >>> > than
> >>>> >> >>>>> >> >>> > copied automatically. Note that this would also
> allow
> >>>> >> >>>>> >> >>> > what
> >>>> >> >>>>> >> >>> > I
> >>>> >> >>>>> >> >>> > call
> >>>> >> >>>>> >> >>> > 'split'
> >>>> >> >>>>> >> >>> > where a passed in
> >>>> >> >>>>> >> >>> > multi-dim numpy array could be split up to
> individual
> >>>> >> >>>>> >> >>> > blocks
> >>>> >> >>>>> >> >>> > (which
> >>>> >> >>>>> >> >>> > actually
> >>>> >> >>>>> >> >>> > gives a nice perf boost after the splitting costs).
> >>>> >> >>>>> >> >>> >
> >>>> >> >>>>> >> >>> > In working towards some of these goals. I have come
> to
> >>>> >> >>>>> >> >>> > the
> >>>> >> >>>>> >> >>> > opinion
> >>>> >> >>>>> >> >>> > that
> >>>> >> >>>>> >> >>> > it
> >>>> >> >>>>> >> >>> > would make sense to have a neutral API protocol
> layer
> >>>> >> >>>>> >> >>> > that would allow us to swap out different engines as
> >>>> >> >>>>> >> >>> > needed,
> >>>> >> >>>>> >> >>> > for
> >>>> >> >>>>> >> >>> > particular
> >>>> >> >>>>> >> >>> > dtypes, or *maybe* out-of-core type computations.
> E.g.
> >>>> >> >>>>> >> >>> > imagine that we replaced the in-memory block
> structure
> >>>> >> >>>>> >> >>> > with
> >>>> >> >>>>> >> >>> > a
> >>>> >> >>>>> >> >>> > bclolz
> >>>> >> >>>>> >> >>> > /
> >>>> >> >>>>> >> >>> > memap
> >>>> >> >>>>> >> >>> > type; in theory this should be 'easy' and just work.
> >>>> >> >>>>> >> >>> > I could also see us adopting *some* of the SFrame
> code
> >>>> >> >>>>> >> >>> > to
> >>>> >> >>>>> >> >>> > allow
> >>>> >> >>>>> >> >>> > easier
> >>>> >> >>>>> >> >>> > interop with this API layer.
> >>>> >> >>>>> >> >>> >
> >>>> >> >>>>> >> >>> > In practice, I think a nice API layer would need to
> be
> >>>> >> >>>>> >> >>> > created
> >>>> >> >>>>> >> >>> > to
> >>>> >> >>>>> >> >>> > make
> >>>> >> >>>>> >> >>> > this
> >>>> >> >>>>> >> >>> > clean / nice.
> >>>> >> >>>>> >> >>> >
> >>>> >> >>>>> >> >>> > So this comes around to Wes's point about creating a
> >>>> >> >>>>> >> >>> > c++
> >>>> >> >>>>> >> >>> > library for
> >>>> >> >>>>> >> >>> > the
> >>>> >> >>>>> >> >>> > internals (and possibly even some of the indexing
> >>>> >> >>>>> >> >>> > routines).
> >>>> >> >>>>> >> >>> > In an ideal world, or course this would be
> desirable.
> >>>> >> >>>>> >> >>> > Getting
> >>>> >> >>>>> >> >>> > there
> >>>> >> >>>>> >> >>> > is a
> >>>> >> >>>>> >> >>> > bit
> >>>> >> >>>>> >> >>> > non-trivial I think, and IMHO might not be worth the
> >>>> >> >>>>> >> >>> > effort. I
> >>>> >> >>>>> >> >>> > don't
> >>>> >> >>>>> >> >>> > really see big performance bottlenecks. We *already*
> >>>> >> >>>>> >> >>> > defer
> >>>> >> >>>>> >> >>> > much
> >>>> >> >>>>> >> >>> > of
> >>>> >> >>>>> >> >>> > the
> >>>> >> >>>>> >> >>> > computation to libraries like numexpr & bottleneck
> >>>> >> >>>>> >> >>> > (where
> >>>> >> >>>>> >> >>> > appropriate).
> >>>> >> >>>>> >> >>> > Adding numba / dask to the list would be helpful.
> >>>> >> >>>>> >> >>> >
> >>>> >> >>>>> >> >>> > I think that almost all performance issues are the
> >>>> >> >>>>> >> >>> > result
> >>>> >> >>>>> >> >>> > of:
> >>>> >> >>>>> >> >>> >
> >>>> >> >>>>> >> >>> > a) gross misuse of the pandas API. How much code
> have
> >>>> >> >>>>> >> >>> > you
> >>>> >> >>>>> >> >>> > seen
> >>>> >> >>>>> >> >>> > that
> >>>> >> >>>>> >> >>> > does
> >>>> >> >>>>> >> >>> > df.apply(lambda x: x.sum())
> >>>> >> >>>>> >> >>> > b) routines which operate column-by-column rather
> >>>> >> >>>>> >> >>> > block-by-block and
> >>>> >> >>>>> >> >>> > are
> >>>> >> >>>>> >> >>> > in
> >>>> >> >>>>> >> >>> > python space (e.g. we have an issue right now about
> >>>> >> >>>>> >> >>> > .quantile)
> >>>> >> >>>>> >> >>> >
> >>>> >> >>>>> >> >>> > So I am glossing over a big goal of having a c++
> >>>> >> >>>>> >> >>> > library
> >>>> >> >>>>> >> >>> > that
> >>>> >> >>>>> >> >>> > represents
> >>>> >> >>>>> >> >>> > the
> >>>> >> >>>>> >> >>> > pandas internals. This would by definition have a
> c-API
> >>>> >> >>>>> >> >>> > that so
> >>>> >> >>>>> >> >>> > you *could* use pandas like semantics in c/c++ and
> just
> >>>> >> >>>>> >> >>> > have it
> >>>> >> >>>>> >> >>> > work
> >>>> >> >>>>> >> >>> > (and
> >>>> >> >>>>> >> >>> > then pandas would be a thin wrapper around this
> >>>> >> >>>>> >> >>> > library).
> >>>> >> >>>>> >> >>> >
> >>>> >> >>>>> >> >>> > I am not averse to this, but I think would be quite
> a
> >>>> >> >>>>> >> >>> > big
> >>>> >> >>>>> >> >>> > effort,
> >>>> >> >>>>> >> >>> > and
> >>>> >> >>>>> >> >>> > not a
> >>>> >> >>>>> >> >>> > huge perf boost IMHO. Further there are a number of
> API
> >>>> >> >>>>> >> >>> > issues
> >>>> >> >>>>> >> >>> > w.r.t.
> >>>> >> >>>>> >> >>> > indexing
> >>>> >> >>>>> >> >>> > which need to be clarified / worked out (e.g.
> should we
> >>>> >> >>>>> >> >>> > simply
> >>>> >> >>>>> >> >>> > deprecate
> >>>> >> >>>>> >> >>> > [])
> >>>> >> >>>>> >> >>> > that are much easier to test / figure out in python
> >>>> >> >>>>> >> >>> > space.
> >>>> >> >>>>> >> >>> >
> >>>> >> >>>>> >> >>> > I also thing that we have quite a large number of
> >>>> >> >>>>> >> >>> > contributors.
> >>>> >> >>>>> >> >>> > Moving
> >>>> >> >>>>> >> >>> > to
> >>>> >> >>>>> >> >>> > c++ might make the internals a bit more impenetrable
> >>>> >> >>>>> >> >>> > that
> >>>> >> >>>>> >> >>> > the
> >>>> >> >>>>> >> >>> > current
> >>>> >> >>>>> >> >>> > internals.
> >>>> >> >>>>> >> >>> > (though this would allow c++ people to contribute,
> so
> >>>> >> >>>>> >> >>> > that
> >>>> >> >>>>> >> >>> > might
> >>>> >> >>>>> >> >>> > balance
> >>>> >> >>>>> >> >>> > out).
> >>>> >> >>>>> >> >>> >
> >>>> >> >>>>> >> >>> > We have a limited core of devs whom right now are
> >>>> >> >>>>> >> >>> > familar
> >>>> >> >>>>> >> >>> > with
> >>>> >> >>>>> >> >>> > things.
> >>>> >> >>>>> >> >>> > If
> >>>> >> >>>>> >> >>> > someone happened to have a starting base for a c++
> >>>> >> >>>>> >> >>> > library,
> >>>> >> >>>>> >> >>> > then I
> >>>> >> >>>>> >> >>> > might
> >>>> >> >>>>> >> >>> > change
> >>>> >> >>>>> >> >>> > opinions here.
> >>>> >> >>>>> >> >>> >
> >>>> >> >>>>> >> >>> >
> >>>> >> >>>>> >> >>> > my 4c.
> >>>> >> >>>>> >> >>> >
> >>>> >> >>>>> >> >>> > Jeff
> >>>> >> >>>>> >> >>> >
> >>>> >> >>>>> >> >>> >
> >>>> >> >>>>> >> >>> >
> >>>> >> >>>>> >> >>> >
> >>>> >> >>>>> >> >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney
> >>>> >> >>>>> >> >>> > <wesmckinn at gmail.com>
> >>>> >> >>>>> >> >>> > wrote:
> >>>> >> >>>>> >> >>> >>
> >>>> >> >>>>> >> >>> >> Deep thoughts during the holidays.
> >>>> >> >>>>> >> >>> >>
> >>>> >> >>>>> >> >>> >> I might be out of line here, but the
> >>>> >> >>>>> >> >>> >> interpreter-heaviness
> >>>> >> >>>>> >> >>> >> of
> >>>> >> >>>>> >> >>> >> the
> >>>> >> >>>>> >> >>> >> inside of pandas objects is likely to be a
> long-term
> >>>> >> >>>>> >> >>> >> liability
> >>>> >> >>>>> >> >>> >> and
> >>>> >> >>>>> >> >>> >> source of performance problems and technical debt.
> >>>> >> >>>>> >> >>> >>
> >>>> >> >>>>> >> >>> >> Has anyone put any thought into planning and
> beginning
> >>>> >> >>>>> >> >>> >> to
> >>>> >> >>>>> >> >>> >> execute
> >>>> >> >>>>> >> >>> >> on a
> >>>> >> >>>>> >> >>> >> rewrite that moves as much as possible of the
> >>>> >> >>>>> >> >>> >> internals
> >>>> >> >>>>> >> >>> >> into
> >>>> >> >>>>> >> >>> >> native
> >>>> >> >>>>> >> >>> >> /
> >>>> >> >>>>> >> >>> >> compiled code? I'm talking about:
> >>>> >> >>>>> >> >>> >>
> >>>> >> >>>>> >> >>> >> - pandas/core/internals
> >>>> >> >>>>> >> >>> >> - indexing and assignment
> >>>> >> >>>>> >> >>> >> - much of pandas/core/common
> >>>> >> >>>>> >> >>> >> - categorical and custom dtypes
> >>>> >> >>>>> >> >>> >> - all indexing mechanisms
> >>>> >> >>>>> >> >>> >>
> >>>> >> >>>>> >> >>> >> I'm concerned we've already exposed too much
> internals
> >>>> >> >>>>> >> >>> >> to
> >>>> >> >>>>> >> >>> >> users, so
> >>>> >> >>>>> >> >>> >> this might lead to a lot of API breakage, but it
> might
> >>>> >> >>>>> >> >>> >> be
> >>>> >> >>>>> >> >>> >> for
> >>>> >> >>>>> >> >>> >> the
> >>>> >> >>>>> >> >>> >> Greater Good. As a first step, beginning a partial
> >>>> >> >>>>> >> >>> >> migration
> >>>> >> >>>>> >> >>> >> of
> >>>> >> >>>>> >> >>> >> internals into some C++ classes that encapsulate
> the
> >>>> >> >>>>> >> >>> >> insides
> >>>> >> >>>>> >> >>> >> of
> >>>> >> >>>>> >> >>> >> DataFrame objects and implement indexing and
> >>>> >> >>>>> >> >>> >> block-level
> >>>> >> >>>>> >> >>> >> manipulations
> >>>> >> >>>>> >> >>> >> would be a good place to start. I think you could
> do
> >>>> >> >>>>> >> >>> >> this
> >>>> >> >>>>> >> >>> >> wouldn't
> >>>> >> >>>>> >> >>> >> too
> >>>> >> >>>>> >> >>> >> much disruption.
> >>>> >> >>>>> >> >>> >>
> >>>> >> >>>>> >> >>> >> As part of this internal retooling we might give
> >>>> >> >>>>> >> >>> >> consideration
> >>>> >> >>>>> >> >>> >> to
> >>>> >> >>>>> >> >>> >> alternative data structures for representing data
> >>>> >> >>>>> >> >>> >> internal
> >>>> >> >>>>> >> >>> >> to
> >>>> >> >>>>> >> >>> >> pandas
> >>>> >> >>>>> >> >>> >> objects. Now in 2015/2016, continuing to be
> hamstrung
> >>>> >> >>>>> >> >>> >> by
> >>>> >> >>>>> >> >>> >> NumPy's
> >>>> >> >>>>> >> >>> >> limitations feels somewhat anachronistic. User
> code is
> >>>> >> >>>>> >> >>> >> riddled
> >>>> >> >>>>> >> >>> >> with
> >>>> >> >>>>> >> >>> >> workarounds for data type fidelity issues and the
> >>>> >> >>>>> >> >>> >> like.
> >>>> >> >>>>> >> >>> >> Like,
> >>>> >> >>>>> >> >>> >> really,
> >>>> >> >>>>> >> >>> >> why not add a bitndarray (similar to
> >>>> >> >>>>> >> >>> >> ilanschnell/bitarray)
> >>>> >> >>>>> >> >>> >> for
> >>>> >> >>>>> >> >>> >> storing
> >>>> >> >>>>> >> >>> >> nullness for problematic types and hide this from
> the
> >>>> >> >>>>> >> >>> >> user? =)
> >>>> >> >>>>> >> >>> >>
> >>>> >> >>>>> >> >>> >> Since we are now a NumFOCUS-sponsored project, I
> feel
> >>>> >> >>>>> >> >>> >> like
> >>>> >> >>>>> >> >>> >> we
> >>>> >> >>>>> >> >>> >> might
> >>>> >> >>>>> >> >>> >> consider establishing some formal governance over
> >>>> >> >>>>> >> >>> >> pandas
> >>>> >> >>>>> >> >>> >> and
> >>>> >> >>>>> >> >>> >> publishing meetings notes and roadmap documents
> >>>> >> >>>>> >> >>> >> describing
> >>>> >> >>>>> >> >>> >> plans
> >>>> >> >>>>> >> >>> >> for
> >>>> >> >>>>> >> >>> >> the project and meetings notes from committers.
> >>>> >> >>>>> >> >>> >> There's no
> >>>> >> >>>>> >> >>> >> real
> >>>> >> >>>>> >> >>> >> "committer culture" for NumFOCUS projects like
> there
> >>>> >> >>>>> >> >>> >> is
> >>>> >> >>>>> >> >>> >> with
> >>>> >> >>>>> >> >>> >> the
> >>>> >> >>>>> >> >>> >> Apache Software Foundation, but we might try
> leading
> >>>> >> >>>>> >> >>> >> by
> >>>> >> >>>>> >> >>> >> example!
> >>>> >> >>>>> >> >>> >>
> >>>> >> >>>>> >> >>> >> Also, I believe pandas as a project has reached a
> >>>> >> >>>>> >> >>> >> level of
> >>>> >> >>>>> >> >>> >> importance
> >>>> >> >>>>> >> >>> >> where we ought to consider planning and execution
> on
> >>>> >> >>>>> >> >>> >> larger
> >>>> >> >>>>> >> >>> >> scale
> >>>> >> >>>>> >> >>> >> undertakings such as this for safeguarding the
> future.
> >>>> >> >>>>> >> >>> >>
> >>>> >> >>>>> >> >>> >> As for myself, well, I have my hands full in Big
> >>>> >> >>>>> >> >>> >> Data-land. I
> >>>> >> >>>>> >> >>> >> wish
> >>>> >> >>>>> >> >>> >> I
> >>>> >> >>>>> >> >>> >> could be helping more with pandas, but there a
> quite a
> >>>> >> >>>>> >> >>> >> few
> >>>> >> >>>>> >> >>> >> fundamental
> >>>> >> >>>>> >> >>> >> issues (like data interoperability nested data
> >>>> >> >>>>> >> >>> >> handling
> >>>> >> >>>>> >> >>> >> and
> >>>> >> >>>>> >> >>> >> file
> >>>> >> >>>>> >> >>> >> format support — e.g. Parquet, see
> >>>> >> >>>>> >> >>> >>
> >>>> >> >>>>> >> >>> >>
> >>>> >> >>>>> >> >>> >>
> >>>> >> >>>>> >> >>> >>
> >>>> >> >>>>> >> >>> >>
> >>>> >> >>>>> >> >>> >>
> >>>> >> >>>>> >> >>> >>
> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/
> )
> >>>> >> >>>>> >> >>> >> preventing Python from being more useful in
> industry
> >>>> >> >>>>> >> >>> >> analytics
> >>>> >> >>>>> >> >>> >> applications.
> >>>> >> >>>>> >> >>> >>
> >>>> >> >>>>> >> >>> >> Aside: one of the bigger mistakes I made with
> pandas's
> >>>> >> >>>>> >> >>> >> API
> >>>> >> >>>>> >> >>> >> design
> >>>> >> >>>>> >> >>> >> was
> >>>> >> >>>>> >> >>> >> making it acceptable to call class constructors —
> like
> >>>> >> >>>>> >> >>> >> pandas.DataFrame — directly (versus factory
> >>>> >> >>>>> >> >>> >> functions).
> >>>> >> >>>>> >> >>> >> Sorry
> >>>> >> >>>>> >> >>> >> about
> >>>> >> >>>>> >> >>> >> that! If we could convince everyone to start
> writing
> >>>> >> >>>>> >> >>> >> pandas.data_frame
> >>>> >> >>>>> >> >>> >> or dataframe instead of using the class reference
> it
> >>>> >> >>>>> >> >>> >> would
> >>>> >> >>>>> >> >>> >> help a
> >>>> >> >>>>> >> >>> >> lot
> >>>> >> >>>>> >> >>> >> with code cleanup. It's hard to plan for these
> things
> >>>> >> >>>>> >> >>> >> —
> >>>> >> >>>>> >> >>> >> NumPy
> >>>> >> >>>>> >> >>> >> interoperability seemed a lot more important in
> 2008
> >>>> >> >>>>> >> >>> >> than
> >>>> >> >>>>> >> >>> >> it
> >>>> >> >>>>> >> >>> >> does
> >>>> >> >>>>> >> >>> >> now,
> >>>> >> >>>>> >> >>> >> so I forgive myself.
> >>>> >> >>>>> >> >>> >>
> >>>> >> >>>>> >> >>> >> cheers and best wishes for 2016,
> >>>> >> >>>>> >> >>> >> Wes
> >>>> >> >>>>> >> >>> >> _______________________________________________
> >>>> >> >>>>> >> >>> >> Pandas-dev mailing list
> >>>> >> >>>>> >> >>> >> Pandas-dev at python.org
> >>>> >> >>>>> >> >>> >>
> https://mail.python.org/mailman/listinfo/pandas-dev
> >>>> >> >>>>> >> >>> >
> >>>> >> >>>>> >> >>> >
> >>>> >> >>>>> >> >>> _______________________________________________
> >>>> >> >>>>> >> >>> Pandas-dev mailing list
> >>>> >> >>>>> >> >>> Pandas-dev at python.org
> >>>> >> >>>>> >> >>> https://mail.python.org/mailman/listinfo/pandas-dev
> >>>> >> >>>>> >> _______________________________________________
> >>>> >> >>>>> >> Pandas-dev mailing list
> >>>> >> >>>>> >> Pandas-dev at python.org
> >>>> >> >>>>> >> https://mail.python.org/mailman/listinfo/pandas-dev
> >>>> >> >>>>> >
> >>>> >> >>>>> >
> >>>> >> >>>>>
> >>>> >> >>>>>
> >>>> >> >>>>> _______________________________________________
> >>>> >> >>>>> Pandas-dev mailing list
> >>>> >> >>>>> Pandas-dev at python.org
> >>>> >> >>>>> https://mail.python.org/mailman/listinfo/pandas-dev
> >>>> >> >>>>>
> >>>> >> >>>>
> >>>> >> >>
> >>>> >> _______________________________________________
> >>>> >> Pandas-dev mailing list
> >>>> >> Pandas-dev at python.org
> >>>> >> https://mail.python.org/mailman/listinfo/pandas-dev
> >>>> >
> >>>> >
> >>>> _______________________________________________
> >>>> Pandas-dev mailing list
> >>>> Pandas-dev at python.org
> >>>> https://mail.python.org/mailman/listinfo/pandas-dev
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> Pandas-dev mailing list
> >>> Pandas-dev at python.org
> >>> https://mail.python.org/mailman/listinfo/pandas-dev
> >>>
> >>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20160106/2dcc9df6/attachment-0001.html>


More information about the Pandas-dev mailing list