From jorisvandenbossche at gmail.com  Fri Dec 18 12:04:17 2015
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Fri, 18 Dec 2015 18:04:17 +0100
Subject: [Pandas-dev] Notice: backwards incompatible change to the
 'resample' method considered for 0.18
Message-ID: <CALQtMBbAk2XWsquSbYOi8oC3O_V7=nEAqYEwmxq59p5NDYP4vA@mail.gmail.com>

Dear all,

At the moment we are considering to change the API for the resample method.
See the issue API: change .resample to be a groupby-like API
<https://github.com/pydata/pandas/issues/11732>, and Jeff's corresponding PR
11841 <https://github.com/pydata/pandas/pull/11841>.

Basically, we want to make it more similar to groupby. Code that is now
written as:

s.resample('D', how='max')

would become a two-step operation:

s.resample('D').max()

This change makes it more consistent with groupby (as downsampling can be
seen as a special case of groupby), and at the same time enabling the more
powerful features in the groupby-API for resampling.
In the current version of the PR, it will not be breaking silently your
code, as there is a deprecation warning when using resample in the old way,
or a clear error will be raised in some cases (when assigning to the result
of resample).

Feedback always welcome!

Regards,
Joris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20151218/5974012a/attachment.html>

From wesmckinn at gmail.com  Thu Dec 24 19:18:46 2015
From: wesmckinn at gmail.com (Wes McKinney)
Date: Thu, 24 Dec 2015 16:18:46 -0800
Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++?
Message-ID: <CAJPUwMDzrZ6b2AdD5VqJkk8+3B9iyHtptO_ziUc28XSdVkx41g@mail.gmail.com>

Deep thoughts during the holidays.

I might be out of line here, but the interpreter-heaviness of the
inside of pandas objects is likely to be a long-term liability and
source of performance problems and technical debt.

Has anyone put any thought into planning and beginning to execute on a
rewrite that moves as much as possible of the internals into native /
compiled code? I'm talking about:

- pandas/core/internals
- indexing and assignment
- much of pandas/core/common
- categorical and custom dtypes
- all indexing mechanisms

I'm concerned we've already exposed too much internals to users, so
this might lead to a lot of API breakage, but it might be for the
Greater Good. As a first step, beginning a partial migration of
internals into some C++ classes that encapsulate the insides of
DataFrame objects and implement indexing and block-level manipulations
would be a good place to start. I think you could do this wouldn't too
much disruption.

As part of this internal retooling we might give consideration to
alternative data structures for representing data internal to pandas
objects. Now in 2015/2016, continuing to be hamstrung by NumPy's
limitations feels somewhat anachronistic. User code is riddled with
workarounds for data type fidelity issues and the like. Like, really,
why not add a bitndarray (similar to ilanschnell/bitarray) for storing
nullness for problematic types and hide this from the user? =)

Since we are now a NumFOCUS-sponsored project, I feel like we might
consider establishing some formal governance over pandas and
publishing meetings notes and roadmap documents describing plans for
the project and meetings notes from committers. There's no real
"committer culture" for NumFOCUS projects like there is with the
Apache Software Foundation, but we might try leading by example!

Also, I believe pandas as a project has reached a level of importance
where we ought to consider planning and execution on larger scale
undertakings such as this for safeguarding the future.

As for myself, well, I have my hands full in Big Data-land. I wish I
could be helping more with pandas, but there a quite a few fundamental
issues (like data interoperability nested data handling and file
format support ? e.g. Parquet, see
http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/)
preventing Python from being more useful in industry analytics
applications.

Aside: one of the bigger mistakes I made with pandas's API design was
making it acceptable to call class constructors ? like
pandas.DataFrame ? directly (versus factory functions). Sorry about
that! If we could convince everyone to start writing pandas.data_frame
or dataframe instead of using the class reference it would help a lot
with code cleanup. It's hard to plan for these things ? NumPy
interoperability seemed a lot more important in 2008 than it does now,
so I forgive myself.

cheers and best wishes for 2016,
Wes

From jeffreback at gmail.com  Fri Dec 25 17:14:35 2015
From: jeffreback at gmail.com (Jeff Reback)
Date: Fri, 25 Dec 2015 17:14:35 -0500
Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? /
 Roadmap
Message-ID: <CAHMnJKiu9+ij943kYpRxKUxhp4H9Kqvevg84k1MchQaOwjEbGg@mail.gmail.com>

Here are some of my thoughts about pandas Roadmap / status and some
responses to Wes's thoughts.

In the last few (and upcoming) major releases we have been made the
following changes:

- dtype enhancements (Categorical, Timedelta, Datetime w/tz) & making these
first class objects
- code refactoring to remove subclassing of ndarrays for Series & Index
- carving out / deprecating non-core parts of pandas
  - datareader
  - SparsePanel, WidePanel & other aliases (TImeSeries)
  - rpy, rplot, irow et al.
  - google-analytics
- API changes to make things more consistent
  - pd.rolling/expanding * -> .rolling/expanding (this is in master now)
  - .resample becoming a full defered like groupby.
  - multi-index slicing along any level (obviates need for .xs) and allows
assignment
  - .loc/.iloc - for the most part obviates use of .ix
  - .pipe & .assign
  - plotting accessors
  - fixing of the sorting API
- many performance enhancements both micro & macro (e.g. release GIL)

Some on-deck enhancements are (meaning these are basically ready to go in):
  - IntervalIndex (and eventually make PeriodIndex just a sub-class of this)
  - RangeIndex

so lots of changes, though nothing really earth shaking, just more
convenience, reducing magicness somewhat
and providing flexibility.

Of course we are getting increasing issues, mostly bug reports (and lots of
dupes), some edge case enhancements
which can add to the existing API's and of course, requests to expand the
(already) large code to other usecases.
Balancing this are a good many pull-requests from many different users,
some even deep into the internals.

Here are some things that I have talked about and could be considered for
the roadmap. Disclaimer: I do work for Continuum
but these views are of course my own; furthermore obviously I am a bit more
familiar with some of the 'sponsored' open-source
libraries, but always open to new things.

- integration / automatic deferral to numba for JIT (this would be thru
.apply)
- automatic deferal to dask from groubpy where appropriate / maybe a
.to_parallel (to simply return a dask.DataFrame object)
- incorporation of quantities / units (as part of the dtype)
- use of DyND to allow missing values for int dtypes
- make Period a first class dtype.
- provide some copy-on-write semantics to alleviate the chained-indexing
issues which occasionaly come up with the mis-use of the indexing API
- allow a 'policy' to automatically provide column blocks for dict-like
input (e.g. each column would be a block), this would allow a pass-thru API
where you could
put in numpy arrays where you have views and have them preserved rather
than copied automatically. Note that this would also allow what I call
'split' where a passed in
multi-dim numpy array could be split up to individual blocks (which
actually gives a nice perf boost after the splitting costs).

In working towards some of these goals. I have come to the opinion that it
would make sense to have a neutral API protocol layer
that would allow us to swap out different engines as needed, for particular
dtypes, or *maybe* out-of-core type computations. E.g.
imagine that we replaced the in-memory block structure with a bclolz /
memap type; in theory this should be 'easy' and just work.
I could also see us adopting *some* of the SFrame code to allow easier
interop with this API layer.

In practice, I think a nice API layer would need to be created to make this
clean / nice.

So this comes around to Wes's point about creating a c++ library for the
internals (and possibly even some of the indexing routines).
In an ideal world, or course this would be desirable. Getting there is a
bit non-trivial I think, and IMHO might not be worth the effort. I don't
really see big performance bottlenecks. We *already* defer much of the
computation to libraries like numexpr & bottleneck (where appropriate).
Adding numba / dask to the list would be helpful.

I think that almost all performance issues are the result of:

a) gross misuse of the pandas API. How much code have you seen that does
df.apply(lambda x: x.sum())
b) routines which operate column-by-column rather block-by-block and are in
python space (e.g. we have an issue right now about .quantile)

So I am glossing over a big goal of having a c++ library that represents
the pandas internals. This would by definition have a c-API that so
you *could* use pandas like semantics in c/c++ and just have it work (and
then pandas would be a thin wrapper around this library).

I am not averse to this, but I think would be quite a big effort, and not a
huge perf boost IMHO. Further there are a number of API issues w.r.t.
indexing
which need to be clarified / worked out (e.g. should we simply deprecate
[]) that are much easier to test / figure out in python space.

I also thing that we have quite a large number of contributors. Moving to
c++ might make the internals a bit more impenetrable that the current
internals.
(though this would allow c++ people to contribute, so that might balance
out).

We have a limited core of devs whom right now are familar with things. If
someone happened to have a starting base for a c++ library, then I might
change
opinions here.


my 4c.

Jeff


On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney <wesmckinn at gmail.com> wrote:

> Deep thoughts during the holidays.
>
> I might be out of line here, but the interpreter-heaviness of the
> inside of pandas objects is likely to be a long-term liability and
> source of performance problems and technical debt.
>
> Has anyone put any thought into planning and beginning to execute on a
> rewrite that moves as much as possible of the internals into native /
> compiled code? I'm talking about:
>
> - pandas/core/internals
> - indexing and assignment
> - much of pandas/core/common
> - categorical and custom dtypes
> - all indexing mechanisms
>
> I'm concerned we've already exposed too much internals to users, so
> this might lead to a lot of API breakage, but it might be for the
> Greater Good. As a first step, beginning a partial migration of
> internals into some C++ classes that encapsulate the insides of
> DataFrame objects and implement indexing and block-level manipulations
> would be a good place to start. I think you could do this wouldn't too
> much disruption.
>
> As part of this internal retooling we might give consideration to
> alternative data structures for representing data internal to pandas
> objects. Now in 2015/2016, continuing to be hamstrung by NumPy's
> limitations feels somewhat anachronistic. User code is riddled with
> workarounds for data type fidelity issues and the like. Like, really,
> why not add a bitndarray (similar to ilanschnell/bitarray) for storing
> nullness for problematic types and hide this from the user? =)
>
> Since we are now a NumFOCUS-sponsored project, I feel like we might
> consider establishing some formal governance over pandas and
> publishing meetings notes and roadmap documents describing plans for
> the project and meetings notes from committers. There's no real
> "committer culture" for NumFOCUS projects like there is with the
> Apache Software Foundation, but we might try leading by example!
>
> Also, I believe pandas as a project has reached a level of importance
> where we ought to consider planning and execution on larger scale
> undertakings such as this for safeguarding the future.
>
> As for myself, well, I have my hands full in Big Data-land. I wish I
> could be helping more with pandas, but there a quite a few fundamental
> issues (like data interoperability nested data handling and file
> format support ? e.g. Parquet, see
>
> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/
> )
> preventing Python from being more useful in industry analytics
> applications.
>
> Aside: one of the bigger mistakes I made with pandas's API design was
> making it acceptable to call class constructors ? like
> pandas.DataFrame ? directly (versus factory functions). Sorry about
> that! If we could convince everyone to start writing pandas.data_frame
> or dataframe instead of using the class reference it would help a lot
> with code cleanup. It's hard to plan for these things ? NumPy
> interoperability seemed a lot more important in 2008 than it does now,
> so I forgive myself.
>
> cheers and best wishes for 2016,
> Wes
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20151225/ae77e19a/attachment.html>

From wesmckinn at gmail.com  Tue Dec 29 14:49:49 2015
From: wesmckinn at gmail.com (Wes McKinney)
Date: Tue, 29 Dec 2015 11:49:49 -0800
Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? /
 Roadmap
In-Reply-To: <CAHMnJKiu9+ij943kYpRxKUxhp4H9Kqvevg84k1MchQaOwjEbGg@mail.gmail.com>
References: <CAHMnJKiu9+ij943kYpRxKUxhp4H9Kqvevg84k1MchQaOwjEbGg@mail.gmail.com>
Message-ID: <CAJPUwMCdknF90fsjRyGZ7FRqYo_GOe_+3KezOWYZosvZ5hqWzw@mail.gmail.com>

I will write a more detailed response to some of these things after
the new year, but, in particular, re: missing values, can you or
someone tell me why creating an object that contains a NumPy array and
a bitmap is not sufficient? If we we can add a lightweight C/C++ class
layer between NumPy function calls (e.g. arithmetic) and pandas
function calls, then I see no reason why we cannot have

Int32Array->add

and

Float32Array->add

do the right thing (the former would be responsible for bitmasking to
propagate NA values; the latter would defer to NumPy). If we can put
all the internals of pandas objects inside a black box, we can add
layers of virtual function indirection without a performance penalty
(e.g. adding more interpreter overhead with more abstraction layers
does add up to a perf penalty).

I don't think this is too scary -- I would be willing to create a
small POC C++ library to prototype something like what I'm talking
about.

Since pandas has limited points of contact with NumPy I don't think
this would end up being too onerous.

For the record, I'm pretty allergic to "advanced C++"; I think it is a
useful tool if you pick a sane 20% subset of the C++11 spec and follow
Google C++ style it's not very inaccessible to intermediate
developers. More or less "C plus OOP and easier object lifetime
management (shared/unique_ptr, etc.)". As soon as you add a lot of
template metaprogramming C++ library development quickly becomes
inaccessible except to the C++-Jedi.

Maybe let's start a Google document on "pandas roadmap" where we can
break down the 1-2 year goals and some of these infrastructure issues
and have our discussion there? (obviously publish this someplace once
we're done)

- Wes

On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback <jeffreback at gmail.com> wrote:
> Here are some of my thoughts about pandas Roadmap / status and some
> responses to Wes's thoughts.
>
> In the last few (and upcoming) major releases we have been made the
> following changes:
>
> - dtype enhancements (Categorical, Timedelta, Datetime w/tz) & making these
> first class objects
> - code refactoring to remove subclassing of ndarrays for Series & Index
> - carving out / deprecating non-core parts of pandas
>   - datareader
>   - SparsePanel, WidePanel & other aliases (TImeSeries)
>   - rpy, rplot, irow et al.
>   - google-analytics
> - API changes to make things more consistent
>   - pd.rolling/expanding * -> .rolling/expanding (this is in master now)
>   - .resample becoming a full defered like groupby.
>   - multi-index slicing along any level (obviates need for .xs) and allows
> assignment
>   - .loc/.iloc - for the most part obviates use of .ix
>   - .pipe & .assign
>   - plotting accessors
>   - fixing of the sorting API
> - many performance enhancements both micro & macro (e.g. release GIL)
>
> Some on-deck enhancements are (meaning these are basically ready to go in):
>   - IntervalIndex (and eventually make PeriodIndex just a sub-class of this)
>   - RangeIndex
>
> so lots of changes, though nothing really earth shaking, just more
> convenience, reducing magicness somewhat
> and providing flexibility.
>
> Of course we are getting increasing issues, mostly bug reports (and lots of
> dupes), some edge case enhancements
> which can add to the existing API's and of course, requests to expand the
> (already) large code to other usecases.
> Balancing this are a good many pull-requests from many different users, some
> even deep into the internals.
>
> Here are some things that I have talked about and could be considered for
> the roadmap. Disclaimer: I do work for Continuum
> but these views are of course my own; furthermore obviously I am a bit more
> familiar with some of the 'sponsored' open-source
> libraries, but always open to new things.
>
> - integration / automatic deferral to numba for JIT (this would be thru
> .apply)
> - automatic deferal to dask from groubpy where appropriate / maybe a
> .to_parallel (to simply return a dask.DataFrame object)
> - incorporation of quantities / units (as part of the dtype)
> - use of DyND to allow missing values for int dtypes
> - make Period a first class dtype.
> - provide some copy-on-write semantics to alleviate the chained-indexing
> issues which occasionaly come up with the mis-use of the indexing API
> - allow a 'policy' to automatically provide column blocks for dict-like
> input (e.g. each column would be a block), this would allow a pass-thru API
> where you could
> put in numpy arrays where you have views and have them preserved rather than
> copied automatically. Note that this would also allow what I call 'split'
> where a passed in
> multi-dim numpy array could be split up to individual blocks (which actually
> gives a nice perf boost after the splitting costs).
>
> In working towards some of these goals. I have come to the opinion that it
> would make sense to have a neutral API protocol layer
> that would allow us to swap out different engines as needed, for particular
> dtypes, or *maybe* out-of-core type computations. E.g.
> imagine that we replaced the in-memory block structure with a bclolz / memap
> type; in theory this should be 'easy' and just work.
> I could also see us adopting *some* of the SFrame code to allow easier
> interop with this API layer.
>
> In practice, I think a nice API layer would need to be created to make this
> clean / nice.
>
> So this comes around to Wes's point about creating a c++ library for the
> internals (and possibly even some of the indexing routines).
> In an ideal world, or course this would be desirable. Getting there is a bit
> non-trivial I think, and IMHO might not be worth the effort. I don't
> really see big performance bottlenecks. We *already* defer much of the
> computation to libraries like numexpr & bottleneck (where appropriate).
> Adding numba / dask to the list would be helpful.
>
> I think that almost all performance issues are the result of:
>
> a) gross misuse of the pandas API. How much code have you seen that does
> df.apply(lambda x: x.sum())
> b) routines which operate column-by-column rather block-by-block and are in
> python space (e.g. we have an issue right now about .quantile)
>
> So I am glossing over a big goal of having a c++ library that represents the
> pandas internals. This would by definition have a c-API that so
> you *could* use pandas like semantics in c/c++ and just have it work (and
> then pandas would be a thin wrapper around this library).
>
> I am not averse to this, but I think would be quite a big effort, and not a
> huge perf boost IMHO. Further there are a number of API issues w.r.t.
> indexing
> which need to be clarified / worked out (e.g. should we simply deprecate [])
> that are much easier to test / figure out in python space.
>
> I also thing that we have quite a large number of contributors. Moving to
> c++ might make the internals a bit more impenetrable that the current
> internals.
> (though this would allow c++ people to contribute, so that might balance
> out).
>
> We have a limited core of devs whom right now are familar with things. If
> someone happened to have a starting base for a c++ library, then I might
> change
> opinions here.
>
>
> my 4c.
>
> Jeff
>
>
>
>
> On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney <wesmckinn at gmail.com> wrote:
>>
>> Deep thoughts during the holidays.
>>
>> I might be out of line here, but the interpreter-heaviness of the
>> inside of pandas objects is likely to be a long-term liability and
>> source of performance problems and technical debt.
>>
>> Has anyone put any thought into planning and beginning to execute on a
>> rewrite that moves as much as possible of the internals into native /
>> compiled code? I'm talking about:
>>
>> - pandas/core/internals
>> - indexing and assignment
>> - much of pandas/core/common
>> - categorical and custom dtypes
>> - all indexing mechanisms
>>
>> I'm concerned we've already exposed too much internals to users, so
>> this might lead to a lot of API breakage, but it might be for the
>> Greater Good. As a first step, beginning a partial migration of
>> internals into some C++ classes that encapsulate the insides of
>> DataFrame objects and implement indexing and block-level manipulations
>> would be a good place to start. I think you could do this wouldn't too
>> much disruption.
>>
>> As part of this internal retooling we might give consideration to
>> alternative data structures for representing data internal to pandas
>> objects. Now in 2015/2016, continuing to be hamstrung by NumPy's
>> limitations feels somewhat anachronistic. User code is riddled with
>> workarounds for data type fidelity issues and the like. Like, really,
>> why not add a bitndarray (similar to ilanschnell/bitarray) for storing
>> nullness for problematic types and hide this from the user? =)
>>
>> Since we are now a NumFOCUS-sponsored project, I feel like we might
>> consider establishing some formal governance over pandas and
>> publishing meetings notes and roadmap documents describing plans for
>> the project and meetings notes from committers. There's no real
>> "committer culture" for NumFOCUS projects like there is with the
>> Apache Software Foundation, but we might try leading by example!
>>
>> Also, I believe pandas as a project has reached a level of importance
>> where we ought to consider planning and execution on larger scale
>> undertakings such as this for safeguarding the future.
>>
>> As for myself, well, I have my hands full in Big Data-land. I wish I
>> could be helping more with pandas, but there a quite a few fundamental
>> issues (like data interoperability nested data handling and file
>> format support ? e.g. Parquet, see
>>
>> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/)
>> preventing Python from being more useful in industry analytics
>> applications.
>>
>> Aside: one of the bigger mistakes I made with pandas's API design was
>> making it acceptable to call class constructors ? like
>> pandas.DataFrame ? directly (versus factory functions). Sorry about
>> that! If we could convince everyone to start writing pandas.data_frame
>> or dataframe instead of using the class reference it would help a lot
>> with code cleanup. It's hard to plan for these things ? NumPy
>> interoperability seemed a lot more important in 2008 than it does now,
>> so I forgive myself.
>>
>> cheers and best wishes for 2016,
>> Wes
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>
>

From jeffreback at gmail.com  Tue Dec 29 14:56:08 2015
From: jeffreback at gmail.com (Jeff Reback)
Date: Tue, 29 Dec 2015 14:56:08 -0500
Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? /
 Roadmap
In-Reply-To: <CAJPUwMCdknF90fsjRyGZ7FRqYo_GOe_+3KezOWYZosvZ5hqWzw@mail.gmail.com>
References: <CAHMnJKiu9+ij943kYpRxKUxhp4H9Kqvevg84k1MchQaOwjEbGg@mail.gmail.com>
 <CAJPUwMCdknF90fsjRyGZ7FRqYo_GOe_+3KezOWYZosvZ5hqWzw@mail.gmail.com>
Message-ID: <CAHMnJKjO6HDipgYj6fe99usXvhtrm4s4Fgfd7YaUvaV--3kQ2Q@mail.gmail.com>

Ok certainly not averse to using bitfields. I agree that would solve the
problem. In fact Stefan Hoyer and I briefly discussed this w.r.t.
IntervalIndex. Turns out just as easy to use a sentinel. In fact that was
my original idea (for int NA). really similar to how we handle Datetime et
al.

So will create a google doc for discussion points.

I agree creating a minimalist c++ library is not too hard. But my original
question stands, what are the use cases. I can enumerate some here:

- 1) performance (I am not convinced of this, but could be wrong)
- 2) c-api always a good thing & other lang bindings

I suspect you are in the part 2 camp?


On Tue, Dec 29, 2015 at 2:49 PM, Wes McKinney <wesmckinn at gmail.com> wrote:

> I will write a more detailed response to some of these things after
> the new year, but, in particular, re: missing values, can you or
> someone tell me why creating an object that contains a NumPy array and
> a bitmap is not sufficient? If we we can add a lightweight C/C++ class
> layer between NumPy function calls (e.g. arithmetic) and pandas
> function calls, then I see no reason why we cannot have
>
> Int32Array->add
>
> and
>
> Float32Array->add
>
> do the right thing (the former would be responsible for bitmasking to
> propagate NA values; the latter would defer to NumPy). If we can put
> all the internals of pandas objects inside a black box, we can add
> layers of virtual function indirection without a performance penalty
> (e.g. adding more interpreter overhead with more abstraction layers
> does add up to a perf penalty).
>
> I don't think this is too scary -- I would be willing to create a
> small POC C++ library to prototype something like what I'm talking
> about.
>
> Since pandas has limited points of contact with NumPy I don't think
> this would end up being too onerous.
>
> For the record, I'm pretty allergic to "advanced C++"; I think it is a
> useful tool if you pick a sane 20% subset of the C++11 spec and follow
> Google C++ style it's not very inaccessible to intermediate
> developers. More or less "C plus OOP and easier object lifetime
> management (shared/unique_ptr, etc.)". As soon as you add a lot of
> template metaprogramming C++ library development quickly becomes
> inaccessible except to the C++-Jedi.
>
> Maybe let's start a Google document on "pandas roadmap" where we can
> break down the 1-2 year goals and some of these infrastructure issues
> and have our discussion there? (obviously publish this someplace once
> we're done)
>
> - Wes
>
> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback <jeffreback at gmail.com> wrote:
> > Here are some of my thoughts about pandas Roadmap / status and some
> > responses to Wes's thoughts.
> >
> > In the last few (and upcoming) major releases we have been made the
> > following changes:
> >
> > - dtype enhancements (Categorical, Timedelta, Datetime w/tz) & making
> these
> > first class objects
> > - code refactoring to remove subclassing of ndarrays for Series & Index
> > - carving out / deprecating non-core parts of pandas
> >   - datareader
> >   - SparsePanel, WidePanel & other aliases (TImeSeries)
> >   - rpy, rplot, irow et al.
> >   - google-analytics
> > - API changes to make things more consistent
> >   - pd.rolling/expanding * -> .rolling/expanding (this is in master now)
> >   - .resample becoming a full defered like groupby.
> >   - multi-index slicing along any level (obviates need for .xs) and
> allows
> > assignment
> >   - .loc/.iloc - for the most part obviates use of .ix
> >   - .pipe & .assign
> >   - plotting accessors
> >   - fixing of the sorting API
> > - many performance enhancements both micro & macro (e.g. release GIL)
> >
> > Some on-deck enhancements are (meaning these are basically ready to go
> in):
> >   - IntervalIndex (and eventually make PeriodIndex just a sub-class of
> this)
> >   - RangeIndex
> >
> > so lots of changes, though nothing really earth shaking, just more
> > convenience, reducing magicness somewhat
> > and providing flexibility.
> >
> > Of course we are getting increasing issues, mostly bug reports (and lots
> of
> > dupes), some edge case enhancements
> > which can add to the existing API's and of course, requests to expand the
> > (already) large code to other usecases.
> > Balancing this are a good many pull-requests from many different users,
> some
> > even deep into the internals.
> >
> > Here are some things that I have talked about and could be considered for
> > the roadmap. Disclaimer: I do work for Continuum
> > but these views are of course my own; furthermore obviously I am a bit
> more
> > familiar with some of the 'sponsored' open-source
> > libraries, but always open to new things.
> >
> > - integration / automatic deferral to numba for JIT (this would be thru
> > .apply)
> > - automatic deferal to dask from groubpy where appropriate / maybe a
> > .to_parallel (to simply return a dask.DataFrame object)
> > - incorporation of quantities / units (as part of the dtype)
> > - use of DyND to allow missing values for int dtypes
> > - make Period a first class dtype.
> > - provide some copy-on-write semantics to alleviate the chained-indexing
> > issues which occasionaly come up with the mis-use of the indexing API
> > - allow a 'policy' to automatically provide column blocks for dict-like
> > input (e.g. each column would be a block), this would allow a pass-thru
> API
> > where you could
> > put in numpy arrays where you have views and have them preserved rather
> than
> > copied automatically. Note that this would also allow what I call 'split'
> > where a passed in
> > multi-dim numpy array could be split up to individual blocks (which
> actually
> > gives a nice perf boost after the splitting costs).
> >
> > In working towards some of these goals. I have come to the opinion that
> it
> > would make sense to have a neutral API protocol layer
> > that would allow us to swap out different engines as needed, for
> particular
> > dtypes, or *maybe* out-of-core type computations. E.g.
> > imagine that we replaced the in-memory block structure with a bclolz /
> memap
> > type; in theory this should be 'easy' and just work.
> > I could also see us adopting *some* of the SFrame code to allow easier
> > interop with this API layer.
> >
> > In practice, I think a nice API layer would need to be created to make
> this
> > clean / nice.
> >
> > So this comes around to Wes's point about creating a c++ library for the
> > internals (and possibly even some of the indexing routines).
> > In an ideal world, or course this would be desirable. Getting there is a
> bit
> > non-trivial I think, and IMHO might not be worth the effort. I don't
> > really see big performance bottlenecks. We *already* defer much of the
> > computation to libraries like numexpr & bottleneck (where appropriate).
> > Adding numba / dask to the list would be helpful.
> >
> > I think that almost all performance issues are the result of:
> >
> > a) gross misuse of the pandas API. How much code have you seen that does
> > df.apply(lambda x: x.sum())
> > b) routines which operate column-by-column rather block-by-block and are
> in
> > python space (e.g. we have an issue right now about .quantile)
> >
> > So I am glossing over a big goal of having a c++ library that represents
> the
> > pandas internals. This would by definition have a c-API that so
> > you *could* use pandas like semantics in c/c++ and just have it work (and
> > then pandas would be a thin wrapper around this library).
> >
> > I am not averse to this, but I think would be quite a big effort, and
> not a
> > huge perf boost IMHO. Further there are a number of API issues w.r.t.
> > indexing
> > which need to be clarified / worked out (e.g. should we simply deprecate
> [])
> > that are much easier to test / figure out in python space.
> >
> > I also thing that we have quite a large number of contributors. Moving to
> > c++ might make the internals a bit more impenetrable that the current
> > internals.
> > (though this would allow c++ people to contribute, so that might balance
> > out).
> >
> > We have a limited core of devs whom right now are familar with things. If
> > someone happened to have a starting base for a c++ library, then I might
> > change
> > opinions here.
> >
> >
> > my 4c.
> >
> > Jeff
> >
> >
> >
> >
> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney <wesmckinn at gmail.com>
> wrote:
> >>
> >> Deep thoughts during the holidays.
> >>
> >> I might be out of line here, but the interpreter-heaviness of the
> >> inside of pandas objects is likely to be a long-term liability and
> >> source of performance problems and technical debt.
> >>
> >> Has anyone put any thought into planning and beginning to execute on a
> >> rewrite that moves as much as possible of the internals into native /
> >> compiled code? I'm talking about:
> >>
> >> - pandas/core/internals
> >> - indexing and assignment
> >> - much of pandas/core/common
> >> - categorical and custom dtypes
> >> - all indexing mechanisms
> >>
> >> I'm concerned we've already exposed too much internals to users, so
> >> this might lead to a lot of API breakage, but it might be for the
> >> Greater Good. As a first step, beginning a partial migration of
> >> internals into some C++ classes that encapsulate the insides of
> >> DataFrame objects and implement indexing and block-level manipulations
> >> would be a good place to start. I think you could do this wouldn't too
> >> much disruption.
> >>
> >> As part of this internal retooling we might give consideration to
> >> alternative data structures for representing data internal to pandas
> >> objects. Now in 2015/2016, continuing to be hamstrung by NumPy's
> >> limitations feels somewhat anachronistic. User code is riddled with
> >> workarounds for data type fidelity issues and the like. Like, really,
> >> why not add a bitndarray (similar to ilanschnell/bitarray) for storing
> >> nullness for problematic types and hide this from the user? =)
> >>
> >> Since we are now a NumFOCUS-sponsored project, I feel like we might
> >> consider establishing some formal governance over pandas and
> >> publishing meetings notes and roadmap documents describing plans for
> >> the project and meetings notes from committers. There's no real
> >> "committer culture" for NumFOCUS projects like there is with the
> >> Apache Software Foundation, but we might try leading by example!
> >>
> >> Also, I believe pandas as a project has reached a level of importance
> >> where we ought to consider planning and execution on larger scale
> >> undertakings such as this for safeguarding the future.
> >>
> >> As for myself, well, I have my hands full in Big Data-land. I wish I
> >> could be helping more with pandas, but there a quite a few fundamental
> >> issues (like data interoperability nested data handling and file
> >> format support ? e.g. Parquet, see
> >>
> >>
> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/
> )
> >> preventing Python from being more useful in industry analytics
> >> applications.
> >>
> >> Aside: one of the bigger mistakes I made with pandas's API design was
> >> making it acceptable to call class constructors ? like
> >> pandas.DataFrame ? directly (versus factory functions). Sorry about
> >> that! If we could convince everyone to start writing pandas.data_frame
> >> or dataframe instead of using the class reference it would help a lot
> >> with code cleanup. It's hard to plan for these things ? NumPy
> >> interoperability seemed a lot more important in 2008 than it does now,
> >> so I forgive myself.
> >>
> >> cheers and best wishes for 2016,
> >> Wes
> >> _______________________________________________
> >> Pandas-dev mailing list
> >> Pandas-dev at python.org
> >> https://mail.python.org/mailman/listinfo/pandas-dev
> >
> >
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20151229/68373d60/attachment-0001.html>

From jeffreback at gmail.com  Tue Dec 29 14:59:33 2015
From: jeffreback at gmail.com (Jeff Reback)
Date: Tue, 29 Dec 2015 14:59:33 -0500
Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? /
 Roadmap
In-Reply-To: <CAHMnJKjO6HDipgYj6fe99usXvhtrm4s4Fgfd7YaUvaV--3kQ2Q@mail.gmail.com>
References: <CAHMnJKiu9+ij943kYpRxKUxhp4H9Kqvevg84k1MchQaOwjEbGg@mail.gmail.com>
 <CAJPUwMCdknF90fsjRyGZ7FRqYo_GOe_+3KezOWYZosvZ5hqWzw@mail.gmail.com>
 <CAHMnJKjO6HDipgYj6fe99usXvhtrm4s4Fgfd7YaUvaV--3kQ2Q@mail.gmail.com>
Message-ID: <CAHMnJKjHbze=rTz9z40kjZpez5AOjhZJJaouach9yuANQyOZRA@mail.gmail.com>

Here's a link where we can discuss the roadmap:

https://docs.google.com/document/d/151ct8jcZWwh7XStptjbLsda6h2b3C0IuiH_hfZnUA58/edit?usp=sharing

On Tue, Dec 29, 2015 at 2:56 PM, Jeff Reback <jeffreback at gmail.com> wrote:

> Ok certainly not averse to using bitfields. I agree that would solve the
> problem. In fact Stefan Hoyer and I briefly discussed this w.r.t.
> IntervalIndex. Turns out just as easy to use a sentinel. In fact that was
> my original idea (for int NA). really similar to how we handle Datetime et
> al.
>
> So will create a google doc for discussion points.
>
> I agree creating a minimalist c++ library is not too hard. But my original
> question stands, what are the use cases. I can enumerate some here:
>
> - 1) performance (I am not convinced of this, but could be wrong)
> - 2) c-api always a good thing & other lang bindings
>
> I suspect you are in the part 2 camp?
>
>
> On Tue, Dec 29, 2015 at 2:49 PM, Wes McKinney <wesmckinn at gmail.com> wrote:
>
>> I will write a more detailed response to some of these things after
>> the new year, but, in particular, re: missing values, can you or
>> someone tell me why creating an object that contains a NumPy array and
>> a bitmap is not sufficient? If we we can add a lightweight C/C++ class
>> layer between NumPy function calls (e.g. arithmetic) and pandas
>> function calls, then I see no reason why we cannot have
>>
>> Int32Array->add
>>
>> and
>>
>> Float32Array->add
>>
>> do the right thing (the former would be responsible for bitmasking to
>> propagate NA values; the latter would defer to NumPy). If we can put
>> all the internals of pandas objects inside a black box, we can add
>> layers of virtual function indirection without a performance penalty
>> (e.g. adding more interpreter overhead with more abstraction layers
>> does add up to a perf penalty).
>>
>> I don't think this is too scary -- I would be willing to create a
>> small POC C++ library to prototype something like what I'm talking
>> about.
>>
>> Since pandas has limited points of contact with NumPy I don't think
>> this would end up being too onerous.
>>
>> For the record, I'm pretty allergic to "advanced C++"; I think it is a
>> useful tool if you pick a sane 20% subset of the C++11 spec and follow
>> Google C++ style it's not very inaccessible to intermediate
>> developers. More or less "C plus OOP and easier object lifetime
>> management (shared/unique_ptr, etc.)". As soon as you add a lot of
>> template metaprogramming C++ library development quickly becomes
>> inaccessible except to the C++-Jedi.
>>
>> Maybe let's start a Google document on "pandas roadmap" where we can
>> break down the 1-2 year goals and some of these infrastructure issues
>> and have our discussion there? (obviously publish this someplace once
>> we're done)
>>
>> - Wes
>>
>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback <jeffreback at gmail.com>
>> wrote:
>> > Here are some of my thoughts about pandas Roadmap / status and some
>> > responses to Wes's thoughts.
>> >
>> > In the last few (and upcoming) major releases we have been made the
>> > following changes:
>> >
>> > - dtype enhancements (Categorical, Timedelta, Datetime w/tz) & making
>> these
>> > first class objects
>> > - code refactoring to remove subclassing of ndarrays for Series & Index
>> > - carving out / deprecating non-core parts of pandas
>> >   - datareader
>> >   - SparsePanel, WidePanel & other aliases (TImeSeries)
>> >   - rpy, rplot, irow et al.
>> >   - google-analytics
>> > - API changes to make things more consistent
>> >   - pd.rolling/expanding * -> .rolling/expanding (this is in master now)
>> >   - .resample becoming a full defered like groupby.
>> >   - multi-index slicing along any level (obviates need for .xs) and
>> allows
>> > assignment
>> >   - .loc/.iloc - for the most part obviates use of .ix
>> >   - .pipe & .assign
>> >   - plotting accessors
>> >   - fixing of the sorting API
>> > - many performance enhancements both micro & macro (e.g. release GIL)
>> >
>> > Some on-deck enhancements are (meaning these are basically ready to go
>> in):
>> >   - IntervalIndex (and eventually make PeriodIndex just a sub-class of
>> this)
>> >   - RangeIndex
>> >
>> > so lots of changes, though nothing really earth shaking, just more
>> > convenience, reducing magicness somewhat
>> > and providing flexibility.
>> >
>> > Of course we are getting increasing issues, mostly bug reports (and
>> lots of
>> > dupes), some edge case enhancements
>> > which can add to the existing API's and of course, requests to expand
>> the
>> > (already) large code to other usecases.
>> > Balancing this are a good many pull-requests from many different users,
>> some
>> > even deep into the internals.
>> >
>> > Here are some things that I have talked about and could be considered
>> for
>> > the roadmap. Disclaimer: I do work for Continuum
>> > but these views are of course my own; furthermore obviously I am a bit
>> more
>> > familiar with some of the 'sponsored' open-source
>> > libraries, but always open to new things.
>> >
>> > - integration / automatic deferral to numba for JIT (this would be thru
>> > .apply)
>> > - automatic deferal to dask from groubpy where appropriate / maybe a
>> > .to_parallel (to simply return a dask.DataFrame object)
>> > - incorporation of quantities / units (as part of the dtype)
>> > - use of DyND to allow missing values for int dtypes
>> > - make Period a first class dtype.
>> > - provide some copy-on-write semantics to alleviate the chained-indexing
>> > issues which occasionaly come up with the mis-use of the indexing API
>> > - allow a 'policy' to automatically provide column blocks for dict-like
>> > input (e.g. each column would be a block), this would allow a pass-thru
>> API
>> > where you could
>> > put in numpy arrays where you have views and have them preserved rather
>> than
>> > copied automatically. Note that this would also allow what I call
>> 'split'
>> > where a passed in
>> > multi-dim numpy array could be split up to individual blocks (which
>> actually
>> > gives a nice perf boost after the splitting costs).
>> >
>> > In working towards some of these goals. I have come to the opinion that
>> it
>> > would make sense to have a neutral API protocol layer
>> > that would allow us to swap out different engines as needed, for
>> particular
>> > dtypes, or *maybe* out-of-core type computations. E.g.
>> > imagine that we replaced the in-memory block structure with a bclolz /
>> memap
>> > type; in theory this should be 'easy' and just work.
>> > I could also see us adopting *some* of the SFrame code to allow easier
>> > interop with this API layer.
>> >
>> > In practice, I think a nice API layer would need to be created to make
>> this
>> > clean / nice.
>> >
>> > So this comes around to Wes's point about creating a c++ library for the
>> > internals (and possibly even some of the indexing routines).
>> > In an ideal world, or course this would be desirable. Getting there is
>> a bit
>> > non-trivial I think, and IMHO might not be worth the effort. I don't
>> > really see big performance bottlenecks. We *already* defer much of the
>> > computation to libraries like numexpr & bottleneck (where appropriate).
>> > Adding numba / dask to the list would be helpful.
>> >
>> > I think that almost all performance issues are the result of:
>> >
>> > a) gross misuse of the pandas API. How much code have you seen that does
>> > df.apply(lambda x: x.sum())
>> > b) routines which operate column-by-column rather block-by-block and
>> are in
>> > python space (e.g. we have an issue right now about .quantile)
>> >
>> > So I am glossing over a big goal of having a c++ library that
>> represents the
>> > pandas internals. This would by definition have a c-API that so
>> > you *could* use pandas like semantics in c/c++ and just have it work
>> (and
>> > then pandas would be a thin wrapper around this library).
>> >
>> > I am not averse to this, but I think would be quite a big effort, and
>> not a
>> > huge perf boost IMHO. Further there are a number of API issues w.r.t.
>> > indexing
>> > which need to be clarified / worked out (e.g. should we simply
>> deprecate [])
>> > that are much easier to test / figure out in python space.
>> >
>> > I also thing that we have quite a large number of contributors. Moving
>> to
>> > c++ might make the internals a bit more impenetrable that the current
>> > internals.
>> > (though this would allow c++ people to contribute, so that might balance
>> > out).
>> >
>> > We have a limited core of devs whom right now are familar with things.
>> If
>> > someone happened to have a starting base for a c++ library, then I might
>> > change
>> > opinions here.
>> >
>> >
>> > my 4c.
>> >
>> > Jeff
>> >
>> >
>> >
>> >
>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney <wesmckinn at gmail.com>
>> wrote:
>> >>
>> >> Deep thoughts during the holidays.
>> >>
>> >> I might be out of line here, but the interpreter-heaviness of the
>> >> inside of pandas objects is likely to be a long-term liability and
>> >> source of performance problems and technical debt.
>> >>
>> >> Has anyone put any thought into planning and beginning to execute on a
>> >> rewrite that moves as much as possible of the internals into native /
>> >> compiled code? I'm talking about:
>> >>
>> >> - pandas/core/internals
>> >> - indexing and assignment
>> >> - much of pandas/core/common
>> >> - categorical and custom dtypes
>> >> - all indexing mechanisms
>> >>
>> >> I'm concerned we've already exposed too much internals to users, so
>> >> this might lead to a lot of API breakage, but it might be for the
>> >> Greater Good. As a first step, beginning a partial migration of
>> >> internals into some C++ classes that encapsulate the insides of
>> >> DataFrame objects and implement indexing and block-level manipulations
>> >> would be a good place to start. I think you could do this wouldn't too
>> >> much disruption.
>> >>
>> >> As part of this internal retooling we might give consideration to
>> >> alternative data structures for representing data internal to pandas
>> >> objects. Now in 2015/2016, continuing to be hamstrung by NumPy's
>> >> limitations feels somewhat anachronistic. User code is riddled with
>> >> workarounds for data type fidelity issues and the like. Like, really,
>> >> why not add a bitndarray (similar to ilanschnell/bitarray) for storing
>> >> nullness for problematic types and hide this from the user? =)
>> >>
>> >> Since we are now a NumFOCUS-sponsored project, I feel like we might
>> >> consider establishing some formal governance over pandas and
>> >> publishing meetings notes and roadmap documents describing plans for
>> >> the project and meetings notes from committers. There's no real
>> >> "committer culture" for NumFOCUS projects like there is with the
>> >> Apache Software Foundation, but we might try leading by example!
>> >>
>> >> Also, I believe pandas as a project has reached a level of importance
>> >> where we ought to consider planning and execution on larger scale
>> >> undertakings such as this for safeguarding the future.
>> >>
>> >> As for myself, well, I have my hands full in Big Data-land. I wish I
>> >> could be helping more with pandas, but there a quite a few fundamental
>> >> issues (like data interoperability nested data handling and file
>> >> format support ? e.g. Parquet, see
>> >>
>> >>
>> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/
>> )
>> >> preventing Python from being more useful in industry analytics
>> >> applications.
>> >>
>> >> Aside: one of the bigger mistakes I made with pandas's API design was
>> >> making it acceptable to call class constructors ? like
>> >> pandas.DataFrame ? directly (versus factory functions). Sorry about
>> >> that! If we could convince everyone to start writing pandas.data_frame
>> >> or dataframe instead of using the class reference it would help a lot
>> >> with code cleanup. It's hard to plan for these things ? NumPy
>> >> interoperability seemed a lot more important in 2008 than it does now,
>> >> so I forgive myself.
>> >>
>> >> cheers and best wishes for 2016,
>> >> Wes
>> >> _______________________________________________
>> >> Pandas-dev mailing list
>> >> Pandas-dev at python.org
>> >> https://mail.python.org/mailman/listinfo/pandas-dev
>> >
>> >
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20151229/88f940dd/attachment-0001.html>

From wesmckinn at gmail.com  Tue Dec 29 15:07:26 2015
From: wesmckinn at gmail.com (Wes McKinney)
Date: Tue, 29 Dec 2015 12:07:26 -0800
Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? /
 Roadmap
In-Reply-To: <CAHMnJKjHbze=rTz9z40kjZpez5AOjhZJJaouach9yuANQyOZRA@mail.gmail.com>
References: <CAHMnJKiu9+ij943kYpRxKUxhp4H9Kqvevg84k1MchQaOwjEbGg@mail.gmail.com>
 <CAJPUwMCdknF90fsjRyGZ7FRqYo_GOe_+3KezOWYZosvZ5hqWzw@mail.gmail.com>
 <CAHMnJKjO6HDipgYj6fe99usXvhtrm4s4Fgfd7YaUvaV--3kQ2Q@mail.gmail.com>
 <CAHMnJKjHbze=rTz9z40kjZpez5AOjhZJJaouach9yuANQyOZRA@mail.gmail.com>
Message-ID: <CAJPUwMDGsvbPU22SznOQT2QAPGvmW+y-dus+Lbfguv6MJnTLcA@mail.gmail.com>

Yeah, basically creating a "libpandas" with a C API for Series and
DataFrame objects (and maybe a roadmap for more interchangeable
internals) is definitely what I'm talking about. We can probably move
a lot of the Cython guts there, too. I think better microperformance
will fall out of this naturally but the big goal is a more
maintainable and extensible core.

I try to find some time to hack together a CMake file that creates a
libpandas suitable for static linking with a Cython extension and that
links dynamically with NumPy's multiarray.so and libpythonXX. The
library setup is honestly the most tedious part.

Aside: I'm working a lot on nested / Parquet-type data these days, and
this is not a "pandas problem", but I want to make sure the tooling
develops a reasonable C API so that interoperability between pandas
and systems with different non-NumPy-like data models will have
minimal performance overhead.

- Wes

On Tue, Dec 29, 2015 at 11:59 AM, Jeff Reback <jeffreback at gmail.com> wrote:
> Here's a link where we can discuss the roadmap:
>
> https://docs.google.com/document/d/151ct8jcZWwh7XStptjbLsda6h2b3C0IuiH_hfZnUA58/edit?usp=sharing
>
> On Tue, Dec 29, 2015 at 2:56 PM, Jeff Reback <jeffreback at gmail.com> wrote:
>>
>> Ok certainly not averse to using bitfields. I agree that would solve the
>> problem. In fact Stefan Hoyer and I briefly discussed this w.r.t.
>> IntervalIndex. Turns out just as easy to use a sentinel. In fact that was my
>> original idea (for int NA). really similar to how we handle Datetime et al.
>>
>> So will create a google doc for discussion points.
>>
>> I agree creating a minimalist c++ library is not too hard. But my original
>> question stands, what are the use cases. I can enumerate some here:
>>
>> - 1) performance (I am not convinced of this, but could be wrong)
>> - 2) c-api always a good thing & other lang bindings
>>
>> I suspect you are in the part 2 camp?
>>
>>
>> On Tue, Dec 29, 2015 at 2:49 PM, Wes McKinney <wesmckinn at gmail.com> wrote:
>>>
>>> I will write a more detailed response to some of these things after
>>> the new year, but, in particular, re: missing values, can you or
>>> someone tell me why creating an object that contains a NumPy array and
>>> a bitmap is not sufficient? If we we can add a lightweight C/C++ class
>>> layer between NumPy function calls (e.g. arithmetic) and pandas
>>> function calls, then I see no reason why we cannot have
>>>
>>> Int32Array->add
>>>
>>> and
>>>
>>> Float32Array->add
>>>
>>> do the right thing (the former would be responsible for bitmasking to
>>> propagate NA values; the latter would defer to NumPy). If we can put
>>> all the internals of pandas objects inside a black box, we can add
>>> layers of virtual function indirection without a performance penalty
>>> (e.g. adding more interpreter overhead with more abstraction layers
>>> does add up to a perf penalty).
>>>
>>> I don't think this is too scary -- I would be willing to create a
>>> small POC C++ library to prototype something like what I'm talking
>>> about.
>>>
>>> Since pandas has limited points of contact with NumPy I don't think
>>> this would end up being too onerous.
>>>
>>> For the record, I'm pretty allergic to "advanced C++"; I think it is a
>>> useful tool if you pick a sane 20% subset of the C++11 spec and follow
>>> Google C++ style it's not very inaccessible to intermediate
>>> developers. More or less "C plus OOP and easier object lifetime
>>> management (shared/unique_ptr, etc.)". As soon as you add a lot of
>>> template metaprogramming C++ library development quickly becomes
>>> inaccessible except to the C++-Jedi.
>>>
>>> Maybe let's start a Google document on "pandas roadmap" where we can
>>> break down the 1-2 year goals and some of these infrastructure issues
>>> and have our discussion there? (obviously publish this someplace once
>>> we're done)
>>>
>>> - Wes
>>>
>>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback <jeffreback at gmail.com>
>>> wrote:
>>> > Here are some of my thoughts about pandas Roadmap / status and some
>>> > responses to Wes's thoughts.
>>> >
>>> > In the last few (and upcoming) major releases we have been made the
>>> > following changes:
>>> >
>>> > - dtype enhancements (Categorical, Timedelta, Datetime w/tz) & making
>>> > these
>>> > first class objects
>>> > - code refactoring to remove subclassing of ndarrays for Series & Index
>>> > - carving out / deprecating non-core parts of pandas
>>> >   - datareader
>>> >   - SparsePanel, WidePanel & other aliases (TImeSeries)
>>> >   - rpy, rplot, irow et al.
>>> >   - google-analytics
>>> > - API changes to make things more consistent
>>> >   - pd.rolling/expanding * -> .rolling/expanding (this is in master
>>> > now)
>>> >   - .resample becoming a full defered like groupby.
>>> >   - multi-index slicing along any level (obviates need for .xs) and
>>> > allows
>>> > assignment
>>> >   - .loc/.iloc - for the most part obviates use of .ix
>>> >   - .pipe & .assign
>>> >   - plotting accessors
>>> >   - fixing of the sorting API
>>> > - many performance enhancements both micro & macro (e.g. release GIL)
>>> >
>>> > Some on-deck enhancements are (meaning these are basically ready to go
>>> > in):
>>> >   - IntervalIndex (and eventually make PeriodIndex just a sub-class of
>>> > this)
>>> >   - RangeIndex
>>> >
>>> > so lots of changes, though nothing really earth shaking, just more
>>> > convenience, reducing magicness somewhat
>>> > and providing flexibility.
>>> >
>>> > Of course we are getting increasing issues, mostly bug reports (and
>>> > lots of
>>> > dupes), some edge case enhancements
>>> > which can add to the existing API's and of course, requests to expand
>>> > the
>>> > (already) large code to other usecases.
>>> > Balancing this are a good many pull-requests from many different users,
>>> > some
>>> > even deep into the internals.
>>> >
>>> > Here are some things that I have talked about and could be considered
>>> > for
>>> > the roadmap. Disclaimer: I do work for Continuum
>>> > but these views are of course my own; furthermore obviously I am a bit
>>> > more
>>> > familiar with some of the 'sponsored' open-source
>>> > libraries, but always open to new things.
>>> >
>>> > - integration / automatic deferral to numba for JIT (this would be thru
>>> > .apply)
>>> > - automatic deferal to dask from groubpy where appropriate / maybe a
>>> > .to_parallel (to simply return a dask.DataFrame object)
>>> > - incorporation of quantities / units (as part of the dtype)
>>> > - use of DyND to allow missing values for int dtypes
>>> > - make Period a first class dtype.
>>> > - provide some copy-on-write semantics to alleviate the
>>> > chained-indexing
>>> > issues which occasionaly come up with the mis-use of the indexing API
>>> > - allow a 'policy' to automatically provide column blocks for dict-like
>>> > input (e.g. each column would be a block), this would allow a pass-thru
>>> > API
>>> > where you could
>>> > put in numpy arrays where you have views and have them preserved rather
>>> > than
>>> > copied automatically. Note that this would also allow what I call
>>> > 'split'
>>> > where a passed in
>>> > multi-dim numpy array could be split up to individual blocks (which
>>> > actually
>>> > gives a nice perf boost after the splitting costs).
>>> >
>>> > In working towards some of these goals. I have come to the opinion that
>>> > it
>>> > would make sense to have a neutral API protocol layer
>>> > that would allow us to swap out different engines as needed, for
>>> > particular
>>> > dtypes, or *maybe* out-of-core type computations. E.g.
>>> > imagine that we replaced the in-memory block structure with a bclolz /
>>> > memap
>>> > type; in theory this should be 'easy' and just work.
>>> > I could also see us adopting *some* of the SFrame code to allow easier
>>> > interop with this API layer.
>>> >
>>> > In practice, I think a nice API layer would need to be created to make
>>> > this
>>> > clean / nice.
>>> >
>>> > So this comes around to Wes's point about creating a c++ library for
>>> > the
>>> > internals (and possibly even some of the indexing routines).
>>> > In an ideal world, or course this would be desirable. Getting there is
>>> > a bit
>>> > non-trivial I think, and IMHO might not be worth the effort. I don't
>>> > really see big performance bottlenecks. We *already* defer much of the
>>> > computation to libraries like numexpr & bottleneck (where appropriate).
>>> > Adding numba / dask to the list would be helpful.
>>> >
>>> > I think that almost all performance issues are the result of:
>>> >
>>> > a) gross misuse of the pandas API. How much code have you seen that
>>> > does
>>> > df.apply(lambda x: x.sum())
>>> > b) routines which operate column-by-column rather block-by-block and
>>> > are in
>>> > python space (e.g. we have an issue right now about .quantile)
>>> >
>>> > So I am glossing over a big goal of having a c++ library that
>>> > represents the
>>> > pandas internals. This would by definition have a c-API that so
>>> > you *could* use pandas like semantics in c/c++ and just have it work
>>> > (and
>>> > then pandas would be a thin wrapper around this library).
>>> >
>>> > I am not averse to this, but I think would be quite a big effort, and
>>> > not a
>>> > huge perf boost IMHO. Further there are a number of API issues w.r.t.
>>> > indexing
>>> > which need to be clarified / worked out (e.g. should we simply
>>> > deprecate [])
>>> > that are much easier to test / figure out in python space.
>>> >
>>> > I also thing that we have quite a large number of contributors. Moving
>>> > to
>>> > c++ might make the internals a bit more impenetrable that the current
>>> > internals.
>>> > (though this would allow c++ people to contribute, so that might
>>> > balance
>>> > out).
>>> >
>>> > We have a limited core of devs whom right now are familar with things.
>>> > If
>>> > someone happened to have a starting base for a c++ library, then I
>>> > might
>>> > change
>>> > opinions here.
>>> >
>>> >
>>> > my 4c.
>>> >
>>> > Jeff
>>> >
>>> >
>>> >
>>> >
>>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney <wesmckinn at gmail.com>
>>> > wrote:
>>> >>
>>> >> Deep thoughts during the holidays.
>>> >>
>>> >> I might be out of line here, but the interpreter-heaviness of the
>>> >> inside of pandas objects is likely to be a long-term liability and
>>> >> source of performance problems and technical debt.
>>> >>
>>> >> Has anyone put any thought into planning and beginning to execute on a
>>> >> rewrite that moves as much as possible of the internals into native /
>>> >> compiled code? I'm talking about:
>>> >>
>>> >> - pandas/core/internals
>>> >> - indexing and assignment
>>> >> - much of pandas/core/common
>>> >> - categorical and custom dtypes
>>> >> - all indexing mechanisms
>>> >>
>>> >> I'm concerned we've already exposed too much internals to users, so
>>> >> this might lead to a lot of API breakage, but it might be for the
>>> >> Greater Good. As a first step, beginning a partial migration of
>>> >> internals into some C++ classes that encapsulate the insides of
>>> >> DataFrame objects and implement indexing and block-level manipulations
>>> >> would be a good place to start. I think you could do this wouldn't too
>>> >> much disruption.
>>> >>
>>> >> As part of this internal retooling we might give consideration to
>>> >> alternative data structures for representing data internal to pandas
>>> >> objects. Now in 2015/2016, continuing to be hamstrung by NumPy's
>>> >> limitations feels somewhat anachronistic. User code is riddled with
>>> >> workarounds for data type fidelity issues and the like. Like, really,
>>> >> why not add a bitndarray (similar to ilanschnell/bitarray) for storing
>>> >> nullness for problematic types and hide this from the user? =)
>>> >>
>>> >> Since we are now a NumFOCUS-sponsored project, I feel like we might
>>> >> consider establishing some formal governance over pandas and
>>> >> publishing meetings notes and roadmap documents describing plans for
>>> >> the project and meetings notes from committers. There's no real
>>> >> "committer culture" for NumFOCUS projects like there is with the
>>> >> Apache Software Foundation, but we might try leading by example!
>>> >>
>>> >> Also, I believe pandas as a project has reached a level of importance
>>> >> where we ought to consider planning and execution on larger scale
>>> >> undertakings such as this for safeguarding the future.
>>> >>
>>> >> As for myself, well, I have my hands full in Big Data-land. I wish I
>>> >> could be helping more with pandas, but there a quite a few fundamental
>>> >> issues (like data interoperability nested data handling and file
>>> >> format support ? e.g. Parquet, see
>>> >>
>>> >>
>>> >> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/)
>>> >> preventing Python from being more useful in industry analytics
>>> >> applications.
>>> >>
>>> >> Aside: one of the bigger mistakes I made with pandas's API design was
>>> >> making it acceptable to call class constructors ? like
>>> >> pandas.DataFrame ? directly (versus factory functions). Sorry about
>>> >> that! If we could convince everyone to start writing pandas.data_frame
>>> >> or dataframe instead of using the class reference it would help a lot
>>> >> with code cleanup. It's hard to plan for these things ? NumPy
>>> >> interoperability seemed a lot more important in 2008 than it does now,
>>> >> so I forgive myself.
>>> >>
>>> >> cheers and best wishes for 2016,
>>> >> Wes
>>> >> _______________________________________________
>>> >> Pandas-dev mailing list
>>> >> Pandas-dev at python.org
>>> >> https://mail.python.org/mailman/listinfo/pandas-dev
>>> >
>>> >
>>> _______________________________________________
>>> Pandas-dev mailing list
>>> Pandas-dev at python.org
>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
>>
>
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>

From cpcloud at gmail.com  Tue Dec 29 15:14:06 2015
From: cpcloud at gmail.com (Phillip Cloud)
Date: Tue, 29 Dec 2015 20:14:06 +0000
Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? /
 Roadmap
In-Reply-To: <CAJPUwMCdknF90fsjRyGZ7FRqYo_GOe_+3KezOWYZosvZ5hqWzw@mail.gmail.com>
References: <CAHMnJKiu9+ij943kYpRxKUxhp4H9Kqvevg84k1MchQaOwjEbGg@mail.gmail.com>
 <CAJPUwMCdknF90fsjRyGZ7FRqYo_GOe_+3KezOWYZosvZ5hqWzw@mail.gmail.com>
Message-ID: <CAKRVfm79SQAMRNOZ=xCn=AvKSmSafsvDhHAcrtY+7mX3CJndpA@mail.gmail.com>

Maybe this is saying the same thing as Wes, but how far would something
like this get us?

// warning: things are probably not this simple
struct data_array_t {
    void *primitive;  // scalar data
    data_array_t *nested; // nested data
    boost::dynamic_bitset isnull;  // might have to create our own to
avoid boost
    schema_t schema;  // not sure exactly what this looks like
};
typedef std::map<string, data_array_t> data_frame_t;  // probably not
this simple

To answer Jeff?s use-case question: I think that the use cases are 1)
freedom from numpy (mostly) 2) no more block manager which frees us from
the limitations of the block memory layout. In particular, the ability to
take advantage of memory mapped IO would be a big win IMO.
?

On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney <wesmckinn at gmail.com> wrote:

> I will write a more detailed response to some of these things after
> the new year, but, in particular, re: missing values, can you or
> someone tell me why creating an object that contains a NumPy array and
> a bitmap is not sufficient? If we we can add a lightweight C/C++ class
> layer between NumPy function calls (e.g. arithmetic) and pandas
> function calls, then I see no reason why we cannot have
>
> Int32Array->add
>
> and
>
> Float32Array->add
>
> do the right thing (the former would be responsible for bitmasking to
> propagate NA values; the latter would defer to NumPy). If we can put
> all the internals of pandas objects inside a black box, we can add
> layers of virtual function indirection without a performance penalty
> (e.g. adding more interpreter overhead with more abstraction layers
> does add up to a perf penalty).
>
> I don't think this is too scary -- I would be willing to create a
> small POC C++ library to prototype something like what I'm talking
> about.
>
> Since pandas has limited points of contact with NumPy I don't think
> this would end up being too onerous.
>
> For the record, I'm pretty allergic to "advanced C++"; I think it is a
> useful tool if you pick a sane 20% subset of the C++11 spec and follow
> Google C++ style it's not very inaccessible to intermediate
> developers. More or less "C plus OOP and easier object lifetime
> management (shared/unique_ptr, etc.)". As soon as you add a lot of
> template metaprogramming C++ library development quickly becomes
> inaccessible except to the C++-Jedi.
>
> Maybe let's start a Google document on "pandas roadmap" where we can
> break down the 1-2 year goals and some of these infrastructure issues
> and have our discussion there? (obviously publish this someplace once
> we're done)
>
> - Wes
>
> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback <jeffreback at gmail.com> wrote:
> > Here are some of my thoughts about pandas Roadmap / status and some
> > responses to Wes's thoughts.
> >
> > In the last few (and upcoming) major releases we have been made the
> > following changes:
> >
> > - dtype enhancements (Categorical, Timedelta, Datetime w/tz) & making
> these
> > first class objects
> > - code refactoring to remove subclassing of ndarrays for Series & Index
> > - carving out / deprecating non-core parts of pandas
> >   - datareader
> >   - SparsePanel, WidePanel & other aliases (TImeSeries)
> >   - rpy, rplot, irow et al.
> >   - google-analytics
> > - API changes to make things more consistent
> >   - pd.rolling/expanding * -> .rolling/expanding (this is in master now)
> >   - .resample becoming a full defered like groupby.
> >   - multi-index slicing along any level (obviates need for .xs) and
> allows
> > assignment
> >   - .loc/.iloc - for the most part obviates use of .ix
> >   - .pipe & .assign
> >   - plotting accessors
> >   - fixing of the sorting API
> > - many performance enhancements both micro & macro (e.g. release GIL)
> >
> > Some on-deck enhancements are (meaning these are basically ready to go
> in):
> >   - IntervalIndex (and eventually make PeriodIndex just a sub-class of
> this)
> >   - RangeIndex
> >
> > so lots of changes, though nothing really earth shaking, just more
> > convenience, reducing magicness somewhat
> > and providing flexibility.
> >
> > Of course we are getting increasing issues, mostly bug reports (and lots
> of
> > dupes), some edge case enhancements
> > which can add to the existing API's and of course, requests to expand the
> > (already) large code to other usecases.
> > Balancing this are a good many pull-requests from many different users,
> some
> > even deep into the internals.
> >
> > Here are some things that I have talked about and could be considered for
> > the roadmap. Disclaimer: I do work for Continuum
> > but these views are of course my own; furthermore obviously I am a bit
> more
> > familiar with some of the 'sponsored' open-source
> > libraries, but always open to new things.
> >
> > - integration / automatic deferral to numba for JIT (this would be thru
> > .apply)
> > - automatic deferal to dask from groubpy where appropriate / maybe a
> > .to_parallel (to simply return a dask.DataFrame object)
> > - incorporation of quantities / units (as part of the dtype)
> > - use of DyND to allow missing values for int dtypes
> > - make Period a first class dtype.
> > - provide some copy-on-write semantics to alleviate the chained-indexing
> > issues which occasionaly come up with the mis-use of the indexing API
> > - allow a 'policy' to automatically provide column blocks for dict-like
> > input (e.g. each column would be a block), this would allow a pass-thru
> API
> > where you could
> > put in numpy arrays where you have views and have them preserved rather
> than
> > copied automatically. Note that this would also allow what I call 'split'
> > where a passed in
> > multi-dim numpy array could be split up to individual blocks (which
> actually
> > gives a nice perf boost after the splitting costs).
> >
> > In working towards some of these goals. I have come to the opinion that
> it
> > would make sense to have a neutral API protocol layer
> > that would allow us to swap out different engines as needed, for
> particular
> > dtypes, or *maybe* out-of-core type computations. E.g.
> > imagine that we replaced the in-memory block structure with a bclolz /
> memap
> > type; in theory this should be 'easy' and just work.
> > I could also see us adopting *some* of the SFrame code to allow easier
> > interop with this API layer.
> >
> > In practice, I think a nice API layer would need to be created to make
> this
> > clean / nice.
> >
> > So this comes around to Wes's point about creating a c++ library for the
> > internals (and possibly even some of the indexing routines).
> > In an ideal world, or course this would be desirable. Getting there is a
> bit
> > non-trivial I think, and IMHO might not be worth the effort. I don't
> > really see big performance bottlenecks. We *already* defer much of the
> > computation to libraries like numexpr & bottleneck (where appropriate).
> > Adding numba / dask to the list would be helpful.
> >
> > I think that almost all performance issues are the result of:
> >
> > a) gross misuse of the pandas API. How much code have you seen that does
> > df.apply(lambda x: x.sum())
> > b) routines which operate column-by-column rather block-by-block and are
> in
> > python space (e.g. we have an issue right now about .quantile)
> >
> > So I am glossing over a big goal of having a c++ library that represents
> the
> > pandas internals. This would by definition have a c-API that so
> > you *could* use pandas like semantics in c/c++ and just have it work (and
> > then pandas would be a thin wrapper around this library).
> >
> > I am not averse to this, but I think would be quite a big effort, and
> not a
> > huge perf boost IMHO. Further there are a number of API issues w.r.t.
> > indexing
> > which need to be clarified / worked out (e.g. should we simply deprecate
> [])
> > that are much easier to test / figure out in python space.
> >
> > I also thing that we have quite a large number of contributors. Moving to
> > c++ might make the internals a bit more impenetrable that the current
> > internals.
> > (though this would allow c++ people to contribute, so that might balance
> > out).
> >
> > We have a limited core of devs whom right now are familar with things. If
> > someone happened to have a starting base for a c++ library, then I might
> > change
> > opinions here.
> >
> >
> > my 4c.
> >
> > Jeff
> >
> >
> >
> >
> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney <wesmckinn at gmail.com>
> wrote:
> >>
> >> Deep thoughts during the holidays.
> >>
> >> I might be out of line here, but the interpreter-heaviness of the
> >> inside of pandas objects is likely to be a long-term liability and
> >> source of performance problems and technical debt.
> >>
> >> Has anyone put any thought into planning and beginning to execute on a
> >> rewrite that moves as much as possible of the internals into native /
> >> compiled code? I'm talking about:
> >>
> >> - pandas/core/internals
> >> - indexing and assignment
> >> - much of pandas/core/common
> >> - categorical and custom dtypes
> >> - all indexing mechanisms
> >>
> >> I'm concerned we've already exposed too much internals to users, so
> >> this might lead to a lot of API breakage, but it might be for the
> >> Greater Good. As a first step, beginning a partial migration of
> >> internals into some C++ classes that encapsulate the insides of
> >> DataFrame objects and implement indexing and block-level manipulations
> >> would be a good place to start. I think you could do this wouldn't too
> >> much disruption.
> >>
> >> As part of this internal retooling we might give consideration to
> >> alternative data structures for representing data internal to pandas
> >> objects. Now in 2015/2016, continuing to be hamstrung by NumPy's
> >> limitations feels somewhat anachronistic. User code is riddled with
> >> workarounds for data type fidelity issues and the like. Like, really,
> >> why not add a bitndarray (similar to ilanschnell/bitarray) for storing
> >> nullness for problematic types and hide this from the user? =)
> >>
> >> Since we are now a NumFOCUS-sponsored project, I feel like we might
> >> consider establishing some formal governance over pandas and
> >> publishing meetings notes and roadmap documents describing plans for
> >> the project and meetings notes from committers. There's no real
> >> "committer culture" for NumFOCUS projects like there is with the
> >> Apache Software Foundation, but we might try leading by example!
> >>
> >> Also, I believe pandas as a project has reached a level of importance
> >> where we ought to consider planning and execution on larger scale
> >> undertakings such as this for safeguarding the future.
> >>
> >> As for myself, well, I have my hands full in Big Data-land. I wish I
> >> could be helping more with pandas, but there a quite a few fundamental
> >> issues (like data interoperability nested data handling and file
> >> format support ? e.g. Parquet, see
> >>
> >>
> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/
> )
> >> preventing Python from being more useful in industry analytics
> >> applications.
> >>
> >> Aside: one of the bigger mistakes I made with pandas's API design was
> >> making it acceptable to call class constructors ? like
> >> pandas.DataFrame ? directly (versus factory functions). Sorry about
> >> that! If we could convince everyone to start writing pandas.data_frame
> >> or dataframe instead of using the class reference it would help a lot
> >> with code cleanup. It's hard to plan for these things ? NumPy
> >> interoperability seemed a lot more important in 2008 than it does now,
> >> so I forgive myself.
> >>
> >> cheers and best wishes for 2016,
> >> Wes
> >> _______________________________________________
> >> Pandas-dev mailing list
> >> Pandas-dev at python.org
> >> https://mail.python.org/mailman/listinfo/pandas-dev
> >
> >
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20151229/c6ff4a36/attachment-0001.html>

From wesmckinn at gmail.com  Tue Dec 29 16:02:50 2015
From: wesmckinn at gmail.com (Wes McKinney)
Date: Tue, 29 Dec 2015 13:02:50 -0800
Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? /
 Roadmap
In-Reply-To: <CAKRVfm79SQAMRNOZ=xCn=AvKSmSafsvDhHAcrtY+7mX3CJndpA@mail.gmail.com>
References: <CAHMnJKiu9+ij943kYpRxKUxhp4H9Kqvevg84k1MchQaOwjEbGg@mail.gmail.com>
 <CAJPUwMCdknF90fsjRyGZ7FRqYo_GOe_+3KezOWYZosvZ5hqWzw@mail.gmail.com>
 <CAKRVfm79SQAMRNOZ=xCn=AvKSmSafsvDhHAcrtY+7mX3CJndpA@mail.gmail.com>
Message-ID: <CAJPUwMDhc76q+KWPvgxT6i+o+Kw8S0-=JEe8UsnEGCqPOM_2QQ@mail.gmail.com>

Basically the approach is

1) Base dtype type
2) Base array type with K >= 1 dimensions
3) Base scalar type
4) Base index type
5) "Wrapper" subclasses for all NumPy types fitting into categories
#1, #2, #3, #4
6) Subclasses for pandas-specific types like category, datetimeTZ, etc.
7) NDFrame as cpcloud wrote is just a list of these

Indexes and axis labels / column names can get layered on top.

After we do all this we can look at adding nested types (arrays, maps,
structs) to better support JSON.

- Wes

On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud <cpcloud at gmail.com> wrote:
> Maybe this is saying the same thing as Wes, but how far would something like
> this get us?
>
> // warning: things are probably not this simple
>
> struct data_array_t {
>     void *primitive;  // scalar data
>     data_array_t *nested; // nested data
>     boost::dynamic_bitset isnull;  // might have to create our own to avoid
> boost
>     schema_t schema;  // not sure exactly what this looks like
> };
>
> typedef std::map<string, data_array_t> data_frame_t;  // probably not this
> simple
>
> To answer Jeff?s use-case question: I think that the use cases are 1)
> freedom from numpy (mostly) 2) no more block manager which frees us from the
> limitations of the block memory layout. In particular, the ability to take
> advantage of memory mapped IO would be a big win IMO.
>
>
> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney <wesmckinn at gmail.com> wrote:
>>
>> I will write a more detailed response to some of these things after
>> the new year, but, in particular, re: missing values, can you or
>> someone tell me why creating an object that contains a NumPy array and
>> a bitmap is not sufficient? If we we can add a lightweight C/C++ class
>> layer between NumPy function calls (e.g. arithmetic) and pandas
>> function calls, then I see no reason why we cannot have
>>
>> Int32Array->add
>>
>> and
>>
>> Float32Array->add
>>
>> do the right thing (the former would be responsible for bitmasking to
>> propagate NA values; the latter would defer to NumPy). If we can put
>> all the internals of pandas objects inside a black box, we can add
>> layers of virtual function indirection without a performance penalty
>> (e.g. adding more interpreter overhead with more abstraction layers
>> does add up to a perf penalty).
>>
>> I don't think this is too scary -- I would be willing to create a
>> small POC C++ library to prototype something like what I'm talking
>> about.
>>
>> Since pandas has limited points of contact with NumPy I don't think
>> this would end up being too onerous.
>>
>> For the record, I'm pretty allergic to "advanced C++"; I think it is a
>> useful tool if you pick a sane 20% subset of the C++11 spec and follow
>> Google C++ style it's not very inaccessible to intermediate
>> developers. More or less "C plus OOP and easier object lifetime
>> management (shared/unique_ptr, etc.)". As soon as you add a lot of
>> template metaprogramming C++ library development quickly becomes
>> inaccessible except to the C++-Jedi.
>>
>> Maybe let's start a Google document on "pandas roadmap" where we can
>> break down the 1-2 year goals and some of these infrastructure issues
>> and have our discussion there? (obviously publish this someplace once
>> we're done)
>>
>> - Wes
>>
>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback <jeffreback at gmail.com> wrote:
>> > Here are some of my thoughts about pandas Roadmap / status and some
>> > responses to Wes's thoughts.
>> >
>> > In the last few (and upcoming) major releases we have been made the
>> > following changes:
>> >
>> > - dtype enhancements (Categorical, Timedelta, Datetime w/tz) & making
>> > these
>> > first class objects
>> > - code refactoring to remove subclassing of ndarrays for Series & Index
>> > - carving out / deprecating non-core parts of pandas
>> >   - datareader
>> >   - SparsePanel, WidePanel & other aliases (TImeSeries)
>> >   - rpy, rplot, irow et al.
>> >   - google-analytics
>> > - API changes to make things more consistent
>> >   - pd.rolling/expanding * -> .rolling/expanding (this is in master now)
>> >   - .resample becoming a full defered like groupby.
>> >   - multi-index slicing along any level (obviates need for .xs) and
>> > allows
>> > assignment
>> >   - .loc/.iloc - for the most part obviates use of .ix
>> >   - .pipe & .assign
>> >   - plotting accessors
>> >   - fixing of the sorting API
>> > - many performance enhancements both micro & macro (e.g. release GIL)
>> >
>> > Some on-deck enhancements are (meaning these are basically ready to go
>> > in):
>> >   - IntervalIndex (and eventually make PeriodIndex just a sub-class of
>> > this)
>> >   - RangeIndex
>> >
>> > so lots of changes, though nothing really earth shaking, just more
>> > convenience, reducing magicness somewhat
>> > and providing flexibility.
>> >
>> > Of course we are getting increasing issues, mostly bug reports (and lots
>> > of
>> > dupes), some edge case enhancements
>> > which can add to the existing API's and of course, requests to expand
>> > the
>> > (already) large code to other usecases.
>> > Balancing this are a good many pull-requests from many different users,
>> > some
>> > even deep into the internals.
>> >
>> > Here are some things that I have talked about and could be considered
>> > for
>> > the roadmap. Disclaimer: I do work for Continuum
>> > but these views are of course my own; furthermore obviously I am a bit
>> > more
>> > familiar with some of the 'sponsored' open-source
>> > libraries, but always open to new things.
>> >
>> > - integration / automatic deferral to numba for JIT (this would be thru
>> > .apply)
>> > - automatic deferal to dask from groubpy where appropriate / maybe a
>> > .to_parallel (to simply return a dask.DataFrame object)
>> > - incorporation of quantities / units (as part of the dtype)
>> > - use of DyND to allow missing values for int dtypes
>> > - make Period a first class dtype.
>> > - provide some copy-on-write semantics to alleviate the chained-indexing
>> > issues which occasionaly come up with the mis-use of the indexing API
>> > - allow a 'policy' to automatically provide column blocks for dict-like
>> > input (e.g. each column would be a block), this would allow a pass-thru
>> > API
>> > where you could
>> > put in numpy arrays where you have views and have them preserved rather
>> > than
>> > copied automatically. Note that this would also allow what I call
>> > 'split'
>> > where a passed in
>> > multi-dim numpy array could be split up to individual blocks (which
>> > actually
>> > gives a nice perf boost after the splitting costs).
>> >
>> > In working towards some of these goals. I have come to the opinion that
>> > it
>> > would make sense to have a neutral API protocol layer
>> > that would allow us to swap out different engines as needed, for
>> > particular
>> > dtypes, or *maybe* out-of-core type computations. E.g.
>> > imagine that we replaced the in-memory block structure with a bclolz /
>> > memap
>> > type; in theory this should be 'easy' and just work.
>> > I could also see us adopting *some* of the SFrame code to allow easier
>> > interop with this API layer.
>> >
>> > In practice, I think a nice API layer would need to be created to make
>> > this
>> > clean / nice.
>> >
>> > So this comes around to Wes's point about creating a c++ library for the
>> > internals (and possibly even some of the indexing routines).
>> > In an ideal world, or course this would be desirable. Getting there is a
>> > bit
>> > non-trivial I think, and IMHO might not be worth the effort. I don't
>> > really see big performance bottlenecks. We *already* defer much of the
>> > computation to libraries like numexpr & bottleneck (where appropriate).
>> > Adding numba / dask to the list would be helpful.
>> >
>> > I think that almost all performance issues are the result of:
>> >
>> > a) gross misuse of the pandas API. How much code have you seen that does
>> > df.apply(lambda x: x.sum())
>> > b) routines which operate column-by-column rather block-by-block and are
>> > in
>> > python space (e.g. we have an issue right now about .quantile)
>> >
>> > So I am glossing over a big goal of having a c++ library that represents
>> > the
>> > pandas internals. This would by definition have a c-API that so
>> > you *could* use pandas like semantics in c/c++ and just have it work
>> > (and
>> > then pandas would be a thin wrapper around this library).
>> >
>> > I am not averse to this, but I think would be quite a big effort, and
>> > not a
>> > huge perf boost IMHO. Further there are a number of API issues w.r.t.
>> > indexing
>> > which need to be clarified / worked out (e.g. should we simply deprecate
>> > [])
>> > that are much easier to test / figure out in python space.
>> >
>> > I also thing that we have quite a large number of contributors. Moving
>> > to
>> > c++ might make the internals a bit more impenetrable that the current
>> > internals.
>> > (though this would allow c++ people to contribute, so that might balance
>> > out).
>> >
>> > We have a limited core of devs whom right now are familar with things.
>> > If
>> > someone happened to have a starting base for a c++ library, then I might
>> > change
>> > opinions here.
>> >
>> >
>> > my 4c.
>> >
>> > Jeff
>> >
>> >
>> >
>> >
>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney <wesmckinn at gmail.com>
>> > wrote:
>> >>
>> >> Deep thoughts during the holidays.
>> >>
>> >> I might be out of line here, but the interpreter-heaviness of the
>> >> inside of pandas objects is likely to be a long-term liability and
>> >> source of performance problems and technical debt.
>> >>
>> >> Has anyone put any thought into planning and beginning to execute on a
>> >> rewrite that moves as much as possible of the internals into native /
>> >> compiled code? I'm talking about:
>> >>
>> >> - pandas/core/internals
>> >> - indexing and assignment
>> >> - much of pandas/core/common
>> >> - categorical and custom dtypes
>> >> - all indexing mechanisms
>> >>
>> >> I'm concerned we've already exposed too much internals to users, so
>> >> this might lead to a lot of API breakage, but it might be for the
>> >> Greater Good. As a first step, beginning a partial migration of
>> >> internals into some C++ classes that encapsulate the insides of
>> >> DataFrame objects and implement indexing and block-level manipulations
>> >> would be a good place to start. I think you could do this wouldn't too
>> >> much disruption.
>> >>
>> >> As part of this internal retooling we might give consideration to
>> >> alternative data structures for representing data internal to pandas
>> >> objects. Now in 2015/2016, continuing to be hamstrung by NumPy's
>> >> limitations feels somewhat anachronistic. User code is riddled with
>> >> workarounds for data type fidelity issues and the like. Like, really,
>> >> why not add a bitndarray (similar to ilanschnell/bitarray) for storing
>> >> nullness for problematic types and hide this from the user? =)
>> >>
>> >> Since we are now a NumFOCUS-sponsored project, I feel like we might
>> >> consider establishing some formal governance over pandas and
>> >> publishing meetings notes and roadmap documents describing plans for
>> >> the project and meetings notes from committers. There's no real
>> >> "committer culture" for NumFOCUS projects like there is with the
>> >> Apache Software Foundation, but we might try leading by example!
>> >>
>> >> Also, I believe pandas as a project has reached a level of importance
>> >> where we ought to consider planning and execution on larger scale
>> >> undertakings such as this for safeguarding the future.
>> >>
>> >> As for myself, well, I have my hands full in Big Data-land. I wish I
>> >> could be helping more with pandas, but there a quite a few fundamental
>> >> issues (like data interoperability nested data handling and file
>> >> format support ? e.g. Parquet, see
>> >>
>> >>
>> >> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/)
>> >> preventing Python from being more useful in industry analytics
>> >> applications.
>> >>
>> >> Aside: one of the bigger mistakes I made with pandas's API design was
>> >> making it acceptable to call class constructors ? like
>> >> pandas.DataFrame ? directly (versus factory functions). Sorry about
>> >> that! If we could convince everyone to start writing pandas.data_frame
>> >> or dataframe instead of using the class reference it would help a lot
>> >> with code cleanup. It's hard to plan for these things ? NumPy
>> >> interoperability seemed a lot more important in 2008 than it does now,
>> >> so I forgive myself.
>> >>
>> >> cheers and best wishes for 2016,
>> >> Wes
>> >> _______________________________________________
>> >> Pandas-dev mailing list
>> >> Pandas-dev at python.org
>> >> https://mail.python.org/mailman/listinfo/pandas-dev
>> >
>> >
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> https://mail.python.org/mailman/listinfo/pandas-dev

From wesmckinn at gmail.com  Tue Dec 29 16:12:52 2015
From: wesmckinn at gmail.com (Wes McKinney)
Date: Tue, 29 Dec 2015 13:12:52 -0800
Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? /
 Roadmap
In-Reply-To: <CAJPUwMDhc76q+KWPvgxT6i+o+Kw8S0-=JEe8UsnEGCqPOM_2QQ@mail.gmail.com>
References: <CAHMnJKiu9+ij943kYpRxKUxhp4H9Kqvevg84k1MchQaOwjEbGg@mail.gmail.com>
 <CAJPUwMCdknF90fsjRyGZ7FRqYo_GOe_+3KezOWYZosvZ5hqWzw@mail.gmail.com>
 <CAKRVfm79SQAMRNOZ=xCn=AvKSmSafsvDhHAcrtY+7mX3CJndpA@mail.gmail.com>
 <CAJPUwMDhc76q+KWPvgxT6i+o+Kw8S0-=JEe8UsnEGCqPOM_2QQ@mail.gmail.com>
Message-ID: <CAJPUwMBhN2FPWyh=zbYSCixkxpC0giqgD5ppFzzxEKukKUkTXw@mail.gmail.com>

The other huge thing this will enable is to do is copy-on-write for
various kinds of views, which should cut down on some of the defensive
copying in the library and reduce memory usage.

On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney <wesmckinn at gmail.com> wrote:
> Basically the approach is
>
> 1) Base dtype type
> 2) Base array type with K >= 1 dimensions
> 3) Base scalar type
> 4) Base index type
> 5) "Wrapper" subclasses for all NumPy types fitting into categories
> #1, #2, #3, #4
> 6) Subclasses for pandas-specific types like category, datetimeTZ, etc.
> 7) NDFrame as cpcloud wrote is just a list of these
>
> Indexes and axis labels / column names can get layered on top.
>
> After we do all this we can look at adding nested types (arrays, maps,
> structs) to better support JSON.
>
> - Wes
>
> On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud <cpcloud at gmail.com> wrote:
>> Maybe this is saying the same thing as Wes, but how far would something like
>> this get us?
>>
>> // warning: things are probably not this simple
>>
>> struct data_array_t {
>>     void *primitive;  // scalar data
>>     data_array_t *nested; // nested data
>>     boost::dynamic_bitset isnull;  // might have to create our own to avoid
>> boost
>>     schema_t schema;  // not sure exactly what this looks like
>> };
>>
>> typedef std::map<string, data_array_t> data_frame_t;  // probably not this
>> simple
>>
>> To answer Jeff?s use-case question: I think that the use cases are 1)
>> freedom from numpy (mostly) 2) no more block manager which frees us from the
>> limitations of the block memory layout. In particular, the ability to take
>> advantage of memory mapped IO would be a big win IMO.
>>
>>
>> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney <wesmckinn at gmail.com> wrote:
>>>
>>> I will write a more detailed response to some of these things after
>>> the new year, but, in particular, re: missing values, can you or
>>> someone tell me why creating an object that contains a NumPy array and
>>> a bitmap is not sufficient? If we we can add a lightweight C/C++ class
>>> layer between NumPy function calls (e.g. arithmetic) and pandas
>>> function calls, then I see no reason why we cannot have
>>>
>>> Int32Array->add
>>>
>>> and
>>>
>>> Float32Array->add
>>>
>>> do the right thing (the former would be responsible for bitmasking to
>>> propagate NA values; the latter would defer to NumPy). If we can put
>>> all the internals of pandas objects inside a black box, we can add
>>> layers of virtual function indirection without a performance penalty
>>> (e.g. adding more interpreter overhead with more abstraction layers
>>> does add up to a perf penalty).
>>>
>>> I don't think this is too scary -- I would be willing to create a
>>> small POC C++ library to prototype something like what I'm talking
>>> about.
>>>
>>> Since pandas has limited points of contact with NumPy I don't think
>>> this would end up being too onerous.
>>>
>>> For the record, I'm pretty allergic to "advanced C++"; I think it is a
>>> useful tool if you pick a sane 20% subset of the C++11 spec and follow
>>> Google C++ style it's not very inaccessible to intermediate
>>> developers. More or less "C plus OOP and easier object lifetime
>>> management (shared/unique_ptr, etc.)". As soon as you add a lot of
>>> template metaprogramming C++ library development quickly becomes
>>> inaccessible except to the C++-Jedi.
>>>
>>> Maybe let's start a Google document on "pandas roadmap" where we can
>>> break down the 1-2 year goals and some of these infrastructure issues
>>> and have our discussion there? (obviously publish this someplace once
>>> we're done)
>>>
>>> - Wes
>>>
>>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback <jeffreback at gmail.com> wrote:
>>> > Here are some of my thoughts about pandas Roadmap / status and some
>>> > responses to Wes's thoughts.
>>> >
>>> > In the last few (and upcoming) major releases we have been made the
>>> > following changes:
>>> >
>>> > - dtype enhancements (Categorical, Timedelta, Datetime w/tz) & making
>>> > these
>>> > first class objects
>>> > - code refactoring to remove subclassing of ndarrays for Series & Index
>>> > - carving out / deprecating non-core parts of pandas
>>> >   - datareader
>>> >   - SparsePanel, WidePanel & other aliases (TImeSeries)
>>> >   - rpy, rplot, irow et al.
>>> >   - google-analytics
>>> > - API changes to make things more consistent
>>> >   - pd.rolling/expanding * -> .rolling/expanding (this is in master now)
>>> >   - .resample becoming a full defered like groupby.
>>> >   - multi-index slicing along any level (obviates need for .xs) and
>>> > allows
>>> > assignment
>>> >   - .loc/.iloc - for the most part obviates use of .ix
>>> >   - .pipe & .assign
>>> >   - plotting accessors
>>> >   - fixing of the sorting API
>>> > - many performance enhancements both micro & macro (e.g. release GIL)
>>> >
>>> > Some on-deck enhancements are (meaning these are basically ready to go
>>> > in):
>>> >   - IntervalIndex (and eventually make PeriodIndex just a sub-class of
>>> > this)
>>> >   - RangeIndex
>>> >
>>> > so lots of changes, though nothing really earth shaking, just more
>>> > convenience, reducing magicness somewhat
>>> > and providing flexibility.
>>> >
>>> > Of course we are getting increasing issues, mostly bug reports (and lots
>>> > of
>>> > dupes), some edge case enhancements
>>> > which can add to the existing API's and of course, requests to expand
>>> > the
>>> > (already) large code to other usecases.
>>> > Balancing this are a good many pull-requests from many different users,
>>> > some
>>> > even deep into the internals.
>>> >
>>> > Here are some things that I have talked about and could be considered
>>> > for
>>> > the roadmap. Disclaimer: I do work for Continuum
>>> > but these views are of course my own; furthermore obviously I am a bit
>>> > more
>>> > familiar with some of the 'sponsored' open-source
>>> > libraries, but always open to new things.
>>> >
>>> > - integration / automatic deferral to numba for JIT (this would be thru
>>> > .apply)
>>> > - automatic deferal to dask from groubpy where appropriate / maybe a
>>> > .to_parallel (to simply return a dask.DataFrame object)
>>> > - incorporation of quantities / units (as part of the dtype)
>>> > - use of DyND to allow missing values for int dtypes
>>> > - make Period a first class dtype.
>>> > - provide some copy-on-write semantics to alleviate the chained-indexing
>>> > issues which occasionaly come up with the mis-use of the indexing API
>>> > - allow a 'policy' to automatically provide column blocks for dict-like
>>> > input (e.g. each column would be a block), this would allow a pass-thru
>>> > API
>>> > where you could
>>> > put in numpy arrays where you have views and have them preserved rather
>>> > than
>>> > copied automatically. Note that this would also allow what I call
>>> > 'split'
>>> > where a passed in
>>> > multi-dim numpy array could be split up to individual blocks (which
>>> > actually
>>> > gives a nice perf boost after the splitting costs).
>>> >
>>> > In working towards some of these goals. I have come to the opinion that
>>> > it
>>> > would make sense to have a neutral API protocol layer
>>> > that would allow us to swap out different engines as needed, for
>>> > particular
>>> > dtypes, or *maybe* out-of-core type computations. E.g.
>>> > imagine that we replaced the in-memory block structure with a bclolz /
>>> > memap
>>> > type; in theory this should be 'easy' and just work.
>>> > I could also see us adopting *some* of the SFrame code to allow easier
>>> > interop with this API layer.
>>> >
>>> > In practice, I think a nice API layer would need to be created to make
>>> > this
>>> > clean / nice.
>>> >
>>> > So this comes around to Wes's point about creating a c++ library for the
>>> > internals (and possibly even some of the indexing routines).
>>> > In an ideal world, or course this would be desirable. Getting there is a
>>> > bit
>>> > non-trivial I think, and IMHO might not be worth the effort. I don't
>>> > really see big performance bottlenecks. We *already* defer much of the
>>> > computation to libraries like numexpr & bottleneck (where appropriate).
>>> > Adding numba / dask to the list would be helpful.
>>> >
>>> > I think that almost all performance issues are the result of:
>>> >
>>> > a) gross misuse of the pandas API. How much code have you seen that does
>>> > df.apply(lambda x: x.sum())
>>> > b) routines which operate column-by-column rather block-by-block and are
>>> > in
>>> > python space (e.g. we have an issue right now about .quantile)
>>> >
>>> > So I am glossing over a big goal of having a c++ library that represents
>>> > the
>>> > pandas internals. This would by definition have a c-API that so
>>> > you *could* use pandas like semantics in c/c++ and just have it work
>>> > (and
>>> > then pandas would be a thin wrapper around this library).
>>> >
>>> > I am not averse to this, but I think would be quite a big effort, and
>>> > not a
>>> > huge perf boost IMHO. Further there are a number of API issues w.r.t.
>>> > indexing
>>> > which need to be clarified / worked out (e.g. should we simply deprecate
>>> > [])
>>> > that are much easier to test / figure out in python space.
>>> >
>>> > I also thing that we have quite a large number of contributors. Moving
>>> > to
>>> > c++ might make the internals a bit more impenetrable that the current
>>> > internals.
>>> > (though this would allow c++ people to contribute, so that might balance
>>> > out).
>>> >
>>> > We have a limited core of devs whom right now are familar with things.
>>> > If
>>> > someone happened to have a starting base for a c++ library, then I might
>>> > change
>>> > opinions here.
>>> >
>>> >
>>> > my 4c.
>>> >
>>> > Jeff
>>> >
>>> >
>>> >
>>> >
>>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney <wesmckinn at gmail.com>
>>> > wrote:
>>> >>
>>> >> Deep thoughts during the holidays.
>>> >>
>>> >> I might be out of line here, but the interpreter-heaviness of the
>>> >> inside of pandas objects is likely to be a long-term liability and
>>> >> source of performance problems and technical debt.
>>> >>
>>> >> Has anyone put any thought into planning and beginning to execute on a
>>> >> rewrite that moves as much as possible of the internals into native /
>>> >> compiled code? I'm talking about:
>>> >>
>>> >> - pandas/core/internals
>>> >> - indexing and assignment
>>> >> - much of pandas/core/common
>>> >> - categorical and custom dtypes
>>> >> - all indexing mechanisms
>>> >>
>>> >> I'm concerned we've already exposed too much internals to users, so
>>> >> this might lead to a lot of API breakage, but it might be for the
>>> >> Greater Good. As a first step, beginning a partial migration of
>>> >> internals into some C++ classes that encapsulate the insides of
>>> >> DataFrame objects and implement indexing and block-level manipulations
>>> >> would be a good place to start. I think you could do this wouldn't too
>>> >> much disruption.
>>> >>
>>> >> As part of this internal retooling we might give consideration to
>>> >> alternative data structures for representing data internal to pandas
>>> >> objects. Now in 2015/2016, continuing to be hamstrung by NumPy's
>>> >> limitations feels somewhat anachronistic. User code is riddled with
>>> >> workarounds for data type fidelity issues and the like. Like, really,
>>> >> why not add a bitndarray (similar to ilanschnell/bitarray) for storing
>>> >> nullness for problematic types and hide this from the user? =)
>>> >>
>>> >> Since we are now a NumFOCUS-sponsored project, I feel like we might
>>> >> consider establishing some formal governance over pandas and
>>> >> publishing meetings notes and roadmap documents describing plans for
>>> >> the project and meetings notes from committers. There's no real
>>> >> "committer culture" for NumFOCUS projects like there is with the
>>> >> Apache Software Foundation, but we might try leading by example!
>>> >>
>>> >> Also, I believe pandas as a project has reached a level of importance
>>> >> where we ought to consider planning and execution on larger scale
>>> >> undertakings such as this for safeguarding the future.
>>> >>
>>> >> As for myself, well, I have my hands full in Big Data-land. I wish I
>>> >> could be helping more with pandas, but there a quite a few fundamental
>>> >> issues (like data interoperability nested data handling and file
>>> >> format support ? e.g. Parquet, see
>>> >>
>>> >>
>>> >> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/)
>>> >> preventing Python from being more useful in industry analytics
>>> >> applications.
>>> >>
>>> >> Aside: one of the bigger mistakes I made with pandas's API design was
>>> >> making it acceptable to call class constructors ? like
>>> >> pandas.DataFrame ? directly (versus factory functions). Sorry about
>>> >> that! If we could convince everyone to start writing pandas.data_frame
>>> >> or dataframe instead of using the class reference it would help a lot
>>> >> with code cleanup. It's hard to plan for these things ? NumPy
>>> >> interoperability seemed a lot more important in 2008 than it does now,
>>> >> so I forgive myself.
>>> >>
>>> >> cheers and best wishes for 2016,
>>> >> Wes
>>> >> _______________________________________________
>>> >> Pandas-dev mailing list
>>> >> Pandas-dev at python.org
>>> >> https://mail.python.org/mailman/listinfo/pandas-dev
>>> >
>>> >
>>> _______________________________________________
>>> Pandas-dev mailing list
>>> Pandas-dev at python.org
>>> https://mail.python.org/mailman/listinfo/pandas-dev

From jeffreback at gmail.com  Tue Dec 29 16:20:05 2015
From: jeffreback at gmail.com (Jeff Reback)
Date: Tue, 29 Dec 2015 16:20:05 -0500
Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? /
 Roadmap
In-Reply-To: <CAJPUwMBhN2FPWyh=zbYSCixkxpC0giqgD5ppFzzxEKukKUkTXw@mail.gmail.com>
References: <CAHMnJKiu9+ij943kYpRxKUxhp4H9Kqvevg84k1MchQaOwjEbGg@mail.gmail.com>
 <CAJPUwMCdknF90fsjRyGZ7FRqYo_GOe_+3KezOWYZosvZ5hqWzw@mail.gmail.com>
 <CAKRVfm79SQAMRNOZ=xCn=AvKSmSafsvDhHAcrtY+7mX3CJndpA@mail.gmail.com>
 <CAJPUwMDhc76q+KWPvgxT6i+o+Kw8S0-=JEe8UsnEGCqPOM_2QQ@mail.gmail.com>
 <CAJPUwMBhN2FPWyh=zbYSCixkxpC0giqgD5ppFzzxEKukKUkTXw@mail.gmail.com>
Message-ID: <CAHMnJKi-LzFqSVRBYYDk0Rx4-41kWDEpXHRfSsS90vp56mTg6A@mail.gmail.com>

Wes your last is noted as well. I *think* we can actually do this now (well
there is a PR out there).

On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney <wesmckinn at gmail.com> wrote:

> The other huge thing this will enable is to do is copy-on-write for
> various kinds of views, which should cut down on some of the defensive
> copying in the library and reduce memory usage.
>
> On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney <wesmckinn at gmail.com> wrote:
> > Basically the approach is
> >
> > 1) Base dtype type
> > 2) Base array type with K >= 1 dimensions
> > 3) Base scalar type
> > 4) Base index type
> > 5) "Wrapper" subclasses for all NumPy types fitting into categories
> > #1, #2, #3, #4
> > 6) Subclasses for pandas-specific types like category, datetimeTZ, etc.
> > 7) NDFrame as cpcloud wrote is just a list of these
> >
> > Indexes and axis labels / column names can get layered on top.
> >
> > After we do all this we can look at adding nested types (arrays, maps,
> > structs) to better support JSON.
> >
> > - Wes
> >
> > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud <cpcloud at gmail.com>
> wrote:
> >> Maybe this is saying the same thing as Wes, but how far would something
> like
> >> this get us?
> >>
> >> // warning: things are probably not this simple
> >>
> >> struct data_array_t {
> >>     void *primitive;  // scalar data
> >>     data_array_t *nested; // nested data
> >>     boost::dynamic_bitset isnull;  // might have to create our own to
> avoid
> >> boost
> >>     schema_t schema;  // not sure exactly what this looks like
> >> };
> >>
> >> typedef std::map<string, data_array_t> data_frame_t;  // probably not
> this
> >> simple
> >>
> >> To answer Jeff?s use-case question: I think that the use cases are 1)
> >> freedom from numpy (mostly) 2) no more block manager which frees us
> from the
> >> limitations of the block memory layout. In particular, the ability to
> take
> >> advantage of memory mapped IO would be a big win IMO.
> >>
> >>
> >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney <wesmckinn at gmail.com>
> wrote:
> >>>
> >>> I will write a more detailed response to some of these things after
> >>> the new year, but, in particular, re: missing values, can you or
> >>> someone tell me why creating an object that contains a NumPy array and
> >>> a bitmap is not sufficient? If we we can add a lightweight C/C++ class
> >>> layer between NumPy function calls (e.g. arithmetic) and pandas
> >>> function calls, then I see no reason why we cannot have
> >>>
> >>> Int32Array->add
> >>>
> >>> and
> >>>
> >>> Float32Array->add
> >>>
> >>> do the right thing (the former would be responsible for bitmasking to
> >>> propagate NA values; the latter would defer to NumPy). If we can put
> >>> all the internals of pandas objects inside a black box, we can add
> >>> layers of virtual function indirection without a performance penalty
> >>> (e.g. adding more interpreter overhead with more abstraction layers
> >>> does add up to a perf penalty).
> >>>
> >>> I don't think this is too scary -- I would be willing to create a
> >>> small POC C++ library to prototype something like what I'm talking
> >>> about.
> >>>
> >>> Since pandas has limited points of contact with NumPy I don't think
> >>> this would end up being too onerous.
> >>>
> >>> For the record, I'm pretty allergic to "advanced C++"; I think it is a
> >>> useful tool if you pick a sane 20% subset of the C++11 spec and follow
> >>> Google C++ style it's not very inaccessible to intermediate
> >>> developers. More or less "C plus OOP and easier object lifetime
> >>> management (shared/unique_ptr, etc.)". As soon as you add a lot of
> >>> template metaprogramming C++ library development quickly becomes
> >>> inaccessible except to the C++-Jedi.
> >>>
> >>> Maybe let's start a Google document on "pandas roadmap" where we can
> >>> break down the 1-2 year goals and some of these infrastructure issues
> >>> and have our discussion there? (obviously publish this someplace once
> >>> we're done)
> >>>
> >>> - Wes
> >>>
> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback <jeffreback at gmail.com>
> wrote:
> >>> > Here are some of my thoughts about pandas Roadmap / status and some
> >>> > responses to Wes's thoughts.
> >>> >
> >>> > In the last few (and upcoming) major releases we have been made the
> >>> > following changes:
> >>> >
> >>> > - dtype enhancements (Categorical, Timedelta, Datetime w/tz) & making
> >>> > these
> >>> > first class objects
> >>> > - code refactoring to remove subclassing of ndarrays for Series &
> Index
> >>> > - carving out / deprecating non-core parts of pandas
> >>> >   - datareader
> >>> >   - SparsePanel, WidePanel & other aliases (TImeSeries)
> >>> >   - rpy, rplot, irow et al.
> >>> >   - google-analytics
> >>> > - API changes to make things more consistent
> >>> >   - pd.rolling/expanding * -> .rolling/expanding (this is in master
> now)
> >>> >   - .resample becoming a full defered like groupby.
> >>> >   - multi-index slicing along any level (obviates need for .xs) and
> >>> > allows
> >>> > assignment
> >>> >   - .loc/.iloc - for the most part obviates use of .ix
> >>> >   - .pipe & .assign
> >>> >   - plotting accessors
> >>> >   - fixing of the sorting API
> >>> > - many performance enhancements both micro & macro (e.g. release GIL)
> >>> >
> >>> > Some on-deck enhancements are (meaning these are basically ready to
> go
> >>> > in):
> >>> >   - IntervalIndex (and eventually make PeriodIndex just a sub-class
> of
> >>> > this)
> >>> >   - RangeIndex
> >>> >
> >>> > so lots of changes, though nothing really earth shaking, just more
> >>> > convenience, reducing magicness somewhat
> >>> > and providing flexibility.
> >>> >
> >>> > Of course we are getting increasing issues, mostly bug reports (and
> lots
> >>> > of
> >>> > dupes), some edge case enhancements
> >>> > which can add to the existing API's and of course, requests to expand
> >>> > the
> >>> > (already) large code to other usecases.
> >>> > Balancing this are a good many pull-requests from many different
> users,
> >>> > some
> >>> > even deep into the internals.
> >>> >
> >>> > Here are some things that I have talked about and could be considered
> >>> > for
> >>> > the roadmap. Disclaimer: I do work for Continuum
> >>> > but these views are of course my own; furthermore obviously I am a
> bit
> >>> > more
> >>> > familiar with some of the 'sponsored' open-source
> >>> > libraries, but always open to new things.
> >>> >
> >>> > - integration / automatic deferral to numba for JIT (this would be
> thru
> >>> > .apply)
> >>> > - automatic deferal to dask from groubpy where appropriate / maybe a
> >>> > .to_parallel (to simply return a dask.DataFrame object)
> >>> > - incorporation of quantities / units (as part of the dtype)
> >>> > - use of DyND to allow missing values for int dtypes
> >>> > - make Period a first class dtype.
> >>> > - provide some copy-on-write semantics to alleviate the
> chained-indexing
> >>> > issues which occasionaly come up with the mis-use of the indexing API
> >>> > - allow a 'policy' to automatically provide column blocks for
> dict-like
> >>> > input (e.g. each column would be a block), this would allow a
> pass-thru
> >>> > API
> >>> > where you could
> >>> > put in numpy arrays where you have views and have them preserved
> rather
> >>> > than
> >>> > copied automatically. Note that this would also allow what I call
> >>> > 'split'
> >>> > where a passed in
> >>> > multi-dim numpy array could be split up to individual blocks (which
> >>> > actually
> >>> > gives a nice perf boost after the splitting costs).
> >>> >
> >>> > In working towards some of these goals. I have come to the opinion
> that
> >>> > it
> >>> > would make sense to have a neutral API protocol layer
> >>> > that would allow us to swap out different engines as needed, for
> >>> > particular
> >>> > dtypes, or *maybe* out-of-core type computations. E.g.
> >>> > imagine that we replaced the in-memory block structure with a bclolz
> /
> >>> > memap
> >>> > type; in theory this should be 'easy' and just work.
> >>> > I could also see us adopting *some* of the SFrame code to allow
> easier
> >>> > interop with this API layer.
> >>> >
> >>> > In practice, I think a nice API layer would need to be created to
> make
> >>> > this
> >>> > clean / nice.
> >>> >
> >>> > So this comes around to Wes's point about creating a c++ library for
> the
> >>> > internals (and possibly even some of the indexing routines).
> >>> > In an ideal world, or course this would be desirable. Getting there
> is a
> >>> > bit
> >>> > non-trivial I think, and IMHO might not be worth the effort. I don't
> >>> > really see big performance bottlenecks. We *already* defer much of
> the
> >>> > computation to libraries like numexpr & bottleneck (where
> appropriate).
> >>> > Adding numba / dask to the list would be helpful.
> >>> >
> >>> > I think that almost all performance issues are the result of:
> >>> >
> >>> > a) gross misuse of the pandas API. How much code have you seen that
> does
> >>> > df.apply(lambda x: x.sum())
> >>> > b) routines which operate column-by-column rather block-by-block and
> are
> >>> > in
> >>> > python space (e.g. we have an issue right now about .quantile)
> >>> >
> >>> > So I am glossing over a big goal of having a c++ library that
> represents
> >>> > the
> >>> > pandas internals. This would by definition have a c-API that so
> >>> > you *could* use pandas like semantics in c/c++ and just have it work
> >>> > (and
> >>> > then pandas would be a thin wrapper around this library).
> >>> >
> >>> > I am not averse to this, but I think would be quite a big effort, and
> >>> > not a
> >>> > huge perf boost IMHO. Further there are a number of API issues w.r.t.
> >>> > indexing
> >>> > which need to be clarified / worked out (e.g. should we simply
> deprecate
> >>> > [])
> >>> > that are much easier to test / figure out in python space.
> >>> >
> >>> > I also thing that we have quite a large number of contributors.
> Moving
> >>> > to
> >>> > c++ might make the internals a bit more impenetrable that the current
> >>> > internals.
> >>> > (though this would allow c++ people to contribute, so that might
> balance
> >>> > out).
> >>> >
> >>> > We have a limited core of devs whom right now are familar with
> things.
> >>> > If
> >>> > someone happened to have a starting base for a c++ library, then I
> might
> >>> > change
> >>> > opinions here.
> >>> >
> >>> >
> >>> > my 4c.
> >>> >
> >>> > Jeff
> >>> >
> >>> >
> >>> >
> >>> >
> >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney <wesmckinn at gmail.com>
> >>> > wrote:
> >>> >>
> >>> >> Deep thoughts during the holidays.
> >>> >>
> >>> >> I might be out of line here, but the interpreter-heaviness of the
> >>> >> inside of pandas objects is likely to be a long-term liability and
> >>> >> source of performance problems and technical debt.
> >>> >>
> >>> >> Has anyone put any thought into planning and beginning to execute
> on a
> >>> >> rewrite that moves as much as possible of the internals into native
> /
> >>> >> compiled code? I'm talking about:
> >>> >>
> >>> >> - pandas/core/internals
> >>> >> - indexing and assignment
> >>> >> - much of pandas/core/common
> >>> >> - categorical and custom dtypes
> >>> >> - all indexing mechanisms
> >>> >>
> >>> >> I'm concerned we've already exposed too much internals to users, so
> >>> >> this might lead to a lot of API breakage, but it might be for the
> >>> >> Greater Good. As a first step, beginning a partial migration of
> >>> >> internals into some C++ classes that encapsulate the insides of
> >>> >> DataFrame objects and implement indexing and block-level
> manipulations
> >>> >> would be a good place to start. I think you could do this wouldn't
> too
> >>> >> much disruption.
> >>> >>
> >>> >> As part of this internal retooling we might give consideration to
> >>> >> alternative data structures for representing data internal to pandas
> >>> >> objects. Now in 2015/2016, continuing to be hamstrung by NumPy's
> >>> >> limitations feels somewhat anachronistic. User code is riddled with
> >>> >> workarounds for data type fidelity issues and the like. Like,
> really,
> >>> >> why not add a bitndarray (similar to ilanschnell/bitarray) for
> storing
> >>> >> nullness for problematic types and hide this from the user? =)
> >>> >>
> >>> >> Since we are now a NumFOCUS-sponsored project, I feel like we might
> >>> >> consider establishing some formal governance over pandas and
> >>> >> publishing meetings notes and roadmap documents describing plans for
> >>> >> the project and meetings notes from committers. There's no real
> >>> >> "committer culture" for NumFOCUS projects like there is with the
> >>> >> Apache Software Foundation, but we might try leading by example!
> >>> >>
> >>> >> Also, I believe pandas as a project has reached a level of
> importance
> >>> >> where we ought to consider planning and execution on larger scale
> >>> >> undertakings such as this for safeguarding the future.
> >>> >>
> >>> >> As for myself, well, I have my hands full in Big Data-land. I wish I
> >>> >> could be helping more with pandas, but there a quite a few
> fundamental
> >>> >> issues (like data interoperability nested data handling and file
> >>> >> format support ? e.g. Parquet, see
> >>> >>
> >>> >>
> >>> >>
> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/
> )
> >>> >> preventing Python from being more useful in industry analytics
> >>> >> applications.
> >>> >>
> >>> >> Aside: one of the bigger mistakes I made with pandas's API design
> was
> >>> >> making it acceptable to call class constructors ? like
> >>> >> pandas.DataFrame ? directly (versus factory functions). Sorry about
> >>> >> that! If we could convince everyone to start writing
> pandas.data_frame
> >>> >> or dataframe instead of using the class reference it would help a
> lot
> >>> >> with code cleanup. It's hard to plan for these things ? NumPy
> >>> >> interoperability seemed a lot more important in 2008 than it does
> now,
> >>> >> so I forgive myself.
> >>> >>
> >>> >> cheers and best wishes for 2016,
> >>> >> Wes
> >>> >> _______________________________________________
> >>> >> Pandas-dev mailing list
> >>> >> Pandas-dev at python.org
> >>> >> https://mail.python.org/mailman/listinfo/pandas-dev
> >>> >
> >>> >
> >>> _______________________________________________
> >>> Pandas-dev mailing list
> >>> Pandas-dev at python.org
> >>> https://mail.python.org/mailman/listinfo/pandas-dev
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20151229/442f931f/attachment-0001.html>

From wesmckinn at gmail.com  Tue Dec 29 18:18:04 2015
From: wesmckinn at gmail.com (Wes McKinney)
Date: Tue, 29 Dec 2015 15:18:04 -0800
Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? /
 Roadmap
In-Reply-To: <CAHMnJKi-LzFqSVRBYYDk0Rx4-41kWDEpXHRfSsS90vp56mTg6A@mail.gmail.com>
References: <CAHMnJKiu9+ij943kYpRxKUxhp4H9Kqvevg84k1MchQaOwjEbGg@mail.gmail.com>
 <CAJPUwMCdknF90fsjRyGZ7FRqYo_GOe_+3KezOWYZosvZ5hqWzw@mail.gmail.com>
 <CAKRVfm79SQAMRNOZ=xCn=AvKSmSafsvDhHAcrtY+7mX3CJndpA@mail.gmail.com>
 <CAJPUwMDhc76q+KWPvgxT6i+o+Kw8S0-=JEe8UsnEGCqPOM_2QQ@mail.gmail.com>
 <CAJPUwMBhN2FPWyh=zbYSCixkxpC0giqgD5ppFzzxEKukKUkTXw@mail.gmail.com>
 <CAHMnJKi-LzFqSVRBYYDk0Rx4-41kWDEpXHRfSsS90vp56mTg6A@mail.gmail.com>
Message-ID: <CAJPUwMA+m_rYpddocQqOVwMoEFvDocncPOFFtM9GsszVTyE=ng@mail.gmail.com>

Can you link to the PR you're talking about?

I will see about spending a few hours setting up a libpandas.so as a C++
shared library where we can run some experiments and validate whether it
can solve the integer-NA problem and be a place to put new data types
(categorical and friends). I'm +1 on targeting

Would it also be worth making a wish list of APIs we might consider
breaking in a pandas 1.0 release that also features this new "native core"?
Might as well right some wrongs while we're doing some invasive work on the
internals; some breakage might be unavoidable. We can always maintain a
pandas legacy 0.x.x maintenance branch (providing a conda binary build) for
legacy users where showstopper bugs can get fixed.

On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback <jeffreback at gmail.com
<javascript:;>> wrote:
> Wes your last is noted as well. I *think* we can actually do this now
(well
> there is a PR out there).
>
> On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney <wesmckinn at gmail.com
<javascript:;>> wrote:
>>
>> The other huge thing this will enable is to do is copy-on-write for
>> various kinds of views, which should cut down on some of the defensive
>> copying in the library and reduce memory usage.
>>
>> On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney <wesmckinn at gmail.com
<javascript:;>> wrote:
>> > Basically the approach is
>> >
>> > 1) Base dtype type
>> > 2) Base array type with K >= 1 dimensions
>> > 3) Base scalar type
>> > 4) Base index type
>> > 5) "Wrapper" subclasses for all NumPy types fitting into categories
>> > #1, #2, #3, #4
>> > 6) Subclasses for pandas-specific types like category, datetimeTZ, etc.
>> > 7) NDFrame as cpcloud wrote is just a list of these
>> >
>> > Indexes and axis labels / column names can get layered on top.
>> >
>> > After we do all this we can look at adding nested types (arrays, maps,
>> > structs) to better support JSON.
>> >
>> > - Wes
>> >
>> > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud <cpcloud at gmail.com
<javascript:;>>
>> > wrote:
>> >> Maybe this is saying the same thing as Wes, but how far would
something
>> >> like
>> >> this get us?
>> >>
>> >> // warning: things are probably not this simple
>> >>
>> >> struct data_array_t {
>> >>     void *primitive;  // scalar data
>> >>     data_array_t *nested; // nested data
>> >>     boost::dynamic_bitset isnull;  // might have to create our own to
>> >> avoid
>> >> boost
>> >>     schema_t schema;  // not sure exactly what this looks like
>> >> };
>> >>
>> >> typedef std::map<string, data_array_t> data_frame_t;  // probably not
>> >> this
>> >> simple
>> >>
>> >> To answer Jeff?s use-case question: I think that the use cases are 1)
>> >> freedom from numpy (mostly) 2) no more block manager which frees us
>> >> from the
>> >> limitations of the block memory layout. In particular, the ability to
>> >> take
>> >> advantage of memory mapped IO would be a big win IMO.
>> >>
>> >>
>> >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney <wesmckinn at gmail.com
<javascript:;>>
>> >> wrote:
>> >>>
>> >>> I will write a more detailed response to some of these things after
>> >>> the new year, but, in particular, re: missing values, can you or
>> >>> someone tell me why creating an object that contains a NumPy array
and
>> >>> a bitmap is not sufficient? If we we can add a lightweight C/C++
class
>> >>> layer between NumPy function calls (e.g. arithmetic) and pandas
>> >>> function calls, then I see no reason why we cannot have
>> >>>
>> >>> Int32Array->add
>> >>>
>> >>> and
>> >>>
>> >>> Float32Array->add
>> >>>
>> >>> do the right thing (the former would be responsible for bitmasking to
>> >>> propagate NA values; the latter would defer to NumPy). If we can put
>> >>> all the internals of pandas objects inside a black box, we can add
>> >>> layers of virtual function indirection without a performance penalty
>> >>> (e.g. adding more interpreter overhead with more abstraction layers
>> >>> does add up to a perf penalty).
>> >>>
>> >>> I don't think this is too scary -- I would be willing to create a
>> >>> small POC C++ library to prototype something like what I'm talking
>> >>> about.
>> >>>
>> >>> Since pandas has limited points of contact with NumPy I don't think
>> >>> this would end up being too onerous.
>> >>>
>> >>> For the record, I'm pretty allergic to "advanced C++"; I think it is
a
>> >>> useful tool if you pick a sane 20% subset of the C++11 spec and
follow
>> >>> Google C++ style it's not very inaccessible to intermediate
>> >>> developers. More or less "C plus OOP and easier object lifetime
>> >>> management (shared/unique_ptr, etc.)". As soon as you add a lot of
>> >>> template metaprogramming C++ library development quickly becomes
>> >>> inaccessible except to the C++-Jedi.
>> >>>
>> >>> Maybe let's start a Google document on "pandas roadmap" where we can
>> >>> break down the 1-2 year goals and some of these infrastructure issues
>> >>> and have our discussion there? (obviously publish this someplace once
>> >>> we're done)
>> >>>
>> >>> - Wes
>> >>>
>> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback <jeffreback at gmail.com
<javascript:;>>
>> >>> wrote:
>> >>> > Here are some of my thoughts about pandas Roadmap / status and some
>> >>> > responses to Wes's thoughts.
>> >>> >
>> >>> > In the last few (and upcoming) major releases we have been made the
>> >>> > following changes:
>> >>> >
>> >>> > - dtype enhancements (Categorical, Timedelta, Datetime w/tz) &
>> >>> > making
>> >>> > these
>> >>> > first class objects
>> >>> > - code refactoring to remove subclassing of ndarrays for Series &
>> >>> > Index
>> >>> > - carving out / deprecating non-core parts of pandas
>> >>> >   - datareader
>> >>> >   - SparsePanel, WidePanel & other aliases (TImeSeries)
>> >>> >   - rpy, rplot, irow et al.
>> >>> >   - google-analytics
>> >>> > - API changes to make things more consistent
>> >>> >   - pd.rolling/expanding * -> .rolling/expanding (this is in master
>> >>> > now)
>> >>> >   - .resample becoming a full defered like groupby.
>> >>> >   - multi-index slicing along any level (obviates need for .xs) and
>> >>> > allows
>> >>> > assignment
>> >>> >   - .loc/.iloc - for the most part obviates use of .ix
>> >>> >   - .pipe & .assign
>> >>> >   - plotting accessors
>> >>> >   - fixing of the sorting API
>> >>> > - many performance enhancements both micro & macro (e.g. release
>> >>> > GIL)
>> >>> >
>> >>> > Some on-deck enhancements are (meaning these are basically ready to
>> >>> > go
>> >>> > in):
>> >>> >   - IntervalIndex (and eventually make PeriodIndex just a sub-class
>> >>> > of
>> >>> > this)
>> >>> >   - RangeIndex
>> >>> >
>> >>> > so lots of changes, though nothing really earth shaking, just more
>> >>> > convenience, reducing magicness somewhat
>> >>> > and providing flexibility.
>> >>> >
>> >>> > Of course we are getting increasing issues, mostly bug reports (and
>> >>> > lots
>> >>> > of
>> >>> > dupes), some edge case enhancements
>> >>> > which can add to the existing API's and of course, requests to
>> >>> > expand
>> >>> > the
>> >>> > (already) large code to other usecases.
>> >>> > Balancing this are a good many pull-requests from many different
>> >>> > users,
>> >>> > some
>> >>> > even deep into the internals.
>> >>> >
>> >>> > Here are some things that I have talked about and could be
>> >>> > considered
>> >>> > for
>> >>> > the roadmap. Disclaimer: I do work for Continuum
>> >>> > but these views are of course my own; furthermore obviously I am a
>> >>> > bit
>> >>> > more
>> >>> > familiar with some of the 'sponsored' open-source
>> >>> > libraries, but always open to new things.
>> >>> >
>> >>> > - integration / automatic deferral to numba for JIT (this would be
>> >>> > thru
>> >>> > .apply)
>> >>> > - automatic deferal to dask from groubpy where appropriate / maybe
a
>> >>> > .to_parallel (to simply return a dask.DataFrame object)
>> >>> > - incorporation of quantities / units (as part of the dtype)
>> >>> > - use of DyND to allow missing values for int dtypes
>> >>> > - make Period a first class dtype.
>> >>> > - provide some copy-on-write semantics to alleviate the
>> >>> > chained-indexing
>> >>> > issues which occasionaly come up with the mis-use of the indexing
>> >>> > API
>> >>> > - allow a 'policy' to automatically provide column blocks for
>> >>> > dict-like
>> >>> > input (e.g. each column would be a block), this would allow a
>> >>> > pass-thru
>> >>> > API
>> >>> > where you could
>> >>> > put in numpy arrays where you have views and have them preserved
>> >>> > rather
>> >>> > than
>> >>> > copied automatically. Note that this would also allow what I call
>> >>> > 'split'
>> >>> > where a passed in
>> >>> > multi-dim numpy array could be split up to individual blocks (which
>> >>> > actually
>> >>> > gives a nice perf boost after the splitting costs).
>> >>> >
>> >>> > In working towards some of these goals. I have come to the opinion
>> >>> > that
>> >>> > it
>> >>> > would make sense to have a neutral API protocol layer
>> >>> > that would allow us to swap out different engines as needed, for
>> >>> > particular
>> >>> > dtypes, or *maybe* out-of-core type computations. E.g.
>> >>> > imagine that we replaced the in-memory block structure with a
bclolz
>> >>> > /
>> >>> > memap
>> >>> > type; in theory this should be 'easy' and just work.
>> >>> > I could also see us adopting *some* of the SFrame code to allow
>> >>> > easier
>> >>> > interop with this API layer.
>> >>> >
>> >>> > In practice, I think a nice API layer would need to be created to
>> >>> > make
>> >>> > this
>> >>> > clean / nice.
>> >>> >
>> >>> > So this comes around to Wes's point about creating a c++ library
for
>> >>> > the
>> >>> > internals (and possibly even some of the indexing routines).
>> >>> > In an ideal world, or course this would be desirable. Getting there
>> >>> > is a
>> >>> > bit
>> >>> > non-trivial I think, and IMHO might not be worth the effort. I
don't
>> >>> > really see big performance bottlenecks. We *already* defer much of
>> >>> > the
>> >>> > computation to libraries like numexpr & bottleneck (where
>> >>> > appropriate).
>> >>> > Adding numba / dask to the list would be helpful.
>> >>> >
>> >>> > I think that almost all performance issues are the result of:
>> >>> >
>> >>> > a) gross misuse of the pandas API. How much code have you seen that
>> >>> > does
>> >>> > df.apply(lambda x: x.sum())
>> >>> > b) routines which operate column-by-column rather block-by-block
and
>> >>> > are
>> >>> > in
>> >>> > python space (e.g. we have an issue right now about .quantile)
>> >>> >
>> >>> > So I am glossing over a big goal of having a c++ library that
>> >>> > represents
>> >>> > the
>> >>> > pandas internals. This would by definition have a c-API that so
>> >>> > you *could* use pandas like semantics in c/c++ and just have it
work
>> >>> > (and
>> >>> > then pandas would be a thin wrapper around this library).
>> >>> >
>> >>> > I am not averse to this, but I think would be quite a big effort,
>> >>> > and
>> >>> > not a
>> >>> > huge perf boost IMHO. Further there are a number of API issues
>> >>> > w.r.t.
>> >>> > indexing
>> >>> > which need to be clarified / worked out (e.g. should we simply
>> >>> > deprecate
>> >>> > [])
>> >>> > that are much easier to test / figure out in python space.
>> >>> >
>> >>> > I also thing that we have quite a large number of contributors.
>> >>> > Moving
>> >>> > to
>> >>> > c++ might make the internals a bit more impenetrable that the
>> >>> > current
>> >>> > internals.
>> >>> > (though this would allow c++ people to contribute, so that might
>> >>> > balance
>> >>> > out).
>> >>> >
>> >>> > We have a limited core of devs whom right now are familar with
>> >>> > things.
>> >>> > If
>> >>> > someone happened to have a starting base for a c++ library, then I
>> >>> > might
>> >>> > change
>> >>> > opinions here.
>> >>> >
>> >>> >
>> >>> > my 4c.
>> >>> >
>> >>> > Jeff
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney <wesmckinn at gmail.com
<javascript:;>>
>> >>> > wrote:
>> >>> >>
>> >>> >> Deep thoughts during the holidays.
>> >>> >>
>> >>> >> I might be out of line here, but the interpreter-heaviness of the
>> >>> >> inside of pandas objects is likely to be a long-term liability and
>> >>> >> source of performance problems and technical debt.
>> >>> >>
>> >>> >> Has anyone put any thought into planning and beginning to execute
>> >>> >> on a
>> >>> >> rewrite that moves as much as possible of the internals into
native
>> >>> >> /
>> >>> >> compiled code? I'm talking about:
>> >>> >>
>> >>> >> - pandas/core/internals
>> >>> >> - indexing and assignment
>> >>> >> - much of pandas/core/common
>> >>> >> - categorical and custom dtypes
>> >>> >> - all indexing mechanisms
>> >>> >>
>> >>> >> I'm concerned we've already exposed too much internals to users,
so
>> >>> >> this might lead to a lot of API breakage, but it might be for the
>> >>> >> Greater Good. As a first step, beginning a partial migration of
>> >>> >> internals into some C++ classes that encapsulate the insides of
>> >>> >> DataFrame objects and implement indexing and block-level
>> >>> >> manipulations
>> >>> >> would be a good place to start. I think you could do this wouldn't
>> >>> >> too
>> >>> >> much disruption.
>> >>> >>
>> >>> >> As part of this internal retooling we might give consideration to
>> >>> >> alternative data structures for representing data internal to
>> >>> >> pandas
>> >>> >> objects. Now in 2015/2016, continuing to be hamstrung by NumPy's
>> >>> >> limitations feels somewhat anachronistic. User code is riddled
with
>> >>> >> workarounds for data type fidelity issues and the like. Like,
>> >>> >> really,
>> >>> >> why not add a bitndarray (similar to ilanschnell/bitarray) for
>> >>> >> storing
>> >>> >> nullness for problematic types and hide this from the user? =)
>> >>> >>
>> >>> >> Since we are now a NumFOCUS-sponsored project, I feel like we
might
>> >>> >> consider establishing some formal governance over pandas and
>> >>> >> publishing meetings notes and roadmap documents describing plans
>> >>> >> for
>> >>> >> the project and meetings notes from committers. There's no real
>> >>> >> "committer culture" for NumFOCUS projects like there is with the
>> >>> >> Apache Software Foundation, but we might try leading by example!
>> >>> >>
>> >>> >> Also, I believe pandas as a project has reached a level of
>> >>> >> importance
>> >>> >> where we ought to consider planning and execution on larger scale
>> >>> >> undertakings such as this for safeguarding the future.
>> >>> >>
>> >>> >> As for myself, well, I have my hands full in Big Data-land. I wish
>> >>> >> I
>> >>> >> could be helping more with pandas, but there a quite a few
>> >>> >> fundamental
>> >>> >> issues (like data interoperability nested data handling and file
>> >>> >> format support ? e.g. Parquet, see
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >>
http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/
)
>> >>> >> preventing Python from being more useful in industry analytics
>> >>> >> applications.
>> >>> >>
>> >>> >> Aside: one of the bigger mistakes I made with pandas's API design
>> >>> >> was
>> >>> >> making it acceptable to call class constructors ? like
>> >>> >> pandas.DataFrame ? directly (versus factory functions). Sorry
about
>> >>> >> that! If we could convince everyone to start writing
>> >>> >> pandas.data_frame
>> >>> >> or dataframe instead of using the class reference it would help a
>> >>> >> lot
>> >>> >> with code cleanup. It's hard to plan for these things ? NumPy
>> >>> >> interoperability seemed a lot more important in 2008 than it does
>> >>> >> now,
>> >>> >> so I forgive myself.
>> >>> >>
>> >>> >> cheers and best wishes for 2016,
>> >>> >> Wes
>> >>> >> _______________________________________________
>> >>> >> Pandas-dev mailing list
>> >>> >> Pandas-dev at python.org <javascript:;>
>> >>> >> https://mail.python.org/mailman/listinfo/pandas-dev
>> >>> >
>> >>> >
>> >>> _______________________________________________
>> >>> Pandas-dev mailing list
>> >>> Pandas-dev at python.org <javascript:;>
>> >>> https://mail.python.org/mailman/listinfo/pandas-dev
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org <javascript:;>
>> https://mail.python.org/mailman/listinfo/pandas-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20151229/fce10103/attachment-0001.html>

From wesmckinn at gmail.com  Tue Dec 29 18:25:54 2015
From: wesmckinn at gmail.com (Wes McKinney)
Date: Tue, 29 Dec 2015 15:25:54 -0800
Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? /
 Roadmap
In-Reply-To: <CAJPUwMA+m_rYpddocQqOVwMoEFvDocncPOFFtM9GsszVTyE=ng@mail.gmail.com>
References: <CAHMnJKiu9+ij943kYpRxKUxhp4H9Kqvevg84k1MchQaOwjEbGg@mail.gmail.com>
 <CAJPUwMCdknF90fsjRyGZ7FRqYo_GOe_+3KezOWYZosvZ5hqWzw@mail.gmail.com>
 <CAKRVfm79SQAMRNOZ=xCn=AvKSmSafsvDhHAcrtY+7mX3CJndpA@mail.gmail.com>
 <CAJPUwMDhc76q+KWPvgxT6i+o+Kw8S0-=JEe8UsnEGCqPOM_2QQ@mail.gmail.com>
 <CAJPUwMBhN2FPWyh=zbYSCixkxpC0giqgD5ppFzzxEKukKUkTXw@mail.gmail.com>
 <CAHMnJKi-LzFqSVRBYYDk0Rx4-41kWDEpXHRfSsS90vp56mTg6A@mail.gmail.com>
 <CAJPUwMA+m_rYpddocQqOVwMoEFvDocncPOFFtM9GsszVTyE=ng@mail.gmail.com>
Message-ID: <CAJPUwMA0X0m=p-BnYrO7NQHtoL=U2SYJ1-9-gS4gXQ1+T2rDxw@mail.gmail.com>

Hit send by accident. I meant to say targeting pandas/core/internals.py
with the initial explorations.

On Tuesday, December 29, 2015, Wes McKinney <wesmckinn at gmail.com> wrote:

> Can you link to the PR you're talking about?
>
> I will see about spending a few hours setting up a libpandas.so as a C++
> shared library where we can run some experiments and validate whether it
> can solve the integer-NA problem and be a place to put new data types
> (categorical and friends). I'm +1 on targeting
>
> Would it also be worth making a wish list of APIs we might consider
> breaking in a pandas 1.0 release that also features this new "native core"?
> Might as well right some wrongs while we're doing some invasive work on the
> internals; some breakage might be unavoidable. We can always maintain a
> pandas legacy 0.x.x maintenance branch (providing a conda binary build) for
> legacy users where showstopper bugs can get fixed.
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20151229/c2961cf9/attachment-0001.html>

From jeffreback at gmail.com  Tue Dec 29 18:25:31 2015
From: jeffreback at gmail.com (Jeff Reback)
Date: Tue, 29 Dec 2015 18:25:31 -0500
Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? /
 Roadmap
In-Reply-To: <CAJPUwMA+m_rYpddocQqOVwMoEFvDocncPOFFtM9GsszVTyE=ng@mail.gmail.com>
References: <CAHMnJKiu9+ij943kYpRxKUxhp4H9Kqvevg84k1MchQaOwjEbGg@mail.gmail.com>
 <CAJPUwMCdknF90fsjRyGZ7FRqYo_GOe_+3KezOWYZosvZ5hqWzw@mail.gmail.com>
 <CAKRVfm79SQAMRNOZ=xCn=AvKSmSafsvDhHAcrtY+7mX3CJndpA@mail.gmail.com>
 <CAJPUwMDhc76q+KWPvgxT6i+o+Kw8S0-=JEe8UsnEGCqPOM_2QQ@mail.gmail.com>
 <CAJPUwMBhN2FPWyh=zbYSCixkxpC0giqgD5ppFzzxEKukKUkTXw@mail.gmail.com>
 <CAHMnJKi-LzFqSVRBYYDk0Rx4-41kWDEpXHRfSsS90vp56mTg6A@mail.gmail.com>
 <CAJPUwMA+m_rYpddocQqOVwMoEFvDocncPOFFtM9GsszVTyE=ng@mail.gmail.com>
Message-ID: <CAHMnJKhng-jRz8LA5Z7oa2h99FMjana=pFc6F1ZAhnhVqjyehg@mail.gmail.com>

https://github.com/pydata/pandas/pull/11500.

I annotated in the shared google doc as well.

There is a section on some pandas 1.0 things to do.

On Tue, Dec 29, 2015 at 6:18 PM, Wes McKinney <wesmckinn at gmail.com> wrote:

> Can you link to the PR you're talking about?
>
> I will see about spending a few hours setting up a libpandas.so as a C++
> shared library where we can run some experiments and validate whether it
> can solve the integer-NA problem and be a place to put new data types
> (categorical and friends). I'm +1 on targeting
>
> Would it also be worth making a wish list of APIs we might consider
> breaking in a pandas 1.0 release that also features this new "native core"?
> Might as well right some wrongs while we're doing some invasive work on the
> internals; some breakage might be unavoidable. We can always maintain a
> pandas legacy 0.x.x maintenance branch (providing a conda binary build) for
> legacy users where showstopper bugs can get fixed.
>
>
> On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback <jeffreback at gmail.com> wrote:
> > Wes your last is noted as well. I *think* we can actually do this now
> (well
> > there is a PR out there).
> >
> > On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney <wesmckinn at gmail.com>
> wrote:
> >>
> >> The other huge thing this will enable is to do is copy-on-write for
> >> various kinds of views, which should cut down on some of the defensive
> >> copying in the library and reduce memory usage.
> >>
> >> On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney <wesmckinn at gmail.com>
> wrote:
> >> > Basically the approach is
> >> >
> >> > 1) Base dtype type
> >> > 2) Base array type with K >= 1 dimensions
> >> > 3) Base scalar type
> >> > 4) Base index type
> >> > 5) "Wrapper" subclasses for all NumPy types fitting into categories
> >> > #1, #2, #3, #4
> >> > 6) Subclasses for pandas-specific types like category, datetimeTZ,
> etc.
> >> > 7) NDFrame as cpcloud wrote is just a list of these
> >> >
> >> > Indexes and axis labels / column names can get layered on top.
> >> >
> >> > After we do all this we can look at adding nested types (arrays, maps,
> >> > structs) to better support JSON.
> >> >
> >> > - Wes
> >> >
> >> > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud <cpcloud at gmail.com>
> >> > wrote:
> >> >> Maybe this is saying the same thing as Wes, but how far would
> something
> >> >> like
> >> >> this get us?
> >> >>
> >> >> // warning: things are probably not this simple
> >> >>
> >> >> struct data_array_t {
> >> >>     void *primitive;  // scalar data
> >> >>     data_array_t *nested; // nested data
> >> >>     boost::dynamic_bitset isnull;  // might have to create our own to
> >> >> avoid
> >> >> boost
> >> >>     schema_t schema;  // not sure exactly what this looks like
> >> >> };
> >> >>
> >> >> typedef std::map<string, data_array_t> data_frame_t;  // probably not
> >> >> this
> >> >> simple
> >> >>
> >> >> To answer Jeff?s use-case question: I think that the use cases are 1)
> >> >> freedom from numpy (mostly) 2) no more block manager which frees us
> >> >> from the
> >> >> limitations of the block memory layout. In particular, the ability to
> >> >> take
> >> >> advantage of memory mapped IO would be a big win IMO.
> >> >>
> >> >>
> >> >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney <wesmckinn at gmail.com>
> >> >> wrote:
> >> >>>
> >> >>> I will write a more detailed response to some of these things after
> >> >>> the new year, but, in particular, re: missing values, can you or
> >> >>> someone tell me why creating an object that contains a NumPy array
> and
> >> >>> a bitmap is not sufficient? If we we can add a lightweight C/C++
> class
> >> >>> layer between NumPy function calls (e.g. arithmetic) and pandas
> >> >>> function calls, then I see no reason why we cannot have
> >> >>>
> >> >>> Int32Array->add
> >> >>>
> >> >>> and
> >> >>>
> >> >>> Float32Array->add
> >> >>>
> >> >>> do the right thing (the former would be responsible for bitmasking
> to
> >> >>> propagate NA values; the latter would defer to NumPy). If we can put
> >> >>> all the internals of pandas objects inside a black box, we can add
> >> >>> layers of virtual function indirection without a performance penalty
> >> >>> (e.g. adding more interpreter overhead with more abstraction layers
> >> >>> does add up to a perf penalty).
> >> >>>
> >> >>> I don't think this is too scary -- I would be willing to create a
> >> >>> small POC C++ library to prototype something like what I'm talking
> >> >>> about.
> >> >>>
> >> >>> Since pandas has limited points of contact with NumPy I don't think
> >> >>> this would end up being too onerous.
> >> >>>
> >> >>> For the record, I'm pretty allergic to "advanced C++"; I think it
> is a
> >> >>> useful tool if you pick a sane 20% subset of the C++11 spec and
> follow
> >> >>> Google C++ style it's not very inaccessible to intermediate
> >> >>> developers. More or less "C plus OOP and easier object lifetime
> >> >>> management (shared/unique_ptr, etc.)". As soon as you add a lot of
> >> >>> template metaprogramming C++ library development quickly becomes
> >> >>> inaccessible except to the C++-Jedi.
> >> >>>
> >> >>> Maybe let's start a Google document on "pandas roadmap" where we can
> >> >>> break down the 1-2 year goals and some of these infrastructure
> issues
> >> >>> and have our discussion there? (obviously publish this someplace
> once
> >> >>> we're done)
> >> >>>
> >> >>> - Wes
> >> >>>
> >> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback <jeffreback at gmail.com>
> >> >>> wrote:
> >> >>> > Here are some of my thoughts about pandas Roadmap / status and
> some
> >> >>> > responses to Wes's thoughts.
> >> >>> >
> >> >>> > In the last few (and upcoming) major releases we have been made
> the
> >> >>> > following changes:
> >> >>> >
> >> >>> > - dtype enhancements (Categorical, Timedelta, Datetime w/tz) &
> >> >>> > making
> >> >>> > these
> >> >>> > first class objects
> >> >>> > - code refactoring to remove subclassing of ndarrays for Series &
> >> >>> > Index
> >> >>> > - carving out / deprecating non-core parts of pandas
> >> >>> >   - datareader
> >> >>> >   - SparsePanel, WidePanel & other aliases (TImeSeries)
> >> >>> >   - rpy, rplot, irow et al.
> >> >>> >   - google-analytics
> >> >>> > - API changes to make things more consistent
> >> >>> >   - pd.rolling/expanding * -> .rolling/expanding (this is in
> master
> >> >>> > now)
> >> >>> >   - .resample becoming a full defered like groupby.
> >> >>> >   - multi-index slicing along any level (obviates need for .xs)
> and
> >> >>> > allows
> >> >>> > assignment
> >> >>> >   - .loc/.iloc - for the most part obviates use of .ix
> >> >>> >   - .pipe & .assign
> >> >>> >   - plotting accessors
> >> >>> >   - fixing of the sorting API
> >> >>> > - many performance enhancements both micro & macro (e.g. release
> >> >>> > GIL)
> >> >>> >
> >> >>> > Some on-deck enhancements are (meaning these are basically ready
> to
> >> >>> > go
> >> >>> > in):
> >> >>> >   - IntervalIndex (and eventually make PeriodIndex just a
> sub-class
> >> >>> > of
> >> >>> > this)
> >> >>> >   - RangeIndex
> >> >>> >
> >> >>> > so lots of changes, though nothing really earth shaking, just more
> >> >>> > convenience, reducing magicness somewhat
> >> >>> > and providing flexibility.
> >> >>> >
> >> >>> > Of course we are getting increasing issues, mostly bug reports
> (and
> >> >>> > lots
> >> >>> > of
> >> >>> > dupes), some edge case enhancements
> >> >>> > which can add to the existing API's and of course, requests to
> >> >>> > expand
> >> >>> > the
> >> >>> > (already) large code to other usecases.
> >> >>> > Balancing this are a good many pull-requests from many different
> >> >>> > users,
> >> >>> > some
> >> >>> > even deep into the internals.
> >> >>> >
> >> >>> > Here are some things that I have talked about and could be
> >> >>> > considered
> >> >>> > for
> >> >>> > the roadmap. Disclaimer: I do work for Continuum
> >> >>> > but these views are of course my own; furthermore obviously I am a
> >> >>> > bit
> >> >>> > more
> >> >>> > familiar with some of the 'sponsored' open-source
> >> >>> > libraries, but always open to new things.
> >> >>> >
> >> >>> > - integration / automatic deferral to numba for JIT (this would be
> >> >>> > thru
> >> >>> > .apply)
> >> >>> > - automatic deferal to dask from groubpy where appropriate /
> maybe a
> >> >>> > .to_parallel (to simply return a dask.DataFrame object)
> >> >>> > - incorporation of quantities / units (as part of the dtype)
> >> >>> > - use of DyND to allow missing values for int dtypes
> >> >>> > - make Period a first class dtype.
> >> >>> > - provide some copy-on-write semantics to alleviate the
> >> >>> > chained-indexing
> >> >>> > issues which occasionaly come up with the mis-use of the indexing
> >> >>> > API
> >> >>> > - allow a 'policy' to automatically provide column blocks for
> >> >>> > dict-like
> >> >>> > input (e.g. each column would be a block), this would allow a
> >> >>> > pass-thru
> >> >>> > API
> >> >>> > where you could
> >> >>> > put in numpy arrays where you have views and have them preserved
> >> >>> > rather
> >> >>> > than
> >> >>> > copied automatically. Note that this would also allow what I call
> >> >>> > 'split'
> >> >>> > where a passed in
> >> >>> > multi-dim numpy array could be split up to individual blocks
> (which
> >> >>> > actually
> >> >>> > gives a nice perf boost after the splitting costs).
> >> >>> >
> >> >>> > In working towards some of these goals. I have come to the opinion
> >> >>> > that
> >> >>> > it
> >> >>> > would make sense to have a neutral API protocol layer
> >> >>> > that would allow us to swap out different engines as needed, for
> >> >>> > particular
> >> >>> > dtypes, or *maybe* out-of-core type computations. E.g.
> >> >>> > imagine that we replaced the in-memory block structure with a
> bclolz
> >> >>> > /
> >> >>> > memap
> >> >>> > type; in theory this should be 'easy' and just work.
> >> >>> > I could also see us adopting *some* of the SFrame code to allow
> >> >>> > easier
> >> >>> > interop with this API layer.
> >> >>> >
> >> >>> > In practice, I think a nice API layer would need to be created to
> >> >>> > make
> >> >>> > this
> >> >>> > clean / nice.
> >> >>> >
> >> >>> > So this comes around to Wes's point about creating a c++ library
> for
> >> >>> > the
> >> >>> > internals (and possibly even some of the indexing routines).
> >> >>> > In an ideal world, or course this would be desirable. Getting
> there
> >> >>> > is a
> >> >>> > bit
> >> >>> > non-trivial I think, and IMHO might not be worth the effort. I
> don't
> >> >>> > really see big performance bottlenecks. We *already* defer much of
> >> >>> > the
> >> >>> > computation to libraries like numexpr & bottleneck (where
> >> >>> > appropriate).
> >> >>> > Adding numba / dask to the list would be helpful.
> >> >>> >
> >> >>> > I think that almost all performance issues are the result of:
> >> >>> >
> >> >>> > a) gross misuse of the pandas API. How much code have you seen
> that
> >> >>> > does
> >> >>> > df.apply(lambda x: x.sum())
> >> >>> > b) routines which operate column-by-column rather block-by-block
> and
> >> >>> > are
> >> >>> > in
> >> >>> > python space (e.g. we have an issue right now about .quantile)
> >> >>> >
> >> >>> > So I am glossing over a big goal of having a c++ library that
> >> >>> > represents
> >> >>> > the
> >> >>> > pandas internals. This would by definition have a c-API that so
> >> >>> > you *could* use pandas like semantics in c/c++ and just have it
> work
> >> >>> > (and
> >> >>> > then pandas would be a thin wrapper around this library).
> >> >>> >
> >> >>> > I am not averse to this, but I think would be quite a big effort,
> >> >>> > and
> >> >>> > not a
> >> >>> > huge perf boost IMHO. Further there are a number of API issues
> >> >>> > w.r.t.
> >> >>> > indexing
> >> >>> > which need to be clarified / worked out (e.g. should we simply
> >> >>> > deprecate
> >> >>> > [])
> >> >>> > that are much easier to test / figure out in python space.
> >> >>> >
> >> >>> > I also thing that we have quite a large number of contributors.
> >> >>> > Moving
> >> >>> > to
> >> >>> > c++ might make the internals a bit more impenetrable that the
> >> >>> > current
> >> >>> > internals.
> >> >>> > (though this would allow c++ people to contribute, so that might
> >> >>> > balance
> >> >>> > out).
> >> >>> >
> >> >>> > We have a limited core of devs whom right now are familar with
> >> >>> > things.
> >> >>> > If
> >> >>> > someone happened to have a starting base for a c++ library, then I
> >> >>> > might
> >> >>> > change
> >> >>> > opinions here.
> >> >>> >
> >> >>> >
> >> >>> > my 4c.
> >> >>> >
> >> >>> > Jeff
> >> >>> >
> >> >>> >
> >> >>> >
> >> >>> >
> >> >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney <
> wesmckinn at gmail.com>
> >> >>> > wrote:
> >> >>> >>
> >> >>> >> Deep thoughts during the holidays.
> >> >>> >>
> >> >>> >> I might be out of line here, but the interpreter-heaviness of the
> >> >>> >> inside of pandas objects is likely to be a long-term liability
> and
> >> >>> >> source of performance problems and technical debt.
> >> >>> >>
> >> >>> >> Has anyone put any thought into planning and beginning to execute
> >> >>> >> on a
> >> >>> >> rewrite that moves as much as possible of the internals into
> native
> >> >>> >> /
> >> >>> >> compiled code? I'm talking about:
> >> >>> >>
> >> >>> >> - pandas/core/internals
> >> >>> >> - indexing and assignment
> >> >>> >> - much of pandas/core/common
> >> >>> >> - categorical and custom dtypes
> >> >>> >> - all indexing mechanisms
> >> >>> >>
> >> >>> >> I'm concerned we've already exposed too much internals to users,
> so
> >> >>> >> this might lead to a lot of API breakage, but it might be for the
> >> >>> >> Greater Good. As a first step, beginning a partial migration of
> >> >>> >> internals into some C++ classes that encapsulate the insides of
> >> >>> >> DataFrame objects and implement indexing and block-level
> >> >>> >> manipulations
> >> >>> >> would be a good place to start. I think you could do this
> wouldn't
> >> >>> >> too
> >> >>> >> much disruption.
> >> >>> >>
> >> >>> >> As part of this internal retooling we might give consideration to
> >> >>> >> alternative data structures for representing data internal to
> >> >>> >> pandas
> >> >>> >> objects. Now in 2015/2016, continuing to be hamstrung by NumPy's
> >> >>> >> limitations feels somewhat anachronistic. User code is riddled
> with
> >> >>> >> workarounds for data type fidelity issues and the like. Like,
> >> >>> >> really,
> >> >>> >> why not add a bitndarray (similar to ilanschnell/bitarray) for
> >> >>> >> storing
> >> >>> >> nullness for problematic types and hide this from the user? =)
> >> >>> >>
> >> >>> >> Since we are now a NumFOCUS-sponsored project, I feel like we
> might
> >> >>> >> consider establishing some formal governance over pandas and
> >> >>> >> publishing meetings notes and roadmap documents describing plans
> >> >>> >> for
> >> >>> >> the project and meetings notes from committers. There's no real
> >> >>> >> "committer culture" for NumFOCUS projects like there is with the
> >> >>> >> Apache Software Foundation, but we might try leading by example!
> >> >>> >>
> >> >>> >> Also, I believe pandas as a project has reached a level of
> >> >>> >> importance
> >> >>> >> where we ought to consider planning and execution on larger scale
> >> >>> >> undertakings such as this for safeguarding the future.
> >> >>> >>
> >> >>> >> As for myself, well, I have my hands full in Big Data-land. I
> wish
> >> >>> >> I
> >> >>> >> could be helping more with pandas, but there a quite a few
> >> >>> >> fundamental
> >> >>> >> issues (like data interoperability nested data handling and file
> >> >>> >> format support ? e.g. Parquet, see
> >> >>> >>
> >> >>> >>
> >> >>> >>
> >> >>> >>
> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/
> )
> >> >>> >> preventing Python from being more useful in industry analytics
> >> >>> >> applications.
> >> >>> >>
> >> >>> >> Aside: one of the bigger mistakes I made with pandas's API design
> >> >>> >> was
> >> >>> >> making it acceptable to call class constructors ? like
> >> >>> >> pandas.DataFrame ? directly (versus factory functions). Sorry
> about
> >> >>> >> that! If we could convince everyone to start writing
> >> >>> >> pandas.data_frame
> >> >>> >> or dataframe instead of using the class reference it would help a
> >> >>> >> lot
> >> >>> >> with code cleanup. It's hard to plan for these things ? NumPy
> >> >>> >> interoperability seemed a lot more important in 2008 than it does
> >> >>> >> now,
> >> >>> >> so I forgive myself.
> >> >>> >>
> >> >>> >> cheers and best wishes for 2016,
> >> >>> >> Wes
> >> >>> >> _______________________________________________
> >> >>> >> Pandas-dev mailing list
> >> >>> >> Pandas-dev at python.org
> >> >>> >> https://mail.python.org/mailman/listinfo/pandas-dev
> >> >>> >
> >> >>> >
> >> >>> _______________________________________________
> >> >>> Pandas-dev mailing list
> >> >>> Pandas-dev at python.org
> >> >>> https://mail.python.org/mailman/listinfo/pandas-dev
> >> _______________________________________________
> >> Pandas-dev mailing list
> >> Pandas-dev at python.org
> >> https://mail.python.org/mailman/listinfo/pandas-dev
> >
> >
>
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20151229/1056cf0f/attachment-0001.html>

From izaid at continuum.io  Tue Dec 29 18:31:59 2015
From: izaid at continuum.io (Irwin Zaid)
Date: Tue, 29 Dec 2015 17:31:59 -0600
Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? /
 Roadmap
In-Reply-To: <CAJPUwMA+m_rYpddocQqOVwMoEFvDocncPOFFtM9GsszVTyE=ng@mail.gmail.com>
References: <CAHMnJKiu9+ij943kYpRxKUxhp4H9Kqvevg84k1MchQaOwjEbGg@mail.gmail.com>
 <CAJPUwMCdknF90fsjRyGZ7FRqYo_GOe_+3KezOWYZosvZ5hqWzw@mail.gmail.com>
 <CAKRVfm79SQAMRNOZ=xCn=AvKSmSafsvDhHAcrtY+7mX3CJndpA@mail.gmail.com>
 <CAJPUwMDhc76q+KWPvgxT6i+o+Kw8S0-=JEe8UsnEGCqPOM_2QQ@mail.gmail.com>
 <CAJPUwMBhN2FPWyh=zbYSCixkxpC0giqgD5ppFzzxEKukKUkTXw@mail.gmail.com>
 <CAHMnJKi-LzFqSVRBYYDk0Rx4-41kWDEpXHRfSsS90vp56mTg6A@mail.gmail.com>
 <CAJPUwMA+m_rYpddocQqOVwMoEFvDocncPOFFtM9GsszVTyE=ng@mail.gmail.com>
Message-ID: <CAC=y189NE7r6pfUoRhHO10Nt2M=zzF1hcjA8wJz8Munae00y7A@mail.gmail.com>

Hi Wes (and others),

I've been following this conversation with interest. I do think it would be
worth exploring DyND, rather than setting up yet another rewrite of
NumPy-functionality. Especially because DyND is already an optional
dependency of Pandas.

For things like Integer NA and new dtypes, DyND is there and ready to do
this.

Irwin

On Tue, Dec 29, 2015 at 5:18 PM, Wes McKinney <wesmckinn at gmail.com> wrote:

> Can you link to the PR you're talking about?
>
> I will see about spending a few hours setting up a libpandas.so as a C++
> shared library where we can run some experiments and validate whether it
> can solve the integer-NA problem and be a place to put new data types
> (categorical and friends). I'm +1 on targeting
>
> Would it also be worth making a wish list of APIs we might consider
> breaking in a pandas 1.0 release that also features this new "native core"?
> Might as well right some wrongs while we're doing some invasive work on the
> internals; some breakage might be unavoidable. We can always maintain a
> pandas legacy 0.x.x maintenance branch (providing a conda binary build) for
> legacy users where showstopper bugs can get fixed.
>
> On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback <jeffreback at gmail.com> wrote:
> > Wes your last is noted as well. I *think* we can actually do this now
> (well
> > there is a PR out there).
> >
> > On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney <wesmckinn at gmail.com>
> wrote:
> >>
> >> The other huge thing this will enable is to do is copy-on-write for
> >> various kinds of views, which should cut down on some of the defensive
> >> copying in the library and reduce memory usage.
> >>
> >> On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney <wesmckinn at gmail.com>
> wrote:
> >> > Basically the approach is
> >> >
> >> > 1) Base dtype type
> >> > 2) Base array type with K >= 1 dimensions
> >> > 3) Base scalar type
> >> > 4) Base index type
> >> > 5) "Wrapper" subclasses for all NumPy types fitting into categories
> >> > #1, #2, #3, #4
> >> > 6) Subclasses for pandas-specific types like category, datetimeTZ,
> etc.
> >> > 7) NDFrame as cpcloud wrote is just a list of these
> >> >
> >> > Indexes and axis labels / column names can get layered on top.
> >> >
> >> > After we do all this we can look at adding nested types (arrays, maps,
> >> > structs) to better support JSON.
> >> >
> >> > - Wes
> >> >
> >> > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud <cpcloud at gmail.com>
> >> > wrote:
> >> >> Maybe this is saying the same thing as Wes, but how far would
> something
> >> >> like
> >> >> this get us?
> >> >>
> >> >> // warning: things are probably not this simple
> >> >>
> >> >> struct data_array_t {
> >> >>     void *primitive;  // scalar data
> >> >>     data_array_t *nested; // nested data
> >> >>     boost::dynamic_bitset isnull;  // might have to create our own to
> >> >> avoid
> >> >> boost
> >> >>     schema_t schema;  // not sure exactly what this looks like
> >> >> };
> >> >>
> >> >> typedef std::map<string, data_array_t> data_frame_t;  // probably not
> >> >> this
> >> >> simple
> >> >>
> >> >> To answer Jeff?s use-case question: I think that the use cases are 1)
> >> >> freedom from numpy (mostly) 2) no more block manager which frees us
> >> >> from the
> >> >> limitations of the block memory layout. In particular, the ability to
> >> >> take
> >> >> advantage of memory mapped IO would be a big win IMO.
> >> >>
> >> >>
> >> >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney <wesmckinn at gmail.com>
> >> >> wrote:
> >> >>>
> >> >>> I will write a more detailed response to some of these things after
> >> >>> the new year, but, in particular, re: missing values, can you or
> >> >>> someone tell me why creating an object that contains a NumPy array
> and
> >> >>> a bitmap is not sufficient? If we we can add a lightweight C/C++
> class
> >> >>> layer between NumPy function calls (e.g. arithmetic) and pandas
> >> >>> function calls, then I see no reason why we cannot have
> >> >>>
> >> >>> Int32Array->add
> >> >>>
> >> >>> and
> >> >>>
> >> >>> Float32Array->add
> >> >>>
> >> >>> do the right thing (the former would be responsible for bitmasking
> to
> >> >>> propagate NA values; the latter would defer to NumPy). If we can put
> >> >>> all the internals of pandas objects inside a black box, we can add
> >> >>> layers of virtual function indirection without a performance penalty
> >> >>> (e.g. adding more interpreter overhead with more abstraction layers
> >> >>> does add up to a perf penalty).
> >> >>>
> >> >>> I don't think this is too scary -- I would be willing to create a
> >> >>> small POC C++ library to prototype something like what I'm talking
> >> >>> about.
> >> >>>
> >> >>> Since pandas has limited points of contact with NumPy I don't think
> >> >>> this would end up being too onerous.
> >> >>>
> >> >>> For the record, I'm pretty allergic to "advanced C++"; I think it
> is a
> >> >>> useful tool if you pick a sane 20% subset of the C++11 spec and
> follow
> >> >>> Google C++ style it's not very inaccessible to intermediate
> >> >>> developers. More or less "C plus OOP and easier object lifetime
> >> >>> management (shared/unique_ptr, etc.)". As soon as you add a lot of
> >> >>> template metaprogramming C++ library development quickly becomes
> >> >>> inaccessible except to the C++-Jedi.
> >> >>>
> >> >>> Maybe let's start a Google document on "pandas roadmap" where we can
> >> >>> break down the 1-2 year goals and some of these infrastructure
> issues
> >> >>> and have our discussion there? (obviously publish this someplace
> once
> >> >>> we're done)
> >> >>>
> >> >>> - Wes
> >> >>>
> >> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback <jeffreback at gmail.com>
> >> >>> wrote:
> >> >>> > Here are some of my thoughts about pandas Roadmap / status and
> some
> >> >>> > responses to Wes's thoughts.
> >> >>> >
> >> >>> > In the last few (and upcoming) major releases we have been made
> the
> >> >>> > following changes:
> >> >>> >
> >> >>> > - dtype enhancements (Categorical, Timedelta, Datetime w/tz) &
> >> >>> > making
> >> >>> > these
> >> >>> > first class objects
> >> >>> > - code refactoring to remove subclassing of ndarrays for Series &
> >> >>> > Index
> >> >>> > - carving out / deprecating non-core parts of pandas
> >> >>> >   - datareader
> >> >>> >   - SparsePanel, WidePanel & other aliases (TImeSeries)
> >> >>> >   - rpy, rplot, irow et al.
> >> >>> >   - google-analytics
> >> >>> > - API changes to make things more consistent
> >> >>> >   - pd.rolling/expanding * -> .rolling/expanding (this is in
> master
> >> >>> > now)
> >> >>> >   - .resample becoming a full defered like groupby.
> >> >>> >   - multi-index slicing along any level (obviates need for .xs)
> and
> >> >>> > allows
> >> >>> > assignment
> >> >>> >   - .loc/.iloc - for the most part obviates use of .ix
> >> >>> >   - .pipe & .assign
> >> >>> >   - plotting accessors
> >> >>> >   - fixing of the sorting API
> >> >>> > - many performance enhancements both micro & macro (e.g. release
> >> >>> > GIL)
> >> >>> >
> >> >>> > Some on-deck enhancements are (meaning these are basically ready
> to
> >> >>> > go
> >> >>> > in):
> >> >>> >   - IntervalIndex (and eventually make PeriodIndex just a
> sub-class
> >> >>> > of
> >> >>> > this)
> >> >>> >   - RangeIndex
> >> >>> >
> >> >>> > so lots of changes, though nothing really earth shaking, just more
> >> >>> > convenience, reducing magicness somewhat
> >> >>> > and providing flexibility.
> >> >>> >
> >> >>> > Of course we are getting increasing issues, mostly bug reports
> (and
> >> >>> > lots
> >> >>> > of
> >> >>> > dupes), some edge case enhancements
> >> >>> > which can add to the existing API's and of course, requests to
> >> >>> > expand
> >> >>> > the
> >> >>> > (already) large code to other usecases.
> >> >>> > Balancing this are a good many pull-requests from many different
> >> >>> > users,
> >> >>> > some
> >> >>> > even deep into the internals.
> >> >>> >
> >> >>> > Here are some things that I have talked about and could be
> >> >>> > considered
> >> >>> > for
> >> >>> > the roadmap. Disclaimer: I do work for Continuum
> >> >>> > but these views are of course my own; furthermore obviously I am a
> >> >>> > bit
> >> >>> > more
> >> >>> > familiar with some of the 'sponsored' open-source
> >> >>> > libraries, but always open to new things.
> >> >>> >
> >> >>> > - integration / automatic deferral to numba for JIT (this would be
> >> >>> > thru
> >> >>> > .apply)
> >> >>> > - automatic deferal to dask from groubpy where appropriate /
> maybe a
> >> >>> > .to_parallel (to simply return a dask.DataFrame object)
> >> >>> > - incorporation of quantities / units (as part of the dtype)
> >> >>> > - use of DyND to allow missing values for int dtypes
> >> >>> > - make Period a first class dtype.
> >> >>> > - provide some copy-on-write semantics to alleviate the
> >> >>> > chained-indexing
> >> >>> > issues which occasionaly come up with the mis-use of the indexing
> >> >>> > API
> >> >>> > - allow a 'policy' to automatically provide column blocks for
> >> >>> > dict-like
> >> >>> > input (e.g. each column would be a block), this would allow a
> >> >>> > pass-thru
> >> >>> > API
> >> >>> > where you could
> >> >>> > put in numpy arrays where you have views and have them preserved
> >> >>> > rather
> >> >>> > than
> >> >>> > copied automatically. Note that this would also allow what I call
> >> >>> > 'split'
> >> >>> > where a passed in
> >> >>> > multi-dim numpy array could be split up to individual blocks
> (which
> >> >>> > actually
> >> >>> > gives a nice perf boost after the splitting costs).
> >> >>> >
> >> >>> > In working towards some of these goals. I have come to the opinion
> >> >>> > that
> >> >>> > it
> >> >>> > would make sense to have a neutral API protocol layer
> >> >>> > that would allow us to swap out different engines as needed, for
> >> >>> > particular
> >> >>> > dtypes, or *maybe* out-of-core type computations. E.g.
> >> >>> > imagine that we replaced the in-memory block structure with a
> bclolz
> >> >>> > /
> >> >>> > memap
> >> >>> > type; in theory this should be 'easy' and just work.
> >> >>> > I could also see us adopting *some* of the SFrame code to allow
> >> >>> > easier
> >> >>> > interop with this API layer.
> >> >>> >
> >> >>> > In practice, I think a nice API layer would need to be created to
> >> >>> > make
> >> >>> > this
> >> >>> > clean / nice.
> >> >>> >
> >> >>> > So this comes around to Wes's point about creating a c++ library
> for
> >> >>> > the
> >> >>> > internals (and possibly even some of the indexing routines).
> >> >>> > In an ideal world, or course this would be desirable. Getting
> there
> >> >>> > is a
> >> >>> > bit
> >> >>> > non-trivial I think, and IMHO might not be worth the effort. I
> don't
> >> >>> > really see big performance bottlenecks. We *already* defer much of
> >> >>> > the
> >> >>> > computation to libraries like numexpr & bottleneck (where
> >> >>> > appropriate).
> >> >>> > Adding numba / dask to the list would be helpful.
> >> >>> >
> >> >>> > I think that almost all performance issues are the result of:
> >> >>> >
> >> >>> > a) gross misuse of the pandas API. How much code have you seen
> that
> >> >>> > does
> >> >>> > df.apply(lambda x: x.sum())
> >> >>> > b) routines which operate column-by-column rather block-by-block
> and
> >> >>> > are
> >> >>> > in
> >> >>> > python space (e.g. we have an issue right now about .quantile)
> >> >>> >
> >> >>> > So I am glossing over a big goal of having a c++ library that
> >> >>> > represents
> >> >>> > the
> >> >>> > pandas internals. This would by definition have a c-API that so
> >> >>> > you *could* use pandas like semantics in c/c++ and just have it
> work
> >> >>> > (and
> >> >>> > then pandas would be a thin wrapper around this library).
> >> >>> >
> >> >>> > I am not averse to this, but I think would be quite a big effort,
> >> >>> > and
> >> >>> > not a
> >> >>> > huge perf boost IMHO. Further there are a number of API issues
> >> >>> > w.r.t.
> >> >>> > indexing
> >> >>> > which need to be clarified / worked out (e.g. should we simply
> >> >>> > deprecate
> >> >>> > [])
> >> >>> > that are much easier to test / figure out in python space.
> >> >>> >
> >> >>> > I also thing that we have quite a large number of contributors.
> >> >>> > Moving
> >> >>> > to
> >> >>> > c++ might make the internals a bit more impenetrable that the
> >> >>> > current
> >> >>> > internals.
> >> >>> > (though this would allow c++ people to contribute, so that might
> >> >>> > balance
> >> >>> > out).
> >> >>> >
> >> >>> > We have a limited core of devs whom right now are familar with
> >> >>> > things.
> >> >>> > If
> >> >>> > someone happened to have a starting base for a c++ library, then I
> >> >>> > might
> >> >>> > change
> >> >>> > opinions here.
> >> >>> >
> >> >>> >
> >> >>> > my 4c.
> >> >>> >
> >> >>> > Jeff
> >> >>> >
> >> >>> >
> >> >>> >
> >> >>> >
> >> >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney <
> wesmckinn at gmail.com>
> >> >>> > wrote:
> >> >>> >>
> >> >>> >> Deep thoughts during the holidays.
> >> >>> >>
> >> >>> >> I might be out of line here, but the interpreter-heaviness of the
> >> >>> >> inside of pandas objects is likely to be a long-term liability
> and
> >> >>> >> source of performance problems and technical debt.
> >> >>> >>
> >> >>> >> Has anyone put any thought into planning and beginning to execute
> >> >>> >> on a
> >> >>> >> rewrite that moves as much as possible of the internals into
> native
> >> >>> >> /
> >> >>> >> compiled code? I'm talking about:
> >> >>> >>
> >> >>> >> - pandas/core/internals
> >> >>> >> - indexing and assignment
> >> >>> >> - much of pandas/core/common
> >> >>> >> - categorical and custom dtypes
> >> >>> >> - all indexing mechanisms
> >> >>> >>
> >> >>> >> I'm concerned we've already exposed too much internals to users,
> so
> >> >>> >> this might lead to a lot of API breakage, but it might be for the
> >> >>> >> Greater Good. As a first step, beginning a partial migration of
> >> >>> >> internals into some C++ classes that encapsulate the insides of
> >> >>> >> DataFrame objects and implement indexing and block-level
> >> >>> >> manipulations
> >> >>> >> would be a good place to start. I think you could do this
> wouldn't
> >> >>> >> too
> >> >>> >> much disruption.
> >> >>> >>
> >> >>> >> As part of this internal retooling we might give consideration to
> >> >>> >> alternative data structures for representing data internal to
> >> >>> >> pandas
> >> >>> >> objects. Now in 2015/2016, continuing to be hamstrung by NumPy's
> >> >>> >> limitations feels somewhat anachronistic. User code is riddled
> with
> >> >>> >> workarounds for data type fidelity issues and the like. Like,
> >> >>> >> really,
> >> >>> >> why not add a bitndarray (similar to ilanschnell/bitarray) for
> >> >>> >> storing
> >> >>> >> nullness for problematic types and hide this from the user? =)
> >> >>> >>
> >> >>> >> Since we are now a NumFOCUS-sponsored project, I feel like we
> might
> >> >>> >> consider establishing some formal governance over pandas and
> >> >>> >> publishing meetings notes and roadmap documents describing plans
> >> >>> >> for
> >> >>> >> the project and meetings notes from committers. There's no real
> >> >>> >> "committer culture" for NumFOCUS projects like there is with the
> >> >>> >> Apache Software Foundation, but we might try leading by example!
> >> >>> >>
> >> >>> >> Also, I believe pandas as a project has reached a level of
> >> >>> >> importance
> >> >>> >> where we ought to consider planning and execution on larger scale
> >> >>> >> undertakings such as this for safeguarding the future.
> >> >>> >>
> >> >>> >> As for myself, well, I have my hands full in Big Data-land. I
> wish
> >> >>> >> I
> >> >>> >> could be helping more with pandas, but there a quite a few
> >> >>> >> fundamental
> >> >>> >> issues (like data interoperability nested data handling and file
> >> >>> >> format support ? e.g. Parquet, see
> >> >>> >>
> >> >>> >>
> >> >>> >>
> >> >>> >>
> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/
> )
> >> >>> >> preventing Python from being more useful in industry analytics
> >> >>> >> applications.
> >> >>> >>
> >> >>> >> Aside: one of the bigger mistakes I made with pandas's API design
> >> >>> >> was
> >> >>> >> making it acceptable to call class constructors ? like
> >> >>> >> pandas.DataFrame ? directly (versus factory functions). Sorry
> about
> >> >>> >> that! If we could convince everyone to start writing
> >> >>> >> pandas.data_frame
> >> >>> >> or dataframe instead of using the class reference it would help a
> >> >>> >> lot
> >> >>> >> with code cleanup. It's hard to plan for these things ? NumPy
> >> >>> >> interoperability seemed a lot more important in 2008 than it does
> >> >>> >> now,
> >> >>> >> so I forgive myself.
> >> >>> >>
> >> >>> >> cheers and best wishes for 2016,
> >> >>> >> Wes
> >> >>> >> _______________________________________________
> >> >>> >> Pandas-dev mailing list
> >> >>> >> Pandas-dev at python.org
> >> >>> >> https://mail.python.org/mailman/listinfo/pandas-dev
> >> >>> >
> >> >>> >
> >> >>> _______________________________________________
> >> >>> Pandas-dev mailing list
> >> >>> Pandas-dev at python.org
> >> >>> https://mail.python.org/mailman/listinfo/pandas-dev
> >> _______________________________________________
> >> Pandas-dev mailing list
> >> Pandas-dev at python.org
> >> https://mail.python.org/mailman/listinfo/pandas-dev
> >
> >
>
>
> _______________________________________________
> Pandas-dev mailing list
> Pandas-dev at python.org
> https://mail.python.org/mailman/listinfo/pandas-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20151229/f6497ac6/attachment-0001.html>

From wesmckinn at gmail.com  Tue Dec 29 19:01:33 2015
From: wesmckinn at gmail.com (Wes McKinney)
Date: Tue, 29 Dec 2015 16:01:33 -0800
Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? /
 Roadmap
In-Reply-To: <CAC=y189NE7r6pfUoRhHO10Nt2M=zzF1hcjA8wJz8Munae00y7A@mail.gmail.com>
References: <CAHMnJKiu9+ij943kYpRxKUxhp4H9Kqvevg84k1MchQaOwjEbGg@mail.gmail.com>
 <CAJPUwMCdknF90fsjRyGZ7FRqYo_GOe_+3KezOWYZosvZ5hqWzw@mail.gmail.com>
 <CAKRVfm79SQAMRNOZ=xCn=AvKSmSafsvDhHAcrtY+7mX3CJndpA@mail.gmail.com>
 <CAJPUwMDhc76q+KWPvgxT6i+o+Kw8S0-=JEe8UsnEGCqPOM_2QQ@mail.gmail.com>
 <CAJPUwMBhN2FPWyh=zbYSCixkxpC0giqgD5ppFzzxEKukKUkTXw@mail.gmail.com>
 <CAHMnJKi-LzFqSVRBYYDk0Rx4-41kWDEpXHRfSsS90vp56mTg6A@mail.gmail.com>
 <CAJPUwMA+m_rYpddocQqOVwMoEFvDocncPOFFtM9GsszVTyE=ng@mail.gmail.com>
 <CAC=y189NE7r6pfUoRhHO10Nt2M=zzF1hcjA8wJz8Munae00y7A@mail.gmail.com>
Message-ID: <CAJPUwMA4nAm++EghTSagTnKH3hDt4aNestLb3cPnc0-Whkse0w@mail.gmail.com>

I'm not suggesting a rewrite of NumPy functionality but rather pandas
functionality that is currently written in a mishmash of Cython and Python.
Happy to experiment with changing the internal compute infrastructure and
data representation to DyND after this first stage of cleanup is done. Even
if we use DyND a pretty extensive pandas wrapper layer will be  necessary.

On Tuesday, December 29, 2015, Irwin Zaid <izaid at continuum.io> wrote:

> Hi Wes (and others),
>
> I've been following this conversation with interest. I do think it would
> be worth exploring DyND, rather than setting up yet another rewrite of
> NumPy-functionality. Especially because DyND is already an optional
> dependency of Pandas.
>
> For things like Integer NA and new dtypes, DyND is there and ready to do
> this.
>
> Irwin
>
> On Tue, Dec 29, 2015 at 5:18 PM, Wes McKinney <wesmckinn at gmail.com
> <javascript:_e(%7B%7D,'cvml','wesmckinn at gmail.com');>> wrote:
>
>> Can you link to the PR you're talking about?
>>
>> I will see about spending a few hours setting up a libpandas.so as a C++
>> shared library where we can run some experiments and validate whether it
>> can solve the integer-NA problem and be a place to put new data types
>> (categorical and friends). I'm +1 on targeting
>>
>> Would it also be worth making a wish list of APIs we might consider
>> breaking in a pandas 1.0 release that also features this new "native core"?
>> Might as well right some wrongs while we're doing some invasive work on the
>> internals; some breakage might be unavoidable. We can always maintain a
>> pandas legacy 0.x.x maintenance branch (providing a conda binary build) for
>> legacy users where showstopper bugs can get fixed.
>>
>> On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback <jeffreback at gmail.com>
>> wrote:
>> > Wes your last is noted as well. I *think* we can actually do this now
>> (well
>> > there is a PR out there).
>> >
>> > On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney <wesmckinn at gmail.com>
>> wrote:
>> >>
>> >> The other huge thing this will enable is to do is copy-on-write for
>> >> various kinds of views, which should cut down on some of the defensive
>> >> copying in the library and reduce memory usage.
>> >>
>> >> On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney <wesmckinn at gmail.com>
>> wrote:
>> >> > Basically the approach is
>> >> >
>> >> > 1) Base dtype type
>> >> > 2) Base array type with K >= 1 dimensions
>> >> > 3) Base scalar type
>> >> > 4) Base index type
>> >> > 5) "Wrapper" subclasses for all NumPy types fitting into categories
>> >> > #1, #2, #3, #4
>> >> > 6) Subclasses for pandas-specific types like category, datetimeTZ,
>> etc.
>> >> > 7) NDFrame as cpcloud wrote is just a list of these
>> >> >
>> >> > Indexes and axis labels / column names can get layered on top.
>> >> >
>> >> > After we do all this we can look at adding nested types (arrays,
>> maps,
>> >> > structs) to better support JSON.
>> >> >
>> >> > - Wes
>> >> >
>> >> > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud <cpcloud at gmail.com>
>> >> > wrote:
>> >> >> Maybe this is saying the same thing as Wes, but how far would
>> something
>> >> >> like
>> >> >> this get us?
>> >> >>
>> >> >> // warning: things are probably not this simple
>> >> >>
>> >> >> struct data_array_t {
>> >> >>     void *primitive;  // scalar data
>> >> >>     data_array_t *nested; // nested data
>> >> >>     boost::dynamic_bitset isnull;  // might have to create our own
>> to
>> >> >> avoid
>> >> >> boost
>> >> >>     schema_t schema;  // not sure exactly what this looks like
>> >> >> };
>> >> >>
>> >> >> typedef std::map<string, data_array_t> data_frame_t;  // probably
>> not
>> >> >> this
>> >> >> simple
>> >> >>
>> >> >> To answer Jeff?s use-case question: I think that the use cases are
>> 1)
>> >> >> freedom from numpy (mostly) 2) no more block manager which frees us
>> >> >> from the
>> >> >> limitations of the block memory layout. In particular, the ability
>> to
>> >> >> take
>> >> >> advantage of memory mapped IO would be a big win IMO.
>> >> >>
>> >> >>
>> >> >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney <wesmckinn at gmail.com>
>> >> >> wrote:
>> >> >>>
>> >> >>> I will write a more detailed response to some of these things after
>> >> >>> the new year, but, in particular, re: missing values, can you or
>> >> >>> someone tell me why creating an object that contains a NumPy array
>> and
>> >> >>> a bitmap is not sufficient? If we we can add a lightweight C/C++
>> class
>> >> >>> layer between NumPy function calls (e.g. arithmetic) and pandas
>> >> >>> function calls, then I see no reason why we cannot have
>> >> >>>
>> >> >>> Int32Array->add
>> >> >>>
>> >> >>> and
>> >> >>>
>> >> >>> Float32Array->add
>> >> >>>
>> >> >>> do the right thing (the former would be responsible for bitmasking
>> to
>> >> >>> propagate NA values; the latter would defer to NumPy). If we can
>> put
>> >> >>> all the internals of pandas objects inside a black box, we can add
>> >> >>> layers of virtual function indirection without a performance
>> penalty
>> >> >>> (e.g. adding more interpreter overhead with more abstraction layers
>> >> >>> does add up to a perf penalty).
>> >> >>>
>> >> >>> I don't think this is too scary -- I would be willing to create a
>> >> >>> small POC C++ library to prototype something like what I'm talking
>> >> >>> about.
>> >> >>>
>> >> >>> Since pandas has limited points of contact with NumPy I don't think
>> >> >>> this would end up being too onerous.
>> >> >>>
>> >> >>> For the record, I'm pretty allergic to "advanced C++"; I think it
>> is a
>> >> >>> useful tool if you pick a sane 20% subset of the C++11 spec and
>> follow
>> >> >>> Google C++ style it's not very inaccessible to intermediate
>> >> >>> developers. More or less "C plus OOP and easier object lifetime
>> >> >>> management (shared/unique_ptr, etc.)". As soon as you add a lot of
>> >> >>> template metaprogramming C++ library development quickly becomes
>> >> >>> inaccessible except to the C++-Jedi.
>> >> >>>
>> >> >>> Maybe let's start a Google document on "pandas roadmap" where we
>> can
>> >> >>> break down the 1-2 year goals and some of these infrastructure
>> issues
>> >> >>> and have our discussion there? (obviously publish this someplace
>> once
>> >> >>> we're done)
>> >> >>>
>> >> >>> - Wes
>> >> >>>
>> >> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback <jeffreback at gmail.com
>> >
>> >> >>> wrote:
>> >> >>> > Here are some of my thoughts about pandas Roadmap / status and
>> some
>> >> >>> > responses to Wes's thoughts.
>> >> >>> >
>> >> >>> > In the last few (and upcoming) major releases we have been made
>> the
>> >> >>> > following changes:
>> >> >>> >
>> >> >>> > - dtype enhancements (Categorical, Timedelta, Datetime w/tz) &
>> >> >>> > making
>> >> >>> > these
>> >> >>> > first class objects
>> >> >>> > - code refactoring to remove subclassing of ndarrays for Series &
>> >> >>> > Index
>> >> >>> > - carving out / deprecating non-core parts of pandas
>> >> >>> >   - datareader
>> >> >>> >   - SparsePanel, WidePanel & other aliases (TImeSeries)
>> >> >>> >   - rpy, rplot, irow et al.
>> >> >>> >   - google-analytics
>> >> >>> > - API changes to make things more consistent
>> >> >>> >   - pd.rolling/expanding * -> .rolling/expanding (this is in
>> master
>> >> >>> > now)
>> >> >>> >   - .resample becoming a full defered like groupby.
>> >> >>> >   - multi-index slicing along any level (obviates need for .xs)
>> and
>> >> >>> > allows
>> >> >>> > assignment
>> >> >>> >   - .loc/.iloc - for the most part obviates use of .ix
>> >> >>> >   - .pipe & .assign
>> >> >>> >   - plotting accessors
>> >> >>> >   - fixing of the sorting API
>> >> >>> > - many performance enhancements both micro & macro (e.g. release
>> >> >>> > GIL)
>> >> >>> >
>> >> >>> > Some on-deck enhancements are (meaning these are basically ready
>> to
>> >> >>> > go
>> >> >>> > in):
>> >> >>> >   - IntervalIndex (and eventually make PeriodIndex just a
>> sub-class
>> >> >>> > of
>> >> >>> > this)
>> >> >>> >   - RangeIndex
>> >> >>> >
>> >> >>> > so lots of changes, though nothing really earth shaking, just
>> more
>> >> >>> > convenience, reducing magicness somewhat
>> >> >>> > and providing flexibility.
>> >> >>> >
>> >> >>> > Of course we are getting increasing issues, mostly bug reports
>> (and
>> >> >>> > lots
>> >> >>> > of
>> >> >>> > dupes), some edge case enhancements
>> >> >>> > which can add to the existing API's and of course, requests to
>> >> >>> > expand
>> >> >>> > the
>> >> >>> > (already) large code to other usecases.
>> >> >>> > Balancing this are a good many pull-requests from many different
>> >> >>> > users,
>> >> >>> > some
>> >> >>> > even deep into the internals.
>> >> >>> >
>> >> >>> > Here are some things that I have talked about and could be
>> >> >>> > considered
>> >> >>> > for
>> >> >>> > the roadmap. Disclaimer: I do work for Continuum
>> >> >>> > but these views are of course my own; furthermore obviously I am
>> a
>> >> >>> > bit
>> >> >>> > more
>> >> >>> > familiar with some of the 'sponsored' open-source
>> >> >>> > libraries, but always open to new things.
>> >> >>> >
>> >> >>> > - integration / automatic deferral to numba for JIT (this would
>> be
>> >> >>> > thru
>> >> >>> > .apply)
>> >> >>> > - automatic deferal to dask from groubpy where appropriate /
>> maybe a
>> >> >>> > .to_parallel (to simply return a dask.DataFrame object)
>> >> >>> > - incorporation of quantities / units (as part of the dtype)
>> >> >>> > - use of DyND to allow missing values for int dtypes
>> >> >>> > - make Period a first class dtype.
>> >> >>> > - provide some copy-on-write semantics to alleviate the
>> >> >>> > chained-indexing
>> >> >>> > issues which occasionaly come up with the mis-use of the indexing
>> >> >>> > API
>> >> >>> > - allow a 'policy' to automatically provide column blocks for
>> >> >>> > dict-like
>> >> >>> > input (e.g. each column would be a block), this would allow a
>> >> >>> > pass-thru
>> >> >>> > API
>> >> >>> > where you could
>> >> >>> > put in numpy arrays where you have views and have them preserved
>> >> >>> > rather
>> >> >>> > than
>> >> >>> > copied automatically. Note that this would also allow what I call
>> >> >>> > 'split'
>> >> >>> > where a passed in
>> >> >>> > multi-dim numpy array could be split up to individual blocks
>> (which
>> >> >>> > actually
>> >> >>> > gives a nice perf boost after the splitting costs).
>> >> >>> >
>> >> >>> > In working towards some of these goals. I have come to the
>> opinion
>> >> >>> > that
>> >> >>> > it
>> >> >>> > would make sense to have a neutral API protocol layer
>> >> >>> > that would allow us to swap out different engines as needed, for
>> >> >>> > particular
>> >> >>> > dtypes, or *maybe* out-of-core type computations. E.g.
>> >> >>> > imagine that we replaced the in-memory block structure with a
>> bclolz
>> >> >>> > /
>> >> >>> > memap
>> >> >>> > type; in theory this should be 'easy' and just work.
>> >> >>> > I could also see us adopting *some* of the SFrame code to allow
>> >> >>> > easier
>> >> >>> > interop with this API layer.
>> >> >>> >
>> >> >>> > In practice, I think a nice API layer would need to be created to
>> >> >>> > make
>> >> >>> > this
>> >> >>> > clean / nice.
>> >> >>> >
>> >> >>> > So this comes around to Wes's point about creating a c++ library
>> for
>> >> >>> > the
>> >> >>> > internals (and possibly even some of the indexing routines).
>> >> >>> > In an ideal world, or course this would be desirable. Getting
>> there
>> >> >>> > is a
>> >> >>> > bit
>> >> >>> > non-trivial I think, and IMHO might not be worth the effort. I
>> don't
>> >> >>> > really see big performance bottlenecks. We *already* defer much
>> of
>> >> >>> > the
>> >> >>> > computation to libraries like numexpr & bottleneck (where
>> >> >>> > appropriate).
>> >> >>> > Adding numba / dask to the list would be helpful.
>> >> >>> >
>> >> >>> > I think that almost all performance issues are the result of:
>> >> >>> >
>> >> >>> > a) gross misuse of the pandas API. How much code have you seen
>> that
>> >> >>> > does
>> >> >>> > df.apply(lambda x: x.sum())
>> >> >>> > b) routines which operate column-by-column rather block-by-block
>> and
>> >> >>> > are
>> >> >>> > in
>> >> >>> > python space (e.g. we have an issue right now about .quantile)
>> >> >>> >
>> >> >>> > So I am glossing over a big goal of having a c++ library that
>> >> >>> > represents
>> >> >>> > the
>> >> >>> > pandas internals. This would by definition have a c-API that so
>> >> >>> > you *could* use pandas like semantics in c/c++ and just have it
>> work
>> >> >>> > (and
>> >> >>> > then pandas would be a thin wrapper around this library).
>> >> >>> >
>> >> >>> > I am not averse to this, but I think would be quite a big effort,
>> >> >>> > and
>> >> >>> > not a
>> >> >>> > huge perf boost IMHO. Further there are a number of API issues
>> >> >>> > w.r.t.
>> >> >>> > indexing
>> >> >>> > which need to be clarified / worked out (e.g. should we simply
>> >> >>> > deprecate
>> >> >>> > [])
>> >> >>> > that are much easier to test / figure out in python space.
>> >> >>> >
>> >> >>> > I also thing that we have quite a large number of contributors.
>> >> >>> > Moving
>> >> >>> > to
>> >> >>> > c++ might make the internals a bit more impenetrable that the
>> >> >>> > current
>> >> >>> > internals.
>> >> >>> > (though this would allow c++ people to contribute, so that might
>> >> >>> > balance
>> >> >>> > out).
>> >> >>> >
>> >> >>> > We have a limited core of devs whom right now are familar with
>> >> >>> > things.
>> >> >>> > If
>> >> >>> > someone happened to have a starting base for a c++ library, then
>> I
>> >> >>> > might
>> >> >>> > change
>> >> >>> > opinions here.
>> >> >>> >
>> >> >>> >
>> >> >>> > my 4c.
>> >> >>> >
>> >> >>> > Jeff
>> >> >>> >
>> >> >>> >
>> >> >>> >
>> >> >>> >
>> >> >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney <
>> wesmckinn at gmail.com>
>> >> >>> > wrote:
>> >> >>> >>
>> >> >>> >> Deep thoughts during the holidays.
>> >> >>> >>
>> >> >>> >> I might be out of line here, but the interpreter-heaviness of
>> the
>> >> >>> >> inside of pandas objects is likely to be a long-term liability
>> and
>> >> >>> >> source of performance problems and technical debt.
>> >> >>> >>
>> >> >>> >> Has anyone put any thought into planning and beginning to
>> execute
>> >> >>> >> on a
>> >> >>> >> rewrite that moves as much as possible of the internals into
>> native
>> >> >>> >> /
>> >> >>> >> compiled code? I'm talking about:
>> >> >>> >>
>> >> >>> >> - pandas/core/internals
>> >> >>> >> - indexing and assignment
>> >> >>> >> - much of pandas/core/common
>> >> >>> >> - categorical and custom dtypes
>> >> >>> >> - all indexing mechanisms
>> >> >>> >>
>> >> >>> >> I'm concerned we've already exposed too much internals to
>> users, so
>> >> >>> >> this might lead to a lot of API breakage, but it might be for
>> the
>> >> >>> >> Greater Good. As a first step, beginning a partial migration of
>> >> >>> >> internals into some C++ classes that encapsulate the insides of
>> >> >>> >> DataFrame objects and implement indexing and block-level
>> >> >>> >> manipulations
>> >> >>> >> would be a good place to start. I think you could do this
>> wouldn't
>> >> >>> >> too
>> >> >>> >> much disruption.
>> >> >>> >>
>> >> >>> >> As part of this internal retooling we might give consideration
>> to
>> >> >>> >> alternative data structures for representing data internal to
>> >> >>> >> pandas
>> >> >>> >> objects. Now in 2015/2016, continuing to be hamstrung by NumPy's
>> >> >>> >> limitations feels somewhat anachronistic. User code is riddled
>> with
>> >> >>> >> workarounds for data type fidelity issues and the like. Like,
>> >> >>> >> really,
>> >> >>> >> why not add a bitndarray (similar to ilanschnell/bitarray) for
>> >> >>> >> storing
>> >> >>> >> nullness for problematic types and hide this from the user? =)
>> >> >>> >>
>> >> >>> >> Since we are now a NumFOCUS-sponsored project, I feel like we
>> might
>> >> >>> >> consider establishing some formal governance over pandas and
>> >> >>> >> publishing meetings notes and roadmap documents describing plans
>> >> >>> >> for
>> >> >>> >> the project and meetings notes from committers. There's no real
>> >> >>> >> "committer culture" for NumFOCUS projects like there is with the
>> >> >>> >> Apache Software Foundation, but we might try leading by example!
>> >> >>> >>
>> >> >>> >> Also, I believe pandas as a project has reached a level of
>> >> >>> >> importance
>> >> >>> >> where we ought to consider planning and execution on larger
>> scale
>> >> >>> >> undertakings such as this for safeguarding the future.
>> >> >>> >>
>> >> >>> >> As for myself, well, I have my hands full in Big Data-land. I
>> wish
>> >> >>> >> I
>> >> >>> >> could be helping more with pandas, but there a quite a few
>> >> >>> >> fundamental
>> >> >>> >> issues (like data interoperability nested data handling and file
>> >> >>> >> format support ? e.g. Parquet, see
>> >> >>> >>
>> >> >>> >>
>> >> >>> >>
>> >> >>> >>
>> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/
>> )
>> >> >>> >> preventing Python from being more useful in industry analytics
>> >> >>> >> applications.
>> >> >>> >>
>> >> >>> >> Aside: one of the bigger mistakes I made with pandas's API
>> design
>> >> >>> >> was
>> >> >>> >> making it acceptable to call class constructors ? like
>> >> >>> >> pandas.DataFrame ? directly (versus factory functions). Sorry
>> about
>> >> >>> >> that! If we could convince everyone to start writing
>> >> >>> >> pandas.data_frame
>> >> >>> >> or dataframe instead of using the class reference it would help
>> a
>> >> >>> >> lot
>> >> >>> >> with code cleanup. It's hard to plan for these things ? NumPy
>> >> >>> >> interoperability seemed a lot more important in 2008 than it
>> does
>> >> >>> >> now,
>> >> >>> >> so I forgive myself.
>> >> >>> >>
>> >> >>> >> cheers and best wishes for 2016,
>> >> >>> >> Wes
>> >> >>> >> _______________________________________________
>> >> >>> >> Pandas-dev mailing list
>> >> >>> >> Pandas-dev at python.org
>> >> >>> >> https://mail.python.org/mailman/listinfo/pandas-dev
>> >> >>> >
>> >> >>> >
>> >> >>> _______________________________________________
>> >> >>> Pandas-dev mailing list
>> >> >>> Pandas-dev at python.org
>> >> >>> https://mail.python.org/mailman/listinfo/pandas-dev
>> >> _______________________________________________
>> >> Pandas-dev mailing list
>> >> Pandas-dev at python.org
>> >> https://mail.python.org/mailman/listinfo/pandas-dev
>> >
>> >
>>
>>
>> _______________________________________________
>> Pandas-dev mailing list
>> Pandas-dev at python.org
>> <javascript:_e(%7B%7D,'cvml','Pandas-dev at python.org');>
>> https://mail.python.org/mailman/listinfo/pandas-dev
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20151229/6f403670/attachment-0001.html>

From izaid at continuum.io  Tue Dec 29 19:17:25 2015
From: izaid at continuum.io (Irwin Zaid)
Date: Tue, 29 Dec 2015 18:17:25 -0600
Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? /
 Roadmap
In-Reply-To: <CAJPUwMA4nAm++EghTSagTnKH3hDt4aNestLb3cPnc0-Whkse0w@mail.gmail.com>
References: <CAHMnJKiu9+ij943kYpRxKUxhp4H9Kqvevg84k1MchQaOwjEbGg@mail.gmail.com>
 <CAJPUwMCdknF90fsjRyGZ7FRqYo_GOe_+3KezOWYZosvZ5hqWzw@mail.gmail.com>
 <CAKRVfm79SQAMRNOZ=xCn=AvKSmSafsvDhHAcrtY+7mX3CJndpA@mail.gmail.com>
 <CAJPUwMDhc76q+KWPvgxT6i+o+Kw8S0-=JEe8UsnEGCqPOM_2QQ@mail.gmail.com>
 <CAJPUwMBhN2FPWyh=zbYSCixkxpC0giqgD5ppFzzxEKukKUkTXw@mail.gmail.com>
 <CAHMnJKi-LzFqSVRBYYDk0Rx4-41kWDEpXHRfSsS90vp56mTg6A@mail.gmail.com>
 <CAJPUwMA+m_rYpddocQqOVwMoEFvDocncPOFFtM9GsszVTyE=ng@mail.gmail.com>
 <CAC=y189NE7r6pfUoRhHO10Nt2M=zzF1hcjA8wJz8Munae00y7A@mail.gmail.com>
 <CAJPUwMA4nAm++EghTSagTnKH3hDt4aNestLb3cPnc0-Whkse0w@mail.gmail.com>
Message-ID: <CAC=y188HiQUmtS4=q_1OWHJ5GamN5g1gHHazfgy0TWX=v-D6bQ@mail.gmail.com>

Yeah, that seems reasonable and I totally agree a Pandas wrapper layer
would be necessary.

I'll keep an eye on this and I'd like to help if I can.

Irwin

On Tue, Dec 29, 2015 at 6:01 PM, Wes McKinney <wesmckinn at gmail.com> wrote:

> I'm not suggesting a rewrite of NumPy functionality but rather pandas
> functionality that is currently written in a mishmash of Cython and Python.
> Happy to experiment with changing the internal compute infrastructure and
> data representation to DyND after this first stage of cleanup is done. Even
> if we use DyND a pretty extensive pandas wrapper layer will be  necessary.
>
>
> On Tuesday, December 29, 2015, Irwin Zaid <izaid at continuum.io> wrote:
>
>> Hi Wes (and others),
>>
>> I've been following this conversation with interest. I do think it would
>> be worth exploring DyND, rather than setting up yet another rewrite of
>> NumPy-functionality. Especially because DyND is already an optional
>> dependency of Pandas.
>>
>> For things like Integer NA and new dtypes, DyND is there and ready to do
>> this.
>>
>> Irwin
>>
>> On Tue, Dec 29, 2015 at 5:18 PM, Wes McKinney <wesmckinn at gmail.com>
>> wrote:
>>
>>> Can you link to the PR you're talking about?
>>>
>>> I will see about spending a few hours setting up a libpandas.so as a C++
>>> shared library where we can run some experiments and validate whether it
>>> can solve the integer-NA problem and be a place to put new data types
>>> (categorical and friends). I'm +1 on targeting
>>>
>>> Would it also be worth making a wish list of APIs we might consider
>>> breaking in a pandas 1.0 release that also features this new "native core"?
>>> Might as well right some wrongs while we're doing some invasive work on the
>>> internals; some breakage might be unavoidable. We can always maintain a
>>> pandas legacy 0.x.x maintenance branch (providing a conda binary build) for
>>> legacy users where showstopper bugs can get fixed.
>>>
>>> On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback <jeffreback at gmail.com>
>>> wrote:
>>> > Wes your last is noted as well. I *think* we can actually do this now
>>> (well
>>> > there is a PR out there).
>>> >
>>> > On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney <wesmckinn at gmail.com>
>>> wrote:
>>> >>
>>> >> The other huge thing this will enable is to do is copy-on-write for
>>> >> various kinds of views, which should cut down on some of the defensive
>>> >> copying in the library and reduce memory usage.
>>> >>
>>> >> On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney <wesmckinn at gmail.com>
>>> wrote:
>>> >> > Basically the approach is
>>> >> >
>>> >> > 1) Base dtype type
>>> >> > 2) Base array type with K >= 1 dimensions
>>> >> > 3) Base scalar type
>>> >> > 4) Base index type
>>> >> > 5) "Wrapper" subclasses for all NumPy types fitting into categories
>>> >> > #1, #2, #3, #4
>>> >> > 6) Subclasses for pandas-specific types like category, datetimeTZ,
>>> etc.
>>> >> > 7) NDFrame as cpcloud wrote is just a list of these
>>> >> >
>>> >> > Indexes and axis labels / column names can get layered on top.
>>> >> >
>>> >> > After we do all this we can look at adding nested types (arrays,
>>> maps,
>>> >> > structs) to better support JSON.
>>> >> >
>>> >> > - Wes
>>> >> >
>>> >> > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud <cpcloud at gmail.com>
>>> >> > wrote:
>>> >> >> Maybe this is saying the same thing as Wes, but how far would
>>> something
>>> >> >> like
>>> >> >> this get us?
>>> >> >>
>>> >> >> // warning: things are probably not this simple
>>> >> >>
>>> >> >> struct data_array_t {
>>> >> >>     void *primitive;  // scalar data
>>> >> >>     data_array_t *nested; // nested data
>>> >> >>     boost::dynamic_bitset isnull;  // might have to create our own
>>> to
>>> >> >> avoid
>>> >> >> boost
>>> >> >>     schema_t schema;  // not sure exactly what this looks like
>>> >> >> };
>>> >> >>
>>> >> >> typedef std::map<string, data_array_t> data_frame_t;  // probably
>>> not
>>> >> >> this
>>> >> >> simple
>>> >> >>
>>> >> >> To answer Jeff?s use-case question: I think that the use cases are
>>> 1)
>>> >> >> freedom from numpy (mostly) 2) no more block manager which frees us
>>> >> >> from the
>>> >> >> limitations of the block memory layout. In particular, the ability
>>> to
>>> >> >> take
>>> >> >> advantage of memory mapped IO would be a big win IMO.
>>> >> >>
>>> >> >>
>>> >> >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney <wesmckinn at gmail.com>
>>> >> >> wrote:
>>> >> >>>
>>> >> >>> I will write a more detailed response to some of these things
>>> after
>>> >> >>> the new year, but, in particular, re: missing values, can you or
>>> >> >>> someone tell me why creating an object that contains a NumPy
>>> array and
>>> >> >>> a bitmap is not sufficient? If we we can add a lightweight C/C++
>>> class
>>> >> >>> layer between NumPy function calls (e.g. arithmetic) and pandas
>>> >> >>> function calls, then I see no reason why we cannot have
>>> >> >>>
>>> >> >>> Int32Array->add
>>> >> >>>
>>> >> >>> and
>>> >> >>>
>>> >> >>> Float32Array->add
>>> >> >>>
>>> >> >>> do the right thing (the former would be responsible for
>>> bitmasking to
>>> >> >>> propagate NA values; the latter would defer to NumPy). If we can
>>> put
>>> >> >>> all the internals of pandas objects inside a black box, we can add
>>> >> >>> layers of virtual function indirection without a performance
>>> penalty
>>> >> >>> (e.g. adding more interpreter overhead with more abstraction
>>> layers
>>> >> >>> does add up to a perf penalty).
>>> >> >>>
>>> >> >>> I don't think this is too scary -- I would be willing to create a
>>> >> >>> small POC C++ library to prototype something like what I'm talking
>>> >> >>> about.
>>> >> >>>
>>> >> >>> Since pandas has limited points of contact with NumPy I don't
>>> think
>>> >> >>> this would end up being too onerous.
>>> >> >>>
>>> >> >>> For the record, I'm pretty allergic to "advanced C++"; I think it
>>> is a
>>> >> >>> useful tool if you pick a sane 20% subset of the C++11 spec and
>>> follow
>>> >> >>> Google C++ style it's not very inaccessible to intermediate
>>> >> >>> developers. More or less "C plus OOP and easier object lifetime
>>> >> >>> management (shared/unique_ptr, etc.)". As soon as you add a lot of
>>> >> >>> template metaprogramming C++ library development quickly becomes
>>> >> >>> inaccessible except to the C++-Jedi.
>>> >> >>>
>>> >> >>> Maybe let's start a Google document on "pandas roadmap" where we
>>> can
>>> >> >>> break down the 1-2 year goals and some of these infrastructure
>>> issues
>>> >> >>> and have our discussion there? (obviously publish this someplace
>>> once
>>> >> >>> we're done)
>>> >> >>>
>>> >> >>> - Wes
>>> >> >>>
>>> >> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback <
>>> jeffreback at gmail.com>
>>> >> >>> wrote:
>>> >> >>> > Here are some of my thoughts about pandas Roadmap / status and
>>> some
>>> >> >>> > responses to Wes's thoughts.
>>> >> >>> >
>>> >> >>> > In the last few (and upcoming) major releases we have been made
>>> the
>>> >> >>> > following changes:
>>> >> >>> >
>>> >> >>> > - dtype enhancements (Categorical, Timedelta, Datetime w/tz) &
>>> >> >>> > making
>>> >> >>> > these
>>> >> >>> > first class objects
>>> >> >>> > - code refactoring to remove subclassing of ndarrays for Series
>>> &
>>> >> >>> > Index
>>> >> >>> > - carving out / deprecating non-core parts of pandas
>>> >> >>> >   - datareader
>>> >> >>> >   - SparsePanel, WidePanel & other aliases (TImeSeries)
>>> >> >>> >   - rpy, rplot, irow et al.
>>> >> >>> >   - google-analytics
>>> >> >>> > - API changes to make things more consistent
>>> >> >>> >   - pd.rolling/expanding * -> .rolling/expanding (this is in
>>> master
>>> >> >>> > now)
>>> >> >>> >   - .resample becoming a full defered like groupby.
>>> >> >>> >   - multi-index slicing along any level (obviates need for .xs)
>>> and
>>> >> >>> > allows
>>> >> >>> > assignment
>>> >> >>> >   - .loc/.iloc - for the most part obviates use of .ix
>>> >> >>> >   - .pipe & .assign
>>> >> >>> >   - plotting accessors
>>> >> >>> >   - fixing of the sorting API
>>> >> >>> > - many performance enhancements both micro & macro (e.g. release
>>> >> >>> > GIL)
>>> >> >>> >
>>> >> >>> > Some on-deck enhancements are (meaning these are basically
>>> ready to
>>> >> >>> > go
>>> >> >>> > in):
>>> >> >>> >   - IntervalIndex (and eventually make PeriodIndex just a
>>> sub-class
>>> >> >>> > of
>>> >> >>> > this)
>>> >> >>> >   - RangeIndex
>>> >> >>> >
>>> >> >>> > so lots of changes, though nothing really earth shaking, just
>>> more
>>> >> >>> > convenience, reducing magicness somewhat
>>> >> >>> > and providing flexibility.
>>> >> >>> >
>>> >> >>> > Of course we are getting increasing issues, mostly bug reports
>>> (and
>>> >> >>> > lots
>>> >> >>> > of
>>> >> >>> > dupes), some edge case enhancements
>>> >> >>> > which can add to the existing API's and of course, requests to
>>> >> >>> > expand
>>> >> >>> > the
>>> >> >>> > (already) large code to other usecases.
>>> >> >>> > Balancing this are a good many pull-requests from many different
>>> >> >>> > users,
>>> >> >>> > some
>>> >> >>> > even deep into the internals.
>>> >> >>> >
>>> >> >>> > Here are some things that I have talked about and could be
>>> >> >>> > considered
>>> >> >>> > for
>>> >> >>> > the roadmap. Disclaimer: I do work for Continuum
>>> >> >>> > but these views are of course my own; furthermore obviously I
>>> am a
>>> >> >>> > bit
>>> >> >>> > more
>>> >> >>> > familiar with some of the 'sponsored' open-source
>>> >> >>> > libraries, but always open to new things.
>>> >> >>> >
>>> >> >>> > - integration / automatic deferral to numba for JIT (this would
>>> be
>>> >> >>> > thru
>>> >> >>> > .apply)
>>> >> >>> > - automatic deferal to dask from groubpy where appropriate /
>>> maybe a
>>> >> >>> > .to_parallel (to simply return a dask.DataFrame object)
>>> >> >>> > - incorporation of quantities / units (as part of the dtype)
>>> >> >>> > - use of DyND to allow missing values for int dtypes
>>> >> >>> > - make Period a first class dtype.
>>> >> >>> > - provide some copy-on-write semantics to alleviate the
>>> >> >>> > chained-indexing
>>> >> >>> > issues which occasionaly come up with the mis-use of the
>>> indexing
>>> >> >>> > API
>>> >> >>> > - allow a 'policy' to automatically provide column blocks for
>>> >> >>> > dict-like
>>> >> >>> > input (e.g. each column would be a block), this would allow a
>>> >> >>> > pass-thru
>>> >> >>> > API
>>> >> >>> > where you could
>>> >> >>> > put in numpy arrays where you have views and have them preserved
>>> >> >>> > rather
>>> >> >>> > than
>>> >> >>> > copied automatically. Note that this would also allow what I
>>> call
>>> >> >>> > 'split'
>>> >> >>> > where a passed in
>>> >> >>> > multi-dim numpy array could be split up to individual blocks
>>> (which
>>> >> >>> > actually
>>> >> >>> > gives a nice perf boost after the splitting costs).
>>> >> >>> >
>>> >> >>> > In working towards some of these goals. I have come to the
>>> opinion
>>> >> >>> > that
>>> >> >>> > it
>>> >> >>> > would make sense to have a neutral API protocol layer
>>> >> >>> > that would allow us to swap out different engines as needed, for
>>> >> >>> > particular
>>> >> >>> > dtypes, or *maybe* out-of-core type computations. E.g.
>>> >> >>> > imagine that we replaced the in-memory block structure with a
>>> bclolz
>>> >> >>> > /
>>> >> >>> > memap
>>> >> >>> > type; in theory this should be 'easy' and just work.
>>> >> >>> > I could also see us adopting *some* of the SFrame code to allow
>>> >> >>> > easier
>>> >> >>> > interop with this API layer.
>>> >> >>> >
>>> >> >>> > In practice, I think a nice API layer would need to be created
>>> to
>>> >> >>> > make
>>> >> >>> > this
>>> >> >>> > clean / nice.
>>> >> >>> >
>>> >> >>> > So this comes around to Wes's point about creating a c++
>>> library for
>>> >> >>> > the
>>> >> >>> > internals (and possibly even some of the indexing routines).
>>> >> >>> > In an ideal world, or course this would be desirable. Getting
>>> there
>>> >> >>> > is a
>>> >> >>> > bit
>>> >> >>> > non-trivial I think, and IMHO might not be worth the effort. I
>>> don't
>>> >> >>> > really see big performance bottlenecks. We *already* defer much
>>> of
>>> >> >>> > the
>>> >> >>> > computation to libraries like numexpr & bottleneck (where
>>> >> >>> > appropriate).
>>> >> >>> > Adding numba / dask to the list would be helpful.
>>> >> >>> >
>>> >> >>> > I think that almost all performance issues are the result of:
>>> >> >>> >
>>> >> >>> > a) gross misuse of the pandas API. How much code have you seen
>>> that
>>> >> >>> > does
>>> >> >>> > df.apply(lambda x: x.sum())
>>> >> >>> > b) routines which operate column-by-column rather
>>> block-by-block and
>>> >> >>> > are
>>> >> >>> > in
>>> >> >>> > python space (e.g. we have an issue right now about .quantile)
>>> >> >>> >
>>> >> >>> > So I am glossing over a big goal of having a c++ library that
>>> >> >>> > represents
>>> >> >>> > the
>>> >> >>> > pandas internals. This would by definition have a c-API that so
>>> >> >>> > you *could* use pandas like semantics in c/c++ and just have it
>>> work
>>> >> >>> > (and
>>> >> >>> > then pandas would be a thin wrapper around this library).
>>> >> >>> >
>>> >> >>> > I am not averse to this, but I think would be quite a big
>>> effort,
>>> >> >>> > and
>>> >> >>> > not a
>>> >> >>> > huge perf boost IMHO. Further there are a number of API issues
>>> >> >>> > w.r.t.
>>> >> >>> > indexing
>>> >> >>> > which need to be clarified / worked out (e.g. should we simply
>>> >> >>> > deprecate
>>> >> >>> > [])
>>> >> >>> > that are much easier to test / figure out in python space.
>>> >> >>> >
>>> >> >>> > I also thing that we have quite a large number of contributors.
>>> >> >>> > Moving
>>> >> >>> > to
>>> >> >>> > c++ might make the internals a bit more impenetrable that the
>>> >> >>> > current
>>> >> >>> > internals.
>>> >> >>> > (though this would allow c++ people to contribute, so that might
>>> >> >>> > balance
>>> >> >>> > out).
>>> >> >>> >
>>> >> >>> > We have a limited core of devs whom right now are familar with
>>> >> >>> > things.
>>> >> >>> > If
>>> >> >>> > someone happened to have a starting base for a c++ library,
>>> then I
>>> >> >>> > might
>>> >> >>> > change
>>> >> >>> > opinions here.
>>> >> >>> >
>>> >> >>> >
>>> >> >>> > my 4c.
>>> >> >>> >
>>> >> >>> > Jeff
>>> >> >>> >
>>> >> >>> >
>>> >> >>> >
>>> >> >>> >
>>> >> >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney <
>>> wesmckinn at gmail.com>
>>> >> >>> > wrote:
>>> >> >>> >>
>>> >> >>> >> Deep thoughts during the holidays.
>>> >> >>> >>
>>> >> >>> >> I might be out of line here, but the interpreter-heaviness of
>>> the
>>> >> >>> >> inside of pandas objects is likely to be a long-term liability
>>> and
>>> >> >>> >> source of performance problems and technical debt.
>>> >> >>> >>
>>> >> >>> >> Has anyone put any thought into planning and beginning to
>>> execute
>>> >> >>> >> on a
>>> >> >>> >> rewrite that moves as much as possible of the internals into
>>> native
>>> >> >>> >> /
>>> >> >>> >> compiled code? I'm talking about:
>>> >> >>> >>
>>> >> >>> >> - pandas/core/internals
>>> >> >>> >> - indexing and assignment
>>> >> >>> >> - much of pandas/core/common
>>> >> >>> >> - categorical and custom dtypes
>>> >> >>> >> - all indexing mechanisms
>>> >> >>> >>
>>> >> >>> >> I'm concerned we've already exposed too much internals to
>>> users, so
>>> >> >>> >> this might lead to a lot of API breakage, but it might be for
>>> the
>>> >> >>> >> Greater Good. As a first step, beginning a partial migration of
>>> >> >>> >> internals into some C++ classes that encapsulate the insides of
>>> >> >>> >> DataFrame objects and implement indexing and block-level
>>> >> >>> >> manipulations
>>> >> >>> >> would be a good place to start. I think you could do this
>>> wouldn't
>>> >> >>> >> too
>>> >> >>> >> much disruption.
>>> >> >>> >>
>>> >> >>> >> As part of this internal retooling we might give consideration
>>> to
>>> >> >>> >> alternative data structures for representing data internal to
>>> >> >>> >> pandas
>>> >> >>> >> objects. Now in 2015/2016, continuing to be hamstrung by
>>> NumPy's
>>> >> >>> >> limitations feels somewhat anachronistic. User code is riddled
>>> with
>>> >> >>> >> workarounds for data type fidelity issues and the like. Like,
>>> >> >>> >> really,
>>> >> >>> >> why not add a bitndarray (similar to ilanschnell/bitarray) for
>>> >> >>> >> storing
>>> >> >>> >> nullness for problematic types and hide this from the user? =)
>>> >> >>> >>
>>> >> >>> >> Since we are now a NumFOCUS-sponsored project, I feel like we
>>> might
>>> >> >>> >> consider establishing some formal governance over pandas and
>>> >> >>> >> publishing meetings notes and roadmap documents describing
>>> plans
>>> >> >>> >> for
>>> >> >>> >> the project and meetings notes from committers. There's no real
>>> >> >>> >> "committer culture" for NumFOCUS projects like there is with
>>> the
>>> >> >>> >> Apache Software Foundation, but we might try leading by
>>> example!
>>> >> >>> >>
>>> >> >>> >> Also, I believe pandas as a project has reached a level of
>>> >> >>> >> importance
>>> >> >>> >> where we ought to consider planning and execution on larger
>>> scale
>>> >> >>> >> undertakings such as this for safeguarding the future.
>>> >> >>> >>
>>> >> >>> >> As for myself, well, I have my hands full in Big Data-land. I
>>> wish
>>> >> >>> >> I
>>> >> >>> >> could be helping more with pandas, but there a quite a few
>>> >> >>> >> fundamental
>>> >> >>> >> issues (like data interoperability nested data handling and
>>> file
>>> >> >>> >> format support ? e.g. Parquet, see
>>> >> >>> >>
>>> >> >>> >>
>>> >> >>> >>
>>> >> >>> >>
>>> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/
>>> )
>>> >> >>> >> preventing Python from being more useful in industry analytics
>>> >> >>> >> applications.
>>> >> >>> >>
>>> >> >>> >> Aside: one of the bigger mistakes I made with pandas's API
>>> design
>>> >> >>> >> was
>>> >> >>> >> making it acceptable to call class constructors ? like
>>> >> >>> >> pandas.DataFrame ? directly (versus factory functions). Sorry
>>> about
>>> >> >>> >> that! If we could convince everyone to start writing
>>> >> >>> >> pandas.data_frame
>>> >> >>> >> or dataframe instead of using the class reference it would
>>> help a
>>> >> >>> >> lot
>>> >> >>> >> with code cleanup. It's hard to plan for these things ? NumPy
>>> >> >>> >> interoperability seemed a lot more important in 2008 than it
>>> does
>>> >> >>> >> now,
>>> >> >>> >> so I forgive myself.
>>> >> >>> >>
>>> >> >>> >> cheers and best wishes for 2016,
>>> >> >>> >> Wes
>>> >> >>> >> _______________________________________________
>>> >> >>> >> Pandas-dev mailing list
>>> >> >>> >> Pandas-dev at python.org
>>> >> >>> >> https://mail.python.org/mailman/listinfo/pandas-dev
>>> >> >>> >
>>> >> >>> >
>>> >> >>> _______________________________________________
>>> >> >>> Pandas-dev mailing list
>>> >> >>> Pandas-dev at python.org
>>> >> >>> https://mail.python.org/mailman/listinfo/pandas-dev
>>> >> _______________________________________________
>>> >> Pandas-dev mailing list
>>> >> Pandas-dev at python.org
>>> >> https://mail.python.org/mailman/listinfo/pandas-dev
>>> >
>>> >
>>>
>>>
>>> _______________________________________________
>>> Pandas-dev mailing list
>>> Pandas-dev at python.org
>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/pandas-dev/attachments/20151229/63cdaebe/attachment-0001.html>

From wesmckinn at gmail.com  Wed Dec 30 21:04:01 2015
From: wesmckinn at gmail.com (Wes McKinney)
Date: Wed, 30 Dec 2015 18:04:01 -0800
Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? /
 Roadmap
In-Reply-To: <CAC=y188HiQUmtS4=q_1OWHJ5GamN5g1gHHazfgy0TWX=v-D6bQ@mail.gmail.com>
References: <CAHMnJKiu9+ij943kYpRxKUxhp4H9Kqvevg84k1MchQaOwjEbGg@mail.gmail.com>
 <CAJPUwMCdknF90fsjRyGZ7FRqYo_GOe_+3KezOWYZosvZ5hqWzw@mail.gmail.com>
 <CAKRVfm79SQAMRNOZ=xCn=AvKSmSafsvDhHAcrtY+7mX3CJndpA@mail.gmail.com>
 <CAJPUwMDhc76q+KWPvgxT6i+o+Kw8S0-=JEe8UsnEGCqPOM_2QQ@mail.gmail.com>
 <CAJPUwMBhN2FPWyh=zbYSCixkxpC0giqgD5ppFzzxEKukKUkTXw@mail.gmail.com>
 <CAHMnJKi-LzFqSVRBYYDk0Rx4-41kWDEpXHRfSsS90vp56mTg6A@mail.gmail.com>
 <CAJPUwMA+m_rYpddocQqOVwMoEFvDocncPOFFtM9GsszVTyE=ng@mail.gmail.com>
 <CAC=y189NE7r6pfUoRhHO10Nt2M=zzF1hcjA8wJz8Munae00y7A@mail.gmail.com>
 <CAJPUwMA4nAm++EghTSagTnKH3hDt4aNestLb3cPnc0-Whkse0w@mail.gmail.com>
 <CAC=y188HiQUmtS4=q_1OWHJ5GamN5g1gHHazfgy0TWX=v-D6bQ@mail.gmail.com>
Message-ID: <CAJPUwMDYKjnMRo8xkYh5N7gZoWrsz8kBozZ04b0uB6wQ9R8gqQ@mail.gmail.com>

I cobbled together an ugly start of a c++->cython->pandas toolchain here

https://github.com/wesm/pandas/tree/libpandas-native-core

I used a mix of Kudu, Impala, and dynd-python cmake sources, so it's a
bit messy at the moment but it should be sufficient to run some real
experiments with a little more work. I reckon it's like a 6 month
project to tear out the insides of Series and DataFrame and replace it
with a new "native core", but we should be able to get enough info to
see whether it's a viable plan within a month or so.

The end goal is to create "private" extension types in Cython that can
be the new base classes for Series and NDFrame; these will hold a
reference to a C++ object that contains wrappered NumPy arrays and
other metadata (like pandas-only dtypes).

It might be too hard to try to replace a single usage of block manager
as a first experiment, so I'll try to create a minimal "SeriesLite"
that supports 3 dtypes

1) float64 with nans
2) int64 with a bitmask for NAs
3) category type for one of these

Just want to get a feel for the extensibility and offer an NA
singleton Python object (a la None) for getting and setting NAs across
these 3 dtypes.

If we end up going down this route, any way to place a moratorium on
invasive work on pandas internals (outside bug fixes)?

Pedantic aside: I'd rather avoid shipping thirdparty C/C++ libraries
like googletest and friends in pandas if we can. Cloudera folks have
been working on a portable C++ library toolchain for Impala and other
projects at https://github.com/cloudera/native-toolchain, but it is
only being tested on Linux and OS X. Most google libraries should
build out of the box on MSVC but it'll be something to keep an eye on.

BTW thanks to the libdynd developers for pioneering the c++ lib <->
python-c++ lib <-> cython toolchain; being able to build Cython
extensions directly from cmake is a godsend

HNY all
Wes

On Tue, Dec 29, 2015 at 4:17 PM, Irwin Zaid <izaid at continuum.io> wrote:
> Yeah, that seems reasonable and I totally agree a Pandas wrapper layer would
> be necessary.
>
> I'll keep an eye on this and I'd like to help if I can.
>
> Irwin
>
>
> On Tue, Dec 29, 2015 at 6:01 PM, Wes McKinney <wesmckinn at gmail.com> wrote:
>>
>> I'm not suggesting a rewrite of NumPy functionality but rather pandas
>> functionality that is currently written in a mishmash of Cython and Python.
>> Happy to experiment with changing the internal compute infrastructure and
>> data representation to DyND after this first stage of cleanup is done. Even
>> if we use DyND a pretty extensive pandas wrapper layer will be  necessary.
>>
>>
>> On Tuesday, December 29, 2015, Irwin Zaid <izaid at continuum.io> wrote:
>>>
>>> Hi Wes (and others),
>>>
>>> I've been following this conversation with interest. I do think it would
>>> be worth exploring DyND, rather than setting up yet another rewrite of
>>> NumPy-functionality. Especially because DyND is already an optional
>>> dependency of Pandas.
>>>
>>> For things like Integer NA and new dtypes, DyND is there and ready to do
>>> this.
>>>
>>> Irwin
>>>
>>> On Tue, Dec 29, 2015 at 5:18 PM, Wes McKinney <wesmckinn at gmail.com>
>>> wrote:
>>>>
>>>> Can you link to the PR you're talking about?
>>>>
>>>> I will see about spending a few hours setting up a libpandas.so as a C++
>>>> shared library where we can run some experiments and validate whether it can
>>>> solve the integer-NA problem and be a place to put new data types
>>>> (categorical and friends). I'm +1 on targeting
>>>>
>>>> Would it also be worth making a wish list of APIs we might consider
>>>> breaking in a pandas 1.0 release that also features this new "native core"?
>>>> Might as well right some wrongs while we're doing some invasive work on the
>>>> internals; some breakage might be unavoidable. We can always maintain a
>>>> pandas legacy 0.x.x maintenance branch (providing a conda binary build) for
>>>> legacy users where showstopper bugs can get fixed.
>>>>
>>>> On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback <jeffreback at gmail.com>
>>>> wrote:
>>>> > Wes your last is noted as well. I *think* we can actually do this now
>>>> > (well
>>>> > there is a PR out there).
>>>> >
>>>> > On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney <wesmckinn at gmail.com>
>>>> > wrote:
>>>> >>
>>>> >> The other huge thing this will enable is to do is copy-on-write for
>>>> >> various kinds of views, which should cut down on some of the
>>>> >> defensive
>>>> >> copying in the library and reduce memory usage.
>>>> >>
>>>> >> On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney <wesmckinn at gmail.com>
>>>> >> wrote:
>>>> >> > Basically the approach is
>>>> >> >
>>>> >> > 1) Base dtype type
>>>> >> > 2) Base array type with K >= 1 dimensions
>>>> >> > 3) Base scalar type
>>>> >> > 4) Base index type
>>>> >> > 5) "Wrapper" subclasses for all NumPy types fitting into categories
>>>> >> > #1, #2, #3, #4
>>>> >> > 6) Subclasses for pandas-specific types like category, datetimeTZ,
>>>> >> > etc.
>>>> >> > 7) NDFrame as cpcloud wrote is just a list of these
>>>> >> >
>>>> >> > Indexes and axis labels / column names can get layered on top.
>>>> >> >
>>>> >> > After we do all this we can look at adding nested types (arrays,
>>>> >> > maps,
>>>> >> > structs) to better support JSON.
>>>> >> >
>>>> >> > - Wes
>>>> >> >
>>>> >> > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud <cpcloud at gmail.com>
>>>> >> > wrote:
>>>> >> >> Maybe this is saying the same thing as Wes, but how far would
>>>> >> >> something
>>>> >> >> like
>>>> >> >> this get us?
>>>> >> >>
>>>> >> >> // warning: things are probably not this simple
>>>> >> >>
>>>> >> >> struct data_array_t {
>>>> >> >>     void *primitive;  // scalar data
>>>> >> >>     data_array_t *nested; // nested data
>>>> >> >>     boost::dynamic_bitset isnull;  // might have to create our own
>>>> >> >> to
>>>> >> >> avoid
>>>> >> >> boost
>>>> >> >>     schema_t schema;  // not sure exactly what this looks like
>>>> >> >> };
>>>> >> >>
>>>> >> >> typedef std::map<string, data_array_t> data_frame_t;  // probably
>>>> >> >> not
>>>> >> >> this
>>>> >> >> simple
>>>> >> >>
>>>> >> >> To answer Jeff?s use-case question: I think that the use cases are
>>>> >> >> 1)
>>>> >> >> freedom from numpy (mostly) 2) no more block manager which frees
>>>> >> >> us
>>>> >> >> from the
>>>> >> >> limitations of the block memory layout. In particular, the ability
>>>> >> >> to
>>>> >> >> take
>>>> >> >> advantage of memory mapped IO would be a big win IMO.
>>>> >> >>
>>>> >> >>
>>>> >> >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney <wesmckinn at gmail.com>
>>>> >> >> wrote:
>>>> >> >>>
>>>> >> >>> I will write a more detailed response to some of these things
>>>> >> >>> after
>>>> >> >>> the new year, but, in particular, re: missing values, can you or
>>>> >> >>> someone tell me why creating an object that contains a NumPy
>>>> >> >>> array and
>>>> >> >>> a bitmap is not sufficient? If we we can add a lightweight C/C++
>>>> >> >>> class
>>>> >> >>> layer between NumPy function calls (e.g. arithmetic) and pandas
>>>> >> >>> function calls, then I see no reason why we cannot have
>>>> >> >>>
>>>> >> >>> Int32Array->add
>>>> >> >>>
>>>> >> >>> and
>>>> >> >>>
>>>> >> >>> Float32Array->add
>>>> >> >>>
>>>> >> >>> do the right thing (the former would be responsible for
>>>> >> >>> bitmasking to
>>>> >> >>> propagate NA values; the latter would defer to NumPy). If we can
>>>> >> >>> put
>>>> >> >>> all the internals of pandas objects inside a black box, we can
>>>> >> >>> add
>>>> >> >>> layers of virtual function indirection without a performance
>>>> >> >>> penalty
>>>> >> >>> (e.g. adding more interpreter overhead with more abstraction
>>>> >> >>> layers
>>>> >> >>> does add up to a perf penalty).
>>>> >> >>>
>>>> >> >>> I don't think this is too scary -- I would be willing to create a
>>>> >> >>> small POC C++ library to prototype something like what I'm
>>>> >> >>> talking
>>>> >> >>> about.
>>>> >> >>>
>>>> >> >>> Since pandas has limited points of contact with NumPy I don't
>>>> >> >>> think
>>>> >> >>> this would end up being too onerous.
>>>> >> >>>
>>>> >> >>> For the record, I'm pretty allergic to "advanced C++"; I think it
>>>> >> >>> is a
>>>> >> >>> useful tool if you pick a sane 20% subset of the C++11 spec and
>>>> >> >>> follow
>>>> >> >>> Google C++ style it's not very inaccessible to intermediate
>>>> >> >>> developers. More or less "C plus OOP and easier object lifetime
>>>> >> >>> management (shared/unique_ptr, etc.)". As soon as you add a lot
>>>> >> >>> of
>>>> >> >>> template metaprogramming C++ library development quickly becomes
>>>> >> >>> inaccessible except to the C++-Jedi.
>>>> >> >>>
>>>> >> >>> Maybe let's start a Google document on "pandas roadmap" where we
>>>> >> >>> can
>>>> >> >>> break down the 1-2 year goals and some of these infrastructure
>>>> >> >>> issues
>>>> >> >>> and have our discussion there? (obviously publish this someplace
>>>> >> >>> once
>>>> >> >>> we're done)
>>>> >> >>>
>>>> >> >>> - Wes
>>>> >> >>>
>>>> >> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback
>>>> >> >>> <jeffreback at gmail.com>
>>>> >> >>> wrote:
>>>> >> >>> > Here are some of my thoughts about pandas Roadmap / status and
>>>> >> >>> > some
>>>> >> >>> > responses to Wes's thoughts.
>>>> >> >>> >
>>>> >> >>> > In the last few (and upcoming) major releases we have been made
>>>> >> >>> > the
>>>> >> >>> > following changes:
>>>> >> >>> >
>>>> >> >>> > - dtype enhancements (Categorical, Timedelta, Datetime w/tz) &
>>>> >> >>> > making
>>>> >> >>> > these
>>>> >> >>> > first class objects
>>>> >> >>> > - code refactoring to remove subclassing of ndarrays for Series
>>>> >> >>> > &
>>>> >> >>> > Index
>>>> >> >>> > - carving out / deprecating non-core parts of pandas
>>>> >> >>> >   - datareader
>>>> >> >>> >   - SparsePanel, WidePanel & other aliases (TImeSeries)
>>>> >> >>> >   - rpy, rplot, irow et al.
>>>> >> >>> >   - google-analytics
>>>> >> >>> > - API changes to make things more consistent
>>>> >> >>> >   - pd.rolling/expanding * -> .rolling/expanding (this is in
>>>> >> >>> > master
>>>> >> >>> > now)
>>>> >> >>> >   - .resample becoming a full defered like groupby.
>>>> >> >>> >   - multi-index slicing along any level (obviates need for .xs)
>>>> >> >>> > and
>>>> >> >>> > allows
>>>> >> >>> > assignment
>>>> >> >>> >   - .loc/.iloc - for the most part obviates use of .ix
>>>> >> >>> >   - .pipe & .assign
>>>> >> >>> >   - plotting accessors
>>>> >> >>> >   - fixing of the sorting API
>>>> >> >>> > - many performance enhancements both micro & macro (e.g.
>>>> >> >>> > release
>>>> >> >>> > GIL)
>>>> >> >>> >
>>>> >> >>> > Some on-deck enhancements are (meaning these are basically
>>>> >> >>> > ready to
>>>> >> >>> > go
>>>> >> >>> > in):
>>>> >> >>> >   - IntervalIndex (and eventually make PeriodIndex just a
>>>> >> >>> > sub-class
>>>> >> >>> > of
>>>> >> >>> > this)
>>>> >> >>> >   - RangeIndex
>>>> >> >>> >
>>>> >> >>> > so lots of changes, though nothing really earth shaking, just
>>>> >> >>> > more
>>>> >> >>> > convenience, reducing magicness somewhat
>>>> >> >>> > and providing flexibility.
>>>> >> >>> >
>>>> >> >>> > Of course we are getting increasing issues, mostly bug reports
>>>> >> >>> > (and
>>>> >> >>> > lots
>>>> >> >>> > of
>>>> >> >>> > dupes), some edge case enhancements
>>>> >> >>> > which can add to the existing API's and of course, requests to
>>>> >> >>> > expand
>>>> >> >>> > the
>>>> >> >>> > (already) large code to other usecases.
>>>> >> >>> > Balancing this are a good many pull-requests from many
>>>> >> >>> > different
>>>> >> >>> > users,
>>>> >> >>> > some
>>>> >> >>> > even deep into the internals.
>>>> >> >>> >
>>>> >> >>> > Here are some things that I have talked about and could be
>>>> >> >>> > considered
>>>> >> >>> > for
>>>> >> >>> > the roadmap. Disclaimer: I do work for Continuum
>>>> >> >>> > but these views are of course my own; furthermore obviously I
>>>> >> >>> > am a
>>>> >> >>> > bit
>>>> >> >>> > more
>>>> >> >>> > familiar with some of the 'sponsored' open-source
>>>> >> >>> > libraries, but always open to new things.
>>>> >> >>> >
>>>> >> >>> > - integration / automatic deferral to numba for JIT (this would
>>>> >> >>> > be
>>>> >> >>> > thru
>>>> >> >>> > .apply)
>>>> >> >>> > - automatic deferal to dask from groubpy where appropriate /
>>>> >> >>> > maybe a
>>>> >> >>> > .to_parallel (to simply return a dask.DataFrame object)
>>>> >> >>> > - incorporation of quantities / units (as part of the dtype)
>>>> >> >>> > - use of DyND to allow missing values for int dtypes
>>>> >> >>> > - make Period a first class dtype.
>>>> >> >>> > - provide some copy-on-write semantics to alleviate the
>>>> >> >>> > chained-indexing
>>>> >> >>> > issues which occasionaly come up with the mis-use of the
>>>> >> >>> > indexing
>>>> >> >>> > API
>>>> >> >>> > - allow a 'policy' to automatically provide column blocks for
>>>> >> >>> > dict-like
>>>> >> >>> > input (e.g. each column would be a block), this would allow a
>>>> >> >>> > pass-thru
>>>> >> >>> > API
>>>> >> >>> > where you could
>>>> >> >>> > put in numpy arrays where you have views and have them
>>>> >> >>> > preserved
>>>> >> >>> > rather
>>>> >> >>> > than
>>>> >> >>> > copied automatically. Note that this would also allow what I
>>>> >> >>> > call
>>>> >> >>> > 'split'
>>>> >> >>> > where a passed in
>>>> >> >>> > multi-dim numpy array could be split up to individual blocks
>>>> >> >>> > (which
>>>> >> >>> > actually
>>>> >> >>> > gives a nice perf boost after the splitting costs).
>>>> >> >>> >
>>>> >> >>> > In working towards some of these goals. I have come to the
>>>> >> >>> > opinion
>>>> >> >>> > that
>>>> >> >>> > it
>>>> >> >>> > would make sense to have a neutral API protocol layer
>>>> >> >>> > that would allow us to swap out different engines as needed,
>>>> >> >>> > for
>>>> >> >>> > particular
>>>> >> >>> > dtypes, or *maybe* out-of-core type computations. E.g.
>>>> >> >>> > imagine that we replaced the in-memory block structure with a
>>>> >> >>> > bclolz
>>>> >> >>> > /
>>>> >> >>> > memap
>>>> >> >>> > type; in theory this should be 'easy' and just work.
>>>> >> >>> > I could also see us adopting *some* of the SFrame code to allow
>>>> >> >>> > easier
>>>> >> >>> > interop with this API layer.
>>>> >> >>> >
>>>> >> >>> > In practice, I think a nice API layer would need to be created
>>>> >> >>> > to
>>>> >> >>> > make
>>>> >> >>> > this
>>>> >> >>> > clean / nice.
>>>> >> >>> >
>>>> >> >>> > So this comes around to Wes's point about creating a c++
>>>> >> >>> > library for
>>>> >> >>> > the
>>>> >> >>> > internals (and possibly even some of the indexing routines).
>>>> >> >>> > In an ideal world, or course this would be desirable. Getting
>>>> >> >>> > there
>>>> >> >>> > is a
>>>> >> >>> > bit
>>>> >> >>> > non-trivial I think, and IMHO might not be worth the effort. I
>>>> >> >>> > don't
>>>> >> >>> > really see big performance bottlenecks. We *already* defer much
>>>> >> >>> > of
>>>> >> >>> > the
>>>> >> >>> > computation to libraries like numexpr & bottleneck (where
>>>> >> >>> > appropriate).
>>>> >> >>> > Adding numba / dask to the list would be helpful.
>>>> >> >>> >
>>>> >> >>> > I think that almost all performance issues are the result of:
>>>> >> >>> >
>>>> >> >>> > a) gross misuse of the pandas API. How much code have you seen
>>>> >> >>> > that
>>>> >> >>> > does
>>>> >> >>> > df.apply(lambda x: x.sum())
>>>> >> >>> > b) routines which operate column-by-column rather
>>>> >> >>> > block-by-block and
>>>> >> >>> > are
>>>> >> >>> > in
>>>> >> >>> > python space (e.g. we have an issue right now about .quantile)
>>>> >> >>> >
>>>> >> >>> > So I am glossing over a big goal of having a c++ library that
>>>> >> >>> > represents
>>>> >> >>> > the
>>>> >> >>> > pandas internals. This would by definition have a c-API that so
>>>> >> >>> > you *could* use pandas like semantics in c/c++ and just have it
>>>> >> >>> > work
>>>> >> >>> > (and
>>>> >> >>> > then pandas would be a thin wrapper around this library).
>>>> >> >>> >
>>>> >> >>> > I am not averse to this, but I think would be quite a big
>>>> >> >>> > effort,
>>>> >> >>> > and
>>>> >> >>> > not a
>>>> >> >>> > huge perf boost IMHO. Further there are a number of API issues
>>>> >> >>> > w.r.t.
>>>> >> >>> > indexing
>>>> >> >>> > which need to be clarified / worked out (e.g. should we simply
>>>> >> >>> > deprecate
>>>> >> >>> > [])
>>>> >> >>> > that are much easier to test / figure out in python space.
>>>> >> >>> >
>>>> >> >>> > I also thing that we have quite a large number of contributors.
>>>> >> >>> > Moving
>>>> >> >>> > to
>>>> >> >>> > c++ might make the internals a bit more impenetrable that the
>>>> >> >>> > current
>>>> >> >>> > internals.
>>>> >> >>> > (though this would allow c++ people to contribute, so that
>>>> >> >>> > might
>>>> >> >>> > balance
>>>> >> >>> > out).
>>>> >> >>> >
>>>> >> >>> > We have a limited core of devs whom right now are familar with
>>>> >> >>> > things.
>>>> >> >>> > If
>>>> >> >>> > someone happened to have a starting base for a c++ library,
>>>> >> >>> > then I
>>>> >> >>> > might
>>>> >> >>> > change
>>>> >> >>> > opinions here.
>>>> >> >>> >
>>>> >> >>> >
>>>> >> >>> > my 4c.
>>>> >> >>> >
>>>> >> >>> > Jeff
>>>> >> >>> >
>>>> >> >>> >
>>>> >> >>> >
>>>> >> >>> >
>>>> >> >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney
>>>> >> >>> > <wesmckinn at gmail.com>
>>>> >> >>> > wrote:
>>>> >> >>> >>
>>>> >> >>> >> Deep thoughts during the holidays.
>>>> >> >>> >>
>>>> >> >>> >> I might be out of line here, but the interpreter-heaviness of
>>>> >> >>> >> the
>>>> >> >>> >> inside of pandas objects is likely to be a long-term liability
>>>> >> >>> >> and
>>>> >> >>> >> source of performance problems and technical debt.
>>>> >> >>> >>
>>>> >> >>> >> Has anyone put any thought into planning and beginning to
>>>> >> >>> >> execute
>>>> >> >>> >> on a
>>>> >> >>> >> rewrite that moves as much as possible of the internals into
>>>> >> >>> >> native
>>>> >> >>> >> /
>>>> >> >>> >> compiled code? I'm talking about:
>>>> >> >>> >>
>>>> >> >>> >> - pandas/core/internals
>>>> >> >>> >> - indexing and assignment
>>>> >> >>> >> - much of pandas/core/common
>>>> >> >>> >> - categorical and custom dtypes
>>>> >> >>> >> - all indexing mechanisms
>>>> >> >>> >>
>>>> >> >>> >> I'm concerned we've already exposed too much internals to
>>>> >> >>> >> users, so
>>>> >> >>> >> this might lead to a lot of API breakage, but it might be for
>>>> >> >>> >> the
>>>> >> >>> >> Greater Good. As a first step, beginning a partial migration
>>>> >> >>> >> of
>>>> >> >>> >> internals into some C++ classes that encapsulate the insides
>>>> >> >>> >> of
>>>> >> >>> >> DataFrame objects and implement indexing and block-level
>>>> >> >>> >> manipulations
>>>> >> >>> >> would be a good place to start. I think you could do this
>>>> >> >>> >> wouldn't
>>>> >> >>> >> too
>>>> >> >>> >> much disruption.
>>>> >> >>> >>
>>>> >> >>> >> As part of this internal retooling we might give consideration
>>>> >> >>> >> to
>>>> >> >>> >> alternative data structures for representing data internal to
>>>> >> >>> >> pandas
>>>> >> >>> >> objects. Now in 2015/2016, continuing to be hamstrung by
>>>> >> >>> >> NumPy's
>>>> >> >>> >> limitations feels somewhat anachronistic. User code is riddled
>>>> >> >>> >> with
>>>> >> >>> >> workarounds for data type fidelity issues and the like. Like,
>>>> >> >>> >> really,
>>>> >> >>> >> why not add a bitndarray (similar to ilanschnell/bitarray) for
>>>> >> >>> >> storing
>>>> >> >>> >> nullness for problematic types and hide this from the user? =)
>>>> >> >>> >>
>>>> >> >>> >> Since we are now a NumFOCUS-sponsored project, I feel like we
>>>> >> >>> >> might
>>>> >> >>> >> consider establishing some formal governance over pandas and
>>>> >> >>> >> publishing meetings notes and roadmap documents describing
>>>> >> >>> >> plans
>>>> >> >>> >> for
>>>> >> >>> >> the project and meetings notes from committers. There's no
>>>> >> >>> >> real
>>>> >> >>> >> "committer culture" for NumFOCUS projects like there is with
>>>> >> >>> >> the
>>>> >> >>> >> Apache Software Foundation, but we might try leading by
>>>> >> >>> >> example!
>>>> >> >>> >>
>>>> >> >>> >> Also, I believe pandas as a project has reached a level of
>>>> >> >>> >> importance
>>>> >> >>> >> where we ought to consider planning and execution on larger
>>>> >> >>> >> scale
>>>> >> >>> >> undertakings such as this for safeguarding the future.
>>>> >> >>> >>
>>>> >> >>> >> As for myself, well, I have my hands full in Big Data-land. I
>>>> >> >>> >> wish
>>>> >> >>> >> I
>>>> >> >>> >> could be helping more with pandas, but there a quite a few
>>>> >> >>> >> fundamental
>>>> >> >>> >> issues (like data interoperability nested data handling and
>>>> >> >>> >> file
>>>> >> >>> >> format support ? e.g. Parquet, see
>>>> >> >>> >>
>>>> >> >>> >>
>>>> >> >>> >>
>>>> >> >>> >>
>>>> >> >>> >> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/)
>>>> >> >>> >> preventing Python from being more useful in industry analytics
>>>> >> >>> >> applications.
>>>> >> >>> >>
>>>> >> >>> >> Aside: one of the bigger mistakes I made with pandas's API
>>>> >> >>> >> design
>>>> >> >>> >> was
>>>> >> >>> >> making it acceptable to call class constructors ? like
>>>> >> >>> >> pandas.DataFrame ? directly (versus factory functions). Sorry
>>>> >> >>> >> about
>>>> >> >>> >> that! If we could convince everyone to start writing
>>>> >> >>> >> pandas.data_frame
>>>> >> >>> >> or dataframe instead of using the class reference it would
>>>> >> >>> >> help a
>>>> >> >>> >> lot
>>>> >> >>> >> with code cleanup. It's hard to plan for these things ? NumPy
>>>> >> >>> >> interoperability seemed a lot more important in 2008 than it
>>>> >> >>> >> does
>>>> >> >>> >> now,
>>>> >> >>> >> so I forgive myself.
>>>> >> >>> >>
>>>> >> >>> >> cheers and best wishes for 2016,
>>>> >> >>> >> Wes
>>>> >> >>> >> _______________________________________________
>>>> >> >>> >> Pandas-dev mailing list
>>>> >> >>> >> Pandas-dev at python.org
>>>> >> >>> >> https://mail.python.org/mailman/listinfo/pandas-dev
>>>> >> >>> >
>>>> >> >>> >
>>>> >> >>> _______________________________________________
>>>> >> >>> Pandas-dev mailing list
>>>> >> >>> Pandas-dev at python.org
>>>> >> >>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>> >> _______________________________________________
>>>> >> Pandas-dev mailing list
>>>> >> Pandas-dev at python.org
>>>> >> https://mail.python.org/mailman/listinfo/pandas-dev
>>>> >
>>>> >
>>>>
>>>>
>>>> _______________________________________________
>>>> Pandas-dev mailing list
>>>> Pandas-dev at python.org
>>>> https://mail.python.org/mailman/listinfo/pandas-dev
>>>>
>>>
>