From wesmckinn at gmail.com Fri Jan 1 20:13:58 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Fri, 1 Jan 2016 17:13:58 -0800 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References:

Message-ID: Jeff -- can you require log-in for editing on this document? https://docs.google.com/document/d/151ct8jcZWwh7XStptjbLsda6h2b3C0IuiH_hfZnUA58/edit# There are a number of anonymous edits. On Wed, Dec 30, 2015 at 6:04 PM, Wes McKinney wrote: > I cobbled together an ugly start of a c++->cython->pandas toolchain here > > https://github.com/wesm/pandas/tree/libpandas-native-core > > I used a mix of Kudu, Impala, and dynd-python cmake sources, so it's a > bit messy at the moment but it should be sufficient to run some real > experiments with a little more work. I reckon it's like a 6 month > project to tear out the insides of Series and DataFrame and replace it > with a new "native core", but we should be able to get enough info to > see whether it's a viable plan within a month or so. > > The end goal is to create "private" extension types in Cython that can > be the new base classes for Series and NDFrame; these will hold a > reference to a C++ object that contains wrappered NumPy arrays and > other metadata (like pandas-only dtypes). > > It might be too hard to try to replace a single usage of block manager > as a first experiment, so I'll try to create a minimal "SeriesLite" > that supports 3 dtypes > > 1) float64 with nans > 2) int64 with a bitmask for NAs > 3) category type for one of these > > Just want to get a feel for the extensibility and offer an NA > singleton Python object (a la None) for getting and setting NAs across > these 3 dtypes. > > If we end up going down this route, any way to place a moratorium on > invasive work on pandas internals (outside bug fixes)? > > Pedantic aside: I'd rather avoid shipping thirdparty C/C++ libraries > like googletest and friends in pandas if we can. Cloudera folks have > been working on a portable C++ library toolchain for Impala and other > projects at https://github.com/cloudera/native-toolchain, but it is > only being tested on Linux and OS X. Most google libraries should > build out of the box on MSVC but it'll be something to keep an eye on. > > BTW thanks to the libdynd developers for pioneering the c++ lib <-> > python-c++ lib <-> cython toolchain; being able to build Cython > extensions directly from cmake is a godsend > > HNY all > Wes > > On Tue, Dec 29, 2015 at 4:17 PM, Irwin Zaid wrote: >> Yeah, that seems reasonable and I totally agree a Pandas wrapper layer would >> be necessary. >> >> I'll keep an eye on this and I'd like to help if I can. >> >> Irwin >> >> >> On Tue, Dec 29, 2015 at 6:01 PM, Wes McKinney wrote: >>> >>> I'm not suggesting a rewrite of NumPy functionality but rather pandas >>> functionality that is currently written in a mishmash of Cython and Python. >>> Happy to experiment with changing the internal compute infrastructure and >>> data representation to DyND after this first stage of cleanup is done. Even >>> if we use DyND a pretty extensive pandas wrapper layer will be necessary. >>> >>> >>> On Tuesday, December 29, 2015, Irwin Zaid wrote: >>>> >>>> Hi Wes (and others), >>>> >>>> I've been following this conversation with interest. I do think it would >>>> be worth exploring DyND, rather than setting up yet another rewrite of >>>> NumPy-functionality. Especially because DyND is already an optional >>>> dependency of Pandas. >>>> >>>> For things like Integer NA and new dtypes, DyND is there and ready to do >>>> this. >>>> >>>> Irwin >>>> >>>> On Tue, Dec 29, 2015 at 5:18 PM, Wes McKinney >>>> wrote: >>>>> >>>>> Can you link to the PR you're talking about? >>>>> >>>>> I will see about spending a few hours setting up a libpandas.so as a C++ >>>>> shared library where we can run some experiments and validate whether it can >>>>> solve the integer-NA problem and be a place to put new data types >>>>> (categorical and friends). I'm +1 on targeting >>>>> >>>>> Would it also be worth making a wish list of APIs we might consider >>>>> breaking in a pandas 1.0 release that also features this new "native core"? >>>>> Might as well right some wrongs while we're doing some invasive work on the >>>>> internals; some breakage might be unavoidable. We can always maintain a >>>>> pandas legacy 0.x.x maintenance branch (providing a conda binary build) for >>>>> legacy users where showstopper bugs can get fixed. >>>>> >>>>> On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback >>>>> wrote: >>>>> > Wes your last is noted as well. I *think* we can actually do this now >>>>> > (well >>>>> > there is a PR out there). >>>>> > >>>>> > On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney >>>>> > wrote: >>>>> >> >>>>> >> The other huge thing this will enable is to do is copy-on-write for >>>>> >> various kinds of views, which should cut down on some of the >>>>> >> defensive >>>>> >> copying in the library and reduce memory usage. >>>>> >> >>>>> >> On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney >>>>> >> wrote: >>>>> >> > Basically the approach is >>>>> >> > >>>>> >> > 1) Base dtype type >>>>> >> > 2) Base array type with K >= 1 dimensions >>>>> >> > 3) Base scalar type >>>>> >> > 4) Base index type >>>>> >> > 5) "Wrapper" subclasses for all NumPy types fitting into categories >>>>> >> > #1, #2, #3, #4 >>>>> >> > 6) Subclasses for pandas-specific types like category, datetimeTZ, >>>>> >> > etc. >>>>> >> > 7) NDFrame as cpcloud wrote is just a list of these >>>>> >> > >>>>> >> > Indexes and axis labels / column names can get layered on top. >>>>> >> > >>>>> >> > After we do all this we can look at adding nested types (arrays, >>>>> >> > maps, >>>>> >> > structs) to better support JSON. >>>>> >> > >>>>> >> > - Wes >>>>> >> > >>>>> >> > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud >>>>> >> > wrote: >>>>> >> >> Maybe this is saying the same thing as Wes, but how far would >>>>> >> >> something >>>>> >> >> like >>>>> >> >> this get us? >>>>> >> >> >>>>> >> >> // warning: things are probably not this simple >>>>> >> >> >>>>> >> >> struct data_array_t { >>>>> >> >> void *primitive; // scalar data >>>>> >> >> data_array_t *nested; // nested data >>>>> >> >> boost::dynamic_bitset isnull; // might have to create our own >>>>> >> >> to >>>>> >> >> avoid >>>>> >> >> boost >>>>> >> >> schema_t schema; // not sure exactly what this looks like >>>>> >> >> }; >>>>> >> >> >>>>> >> >> typedef std::map data_frame_t; // probably >>>>> >> >> not >>>>> >> >> this >>>>> >> >> simple >>>>> >> >> >>>>> >> >> To answer Jeff?s use-case question: I think that the use cases are >>>>> >> >> 1) >>>>> >> >> freedom from numpy (mostly) 2) no more block manager which frees >>>>> >> >> us >>>>> >> >> from the >>>>> >> >> limitations of the block memory layout. In particular, the ability >>>>> >> >> to >>>>> >> >> take >>>>> >> >> advantage of memory mapped IO would be a big win IMO. >>>>> >> >> >>>>> >> >> >>>>> >> >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney >>>>> >> >> wrote: >>>>> >> >>> >>>>> >> >>> I will write a more detailed response to some of these things >>>>> >> >>> after >>>>> >> >>> the new year, but, in particular, re: missing values, can you or >>>>> >> >>> someone tell me why creating an object that contains a NumPy >>>>> >> >>> array and >>>>> >> >>> a bitmap is not sufficient? If we we can add a lightweight C/C++ >>>>> >> >>> class >>>>> >> >>> layer between NumPy function calls (e.g. arithmetic) and pandas >>>>> >> >>> function calls, then I see no reason why we cannot have >>>>> >> >>> >>>>> >> >>> Int32Array->add >>>>> >> >>> >>>>> >> >>> and >>>>> >> >>> >>>>> >> >>> Float32Array->add >>>>> >> >>> >>>>> >> >>> do the right thing (the former would be responsible for >>>>> >> >>> bitmasking to >>>>> >> >>> propagate NA values; the latter would defer to NumPy). If we can >>>>> >> >>> put >>>>> >> >>> all the internals of pandas objects inside a black box, we can >>>>> >> >>> add >>>>> >> >>> layers of virtual function indirection without a performance >>>>> >> >>> penalty >>>>> >> >>> (e.g. adding more interpreter overhead with more abstraction >>>>> >> >>> layers >>>>> >> >>> does add up to a perf penalty). >>>>> >> >>> >>>>> >> >>> I don't think this is too scary -- I would be willing to create a >>>>> >> >>> small POC C++ library to prototype something like what I'm >>>>> >> >>> talking >>>>> >> >>> about. >>>>> >> >>> >>>>> >> >>> Since pandas has limited points of contact with NumPy I don't >>>>> >> >>> think >>>>> >> >>> this would end up being too onerous. >>>>> >> >>> >>>>> >> >>> For the record, I'm pretty allergic to "advanced C++"; I think it >>>>> >> >>> is a >>>>> >> >>> useful tool if you pick a sane 20% subset of the C++11 spec and >>>>> >> >>> follow >>>>> >> >>> Google C++ style it's not very inaccessible to intermediate >>>>> >> >>> developers. More or less "C plus OOP and easier object lifetime >>>>> >> >>> management (shared/unique_ptr, etc.)". As soon as you add a lot >>>>> >> >>> of >>>>> >> >>> template metaprogramming C++ library development quickly becomes >>>>> >> >>> inaccessible except to the C++-Jedi. >>>>> >> >>> >>>>> >> >>> Maybe let's start a Google document on "pandas roadmap" where we >>>>> >> >>> can >>>>> >> >>> break down the 1-2 year goals and some of these infrastructure >>>>> >> >>> issues >>>>> >> >>> and have our discussion there? (obviously publish this someplace >>>>> >> >>> once >>>>> >> >>> we're done) >>>>> >> >>> >>>>> >> >>> - Wes >>>>> >> >>> >>>>> >> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback >>>>> >> >>> >>>>> >> >>> wrote: >>>>> >> >>> > Here are some of my thoughts about pandas Roadmap / status and >>>>> >> >>> > some >>>>> >> >>> > responses to Wes's thoughts. >>>>> >> >>> > >>>>> >> >>> > In the last few (and upcoming) major releases we have been made >>>>> >> >>> > the >>>>> >> >>> > following changes: >>>>> >> >>> > >>>>> >> >>> > - dtype enhancements (Categorical, Timedelta, Datetime w/tz) & >>>>> >> >>> > making >>>>> >> >>> > these >>>>> >> >>> > first class objects >>>>> >> >>> > - code refactoring to remove subclassing of ndarrays for Series >>>>> >> >>> > & >>>>> >> >>> > Index >>>>> >> >>> > - carving out / deprecating non-core parts of pandas >>>>> >> >>> > - datareader >>>>> >> >>> > - SparsePanel, WidePanel & other aliases (TImeSeries) >>>>> >> >>> > - rpy, rplot, irow et al. >>>>> >> >>> > - google-analytics >>>>> >> >>> > - API changes to make things more consistent >>>>> >> >>> > - pd.rolling/expanding * -> .rolling/expanding (this is in >>>>> >> >>> > master >>>>> >> >>> > now) >>>>> >> >>> > - .resample becoming a full defered like groupby. >>>>> >> >>> > - multi-index slicing along any level (obviates need for .xs) >>>>> >> >>> > and >>>>> >> >>> > allows >>>>> >> >>> > assignment >>>>> >> >>> > - .loc/.iloc - for the most part obviates use of .ix >>>>> >> >>> > - .pipe & .assign >>>>> >> >>> > - plotting accessors >>>>> >> >>> > - fixing of the sorting API >>>>> >> >>> > - many performance enhancements both micro & macro (e.g. >>>>> >> >>> > release >>>>> >> >>> > GIL) >>>>> >> >>> > >>>>> >> >>> > Some on-deck enhancements are (meaning these are basically >>>>> >> >>> > ready to >>>>> >> >>> > go >>>>> >> >>> > in): >>>>> >> >>> > - IntervalIndex (and eventually make PeriodIndex just a >>>>> >> >>> > sub-class >>>>> >> >>> > of >>>>> >> >>> > this) >>>>> >> >>> > - RangeIndex >>>>> >> >>> > >>>>> >> >>> > so lots of changes, though nothing really earth shaking, just >>>>> >> >>> > more >>>>> >> >>> > convenience, reducing magicness somewhat >>>>> >> >>> > and providing flexibility. >>>>> >> >>> > >>>>> >> >>> > Of course we are getting increasing issues, mostly bug reports >>>>> >> >>> > (and >>>>> >> >>> > lots >>>>> >> >>> > of >>>>> >> >>> > dupes), some edge case enhancements >>>>> >> >>> > which can add to the existing API's and of course, requests to >>>>> >> >>> > expand >>>>> >> >>> > the >>>>> >> >>> > (already) large code to other usecases. >>>>> >> >>> > Balancing this are a good many pull-requests from many >>>>> >> >>> > different >>>>> >> >>> > users, >>>>> >> >>> > some >>>>> >> >>> > even deep into the internals. >>>>> >> >>> > >>>>> >> >>> > Here are some things that I have talked about and could be >>>>> >> >>> > considered >>>>> >> >>> > for >>>>> >> >>> > the roadmap. Disclaimer: I do work for Continuum >>>>> >> >>> > but these views are of course my own; furthermore obviously I >>>>> >> >>> > am a >>>>> >> >>> > bit >>>>> >> >>> > more >>>>> >> >>> > familiar with some of the 'sponsored' open-source >>>>> >> >>> > libraries, but always open to new things. >>>>> >> >>> > >>>>> >> >>> > - integration / automatic deferral to numba for JIT (this would >>>>> >> >>> > be >>>>> >> >>> > thru >>>>> >> >>> > .apply) >>>>> >> >>> > - automatic deferal to dask from groubpy where appropriate / >>>>> >> >>> > maybe a >>>>> >> >>> > .to_parallel (to simply return a dask.DataFrame object) >>>>> >> >>> > - incorporation of quantities / units (as part of the dtype) >>>>> >> >>> > - use of DyND to allow missing values for int dtypes >>>>> >> >>> > - make Period a first class dtype. >>>>> >> >>> > - provide some copy-on-write semantics to alleviate the >>>>> >> >>> > chained-indexing >>>>> >> >>> > issues which occasionaly come up with the mis-use of the >>>>> >> >>> > indexing >>>>> >> >>> > API >>>>> >> >>> > - allow a 'policy' to automatically provide column blocks for >>>>> >> >>> > dict-like >>>>> >> >>> > input (e.g. each column would be a block), this would allow a >>>>> >> >>> > pass-thru >>>>> >> >>> > API >>>>> >> >>> > where you could >>>>> >> >>> > put in numpy arrays where you have views and have them >>>>> >> >>> > preserved >>>>> >> >>> > rather >>>>> >> >>> > than >>>>> >> >>> > copied automatically. Note that this would also allow what I >>>>> >> >>> > call >>>>> >> >>> > 'split' >>>>> >> >>> > where a passed in >>>>> >> >>> > multi-dim numpy array could be split up to individual blocks >>>>> >> >>> > (which >>>>> >> >>> > actually >>>>> >> >>> > gives a nice perf boost after the splitting costs). >>>>> >> >>> > >>>>> >> >>> > In working towards some of these goals. I have come to the >>>>> >> >>> > opinion >>>>> >> >>> > that >>>>> >> >>> > it >>>>> >> >>> > would make sense to have a neutral API protocol layer >>>>> >> >>> > that would allow us to swap out different engines as needed, >>>>> >> >>> > for >>>>> >> >>> > particular >>>>> >> >>> > dtypes, or *maybe* out-of-core type computations. E.g. >>>>> >> >>> > imagine that we replaced the in-memory block structure with a >>>>> >> >>> > bclolz >>>>> >> >>> > / >>>>> >> >>> > memap >>>>> >> >>> > type; in theory this should be 'easy' and just work. >>>>> >> >>> > I could also see us adopting *some* of the SFrame code to allow >>>>> >> >>> > easier >>>>> >> >>> > interop with this API layer. >>>>> >> >>> > >>>>> >> >>> > In practice, I think a nice API layer would need to be created >>>>> >> >>> > to >>>>> >> >>> > make >>>>> >> >>> > this >>>>> >> >>> > clean / nice. >>>>> >> >>> > >>>>> >> >>> > So this comes around to Wes's point about creating a c++ >>>>> >> >>> > library for >>>>> >> >>> > the >>>>> >> >>> > internals (and possibly even some of the indexing routines). >>>>> >> >>> > In an ideal world, or course this would be desirable. Getting >>>>> >> >>> > there >>>>> >> >>> > is a >>>>> >> >>> > bit >>>>> >> >>> > non-trivial I think, and IMHO might not be worth the effort. I >>>>> >> >>> > don't >>>>> >> >>> > really see big performance bottlenecks. We *already* defer much >>>>> >> >>> > of >>>>> >> >>> > the >>>>> >> >>> > computation to libraries like numexpr & bottleneck (where >>>>> >> >>> > appropriate). >>>>> >> >>> > Adding numba / dask to the list would be helpful. >>>>> >> >>> > >>>>> >> >>> > I think that almost all performance issues are the result of: >>>>> >> >>> > >>>>> >> >>> > a) gross misuse of the pandas API. How much code have you seen >>>>> >> >>> > that >>>>> >> >>> > does >>>>> >> >>> > df.apply(lambda x: x.sum()) >>>>> >> >>> > b) routines which operate column-by-column rather >>>>> >> >>> > block-by-block and >>>>> >> >>> > are >>>>> >> >>> > in >>>>> >> >>> > python space (e.g. we have an issue right now about .quantile) >>>>> >> >>> > >>>>> >> >>> > So I am glossing over a big goal of having a c++ library that >>>>> >> >>> > represents >>>>> >> >>> > the >>>>> >> >>> > pandas internals. This would by definition have a c-API that so >>>>> >> >>> > you *could* use pandas like semantics in c/c++ and just have it >>>>> >> >>> > work >>>>> >> >>> > (and >>>>> >> >>> > then pandas would be a thin wrapper around this library). >>>>> >> >>> > >>>>> >> >>> > I am not averse to this, but I think would be quite a big >>>>> >> >>> > effort, >>>>> >> >>> > and >>>>> >> >>> > not a >>>>> >> >>> > huge perf boost IMHO. Further there are a number of API issues >>>>> >> >>> > w.r.t. >>>>> >> >>> > indexing >>>>> >> >>> > which need to be clarified / worked out (e.g. should we simply >>>>> >> >>> > deprecate >>>>> >> >>> > []) >>>>> >> >>> > that are much easier to test / figure out in python space. >>>>> >> >>> > >>>>> >> >>> > I also thing that we have quite a large number of contributors. >>>>> >> >>> > Moving >>>>> >> >>> > to >>>>> >> >>> > c++ might make the internals a bit more impenetrable that the >>>>> >> >>> > current >>>>> >> >>> > internals. >>>>> >> >>> > (though this would allow c++ people to contribute, so that >>>>> >> >>> > might >>>>> >> >>> > balance >>>>> >> >>> > out). >>>>> >> >>> > >>>>> >> >>> > We have a limited core of devs whom right now are familar with >>>>> >> >>> > things. >>>>> >> >>> > If >>>>> >> >>> > someone happened to have a starting base for a c++ library, >>>>> >> >>> > then I >>>>> >> >>> > might >>>>> >> >>> > change >>>>> >> >>> > opinions here. >>>>> >> >>> > >>>>> >> >>> > >>>>> >> >>> > my 4c. >>>>> >> >>> > >>>>> >> >>> > Jeff >>>>> >> >>> > >>>>> >> >>> > >>>>> >> >>> > >>>>> >> >>> > >>>>> >> >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney >>>>> >> >>> > >>>>> >> >>> > wrote: >>>>> >> >>> >> >>>>> >> >>> >> Deep thoughts during the holidays. >>>>> >> >>> >> >>>>> >> >>> >> I might be out of line here, but the interpreter-heaviness of >>>>> >> >>> >> the >>>>> >> >>> >> inside of pandas objects is likely to be a long-term liability >>>>> >> >>> >> and >>>>> >> >>> >> source of performance problems and technical debt. >>>>> >> >>> >> >>>>> >> >>> >> Has anyone put any thought into planning and beginning to >>>>> >> >>> >> execute >>>>> >> >>> >> on a >>>>> >> >>> >> rewrite that moves as much as possible of the internals into >>>>> >> >>> >> native >>>>> >> >>> >> / >>>>> >> >>> >> compiled code? I'm talking about: >>>>> >> >>> >> >>>>> >> >>> >> - pandas/core/internals >>>>> >> >>> >> - indexing and assignment >>>>> >> >>> >> - much of pandas/core/common >>>>> >> >>> >> - categorical and custom dtypes >>>>> >> >>> >> - all indexing mechanisms >>>>> >> >>> >> >>>>> >> >>> >> I'm concerned we've already exposed too much internals to >>>>> >> >>> >> users, so >>>>> >> >>> >> this might lead to a lot of API breakage, but it might be for >>>>> >> >>> >> the >>>>> >> >>> >> Greater Good. As a first step, beginning a partial migration >>>>> >> >>> >> of >>>>> >> >>> >> internals into some C++ classes that encapsulate the insides >>>>> >> >>> >> of >>>>> >> >>> >> DataFrame objects and implement indexing and block-level >>>>> >> >>> >> manipulations >>>>> >> >>> >> would be a good place to start. I think you could do this >>>>> >> >>> >> wouldn't >>>>> >> >>> >> too >>>>> >> >>> >> much disruption. >>>>> >> >>> >> >>>>> >> >>> >> As part of this internal retooling we might give consideration >>>>> >> >>> >> to >>>>> >> >>> >> alternative data structures for representing data internal to >>>>> >> >>> >> pandas >>>>> >> >>> >> objects. Now in 2015/2016, continuing to be hamstrung by >>>>> >> >>> >> NumPy's >>>>> >> >>> >> limitations feels somewhat anachronistic. User code is riddled >>>>> >> >>> >> with >>>>> >> >>> >> workarounds for data type fidelity issues and the like. Like, >>>>> >> >>> >> really, >>>>> >> >>> >> why not add a bitndarray (similar to ilanschnell/bitarray) for >>>>> >> >>> >> storing >>>>> >> >>> >> nullness for problematic types and hide this from the user? =) >>>>> >> >>> >> >>>>> >> >>> >> Since we are now a NumFOCUS-sponsored project, I feel like we >>>>> >> >>> >> might >>>>> >> >>> >> consider establishing some formal governance over pandas and >>>>> >> >>> >> publishing meetings notes and roadmap documents describing >>>>> >> >>> >> plans >>>>> >> >>> >> for >>>>> >> >>> >> the project and meetings notes from committers. There's no >>>>> >> >>> >> real >>>>> >> >>> >> "committer culture" for NumFOCUS projects like there is with >>>>> >> >>> >> the >>>>> >> >>> >> Apache Software Foundation, but we might try leading by >>>>> >> >>> >> example! >>>>> >> >>> >> >>>>> >> >>> >> Also, I believe pandas as a project has reached a level of >>>>> >> >>> >> importance >>>>> >> >>> >> where we ought to consider planning and execution on larger >>>>> >> >>> >> scale >>>>> >> >>> >> undertakings such as this for safeguarding the future. >>>>> >> >>> >> >>>>> >> >>> >> As for myself, well, I have my hands full in Big Data-land. I >>>>> >> >>> >> wish >>>>> >> >>> >> I >>>>> >> >>> >> could be helping more with pandas, but there a quite a few >>>>> >> >>> >> fundamental >>>>> >> >>> >> issues (like data interoperability nested data handling and >>>>> >> >>> >> file >>>>> >> >>> >> format support ? e.g. Parquet, see >>>>> >> >>> >> >>>>> >> >>> >> >>>>> >> >>> >> >>>>> >> >>> >> >>>>> >> >>> >> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/) >>>>> >> >>> >> preventing Python from being more useful in industry analytics >>>>> >> >>> >> applications. >>>>> >> >>> >> >>>>> >> >>> >> Aside: one of the bigger mistakes I made with pandas's API >>>>> >> >>> >> design >>>>> >> >>> >> was >>>>> >> >>> >> making it acceptable to call class constructors ? like >>>>> >> >>> >> pandas.DataFrame ? directly (versus factory functions). Sorry >>>>> >> >>> >> about >>>>> >> >>> >> that! If we could convince everyone to start writing >>>>> >> >>> >> pandas.data_frame >>>>> >> >>> >> or dataframe instead of using the class reference it would >>>>> >> >>> >> help a >>>>> >> >>> >> lot >>>>> >> >>> >> with code cleanup. It's hard to plan for these things ? NumPy >>>>> >> >>> >> interoperability seemed a lot more important in 2008 than it >>>>> >> >>> >> does >>>>> >> >>> >> now, >>>>> >> >>> >> so I forgive myself. >>>>> >> >>> >> >>>>> >> >>> >> cheers and best wishes for 2016, >>>>> >> >>> >> Wes >>>>> >> >>> >> _______________________________________________ >>>>> >> >>> >> Pandas-dev mailing list >>>>> >> >>> >> Pandas-dev at python.org >>>>> >> >>> >> https://mail.python.org/mailman/listinfo/pandas-dev >>>>> >> >>> > >>>>> >> >>> > >>>>> >> >>> _______________________________________________ >>>>> >> >>> Pandas-dev mailing list >>>>> >> >>> Pandas-dev at python.org >>>>> >> >>> https://mail.python.org/mailman/listinfo/pandas-dev >>>>> >> _______________________________________________ >>>>> >> Pandas-dev mailing list >>>>> >> Pandas-dev at python.org >>>>> >> https://mail.python.org/mailman/listinfo/pandas-dev >>>>> > >>>>> > >>>>> >>>>> >>>>> _______________________________________________ >>>>> Pandas-dev mailing list >>>>> Pandas-dev at python.org >>>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>>> >>>> >> From jeffreback at gmail.com Fri Jan 1 20:23:02 2016 From: jeffreback at gmail.com (Jeff Reback) Date: Fri, 1 Jan 2016 20:23:02 -0500 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References:

Message-ID: I changed the doc so that the core dev people can edit. I *think* that everyone should be able to view/comment though. On Fri, Jan 1, 2016 at 8:13 PM, Wes McKinney wrote: > Jeff -- can you require log-in for editing on this document? > > https://docs.google.com/document/d/151ct8jcZWwh7XStptjbLsda6h2b3C0IuiH_hfZnUA58/edit# > > There are a number of anonymous edits. > > On Wed, Dec 30, 2015 at 6:04 PM, Wes McKinney wrote: > > I cobbled together an ugly start of a c++->cython->pandas toolchain here > > > > https://github.com/wesm/pandas/tree/libpandas-native-core > > > > I used a mix of Kudu, Impala, and dynd-python cmake sources, so it's a > > bit messy at the moment but it should be sufficient to run some real > > experiments with a little more work. I reckon it's like a 6 month > > project to tear out the insides of Series and DataFrame and replace it > > with a new "native core", but we should be able to get enough info to > > see whether it's a viable plan within a month or so. > > > > The end goal is to create "private" extension types in Cython that can > > be the new base classes for Series and NDFrame; these will hold a > > reference to a C++ object that contains wrappered NumPy arrays and > > other metadata (like pandas-only dtypes). > > > > It might be too hard to try to replace a single usage of block manager > > as a first experiment, so I'll try to create a minimal "SeriesLite" > > that supports 3 dtypes > > > > 1) float64 with nans > > 2) int64 with a bitmask for NAs > > 3) category type for one of these > > > > Just want to get a feel for the extensibility and offer an NA > > singleton Python object (a la None) for getting and setting NAs across > > these 3 dtypes. > > > > If we end up going down this route, any way to place a moratorium on > > invasive work on pandas internals (outside bug fixes)? > > > > Pedantic aside: I'd rather avoid shipping thirdparty C/C++ libraries > > like googletest and friends in pandas if we can. Cloudera folks have > > been working on a portable C++ library toolchain for Impala and other > > projects at https://github.com/cloudera/native-toolchain, but it is > > only being tested on Linux and OS X. Most google libraries should > > build out of the box on MSVC but it'll be something to keep an eye on. > > > > BTW thanks to the libdynd developers for pioneering the c++ lib <-> > > python-c++ lib <-> cython toolchain; being able to build Cython > > extensions directly from cmake is a godsend > > > > HNY all > > Wes > > > > On Tue, Dec 29, 2015 at 4:17 PM, Irwin Zaid wrote: > >> Yeah, that seems reasonable and I totally agree a Pandas wrapper layer > would > >> be necessary. > >> > >> I'll keep an eye on this and I'd like to help if I can. > >> > >> Irwin > >> > >> > >> On Tue, Dec 29, 2015 at 6:01 PM, Wes McKinney > wrote: > >>> > >>> I'm not suggesting a rewrite of NumPy functionality but rather pandas > >>> functionality that is currently written in a mishmash of Cython and > Python. > >>> Happy to experiment with changing the internal compute infrastructure > and > >>> data representation to DyND after this first stage of cleanup is done. > Even > >>> if we use DyND a pretty extensive pandas wrapper layer will be > necessary. > >>> > >>> > >>> On Tuesday, December 29, 2015, Irwin Zaid wrote: > >>>> > >>>> Hi Wes (and others), > >>>> > >>>> I've been following this conversation with interest. I do think it > would > >>>> be worth exploring DyND, rather than setting up yet another rewrite of > >>>> NumPy-functionality. Especially because DyND is already an optional > >>>> dependency of Pandas. > >>>> > >>>> For things like Integer NA and new dtypes, DyND is there and ready to > do > >>>> this. > >>>> > >>>> Irwin > >>>> > >>>> On Tue, Dec 29, 2015 at 5:18 PM, Wes McKinney > >>>> wrote: > >>>>> > >>>>> Can you link to the PR you're talking about? > >>>>> > >>>>> I will see about spending a few hours setting up a libpandas.so as a > C++ > >>>>> shared library where we can run some experiments and validate > whether it can > >>>>> solve the integer-NA problem and be a place to put new data types > >>>>> (categorical and friends). I'm +1 on targeting > >>>>> > >>>>> Would it also be worth making a wish list of APIs we might consider > >>>>> breaking in a pandas 1.0 release that also features this new "native > core"? > >>>>> Might as well right some wrongs while we're doing some invasive work > on the > >>>>> internals; some breakage might be unavoidable. We can always > maintain a > >>>>> pandas legacy 0.x.x maintenance branch (providing a conda binary > build) for > >>>>> legacy users where showstopper bugs can get fixed. > >>>>> > >>>>> On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback > >>>>> wrote: > >>>>> > Wes your last is noted as well. I *think* we can actually do this > now > >>>>> > (well > >>>>> > there is a PR out there). > >>>>> > > >>>>> > On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney > > >>>>> > wrote: > >>>>> >> > >>>>> >> The other huge thing this will enable is to do is copy-on-write > for > >>>>> >> various kinds of views, which should cut down on some of the > >>>>> >> defensive > >>>>> >> copying in the library and reduce memory usage. > >>>>> >> > >>>>> >> On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney < > wesmckinn at gmail.com> > >>>>> >> wrote: > >>>>> >> > Basically the approach is > >>>>> >> > > >>>>> >> > 1) Base dtype type > >>>>> >> > 2) Base array type with K >= 1 dimensions > >>>>> >> > 3) Base scalar type > >>>>> >> > 4) Base index type > >>>>> >> > 5) "Wrapper" subclasses for all NumPy types fitting into > categories > >>>>> >> > #1, #2, #3, #4 > >>>>> >> > 6) Subclasses for pandas-specific types like category, > datetimeTZ, > >>>>> >> > etc. > >>>>> >> > 7) NDFrame as cpcloud wrote is just a list of these > >>>>> >> > > >>>>> >> > Indexes and axis labels / column names can get layered on top. > >>>>> >> > > >>>>> >> > After we do all this we can look at adding nested types (arrays, > >>>>> >> > maps, > >>>>> >> > structs) to better support JSON. > >>>>> >> > > >>>>> >> > - Wes > >>>>> >> > > >>>>> >> > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud < > cpcloud at gmail.com> > >>>>> >> > wrote: > >>>>> >> >> Maybe this is saying the same thing as Wes, but how far would > >>>>> >> >> something > >>>>> >> >> like > >>>>> >> >> this get us? > >>>>> >> >> > >>>>> >> >> // warning: things are probably not this simple > >>>>> >> >> > >>>>> >> >> struct data_array_t { > >>>>> >> >> void *primitive; // scalar data > >>>>> >> >> data_array_t *nested; // nested data > >>>>> >> >> boost::dynamic_bitset isnull; // might have to create our > own > >>>>> >> >> to > >>>>> >> >> avoid > >>>>> >> >> boost > >>>>> >> >> schema_t schema; // not sure exactly what this looks like > >>>>> >> >> }; > >>>>> >> >> > >>>>> >> >> typedef std::map data_frame_t; // > probably > >>>>> >> >> not > >>>>> >> >> this > >>>>> >> >> simple > >>>>> >> >> > >>>>> >> >> To answer Jeff?s use-case question: I think that the use cases > are > >>>>> >> >> 1) > >>>>> >> >> freedom from numpy (mostly) 2) no more block manager which > frees > >>>>> >> >> us > >>>>> >> >> from the > >>>>> >> >> limitations of the block memory layout. In particular, the > ability > >>>>> >> >> to > >>>>> >> >> take > >>>>> >> >> advantage of memory mapped IO would be a big win IMO. > >>>>> >> >> > >>>>> >> >> > >>>>> >> >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney < > wesmckinn at gmail.com> > >>>>> >> >> wrote: > >>>>> >> >>> > >>>>> >> >>> I will write a more detailed response to some of these things > >>>>> >> >>> after > >>>>> >> >>> the new year, but, in particular, re: missing values, can you > or > >>>>> >> >>> someone tell me why creating an object that contains a NumPy > >>>>> >> >>> array and > >>>>> >> >>> a bitmap is not sufficient? If we we can add a lightweight > C/C++ > >>>>> >> >>> class > >>>>> >> >>> layer between NumPy function calls (e.g. arithmetic) and > pandas > >>>>> >> >>> function calls, then I see no reason why we cannot have > >>>>> >> >>> > >>>>> >> >>> Int32Array->add > >>>>> >> >>> > >>>>> >> >>> and > >>>>> >> >>> > >>>>> >> >>> Float32Array->add > >>>>> >> >>> > >>>>> >> >>> do the right thing (the former would be responsible for > >>>>> >> >>> bitmasking to > >>>>> >> >>> propagate NA values; the latter would defer to NumPy). If we > can > >>>>> >> >>> put > >>>>> >> >>> all the internals of pandas objects inside a black box, we can > >>>>> >> >>> add > >>>>> >> >>> layers of virtual function indirection without a performance > >>>>> >> >>> penalty > >>>>> >> >>> (e.g. adding more interpreter overhead with more abstraction > >>>>> >> >>> layers > >>>>> >> >>> does add up to a perf penalty). > >>>>> >> >>> > >>>>> >> >>> I don't think this is too scary -- I would be willing to > create a > >>>>> >> >>> small POC C++ library to prototype something like what I'm > >>>>> >> >>> talking > >>>>> >> >>> about. > >>>>> >> >>> > >>>>> >> >>> Since pandas has limited points of contact with NumPy I don't > >>>>> >> >>> think > >>>>> >> >>> this would end up being too onerous. > >>>>> >> >>> > >>>>> >> >>> For the record, I'm pretty allergic to "advanced C++"; I > think it > >>>>> >> >>> is a > >>>>> >> >>> useful tool if you pick a sane 20% subset of the C++11 spec > and > >>>>> >> >>> follow > >>>>> >> >>> Google C++ style it's not very inaccessible to intermediate > >>>>> >> >>> developers. More or less "C plus OOP and easier object > lifetime > >>>>> >> >>> management (shared/unique_ptr, etc.)". As soon as you add a > lot > >>>>> >> >>> of > >>>>> >> >>> template metaprogramming C++ library development quickly > becomes > >>>>> >> >>> inaccessible except to the C++-Jedi. > >>>>> >> >>> > >>>>> >> >>> Maybe let's start a Google document on "pandas roadmap" where > we > >>>>> >> >>> can > >>>>> >> >>> break down the 1-2 year goals and some of these infrastructure > >>>>> >> >>> issues > >>>>> >> >>> and have our discussion there? (obviously publish this > someplace > >>>>> >> >>> once > >>>>> >> >>> we're done) > >>>>> >> >>> > >>>>> >> >>> - Wes > >>>>> >> >>> > >>>>> >> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback > >>>>> >> >>> > >>>>> >> >>> wrote: > >>>>> >> >>> > Here are some of my thoughts about pandas Roadmap / status > and > >>>>> >> >>> > some > >>>>> >> >>> > responses to Wes's thoughts. > >>>>> >> >>> > > >>>>> >> >>> > In the last few (and upcoming) major releases we have been > made > >>>>> >> >>> > the > >>>>> >> >>> > following changes: > >>>>> >> >>> > > >>>>> >> >>> > - dtype enhancements (Categorical, Timedelta, Datetime > w/tz) & > >>>>> >> >>> > making > >>>>> >> >>> > these > >>>>> >> >>> > first class objects > >>>>> >> >>> > - code refactoring to remove subclassing of ndarrays for > Series > >>>>> >> >>> > & > >>>>> >> >>> > Index > >>>>> >> >>> > - carving out / deprecating non-core parts of pandas > >>>>> >> >>> > - datareader > >>>>> >> >>> > - SparsePanel, WidePanel & other aliases (TImeSeries) > >>>>> >> >>> > - rpy, rplot, irow et al. > >>>>> >> >>> > - google-analytics > >>>>> >> >>> > - API changes to make things more consistent > >>>>> >> >>> > - pd.rolling/expanding * -> .rolling/expanding (this is in > >>>>> >> >>> > master > >>>>> >> >>> > now) > >>>>> >> >>> > - .resample becoming a full defered like groupby. > >>>>> >> >>> > - multi-index slicing along any level (obviates need for > .xs) > >>>>> >> >>> > and > >>>>> >> >>> > allows > >>>>> >> >>> > assignment > >>>>> >> >>> > - .loc/.iloc - for the most part obviates use of .ix > >>>>> >> >>> > - .pipe & .assign > >>>>> >> >>> > - plotting accessors > >>>>> >> >>> > - fixing of the sorting API > >>>>> >> >>> > - many performance enhancements both micro & macro (e.g. > >>>>> >> >>> > release > >>>>> >> >>> > GIL) > >>>>> >> >>> > > >>>>> >> >>> > Some on-deck enhancements are (meaning these are basically > >>>>> >> >>> > ready to > >>>>> >> >>> > go > >>>>> >> >>> > in): > >>>>> >> >>> > - IntervalIndex (and eventually make PeriodIndex just a > >>>>> >> >>> > sub-class > >>>>> >> >>> > of > >>>>> >> >>> > this) > >>>>> >> >>> > - RangeIndex > >>>>> >> >>> > > >>>>> >> >>> > so lots of changes, though nothing really earth shaking, > just > >>>>> >> >>> > more > >>>>> >> >>> > convenience, reducing magicness somewhat > >>>>> >> >>> > and providing flexibility. > >>>>> >> >>> > > >>>>> >> >>> > Of course we are getting increasing issues, mostly bug > reports > >>>>> >> >>> > (and > >>>>> >> >>> > lots > >>>>> >> >>> > of > >>>>> >> >>> > dupes), some edge case enhancements > >>>>> >> >>> > which can add to the existing API's and of course, requests > to > >>>>> >> >>> > expand > >>>>> >> >>> > the > >>>>> >> >>> > (already) large code to other usecases. > >>>>> >> >>> > Balancing this are a good many pull-requests from many > >>>>> >> >>> > different > >>>>> >> >>> > users, > >>>>> >> >>> > some > >>>>> >> >>> > even deep into the internals. > >>>>> >> >>> > > >>>>> >> >>> > Here are some things that I have talked about and could be > >>>>> >> >>> > considered > >>>>> >> >>> > for > >>>>> >> >>> > the roadmap. Disclaimer: I do work for Continuum > >>>>> >> >>> > but these views are of course my own; furthermore obviously > I > >>>>> >> >>> > am a > >>>>> >> >>> > bit > >>>>> >> >>> > more > >>>>> >> >>> > familiar with some of the 'sponsored' open-source > >>>>> >> >>> > libraries, but always open to new things. > >>>>> >> >>> > > >>>>> >> >>> > - integration / automatic deferral to numba for JIT (this > would > >>>>> >> >>> > be > >>>>> >> >>> > thru > >>>>> >> >>> > .apply) > >>>>> >> >>> > - automatic deferal to dask from groubpy where appropriate / > >>>>> >> >>> > maybe a > >>>>> >> >>> > .to_parallel (to simply return a dask.DataFrame object) > >>>>> >> >>> > - incorporation of quantities / units (as part of the dtype) > >>>>> >> >>> > - use of DyND to allow missing values for int dtypes > >>>>> >> >>> > - make Period a first class dtype. > >>>>> >> >>> > - provide some copy-on-write semantics to alleviate the > >>>>> >> >>> > chained-indexing > >>>>> >> >>> > issues which occasionaly come up with the mis-use of the > >>>>> >> >>> > indexing > >>>>> >> >>> > API > >>>>> >> >>> > - allow a 'policy' to automatically provide column blocks > for > >>>>> >> >>> > dict-like > >>>>> >> >>> > input (e.g. each column would be a block), this would allow > a > >>>>> >> >>> > pass-thru > >>>>> >> >>> > API > >>>>> >> >>> > where you could > >>>>> >> >>> > put in numpy arrays where you have views and have them > >>>>> >> >>> > preserved > >>>>> >> >>> > rather > >>>>> >> >>> > than > >>>>> >> >>> > copied automatically. Note that this would also allow what I > >>>>> >> >>> > call > >>>>> >> >>> > 'split' > >>>>> >> >>> > where a passed in > >>>>> >> >>> > multi-dim numpy array could be split up to individual blocks > >>>>> >> >>> > (which > >>>>> >> >>> > actually > >>>>> >> >>> > gives a nice perf boost after the splitting costs). > >>>>> >> >>> > > >>>>> >> >>> > In working towards some of these goals. I have come to the > >>>>> >> >>> > opinion > >>>>> >> >>> > that > >>>>> >> >>> > it > >>>>> >> >>> > would make sense to have a neutral API protocol layer > >>>>> >> >>> > that would allow us to swap out different engines as needed, > >>>>> >> >>> > for > >>>>> >> >>> > particular > >>>>> >> >>> > dtypes, or *maybe* out-of-core type computations. E.g. > >>>>> >> >>> > imagine that we replaced the in-memory block structure with > a > >>>>> >> >>> > bclolz > >>>>> >> >>> > / > >>>>> >> >>> > memap > >>>>> >> >>> > type; in theory this should be 'easy' and just work. > >>>>> >> >>> > I could also see us adopting *some* of the SFrame code to > allow > >>>>> >> >>> > easier > >>>>> >> >>> > interop with this API layer. > >>>>> >> >>> > > >>>>> >> >>> > In practice, I think a nice API layer would need to be > created > >>>>> >> >>> > to > >>>>> >> >>> > make > >>>>> >> >>> > this > >>>>> >> >>> > clean / nice. > >>>>> >> >>> > > >>>>> >> >>> > So this comes around to Wes's point about creating a c++ > >>>>> >> >>> > library for > >>>>> >> >>> > the > >>>>> >> >>> > internals (and possibly even some of the indexing routines). > >>>>> >> >>> > In an ideal world, or course this would be desirable. > Getting > >>>>> >> >>> > there > >>>>> >> >>> > is a > >>>>> >> >>> > bit > >>>>> >> >>> > non-trivial I think, and IMHO might not be worth the > effort. I > >>>>> >> >>> > don't > >>>>> >> >>> > really see big performance bottlenecks. We *already* defer > much > >>>>> >> >>> > of > >>>>> >> >>> > the > >>>>> >> >>> > computation to libraries like numexpr & bottleneck (where > >>>>> >> >>> > appropriate). > >>>>> >> >>> > Adding numba / dask to the list would be helpful. > >>>>> >> >>> > > >>>>> >> >>> > I think that almost all performance issues are the result > of: > >>>>> >> >>> > > >>>>> >> >>> > a) gross misuse of the pandas API. How much code have you > seen > >>>>> >> >>> > that > >>>>> >> >>> > does > >>>>> >> >>> > df.apply(lambda x: x.sum()) > >>>>> >> >>> > b) routines which operate column-by-column rather > >>>>> >> >>> > block-by-block and > >>>>> >> >>> > are > >>>>> >> >>> > in > >>>>> >> >>> > python space (e.g. we have an issue right now about > .quantile) > >>>>> >> >>> > > >>>>> >> >>> > So I am glossing over a big goal of having a c++ library > that > >>>>> >> >>> > represents > >>>>> >> >>> > the > >>>>> >> >>> > pandas internals. This would by definition have a c-API > that so > >>>>> >> >>> > you *could* use pandas like semantics in c/c++ and just > have it > >>>>> >> >>> > work > >>>>> >> >>> > (and > >>>>> >> >>> > then pandas would be a thin wrapper around this library). > >>>>> >> >>> > > >>>>> >> >>> > I am not averse to this, but I think would be quite a big > >>>>> >> >>> > effort, > >>>>> >> >>> > and > >>>>> >> >>> > not a > >>>>> >> >>> > huge perf boost IMHO. Further there are a number of API > issues > >>>>> >> >>> > w.r.t. > >>>>> >> >>> > indexing > >>>>> >> >>> > which need to be clarified / worked out (e.g. should we > simply > >>>>> >> >>> > deprecate > >>>>> >> >>> > []) > >>>>> >> >>> > that are much easier to test / figure out in python space. > >>>>> >> >>> > > >>>>> >> >>> > I also thing that we have quite a large number of > contributors. > >>>>> >> >>> > Moving > >>>>> >> >>> > to > >>>>> >> >>> > c++ might make the internals a bit more impenetrable that > the > >>>>> >> >>> > current > >>>>> >> >>> > internals. > >>>>> >> >>> > (though this would allow c++ people to contribute, so that > >>>>> >> >>> > might > >>>>> >> >>> > balance > >>>>> >> >>> > out). > >>>>> >> >>> > > >>>>> >> >>> > We have a limited core of devs whom right now are familar > with > >>>>> >> >>> > things. > >>>>> >> >>> > If > >>>>> >> >>> > someone happened to have a starting base for a c++ library, > >>>>> >> >>> > then I > >>>>> >> >>> > might > >>>>> >> >>> > change > >>>>> >> >>> > opinions here. > >>>>> >> >>> > > >>>>> >> >>> > > >>>>> >> >>> > my 4c. > >>>>> >> >>> > > >>>>> >> >>> > Jeff > >>>>> >> >>> > > >>>>> >> >>> > > >>>>> >> >>> > > >>>>> >> >>> > > >>>>> >> >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney > >>>>> >> >>> > > >>>>> >> >>> > wrote: > >>>>> >> >>> >> > >>>>> >> >>> >> Deep thoughts during the holidays. > >>>>> >> >>> >> > >>>>> >> >>> >> I might be out of line here, but the interpreter-heaviness > of > >>>>> >> >>> >> the > >>>>> >> >>> >> inside of pandas objects is likely to be a long-term > liability > >>>>> >> >>> >> and > >>>>> >> >>> >> source of performance problems and technical debt. > >>>>> >> >>> >> > >>>>> >> >>> >> Has anyone put any thought into planning and beginning to > >>>>> >> >>> >> execute > >>>>> >> >>> >> on a > >>>>> >> >>> >> rewrite that moves as much as possible of the internals > into > >>>>> >> >>> >> native > >>>>> >> >>> >> / > >>>>> >> >>> >> compiled code? I'm talking about: > >>>>> >> >>> >> > >>>>> >> >>> >> - pandas/core/internals > >>>>> >> >>> >> - indexing and assignment > >>>>> >> >>> >> - much of pandas/core/common > >>>>> >> >>> >> - categorical and custom dtypes > >>>>> >> >>> >> - all indexing mechanisms > >>>>> >> >>> >> > >>>>> >> >>> >> I'm concerned we've already exposed too much internals to > >>>>> >> >>> >> users, so > >>>>> >> >>> >> this might lead to a lot of API breakage, but it might be > for > >>>>> >> >>> >> the > >>>>> >> >>> >> Greater Good. As a first step, beginning a partial > migration > >>>>> >> >>> >> of > >>>>> >> >>> >> internals into some C++ classes that encapsulate the > insides > >>>>> >> >>> >> of > >>>>> >> >>> >> DataFrame objects and implement indexing and block-level > >>>>> >> >>> >> manipulations > >>>>> >> >>> >> would be a good place to start. I think you could do this > >>>>> >> >>> >> wouldn't > >>>>> >> >>> >> too > >>>>> >> >>> >> much disruption. > >>>>> >> >>> >> > >>>>> >> >>> >> As part of this internal retooling we might give > consideration > >>>>> >> >>> >> to > >>>>> >> >>> >> alternative data structures for representing data internal > to > >>>>> >> >>> >> pandas > >>>>> >> >>> >> objects. Now in 2015/2016, continuing to be hamstrung by > >>>>> >> >>> >> NumPy's > >>>>> >> >>> >> limitations feels somewhat anachronistic. User code is > riddled > >>>>> >> >>> >> with > >>>>> >> >>> >> workarounds for data type fidelity issues and the like. > Like, > >>>>> >> >>> >> really, > >>>>> >> >>> >> why not add a bitndarray (similar to ilanschnell/bitarray) > for > >>>>> >> >>> >> storing > >>>>> >> >>> >> nullness for problematic types and hide this from the > user? =) > >>>>> >> >>> >> > >>>>> >> >>> >> Since we are now a NumFOCUS-sponsored project, I feel like > we > >>>>> >> >>> >> might > >>>>> >> >>> >> consider establishing some formal governance over pandas > and > >>>>> >> >>> >> publishing meetings notes and roadmap documents describing > >>>>> >> >>> >> plans > >>>>> >> >>> >> for > >>>>> >> >>> >> the project and meetings notes from committers. There's no > >>>>> >> >>> >> real > >>>>> >> >>> >> "committer culture" for NumFOCUS projects like there is > with > >>>>> >> >>> >> the > >>>>> >> >>> >> Apache Software Foundation, but we might try leading by > >>>>> >> >>> >> example! > >>>>> >> >>> >> > >>>>> >> >>> >> Also, I believe pandas as a project has reached a level of > >>>>> >> >>> >> importance > >>>>> >> >>> >> where we ought to consider planning and execution on larger > >>>>> >> >>> >> scale > >>>>> >> >>> >> undertakings such as this for safeguarding the future. > >>>>> >> >>> >> > >>>>> >> >>> >> As for myself, well, I have my hands full in Big > Data-land. I > >>>>> >> >>> >> wish > >>>>> >> >>> >> I > >>>>> >> >>> >> could be helping more with pandas, but there a quite a few > >>>>> >> >>> >> fundamental > >>>>> >> >>> >> issues (like data interoperability nested data handling and > >>>>> >> >>> >> file > >>>>> >> >>> >> format support ? e.g. Parquet, see > >>>>> >> >>> >> > >>>>> >> >>> >> > >>>>> >> >>> >> > >>>>> >> >>> >> > >>>>> >> >>> >> > http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/ > ) > >>>>> >> >>> >> preventing Python from being more useful in industry > analytics > >>>>> >> >>> >> applications. > >>>>> >> >>> >> > >>>>> >> >>> >> Aside: one of the bigger mistakes I made with pandas's API > >>>>> >> >>> >> design > >>>>> >> >>> >> was > >>>>> >> >>> >> making it acceptable to call class constructors ? like > >>>>> >> >>> >> pandas.DataFrame ? directly (versus factory functions). > Sorry > >>>>> >> >>> >> about > >>>>> >> >>> >> that! If we could convince everyone to start writing > >>>>> >> >>> >> pandas.data_frame > >>>>> >> >>> >> or dataframe instead of using the class reference it would > >>>>> >> >>> >> help a > >>>>> >> >>> >> lot > >>>>> >> >>> >> with code cleanup. It's hard to plan for these things ? > NumPy > >>>>> >> >>> >> interoperability seemed a lot more important in 2008 than > it > >>>>> >> >>> >> does > >>>>> >> >>> >> now, > >>>>> >> >>> >> so I forgive myself. > >>>>> >> >>> >> > >>>>> >> >>> >> cheers and best wishes for 2016, > >>>>> >> >>> >> Wes > >>>>> >> >>> >> _______________________________________________ > >>>>> >> >>> >> Pandas-dev mailing list > >>>>> >> >>> >> Pandas-dev at python.org > >>>>> >> >>> >> https://mail.python.org/mailman/listinfo/pandas-dev > >>>>> >> >>> > > >>>>> >> >>> > > >>>>> >> >>> _______________________________________________ > >>>>> >> >>> Pandas-dev mailing list > >>>>> >> >>> Pandas-dev at python.org > >>>>> >> >>> https://mail.python.org/mailman/listinfo/pandas-dev > >>>>> >> _______________________________________________ > >>>>> >> Pandas-dev mailing list > >>>>> >> Pandas-dev at python.org > >>>>> >> https://mail.python.org/mailman/listinfo/pandas-dev > >>>>> > > >>>>> > > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> Pandas-dev mailing list > >>>>> Pandas-dev at python.org > >>>>> https://mail.python.org/mailman/listinfo/pandas-dev > >>>>> > >>>> > >> > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Fri Jan 1 20:48:18 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Fri, 1 Jan 2016 17:48:18 -0800 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References:

Message-ID: Thanks Jeff. Can you create and share a shared Drive folder containing this where I can put other auxiliary / follow up documents? On Fri, Jan 1, 2016 at 5:23 PM, Jeff Reback wrote: > I changed the doc so that the core dev people can edit. I *think* that > everyone should be able to view/comment though. > > On Fri, Jan 1, 2016 at 8:13 PM, Wes McKinney wrote: >> >> Jeff -- can you require log-in for editing on this document? >> >> https://docs.google.com/document/d/151ct8jcZWwh7XStptjbLsda6h2b3C0IuiH_hfZnUA58/edit# >> >> There are a number of anonymous edits. >> >> On Wed, Dec 30, 2015 at 6:04 PM, Wes McKinney wrote: >> > I cobbled together an ugly start of a c++->cython->pandas toolchain here >> > >> > https://github.com/wesm/pandas/tree/libpandas-native-core >> > >> > I used a mix of Kudu, Impala, and dynd-python cmake sources, so it's a >> > bit messy at the moment but it should be sufficient to run some real >> > experiments with a little more work. I reckon it's like a 6 month >> > project to tear out the insides of Series and DataFrame and replace it >> > with a new "native core", but we should be able to get enough info to >> > see whether it's a viable plan within a month or so. >> > >> > The end goal is to create "private" extension types in Cython that can >> > be the new base classes for Series and NDFrame; these will hold a >> > reference to a C++ object that contains wrappered NumPy arrays and >> > other metadata (like pandas-only dtypes). >> > >> > It might be too hard to try to replace a single usage of block manager >> > as a first experiment, so I'll try to create a minimal "SeriesLite" >> > that supports 3 dtypes >> > >> > 1) float64 with nans >> > 2) int64 with a bitmask for NAs >> > 3) category type for one of these >> > >> > Just want to get a feel for the extensibility and offer an NA >> > singleton Python object (a la None) for getting and setting NAs across >> > these 3 dtypes. >> > >> > If we end up going down this route, any way to place a moratorium on >> > invasive work on pandas internals (outside bug fixes)? >> > >> > Pedantic aside: I'd rather avoid shipping thirdparty C/C++ libraries >> > like googletest and friends in pandas if we can. Cloudera folks have >> > been working on a portable C++ library toolchain for Impala and other >> > projects at https://github.com/cloudera/native-toolchain, but it is >> > only being tested on Linux and OS X. Most google libraries should >> > build out of the box on MSVC but it'll be something to keep an eye on. >> > >> > BTW thanks to the libdynd developers for pioneering the c++ lib <-> >> > python-c++ lib <-> cython toolchain; being able to build Cython >> > extensions directly from cmake is a godsend >> > >> > HNY all >> > Wes >> > >> > On Tue, Dec 29, 2015 at 4:17 PM, Irwin Zaid wrote: >> >> Yeah, that seems reasonable and I totally agree a Pandas wrapper layer >> >> would >> >> be necessary. >> >> >> >> I'll keep an eye on this and I'd like to help if I can. >> >> >> >> Irwin >> >> >> >> >> >> On Tue, Dec 29, 2015 at 6:01 PM, Wes McKinney >> >> wrote: >> >>> >> >>> I'm not suggesting a rewrite of NumPy functionality but rather pandas >> >>> functionality that is currently written in a mishmash of Cython and >> >>> Python. >> >>> Happy to experiment with changing the internal compute infrastructure >> >>> and >> >>> data representation to DyND after this first stage of cleanup is done. >> >>> Even >> >>> if we use DyND a pretty extensive pandas wrapper layer will be >> >>> necessary. >> >>> >> >>> >> >>> On Tuesday, December 29, 2015, Irwin Zaid wrote: >> >>>> >> >>>> Hi Wes (and others), >> >>>> >> >>>> I've been following this conversation with interest. I do think it >> >>>> would >> >>>> be worth exploring DyND, rather than setting up yet another rewrite >> >>>> of >> >>>> NumPy-functionality. Especially because DyND is already an optional >> >>>> dependency of Pandas. >> >>>> >> >>>> For things like Integer NA and new dtypes, DyND is there and ready to >> >>>> do >> >>>> this. >> >>>> >> >>>> Irwin >> >>>> >> >>>> On Tue, Dec 29, 2015 at 5:18 PM, Wes McKinney >> >>>> wrote: >> >>>>> >> >>>>> Can you link to the PR you're talking about? >> >>>>> >> >>>>> I will see about spending a few hours setting up a libpandas.so as a >> >>>>> C++ >> >>>>> shared library where we can run some experiments and validate >> >>>>> whether it can >> >>>>> solve the integer-NA problem and be a place to put new data types >> >>>>> (categorical and friends). I'm +1 on targeting >> >>>>> >> >>>>> Would it also be worth making a wish list of APIs we might consider >> >>>>> breaking in a pandas 1.0 release that also features this new "native >> >>>>> core"? >> >>>>> Might as well right some wrongs while we're doing some invasive work >> >>>>> on the >> >>>>> internals; some breakage might be unavoidable. We can always >> >>>>> maintain a >> >>>>> pandas legacy 0.x.x maintenance branch (providing a conda binary >> >>>>> build) for >> >>>>> legacy users where showstopper bugs can get fixed. >> >>>>> >> >>>>> On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback >> >>>>> wrote: >> >>>>> > Wes your last is noted as well. I *think* we can actually do this >> >>>>> > now >> >>>>> > (well >> >>>>> > there is a PR out there). >> >>>>> > >> >>>>> > On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney >> >>>>> > >> >>>>> > wrote: >> >>>>> >> >> >>>>> >> The other huge thing this will enable is to do is copy-on-write >> >>>>> >> for >> >>>>> >> various kinds of views, which should cut down on some of the >> >>>>> >> defensive >> >>>>> >> copying in the library and reduce memory usage. >> >>>>> >> >> >>>>> >> On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney >> >>>>> >> >> >>>>> >> wrote: >> >>>>> >> > Basically the approach is >> >>>>> >> > >> >>>>> >> > 1) Base dtype type >> >>>>> >> > 2) Base array type with K >= 1 dimensions >> >>>>> >> > 3) Base scalar type >> >>>>> >> > 4) Base index type >> >>>>> >> > 5) "Wrapper" subclasses for all NumPy types fitting into >> >>>>> >> > categories >> >>>>> >> > #1, #2, #3, #4 >> >>>>> >> > 6) Subclasses for pandas-specific types like category, >> >>>>> >> > datetimeTZ, >> >>>>> >> > etc. >> >>>>> >> > 7) NDFrame as cpcloud wrote is just a list of these >> >>>>> >> > >> >>>>> >> > Indexes and axis labels / column names can get layered on top. >> >>>>> >> > >> >>>>> >> > After we do all this we can look at adding nested types >> >>>>> >> > (arrays, >> >>>>> >> > maps, >> >>>>> >> > structs) to better support JSON. >> >>>>> >> > >> >>>>> >> > - Wes >> >>>>> >> > >> >>>>> >> > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud >> >>>>> >> > >> >>>>> >> > wrote: >> >>>>> >> >> Maybe this is saying the same thing as Wes, but how far would >> >>>>> >> >> something >> >>>>> >> >> like >> >>>>> >> >> this get us? >> >>>>> >> >> >> >>>>> >> >> // warning: things are probably not this simple >> >>>>> >> >> >> >>>>> >> >> struct data_array_t { >> >>>>> >> >> void *primitive; // scalar data >> >>>>> >> >> data_array_t *nested; // nested data >> >>>>> >> >> boost::dynamic_bitset isnull; // might have to create our >> >>>>> >> >> own >> >>>>> >> >> to >> >>>>> >> >> avoid >> >>>>> >> >> boost >> >>>>> >> >> schema_t schema; // not sure exactly what this looks like >> >>>>> >> >> }; >> >>>>> >> >> >> >>>>> >> >> typedef std::map data_frame_t; // >> >>>>> >> >> probably >> >>>>> >> >> not >> >>>>> >> >> this >> >>>>> >> >> simple >> >>>>> >> >> >> >>>>> >> >> To answer Jeff?s use-case question: I think that the use cases >> >>>>> >> >> are >> >>>>> >> >> 1) >> >>>>> >> >> freedom from numpy (mostly) 2) no more block manager which >> >>>>> >> >> frees >> >>>>> >> >> us >> >>>>> >> >> from the >> >>>>> >> >> limitations of the block memory layout. In particular, the >> >>>>> >> >> ability >> >>>>> >> >> to >> >>>>> >> >> take >> >>>>> >> >> advantage of memory mapped IO would be a big win IMO. >> >>>>> >> >> >> >>>>> >> >> >> >>>>> >> >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney >> >>>>> >> >> >> >>>>> >> >> wrote: >> >>>>> >> >>> >> >>>>> >> >>> I will write a more detailed response to some of these things >> >>>>> >> >>> after >> >>>>> >> >>> the new year, but, in particular, re: missing values, can you >> >>>>> >> >>> or >> >>>>> >> >>> someone tell me why creating an object that contains a NumPy >> >>>>> >> >>> array and >> >>>>> >> >>> a bitmap is not sufficient? If we we can add a lightweight >> >>>>> >> >>> C/C++ >> >>>>> >> >>> class >> >>>>> >> >>> layer between NumPy function calls (e.g. arithmetic) and >> >>>>> >> >>> pandas >> >>>>> >> >>> function calls, then I see no reason why we cannot have >> >>>>> >> >>> >> >>>>> >> >>> Int32Array->add >> >>>>> >> >>> >> >>>>> >> >>> and >> >>>>> >> >>> >> >>>>> >> >>> Float32Array->add >> >>>>> >> >>> >> >>>>> >> >>> do the right thing (the former would be responsible for >> >>>>> >> >>> bitmasking to >> >>>>> >> >>> propagate NA values; the latter would defer to NumPy). If we >> >>>>> >> >>> can >> >>>>> >> >>> put >> >>>>> >> >>> all the internals of pandas objects inside a black box, we >> >>>>> >> >>> can >> >>>>> >> >>> add >> >>>>> >> >>> layers of virtual function indirection without a performance >> >>>>> >> >>> penalty >> >>>>> >> >>> (e.g. adding more interpreter overhead with more abstraction >> >>>>> >> >>> layers >> >>>>> >> >>> does add up to a perf penalty). >> >>>>> >> >>> >> >>>>> >> >>> I don't think this is too scary -- I would be willing to >> >>>>> >> >>> create a >> >>>>> >> >>> small POC C++ library to prototype something like what I'm >> >>>>> >> >>> talking >> >>>>> >> >>> about. >> >>>>> >> >>> >> >>>>> >> >>> Since pandas has limited points of contact with NumPy I don't >> >>>>> >> >>> think >> >>>>> >> >>> this would end up being too onerous. >> >>>>> >> >>> >> >>>>> >> >>> For the record, I'm pretty allergic to "advanced C++"; I >> >>>>> >> >>> think it >> >>>>> >> >>> is a >> >>>>> >> >>> useful tool if you pick a sane 20% subset of the C++11 spec >> >>>>> >> >>> and >> >>>>> >> >>> follow >> >>>>> >> >>> Google C++ style it's not very inaccessible to intermediate >> >>>>> >> >>> developers. More or less "C plus OOP and easier object >> >>>>> >> >>> lifetime >> >>>>> >> >>> management (shared/unique_ptr, etc.)". As soon as you add a >> >>>>> >> >>> lot >> >>>>> >> >>> of >> >>>>> >> >>> template metaprogramming C++ library development quickly >> >>>>> >> >>> becomes >> >>>>> >> >>> inaccessible except to the C++-Jedi. >> >>>>> >> >>> >> >>>>> >> >>> Maybe let's start a Google document on "pandas roadmap" where >> >>>>> >> >>> we >> >>>>> >> >>> can >> >>>>> >> >>> break down the 1-2 year goals and some of these >> >>>>> >> >>> infrastructure >> >>>>> >> >>> issues >> >>>>> >> >>> and have our discussion there? (obviously publish this >> >>>>> >> >>> someplace >> >>>>> >> >>> once >> >>>>> >> >>> we're done) >> >>>>> >> >>> >> >>>>> >> >>> - Wes >> >>>>> >> >>> >> >>>>> >> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback >> >>>>> >> >>> >> >>>>> >> >>> wrote: >> >>>>> >> >>> > Here are some of my thoughts about pandas Roadmap / status >> >>>>> >> >>> > and >> >>>>> >> >>> > some >> >>>>> >> >>> > responses to Wes's thoughts. >> >>>>> >> >>> > >> >>>>> >> >>> > In the last few (and upcoming) major releases we have been >> >>>>> >> >>> > made >> >>>>> >> >>> > the >> >>>>> >> >>> > following changes: >> >>>>> >> >>> > >> >>>>> >> >>> > - dtype enhancements (Categorical, Timedelta, Datetime >> >>>>> >> >>> > w/tz) & >> >>>>> >> >>> > making >> >>>>> >> >>> > these >> >>>>> >> >>> > first class objects >> >>>>> >> >>> > - code refactoring to remove subclassing of ndarrays for >> >>>>> >> >>> > Series >> >>>>> >> >>> > & >> >>>>> >> >>> > Index >> >>>>> >> >>> > - carving out / deprecating non-core parts of pandas >> >>>>> >> >>> > - datareader >> >>>>> >> >>> > - SparsePanel, WidePanel & other aliases (TImeSeries) >> >>>>> >> >>> > - rpy, rplot, irow et al. >> >>>>> >> >>> > - google-analytics >> >>>>> >> >>> > - API changes to make things more consistent >> >>>>> >> >>> > - pd.rolling/expanding * -> .rolling/expanding (this is >> >>>>> >> >>> > in >> >>>>> >> >>> > master >> >>>>> >> >>> > now) >> >>>>> >> >>> > - .resample becoming a full defered like groupby. >> >>>>> >> >>> > - multi-index slicing along any level (obviates need for >> >>>>> >> >>> > .xs) >> >>>>> >> >>> > and >> >>>>> >> >>> > allows >> >>>>> >> >>> > assignment >> >>>>> >> >>> > - .loc/.iloc - for the most part obviates use of .ix >> >>>>> >> >>> > - .pipe & .assign >> >>>>> >> >>> > - plotting accessors >> >>>>> >> >>> > - fixing of the sorting API >> >>>>> >> >>> > - many performance enhancements both micro & macro (e.g. >> >>>>> >> >>> > release >> >>>>> >> >>> > GIL) >> >>>>> >> >>> > >> >>>>> >> >>> > Some on-deck enhancements are (meaning these are basically >> >>>>> >> >>> > ready to >> >>>>> >> >>> > go >> >>>>> >> >>> > in): >> >>>>> >> >>> > - IntervalIndex (and eventually make PeriodIndex just a >> >>>>> >> >>> > sub-class >> >>>>> >> >>> > of >> >>>>> >> >>> > this) >> >>>>> >> >>> > - RangeIndex >> >>>>> >> >>> > >> >>>>> >> >>> > so lots of changes, though nothing really earth shaking, >> >>>>> >> >>> > just >> >>>>> >> >>> > more >> >>>>> >> >>> > convenience, reducing magicness somewhat >> >>>>> >> >>> > and providing flexibility. >> >>>>> >> >>> > >> >>>>> >> >>> > Of course we are getting increasing issues, mostly bug >> >>>>> >> >>> > reports >> >>>>> >> >>> > (and >> >>>>> >> >>> > lots >> >>>>> >> >>> > of >> >>>>> >> >>> > dupes), some edge case enhancements >> >>>>> >> >>> > which can add to the existing API's and of course, requests >> >>>>> >> >>> > to >> >>>>> >> >>> > expand >> >>>>> >> >>> > the >> >>>>> >> >>> > (already) large code to other usecases. >> >>>>> >> >>> > Balancing this are a good many pull-requests from many >> >>>>> >> >>> > different >> >>>>> >> >>> > users, >> >>>>> >> >>> > some >> >>>>> >> >>> > even deep into the internals. >> >>>>> >> >>> > >> >>>>> >> >>> > Here are some things that I have talked about and could be >> >>>>> >> >>> > considered >> >>>>> >> >>> > for >> >>>>> >> >>> > the roadmap. Disclaimer: I do work for Continuum >> >>>>> >> >>> > but these views are of course my own; furthermore obviously >> >>>>> >> >>> > I >> >>>>> >> >>> > am a >> >>>>> >> >>> > bit >> >>>>> >> >>> > more >> >>>>> >> >>> > familiar with some of the 'sponsored' open-source >> >>>>> >> >>> > libraries, but always open to new things. >> >>>>> >> >>> > >> >>>>> >> >>> > - integration / automatic deferral to numba for JIT (this >> >>>>> >> >>> > would >> >>>>> >> >>> > be >> >>>>> >> >>> > thru >> >>>>> >> >>> > .apply) >> >>>>> >> >>> > - automatic deferal to dask from groubpy where appropriate >> >>>>> >> >>> > / >> >>>>> >> >>> > maybe a >> >>>>> >> >>> > .to_parallel (to simply return a dask.DataFrame object) >> >>>>> >> >>> > - incorporation of quantities / units (as part of the >> >>>>> >> >>> > dtype) >> >>>>> >> >>> > - use of DyND to allow missing values for int dtypes >> >>>>> >> >>> > - make Period a first class dtype. >> >>>>> >> >>> > - provide some copy-on-write semantics to alleviate the >> >>>>> >> >>> > chained-indexing >> >>>>> >> >>> > issues which occasionaly come up with the mis-use of the >> >>>>> >> >>> > indexing >> >>>>> >> >>> > API >> >>>>> >> >>> > - allow a 'policy' to automatically provide column blocks >> >>>>> >> >>> > for >> >>>>> >> >>> > dict-like >> >>>>> >> >>> > input (e.g. each column would be a block), this would allow >> >>>>> >> >>> > a >> >>>>> >> >>> > pass-thru >> >>>>> >> >>> > API >> >>>>> >> >>> > where you could >> >>>>> >> >>> > put in numpy arrays where you have views and have them >> >>>>> >> >>> > preserved >> >>>>> >> >>> > rather >> >>>>> >> >>> > than >> >>>>> >> >>> > copied automatically. Note that this would also allow what >> >>>>> >> >>> > I >> >>>>> >> >>> > call >> >>>>> >> >>> > 'split' >> >>>>> >> >>> > where a passed in >> >>>>> >> >>> > multi-dim numpy array could be split up to individual >> >>>>> >> >>> > blocks >> >>>>> >> >>> > (which >> >>>>> >> >>> > actually >> >>>>> >> >>> > gives a nice perf boost after the splitting costs). >> >>>>> >> >>> > >> >>>>> >> >>> > In working towards some of these goals. I have come to the >> >>>>> >> >>> > opinion >> >>>>> >> >>> > that >> >>>>> >> >>> > it >> >>>>> >> >>> > would make sense to have a neutral API protocol layer >> >>>>> >> >>> > that would allow us to swap out different engines as >> >>>>> >> >>> > needed, >> >>>>> >> >>> > for >> >>>>> >> >>> > particular >> >>>>> >> >>> > dtypes, or *maybe* out-of-core type computations. E.g. >> >>>>> >> >>> > imagine that we replaced the in-memory block structure with >> >>>>> >> >>> > a >> >>>>> >> >>> > bclolz >> >>>>> >> >>> > / >> >>>>> >> >>> > memap >> >>>>> >> >>> > type; in theory this should be 'easy' and just work. >> >>>>> >> >>> > I could also see us adopting *some* of the SFrame code to >> >>>>> >> >>> > allow >> >>>>> >> >>> > easier >> >>>>> >> >>> > interop with this API layer. >> >>>>> >> >>> > >> >>>>> >> >>> > In practice, I think a nice API layer would need to be >> >>>>> >> >>> > created >> >>>>> >> >>> > to >> >>>>> >> >>> > make >> >>>>> >> >>> > this >> >>>>> >> >>> > clean / nice. >> >>>>> >> >>> > >> >>>>> >> >>> > So this comes around to Wes's point about creating a c++ >> >>>>> >> >>> > library for >> >>>>> >> >>> > the >> >>>>> >> >>> > internals (and possibly even some of the indexing >> >>>>> >> >>> > routines). >> >>>>> >> >>> > In an ideal world, or course this would be desirable. >> >>>>> >> >>> > Getting >> >>>>> >> >>> > there >> >>>>> >> >>> > is a >> >>>>> >> >>> > bit >> >>>>> >> >>> > non-trivial I think, and IMHO might not be worth the >> >>>>> >> >>> > effort. I >> >>>>> >> >>> > don't >> >>>>> >> >>> > really see big performance bottlenecks. We *already* defer >> >>>>> >> >>> > much >> >>>>> >> >>> > of >> >>>>> >> >>> > the >> >>>>> >> >>> > computation to libraries like numexpr & bottleneck (where >> >>>>> >> >>> > appropriate). >> >>>>> >> >>> > Adding numba / dask to the list would be helpful. >> >>>>> >> >>> > >> >>>>> >> >>> > I think that almost all performance issues are the result >> >>>>> >> >>> > of: >> >>>>> >> >>> > >> >>>>> >> >>> > a) gross misuse of the pandas API. How much code have you >> >>>>> >> >>> > seen >> >>>>> >> >>> > that >> >>>>> >> >>> > does >> >>>>> >> >>> > df.apply(lambda x: x.sum()) >> >>>>> >> >>> > b) routines which operate column-by-column rather >> >>>>> >> >>> > block-by-block and >> >>>>> >> >>> > are >> >>>>> >> >>> > in >> >>>>> >> >>> > python space (e.g. we have an issue right now about >> >>>>> >> >>> > .quantile) >> >>>>> >> >>> > >> >>>>> >> >>> > So I am glossing over a big goal of having a c++ library >> >>>>> >> >>> > that >> >>>>> >> >>> > represents >> >>>>> >> >>> > the >> >>>>> >> >>> > pandas internals. This would by definition have a c-API >> >>>>> >> >>> > that so >> >>>>> >> >>> > you *could* use pandas like semantics in c/c++ and just >> >>>>> >> >>> > have it >> >>>>> >> >>> > work >> >>>>> >> >>> > (and >> >>>>> >> >>> > then pandas would be a thin wrapper around this library). >> >>>>> >> >>> > >> >>>>> >> >>> > I am not averse to this, but I think would be quite a big >> >>>>> >> >>> > effort, >> >>>>> >> >>> > and >> >>>>> >> >>> > not a >> >>>>> >> >>> > huge perf boost IMHO. Further there are a number of API >> >>>>> >> >>> > issues >> >>>>> >> >>> > w.r.t. >> >>>>> >> >>> > indexing >> >>>>> >> >>> > which need to be clarified / worked out (e.g. should we >> >>>>> >> >>> > simply >> >>>>> >> >>> > deprecate >> >>>>> >> >>> > []) >> >>>>> >> >>> > that are much easier to test / figure out in python space. >> >>>>> >> >>> > >> >>>>> >> >>> > I also thing that we have quite a large number of >> >>>>> >> >>> > contributors. >> >>>>> >> >>> > Moving >> >>>>> >> >>> > to >> >>>>> >> >>> > c++ might make the internals a bit more impenetrable that >> >>>>> >> >>> > the >> >>>>> >> >>> > current >> >>>>> >> >>> > internals. >> >>>>> >> >>> > (though this would allow c++ people to contribute, so that >> >>>>> >> >>> > might >> >>>>> >> >>> > balance >> >>>>> >> >>> > out). >> >>>>> >> >>> > >> >>>>> >> >>> > We have a limited core of devs whom right now are familar >> >>>>> >> >>> > with >> >>>>> >> >>> > things. >> >>>>> >> >>> > If >> >>>>> >> >>> > someone happened to have a starting base for a c++ library, >> >>>>> >> >>> > then I >> >>>>> >> >>> > might >> >>>>> >> >>> > change >> >>>>> >> >>> > opinions here. >> >>>>> >> >>> > >> >>>>> >> >>> > >> >>>>> >> >>> > my 4c. >> >>>>> >> >>> > >> >>>>> >> >>> > Jeff >> >>>>> >> >>> > >> >>>>> >> >>> > >> >>>>> >> >>> > >> >>>>> >> >>> > >> >>>>> >> >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney >> >>>>> >> >>> > >> >>>>> >> >>> > wrote: >> >>>>> >> >>> >> >> >>>>> >> >>> >> Deep thoughts during the holidays. >> >>>>> >> >>> >> >> >>>>> >> >>> >> I might be out of line here, but the interpreter-heaviness >> >>>>> >> >>> >> of >> >>>>> >> >>> >> the >> >>>>> >> >>> >> inside of pandas objects is likely to be a long-term >> >>>>> >> >>> >> liability >> >>>>> >> >>> >> and >> >>>>> >> >>> >> source of performance problems and technical debt. >> >>>>> >> >>> >> >> >>>>> >> >>> >> Has anyone put any thought into planning and beginning to >> >>>>> >> >>> >> execute >> >>>>> >> >>> >> on a >> >>>>> >> >>> >> rewrite that moves as much as possible of the internals >> >>>>> >> >>> >> into >> >>>>> >> >>> >> native >> >>>>> >> >>> >> / >> >>>>> >> >>> >> compiled code? I'm talking about: >> >>>>> >> >>> >> >> >>>>> >> >>> >> - pandas/core/internals >> >>>>> >> >>> >> - indexing and assignment >> >>>>> >> >>> >> - much of pandas/core/common >> >>>>> >> >>> >> - categorical and custom dtypes >> >>>>> >> >>> >> - all indexing mechanisms >> >>>>> >> >>> >> >> >>>>> >> >>> >> I'm concerned we've already exposed too much internals to >> >>>>> >> >>> >> users, so >> >>>>> >> >>> >> this might lead to a lot of API breakage, but it might be >> >>>>> >> >>> >> for >> >>>>> >> >>> >> the >> >>>>> >> >>> >> Greater Good. As a first step, beginning a partial >> >>>>> >> >>> >> migration >> >>>>> >> >>> >> of >> >>>>> >> >>> >> internals into some C++ classes that encapsulate the >> >>>>> >> >>> >> insides >> >>>>> >> >>> >> of >> >>>>> >> >>> >> DataFrame objects and implement indexing and block-level >> >>>>> >> >>> >> manipulations >> >>>>> >> >>> >> would be a good place to start. I think you could do this >> >>>>> >> >>> >> wouldn't >> >>>>> >> >>> >> too >> >>>>> >> >>> >> much disruption. >> >>>>> >> >>> >> >> >>>>> >> >>> >> As part of this internal retooling we might give >> >>>>> >> >>> >> consideration >> >>>>> >> >>> >> to >> >>>>> >> >>> >> alternative data structures for representing data internal >> >>>>> >> >>> >> to >> >>>>> >> >>> >> pandas >> >>>>> >> >>> >> objects. Now in 2015/2016, continuing to be hamstrung by >> >>>>> >> >>> >> NumPy's >> >>>>> >> >>> >> limitations feels somewhat anachronistic. User code is >> >>>>> >> >>> >> riddled >> >>>>> >> >>> >> with >> >>>>> >> >>> >> workarounds for data type fidelity issues and the like. >> >>>>> >> >>> >> Like, >> >>>>> >> >>> >> really, >> >>>>> >> >>> >> why not add a bitndarray (similar to ilanschnell/bitarray) >> >>>>> >> >>> >> for >> >>>>> >> >>> >> storing >> >>>>> >> >>> >> nullness for problematic types and hide this from the >> >>>>> >> >>> >> user? =) >> >>>>> >> >>> >> >> >>>>> >> >>> >> Since we are now a NumFOCUS-sponsored project, I feel like >> >>>>> >> >>> >> we >> >>>>> >> >>> >> might >> >>>>> >> >>> >> consider establishing some formal governance over pandas >> >>>>> >> >>> >> and >> >>>>> >> >>> >> publishing meetings notes and roadmap documents describing >> >>>>> >> >>> >> plans >> >>>>> >> >>> >> for >> >>>>> >> >>> >> the project and meetings notes from committers. There's no >> >>>>> >> >>> >> real >> >>>>> >> >>> >> "committer culture" for NumFOCUS projects like there is >> >>>>> >> >>> >> with >> >>>>> >> >>> >> the >> >>>>> >> >>> >> Apache Software Foundation, but we might try leading by >> >>>>> >> >>> >> example! >> >>>>> >> >>> >> >> >>>>> >> >>> >> Also, I believe pandas as a project has reached a level of >> >>>>> >> >>> >> importance >> >>>>> >> >>> >> where we ought to consider planning and execution on >> >>>>> >> >>> >> larger >> >>>>> >> >>> >> scale >> >>>>> >> >>> >> undertakings such as this for safeguarding the future. >> >>>>> >> >>> >> >> >>>>> >> >>> >> As for myself, well, I have my hands full in Big >> >>>>> >> >>> >> Data-land. I >> >>>>> >> >>> >> wish >> >>>>> >> >>> >> I >> >>>>> >> >>> >> could be helping more with pandas, but there a quite a few >> >>>>> >> >>> >> fundamental >> >>>>> >> >>> >> issues (like data interoperability nested data handling >> >>>>> >> >>> >> and >> >>>>> >> >>> >> file >> >>>>> >> >>> >> format support ? e.g. Parquet, see >> >>>>> >> >>> >> >> >>>>> >> >>> >> >> >>>>> >> >>> >> >> >>>>> >> >>> >> >> >>>>> >> >>> >> >> >>>>> >> >>> >> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/) >> >>>>> >> >>> >> preventing Python from being more useful in industry >> >>>>> >> >>> >> analytics >> >>>>> >> >>> >> applications. >> >>>>> >> >>> >> >> >>>>> >> >>> >> Aside: one of the bigger mistakes I made with pandas's API >> >>>>> >> >>> >> design >> >>>>> >> >>> >> was >> >>>>> >> >>> >> making it acceptable to call class constructors ? like >> >>>>> >> >>> >> pandas.DataFrame ? directly (versus factory functions). >> >>>>> >> >>> >> Sorry >> >>>>> >> >>> >> about >> >>>>> >> >>> >> that! If we could convince everyone to start writing >> >>>>> >> >>> >> pandas.data_frame >> >>>>> >> >>> >> or dataframe instead of using the class reference it would >> >>>>> >> >>> >> help a >> >>>>> >> >>> >> lot >> >>>>> >> >>> >> with code cleanup. It's hard to plan for these things ? >> >>>>> >> >>> >> NumPy >> >>>>> >> >>> >> interoperability seemed a lot more important in 2008 than >> >>>>> >> >>> >> it >> >>>>> >> >>> >> does >> >>>>> >> >>> >> now, >> >>>>> >> >>> >> so I forgive myself. >> >>>>> >> >>> >> >> >>>>> >> >>> >> cheers and best wishes for 2016, >> >>>>> >> >>> >> Wes >> >>>>> >> >>> >> _______________________________________________ >> >>>>> >> >>> >> Pandas-dev mailing list >> >>>>> >> >>> >> Pandas-dev at python.org >> >>>>> >> >>> >> https://mail.python.org/mailman/listinfo/pandas-dev >> >>>>> >> >>> > >> >>>>> >> >>> > >> >>>>> >> >>> _______________________________________________ >> >>>>> >> >>> Pandas-dev mailing list >> >>>>> >> >>> Pandas-dev at python.org >> >>>>> >> >>> https://mail.python.org/mailman/listinfo/pandas-dev >> >>>>> >> _______________________________________________ >> >>>>> >> Pandas-dev mailing list >> >>>>> >> Pandas-dev at python.org >> >>>>> >> https://mail.python.org/mailman/listinfo/pandas-dev >> >>>>> > >> >>>>> > >> >>>>> >> >>>>> >> >>>>> _______________________________________________ >> >>>>> Pandas-dev mailing list >> >>>>> Pandas-dev at python.org >> >>>>> https://mail.python.org/mailman/listinfo/pandas-dev >> >>>>> >> >>>> >> >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev > > From jeffreback at gmail.com Fri Jan 1 21:06:35 2016 From: jeffreback at gmail.com (Jeff Reback) Date: Fri, 1 Jan 2016 21:06:35 -0500 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References:

Message-ID: ok I moved the document to the Pandas folder, where the same group should be able to edit/upload/etc. lmk if any issues On Fri, Jan 1, 2016 at 8:48 PM, Wes McKinney wrote: > Thanks Jeff. Can you create and share a shared Drive folder containing > this where I can put other auxiliary / follow up documents? > > On Fri, Jan 1, 2016 at 5:23 PM, Jeff Reback wrote: > > I changed the doc so that the core dev people can edit. I *think* that > > everyone should be able to view/comment though. > > > > On Fri, Jan 1, 2016 at 8:13 PM, Wes McKinney > wrote: > >> > >> Jeff -- can you require log-in for editing on this document? > >> > >> > https://docs.google.com/document/d/151ct8jcZWwh7XStptjbLsda6h2b3C0IuiH_hfZnUA58/edit# > >> > >> There are a number of anonymous edits. > >> > >> On Wed, Dec 30, 2015 at 6:04 PM, Wes McKinney > wrote: > >> > I cobbled together an ugly start of a c++->cython->pandas toolchain > here > >> > > >> > https://github.com/wesm/pandas/tree/libpandas-native-core > >> > > >> > I used a mix of Kudu, Impala, and dynd-python cmake sources, so it's a > >> > bit messy at the moment but it should be sufficient to run some real > >> > experiments with a little more work. I reckon it's like a 6 month > >> > project to tear out the insides of Series and DataFrame and replace it > >> > with a new "native core", but we should be able to get enough info to > >> > see whether it's a viable plan within a month or so. > >> > > >> > The end goal is to create "private" extension types in Cython that can > >> > be the new base classes for Series and NDFrame; these will hold a > >> > reference to a C++ object that contains wrappered NumPy arrays and > >> > other metadata (like pandas-only dtypes). > >> > > >> > It might be too hard to try to replace a single usage of block manager > >> > as a first experiment, so I'll try to create a minimal "SeriesLite" > >> > that supports 3 dtypes > >> > > >> > 1) float64 with nans > >> > 2) int64 with a bitmask for NAs > >> > 3) category type for one of these > >> > > >> > Just want to get a feel for the extensibility and offer an NA > >> > singleton Python object (a la None) for getting and setting NAs across > >> > these 3 dtypes. > >> > > >> > If we end up going down this route, any way to place a moratorium on > >> > invasive work on pandas internals (outside bug fixes)? > >> > > >> > Pedantic aside: I'd rather avoid shipping thirdparty C/C++ libraries > >> > like googletest and friends in pandas if we can. Cloudera folks have > >> > been working on a portable C++ library toolchain for Impala and other > >> > projects at https://github.com/cloudera/native-toolchain, but it is > >> > only being tested on Linux and OS X. Most google libraries should > >> > build out of the box on MSVC but it'll be something to keep an eye on. > >> > > >> > BTW thanks to the libdynd developers for pioneering the c++ lib <-> > >> > python-c++ lib <-> cython toolchain; being able to build Cython > >> > extensions directly from cmake is a godsend > >> > > >> > HNY all > >> > Wes > >> > > >> > On Tue, Dec 29, 2015 at 4:17 PM, Irwin Zaid > wrote: > >> >> Yeah, that seems reasonable and I totally agree a Pandas wrapper > layer > >> >> would > >> >> be necessary. > >> >> > >> >> I'll keep an eye on this and I'd like to help if I can. > >> >> > >> >> Irwin > >> >> > >> >> > >> >> On Tue, Dec 29, 2015 at 6:01 PM, Wes McKinney > >> >> wrote: > >> >>> > >> >>> I'm not suggesting a rewrite of NumPy functionality but rather > pandas > >> >>> functionality that is currently written in a mishmash of Cython and > >> >>> Python. > >> >>> Happy to experiment with changing the internal compute > infrastructure > >> >>> and > >> >>> data representation to DyND after this first stage of cleanup is > done. > >> >>> Even > >> >>> if we use DyND a pretty extensive pandas wrapper layer will be > >> >>> necessary. > >> >>> > >> >>> > >> >>> On Tuesday, December 29, 2015, Irwin Zaid > wrote: > >> >>>> > >> >>>> Hi Wes (and others), > >> >>>> > >> >>>> I've been following this conversation with interest. I do think it > >> >>>> would > >> >>>> be worth exploring DyND, rather than setting up yet another rewrite > >> >>>> of > >> >>>> NumPy-functionality. Especially because DyND is already an optional > >> >>>> dependency of Pandas. > >> >>>> > >> >>>> For things like Integer NA and new dtypes, DyND is there and ready > to > >> >>>> do > >> >>>> this. > >> >>>> > >> >>>> Irwin > >> >>>> > >> >>>> On Tue, Dec 29, 2015 at 5:18 PM, Wes McKinney > > >> >>>> wrote: > >> >>>>> > >> >>>>> Can you link to the PR you're talking about? > >> >>>>> > >> >>>>> I will see about spending a few hours setting up a libpandas.so > as a > >> >>>>> C++ > >> >>>>> shared library where we can run some experiments and validate > >> >>>>> whether it can > >> >>>>> solve the integer-NA problem and be a place to put new data types > >> >>>>> (categorical and friends). I'm +1 on targeting > >> >>>>> > >> >>>>> Would it also be worth making a wish list of APIs we might > consider > >> >>>>> breaking in a pandas 1.0 release that also features this new > "native > >> >>>>> core"? > >> >>>>> Might as well right some wrongs while we're doing some invasive > work > >> >>>>> on the > >> >>>>> internals; some breakage might be unavoidable. We can always > >> >>>>> maintain a > >> >>>>> pandas legacy 0.x.x maintenance branch (providing a conda binary > >> >>>>> build) for > >> >>>>> legacy users where showstopper bugs can get fixed. > >> >>>>> > >> >>>>> On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback < > jeffreback at gmail.com> > >> >>>>> wrote: > >> >>>>> > Wes your last is noted as well. I *think* we can actually do > this > >> >>>>> > now > >> >>>>> > (well > >> >>>>> > there is a PR out there). > >> >>>>> > > >> >>>>> > On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney > >> >>>>> > > >> >>>>> > wrote: > >> >>>>> >> > >> >>>>> >> The other huge thing this will enable is to do is copy-on-write > >> >>>>> >> for > >> >>>>> >> various kinds of views, which should cut down on some of the > >> >>>>> >> defensive > >> >>>>> >> copying in the library and reduce memory usage. > >> >>>>> >> > >> >>>>> >> On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney > >> >>>>> >> > >> >>>>> >> wrote: > >> >>>>> >> > Basically the approach is > >> >>>>> >> > > >> >>>>> >> > 1) Base dtype type > >> >>>>> >> > 2) Base array type with K >= 1 dimensions > >> >>>>> >> > 3) Base scalar type > >> >>>>> >> > 4) Base index type > >> >>>>> >> > 5) "Wrapper" subclasses for all NumPy types fitting into > >> >>>>> >> > categories > >> >>>>> >> > #1, #2, #3, #4 > >> >>>>> >> > 6) Subclasses for pandas-specific types like category, > >> >>>>> >> > datetimeTZ, > >> >>>>> >> > etc. > >> >>>>> >> > 7) NDFrame as cpcloud wrote is just a list of these > >> >>>>> >> > > >> >>>>> >> > Indexes and axis labels / column names can get layered on > top. > >> >>>>> >> > > >> >>>>> >> > After we do all this we can look at adding nested types > >> >>>>> >> > (arrays, > >> >>>>> >> > maps, > >> >>>>> >> > structs) to better support JSON. > >> >>>>> >> > > >> >>>>> >> > - Wes > >> >>>>> >> > > >> >>>>> >> > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud > >> >>>>> >> > > >> >>>>> >> > wrote: > >> >>>>> >> >> Maybe this is saying the same thing as Wes, but how far > would > >> >>>>> >> >> something > >> >>>>> >> >> like > >> >>>>> >> >> this get us? > >> >>>>> >> >> > >> >>>>> >> >> // warning: things are probably not this simple > >> >>>>> >> >> > >> >>>>> >> >> struct data_array_t { > >> >>>>> >> >> void *primitive; // scalar data > >> >>>>> >> >> data_array_t *nested; // nested data > >> >>>>> >> >> boost::dynamic_bitset isnull; // might have to create > our > >> >>>>> >> >> own > >> >>>>> >> >> to > >> >>>>> >> >> avoid > >> >>>>> >> >> boost > >> >>>>> >> >> schema_t schema; // not sure exactly what this looks > like > >> >>>>> >> >> }; > >> >>>>> >> >> > >> >>>>> >> >> typedef std::map data_frame_t; // > >> >>>>> >> >> probably > >> >>>>> >> >> not > >> >>>>> >> >> this > >> >>>>> >> >> simple > >> >>>>> >> >> > >> >>>>> >> >> To answer Jeff?s use-case question: I think that the use > cases > >> >>>>> >> >> are > >> >>>>> >> >> 1) > >> >>>>> >> >> freedom from numpy (mostly) 2) no more block manager which > >> >>>>> >> >> frees > >> >>>>> >> >> us > >> >>>>> >> >> from the > >> >>>>> >> >> limitations of the block memory layout. In particular, the > >> >>>>> >> >> ability > >> >>>>> >> >> to > >> >>>>> >> >> take > >> >>>>> >> >> advantage of memory mapped IO would be a big win IMO. > >> >>>>> >> >> > >> >>>>> >> >> > >> >>>>> >> >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney > >> >>>>> >> >> > >> >>>>> >> >> wrote: > >> >>>>> >> >>> > >> >>>>> >> >>> I will write a more detailed response to some of these > things > >> >>>>> >> >>> after > >> >>>>> >> >>> the new year, but, in particular, re: missing values, can > you > >> >>>>> >> >>> or > >> >>>>> >> >>> someone tell me why creating an object that contains a > NumPy > >> >>>>> >> >>> array and > >> >>>>> >> >>> a bitmap is not sufficient? If we we can add a lightweight > >> >>>>> >> >>> C/C++ > >> >>>>> >> >>> class > >> >>>>> >> >>> layer between NumPy function calls (e.g. arithmetic) and > >> >>>>> >> >>> pandas > >> >>>>> >> >>> function calls, then I see no reason why we cannot have > >> >>>>> >> >>> > >> >>>>> >> >>> Int32Array->add > >> >>>>> >> >>> > >> >>>>> >> >>> and > >> >>>>> >> >>> > >> >>>>> >> >>> Float32Array->add > >> >>>>> >> >>> > >> >>>>> >> >>> do the right thing (the former would be responsible for > >> >>>>> >> >>> bitmasking to > >> >>>>> >> >>> propagate NA values; the latter would defer to NumPy). If > we > >> >>>>> >> >>> can > >> >>>>> >> >>> put > >> >>>>> >> >>> all the internals of pandas objects inside a black box, we > >> >>>>> >> >>> can > >> >>>>> >> >>> add > >> >>>>> >> >>> layers of virtual function indirection without a > performance > >> >>>>> >> >>> penalty > >> >>>>> >> >>> (e.g. adding more interpreter overhead with more > abstraction > >> >>>>> >> >>> layers > >> >>>>> >> >>> does add up to a perf penalty). > >> >>>>> >> >>> > >> >>>>> >> >>> I don't think this is too scary -- I would be willing to > >> >>>>> >> >>> create a > >> >>>>> >> >>> small POC C++ library to prototype something like what I'm > >> >>>>> >> >>> talking > >> >>>>> >> >>> about. > >> >>>>> >> >>> > >> >>>>> >> >>> Since pandas has limited points of contact with NumPy I > don't > >> >>>>> >> >>> think > >> >>>>> >> >>> this would end up being too onerous. > >> >>>>> >> >>> > >> >>>>> >> >>> For the record, I'm pretty allergic to "advanced C++"; I > >> >>>>> >> >>> think it > >> >>>>> >> >>> is a > >> >>>>> >> >>> useful tool if you pick a sane 20% subset of the C++11 spec > >> >>>>> >> >>> and > >> >>>>> >> >>> follow > >> >>>>> >> >>> Google C++ style it's not very inaccessible to intermediate > >> >>>>> >> >>> developers. More or less "C plus OOP and easier object > >> >>>>> >> >>> lifetime > >> >>>>> >> >>> management (shared/unique_ptr, etc.)". As soon as you add a > >> >>>>> >> >>> lot > >> >>>>> >> >>> of > >> >>>>> >> >>> template metaprogramming C++ library development quickly > >> >>>>> >> >>> becomes > >> >>>>> >> >>> inaccessible except to the C++-Jedi. > >> >>>>> >> >>> > >> >>>>> >> >>> Maybe let's start a Google document on "pandas roadmap" > where > >> >>>>> >> >>> we > >> >>>>> >> >>> can > >> >>>>> >> >>> break down the 1-2 year goals and some of these > >> >>>>> >> >>> infrastructure > >> >>>>> >> >>> issues > >> >>>>> >> >>> and have our discussion there? (obviously publish this > >> >>>>> >> >>> someplace > >> >>>>> >> >>> once > >> >>>>> >> >>> we're done) > >> >>>>> >> >>> > >> >>>>> >> >>> - Wes > >> >>>>> >> >>> > >> >>>>> >> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback > >> >>>>> >> >>> > >> >>>>> >> >>> wrote: > >> >>>>> >> >>> > Here are some of my thoughts about pandas Roadmap / > status > >> >>>>> >> >>> > and > >> >>>>> >> >>> > some > >> >>>>> >> >>> > responses to Wes's thoughts. > >> >>>>> >> >>> > > >> >>>>> >> >>> > In the last few (and upcoming) major releases we have > been > >> >>>>> >> >>> > made > >> >>>>> >> >>> > the > >> >>>>> >> >>> > following changes: > >> >>>>> >> >>> > > >> >>>>> >> >>> > - dtype enhancements (Categorical, Timedelta, Datetime > >> >>>>> >> >>> > w/tz) & > >> >>>>> >> >>> > making > >> >>>>> >> >>> > these > >> >>>>> >> >>> > first class objects > >> >>>>> >> >>> > - code refactoring to remove subclassing of ndarrays for > >> >>>>> >> >>> > Series > >> >>>>> >> >>> > & > >> >>>>> >> >>> > Index > >> >>>>> >> >>> > - carving out / deprecating non-core parts of pandas > >> >>>>> >> >>> > - datareader > >> >>>>> >> >>> > - SparsePanel, WidePanel & other aliases (TImeSeries) > >> >>>>> >> >>> > - rpy, rplot, irow et al. > >> >>>>> >> >>> > - google-analytics > >> >>>>> >> >>> > - API changes to make things more consistent > >> >>>>> >> >>> > - pd.rolling/expanding * -> .rolling/expanding (this is > >> >>>>> >> >>> > in > >> >>>>> >> >>> > master > >> >>>>> >> >>> > now) > >> >>>>> >> >>> > - .resample becoming a full defered like groupby. > >> >>>>> >> >>> > - multi-index slicing along any level (obviates need > for > >> >>>>> >> >>> > .xs) > >> >>>>> >> >>> > and > >> >>>>> >> >>> > allows > >> >>>>> >> >>> > assignment > >> >>>>> >> >>> > - .loc/.iloc - for the most part obviates use of .ix > >> >>>>> >> >>> > - .pipe & .assign > >> >>>>> >> >>> > - plotting accessors > >> >>>>> >> >>> > - fixing of the sorting API > >> >>>>> >> >>> > - many performance enhancements both micro & macro (e.g. > >> >>>>> >> >>> > release > >> >>>>> >> >>> > GIL) > >> >>>>> >> >>> > > >> >>>>> >> >>> > Some on-deck enhancements are (meaning these are > basically > >> >>>>> >> >>> > ready to > >> >>>>> >> >>> > go > >> >>>>> >> >>> > in): > >> >>>>> >> >>> > - IntervalIndex (and eventually make PeriodIndex just a > >> >>>>> >> >>> > sub-class > >> >>>>> >> >>> > of > >> >>>>> >> >>> > this) > >> >>>>> >> >>> > - RangeIndex > >> >>>>> >> >>> > > >> >>>>> >> >>> > so lots of changes, though nothing really earth shaking, > >> >>>>> >> >>> > just > >> >>>>> >> >>> > more > >> >>>>> >> >>> > convenience, reducing magicness somewhat > >> >>>>> >> >>> > and providing flexibility. > >> >>>>> >> >>> > > >> >>>>> >> >>> > Of course we are getting increasing issues, mostly bug > >> >>>>> >> >>> > reports > >> >>>>> >> >>> > (and > >> >>>>> >> >>> > lots > >> >>>>> >> >>> > of > >> >>>>> >> >>> > dupes), some edge case enhancements > >> >>>>> >> >>> > which can add to the existing API's and of course, > requests > >> >>>>> >> >>> > to > >> >>>>> >> >>> > expand > >> >>>>> >> >>> > the > >> >>>>> >> >>> > (already) large code to other usecases. > >> >>>>> >> >>> > Balancing this are a good many pull-requests from many > >> >>>>> >> >>> > different > >> >>>>> >> >>> > users, > >> >>>>> >> >>> > some > >> >>>>> >> >>> > even deep into the internals. > >> >>>>> >> >>> > > >> >>>>> >> >>> > Here are some things that I have talked about and could > be > >> >>>>> >> >>> > considered > >> >>>>> >> >>> > for > >> >>>>> >> >>> > the roadmap. Disclaimer: I do work for Continuum > >> >>>>> >> >>> > but these views are of course my own; furthermore > obviously > >> >>>>> >> >>> > I > >> >>>>> >> >>> > am a > >> >>>>> >> >>> > bit > >> >>>>> >> >>> > more > >> >>>>> >> >>> > familiar with some of the 'sponsored' open-source > >> >>>>> >> >>> > libraries, but always open to new things. > >> >>>>> >> >>> > > >> >>>>> >> >>> > - integration / automatic deferral to numba for JIT (this > >> >>>>> >> >>> > would > >> >>>>> >> >>> > be > >> >>>>> >> >>> > thru > >> >>>>> >> >>> > .apply) > >> >>>>> >> >>> > - automatic deferal to dask from groubpy where > appropriate > >> >>>>> >> >>> > / > >> >>>>> >> >>> > maybe a > >> >>>>> >> >>> > .to_parallel (to simply return a dask.DataFrame object) > >> >>>>> >> >>> > - incorporation of quantities / units (as part of the > >> >>>>> >> >>> > dtype) > >> >>>>> >> >>> > - use of DyND to allow missing values for int dtypes > >> >>>>> >> >>> > - make Period a first class dtype. > >> >>>>> >> >>> > - provide some copy-on-write semantics to alleviate the > >> >>>>> >> >>> > chained-indexing > >> >>>>> >> >>> > issues which occasionaly come up with the mis-use of the > >> >>>>> >> >>> > indexing > >> >>>>> >> >>> > API > >> >>>>> >> >>> > - allow a 'policy' to automatically provide column blocks > >> >>>>> >> >>> > for > >> >>>>> >> >>> > dict-like > >> >>>>> >> >>> > input (e.g. each column would be a block), this would > allow > >> >>>>> >> >>> > a > >> >>>>> >> >>> > pass-thru > >> >>>>> >> >>> > API > >> >>>>> >> >>> > where you could > >> >>>>> >> >>> > put in numpy arrays where you have views and have them > >> >>>>> >> >>> > preserved > >> >>>>> >> >>> > rather > >> >>>>> >> >>> > than > >> >>>>> >> >>> > copied automatically. Note that this would also allow > what > >> >>>>> >> >>> > I > >> >>>>> >> >>> > call > >> >>>>> >> >>> > 'split' > >> >>>>> >> >>> > where a passed in > >> >>>>> >> >>> > multi-dim numpy array could be split up to individual > >> >>>>> >> >>> > blocks > >> >>>>> >> >>> > (which > >> >>>>> >> >>> > actually > >> >>>>> >> >>> > gives a nice perf boost after the splitting costs). > >> >>>>> >> >>> > > >> >>>>> >> >>> > In working towards some of these goals. I have come to > the > >> >>>>> >> >>> > opinion > >> >>>>> >> >>> > that > >> >>>>> >> >>> > it > >> >>>>> >> >>> > would make sense to have a neutral API protocol layer > >> >>>>> >> >>> > that would allow us to swap out different engines as > >> >>>>> >> >>> > needed, > >> >>>>> >> >>> > for > >> >>>>> >> >>> > particular > >> >>>>> >> >>> > dtypes, or *maybe* out-of-core type computations. E.g. > >> >>>>> >> >>> > imagine that we replaced the in-memory block structure > with > >> >>>>> >> >>> > a > >> >>>>> >> >>> > bclolz > >> >>>>> >> >>> > / > >> >>>>> >> >>> > memap > >> >>>>> >> >>> > type; in theory this should be 'easy' and just work. > >> >>>>> >> >>> > I could also see us adopting *some* of the SFrame code to > >> >>>>> >> >>> > allow > >> >>>>> >> >>> > easier > >> >>>>> >> >>> > interop with this API layer. > >> >>>>> >> >>> > > >> >>>>> >> >>> > In practice, I think a nice API layer would need to be > >> >>>>> >> >>> > created > >> >>>>> >> >>> > to > >> >>>>> >> >>> > make > >> >>>>> >> >>> > this > >> >>>>> >> >>> > clean / nice. > >> >>>>> >> >>> > > >> >>>>> >> >>> > So this comes around to Wes's point about creating a c++ > >> >>>>> >> >>> > library for > >> >>>>> >> >>> > the > >> >>>>> >> >>> > internals (and possibly even some of the indexing > >> >>>>> >> >>> > routines). > >> >>>>> >> >>> > In an ideal world, or course this would be desirable. > >> >>>>> >> >>> > Getting > >> >>>>> >> >>> > there > >> >>>>> >> >>> > is a > >> >>>>> >> >>> > bit > >> >>>>> >> >>> > non-trivial I think, and IMHO might not be worth the > >> >>>>> >> >>> > effort. I > >> >>>>> >> >>> > don't > >> >>>>> >> >>> > really see big performance bottlenecks. We *already* > defer > >> >>>>> >> >>> > much > >> >>>>> >> >>> > of > >> >>>>> >> >>> > the > >> >>>>> >> >>> > computation to libraries like numexpr & bottleneck (where > >> >>>>> >> >>> > appropriate). > >> >>>>> >> >>> > Adding numba / dask to the list would be helpful. > >> >>>>> >> >>> > > >> >>>>> >> >>> > I think that almost all performance issues are the result > >> >>>>> >> >>> > of: > >> >>>>> >> >>> > > >> >>>>> >> >>> > a) gross misuse of the pandas API. How much code have you > >> >>>>> >> >>> > seen > >> >>>>> >> >>> > that > >> >>>>> >> >>> > does > >> >>>>> >> >>> > df.apply(lambda x: x.sum()) > >> >>>>> >> >>> > b) routines which operate column-by-column rather > >> >>>>> >> >>> > block-by-block and > >> >>>>> >> >>> > are > >> >>>>> >> >>> > in > >> >>>>> >> >>> > python space (e.g. we have an issue right now about > >> >>>>> >> >>> > .quantile) > >> >>>>> >> >>> > > >> >>>>> >> >>> > So I am glossing over a big goal of having a c++ library > >> >>>>> >> >>> > that > >> >>>>> >> >>> > represents > >> >>>>> >> >>> > the > >> >>>>> >> >>> > pandas internals. This would by definition have a c-API > >> >>>>> >> >>> > that so > >> >>>>> >> >>> > you *could* use pandas like semantics in c/c++ and just > >> >>>>> >> >>> > have it > >> >>>>> >> >>> > work > >> >>>>> >> >>> > (and > >> >>>>> >> >>> > then pandas would be a thin wrapper around this library). > >> >>>>> >> >>> > > >> >>>>> >> >>> > I am not averse to this, but I think would be quite a big > >> >>>>> >> >>> > effort, > >> >>>>> >> >>> > and > >> >>>>> >> >>> > not a > >> >>>>> >> >>> > huge perf boost IMHO. Further there are a number of API > >> >>>>> >> >>> > issues > >> >>>>> >> >>> > w.r.t. > >> >>>>> >> >>> > indexing > >> >>>>> >> >>> > which need to be clarified / worked out (e.g. should we > >> >>>>> >> >>> > simply > >> >>>>> >> >>> > deprecate > >> >>>>> >> >>> > []) > >> >>>>> >> >>> > that are much easier to test / figure out in python > space. > >> >>>>> >> >>> > > >> >>>>> >> >>> > I also thing that we have quite a large number of > >> >>>>> >> >>> > contributors. > >> >>>>> >> >>> > Moving > >> >>>>> >> >>> > to > >> >>>>> >> >>> > c++ might make the internals a bit more impenetrable that > >> >>>>> >> >>> > the > >> >>>>> >> >>> > current > >> >>>>> >> >>> > internals. > >> >>>>> >> >>> > (though this would allow c++ people to contribute, so > that > >> >>>>> >> >>> > might > >> >>>>> >> >>> > balance > >> >>>>> >> >>> > out). > >> >>>>> >> >>> > > >> >>>>> >> >>> > We have a limited core of devs whom right now are familar > >> >>>>> >> >>> > with > >> >>>>> >> >>> > things. > >> >>>>> >> >>> > If > >> >>>>> >> >>> > someone happened to have a starting base for a c++ > library, > >> >>>>> >> >>> > then I > >> >>>>> >> >>> > might > >> >>>>> >> >>> > change > >> >>>>> >> >>> > opinions here. > >> >>>>> >> >>> > > >> >>>>> >> >>> > > >> >>>>> >> >>> > my 4c. > >> >>>>> >> >>> > > >> >>>>> >> >>> > Jeff > >> >>>>> >> >>> > > >> >>>>> >> >>> > > >> >>>>> >> >>> > > >> >>>>> >> >>> > > >> >>>>> >> >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney > >> >>>>> >> >>> > > >> >>>>> >> >>> > wrote: > >> >>>>> >> >>> >> > >> >>>>> >> >>> >> Deep thoughts during the holidays. > >> >>>>> >> >>> >> > >> >>>>> >> >>> >> I might be out of line here, but the > interpreter-heaviness > >> >>>>> >> >>> >> of > >> >>>>> >> >>> >> the > >> >>>>> >> >>> >> inside of pandas objects is likely to be a long-term > >> >>>>> >> >>> >> liability > >> >>>>> >> >>> >> and > >> >>>>> >> >>> >> source of performance problems and technical debt. > >> >>>>> >> >>> >> > >> >>>>> >> >>> >> Has anyone put any thought into planning and beginning > to > >> >>>>> >> >>> >> execute > >> >>>>> >> >>> >> on a > >> >>>>> >> >>> >> rewrite that moves as much as possible of the internals > >> >>>>> >> >>> >> into > >> >>>>> >> >>> >> native > >> >>>>> >> >>> >> / > >> >>>>> >> >>> >> compiled code? I'm talking about: > >> >>>>> >> >>> >> > >> >>>>> >> >>> >> - pandas/core/internals > >> >>>>> >> >>> >> - indexing and assignment > >> >>>>> >> >>> >> - much of pandas/core/common > >> >>>>> >> >>> >> - categorical and custom dtypes > >> >>>>> >> >>> >> - all indexing mechanisms > >> >>>>> >> >>> >> > >> >>>>> >> >>> >> I'm concerned we've already exposed too much internals > to > >> >>>>> >> >>> >> users, so > >> >>>>> >> >>> >> this might lead to a lot of API breakage, but it might > be > >> >>>>> >> >>> >> for > >> >>>>> >> >>> >> the > >> >>>>> >> >>> >> Greater Good. As a first step, beginning a partial > >> >>>>> >> >>> >> migration > >> >>>>> >> >>> >> of > >> >>>>> >> >>> >> internals into some C++ classes that encapsulate the > >> >>>>> >> >>> >> insides > >> >>>>> >> >>> >> of > >> >>>>> >> >>> >> DataFrame objects and implement indexing and block-level > >> >>>>> >> >>> >> manipulations > >> >>>>> >> >>> >> would be a good place to start. I think you could do > this > >> >>>>> >> >>> >> wouldn't > >> >>>>> >> >>> >> too > >> >>>>> >> >>> >> much disruption. > >> >>>>> >> >>> >> > >> >>>>> >> >>> >> As part of this internal retooling we might give > >> >>>>> >> >>> >> consideration > >> >>>>> >> >>> >> to > >> >>>>> >> >>> >> alternative data structures for representing data > internal > >> >>>>> >> >>> >> to > >> >>>>> >> >>> >> pandas > >> >>>>> >> >>> >> objects. Now in 2015/2016, continuing to be hamstrung by > >> >>>>> >> >>> >> NumPy's > >> >>>>> >> >>> >> limitations feels somewhat anachronistic. User code is > >> >>>>> >> >>> >> riddled > >> >>>>> >> >>> >> with > >> >>>>> >> >>> >> workarounds for data type fidelity issues and the like. > >> >>>>> >> >>> >> Like, > >> >>>>> >> >>> >> really, > >> >>>>> >> >>> >> why not add a bitndarray (similar to > ilanschnell/bitarray) > >> >>>>> >> >>> >> for > >> >>>>> >> >>> >> storing > >> >>>>> >> >>> >> nullness for problematic types and hide this from the > >> >>>>> >> >>> >> user? =) > >> >>>>> >> >>> >> > >> >>>>> >> >>> >> Since we are now a NumFOCUS-sponsored project, I feel > like > >> >>>>> >> >>> >> we > >> >>>>> >> >>> >> might > >> >>>>> >> >>> >> consider establishing some formal governance over pandas > >> >>>>> >> >>> >> and > >> >>>>> >> >>> >> publishing meetings notes and roadmap documents > describing > >> >>>>> >> >>> >> plans > >> >>>>> >> >>> >> for > >> >>>>> >> >>> >> the project and meetings notes from committers. There's > no > >> >>>>> >> >>> >> real > >> >>>>> >> >>> >> "committer culture" for NumFOCUS projects like there is > >> >>>>> >> >>> >> with > >> >>>>> >> >>> >> the > >> >>>>> >> >>> >> Apache Software Foundation, but we might try leading by > >> >>>>> >> >>> >> example! > >> >>>>> >> >>> >> > >> >>>>> >> >>> >> Also, I believe pandas as a project has reached a level > of > >> >>>>> >> >>> >> importance > >> >>>>> >> >>> >> where we ought to consider planning and execution on > >> >>>>> >> >>> >> larger > >> >>>>> >> >>> >> scale > >> >>>>> >> >>> >> undertakings such as this for safeguarding the future. > >> >>>>> >> >>> >> > >> >>>>> >> >>> >> As for myself, well, I have my hands full in Big > >> >>>>> >> >>> >> Data-land. I > >> >>>>> >> >>> >> wish > >> >>>>> >> >>> >> I > >> >>>>> >> >>> >> could be helping more with pandas, but there a quite a > few > >> >>>>> >> >>> >> fundamental > >> >>>>> >> >>> >> issues (like data interoperability nested data handling > >> >>>>> >> >>> >> and > >> >>>>> >> >>> >> file > >> >>>>> >> >>> >> format support ? e.g. Parquet, see > >> >>>>> >> >>> >> > >> >>>>> >> >>> >> > >> >>>>> >> >>> >> > >> >>>>> >> >>> >> > >> >>>>> >> >>> >> > >> >>>>> >> >>> >> > http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/ > ) > >> >>>>> >> >>> >> preventing Python from being more useful in industry > >> >>>>> >> >>> >> analytics > >> >>>>> >> >>> >> applications. > >> >>>>> >> >>> >> > >> >>>>> >> >>> >> Aside: one of the bigger mistakes I made with pandas's > API > >> >>>>> >> >>> >> design > >> >>>>> >> >>> >> was > >> >>>>> >> >>> >> making it acceptable to call class constructors ? like > >> >>>>> >> >>> >> pandas.DataFrame ? directly (versus factory functions). > >> >>>>> >> >>> >> Sorry > >> >>>>> >> >>> >> about > >> >>>>> >> >>> >> that! If we could convince everyone to start writing > >> >>>>> >> >>> >> pandas.data_frame > >> >>>>> >> >>> >> or dataframe instead of using the class reference it > would > >> >>>>> >> >>> >> help a > >> >>>>> >> >>> >> lot > >> >>>>> >> >>> >> with code cleanup. It's hard to plan for these things ? > >> >>>>> >> >>> >> NumPy > >> >>>>> >> >>> >> interoperability seemed a lot more important in 2008 > than > >> >>>>> >> >>> >> it > >> >>>>> >> >>> >> does > >> >>>>> >> >>> >> now, > >> >>>>> >> >>> >> so I forgive myself. > >> >>>>> >> >>> >> > >> >>>>> >> >>> >> cheers and best wishes for 2016, > >> >>>>> >> >>> >> Wes > >> >>>>> >> >>> >> _______________________________________________ > >> >>>>> >> >>> >> Pandas-dev mailing list > >> >>>>> >> >>> >> Pandas-dev at python.org > >> >>>>> >> >>> >> https://mail.python.org/mailman/listinfo/pandas-dev > >> >>>>> >> >>> > > >> >>>>> >> >>> > > >> >>>>> >> >>> _______________________________________________ > >> >>>>> >> >>> Pandas-dev mailing list > >> >>>>> >> >>> Pandas-dev at python.org > >> >>>>> >> >>> https://mail.python.org/mailman/listinfo/pandas-dev > >> >>>>> >> _______________________________________________ > >> >>>>> >> Pandas-dev mailing list > >> >>>>> >> Pandas-dev at python.org > >> >>>>> >> https://mail.python.org/mailman/listinfo/pandas-dev > >> >>>>> > > >> >>>>> > > >> >>>>> > >> >>>>> > >> >>>>> _______________________________________________ > >> >>>>> Pandas-dev mailing list > >> >>>>> Pandas-dev at python.org > >> >>>>> https://mail.python.org/mailman/listinfo/pandas-dev > >> >>>>> > >> >>>> > >> >> > >> _______________________________________________ > >> Pandas-dev mailing list > >> Pandas-dev at python.org > >> https://mail.python.org/mailman/listinfo/pandas-dev > > > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Sun Jan 3 14:41:17 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Sun, 3 Jan 2016 11:41:17 -0800 Subject: [Pandas-dev] pandas 0.18.x and pandas 1.0 roadmap Message-ID: Per discussions we've been having here https://docs.google.com/document/d/151ct8jcZWwh7XStptjbLsda6h2b3C0IuiH_hfZnUA58/edit?ts=568725eb#heading=h.qm48l6dargmd I started this document to solicit a high level plan for the last 0.x release and where we can develop a plan for what will become pandas 1.0 https://docs.google.com/document/d/1K3uVluD9qNn9nLp6oRjIwP7qillysw820wfulJY3BiU/edit# Let me know what you think of this idea -- I'll have more bandwidth this year to be involved and I'm starting to look at what a 2nd ed of Python for Data Analysis should look like. Relatedly: I'm assembling enough basic plumbing so that I can give you all a demo of how the libpandas / C/C++ native core will help us better hide implementation details and fix problems like integer/boolean missing data in a clean and extensible way. It will also help establish a pattern for adding new data types to pandas (which may or may not rely on NumPy). I'll follow up about it when I get a bit more stuff working; probably take me a few more days at least. thanks! Wes From jorisvandenbossche at gmail.com Mon Jan 4 18:30:38 2016 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Tue, 5 Jan 2016 00:30:38 +0100 Subject: [Pandas-dev] pandas 0.18.x and pandas 1.0 roadmap In-Reply-To: References: Message-ID: Hi all, Interesting discussions! My expertise as pandas contributor is not really in the core part, so I cannot really comment on that. But for me, as we think of a pandas 1.0, a possible clean-up of the existing user facing API is an important aspect to discuss I think (regardless of a clean-up and rewrite of the internals, as this should not affect too much of the existing API? (apart from new features)). In the light of how to keep (or improve on) pandas easy to learn, clear to understand, consistent and yet powerful. There are some points listed in the Pandas Development Roadmap under 'pandas 1.0', coming from https://github.com/pydata/pandas/issues/10000, but possibly other points as well. Probably the most prominent example is the indexing API, and specifically [] / __getitem__. Some time ago I made an overview of some of its warts that have grown over time: https://github.com/pydata/pandas/issues/9595 I don't say we have to change something about this (because it will break a lot of existing code), but we should at least discuss it a bit more thoroughly I think. As for the timeline, I like the idea of limiting the number of releases for the 0.x line. Maybe we will like to do a 0.19.x as well (eg to introduce some features to improve the transition to 1.0), or depending on how long it takes to shape up 1.0, but that is something that can be discussed later if that comes up I think. Regards, Joris 2016-01-03 20:41 GMT+01:00 Wes McKinney : > Per discussions we've been having here > > > https://docs.google.com/document/d/151ct8jcZWwh7XStptjbLsda6h2b3C0IuiH_hfZnUA58/edit?ts=568725eb#heading=h.qm48l6dargmd > > I started this document to solicit a high level plan for the last 0.x > release and where we can develop a plan for what will become pandas > 1.0 > > > https://docs.google.com/document/d/1K3uVluD9qNn9nLp6oRjIwP7qillysw820wfulJY3BiU/edit# > > Let me know what you think of this idea -- I'll have more bandwidth > this year to be involved and I'm starting to look at what a 2nd ed of > Python for Data Analysis should look like. > > Relatedly: I'm assembling enough basic plumbing so that I can give you > all a demo of how the libpandas / C/C++ native core will help us > better hide implementation details and fix problems like > integer/boolean missing data in a clean and extensible way. It will > also help establish a pattern for adding new data types to pandas > (which may or may not rely on NumPy). I'll follow up about it when I > get a bit more stuff working; probably take me a few more days at > least. > > thanks! > Wes > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeffreback at gmail.com Mon Jan 4 19:36:45 2016 From: jeffreback at gmail.com (Jeff Reback) Date: Mon, 4 Jan 2016 19:36:45 -0500 Subject: [Pandas-dev] pandas 0.18.x and pandas 1.0 roadmap In-Reply-To: References:

Message-ID: I agree with joris on schedule a bit. We have been putting out majors every 3-4 months and then a minor. So I would expect 0.18.0 say in februrary, then 0.18.1 march. Could see 0.19.0 in the summer, Then 1.0 in the fall (and can use 0.19. to road test some things). I also believe any internals changes can be achieved with limited compat breaks. I don't think anyone is proposing a big-break / incompat for 1.0, which IMHO would just cause fragmentation and generally not be a good thing. Certainly we can make major changes, but we have been pretty liberal about deprecations (though not so about removing prior deprecations). So this would also be a good time for that. my 3c Jeff On Mon, Jan 4, 2016 at 6:30 PM, Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > Hi all, > > Interesting discussions! > > My expertise as pandas contributor is not really in the core part, so I > cannot really comment on that. But for me, as we think of a pandas 1.0, a > possible clean-up of the existing user facing API is an important aspect to > discuss I think (regardless of a clean-up and rewrite of the internals, as > this should not affect too much of the existing API? (apart from new > features)). > In the light of how to keep (or improve on) pandas easy to learn, clear to > understand, consistent and yet powerful. > > There are some points listed in the Pandas Development Roadmap > > under 'pandas 1.0', coming from > https://github.com/pydata/pandas/issues/10000, but possibly other points > as well. > > Probably the most prominent example is the indexing API, and specifically > [] / __getitem__. Some time ago I made an overview of some of its warts > that have grown over time: https://github.com/pydata/pandas/issues/9595 > I don't say we have to change something about this (because it will break > a lot of existing code), but we should at least discuss it a bit more > thoroughly I think. > > > As for the timeline, I like the idea of limiting the number of releases > for the 0.x line. Maybe we will like to do a 0.19.x as well (eg to > introduce some features to improve the transition to 1.0), or depending on > how long it takes to shape up 1.0, but that is something that can be > discussed later if that comes up I think. > > Regards, > Joris > > > 2016-01-03 20:41 GMT+01:00 Wes McKinney : > >> Per discussions we've been having here >> >> >> https://docs.google.com/document/d/151ct8jcZWwh7XStptjbLsda6h2b3C0IuiH_hfZnUA58/edit?ts=568725eb#heading=h.qm48l6dargmd >> >> I started this document to solicit a high level plan for the last 0.x >> release and where we can develop a plan for what will become pandas >> 1.0 >> >> >> https://docs.google.com/document/d/1K3uVluD9qNn9nLp6oRjIwP7qillysw820wfulJY3BiU/edit# >> >> Let me know what you think of this idea -- I'll have more bandwidth >> this year to be involved and I'm starting to look at what a 2nd ed of >> Python for Data Analysis should look like. >> >> Relatedly: I'm assembling enough basic plumbing so that I can give you >> all a demo of how the libpandas / C/C++ native core will help us >> better hide implementation details and fix problems like >> integer/boolean missing data in a clean and extensible way. It will >> also help establish a pattern for adding new data types to pandas >> (which may or may not rely on NumPy). I'll follow up about it when I >> get a bit more stuff working; probably take me a few more days at >> least. >> >> thanks! >> Wes >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Mon Jan 4 20:31:58 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Mon, 4 Jan 2016 17:31:58 -0800 Subject: [Pandas-dev] pandas 0.18.x and pandas 1.0 roadmap In-Reply-To: References:

Message-ID: This all makes sense. I guess there are too major areas for pandas 1.0: - User API cleanup - Internal cleanup In both cases, we'll want to make sure we can maintain a pandas-1.0 branch that is rebased regularly on master for a while that is not too painful to keep up. How about to keep ourselves sane we make separate roadmaps for the user API and the internals, and we can loudly mark places where there is crossover (for example: data type improvements that are user visible, or changes in data copying semantics / copy-on-write). As Jeff said, if we're doing it right, then the internals revamp shouldn't affect the user API work all that much. Since the idea is that it would fix various "warts" (like reindexing integers or booleans causing upcasts to occur), we'll want to collect all the affected test cases in one place partly as a record of what APIs are effectively broken (e.g. I'm sure some users have a lot of code that assumes that reindexing an integer series results in floating point output). Within the next couple weeks I'll try to make a compelling case for decommissioning the current BlockManager internals of Series and DataFrame in favor of much simpler Array and Table data structures implemented as C++ classes (with Cython wrappers, where Python glue and conveniences can live). A major part of this is inserting a "wrapper layer" in between NumPy and pandas that makes pandas less dependent on NumPy-specific implementation details. While this might seem scary, we already have an extensive NumPy wrapper layer between pandas.core.common and pandas.core.internals. So functions like common._maybe_promote will go away. This will also be a good time to review and cleanup a lot of the existing Cython code. It will be really nice for Series and DataFrame to have a C API ? at some point we can figure out how to enable outside projects to access the C API. I presume Jeff and I will take responsibility for the internals overhaul ? anyone else been hacking around in there want to get down in the trenches? Joris, do you want to take point on the user API roadmapping / planning? cheers, Wes On Mon, Jan 4, 2016 at 4:36 PM, Jeff Reback wrote: > I agree with joris on schedule a bit. We have been putting out majors every > 3-4 months and then a minor. So I would expect 0.18.0 say in februrary, then > 0.18.1 march. Could see 0.19.0 in the summer, Then 1.0 in the fall (and can > use 0.19. to road test some things). > > I also believe any internals changes can be achieved with limited compat > breaks. I don't think anyone is proposing a big-break / incompat for 1.0, > which IMHO would > just cause fragmentation and generally not be a good thing. > > Certainly we can make major changes, but we have been pretty liberal about > deprecations (though not so about removing prior deprecations). So this > would also be a good time for that. > > my 3c > > Jeff > > On Mon, Jan 4, 2016 at 6:30 PM, Joris Van den Bossche > wrote: >> >> Hi all, >> >> Interesting discussions! >> >> My expertise as pandas contributor is not really in the core part, so I >> cannot really comment on that. But for me, as we think of a pandas 1.0, a >> possible clean-up of the existing user facing API is an important aspect to >> discuss I think (regardless of a clean-up and rewrite of the internals, as >> this should not affect too much of the existing API? (apart from new >> features)). >> In the light of how to keep (or improve on) pandas easy to learn, clear to >> understand, consistent and yet powerful. >> >> There are some points listed in the Pandas Development Roadmap under >> 'pandas 1.0', coming from https://github.com/pydata/pandas/issues/10000, but >> possibly other points as well. >> >> Probably the most prominent example is the indexing API, and specifically >> [] / __getitem__. Some time ago I made an overview of some of its warts that >> have grown over time: https://github.com/pydata/pandas/issues/9595 >> I don't say we have to change something about this (because it will break >> a lot of existing code), but we should at least discuss it a bit more >> thoroughly I think. >> >> >> As for the timeline, I like the idea of limiting the number of releases >> for the 0.x line. Maybe we will like to do a 0.19.x as well (eg to introduce >> some features to improve the transition to 1.0), or depending on how long it >> takes to shape up 1.0, but that is something that can be discussed later if >> that comes up I think. >> >> Regards, >> Joris >> >> >> 2016-01-03 20:41 GMT+01:00 Wes McKinney : >>> >>> Per discussions we've been having here >>> >>> >>> https://docs.google.com/document/d/151ct8jcZWwh7XStptjbLsda6h2b3C0IuiH_hfZnUA58/edit?ts=568725eb#heading=h.qm48l6dargmd >>> >>> I started this document to solicit a high level plan for the last 0.x >>> release and where we can develop a plan for what will become pandas >>> 1.0 >>> >>> >>> https://docs.google.com/document/d/1K3uVluD9qNn9nLp6oRjIwP7qillysw820wfulJY3BiU/edit# >>> >>> Let me know what you think of this idea -- I'll have more bandwidth >>> this year to be involved and I'm starting to look at what a 2nd ed of >>> Python for Data Analysis should look like. >>> >>> Relatedly: I'm assembling enough basic plumbing so that I can give you >>> all a demo of how the libpandas / C/C++ native core will help us >>> better hide implementation details and fix problems like >>> integer/boolean missing data in a clean and extensible way. It will >>> also help establish a pattern for adding new data types to pandas >>> (which may or may not rely on NumPy). I'll follow up about it when I >>> get a bit more stuff working; probably take me a few more days at >>> least. >>> >>> thanks! >>> Wes >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >> >> >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > From jeffreback at gmail.com Mon Jan 4 21:21:34 2016 From: jeffreback at gmail.com (Jeff Reback) Date: Mon, 4 Jan 2016 21:21:34 -0500 Subject: [Pandas-dev] GitHub/pandas Message-ID: any thoughts on claiming the pandas org in GitHub (it's an inactive username so I think we could claim it) iow have the main repo be: pandas/pandas could make sense for spinoffs eg pandas-datareader as well xarray just moved to: PyData/xarray (so somewhat unified now) PyData isn't really used by others that pandas (except numexpr) and a number of older / much less active repos the con on this is that pandas has existed for quite a long time and is known well as PyData/pandas. furthermore I don't think pandas.org is available pro is that the future is much longer than the past! (same rationale as in making API breaks!) Jeff I can be reached on my cell 917-971-6387 From wesmckinn at gmail.com Mon Jan 4 21:25:51 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Mon, 4 Jan 2016 18:25:51 -0800 Subject: [Pandas-dev] GitHub/pandas In-Reply-To: References: Message-ID: I actually just contacted GitHub about this today. It's not inactive but I'm going to write a plea to the owner to see if they will transfer it to us. Let you know. On Mon, Jan 4, 2016 at 6:21 PM, Jeff Reback wrote: > any thoughts on claiming the > > pandas org in GitHub (it's an inactive username so I think we could claim it) > > iow have the main repo be: pandas/pandas > > could make sense for spinoffs > eg pandas-datareader as well > > xarray just moved to: PyData/xarray > (so somewhat unified now) > > PyData isn't really used by others that pandas (except numexpr) and a number of older / much less active repos > > the con on this is that pandas has existed for quite a long time and is known well as PyData/pandas. furthermore I don't think pandas.org is available > > pro is that the future is much longer than the past! (same rationale as in making API breaks!) > > Jeff > > > > I can be reached on my cell 917-971-6387 > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev From wesmckinn at gmail.com Tue Jan 5 13:15:51 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Tue, 5 Jan 2016 10:15:51 -0800 Subject: [Pandas-dev] pandas governance Message-ID: hi folks, I'm sorry I didn't do this 2 or 3 years ago when I first handed over release management responsibilities to Jeff, y-p and others, but it would be good for us to formalize the project governance like most other major open source projects. See IPython / Jupyter for an example set of governance documents https://github.com/jupyter/governance I don't have particular concerns over the project's direction and decision making procedure, but as I've had several people raise private concerns with me over the last few years, I think it would be good for the community to have a set of public documents on GitHub that lists people and process in simple terms. This is especially important now that we can receive financial sponsorship through NumFOCUS, so that sponsored contributions are subject to the same community process as volunteer contributions. A basic summary of how we've been informally operating is: Project committers (as will be defined and listed in the governance documents) make decisions based on consensus; in the absence of consensus (which has rarely occurred) I will reserve tie-breaking / BDFL privileges. I don't recall having ever having to put on the BDFL hat but it's the "just in case" should we reach some impasse down the road. I can take a crack at assembling something based on the IPython governance docs if that sounds good. At the end of the day, an OSS project is only as strong as the individuals committing code and reviewing patches. As pandas will be 8 years old in April, with 6 years as open source, I think we have a good track record of consensus-, common-sense-, and fact/evidence-driven decision making. best, Wes From jorisvandenbossche at gmail.com Tue Jan 5 19:29:23 2016 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Wed, 6 Jan 2016 01:29:23 +0100 Subject: [Pandas-dev] pandas governance In-Reply-To: References: Message-ID: Sounds very good! Certainly now we are a NumFOCUS supported project (and have to deal with financial things), I think this is important to do. 2016-01-05 19:15 GMT+01:00 Wes McKinney : > hi folks, > > I'm sorry I didn't do this 2 or 3 years ago when I first handed over > release management responsibilities to Jeff, y-p and others, but it > would be good for us to formalize the project governance like most > other major open source projects. See IPython / Jupyter for an example > set of governance documents > > https://github.com/jupyter/governance > > Numpy also recently adopted a goverance document, based on the Jupyter one: http://docs.scipy.org/doc/numpy-dev/dev/governance/governance.html and https://github.com/numpy/numpy/pull/6352. Maybe also worth a look (although I don't know what they exactly changed from the Jupyter one). > I don't have particular concerns over the project's direction and > decision making procedure, but as I've had several people raise > private concerns with me over the last few years, I think it would be > good for the community to have a set of public documents on GitHub > that lists people and process in simple terms. This is especially > important now that we can receive financial sponsorship through > NumFOCUS, so that sponsored contributions are subject to the same > community process as volunteer contributions. > > A basic summary of how we've been informally operating is: Project > committers (as will be defined and listed in the governance documents) > make decisions based on consensus; in the absence of consensus (which > has rarely occurred) I will reserve tie-breaking / BDFL privileges. I > don't recall having ever having to put on the BDFL hat but it's the > "just in case" should we reach some impasse down the road. > > Sounds good! > I can take a crack at assembling something based on the IPython > governance docs if that sounds good. > > At the end of the day, an OSS project is only as strong as the > individuals committing code and reviewing patches. As pandas will be 8 > years old in April, with 6 years as open source, I think we have a > good track record of consensus-, common-sense-, and > fact/evidence-driven decision making. > > best, > Wes > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeffreback at gmail.com Wed Jan 6 08:50:49 2016 From: jeffreback at gmail.com (Jeff Reback) Date: Wed, 6 Jan 2016 08:50:49 -0500 Subject: [Pandas-dev] pandas governance In-Reply-To: References:

Message-ID: yes on board with this as well. We do have a fiscal governance document w.r.t. NUMFocus. That should at the very least be reference by the governance docs. Certainly starting with the jupyter docs is a good think. I don't think we will have the long-long-long discussion that numpy had about the steering committee representation :) Jeff On Tue, Jan 5, 2016 at 7:29 PM, Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > Sounds very good! > > Certainly now we are a NumFOCUS supported project (and have to deal with > financial things), I think this is important to do. > > 2016-01-05 19:15 GMT+01:00 Wes McKinney : > >> hi folks, >> >> I'm sorry I didn't do this 2 or 3 years ago when I first handed over >> release management responsibilities to Jeff, y-p and others, but it >> would be good for us to formalize the project governance like most >> other major open source projects. See IPython / Jupyter for an example >> set of governance documents >> >> https://github.com/jupyter/governance >> >> Numpy also recently adopted a goverance document, based on the Jupyter > one: http://docs.scipy.org/doc/numpy-dev/dev/governance/governance.html > and https://github.com/numpy/numpy/pull/6352. > Maybe also worth a look (although I don't know what they exactly changed > from the Jupyter one). > > >> I don't have particular concerns over the project's direction and >> decision making procedure, but as I've had several people raise >> private concerns with me over the last few years, I think it would be >> good for the community to have a set of public documents on GitHub >> that lists people and process in simple terms. This is especially >> important now that we can receive financial sponsorship through >> NumFOCUS, so that sponsored contributions are subject to the same >> community process as volunteer contributions. >> >> A basic summary of how we've been informally operating is: Project >> committers (as will be defined and listed in the governance documents) >> make decisions based on consensus; in the absence of consensus (which >> has rarely occurred) I will reserve tie-breaking / BDFL privileges. I >> don't recall having ever having to put on the BDFL hat but it's the >> "just in case" should we reach some impasse down the road. >> >> Sounds good! > > >> I can take a crack at assembling something based on the IPython >> governance docs if that sounds good. >> >> At the end of the day, an OSS project is only as strong as the >> individuals committing code and reviewing patches. As pandas will be 8 >> years old in April, with 6 years as open source, I think we have a >> good track record of consensus-, common-sense-, and >> fact/evidence-driven decision making. >> >> best, >> Wes >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Wed Jan 6 13:11:55 2016 From: shoyer at gmail.com (Stephan Hoyer) Date: Wed, 6 Jan 2016 10:11:55 -0800 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References:

Message-ID: I was asked about this off list, so I'll belatedly share my thoughts. First of all, I am really excited by Wes's renewed engagement in the project and his interest in rewriting pandas internals. This is quite an ambitious plan and nobody is better positioned to tackle it than Wes. I have mixed feelings about the details of the rewrite itself. +1 on the simpler internal data model. The block manager is confusing and leads to hard to predict performance issues related to copying data. If we can do all column additions/removals/re-orderings without a copy it will be a clear win. +0 on moving internals to C++. I do like the performance benefits, but it seems like a lot of work, and it may make pandas less friendly to new contributors. -0 on writing a brand new dtype system just for pandas -- this stuff really belongs in NumPy (or another array library like DyND), and I am skeptical that pandas can do a complete enough job to be useful without replicating all that functionality. More broadly, I am concerned that this rewrite may improve the tabular computation ecosystem at the cost of inter-operability with the array-based ecosystem (numpy, scipy, sklearn, xarray, etc.). The later has been one of the strengths of pandas and it would be a shame to see that go away. We're already starting to struggle with inter-operability with the new pandas dtypes and a further rewrite would make this even harder. For example, see categoricals and scikit-learn in Tom's recent post [1], or the fact that .values no longer always returns a numpy array. This has also been a challenge for xarray, which can't handle these new dtypes because we lack a suitable array backend for them. Personally, I would much rather leverage a full featured library like an improved NumPy or DyND for new dtypes, because that could also be used by the array-based ecosystem. At the very least, it would be good to think about zero-copy inter-operability with array-based tools. On the other hand, I wonder if maybe it would be better to write a native in-memory backend for Ibis instead of rewriting pandas. Ibis does seem to have improved/simplified API which resolves many of pandas's warts. That said, it's a pretty big change from the "DataFrame as matrix" model, and pandas won't be going away anytime soon. I do like that it would force users to be more explicit about converting between tables and arrays, which might also make distinctions between the tabular and array oriented ecosystems easier to swallow. Just my two cents, from someone who has lots of opinions but who will likely stay on the sidelines for most of this work. Cheers, Stephan [1] http://tomaugspurger.github.io/categorical-pipelines.html On Fri, Jan 1, 2016 at 6:06 PM, Jeff Reback wrote: > ok I moved the document to the Pandas folder, where the same group should > be able to edit/upload/etc. lmk if any issues > > On Fri, Jan 1, 2016 at 8:48 PM, Wes McKinney wrote: > >> Thanks Jeff. Can you create and share a shared Drive folder containing >> this where I can put other auxiliary / follow up documents? >> >> On Fri, Jan 1, 2016 at 5:23 PM, Jeff Reback wrote: >> > I changed the doc so that the core dev people can edit. I *think* that >> > everyone should be able to view/comment though. >> > >> > On Fri, Jan 1, 2016 at 8:13 PM, Wes McKinney >> wrote: >> >> >> >> Jeff -- can you require log-in for editing on this document? >> >> >> >> >> https://docs.google.com/document/d/151ct8jcZWwh7XStptjbLsda6h2b3C0IuiH_hfZnUA58/edit# >> >> >> >> There are a number of anonymous edits. >> >> >> >> On Wed, Dec 30, 2015 at 6:04 PM, Wes McKinney >> wrote: >> >> > I cobbled together an ugly start of a c++->cython->pandas toolchain >> here >> >> > >> >> > https://github.com/wesm/pandas/tree/libpandas-native-core >> >> > >> >> > I used a mix of Kudu, Impala, and dynd-python cmake sources, so it's >> a >> >> > bit messy at the moment but it should be sufficient to run some real >> >> > experiments with a little more work. I reckon it's like a 6 month >> >> > project to tear out the insides of Series and DataFrame and replace >> it >> >> > with a new "native core", but we should be able to get enough info to >> >> > see whether it's a viable plan within a month or so. >> >> > >> >> > The end goal is to create "private" extension types in Cython that >> can >> >> > be the new base classes for Series and NDFrame; these will hold a >> >> > reference to a C++ object that contains wrappered NumPy arrays and >> >> > other metadata (like pandas-only dtypes). >> >> > >> >> > It might be too hard to try to replace a single usage of block >> manager >> >> > as a first experiment, so I'll try to create a minimal "SeriesLite" >> >> > that supports 3 dtypes >> >> > >> >> > 1) float64 with nans >> >> > 2) int64 with a bitmask for NAs >> >> > 3) category type for one of these >> >> > >> >> > Just want to get a feel for the extensibility and offer an NA >> >> > singleton Python object (a la None) for getting and setting NAs >> across >> >> > these 3 dtypes. >> >> > >> >> > If we end up going down this route, any way to place a moratorium on >> >> > invasive work on pandas internals (outside bug fixes)? >> >> > >> >> > Pedantic aside: I'd rather avoid shipping thirdparty C/C++ libraries >> >> > like googletest and friends in pandas if we can. Cloudera folks have >> >> > been working on a portable C++ library toolchain for Impala and other >> >> > projects at https://github.com/cloudera/native-toolchain, but it is >> >> > only being tested on Linux and OS X. Most google libraries should >> >> > build out of the box on MSVC but it'll be something to keep an eye >> on. >> >> > >> >> > BTW thanks to the libdynd developers for pioneering the c++ lib <-> >> >> > python-c++ lib <-> cython toolchain; being able to build Cython >> >> > extensions directly from cmake is a godsend >> >> > >> >> > HNY all >> >> > Wes >> >> > >> >> > On Tue, Dec 29, 2015 at 4:17 PM, Irwin Zaid >> wrote: >> >> >> Yeah, that seems reasonable and I totally agree a Pandas wrapper >> layer >> >> >> would >> >> >> be necessary. >> >> >> >> >> >> I'll keep an eye on this and I'd like to help if I can. >> >> >> >> >> >> Irwin >> >> >> >> >> >> >> >> >> On Tue, Dec 29, 2015 at 6:01 PM, Wes McKinney >> >> >> wrote: >> >> >>> >> >> >>> I'm not suggesting a rewrite of NumPy functionality but rather >> pandas >> >> >>> functionality that is currently written in a mishmash of Cython and >> >> >>> Python. >> >> >>> Happy to experiment with changing the internal compute >> infrastructure >> >> >>> and >> >> >>> data representation to DyND after this first stage of cleanup is >> done. >> >> >>> Even >> >> >>> if we use DyND a pretty extensive pandas wrapper layer will be >> >> >>> necessary. >> >> >>> >> >> >>> >> >> >>> On Tuesday, December 29, 2015, Irwin Zaid >> wrote: >> >> >>>> >> >> >>>> Hi Wes (and others), >> >> >>>> >> >> >>>> I've been following this conversation with interest. I do think it >> >> >>>> would >> >> >>>> be worth exploring DyND, rather than setting up yet another >> rewrite >> >> >>>> of >> >> >>>> NumPy-functionality. Especially because DyND is already an >> optional >> >> >>>> dependency of Pandas. >> >> >>>> >> >> >>>> For things like Integer NA and new dtypes, DyND is there and >> ready to >> >> >>>> do >> >> >>>> this. >> >> >>>> >> >> >>>> Irwin >> >> >>>> >> >> >>>> On Tue, Dec 29, 2015 at 5:18 PM, Wes McKinney < >> wesmckinn at gmail.com> >> >> >>>> wrote: >> >> >>>>> >> >> >>>>> Can you link to the PR you're talking about? >> >> >>>>> >> >> >>>>> I will see about spending a few hours setting up a libpandas.so >> as a >> >> >>>>> C++ >> >> >>>>> shared library where we can run some experiments and validate >> >> >>>>> whether it can >> >> >>>>> solve the integer-NA problem and be a place to put new data types >> >> >>>>> (categorical and friends). I'm +1 on targeting >> >> >>>>> >> >> >>>>> Would it also be worth making a wish list of APIs we might >> consider >> >> >>>>> breaking in a pandas 1.0 release that also features this new >> "native >> >> >>>>> core"? >> >> >>>>> Might as well right some wrongs while we're doing some invasive >> work >> >> >>>>> on the >> >> >>>>> internals; some breakage might be unavoidable. We can always >> >> >>>>> maintain a >> >> >>>>> pandas legacy 0.x.x maintenance branch (providing a conda binary >> >> >>>>> build) for >> >> >>>>> legacy users where showstopper bugs can get fixed. >> >> >>>>> >> >> >>>>> On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback < >> jeffreback at gmail.com> >> >> >>>>> wrote: >> >> >>>>> > Wes your last is noted as well. I *think* we can actually do >> this >> >> >>>>> > now >> >> >>>>> > (well >> >> >>>>> > there is a PR out there). >> >> >>>>> > >> >> >>>>> > On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney >> >> >>>>> > >> >> >>>>> > wrote: >> >> >>>>> >> >> >> >>>>> >> The other huge thing this will enable is to do is >> copy-on-write >> >> >>>>> >> for >> >> >>>>> >> various kinds of views, which should cut down on some of the >> >> >>>>> >> defensive >> >> >>>>> >> copying in the library and reduce memory usage. >> >> >>>>> >> >> >> >>>>> >> On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney >> >> >>>>> >> >> >> >>>>> >> wrote: >> >> >>>>> >> > Basically the approach is >> >> >>>>> >> > >> >> >>>>> >> > 1) Base dtype type >> >> >>>>> >> > 2) Base array type with K >= 1 dimensions >> >> >>>>> >> > 3) Base scalar type >> >> >>>>> >> > 4) Base index type >> >> >>>>> >> > 5) "Wrapper" subclasses for all NumPy types fitting into >> >> >>>>> >> > categories >> >> >>>>> >> > #1, #2, #3, #4 >> >> >>>>> >> > 6) Subclasses for pandas-specific types like category, >> >> >>>>> >> > datetimeTZ, >> >> >>>>> >> > etc. >> >> >>>>> >> > 7) NDFrame as cpcloud wrote is just a list of these >> >> >>>>> >> > >> >> >>>>> >> > Indexes and axis labels / column names can get layered on >> top. >> >> >>>>> >> > >> >> >>>>> >> > After we do all this we can look at adding nested types >> >> >>>>> >> > (arrays, >> >> >>>>> >> > maps, >> >> >>>>> >> > structs) to better support JSON. >> >> >>>>> >> > >> >> >>>>> >> > - Wes >> >> >>>>> >> > >> >> >>>>> >> > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud >> >> >>>>> >> > >> >> >>>>> >> > wrote: >> >> >>>>> >> >> Maybe this is saying the same thing as Wes, but how far >> would >> >> >>>>> >> >> something >> >> >>>>> >> >> like >> >> >>>>> >> >> this get us? >> >> >>>>> >> >> >> >> >>>>> >> >> // warning: things are probably not this simple >> >> >>>>> >> >> >> >> >>>>> >> >> struct data_array_t { >> >> >>>>> >> >> void *primitive; // scalar data >> >> >>>>> >> >> data_array_t *nested; // nested data >> >> >>>>> >> >> boost::dynamic_bitset isnull; // might have to create >> our >> >> >>>>> >> >> own >> >> >>>>> >> >> to >> >> >>>>> >> >> avoid >> >> >>>>> >> >> boost >> >> >>>>> >> >> schema_t schema; // not sure exactly what this looks >> like >> >> >>>>> >> >> }; >> >> >>>>> >> >> >> >> >>>>> >> >> typedef std::map data_frame_t; // >> >> >>>>> >> >> probably >> >> >>>>> >> >> not >> >> >>>>> >> >> this >> >> >>>>> >> >> simple >> >> >>>>> >> >> >> >> >>>>> >> >> To answer Jeff?s use-case question: I think that the use >> cases >> >> >>>>> >> >> are >> >> >>>>> >> >> 1) >> >> >>>>> >> >> freedom from numpy (mostly) 2) no more block manager which >> >> >>>>> >> >> frees >> >> >>>>> >> >> us >> >> >>>>> >> >> from the >> >> >>>>> >> >> limitations of the block memory layout. In particular, the >> >> >>>>> >> >> ability >> >> >>>>> >> >> to >> >> >>>>> >> >> take >> >> >>>>> >> >> advantage of memory mapped IO would be a big win IMO. >> >> >>>>> >> >> >> >> >>>>> >> >> >> >> >>>>> >> >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney >> >> >>>>> >> >> >> >> >>>>> >> >> wrote: >> >> >>>>> >> >>> >> >> >>>>> >> >>> I will write a more detailed response to some of these >> things >> >> >>>>> >> >>> after >> >> >>>>> >> >>> the new year, but, in particular, re: missing values, can >> you >> >> >>>>> >> >>> or >> >> >>>>> >> >>> someone tell me why creating an object that contains a >> NumPy >> >> >>>>> >> >>> array and >> >> >>>>> >> >>> a bitmap is not sufficient? If we we can add a lightweight >> >> >>>>> >> >>> C/C++ >> >> >>>>> >> >>> class >> >> >>>>> >> >>> layer between NumPy function calls (e.g. arithmetic) and >> >> >>>>> >> >>> pandas >> >> >>>>> >> >>> function calls, then I see no reason why we cannot have >> >> >>>>> >> >>> >> >> >>>>> >> >>> Int32Array->add >> >> >>>>> >> >>> >> >> >>>>> >> >>> and >> >> >>>>> >> >>> >> >> >>>>> >> >>> Float32Array->add >> >> >>>>> >> >>> >> >> >>>>> >> >>> do the right thing (the former would be responsible for >> >> >>>>> >> >>> bitmasking to >> >> >>>>> >> >>> propagate NA values; the latter would defer to NumPy). If >> we >> >> >>>>> >> >>> can >> >> >>>>> >> >>> put >> >> >>>>> >> >>> all the internals of pandas objects inside a black box, we >> >> >>>>> >> >>> can >> >> >>>>> >> >>> add >> >> >>>>> >> >>> layers of virtual function indirection without a >> performance >> >> >>>>> >> >>> penalty >> >> >>>>> >> >>> (e.g. adding more interpreter overhead with more >> abstraction >> >> >>>>> >> >>> layers >> >> >>>>> >> >>> does add up to a perf penalty). >> >> >>>>> >> >>> >> >> >>>>> >> >>> I don't think this is too scary -- I would be willing to >> >> >>>>> >> >>> create a >> >> >>>>> >> >>> small POC C++ library to prototype something like what I'm >> >> >>>>> >> >>> talking >> >> >>>>> >> >>> about. >> >> >>>>> >> >>> >> >> >>>>> >> >>> Since pandas has limited points of contact with NumPy I >> don't >> >> >>>>> >> >>> think >> >> >>>>> >> >>> this would end up being too onerous. >> >> >>>>> >> >>> >> >> >>>>> >> >>> For the record, I'm pretty allergic to "advanced C++"; I >> >> >>>>> >> >>> think it >> >> >>>>> >> >>> is a >> >> >>>>> >> >>> useful tool if you pick a sane 20% subset of the C++11 >> spec >> >> >>>>> >> >>> and >> >> >>>>> >> >>> follow >> >> >>>>> >> >>> Google C++ style it's not very inaccessible to >> intermediate >> >> >>>>> >> >>> developers. More or less "C plus OOP and easier object >> >> >>>>> >> >>> lifetime >> >> >>>>> >> >>> management (shared/unique_ptr, etc.)". As soon as you add >> a >> >> >>>>> >> >>> lot >> >> >>>>> >> >>> of >> >> >>>>> >> >>> template metaprogramming C++ library development quickly >> >> >>>>> >> >>> becomes >> >> >>>>> >> >>> inaccessible except to the C++-Jedi. >> >> >>>>> >> >>> >> >> >>>>> >> >>> Maybe let's start a Google document on "pandas roadmap" >> where >> >> >>>>> >> >>> we >> >> >>>>> >> >>> can >> >> >>>>> >> >>> break down the 1-2 year goals and some of these >> >> >>>>> >> >>> infrastructure >> >> >>>>> >> >>> issues >> >> >>>>> >> >>> and have our discussion there? (obviously publish this >> >> >>>>> >> >>> someplace >> >> >>>>> >> >>> once >> >> >>>>> >> >>> we're done) >> >> >>>>> >> >>> >> >> >>>>> >> >>> - Wes >> >> >>>>> >> >>> >> >> >>>>> >> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback >> >> >>>>> >> >>> >> >> >>>>> >> >>> wrote: >> >> >>>>> >> >>> > Here are some of my thoughts about pandas Roadmap / >> status >> >> >>>>> >> >>> > and >> >> >>>>> >> >>> > some >> >> >>>>> >> >>> > responses to Wes's thoughts. >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > In the last few (and upcoming) major releases we have >> been >> >> >>>>> >> >>> > made >> >> >>>>> >> >>> > the >> >> >>>>> >> >>> > following changes: >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > - dtype enhancements (Categorical, Timedelta, Datetime >> >> >>>>> >> >>> > w/tz) & >> >> >>>>> >> >>> > making >> >> >>>>> >> >>> > these >> >> >>>>> >> >>> > first class objects >> >> >>>>> >> >>> > - code refactoring to remove subclassing of ndarrays for >> >> >>>>> >> >>> > Series >> >> >>>>> >> >>> > & >> >> >>>>> >> >>> > Index >> >> >>>>> >> >>> > - carving out / deprecating non-core parts of pandas >> >> >>>>> >> >>> > - datareader >> >> >>>>> >> >>> > - SparsePanel, WidePanel & other aliases (TImeSeries) >> >> >>>>> >> >>> > - rpy, rplot, irow et al. >> >> >>>>> >> >>> > - google-analytics >> >> >>>>> >> >>> > - API changes to make things more consistent >> >> >>>>> >> >>> > - pd.rolling/expanding * -> .rolling/expanding (this >> is >> >> >>>>> >> >>> > in >> >> >>>>> >> >>> > master >> >> >>>>> >> >>> > now) >> >> >>>>> >> >>> > - .resample becoming a full defered like groupby. >> >> >>>>> >> >>> > - multi-index slicing along any level (obviates need >> for >> >> >>>>> >> >>> > .xs) >> >> >>>>> >> >>> > and >> >> >>>>> >> >>> > allows >> >> >>>>> >> >>> > assignment >> >> >>>>> >> >>> > - .loc/.iloc - for the most part obviates use of .ix >> >> >>>>> >> >>> > - .pipe & .assign >> >> >>>>> >> >>> > - plotting accessors >> >> >>>>> >> >>> > - fixing of the sorting API >> >> >>>>> >> >>> > - many performance enhancements both micro & macro (e.g. >> >> >>>>> >> >>> > release >> >> >>>>> >> >>> > GIL) >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > Some on-deck enhancements are (meaning these are >> basically >> >> >>>>> >> >>> > ready to >> >> >>>>> >> >>> > go >> >> >>>>> >> >>> > in): >> >> >>>>> >> >>> > - IntervalIndex (and eventually make PeriodIndex just >> a >> >> >>>>> >> >>> > sub-class >> >> >>>>> >> >>> > of >> >> >>>>> >> >>> > this) >> >> >>>>> >> >>> > - RangeIndex >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > so lots of changes, though nothing really earth shaking, >> >> >>>>> >> >>> > just >> >> >>>>> >> >>> > more >> >> >>>>> >> >>> > convenience, reducing magicness somewhat >> >> >>>>> >> >>> > and providing flexibility. >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > Of course we are getting increasing issues, mostly bug >> >> >>>>> >> >>> > reports >> >> >>>>> >> >>> > (and >> >> >>>>> >> >>> > lots >> >> >>>>> >> >>> > of >> >> >>>>> >> >>> > dupes), some edge case enhancements >> >> >>>>> >> >>> > which can add to the existing API's and of course, >> requests >> >> >>>>> >> >>> > to >> >> >>>>> >> >>> > expand >> >> >>>>> >> >>> > the >> >> >>>>> >> >>> > (already) large code to other usecases. >> >> >>>>> >> >>> > Balancing this are a good many pull-requests from many >> >> >>>>> >> >>> > different >> >> >>>>> >> >>> > users, >> >> >>>>> >> >>> > some >> >> >>>>> >> >>> > even deep into the internals. >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > Here are some things that I have talked about and could >> be >> >> >>>>> >> >>> > considered >> >> >>>>> >> >>> > for >> >> >>>>> >> >>> > the roadmap. Disclaimer: I do work for Continuum >> >> >>>>> >> >>> > but these views are of course my own; furthermore >> obviously >> >> >>>>> >> >>> > I >> >> >>>>> >> >>> > am a >> >> >>>>> >> >>> > bit >> >> >>>>> >> >>> > more >> >> >>>>> >> >>> > familiar with some of the 'sponsored' open-source >> >> >>>>> >> >>> > libraries, but always open to new things. >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > - integration / automatic deferral to numba for JIT >> (this >> >> >>>>> >> >>> > would >> >> >>>>> >> >>> > be >> >> >>>>> >> >>> > thru >> >> >>>>> >> >>> > .apply) >> >> >>>>> >> >>> > - automatic deferal to dask from groubpy where >> appropriate >> >> >>>>> >> >>> > / >> >> >>>>> >> >>> > maybe a >> >> >>>>> >> >>> > .to_parallel (to simply return a dask.DataFrame object) >> >> >>>>> >> >>> > - incorporation of quantities / units (as part of the >> >> >>>>> >> >>> > dtype) >> >> >>>>> >> >>> > - use of DyND to allow missing values for int dtypes >> >> >>>>> >> >>> > - make Period a first class dtype. >> >> >>>>> >> >>> > - provide some copy-on-write semantics to alleviate the >> >> >>>>> >> >>> > chained-indexing >> >> >>>>> >> >>> > issues which occasionaly come up with the mis-use of the >> >> >>>>> >> >>> > indexing >> >> >>>>> >> >>> > API >> >> >>>>> >> >>> > - allow a 'policy' to automatically provide column >> blocks >> >> >>>>> >> >>> > for >> >> >>>>> >> >>> > dict-like >> >> >>>>> >> >>> > input (e.g. each column would be a block), this would >> allow >> >> >>>>> >> >>> > a >> >> >>>>> >> >>> > pass-thru >> >> >>>>> >> >>> > API >> >> >>>>> >> >>> > where you could >> >> >>>>> >> >>> > put in numpy arrays where you have views and have them >> >> >>>>> >> >>> > preserved >> >> >>>>> >> >>> > rather >> >> >>>>> >> >>> > than >> >> >>>>> >> >>> > copied automatically. Note that this would also allow >> what >> >> >>>>> >> >>> > I >> >> >>>>> >> >>> > call >> >> >>>>> >> >>> > 'split' >> >> >>>>> >> >>> > where a passed in >> >> >>>>> >> >>> > multi-dim numpy array could be split up to individual >> >> >>>>> >> >>> > blocks >> >> >>>>> >> >>> > (which >> >> >>>>> >> >>> > actually >> >> >>>>> >> >>> > gives a nice perf boost after the splitting costs). >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > In working towards some of these goals. I have come to >> the >> >> >>>>> >> >>> > opinion >> >> >>>>> >> >>> > that >> >> >>>>> >> >>> > it >> >> >>>>> >> >>> > would make sense to have a neutral API protocol layer >> >> >>>>> >> >>> > that would allow us to swap out different engines as >> >> >>>>> >> >>> > needed, >> >> >>>>> >> >>> > for >> >> >>>>> >> >>> > particular >> >> >>>>> >> >>> > dtypes, or *maybe* out-of-core type computations. E.g. >> >> >>>>> >> >>> > imagine that we replaced the in-memory block structure >> with >> >> >>>>> >> >>> > a >> >> >>>>> >> >>> > bclolz >> >> >>>>> >> >>> > / >> >> >>>>> >> >>> > memap >> >> >>>>> >> >>> > type; in theory this should be 'easy' and just work. >> >> >>>>> >> >>> > I could also see us adopting *some* of the SFrame code >> to >> >> >>>>> >> >>> > allow >> >> >>>>> >> >>> > easier >> >> >>>>> >> >>> > interop with this API layer. >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > In practice, I think a nice API layer would need to be >> >> >>>>> >> >>> > created >> >> >>>>> >> >>> > to >> >> >>>>> >> >>> > make >> >> >>>>> >> >>> > this >> >> >>>>> >> >>> > clean / nice. >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > So this comes around to Wes's point about creating a c++ >> >> >>>>> >> >>> > library for >> >> >>>>> >> >>> > the >> >> >>>>> >> >>> > internals (and possibly even some of the indexing >> >> >>>>> >> >>> > routines). >> >> >>>>> >> >>> > In an ideal world, or course this would be desirable. >> >> >>>>> >> >>> > Getting >> >> >>>>> >> >>> > there >> >> >>>>> >> >>> > is a >> >> >>>>> >> >>> > bit >> >> >>>>> >> >>> > non-trivial I think, and IMHO might not be worth the >> >> >>>>> >> >>> > effort. I >> >> >>>>> >> >>> > don't >> >> >>>>> >> >>> > really see big performance bottlenecks. We *already* >> defer >> >> >>>>> >> >>> > much >> >> >>>>> >> >>> > of >> >> >>>>> >> >>> > the >> >> >>>>> >> >>> > computation to libraries like numexpr & bottleneck >> (where >> >> >>>>> >> >>> > appropriate). >> >> >>>>> >> >>> > Adding numba / dask to the list would be helpful. >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > I think that almost all performance issues are the >> result >> >> >>>>> >> >>> > of: >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > a) gross misuse of the pandas API. How much code have >> you >> >> >>>>> >> >>> > seen >> >> >>>>> >> >>> > that >> >> >>>>> >> >>> > does >> >> >>>>> >> >>> > df.apply(lambda x: x.sum()) >> >> >>>>> >> >>> > b) routines which operate column-by-column rather >> >> >>>>> >> >>> > block-by-block and >> >> >>>>> >> >>> > are >> >> >>>>> >> >>> > in >> >> >>>>> >> >>> > python space (e.g. we have an issue right now about >> >> >>>>> >> >>> > .quantile) >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > So I am glossing over a big goal of having a c++ library >> >> >>>>> >> >>> > that >> >> >>>>> >> >>> > represents >> >> >>>>> >> >>> > the >> >> >>>>> >> >>> > pandas internals. This would by definition have a c-API >> >> >>>>> >> >>> > that so >> >> >>>>> >> >>> > you *could* use pandas like semantics in c/c++ and just >> >> >>>>> >> >>> > have it >> >> >>>>> >> >>> > work >> >> >>>>> >> >>> > (and >> >> >>>>> >> >>> > then pandas would be a thin wrapper around this >> library). >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > I am not averse to this, but I think would be quite a >> big >> >> >>>>> >> >>> > effort, >> >> >>>>> >> >>> > and >> >> >>>>> >> >>> > not a >> >> >>>>> >> >>> > huge perf boost IMHO. Further there are a number of API >> >> >>>>> >> >>> > issues >> >> >>>>> >> >>> > w.r.t. >> >> >>>>> >> >>> > indexing >> >> >>>>> >> >>> > which need to be clarified / worked out (e.g. should we >> >> >>>>> >> >>> > simply >> >> >>>>> >> >>> > deprecate >> >> >>>>> >> >>> > []) >> >> >>>>> >> >>> > that are much easier to test / figure out in python >> space. >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > I also thing that we have quite a large number of >> >> >>>>> >> >>> > contributors. >> >> >>>>> >> >>> > Moving >> >> >>>>> >> >>> > to >> >> >>>>> >> >>> > c++ might make the internals a bit more impenetrable >> that >> >> >>>>> >> >>> > the >> >> >>>>> >> >>> > current >> >> >>>>> >> >>> > internals. >> >> >>>>> >> >>> > (though this would allow c++ people to contribute, so >> that >> >> >>>>> >> >>> > might >> >> >>>>> >> >>> > balance >> >> >>>>> >> >>> > out). >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > We have a limited core of devs whom right now are >> familar >> >> >>>>> >> >>> > with >> >> >>>>> >> >>> > things. >> >> >>>>> >> >>> > If >> >> >>>>> >> >>> > someone happened to have a starting base for a c++ >> library, >> >> >>>>> >> >>> > then I >> >> >>>>> >> >>> > might >> >> >>>>> >> >>> > change >> >> >>>>> >> >>> > opinions here. >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > my 4c. >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > Jeff >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > wrote: >> >> >>>>> >> >>> >> >> >> >>>>> >> >>> >> Deep thoughts during the holidays. >> >> >>>>> >> >>> >> >> >> >>>>> >> >>> >> I might be out of line here, but the >> interpreter-heaviness >> >> >>>>> >> >>> >> of >> >> >>>>> >> >>> >> the >> >> >>>>> >> >>> >> inside of pandas objects is likely to be a long-term >> >> >>>>> >> >>> >> liability >> >> >>>>> >> >>> >> and >> >> >>>>> >> >>> >> source of performance problems and technical debt. >> >> >>>>> >> >>> >> >> >> >>>>> >> >>> >> Has anyone put any thought into planning and beginning >> to >> >> >>>>> >> >>> >> execute >> >> >>>>> >> >>> >> on a >> >> >>>>> >> >>> >> rewrite that moves as much as possible of the internals >> >> >>>>> >> >>> >> into >> >> >>>>> >> >>> >> native >> >> >>>>> >> >>> >> / >> >> >>>>> >> >>> >> compiled code? I'm talking about: >> >> >>>>> >> >>> >> >> >> >>>>> >> >>> >> - pandas/core/internals >> >> >>>>> >> >>> >> - indexing and assignment >> >> >>>>> >> >>> >> - much of pandas/core/common >> >> >>>>> >> >>> >> - categorical and custom dtypes >> >> >>>>> >> >>> >> - all indexing mechanisms >> >> >>>>> >> >>> >> >> >> >>>>> >> >>> >> I'm concerned we've already exposed too much internals >> to >> >> >>>>> >> >>> >> users, so >> >> >>>>> >> >>> >> this might lead to a lot of API breakage, but it might >> be >> >> >>>>> >> >>> >> for >> >> >>>>> >> >>> >> the >> >> >>>>> >> >>> >> Greater Good. As a first step, beginning a partial >> >> >>>>> >> >>> >> migration >> >> >>>>> >> >>> >> of >> >> >>>>> >> >>> >> internals into some C++ classes that encapsulate the >> >> >>>>> >> >>> >> insides >> >> >>>>> >> >>> >> of >> >> >>>>> >> >>> >> DataFrame objects and implement indexing and >> block-level >> >> >>>>> >> >>> >> manipulations >> >> >>>>> >> >>> >> would be a good place to start. I think you could do >> this >> >> >>>>> >> >>> >> wouldn't >> >> >>>>> >> >>> >> too >> >> >>>>> >> >>> >> much disruption. >> >> >>>>> >> >>> >> >> >> >>>>> >> >>> >> As part of this internal retooling we might give >> >> >>>>> >> >>> >> consideration >> >> >>>>> >> >>> >> to >> >> >>>>> >> >>> >> alternative data structures for representing data >> internal >> >> >>>>> >> >>> >> to >> >> >>>>> >> >>> >> pandas >> >> >>>>> >> >>> >> objects. Now in 2015/2016, continuing to be hamstrung >> by >> >> >>>>> >> >>> >> NumPy's >> >> >>>>> >> >>> >> limitations feels somewhat anachronistic. User code is >> >> >>>>> >> >>> >> riddled >> >> >>>>> >> >>> >> with >> >> >>>>> >> >>> >> workarounds for data type fidelity issues and the like. >> >> >>>>> >> >>> >> Like, >> >> >>>>> >> >>> >> really, >> >> >>>>> >> >>> >> why not add a bitndarray (similar to >> ilanschnell/bitarray) >> >> >>>>> >> >>> >> for >> >> >>>>> >> >>> >> storing >> >> >>>>> >> >>> >> nullness for problematic types and hide this from the >> >> >>>>> >> >>> >> user? =) >> >> >>>>> >> >>> >> >> >> >>>>> >> >>> >> Since we are now a NumFOCUS-sponsored project, I feel >> like >> >> >>>>> >> >>> >> we >> >> >>>>> >> >>> >> might >> >> >>>>> >> >>> >> consider establishing some formal governance over >> pandas >> >> >>>>> >> >>> >> and >> >> >>>>> >> >>> >> publishing meetings notes and roadmap documents >> describing >> >> >>>>> >> >>> >> plans >> >> >>>>> >> >>> >> for >> >> >>>>> >> >>> >> the project and meetings notes from committers. >> There's no >> >> >>>>> >> >>> >> real >> >> >>>>> >> >>> >> "committer culture" for NumFOCUS projects like there is >> >> >>>>> >> >>> >> with >> >> >>>>> >> >>> >> the >> >> >>>>> >> >>> >> Apache Software Foundation, but we might try leading by >> >> >>>>> >> >>> >> example! >> >> >>>>> >> >>> >> >> >> >>>>> >> >>> >> Also, I believe pandas as a project has reached a >> level of >> >> >>>>> >> >>> >> importance >> >> >>>>> >> >>> >> where we ought to consider planning and execution on >> >> >>>>> >> >>> >> larger >> >> >>>>> >> >>> >> scale >> >> >>>>> >> >>> >> undertakings such as this for safeguarding the future. >> >> >>>>> >> >>> >> >> >> >>>>> >> >>> >> As for myself, well, I have my hands full in Big >> >> >>>>> >> >>> >> Data-land. I >> >> >>>>> >> >>> >> wish >> >> >>>>> >> >>> >> I >> >> >>>>> >> >>> >> could be helping more with pandas, but there a quite a >> few >> >> >>>>> >> >>> >> fundamental >> >> >>>>> >> >>> >> issues (like data interoperability nested data handling >> >> >>>>> >> >>> >> and >> >> >>>>> >> >>> >> file >> >> >>>>> >> >>> >> format support ? e.g. Parquet, see >> >> >>>>> >> >>> >> >> >> >>>>> >> >>> >> >> >> >>>>> >> >>> >> >> >> >>>>> >> >>> >> >> >> >>>>> >> >>> >> >> >> >>>>> >> >>> >> >> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/ >> ) >> >> >>>>> >> >>> >> preventing Python from being more useful in industry >> >> >>>>> >> >>> >> analytics >> >> >>>>> >> >>> >> applications. >> >> >>>>> >> >>> >> >> >> >>>>> >> >>> >> Aside: one of the bigger mistakes I made with pandas's >> API >> >> >>>>> >> >>> >> design >> >> >>>>> >> >>> >> was >> >> >>>>> >> >>> >> making it acceptable to call class constructors ? like >> >> >>>>> >> >>> >> pandas.DataFrame ? directly (versus factory functions). >> >> >>>>> >> >>> >> Sorry >> >> >>>>> >> >>> >> about >> >> >>>>> >> >>> >> that! If we could convince everyone to start writing >> >> >>>>> >> >>> >> pandas.data_frame >> >> >>>>> >> >>> >> or dataframe instead of using the class reference it >> would >> >> >>>>> >> >>> >> help a >> >> >>>>> >> >>> >> lot >> >> >>>>> >> >>> >> with code cleanup. It's hard to plan for these things ? >> >> >>>>> >> >>> >> NumPy >> >> >>>>> >> >>> >> interoperability seemed a lot more important in 2008 >> than >> >> >>>>> >> >>> >> it >> >> >>>>> >> >>> >> does >> >> >>>>> >> >>> >> now, >> >> >>>>> >> >>> >> so I forgive myself. >> >> >>>>> >> >>> >> >> >> >>>>> >> >>> >> cheers and best wishes for 2016, >> >> >>>>> >> >>> >> Wes >> >> >>>>> >> >>> >> _______________________________________________ >> >> >>>>> >> >>> >> Pandas-dev mailing list >> >> >>>>> >> >>> >> Pandas-dev at python.org >> >> >>>>> >> >>> >> https://mail.python.org/mailman/listinfo/pandas-dev >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > >> >> >>>>> >> >>> _______________________________________________ >> >> >>>>> >> >>> Pandas-dev mailing list >> >> >>>>> >> >>> Pandas-dev at python.org >> >> >>>>> >> >>> https://mail.python.org/mailman/listinfo/pandas-dev >> >> >>>>> >> _______________________________________________ >> >> >>>>> >> Pandas-dev mailing list >> >> >>>>> >> Pandas-dev at python.org >> >> >>>>> >> https://mail.python.org/mailman/listinfo/pandas-dev >> >> >>>>> > >> >> >>>>> > >> >> >>>>> >> >> >>>>> >> >> >>>>> _______________________________________________ >> >> >>>>> Pandas-dev mailing list >> >> >>>>> Pandas-dev at python.org >> >> >>>>> https://mail.python.org/mailman/listinfo/pandas-dev >> >> >>>>> >> >> >>>> >> >> >> >> >> _______________________________________________ >> >> Pandas-dev mailing list >> >> Pandas-dev at python.org >> >> https://mail.python.org/mailman/listinfo/pandas-dev >> > >> > >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Wed Jan 6 13:30:46 2016 From: shoyer at gmail.com (Stephan Hoyer) Date: Wed, 6 Jan 2016 10:30:46 -0800 Subject: [Pandas-dev] pandas governance In-Reply-To: References:

Message-ID: I'm also supportive of formalizing pandas governance like this. It's definitely the right call for a mature project. I agree that we can probably just the Jupyter governance docs with minor adjustments. Cheers, Stephan On Wed, Jan 6, 2016 at 5:50 AM, Jeff Reback wrote: > yes on board with this as well. We do have a fiscal governance document > w.r.t. NUMFocus. That should at the very least be reference by > the governance docs. Certainly starting with the jupyter docs is a good > think. > > I don't think we will have the long-long-long discussion that numpy had > about the steering committee representation :) > > Jeff > > On Tue, Jan 5, 2016 at 7:29 PM, Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > >> Sounds very good! >> >> Certainly now we are a NumFOCUS supported project (and have to deal with >> financial things), I think this is important to do. >> >> 2016-01-05 19:15 GMT+01:00 Wes McKinney : >> >>> hi folks, >>> >>> I'm sorry I didn't do this 2 or 3 years ago when I first handed over >>> release management responsibilities to Jeff, y-p and others, but it >>> would be good for us to formalize the project governance like most >>> other major open source projects. See IPython / Jupyter for an example >>> set of governance documents >>> >>> https://github.com/jupyter/governance >>> >>> Numpy also recently adopted a goverance document, based on the Jupyter >> one: http://docs.scipy.org/doc/numpy-dev/dev/governance/governance.html >> and https://github.com/numpy/numpy/pull/6352. >> Maybe also worth a look (although I don't know what they exactly changed >> from the Jupyter one). >> >> >>> I don't have particular concerns over the project's direction and >>> decision making procedure, but as I've had several people raise >>> private concerns with me over the last few years, I think it would be >>> good for the community to have a set of public documents on GitHub >>> that lists people and process in simple terms. This is especially >>> important now that we can receive financial sponsorship through >>> NumFOCUS, so that sponsored contributions are subject to the same >>> community process as volunteer contributions. >>> >>> A basic summary of how we've been informally operating is: Project >>> committers (as will be defined and listed in the governance documents) >>> make decisions based on consensus; in the absence of consensus (which >>> has rarely occurred) I will reserve tie-breaking / BDFL privileges. I >>> don't recall having ever having to put on the BDFL hat but it's the >>> "just in case" should we reach some impasse down the road. >>> >>> Sounds good! >> >> >>> I can take a crack at assembling something based on the IPython >>> governance docs if that sounds good. >>> >>> At the end of the day, an OSS project is only as strong as the >>> individuals committing code and reviewing patches. As pandas will be 8 >>> years old in April, with 6 years as open source, I think we have a >>> good track record of consensus-, common-sense-, and >>> fact/evidence-driven decision making. >>> >>> best, >>> Wes >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> >> > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Wed Jan 6 14:26:49 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Wed, 6 Jan 2016 11:26:49 -0800 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References:

Message-ID: hey Stephan, Thanks for all the thoughts. Let me make a few off-the-cuff comments. On Wed, Jan 6, 2016 at 10:11 AM, Stephan Hoyer wrote: > I was asked about this off list, so I'll belatedly share my thoughts. > > First of all, I am really excited by Wes's renewed engagement in the project > and his interest in rewriting pandas internals. This is quite an ambitious > plan and nobody is better positioned to tackle it than Wes. > > I have mixed feelings about the details of the rewrite itself. > > +1 on the simpler internal data model. The block manager is confusing and > leads to hard to predict performance issues related to copying data. If we > can do all column additions/removals/re-orderings without a copy it will be > a clear win. > > +0 on moving internals to C++. I do like the performance benefits, but it > seems like a lot of work, and it may make pandas less friendly to new > contributors. > It really goes beyond performance benefits. If you go back to my 2013 talk http://www.slideshare.net/wesm/practical-medium-data-analytics-with-python there's a long list of architectural problems that now in 2016 haven't found solutions. The only way (that I can fully reason through -- I am happy to look at alternate proposals) to move the internals of pandas closer to the metal is to give Series and DataFrame a C/C++ API -- this is the "libpandas native core" as I've been describing. > -0 on writing a brand new dtype system just for pandas -- this stuff really > belongs in NumPy (or another array library like DyND), and I am skeptical > that pandas can do a complete enough job to be useful without replicating > all that functionality. > I'm curious what "a brand new dtype system" means to you. pandas already has its own data type system, but it's a potpourri of inconsistencies and rough edges with self-evident problems for both users and developers. Some indicators: - Some pandas types use NaN for missing data, others None (or both), others nothing at all. We lose data (integers) or bloat memory (booleans) by upcasting to float-NaN or object-None. - Internal functions full of is_XXX_dtype functions: pandas.core.common, pandas.core.algorithms, etc. - Series.values on synthetic dtypes like Categorical - We use arrays of Python objects for string data The biggest cause IMHO is that pandas is too tightly coupled to NumPy, but it's coupled in a way that makes development and extensibility difficult. We've already allowed NumPy-specific details to taint the pandas user API in many unpleasant ways. This isn't to say "NumPy is bad" but rather "pandas tries to layer domain-specific functionality [that NumPy was not designed for] on top". Some things things I'm advocating with the internals refactor: 1) First class "pandas type" objects. This is not the same as a NumPy dtype which has some pretty loaded implications -- in particular, NumPy dtypes are implicitly coupled to an array computing framework (see the function table that is attached to the PyArray_Descr object) 2) Pandas array container types that map user-land API calls to implementation-land API calls (in NumPy, DyND, or pandas-native code like pandas.core.algorithms etc.). This will make it much easier to leverage innovations in NumPy and DyND without those implementation details spilling over into the pandas user API 3) Adding a single pandas.NA singleton to have one library-wide notion of a scalar null value (obviously, we can automatically map NaN and None to NA for backwards compatibility). 4) Layering a bitmask internally on NumPy arrays (especially integer and boolean) to add null-ness to types that need it. Note that this does not prevent us from switching to DyND arrays with option dtype in the future. If the details of how we are implementing NULL are visible to the user, we have failed. 5) Removing the block manager in favor of simpler pandas Array (1D) and Table (2D -- vector of Array) data structures I believe you can do all this without harming interoperability with the ecosystem of projects that people currently use in conjunction with pandas. > More broadly, I am concerned that this rewrite may improve the tabular > computation ecosystem at the cost of inter-operability with the array-based > ecosystem (numpy, scipy, sklearn, xarray, etc.). The later has been one of > the strengths of pandas and it would be a shame to see that go away. > I have no intention of letting this happen. What I've am asking from you (and others reading) is to help define what constitutes interoperability. What guarantees do we make the user? For example, we should have very strict guidelines for the output of: np.asarray(pandas_obj) For example In [3]: s = pd.Series([1,2,3]*10).astype('category') In [4]: np.asarray(s) Out[4]: array([1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3]) I see no reason why this should necessarily behave any differently. The problem will come in when there is pandas data that is not precisely representable in a NumPy array. Example: In [5]: s = pd.Series([1,2,3, 4]) In [6]: s.dtype Out[6]: dtype('int64') In [7]: s2 = s.reindex(np.arange(10)) In [8]: s2.dtype Out[8]: dtype('float64') In [9]: np.asarray(s2) Out[9]: array([ 1., 2., 3., 4., nan, nan, nan, nan, nan, nan]) With the "new internals", s2 will still be int64 type, but we may decide that np.asarray(s2) should raise an exception rather than implicitly make a decision about how to perform a "lossy" conversion to a NumPy array. If you are using DyND with pandas, then the equivalent function would be able to implicitly convert without data loss. > We're already starting to struggle with inter-operability with the new > pandas dtypes and a further rewrite would make this even harder. > For example, see categoricals and scikit-learn in Tom's recent post [1], or the > fact that .values no longer always returns a numpy array. This has also been > a challenge for xarray, which can't handle these new dtypes because we lack > a suitable array backend for them. I'm definitely motivated in this initiative by these challenges. The idea here is that with the new internals, Series.values will always return the same type of object, and there will be one consistent code path for getting a NumPy array out. For example, rather than: if isinstance(s.values, Categorical): # pandas ... else: # NumPy ... We could have (just an idea) s.values.to_numpy() Or simply np.asarray(s.values) > > Personally, I would much rather leverage a full featured library like an > improved NumPy or DyND for new dtypes, because that could also be used by > the array-based ecosystem. At the very least, it would be good to think > about zero-copy inter-operability with array-based tools. > I'm all for zero-copy interoperability when possible, but my gut feeling is that exposing the data type system of an array library (the choice of which is an implementation detail) to pandas users is an inherent leaky abstraction that will continue to cause problems if we plan to keep innovating inside pandas. By better hiding NumPy details and types from the user we will make it much easier to swap out new low level array data structures and compute components (e.g. DyND), or add custom data structures or out-of-core tools (memory maps, bcolz, etc.) I'm additionally offering to do nearly all of this replumbing of pandas internals myself, and completely in my free time. What I will expect in return from you all is to help enumerate our contracts with the pandas user (i.e. interoperability) and to hold me accountable to not break them. I know I haven't been committing code on pandas since mid-2013 (after a 5 year marathon), but these architectural problems have been on my mind almost constantly since then, I just haven't had the bandwidth to start tackling them. cheers, Wes > On the other hand, I wonder if maybe it would be better to write a native > in-memory backend for Ibis instead of rewriting pandas. Ibis does seem to > have improved/simplified API which resolves many of pandas's warts. That > said, it's a pretty big change from the "DataFrame as matrix" model, and > pandas won't be going away anytime soon. I do like that it would force users > to be more explicit about converting between tables and arrays, which might > also make distinctions between the tabular and array oriented ecosystems > easier to swallow. > > Just my two cents, from someone who has lots of opinions but who will likely > stay on the sidelines for most of this work. > > Cheers, > Stephan > > [1] http://tomaugspurger.github.io/categorical-pipelines.html > > On Fri, Jan 1, 2016 at 6:06 PM, Jeff Reback wrote: >> >> ok I moved the document to the Pandas folder, where the same group should >> be able to edit/upload/etc. lmk if any issues >> >> On Fri, Jan 1, 2016 at 8:48 PM, Wes McKinney wrote: >>> >>> Thanks Jeff. Can you create and share a shared Drive folder containing >>> this where I can put other auxiliary / follow up documents? >>> >>> On Fri, Jan 1, 2016 at 5:23 PM, Jeff Reback wrote: >>> > I changed the doc so that the core dev people can edit. I *think* that >>> > everyone should be able to view/comment though. >>> > >>> > On Fri, Jan 1, 2016 at 8:13 PM, Wes McKinney >>> > wrote: >>> >> >>> >> Jeff -- can you require log-in for editing on this document? >>> >> >>> >> >>> >> https://docs.google.com/document/d/151ct8jcZWwh7XStptjbLsda6h2b3C0IuiH_hfZnUA58/edit# >>> >> >>> >> There are a number of anonymous edits. >>> >> >>> >> On Wed, Dec 30, 2015 at 6:04 PM, Wes McKinney >>> >> wrote: >>> >> > I cobbled together an ugly start of a c++->cython->pandas toolchain >>> >> > here >>> >> > >>> >> > https://github.com/wesm/pandas/tree/libpandas-native-core >>> >> > >>> >> > I used a mix of Kudu, Impala, and dynd-python cmake sources, so it's >>> >> > a >>> >> > bit messy at the moment but it should be sufficient to run some real >>> >> > experiments with a little more work. I reckon it's like a 6 month >>> >> > project to tear out the insides of Series and DataFrame and replace >>> >> > it >>> >> > with a new "native core", but we should be able to get enough info >>> >> > to >>> >> > see whether it's a viable plan within a month or so. >>> >> > >>> >> > The end goal is to create "private" extension types in Cython that >>> >> > can >>> >> > be the new base classes for Series and NDFrame; these will hold a >>> >> > reference to a C++ object that contains wrappered NumPy arrays and >>> >> > other metadata (like pandas-only dtypes). >>> >> > >>> >> > It might be too hard to try to replace a single usage of block >>> >> > manager >>> >> > as a first experiment, so I'll try to create a minimal "SeriesLite" >>> >> > that supports 3 dtypes >>> >> > >>> >> > 1) float64 with nans >>> >> > 2) int64 with a bitmask for NAs >>> >> > 3) category type for one of these >>> >> > >>> >> > Just want to get a feel for the extensibility and offer an NA >>> >> > singleton Python object (a la None) for getting and setting NAs >>> >> > across >>> >> > these 3 dtypes. >>> >> > >>> >> > If we end up going down this route, any way to place a moratorium on >>> >> > invasive work on pandas internals (outside bug fixes)? >>> >> > >>> >> > Pedantic aside: I'd rather avoid shipping thirdparty C/C++ libraries >>> >> > like googletest and friends in pandas if we can. Cloudera folks have >>> >> > been working on a portable C++ library toolchain for Impala and >>> >> > other >>> >> > projects at https://github.com/cloudera/native-toolchain, but it is >>> >> > only being tested on Linux and OS X. Most google libraries should >>> >> > build out of the box on MSVC but it'll be something to keep an eye >>> >> > on. >>> >> > >>> >> > BTW thanks to the libdynd developers for pioneering the c++ lib <-> >>> >> > python-c++ lib <-> cython toolchain; being able to build Cython >>> >> > extensions directly from cmake is a godsend >>> >> > >>> >> > HNY all >>> >> > Wes >>> >> > >>> >> > On Tue, Dec 29, 2015 at 4:17 PM, Irwin Zaid >>> >> > wrote: >>> >> >> Yeah, that seems reasonable and I totally agree a Pandas wrapper >>> >> >> layer >>> >> >> would >>> >> >> be necessary. >>> >> >> >>> >> >> I'll keep an eye on this and I'd like to help if I can. >>> >> >> >>> >> >> Irwin >>> >> >> >>> >> >> >>> >> >> On Tue, Dec 29, 2015 at 6:01 PM, Wes McKinney >>> >> >> wrote: >>> >> >>> >>> >> >>> I'm not suggesting a rewrite of NumPy functionality but rather >>> >> >>> pandas >>> >> >>> functionality that is currently written in a mishmash of Cython >>> >> >>> and >>> >> >>> Python. >>> >> >>> Happy to experiment with changing the internal compute >>> >> >>> infrastructure >>> >> >>> and >>> >> >>> data representation to DyND after this first stage of cleanup is >>> >> >>> done. >>> >> >>> Even >>> >> >>> if we use DyND a pretty extensive pandas wrapper layer will be >>> >> >>> necessary. >>> >> >>> >>> >> >>> >>> >> >>> On Tuesday, December 29, 2015, Irwin Zaid >>> >> >>> wrote: >>> >> >>>> >>> >> >>>> Hi Wes (and others), >>> >> >>>> >>> >> >>>> I've been following this conversation with interest. I do think >>> >> >>>> it >>> >> >>>> would >>> >> >>>> be worth exploring DyND, rather than setting up yet another >>> >> >>>> rewrite >>> >> >>>> of >>> >> >>>> NumPy-functionality. Especially because DyND is already an >>> >> >>>> optional >>> >> >>>> dependency of Pandas. >>> >> >>>> >>> >> >>>> For things like Integer NA and new dtypes, DyND is there and >>> >> >>>> ready to >>> >> >>>> do >>> >> >>>> this. >>> >> >>>> >>> >> >>>> Irwin >>> >> >>>> >>> >> >>>> On Tue, Dec 29, 2015 at 5:18 PM, Wes McKinney >>> >> >>>> >>> >> >>>> wrote: >>> >> >>>>> >>> >> >>>>> Can you link to the PR you're talking about? >>> >> >>>>> >>> >> >>>>> I will see about spending a few hours setting up a libpandas.so >>> >> >>>>> as a >>> >> >>>>> C++ >>> >> >>>>> shared library where we can run some experiments and validate >>> >> >>>>> whether it can >>> >> >>>>> solve the integer-NA problem and be a place to put new data >>> >> >>>>> types >>> >> >>>>> (categorical and friends). I'm +1 on targeting >>> >> >>>>> >>> >> >>>>> Would it also be worth making a wish list of APIs we might >>> >> >>>>> consider >>> >> >>>>> breaking in a pandas 1.0 release that also features this new >>> >> >>>>> "native >>> >> >>>>> core"? >>> >> >>>>> Might as well right some wrongs while we're doing some invasive >>> >> >>>>> work >>> >> >>>>> on the >>> >> >>>>> internals; some breakage might be unavoidable. We can always >>> >> >>>>> maintain a >>> >> >>>>> pandas legacy 0.x.x maintenance branch (providing a conda binary >>> >> >>>>> build) for >>> >> >>>>> legacy users where showstopper bugs can get fixed. >>> >> >>>>> >>> >> >>>>> On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback >>> >> >>>>> >>> >> >>>>> wrote: >>> >> >>>>> > Wes your last is noted as well. I *think* we can actually do >>> >> >>>>> > this >>> >> >>>>> > now >>> >> >>>>> > (well >>> >> >>>>> > there is a PR out there). >>> >> >>>>> > >>> >> >>>>> > On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney >>> >> >>>>> > >>> >> >>>>> > wrote: >>> >> >>>>> >> >>> >> >>>>> >> The other huge thing this will enable is to do is >>> >> >>>>> >> copy-on-write >>> >> >>>>> >> for >>> >> >>>>> >> various kinds of views, which should cut down on some of the >>> >> >>>>> >> defensive >>> >> >>>>> >> copying in the library and reduce memory usage. >>> >> >>>>> >> >>> >> >>>>> >> On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney >>> >> >>>>> >> >>> >> >>>>> >> wrote: >>> >> >>>>> >> > Basically the approach is >>> >> >>>>> >> > >>> >> >>>>> >> > 1) Base dtype type >>> >> >>>>> >> > 2) Base array type with K >= 1 dimensions >>> >> >>>>> >> > 3) Base scalar type >>> >> >>>>> >> > 4) Base index type >>> >> >>>>> >> > 5) "Wrapper" subclasses for all NumPy types fitting into >>> >> >>>>> >> > categories >>> >> >>>>> >> > #1, #2, #3, #4 >>> >> >>>>> >> > 6) Subclasses for pandas-specific types like category, >>> >> >>>>> >> > datetimeTZ, >>> >> >>>>> >> > etc. >>> >> >>>>> >> > 7) NDFrame as cpcloud wrote is just a list of these >>> >> >>>>> >> > >>> >> >>>>> >> > Indexes and axis labels / column names can get layered on >>> >> >>>>> >> > top. >>> >> >>>>> >> > >>> >> >>>>> >> > After we do all this we can look at adding nested types >>> >> >>>>> >> > (arrays, >>> >> >>>>> >> > maps, >>> >> >>>>> >> > structs) to better support JSON. >>> >> >>>>> >> > >>> >> >>>>> >> > - Wes >>> >> >>>>> >> > >>> >> >>>>> >> > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud >>> >> >>>>> >> > >>> >> >>>>> >> > wrote: >>> >> >>>>> >> >> Maybe this is saying the same thing as Wes, but how far >>> >> >>>>> >> >> would >>> >> >>>>> >> >> something >>> >> >>>>> >> >> like >>> >> >>>>> >> >> this get us? >>> >> >>>>> >> >> >>> >> >>>>> >> >> // warning: things are probably not this simple >>> >> >>>>> >> >> >>> >> >>>>> >> >> struct data_array_t { >>> >> >>>>> >> >> void *primitive; // scalar data >>> >> >>>>> >> >> data_array_t *nested; // nested data >>> >> >>>>> >> >> boost::dynamic_bitset isnull; // might have to create >>> >> >>>>> >> >> our >>> >> >>>>> >> >> own >>> >> >>>>> >> >> to >>> >> >>>>> >> >> avoid >>> >> >>>>> >> >> boost >>> >> >>>>> >> >> schema_t schema; // not sure exactly what this looks >>> >> >>>>> >> >> like >>> >> >>>>> >> >> }; >>> >> >>>>> >> >> >>> >> >>>>> >> >> typedef std::map data_frame_t; // >>> >> >>>>> >> >> probably >>> >> >>>>> >> >> not >>> >> >>>>> >> >> this >>> >> >>>>> >> >> simple >>> >> >>>>> >> >> >>> >> >>>>> >> >> To answer Jeff?s use-case question: I think that the use >>> >> >>>>> >> >> cases >>> >> >>>>> >> >> are >>> >> >>>>> >> >> 1) >>> >> >>>>> >> >> freedom from numpy (mostly) 2) no more block manager which >>> >> >>>>> >> >> frees >>> >> >>>>> >> >> us >>> >> >>>>> >> >> from the >>> >> >>>>> >> >> limitations of the block memory layout. In particular, the >>> >> >>>>> >> >> ability >>> >> >>>>> >> >> to >>> >> >>>>> >> >> take >>> >> >>>>> >> >> advantage of memory mapped IO would be a big win IMO. >>> >> >>>>> >> >> >>> >> >>>>> >> >> >>> >> >>>>> >> >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney >>> >> >>>>> >> >> >>> >> >>>>> >> >> wrote: >>> >> >>>>> >> >>> >>> >> >>>>> >> >>> I will write a more detailed response to some of these >>> >> >>>>> >> >>> things >>> >> >>>>> >> >>> after >>> >> >>>>> >> >>> the new year, but, in particular, re: missing values, can >>> >> >>>>> >> >>> you >>> >> >>>>> >> >>> or >>> >> >>>>> >> >>> someone tell me why creating an object that contains a >>> >> >>>>> >> >>> NumPy >>> >> >>>>> >> >>> array and >>> >> >>>>> >> >>> a bitmap is not sufficient? If we we can add a >>> >> >>>>> >> >>> lightweight >>> >> >>>>> >> >>> C/C++ >>> >> >>>>> >> >>> class >>> >> >>>>> >> >>> layer between NumPy function calls (e.g. arithmetic) and >>> >> >>>>> >> >>> pandas >>> >> >>>>> >> >>> function calls, then I see no reason why we cannot have >>> >> >>>>> >> >>> >>> >> >>>>> >> >>> Int32Array->add >>> >> >>>>> >> >>> >>> >> >>>>> >> >>> and >>> >> >>>>> >> >>> >>> >> >>>>> >> >>> Float32Array->add >>> >> >>>>> >> >>> >>> >> >>>>> >> >>> do the right thing (the former would be responsible for >>> >> >>>>> >> >>> bitmasking to >>> >> >>>>> >> >>> propagate NA values; the latter would defer to NumPy). If >>> >> >>>>> >> >>> we >>> >> >>>>> >> >>> can >>> >> >>>>> >> >>> put >>> >> >>>>> >> >>> all the internals of pandas objects inside a black box, >>> >> >>>>> >> >>> we >>> >> >>>>> >> >>> can >>> >> >>>>> >> >>> add >>> >> >>>>> >> >>> layers of virtual function indirection without a >>> >> >>>>> >> >>> performance >>> >> >>>>> >> >>> penalty >>> >> >>>>> >> >>> (e.g. adding more interpreter overhead with more >>> >> >>>>> >> >>> abstraction >>> >> >>>>> >> >>> layers >>> >> >>>>> >> >>> does add up to a perf penalty). >>> >> >>>>> >> >>> >>> >> >>>>> >> >>> I don't think this is too scary -- I would be willing to >>> >> >>>>> >> >>> create a >>> >> >>>>> >> >>> small POC C++ library to prototype something like what >>> >> >>>>> >> >>> I'm >>> >> >>>>> >> >>> talking >>> >> >>>>> >> >>> about. >>> >> >>>>> >> >>> >>> >> >>>>> >> >>> Since pandas has limited points of contact with NumPy I >>> >> >>>>> >> >>> don't >>> >> >>>>> >> >>> think >>> >> >>>>> >> >>> this would end up being too onerous. >>> >> >>>>> >> >>> >>> >> >>>>> >> >>> For the record, I'm pretty allergic to "advanced C++"; I >>> >> >>>>> >> >>> think it >>> >> >>>>> >> >>> is a >>> >> >>>>> >> >>> useful tool if you pick a sane 20% subset of the C++11 >>> >> >>>>> >> >>> spec >>> >> >>>>> >> >>> and >>> >> >>>>> >> >>> follow >>> >> >>>>> >> >>> Google C++ style it's not very inaccessible to >>> >> >>>>> >> >>> intermediate >>> >> >>>>> >> >>> developers. More or less "C plus OOP and easier object >>> >> >>>>> >> >>> lifetime >>> >> >>>>> >> >>> management (shared/unique_ptr, etc.)". As soon as you add >>> >> >>>>> >> >>> a >>> >> >>>>> >> >>> lot >>> >> >>>>> >> >>> of >>> >> >>>>> >> >>> template metaprogramming C++ library development quickly >>> >> >>>>> >> >>> becomes >>> >> >>>>> >> >>> inaccessible except to the C++-Jedi. >>> >> >>>>> >> >>> >>> >> >>>>> >> >>> Maybe let's start a Google document on "pandas roadmap" >>> >> >>>>> >> >>> where >>> >> >>>>> >> >>> we >>> >> >>>>> >> >>> can >>> >> >>>>> >> >>> break down the 1-2 year goals and some of these >>> >> >>>>> >> >>> infrastructure >>> >> >>>>> >> >>> issues >>> >> >>>>> >> >>> and have our discussion there? (obviously publish this >>> >> >>>>> >> >>> someplace >>> >> >>>>> >> >>> once >>> >> >>>>> >> >>> we're done) >>> >> >>>>> >> >>> >>> >> >>>>> >> >>> - Wes >>> >> >>>>> >> >>> >>> >> >>>>> >> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback >>> >> >>>>> >> >>> >>> >> >>>>> >> >>> wrote: >>> >> >>>>> >> >>> > Here are some of my thoughts about pandas Roadmap / >>> >> >>>>> >> >>> > status >>> >> >>>>> >> >>> > and >>> >> >>>>> >> >>> > some >>> >> >>>>> >> >>> > responses to Wes's thoughts. >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > In the last few (and upcoming) major releases we have >>> >> >>>>> >> >>> > been >>> >> >>>>> >> >>> > made >>> >> >>>>> >> >>> > the >>> >> >>>>> >> >>> > following changes: >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > - dtype enhancements (Categorical, Timedelta, Datetime >>> >> >>>>> >> >>> > w/tz) & >>> >> >>>>> >> >>> > making >>> >> >>>>> >> >>> > these >>> >> >>>>> >> >>> > first class objects >>> >> >>>>> >> >>> > - code refactoring to remove subclassing of ndarrays >>> >> >>>>> >> >>> > for >>> >> >>>>> >> >>> > Series >>> >> >>>>> >> >>> > & >>> >> >>>>> >> >>> > Index >>> >> >>>>> >> >>> > - carving out / deprecating non-core parts of pandas >>> >> >>>>> >> >>> > - datareader >>> >> >>>>> >> >>> > - SparsePanel, WidePanel & other aliases (TImeSeries) >>> >> >>>>> >> >>> > - rpy, rplot, irow et al. >>> >> >>>>> >> >>> > - google-analytics >>> >> >>>>> >> >>> > - API changes to make things more consistent >>> >> >>>>> >> >>> > - pd.rolling/expanding * -> .rolling/expanding (this >>> >> >>>>> >> >>> > is >>> >> >>>>> >> >>> > in >>> >> >>>>> >> >>> > master >>> >> >>>>> >> >>> > now) >>> >> >>>>> >> >>> > - .resample becoming a full defered like groupby. >>> >> >>>>> >> >>> > - multi-index slicing along any level (obviates need >>> >> >>>>> >> >>> > for >>> >> >>>>> >> >>> > .xs) >>> >> >>>>> >> >>> > and >>> >> >>>>> >> >>> > allows >>> >> >>>>> >> >>> > assignment >>> >> >>>>> >> >>> > - .loc/.iloc - for the most part obviates use of .ix >>> >> >>>>> >> >>> > - .pipe & .assign >>> >> >>>>> >> >>> > - plotting accessors >>> >> >>>>> >> >>> > - fixing of the sorting API >>> >> >>>>> >> >>> > - many performance enhancements both micro & macro >>> >> >>>>> >> >>> > (e.g. >>> >> >>>>> >> >>> > release >>> >> >>>>> >> >>> > GIL) >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > Some on-deck enhancements are (meaning these are >>> >> >>>>> >> >>> > basically >>> >> >>>>> >> >>> > ready to >>> >> >>>>> >> >>> > go >>> >> >>>>> >> >>> > in): >>> >> >>>>> >> >>> > - IntervalIndex (and eventually make PeriodIndex just >>> >> >>>>> >> >>> > a >>> >> >>>>> >> >>> > sub-class >>> >> >>>>> >> >>> > of >>> >> >>>>> >> >>> > this) >>> >> >>>>> >> >>> > - RangeIndex >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > so lots of changes, though nothing really earth >>> >> >>>>> >> >>> > shaking, >>> >> >>>>> >> >>> > just >>> >> >>>>> >> >>> > more >>> >> >>>>> >> >>> > convenience, reducing magicness somewhat >>> >> >>>>> >> >>> > and providing flexibility. >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > Of course we are getting increasing issues, mostly bug >>> >> >>>>> >> >>> > reports >>> >> >>>>> >> >>> > (and >>> >> >>>>> >> >>> > lots >>> >> >>>>> >> >>> > of >>> >> >>>>> >> >>> > dupes), some edge case enhancements >>> >> >>>>> >> >>> > which can add to the existing API's and of course, >>> >> >>>>> >> >>> > requests >>> >> >>>>> >> >>> > to >>> >> >>>>> >> >>> > expand >>> >> >>>>> >> >>> > the >>> >> >>>>> >> >>> > (already) large code to other usecases. >>> >> >>>>> >> >>> > Balancing this are a good many pull-requests from many >>> >> >>>>> >> >>> > different >>> >> >>>>> >> >>> > users, >>> >> >>>>> >> >>> > some >>> >> >>>>> >> >>> > even deep into the internals. >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > Here are some things that I have talked about and could >>> >> >>>>> >> >>> > be >>> >> >>>>> >> >>> > considered >>> >> >>>>> >> >>> > for >>> >> >>>>> >> >>> > the roadmap. Disclaimer: I do work for Continuum >>> >> >>>>> >> >>> > but these views are of course my own; furthermore >>> >> >>>>> >> >>> > obviously >>> >> >>>>> >> >>> > I >>> >> >>>>> >> >>> > am a >>> >> >>>>> >> >>> > bit >>> >> >>>>> >> >>> > more >>> >> >>>>> >> >>> > familiar with some of the 'sponsored' open-source >>> >> >>>>> >> >>> > libraries, but always open to new things. >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > - integration / automatic deferral to numba for JIT >>> >> >>>>> >> >>> > (this >>> >> >>>>> >> >>> > would >>> >> >>>>> >> >>> > be >>> >> >>>>> >> >>> > thru >>> >> >>>>> >> >>> > .apply) >>> >> >>>>> >> >>> > - automatic deferal to dask from groubpy where >>> >> >>>>> >> >>> > appropriate >>> >> >>>>> >> >>> > / >>> >> >>>>> >> >>> > maybe a >>> >> >>>>> >> >>> > .to_parallel (to simply return a dask.DataFrame object) >>> >> >>>>> >> >>> > - incorporation of quantities / units (as part of the >>> >> >>>>> >> >>> > dtype) >>> >> >>>>> >> >>> > - use of DyND to allow missing values for int dtypes >>> >> >>>>> >> >>> > - make Period a first class dtype. >>> >> >>>>> >> >>> > - provide some copy-on-write semantics to alleviate the >>> >> >>>>> >> >>> > chained-indexing >>> >> >>>>> >> >>> > issues which occasionaly come up with the mis-use of >>> >> >>>>> >> >>> > the >>> >> >>>>> >> >>> > indexing >>> >> >>>>> >> >>> > API >>> >> >>>>> >> >>> > - allow a 'policy' to automatically provide column >>> >> >>>>> >> >>> > blocks >>> >> >>>>> >> >>> > for >>> >> >>>>> >> >>> > dict-like >>> >> >>>>> >> >>> > input (e.g. each column would be a block), this would >>> >> >>>>> >> >>> > allow >>> >> >>>>> >> >>> > a >>> >> >>>>> >> >>> > pass-thru >>> >> >>>>> >> >>> > API >>> >> >>>>> >> >>> > where you could >>> >> >>>>> >> >>> > put in numpy arrays where you have views and have them >>> >> >>>>> >> >>> > preserved >>> >> >>>>> >> >>> > rather >>> >> >>>>> >> >>> > than >>> >> >>>>> >> >>> > copied automatically. Note that this would also allow >>> >> >>>>> >> >>> > what >>> >> >>>>> >> >>> > I >>> >> >>>>> >> >>> > call >>> >> >>>>> >> >>> > 'split' >>> >> >>>>> >> >>> > where a passed in >>> >> >>>>> >> >>> > multi-dim numpy array could be split up to individual >>> >> >>>>> >> >>> > blocks >>> >> >>>>> >> >>> > (which >>> >> >>>>> >> >>> > actually >>> >> >>>>> >> >>> > gives a nice perf boost after the splitting costs). >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > In working towards some of these goals. I have come to >>> >> >>>>> >> >>> > the >>> >> >>>>> >> >>> > opinion >>> >> >>>>> >> >>> > that >>> >> >>>>> >> >>> > it >>> >> >>>>> >> >>> > would make sense to have a neutral API protocol layer >>> >> >>>>> >> >>> > that would allow us to swap out different engines as >>> >> >>>>> >> >>> > needed, >>> >> >>>>> >> >>> > for >>> >> >>>>> >> >>> > particular >>> >> >>>>> >> >>> > dtypes, or *maybe* out-of-core type computations. E.g. >>> >> >>>>> >> >>> > imagine that we replaced the in-memory block structure >>> >> >>>>> >> >>> > with >>> >> >>>>> >> >>> > a >>> >> >>>>> >> >>> > bclolz >>> >> >>>>> >> >>> > / >>> >> >>>>> >> >>> > memap >>> >> >>>>> >> >>> > type; in theory this should be 'easy' and just work. >>> >> >>>>> >> >>> > I could also see us adopting *some* of the SFrame code >>> >> >>>>> >> >>> > to >>> >> >>>>> >> >>> > allow >>> >> >>>>> >> >>> > easier >>> >> >>>>> >> >>> > interop with this API layer. >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > In practice, I think a nice API layer would need to be >>> >> >>>>> >> >>> > created >>> >> >>>>> >> >>> > to >>> >> >>>>> >> >>> > make >>> >> >>>>> >> >>> > this >>> >> >>>>> >> >>> > clean / nice. >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > So this comes around to Wes's point about creating a >>> >> >>>>> >> >>> > c++ >>> >> >>>>> >> >>> > library for >>> >> >>>>> >> >>> > the >>> >> >>>>> >> >>> > internals (and possibly even some of the indexing >>> >> >>>>> >> >>> > routines). >>> >> >>>>> >> >>> > In an ideal world, or course this would be desirable. >>> >> >>>>> >> >>> > Getting >>> >> >>>>> >> >>> > there >>> >> >>>>> >> >>> > is a >>> >> >>>>> >> >>> > bit >>> >> >>>>> >> >>> > non-trivial I think, and IMHO might not be worth the >>> >> >>>>> >> >>> > effort. I >>> >> >>>>> >> >>> > don't >>> >> >>>>> >> >>> > really see big performance bottlenecks. We *already* >>> >> >>>>> >> >>> > defer >>> >> >>>>> >> >>> > much >>> >> >>>>> >> >>> > of >>> >> >>>>> >> >>> > the >>> >> >>>>> >> >>> > computation to libraries like numexpr & bottleneck >>> >> >>>>> >> >>> > (where >>> >> >>>>> >> >>> > appropriate). >>> >> >>>>> >> >>> > Adding numba / dask to the list would be helpful. >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > I think that almost all performance issues are the >>> >> >>>>> >> >>> > result >>> >> >>>>> >> >>> > of: >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > a) gross misuse of the pandas API. How much code have >>> >> >>>>> >> >>> > you >>> >> >>>>> >> >>> > seen >>> >> >>>>> >> >>> > that >>> >> >>>>> >> >>> > does >>> >> >>>>> >> >>> > df.apply(lambda x: x.sum()) >>> >> >>>>> >> >>> > b) routines which operate column-by-column rather >>> >> >>>>> >> >>> > block-by-block and >>> >> >>>>> >> >>> > are >>> >> >>>>> >> >>> > in >>> >> >>>>> >> >>> > python space (e.g. we have an issue right now about >>> >> >>>>> >> >>> > .quantile) >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > So I am glossing over a big goal of having a c++ >>> >> >>>>> >> >>> > library >>> >> >>>>> >> >>> > that >>> >> >>>>> >> >>> > represents >>> >> >>>>> >> >>> > the >>> >> >>>>> >> >>> > pandas internals. This would by definition have a c-API >>> >> >>>>> >> >>> > that so >>> >> >>>>> >> >>> > you *could* use pandas like semantics in c/c++ and just >>> >> >>>>> >> >>> > have it >>> >> >>>>> >> >>> > work >>> >> >>>>> >> >>> > (and >>> >> >>>>> >> >>> > then pandas would be a thin wrapper around this >>> >> >>>>> >> >>> > library). >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > I am not averse to this, but I think would be quite a >>> >> >>>>> >> >>> > big >>> >> >>>>> >> >>> > effort, >>> >> >>>>> >> >>> > and >>> >> >>>>> >> >>> > not a >>> >> >>>>> >> >>> > huge perf boost IMHO. Further there are a number of API >>> >> >>>>> >> >>> > issues >>> >> >>>>> >> >>> > w.r.t. >>> >> >>>>> >> >>> > indexing >>> >> >>>>> >> >>> > which need to be clarified / worked out (e.g. should we >>> >> >>>>> >> >>> > simply >>> >> >>>>> >> >>> > deprecate >>> >> >>>>> >> >>> > []) >>> >> >>>>> >> >>> > that are much easier to test / figure out in python >>> >> >>>>> >> >>> > space. >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > I also thing that we have quite a large number of >>> >> >>>>> >> >>> > contributors. >>> >> >>>>> >> >>> > Moving >>> >> >>>>> >> >>> > to >>> >> >>>>> >> >>> > c++ might make the internals a bit more impenetrable >>> >> >>>>> >> >>> > that >>> >> >>>>> >> >>> > the >>> >> >>>>> >> >>> > current >>> >> >>>>> >> >>> > internals. >>> >> >>>>> >> >>> > (though this would allow c++ people to contribute, so >>> >> >>>>> >> >>> > that >>> >> >>>>> >> >>> > might >>> >> >>>>> >> >>> > balance >>> >> >>>>> >> >>> > out). >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > We have a limited core of devs whom right now are >>> >> >>>>> >> >>> > familar >>> >> >>>>> >> >>> > with >>> >> >>>>> >> >>> > things. >>> >> >>>>> >> >>> > If >>> >> >>>>> >> >>> > someone happened to have a starting base for a c++ >>> >> >>>>> >> >>> > library, >>> >> >>>>> >> >>> > then I >>> >> >>>>> >> >>> > might >>> >> >>>>> >> >>> > change >>> >> >>>>> >> >>> > opinions here. >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > my 4c. >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > Jeff >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > wrote: >>> >> >>>>> >> >>> >> >>> >> >>>>> >> >>> >> Deep thoughts during the holidays. >>> >> >>>>> >> >>> >> >>> >> >>>>> >> >>> >> I might be out of line here, but the >>> >> >>>>> >> >>> >> interpreter-heaviness >>> >> >>>>> >> >>> >> of >>> >> >>>>> >> >>> >> the >>> >> >>>>> >> >>> >> inside of pandas objects is likely to be a long-term >>> >> >>>>> >> >>> >> liability >>> >> >>>>> >> >>> >> and >>> >> >>>>> >> >>> >> source of performance problems and technical debt. >>> >> >>>>> >> >>> >> >>> >> >>>>> >> >>> >> Has anyone put any thought into planning and beginning >>> >> >>>>> >> >>> >> to >>> >> >>>>> >> >>> >> execute >>> >> >>>>> >> >>> >> on a >>> >> >>>>> >> >>> >> rewrite that moves as much as possible of the >>> >> >>>>> >> >>> >> internals >>> >> >>>>> >> >>> >> into >>> >> >>>>> >> >>> >> native >>> >> >>>>> >> >>> >> / >>> >> >>>>> >> >>> >> compiled code? I'm talking about: >>> >> >>>>> >> >>> >> >>> >> >>>>> >> >>> >> - pandas/core/internals >>> >> >>>>> >> >>> >> - indexing and assignment >>> >> >>>>> >> >>> >> - much of pandas/core/common >>> >> >>>>> >> >>> >> - categorical and custom dtypes >>> >> >>>>> >> >>> >> - all indexing mechanisms >>> >> >>>>> >> >>> >> >>> >> >>>>> >> >>> >> I'm concerned we've already exposed too much internals >>> >> >>>>> >> >>> >> to >>> >> >>>>> >> >>> >> users, so >>> >> >>>>> >> >>> >> this might lead to a lot of API breakage, but it might >>> >> >>>>> >> >>> >> be >>> >> >>>>> >> >>> >> for >>> >> >>>>> >> >>> >> the >>> >> >>>>> >> >>> >> Greater Good. As a first step, beginning a partial >>> >> >>>>> >> >>> >> migration >>> >> >>>>> >> >>> >> of >>> >> >>>>> >> >>> >> internals into some C++ classes that encapsulate the >>> >> >>>>> >> >>> >> insides >>> >> >>>>> >> >>> >> of >>> >> >>>>> >> >>> >> DataFrame objects and implement indexing and >>> >> >>>>> >> >>> >> block-level >>> >> >>>>> >> >>> >> manipulations >>> >> >>>>> >> >>> >> would be a good place to start. I think you could do >>> >> >>>>> >> >>> >> this >>> >> >>>>> >> >>> >> wouldn't >>> >> >>>>> >> >>> >> too >>> >> >>>>> >> >>> >> much disruption. >>> >> >>>>> >> >>> >> >>> >> >>>>> >> >>> >> As part of this internal retooling we might give >>> >> >>>>> >> >>> >> consideration >>> >> >>>>> >> >>> >> to >>> >> >>>>> >> >>> >> alternative data structures for representing data >>> >> >>>>> >> >>> >> internal >>> >> >>>>> >> >>> >> to >>> >> >>>>> >> >>> >> pandas >>> >> >>>>> >> >>> >> objects. Now in 2015/2016, continuing to be hamstrung >>> >> >>>>> >> >>> >> by >>> >> >>>>> >> >>> >> NumPy's >>> >> >>>>> >> >>> >> limitations feels somewhat anachronistic. User code is >>> >> >>>>> >> >>> >> riddled >>> >> >>>>> >> >>> >> with >>> >> >>>>> >> >>> >> workarounds for data type fidelity issues and the >>> >> >>>>> >> >>> >> like. >>> >> >>>>> >> >>> >> Like, >>> >> >>>>> >> >>> >> really, >>> >> >>>>> >> >>> >> why not add a bitndarray (similar to >>> >> >>>>> >> >>> >> ilanschnell/bitarray) >>> >> >>>>> >> >>> >> for >>> >> >>>>> >> >>> >> storing >>> >> >>>>> >> >>> >> nullness for problematic types and hide this from the >>> >> >>>>> >> >>> >> user? =) >>> >> >>>>> >> >>> >> >>> >> >>>>> >> >>> >> Since we are now a NumFOCUS-sponsored project, I feel >>> >> >>>>> >> >>> >> like >>> >> >>>>> >> >>> >> we >>> >> >>>>> >> >>> >> might >>> >> >>>>> >> >>> >> consider establishing some formal governance over >>> >> >>>>> >> >>> >> pandas >>> >> >>>>> >> >>> >> and >>> >> >>>>> >> >>> >> publishing meetings notes and roadmap documents >>> >> >>>>> >> >>> >> describing >>> >> >>>>> >> >>> >> plans >>> >> >>>>> >> >>> >> for >>> >> >>>>> >> >>> >> the project and meetings notes from committers. >>> >> >>>>> >> >>> >> There's no >>> >> >>>>> >> >>> >> real >>> >> >>>>> >> >>> >> "committer culture" for NumFOCUS projects like there >>> >> >>>>> >> >>> >> is >>> >> >>>>> >> >>> >> with >>> >> >>>>> >> >>> >> the >>> >> >>>>> >> >>> >> Apache Software Foundation, but we might try leading >>> >> >>>>> >> >>> >> by >>> >> >>>>> >> >>> >> example! >>> >> >>>>> >> >>> >> >>> >> >>>>> >> >>> >> Also, I believe pandas as a project has reached a >>> >> >>>>> >> >>> >> level of >>> >> >>>>> >> >>> >> importance >>> >> >>>>> >> >>> >> where we ought to consider planning and execution on >>> >> >>>>> >> >>> >> larger >>> >> >>>>> >> >>> >> scale >>> >> >>>>> >> >>> >> undertakings such as this for safeguarding the future. >>> >> >>>>> >> >>> >> >>> >> >>>>> >> >>> >> As for myself, well, I have my hands full in Big >>> >> >>>>> >> >>> >> Data-land. I >>> >> >>>>> >> >>> >> wish >>> >> >>>>> >> >>> >> I >>> >> >>>>> >> >>> >> could be helping more with pandas, but there a quite a >>> >> >>>>> >> >>> >> few >>> >> >>>>> >> >>> >> fundamental >>> >> >>>>> >> >>> >> issues (like data interoperability nested data >>> >> >>>>> >> >>> >> handling >>> >> >>>>> >> >>> >> and >>> >> >>>>> >> >>> >> file >>> >> >>>>> >> >>> >> format support ? e.g. Parquet, see >>> >> >>>>> >> >>> >> >>> >> >>>>> >> >>> >> >>> >> >>>>> >> >>> >> >>> >> >>>>> >> >>> >> >>> >> >>>>> >> >>> >> >>> >> >>>>> >> >>> >> >>> >> >>>>> >> >>> >> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/) >>> >> >>>>> >> >>> >> preventing Python from being more useful in industry >>> >> >>>>> >> >>> >> analytics >>> >> >>>>> >> >>> >> applications. >>> >> >>>>> >> >>> >> >>> >> >>>>> >> >>> >> Aside: one of the bigger mistakes I made with pandas's >>> >> >>>>> >> >>> >> API >>> >> >>>>> >> >>> >> design >>> >> >>>>> >> >>> >> was >>> >> >>>>> >> >>> >> making it acceptable to call class constructors ? like >>> >> >>>>> >> >>> >> pandas.DataFrame ? directly (versus factory >>> >> >>>>> >> >>> >> functions). >>> >> >>>>> >> >>> >> Sorry >>> >> >>>>> >> >>> >> about >>> >> >>>>> >> >>> >> that! If we could convince everyone to start writing >>> >> >>>>> >> >>> >> pandas.data_frame >>> >> >>>>> >> >>> >> or dataframe instead of using the class reference it >>> >> >>>>> >> >>> >> would >>> >> >>>>> >> >>> >> help a >>> >> >>>>> >> >>> >> lot >>> >> >>>>> >> >>> >> with code cleanup. It's hard to plan for these things >>> >> >>>>> >> >>> >> ? >>> >> >>>>> >> >>> >> NumPy >>> >> >>>>> >> >>> >> interoperability seemed a lot more important in 2008 >>> >> >>>>> >> >>> >> than >>> >> >>>>> >> >>> >> it >>> >> >>>>> >> >>> >> does >>> >> >>>>> >> >>> >> now, >>> >> >>>>> >> >>> >> so I forgive myself. >>> >> >>>>> >> >>> >> >>> >> >>>>> >> >>> >> cheers and best wishes for 2016, >>> >> >>>>> >> >>> >> Wes >>> >> >>>>> >> >>> >> _______________________________________________ >>> >> >>>>> >> >>> >> Pandas-dev mailing list >>> >> >>>>> >> >>> >> Pandas-dev at python.org >>> >> >>>>> >> >>> >> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> _______________________________________________ >>> >> >>>>> >> >>> Pandas-dev mailing list >>> >> >>>>> >> >>> Pandas-dev at python.org >>> >> >>>>> >> >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> >>>>> >> _______________________________________________ >>> >> >>>>> >> Pandas-dev mailing list >>> >> >>>>> >> Pandas-dev at python.org >>> >> >>>>> >> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> >>>>> > >>> >> >>>>> > >>> >> >>>>> >>> >> >>>>> >>> >> >>>>> _______________________________________________ >>> >> >>>>> Pandas-dev mailing list >>> >> >>>>> Pandas-dev at python.org >>> >> >>>>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> >>>>> >>> >> >>>> >>> >> >> >>> >> _______________________________________________ >>> >> Pandas-dev mailing list >>> >> Pandas-dev at python.org >>> >> https://mail.python.org/mailman/listinfo/pandas-dev >>> > >>> > >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >> >> >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > From wesmckinn at gmail.com Wed Jan 6 14:37:11 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Wed, 6 Jan 2016 11:37:11 -0800 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References:

Message-ID: On Wed, Jan 6, 2016 at 11:26 AM, Wes McKinney wrote: > hey Stephan, > > Thanks for all the thoughts. Let me make a few off-the-cuff comments. > > On Wed, Jan 6, 2016 at 10:11 AM, Stephan Hoyer wrote: >> I was asked about this off list, so I'll belatedly share my thoughts. >> >> First of all, I am really excited by Wes's renewed engagement in the project >> and his interest in rewriting pandas internals. This is quite an ambitious >> plan and nobody is better positioned to tackle it than Wes. >> >> I have mixed feelings about the details of the rewrite itself. >> >> +1 on the simpler internal data model. The block manager is confusing and >> leads to hard to predict performance issues related to copying data. If we >> can do all column additions/removals/re-orderings without a copy it will be >> a clear win. >> >> +0 on moving internals to C++. I do like the performance benefits, but it >> seems like a lot of work, and it may make pandas less friendly to new >> contributors. >> > > It really goes beyond performance benefits. If you go back to my 2013 > talk http://www.slideshare.net/wesm/practical-medium-data-analytics-with-python > there's a long list of architectural problems that now in 2016 haven't > found solutions. The only way (that I can fully reason through -- I am > happy to look at alternate proposals) to move the internals of pandas > closer to the metal is to give Series and DataFrame a C/C++ API -- > this is the "libpandas native core" as I've been describing. I should point out the the main thing that's changed since that preso is "synthetic" data types like Categorical. But seeing what it took for Jeff et al to build that is a prime motivation for this internals refactoring plan. > >> -0 on writing a brand new dtype system just for pandas -- this stuff really >> belongs in NumPy (or another array library like DyND), and I am skeptical >> that pandas can do a complete enough job to be useful without replicating >> all that functionality. >> > > I'm curious what "a brand new dtype system" means to you. pandas > already has its own data type system, but it's a potpourri of > inconsistencies and rough edges with self-evident problems for both > users and developers. Some indicators: > > - Some pandas types use NaN for missing data, others None (or both), > others nothing at all. We lose data (integers) or bloat memory > (booleans) by upcasting to float-NaN or object-None. > - Internal functions full of is_XXX_dtype functions: > pandas.core.common, pandas.core.algorithms, etc. > - Series.values on synthetic dtypes like Categorical > - We use arrays of Python objects for string data > > The biggest cause IMHO is that pandas is too tightly coupled to NumPy, > but it's coupled in a way that makes development and extensibility > difficult. We've already allowed NumPy-specific details to taint the > pandas user API in many unpleasant ways. This isn't to say "NumPy is > bad" but rather "pandas tries to layer domain-specific functionality > [that NumPy was not designed for] on top". > > Some things things I'm advocating with the internals refactor: > > 1) First class "pandas type" objects. This is not the same as a NumPy > dtype which has some pretty loaded implications -- in particular, > NumPy dtypes are implicitly coupled to an array computing framework > (see the function table that is attached to the PyArray_Descr object) > > 2) Pandas array container types that map user-land API calls to > implementation-land API calls (in NumPy, DyND, or pandas-native code > like pandas.core.algorithms etc.). This will make it much easier to > leverage innovations in NumPy and DyND without those implementation > details spilling over into the pandas user API > > 3) Adding a single pandas.NA singleton to have one library-wide notion > of a scalar null value (obviously, we can automatically map NaN and > None to NA for backwards compatibility). > > 4) Layering a bitmask internally on NumPy arrays (especially integer > and boolean) to add null-ness to types that need it. Note that this > does not prevent us from switching to DyND arrays with option dtype in > the future. If the details of how we are implementing NULL are visible > to the user, we have failed. > > 5) Removing the block manager in favor of simpler pandas Array (1D) > and Table (2D -- vector of Array) data structures > > I believe you can do all this without harming interoperability with > the ecosystem of projects that people currently use in conjunction > with pandas. > >> More broadly, I am concerned that this rewrite may improve the tabular >> computation ecosystem at the cost of inter-operability with the array-based >> ecosystem (numpy, scipy, sklearn, xarray, etc.). The later has been one of >> the strengths of pandas and it would be a shame to see that go away. >> > > I have no intention of letting this happen. What I've am asking from > you (and others reading) is to help define what constitutes > interoperability. What guarantees do we make the user? > > For example, we should have very strict guidelines for the output of: > > np.asarray(pandas_obj) > > For example > > In [3]: s = pd.Series([1,2,3]*10).astype('category') > > In [4]: np.asarray(s) > Out[4]: > array([1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, > 3, 1, 2, 3, 1, 2, 3]) > > I see no reason why this should necessarily behave any differently. > The problem will come in when there is pandas data that is not > precisely representable in a NumPy array. Example: > > In [5]: s = pd.Series([1,2,3, 4]) > > In [6]: s.dtype > Out[6]: dtype('int64') > > In [7]: s2 = s.reindex(np.arange(10)) > > In [8]: s2.dtype > Out[8]: dtype('float64') > > In [9]: np.asarray(s2) > Out[9]: array([ 1., 2., 3., 4., nan, nan, nan, nan, nan, nan]) > > With the "new internals", s2 will still be int64 type, but we may > decide that np.asarray(s2) should raise an exception rather than > implicitly make a decision about how to perform a "lossy" conversion > to a NumPy array. If you are using DyND with pandas, then the > equivalent function would be able to implicitly convert without data > loss. > >> We're already starting to struggle with inter-operability with the new >> pandas dtypes and a further rewrite would make this even harder. >> For example, see categoricals and scikit-learn in Tom's recent post [1], or the >> fact that .values no longer always returns a numpy array. This has also been >> a challenge for xarray, which can't handle these new dtypes because we lack >> a suitable array backend for them. > > I'm definitely motivated in this initiative by these challenges. The > idea here is that with the new internals, Series.values will always > return the same type of object, and there will be one consistent code > path for getting a NumPy array out. For example, rather than: > > if isinstance(s.values, Categorical): > # pandas > ... > else: > # NumPy > ... > > We could have (just an idea) > > s.values.to_numpy() > > Or simply > > np.asarray(s.values) > >> >> Personally, I would much rather leverage a full featured library like an >> improved NumPy or DyND for new dtypes, because that could also be used by >> the array-based ecosystem. At the very least, it would be good to think >> about zero-copy inter-operability with array-based tools. >> > > I'm all for zero-copy interoperability when possible, but my gut > feeling is that exposing the data type system of an array library (the > choice of which is an implementation detail) to pandas users is an > inherent leaky abstraction that will continue to cause problems if we > plan to keep innovating inside pandas. By better hiding NumPy details > and types from the user we will make it much easier to swap out new > low level array data structures and compute components (e.g. DyND), or > add custom data structures or out-of-core tools (memory maps, bcolz, > etc.) > > I'm additionally offering to do nearly all of this replumbing of > pandas internals myself, and completely in my free time. What I will > expect in return from you all is to help enumerate our contracts with > the pandas user (i.e. interoperability) and to hold me accountable to > not break them. I know I haven't been committing code on pandas since > mid-2013 (after a 5 year marathon), but these architectural problems > have been on my mind almost constantly since then, I just haven't had > the bandwidth to start tackling them. > > cheers, > Wes > >> On the other hand, I wonder if maybe it would be better to write a native >> in-memory backend for Ibis instead of rewriting pandas. Ibis does seem to >> have improved/simplified API which resolves many of pandas's warts. That >> said, it's a pretty big change from the "DataFrame as matrix" model, and >> pandas won't be going away anytime soon. I do like that it would force users >> to be more explicit about converting between tables and arrays, which might >> also make distinctions between the tabular and array oriented ecosystems >> easier to swallow. >> >> Just my two cents, from someone who has lots of opinions but who will likely >> stay on the sidelines for most of this work. >> >> Cheers, >> Stephan >> >> [1] http://tomaugspurger.github.io/categorical-pipelines.html >> >> On Fri, Jan 1, 2016 at 6:06 PM, Jeff Reback wrote: >>> >>> ok I moved the document to the Pandas folder, where the same group should >>> be able to edit/upload/etc. lmk if any issues >>> >>> On Fri, Jan 1, 2016 at 8:48 PM, Wes McKinney wrote: >>>> >>>> Thanks Jeff. Can you create and share a shared Drive folder containing >>>> this where I can put other auxiliary / follow up documents? >>>> >>>> On Fri, Jan 1, 2016 at 5:23 PM, Jeff Reback wrote: >>>> > I changed the doc so that the core dev people can edit. I *think* that >>>> > everyone should be able to view/comment though. >>>> > >>>> > On Fri, Jan 1, 2016 at 8:13 PM, Wes McKinney >>>> > wrote: >>>> >> >>>> >> Jeff -- can you require log-in for editing on this document? >>>> >> >>>> >> >>>> >> https://docs.google.com/document/d/151ct8jcZWwh7XStptjbLsda6h2b3C0IuiH_hfZnUA58/edit# >>>> >> >>>> >> There are a number of anonymous edits. >>>> >> >>>> >> On Wed, Dec 30, 2015 at 6:04 PM, Wes McKinney >>>> >> wrote: >>>> >> > I cobbled together an ugly start of a c++->cython->pandas toolchain >>>> >> > here >>>> >> > >>>> >> > https://github.com/wesm/pandas/tree/libpandas-native-core >>>> >> > >>>> >> > I used a mix of Kudu, Impala, and dynd-python cmake sources, so it's >>>> >> > a >>>> >> > bit messy at the moment but it should be sufficient to run some real >>>> >> > experiments with a little more work. I reckon it's like a 6 month >>>> >> > project to tear out the insides of Series and DataFrame and replace >>>> >> > it >>>> >> > with a new "native core", but we should be able to get enough info >>>> >> > to >>>> >> > see whether it's a viable plan within a month or so. >>>> >> > >>>> >> > The end goal is to create "private" extension types in Cython that >>>> >> > can >>>> >> > be the new base classes for Series and NDFrame; these will hold a >>>> >> > reference to a C++ object that contains wrappered NumPy arrays and >>>> >> > other metadata (like pandas-only dtypes). >>>> >> > >>>> >> > It might be too hard to try to replace a single usage of block >>>> >> > manager >>>> >> > as a first experiment, so I'll try to create a minimal "SeriesLite" >>>> >> > that supports 3 dtypes >>>> >> > >>>> >> > 1) float64 with nans >>>> >> > 2) int64 with a bitmask for NAs >>>> >> > 3) category type for one of these >>>> >> > >>>> >> > Just want to get a feel for the extensibility and offer an NA >>>> >> > singleton Python object (a la None) for getting and setting NAs >>>> >> > across >>>> >> > these 3 dtypes. >>>> >> > >>>> >> > If we end up going down this route, any way to place a moratorium on >>>> >> > invasive work on pandas internals (outside bug fixes)? >>>> >> > >>>> >> > Pedantic aside: I'd rather avoid shipping thirdparty C/C++ libraries >>>> >> > like googletest and friends in pandas if we can. Cloudera folks have >>>> >> > been working on a portable C++ library toolchain for Impala and >>>> >> > other >>>> >> > projects at https://github.com/cloudera/native-toolchain, but it is >>>> >> > only being tested on Linux and OS X. Most google libraries should >>>> >> > build out of the box on MSVC but it'll be something to keep an eye >>>> >> > on. >>>> >> > >>>> >> > BTW thanks to the libdynd developers for pioneering the c++ lib <-> >>>> >> > python-c++ lib <-> cython toolchain; being able to build Cython >>>> >> > extensions directly from cmake is a godsend >>>> >> > >>>> >> > HNY all >>>> >> > Wes >>>> >> > >>>> >> > On Tue, Dec 29, 2015 at 4:17 PM, Irwin Zaid >>>> >> > wrote: >>>> >> >> Yeah, that seems reasonable and I totally agree a Pandas wrapper >>>> >> >> layer >>>> >> >> would >>>> >> >> be necessary. >>>> >> >> >>>> >> >> I'll keep an eye on this and I'd like to help if I can. >>>> >> >> >>>> >> >> Irwin >>>> >> >> >>>> >> >> >>>> >> >> On Tue, Dec 29, 2015 at 6:01 PM, Wes McKinney >>>> >> >> wrote: >>>> >> >>> >>>> >> >>> I'm not suggesting a rewrite of NumPy functionality but rather >>>> >> >>> pandas >>>> >> >>> functionality that is currently written in a mishmash of Cython >>>> >> >>> and >>>> >> >>> Python. >>>> >> >>> Happy to experiment with changing the internal compute >>>> >> >>> infrastructure >>>> >> >>> and >>>> >> >>> data representation to DyND after this first stage of cleanup is >>>> >> >>> done. >>>> >> >>> Even >>>> >> >>> if we use DyND a pretty extensive pandas wrapper layer will be >>>> >> >>> necessary. >>>> >> >>> >>>> >> >>> >>>> >> >>> On Tuesday, December 29, 2015, Irwin Zaid >>>> >> >>> wrote: >>>> >> >>>> >>>> >> >>>> Hi Wes (and others), >>>> >> >>>> >>>> >> >>>> I've been following this conversation with interest. I do think >>>> >> >>>> it >>>> >> >>>> would >>>> >> >>>> be worth exploring DyND, rather than setting up yet another >>>> >> >>>> rewrite >>>> >> >>>> of >>>> >> >>>> NumPy-functionality. Especially because DyND is already an >>>> >> >>>> optional >>>> >> >>>> dependency of Pandas. >>>> >> >>>> >>>> >> >>>> For things like Integer NA and new dtypes, DyND is there and >>>> >> >>>> ready to >>>> >> >>>> do >>>> >> >>>> this. >>>> >> >>>> >>>> >> >>>> Irwin >>>> >> >>>> >>>> >> >>>> On Tue, Dec 29, 2015 at 5:18 PM, Wes McKinney >>>> >> >>>> >>>> >> >>>> wrote: >>>> >> >>>>> >>>> >> >>>>> Can you link to the PR you're talking about? >>>> >> >>>>> >>>> >> >>>>> I will see about spending a few hours setting up a libpandas.so >>>> >> >>>>> as a >>>> >> >>>>> C++ >>>> >> >>>>> shared library where we can run some experiments and validate >>>> >> >>>>> whether it can >>>> >> >>>>> solve the integer-NA problem and be a place to put new data >>>> >> >>>>> types >>>> >> >>>>> (categorical and friends). I'm +1 on targeting >>>> >> >>>>> >>>> >> >>>>> Would it also be worth making a wish list of APIs we might >>>> >> >>>>> consider >>>> >> >>>>> breaking in a pandas 1.0 release that also features this new >>>> >> >>>>> "native >>>> >> >>>>> core"? >>>> >> >>>>> Might as well right some wrongs while we're doing some invasive >>>> >> >>>>> work >>>> >> >>>>> on the >>>> >> >>>>> internals; some breakage might be unavoidable. We can always >>>> >> >>>>> maintain a >>>> >> >>>>> pandas legacy 0.x.x maintenance branch (providing a conda binary >>>> >> >>>>> build) for >>>> >> >>>>> legacy users where showstopper bugs can get fixed. >>>> >> >>>>> >>>> >> >>>>> On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback >>>> >> >>>>> >>>> >> >>>>> wrote: >>>> >> >>>>> > Wes your last is noted as well. I *think* we can actually do >>>> >> >>>>> > this >>>> >> >>>>> > now >>>> >> >>>>> > (well >>>> >> >>>>> > there is a PR out there). >>>> >> >>>>> > >>>> >> >>>>> > On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney >>>> >> >>>>> > >>>> >> >>>>> > wrote: >>>> >> >>>>> >> >>>> >> >>>>> >> The other huge thing this will enable is to do is >>>> >> >>>>> >> copy-on-write >>>> >> >>>>> >> for >>>> >> >>>>> >> various kinds of views, which should cut down on some of the >>>> >> >>>>> >> defensive >>>> >> >>>>> >> copying in the library and reduce memory usage. >>>> >> >>>>> >> >>>> >> >>>>> >> On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney >>>> >> >>>>> >> >>>> >> >>>>> >> wrote: >>>> >> >>>>> >> > Basically the approach is >>>> >> >>>>> >> > >>>> >> >>>>> >> > 1) Base dtype type >>>> >> >>>>> >> > 2) Base array type with K >= 1 dimensions >>>> >> >>>>> >> > 3) Base scalar type >>>> >> >>>>> >> > 4) Base index type >>>> >> >>>>> >> > 5) "Wrapper" subclasses for all NumPy types fitting into >>>> >> >>>>> >> > categories >>>> >> >>>>> >> > #1, #2, #3, #4 >>>> >> >>>>> >> > 6) Subclasses for pandas-specific types like category, >>>> >> >>>>> >> > datetimeTZ, >>>> >> >>>>> >> > etc. >>>> >> >>>>> >> > 7) NDFrame as cpcloud wrote is just a list of these >>>> >> >>>>> >> > >>>> >> >>>>> >> > Indexes and axis labels / column names can get layered on >>>> >> >>>>> >> > top. >>>> >> >>>>> >> > >>>> >> >>>>> >> > After we do all this we can look at adding nested types >>>> >> >>>>> >> > (arrays, >>>> >> >>>>> >> > maps, >>>> >> >>>>> >> > structs) to better support JSON. >>>> >> >>>>> >> > >>>> >> >>>>> >> > - Wes >>>> >> >>>>> >> > >>>> >> >>>>> >> > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud >>>> >> >>>>> >> > >>>> >> >>>>> >> > wrote: >>>> >> >>>>> >> >> Maybe this is saying the same thing as Wes, but how far >>>> >> >>>>> >> >> would >>>> >> >>>>> >> >> something >>>> >> >>>>> >> >> like >>>> >> >>>>> >> >> this get us? >>>> >> >>>>> >> >> >>>> >> >>>>> >> >> // warning: things are probably not this simple >>>> >> >>>>> >> >> >>>> >> >>>>> >> >> struct data_array_t { >>>> >> >>>>> >> >> void *primitive; // scalar data >>>> >> >>>>> >> >> data_array_t *nested; // nested data >>>> >> >>>>> >> >> boost::dynamic_bitset isnull; // might have to create >>>> >> >>>>> >> >> our >>>> >> >>>>> >> >> own >>>> >> >>>>> >> >> to >>>> >> >>>>> >> >> avoid >>>> >> >>>>> >> >> boost >>>> >> >>>>> >> >> schema_t schema; // not sure exactly what this looks >>>> >> >>>>> >> >> like >>>> >> >>>>> >> >> }; >>>> >> >>>>> >> >> >>>> >> >>>>> >> >> typedef std::map data_frame_t; // >>>> >> >>>>> >> >> probably >>>> >> >>>>> >> >> not >>>> >> >>>>> >> >> this >>>> >> >>>>> >> >> simple >>>> >> >>>>> >> >> >>>> >> >>>>> >> >> To answer Jeff?s use-case question: I think that the use >>>> >> >>>>> >> >> cases >>>> >> >>>>> >> >> are >>>> >> >>>>> >> >> 1) >>>> >> >>>>> >> >> freedom from numpy (mostly) 2) no more block manager which >>>> >> >>>>> >> >> frees >>>> >> >>>>> >> >> us >>>> >> >>>>> >> >> from the >>>> >> >>>>> >> >> limitations of the block memory layout. In particular, the >>>> >> >>>>> >> >> ability >>>> >> >>>>> >> >> to >>>> >> >>>>> >> >> take >>>> >> >>>>> >> >> advantage of memory mapped IO would be a big win IMO. >>>> >> >>>>> >> >> >>>> >> >>>>> >> >> >>>> >> >>>>> >> >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney >>>> >> >>>>> >> >> >>>> >> >>>>> >> >> wrote: >>>> >> >>>>> >> >>> >>>> >> >>>>> >> >>> I will write a more detailed response to some of these >>>> >> >>>>> >> >>> things >>>> >> >>>>> >> >>> after >>>> >> >>>>> >> >>> the new year, but, in particular, re: missing values, can >>>> >> >>>>> >> >>> you >>>> >> >>>>> >> >>> or >>>> >> >>>>> >> >>> someone tell me why creating an object that contains a >>>> >> >>>>> >> >>> NumPy >>>> >> >>>>> >> >>> array and >>>> >> >>>>> >> >>> a bitmap is not sufficient? If we we can add a >>>> >> >>>>> >> >>> lightweight >>>> >> >>>>> >> >>> C/C++ >>>> >> >>>>> >> >>> class >>>> >> >>>>> >> >>> layer between NumPy function calls (e.g. arithmetic) and >>>> >> >>>>> >> >>> pandas >>>> >> >>>>> >> >>> function calls, then I see no reason why we cannot have >>>> >> >>>>> >> >>> >>>> >> >>>>> >> >>> Int32Array->add >>>> >> >>>>> >> >>> >>>> >> >>>>> >> >>> and >>>> >> >>>>> >> >>> >>>> >> >>>>> >> >>> Float32Array->add >>>> >> >>>>> >> >>> >>>> >> >>>>> >> >>> do the right thing (the former would be responsible for >>>> >> >>>>> >> >>> bitmasking to >>>> >> >>>>> >> >>> propagate NA values; the latter would defer to NumPy). If >>>> >> >>>>> >> >>> we >>>> >> >>>>> >> >>> can >>>> >> >>>>> >> >>> put >>>> >> >>>>> >> >>> all the internals of pandas objects inside a black box, >>>> >> >>>>> >> >>> we >>>> >> >>>>> >> >>> can >>>> >> >>>>> >> >>> add >>>> >> >>>>> >> >>> layers of virtual function indirection without a >>>> >> >>>>> >> >>> performance >>>> >> >>>>> >> >>> penalty >>>> >> >>>>> >> >>> (e.g. adding more interpreter overhead with more >>>> >> >>>>> >> >>> abstraction >>>> >> >>>>> >> >>> layers >>>> >> >>>>> >> >>> does add up to a perf penalty). >>>> >> >>>>> >> >>> >>>> >> >>>>> >> >>> I don't think this is too scary -- I would be willing to >>>> >> >>>>> >> >>> create a >>>> >> >>>>> >> >>> small POC C++ library to prototype something like what >>>> >> >>>>> >> >>> I'm >>>> >> >>>>> >> >>> talking >>>> >> >>>>> >> >>> about. >>>> >> >>>>> >> >>> >>>> >> >>>>> >> >>> Since pandas has limited points of contact with NumPy I >>>> >> >>>>> >> >>> don't >>>> >> >>>>> >> >>> think >>>> >> >>>>> >> >>> this would end up being too onerous. >>>> >> >>>>> >> >>> >>>> >> >>>>> >> >>> For the record, I'm pretty allergic to "advanced C++"; I >>>> >> >>>>> >> >>> think it >>>> >> >>>>> >> >>> is a >>>> >> >>>>> >> >>> useful tool if you pick a sane 20% subset of the C++11 >>>> >> >>>>> >> >>> spec >>>> >> >>>>> >> >>> and >>>> >> >>>>> >> >>> follow >>>> >> >>>>> >> >>> Google C++ style it's not very inaccessible to >>>> >> >>>>> >> >>> intermediate >>>> >> >>>>> >> >>> developers. More or less "C plus OOP and easier object >>>> >> >>>>> >> >>> lifetime >>>> >> >>>>> >> >>> management (shared/unique_ptr, etc.)". As soon as you add >>>> >> >>>>> >> >>> a >>>> >> >>>>> >> >>> lot >>>> >> >>>>> >> >>> of >>>> >> >>>>> >> >>> template metaprogramming C++ library development quickly >>>> >> >>>>> >> >>> becomes >>>> >> >>>>> >> >>> inaccessible except to the C++-Jedi. >>>> >> >>>>> >> >>> >>>> >> >>>>> >> >>> Maybe let's start a Google document on "pandas roadmap" >>>> >> >>>>> >> >>> where >>>> >> >>>>> >> >>> we >>>> >> >>>>> >> >>> can >>>> >> >>>>> >> >>> break down the 1-2 year goals and some of these >>>> >> >>>>> >> >>> infrastructure >>>> >> >>>>> >> >>> issues >>>> >> >>>>> >> >>> and have our discussion there? (obviously publish this >>>> >> >>>>> >> >>> someplace >>>> >> >>>>> >> >>> once >>>> >> >>>>> >> >>> we're done) >>>> >> >>>>> >> >>> >>>> >> >>>>> >> >>> - Wes >>>> >> >>>>> >> >>> >>>> >> >>>>> >> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback >>>> >> >>>>> >> >>> >>>> >> >>>>> >> >>> wrote: >>>> >> >>>>> >> >>> > Here are some of my thoughts about pandas Roadmap / >>>> >> >>>>> >> >>> > status >>>> >> >>>>> >> >>> > and >>>> >> >>>>> >> >>> > some >>>> >> >>>>> >> >>> > responses to Wes's thoughts. >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > In the last few (and upcoming) major releases we have >>>> >> >>>>> >> >>> > been >>>> >> >>>>> >> >>> > made >>>> >> >>>>> >> >>> > the >>>> >> >>>>> >> >>> > following changes: >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > - dtype enhancements (Categorical, Timedelta, Datetime >>>> >> >>>>> >> >>> > w/tz) & >>>> >> >>>>> >> >>> > making >>>> >> >>>>> >> >>> > these >>>> >> >>>>> >> >>> > first class objects >>>> >> >>>>> >> >>> > - code refactoring to remove subclassing of ndarrays >>>> >> >>>>> >> >>> > for >>>> >> >>>>> >> >>> > Series >>>> >> >>>>> >> >>> > & >>>> >> >>>>> >> >>> > Index >>>> >> >>>>> >> >>> > - carving out / deprecating non-core parts of pandas >>>> >> >>>>> >> >>> > - datareader >>>> >> >>>>> >> >>> > - SparsePanel, WidePanel & other aliases (TImeSeries) >>>> >> >>>>> >> >>> > - rpy, rplot, irow et al. >>>> >> >>>>> >> >>> > - google-analytics >>>> >> >>>>> >> >>> > - API changes to make things more consistent >>>> >> >>>>> >> >>> > - pd.rolling/expanding * -> .rolling/expanding (this >>>> >> >>>>> >> >>> > is >>>> >> >>>>> >> >>> > in >>>> >> >>>>> >> >>> > master >>>> >> >>>>> >> >>> > now) >>>> >> >>>>> >> >>> > - .resample becoming a full defered like groupby. >>>> >> >>>>> >> >>> > - multi-index slicing along any level (obviates need >>>> >> >>>>> >> >>> > for >>>> >> >>>>> >> >>> > .xs) >>>> >> >>>>> >> >>> > and >>>> >> >>>>> >> >>> > allows >>>> >> >>>>> >> >>> > assignment >>>> >> >>>>> >> >>> > - .loc/.iloc - for the most part obviates use of .ix >>>> >> >>>>> >> >>> > - .pipe & .assign >>>> >> >>>>> >> >>> > - plotting accessors >>>> >> >>>>> >> >>> > - fixing of the sorting API >>>> >> >>>>> >> >>> > - many performance enhancements both micro & macro >>>> >> >>>>> >> >>> > (e.g. >>>> >> >>>>> >> >>> > release >>>> >> >>>>> >> >>> > GIL) >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > Some on-deck enhancements are (meaning these are >>>> >> >>>>> >> >>> > basically >>>> >> >>>>> >> >>> > ready to >>>> >> >>>>> >> >>> > go >>>> >> >>>>> >> >>> > in): >>>> >> >>>>> >> >>> > - IntervalIndex (and eventually make PeriodIndex just >>>> >> >>>>> >> >>> > a >>>> >> >>>>> >> >>> > sub-class >>>> >> >>>>> >> >>> > of >>>> >> >>>>> >> >>> > this) >>>> >> >>>>> >> >>> > - RangeIndex >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > so lots of changes, though nothing really earth >>>> >> >>>>> >> >>> > shaking, >>>> >> >>>>> >> >>> > just >>>> >> >>>>> >> >>> > more >>>> >> >>>>> >> >>> > convenience, reducing magicness somewhat >>>> >> >>>>> >> >>> > and providing flexibility. >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > Of course we are getting increasing issues, mostly bug >>>> >> >>>>> >> >>> > reports >>>> >> >>>>> >> >>> > (and >>>> >> >>>>> >> >>> > lots >>>> >> >>>>> >> >>> > of >>>> >> >>>>> >> >>> > dupes), some edge case enhancements >>>> >> >>>>> >> >>> > which can add to the existing API's and of course, >>>> >> >>>>> >> >>> > requests >>>> >> >>>>> >> >>> > to >>>> >> >>>>> >> >>> > expand >>>> >> >>>>> >> >>> > the >>>> >> >>>>> >> >>> > (already) large code to other usecases. >>>> >> >>>>> >> >>> > Balancing this are a good many pull-requests from many >>>> >> >>>>> >> >>> > different >>>> >> >>>>> >> >>> > users, >>>> >> >>>>> >> >>> > some >>>> >> >>>>> >> >>> > even deep into the internals. >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > Here are some things that I have talked about and could >>>> >> >>>>> >> >>> > be >>>> >> >>>>> >> >>> > considered >>>> >> >>>>> >> >>> > for >>>> >> >>>>> >> >>> > the roadmap. Disclaimer: I do work for Continuum >>>> >> >>>>> >> >>> > but these views are of course my own; furthermore >>>> >> >>>>> >> >>> > obviously >>>> >> >>>>> >> >>> > I >>>> >> >>>>> >> >>> > am a >>>> >> >>>>> >> >>> > bit >>>> >> >>>>> >> >>> > more >>>> >> >>>>> >> >>> > familiar with some of the 'sponsored' open-source >>>> >> >>>>> >> >>> > libraries, but always open to new things. >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > - integration / automatic deferral to numba for JIT >>>> >> >>>>> >> >>> > (this >>>> >> >>>>> >> >>> > would >>>> >> >>>>> >> >>> > be >>>> >> >>>>> >> >>> > thru >>>> >> >>>>> >> >>> > .apply) >>>> >> >>>>> >> >>> > - automatic deferal to dask from groubpy where >>>> >> >>>>> >> >>> > appropriate >>>> >> >>>>> >> >>> > / >>>> >> >>>>> >> >>> > maybe a >>>> >> >>>>> >> >>> > .to_parallel (to simply return a dask.DataFrame object) >>>> >> >>>>> >> >>> > - incorporation of quantities / units (as part of the >>>> >> >>>>> >> >>> > dtype) >>>> >> >>>>> >> >>> > - use of DyND to allow missing values for int dtypes >>>> >> >>>>> >> >>> > - make Period a first class dtype. >>>> >> >>>>> >> >>> > - provide some copy-on-write semantics to alleviate the >>>> >> >>>>> >> >>> > chained-indexing >>>> >> >>>>> >> >>> > issues which occasionaly come up with the mis-use of >>>> >> >>>>> >> >>> > the >>>> >> >>>>> >> >>> > indexing >>>> >> >>>>> >> >>> > API >>>> >> >>>>> >> >>> > - allow a 'policy' to automatically provide column >>>> >> >>>>> >> >>> > blocks >>>> >> >>>>> >> >>> > for >>>> >> >>>>> >> >>> > dict-like >>>> >> >>>>> >> >>> > input (e.g. each column would be a block), this would >>>> >> >>>>> >> >>> > allow >>>> >> >>>>> >> >>> > a >>>> >> >>>>> >> >>> > pass-thru >>>> >> >>>>> >> >>> > API >>>> >> >>>>> >> >>> > where you could >>>> >> >>>>> >> >>> > put in numpy arrays where you have views and have them >>>> >> >>>>> >> >>> > preserved >>>> >> >>>>> >> >>> > rather >>>> >> >>>>> >> >>> > than >>>> >> >>>>> >> >>> > copied automatically. Note that this would also allow >>>> >> >>>>> >> >>> > what >>>> >> >>>>> >> >>> > I >>>> >> >>>>> >> >>> > call >>>> >> >>>>> >> >>> > 'split' >>>> >> >>>>> >> >>> > where a passed in >>>> >> >>>>> >> >>> > multi-dim numpy array could be split up to individual >>>> >> >>>>> >> >>> > blocks >>>> >> >>>>> >> >>> > (which >>>> >> >>>>> >> >>> > actually >>>> >> >>>>> >> >>> > gives a nice perf boost after the splitting costs). >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > In working towards some of these goals. I have come to >>>> >> >>>>> >> >>> > the >>>> >> >>>>> >> >>> > opinion >>>> >> >>>>> >> >>> > that >>>> >> >>>>> >> >>> > it >>>> >> >>>>> >> >>> > would make sense to have a neutral API protocol layer >>>> >> >>>>> >> >>> > that would allow us to swap out different engines as >>>> >> >>>>> >> >>> > needed, >>>> >> >>>>> >> >>> > for >>>> >> >>>>> >> >>> > particular >>>> >> >>>>> >> >>> > dtypes, or *maybe* out-of-core type computations. E.g. >>>> >> >>>>> >> >>> > imagine that we replaced the in-memory block structure >>>> >> >>>>> >> >>> > with >>>> >> >>>>> >> >>> > a >>>> >> >>>>> >> >>> > bclolz >>>> >> >>>>> >> >>> > / >>>> >> >>>>> >> >>> > memap >>>> >> >>>>> >> >>> > type; in theory this should be 'easy' and just work. >>>> >> >>>>> >> >>> > I could also see us adopting *some* of the SFrame code >>>> >> >>>>> >> >>> > to >>>> >> >>>>> >> >>> > allow >>>> >> >>>>> >> >>> > easier >>>> >> >>>>> >> >>> > interop with this API layer. >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > In practice, I think a nice API layer would need to be >>>> >> >>>>> >> >>> > created >>>> >> >>>>> >> >>> > to >>>> >> >>>>> >> >>> > make >>>> >> >>>>> >> >>> > this >>>> >> >>>>> >> >>> > clean / nice. >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > So this comes around to Wes's point about creating a >>>> >> >>>>> >> >>> > c++ >>>> >> >>>>> >> >>> > library for >>>> >> >>>>> >> >>> > the >>>> >> >>>>> >> >>> > internals (and possibly even some of the indexing >>>> >> >>>>> >> >>> > routines). >>>> >> >>>>> >> >>> > In an ideal world, or course this would be desirable. >>>> >> >>>>> >> >>> > Getting >>>> >> >>>>> >> >>> > there >>>> >> >>>>> >> >>> > is a >>>> >> >>>>> >> >>> > bit >>>> >> >>>>> >> >>> > non-trivial I think, and IMHO might not be worth the >>>> >> >>>>> >> >>> > effort. I >>>> >> >>>>> >> >>> > don't >>>> >> >>>>> >> >>> > really see big performance bottlenecks. We *already* >>>> >> >>>>> >> >>> > defer >>>> >> >>>>> >> >>> > much >>>> >> >>>>> >> >>> > of >>>> >> >>>>> >> >>> > the >>>> >> >>>>> >> >>> > computation to libraries like numexpr & bottleneck >>>> >> >>>>> >> >>> > (where >>>> >> >>>>> >> >>> > appropriate). >>>> >> >>>>> >> >>> > Adding numba / dask to the list would be helpful. >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > I think that almost all performance issues are the >>>> >> >>>>> >> >>> > result >>>> >> >>>>> >> >>> > of: >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > a) gross misuse of the pandas API. How much code have >>>> >> >>>>> >> >>> > you >>>> >> >>>>> >> >>> > seen >>>> >> >>>>> >> >>> > that >>>> >> >>>>> >> >>> > does >>>> >> >>>>> >> >>> > df.apply(lambda x: x.sum()) >>>> >> >>>>> >> >>> > b) routines which operate column-by-column rather >>>> >> >>>>> >> >>> > block-by-block and >>>> >> >>>>> >> >>> > are >>>> >> >>>>> >> >>> > in >>>> >> >>>>> >> >>> > python space (e.g. we have an issue right now about >>>> >> >>>>> >> >>> > .quantile) >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > So I am glossing over a big goal of having a c++ >>>> >> >>>>> >> >>> > library >>>> >> >>>>> >> >>> > that >>>> >> >>>>> >> >>> > represents >>>> >> >>>>> >> >>> > the >>>> >> >>>>> >> >>> > pandas internals. This would by definition have a c-API >>>> >> >>>>> >> >>> > that so >>>> >> >>>>> >> >>> > you *could* use pandas like semantics in c/c++ and just >>>> >> >>>>> >> >>> > have it >>>> >> >>>>> >> >>> > work >>>> >> >>>>> >> >>> > (and >>>> >> >>>>> >> >>> > then pandas would be a thin wrapper around this >>>> >> >>>>> >> >>> > library). >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > I am not averse to this, but I think would be quite a >>>> >> >>>>> >> >>> > big >>>> >> >>>>> >> >>> > effort, >>>> >> >>>>> >> >>> > and >>>> >> >>>>> >> >>> > not a >>>> >> >>>>> >> >>> > huge perf boost IMHO. Further there are a number of API >>>> >> >>>>> >> >>> > issues >>>> >> >>>>> >> >>> > w.r.t. >>>> >> >>>>> >> >>> > indexing >>>> >> >>>>> >> >>> > which need to be clarified / worked out (e.g. should we >>>> >> >>>>> >> >>> > simply >>>> >> >>>>> >> >>> > deprecate >>>> >> >>>>> >> >>> > []) >>>> >> >>>>> >> >>> > that are much easier to test / figure out in python >>>> >> >>>>> >> >>> > space. >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > I also thing that we have quite a large number of >>>> >> >>>>> >> >>> > contributors. >>>> >> >>>>> >> >>> > Moving >>>> >> >>>>> >> >>> > to >>>> >> >>>>> >> >>> > c++ might make the internals a bit more impenetrable >>>> >> >>>>> >> >>> > that >>>> >> >>>>> >> >>> > the >>>> >> >>>>> >> >>> > current >>>> >> >>>>> >> >>> > internals. >>>> >> >>>>> >> >>> > (though this would allow c++ people to contribute, so >>>> >> >>>>> >> >>> > that >>>> >> >>>>> >> >>> > might >>>> >> >>>>> >> >>> > balance >>>> >> >>>>> >> >>> > out). >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > We have a limited core of devs whom right now are >>>> >> >>>>> >> >>> > familar >>>> >> >>>>> >> >>> > with >>>> >> >>>>> >> >>> > things. >>>> >> >>>>> >> >>> > If >>>> >> >>>>> >> >>> > someone happened to have a starting base for a c++ >>>> >> >>>>> >> >>> > library, >>>> >> >>>>> >> >>> > then I >>>> >> >>>>> >> >>> > might >>>> >> >>>>> >> >>> > change >>>> >> >>>>> >> >>> > opinions here. >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > my 4c. >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > Jeff >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > wrote: >>>> >> >>>>> >> >>> >> >>>> >> >>>>> >> >>> >> Deep thoughts during the holidays. >>>> >> >>>>> >> >>> >> >>>> >> >>>>> >> >>> >> I might be out of line here, but the >>>> >> >>>>> >> >>> >> interpreter-heaviness >>>> >> >>>>> >> >>> >> of >>>> >> >>>>> >> >>> >> the >>>> >> >>>>> >> >>> >> inside of pandas objects is likely to be a long-term >>>> >> >>>>> >> >>> >> liability >>>> >> >>>>> >> >>> >> and >>>> >> >>>>> >> >>> >> source of performance problems and technical debt. >>>> >> >>>>> >> >>> >> >>>> >> >>>>> >> >>> >> Has anyone put any thought into planning and beginning >>>> >> >>>>> >> >>> >> to >>>> >> >>>>> >> >>> >> execute >>>> >> >>>>> >> >>> >> on a >>>> >> >>>>> >> >>> >> rewrite that moves as much as possible of the >>>> >> >>>>> >> >>> >> internals >>>> >> >>>>> >> >>> >> into >>>> >> >>>>> >> >>> >> native >>>> >> >>>>> >> >>> >> / >>>> >> >>>>> >> >>> >> compiled code? I'm talking about: >>>> >> >>>>> >> >>> >> >>>> >> >>>>> >> >>> >> - pandas/core/internals >>>> >> >>>>> >> >>> >> - indexing and assignment >>>> >> >>>>> >> >>> >> - much of pandas/core/common >>>> >> >>>>> >> >>> >> - categorical and custom dtypes >>>> >> >>>>> >> >>> >> - all indexing mechanisms >>>> >> >>>>> >> >>> >> >>>> >> >>>>> >> >>> >> I'm concerned we've already exposed too much internals >>>> >> >>>>> >> >>> >> to >>>> >> >>>>> >> >>> >> users, so >>>> >> >>>>> >> >>> >> this might lead to a lot of API breakage, but it might >>>> >> >>>>> >> >>> >> be >>>> >> >>>>> >> >>> >> for >>>> >> >>>>> >> >>> >> the >>>> >> >>>>> >> >>> >> Greater Good. As a first step, beginning a partial >>>> >> >>>>> >> >>> >> migration >>>> >> >>>>> >> >>> >> of >>>> >> >>>>> >> >>> >> internals into some C++ classes that encapsulate the >>>> >> >>>>> >> >>> >> insides >>>> >> >>>>> >> >>> >> of >>>> >> >>>>> >> >>> >> DataFrame objects and implement indexing and >>>> >> >>>>> >> >>> >> block-level >>>> >> >>>>> >> >>> >> manipulations >>>> >> >>>>> >> >>> >> would be a good place to start. I think you could do >>>> >> >>>>> >> >>> >> this >>>> >> >>>>> >> >>> >> wouldn't >>>> >> >>>>> >> >>> >> too >>>> >> >>>>> >> >>> >> much disruption. >>>> >> >>>>> >> >>> >> >>>> >> >>>>> >> >>> >> As part of this internal retooling we might give >>>> >> >>>>> >> >>> >> consideration >>>> >> >>>>> >> >>> >> to >>>> >> >>>>> >> >>> >> alternative data structures for representing data >>>> >> >>>>> >> >>> >> internal >>>> >> >>>>> >> >>> >> to >>>> >> >>>>> >> >>> >> pandas >>>> >> >>>>> >> >>> >> objects. Now in 2015/2016, continuing to be hamstrung >>>> >> >>>>> >> >>> >> by >>>> >> >>>>> >> >>> >> NumPy's >>>> >> >>>>> >> >>> >> limitations feels somewhat anachronistic. User code is >>>> >> >>>>> >> >>> >> riddled >>>> >> >>>>> >> >>> >> with >>>> >> >>>>> >> >>> >> workarounds for data type fidelity issues and the >>>> >> >>>>> >> >>> >> like. >>>> >> >>>>> >> >>> >> Like, >>>> >> >>>>> >> >>> >> really, >>>> >> >>>>> >> >>> >> why not add a bitndarray (similar to >>>> >> >>>>> >> >>> >> ilanschnell/bitarray) >>>> >> >>>>> >> >>> >> for >>>> >> >>>>> >> >>> >> storing >>>> >> >>>>> >> >>> >> nullness for problematic types and hide this from the >>>> >> >>>>> >> >>> >> user? =) >>>> >> >>>>> >> >>> >> >>>> >> >>>>> >> >>> >> Since we are now a NumFOCUS-sponsored project, I feel >>>> >> >>>>> >> >>> >> like >>>> >> >>>>> >> >>> >> we >>>> >> >>>>> >> >>> >> might >>>> >> >>>>> >> >>> >> consider establishing some formal governance over >>>> >> >>>>> >> >>> >> pandas >>>> >> >>>>> >> >>> >> and >>>> >> >>>>> >> >>> >> publishing meetings notes and roadmap documents >>>> >> >>>>> >> >>> >> describing >>>> >> >>>>> >> >>> >> plans >>>> >> >>>>> >> >>> >> for >>>> >> >>>>> >> >>> >> the project and meetings notes from committers. >>>> >> >>>>> >> >>> >> There's no >>>> >> >>>>> >> >>> >> real >>>> >> >>>>> >> >>> >> "committer culture" for NumFOCUS projects like there >>>> >> >>>>> >> >>> >> is >>>> >> >>>>> >> >>> >> with >>>> >> >>>>> >> >>> >> the >>>> >> >>>>> >> >>> >> Apache Software Foundation, but we might try leading >>>> >> >>>>> >> >>> >> by >>>> >> >>>>> >> >>> >> example! >>>> >> >>>>> >> >>> >> >>>> >> >>>>> >> >>> >> Also, I believe pandas as a project has reached a >>>> >> >>>>> >> >>> >> level of >>>> >> >>>>> >> >>> >> importance >>>> >> >>>>> >> >>> >> where we ought to consider planning and execution on >>>> >> >>>>> >> >>> >> larger >>>> >> >>>>> >> >>> >> scale >>>> >> >>>>> >> >>> >> undertakings such as this for safeguarding the future. >>>> >> >>>>> >> >>> >> >>>> >> >>>>> >> >>> >> As for myself, well, I have my hands full in Big >>>> >> >>>>> >> >>> >> Data-land. I >>>> >> >>>>> >> >>> >> wish >>>> >> >>>>> >> >>> >> I >>>> >> >>>>> >> >>> >> could be helping more with pandas, but there a quite a >>>> >> >>>>> >> >>> >> few >>>> >> >>>>> >> >>> >> fundamental >>>> >> >>>>> >> >>> >> issues (like data interoperability nested data >>>> >> >>>>> >> >>> >> handling >>>> >> >>>>> >> >>> >> and >>>> >> >>>>> >> >>> >> file >>>> >> >>>>> >> >>> >> format support ? e.g. Parquet, see >>>> >> >>>>> >> >>> >> >>>> >> >>>>> >> >>> >> >>>> >> >>>>> >> >>> >> >>>> >> >>>>> >> >>> >> >>>> >> >>>>> >> >>> >> >>>> >> >>>>> >> >>> >> >>>> >> >>>>> >> >>> >> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/) >>>> >> >>>>> >> >>> >> preventing Python from being more useful in industry >>>> >> >>>>> >> >>> >> analytics >>>> >> >>>>> >> >>> >> applications. >>>> >> >>>>> >> >>> >> >>>> >> >>>>> >> >>> >> Aside: one of the bigger mistakes I made with pandas's >>>> >> >>>>> >> >>> >> API >>>> >> >>>>> >> >>> >> design >>>> >> >>>>> >> >>> >> was >>>> >> >>>>> >> >>> >> making it acceptable to call class constructors ? like >>>> >> >>>>> >> >>> >> pandas.DataFrame ? directly (versus factory >>>> >> >>>>> >> >>> >> functions). >>>> >> >>>>> >> >>> >> Sorry >>>> >> >>>>> >> >>> >> about >>>> >> >>>>> >> >>> >> that! If we could convince everyone to start writing >>>> >> >>>>> >> >>> >> pandas.data_frame >>>> >> >>>>> >> >>> >> or dataframe instead of using the class reference it >>>> >> >>>>> >> >>> >> would >>>> >> >>>>> >> >>> >> help a >>>> >> >>>>> >> >>> >> lot >>>> >> >>>>> >> >>> >> with code cleanup. It's hard to plan for these things >>>> >> >>>>> >> >>> >> ? >>>> >> >>>>> >> >>> >> NumPy >>>> >> >>>>> >> >>> >> interoperability seemed a lot more important in 2008 >>>> >> >>>>> >> >>> >> than >>>> >> >>>>> >> >>> >> it >>>> >> >>>>> >> >>> >> does >>>> >> >>>>> >> >>> >> now, >>>> >> >>>>> >> >>> >> so I forgive myself. >>>> >> >>>>> >> >>> >> >>>> >> >>>>> >> >>> >> cheers and best wishes for 2016, >>>> >> >>>>> >> >>> >> Wes >>>> >> >>>>> >> >>> >> _______________________________________________ >>>> >> >>>>> >> >>> >> Pandas-dev mailing list >>>> >> >>>>> >> >>> >> Pandas-dev at python.org >>>> >> >>>>> >> >>> >> https://mail.python.org/mailman/listinfo/pandas-dev >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> _______________________________________________ >>>> >> >>>>> >> >>> Pandas-dev mailing list >>>> >> >>>>> >> >>> Pandas-dev at python.org >>>> >> >>>>> >> >>> https://mail.python.org/mailman/listinfo/pandas-dev >>>> >> >>>>> >> _______________________________________________ >>>> >> >>>>> >> Pandas-dev mailing list >>>> >> >>>>> >> Pandas-dev at python.org >>>> >> >>>>> >> https://mail.python.org/mailman/listinfo/pandas-dev >>>> >> >>>>> > >>>> >> >>>>> > >>>> >> >>>>> >>>> >> >>>>> >>>> >> >>>>> _______________________________________________ >>>> >> >>>>> Pandas-dev mailing list >>>> >> >>>>> Pandas-dev at python.org >>>> >> >>>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>> >> >>>>> >>>> >> >>>> >>>> >> >> >>>> >> _______________________________________________ >>>> >> Pandas-dev mailing list >>>> >> Pandas-dev at python.org >>>> >> https://mail.python.org/mailman/listinfo/pandas-dev >>>> > >>>> > >>>> _______________________________________________ >>>> Pandas-dev mailing list >>>> Pandas-dev at python.org >>>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >>> >>> >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> From jeffreback at gmail.com Wed Jan 6 14:45:45 2016 From: jeffreback at gmail.com (Jeff Reback) Date: Wed, 6 Jan 2016 14:45:45 -0500 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References:

Message-ID: I'll just apologize right up front! hahah. No I think I have been pushing on these extras in pandas to help move it forward. I have commented a bit on Stephan's issue here about why I didn't push for these in numpy. numpy is fairly slow moving (though moves faster lately, I suspect the pace when Wes was developing pandas was not much faster). So pandas was essentially 'fixing' lots of bug / compat issues in numpy. To the extent that we can keep the current user facing API the same (high likelihood I think), willing to acccept *some* breakage with the pandas->duck-like array container API in order to provide swappable containers. For example I recall that in doing datetime w/tz, that we wanted Series.values to return a numpy array (which it DOES!) but it is actually lossy (its loses the tz). Samething with the Categorical example wes gave. I dont' think these requirements should hold pandas back! People are increasingly using pandas as the API for there work. That makes it very important that we can handle lots of input properly, w/o the handcuffs of numpy. All this said, I'll reiterate Wes (and others points). That back-compat is extremely important. (I in fact try to bend over backwards to provide this, sometimes its too much of course!). E.g. take the resample changes to API Was originally going to just do a hard break, but this turns off people when they have to update there code or else. my 4c (incrementing!) Jeff On Wed, Jan 6, 2016 at 2:37 PM, Wes McKinney wrote: > On Wed, Jan 6, 2016 at 11:26 AM, Wes McKinney wrote: > > hey Stephan, > > > > Thanks for all the thoughts. Let me make a few off-the-cuff comments. > > > > On Wed, Jan 6, 2016 at 10:11 AM, Stephan Hoyer wrote: > >> I was asked about this off list, so I'll belatedly share my thoughts. > >> > >> First of all, I am really excited by Wes's renewed engagement in the > project > >> and his interest in rewriting pandas internals. This is quite an > ambitious > >> plan and nobody is better positioned to tackle it than Wes. > >> > >> I have mixed feelings about the details of the rewrite itself. > >> > >> +1 on the simpler internal data model. The block manager is confusing > and > >> leads to hard to predict performance issues related to copying data. If > we > >> can do all column additions/removals/re-orderings without a copy it > will be > >> a clear win. > >> > >> +0 on moving internals to C++. I do like the performance benefits, but > it > >> seems like a lot of work, and it may make pandas less friendly to new > >> contributors. > >> > > > > It really goes beyond performance benefits. If you go back to my 2013 > > talk > http://www.slideshare.net/wesm/practical-medium-data-analytics-with-python > > there's a long list of architectural problems that now in 2016 haven't > > found solutions. The only way (that I can fully reason through -- I am > > happy to look at alternate proposals) to move the internals of pandas > > closer to the metal is to give Series and DataFrame a C/C++ API -- > > this is the "libpandas native core" as I've been describing. > > I should point out the the main thing that's changed since that preso > is "synthetic" data types like Categorical. But seeing what it took > for Jeff et al to build that is a prime motivation for this internals > refactoring plan. > > > > >> -0 on writing a brand new dtype system just for pandas -- this stuff > really > >> belongs in NumPy (or another array library like DyND), and I am > skeptical > >> that pandas can do a complete enough job to be useful without > replicating > >> all that functionality. > >> > > > > I'm curious what "a brand new dtype system" means to you. pandas > > already has its own data type system, but it's a potpourri of > > inconsistencies and rough edges with self-evident problems for both > > users and developers. Some indicators: > > > > - Some pandas types use NaN for missing data, others None (or both), > > others nothing at all. We lose data (integers) or bloat memory > > (booleans) by upcasting to float-NaN or object-None. > > - Internal functions full of is_XXX_dtype functions: > > pandas.core.common, pandas.core.algorithms, etc. > > - Series.values on synthetic dtypes like Categorical > > - We use arrays of Python objects for string data > > > > The biggest cause IMHO is that pandas is too tightly coupled to NumPy, > > but it's coupled in a way that makes development and extensibility > > difficult. We've already allowed NumPy-specific details to taint the > > pandas user API in many unpleasant ways. This isn't to say "NumPy is > > bad" but rather "pandas tries to layer domain-specific functionality > > [that NumPy was not designed for] on top". > > > > Some things things I'm advocating with the internals refactor: > > > > 1) First class "pandas type" objects. This is not the same as a NumPy > > dtype which has some pretty loaded implications -- in particular, > > NumPy dtypes are implicitly coupled to an array computing framework > > (see the function table that is attached to the PyArray_Descr object) > > > > 2) Pandas array container types that map user-land API calls to > > implementation-land API calls (in NumPy, DyND, or pandas-native code > > like pandas.core.algorithms etc.). This will make it much easier to > > leverage innovations in NumPy and DyND without those implementation > > details spilling over into the pandas user API > > > > 3) Adding a single pandas.NA singleton to have one library-wide notion > > of a scalar null value (obviously, we can automatically map NaN and > > None to NA for backwards compatibility). > > > > 4) Layering a bitmask internally on NumPy arrays (especially integer > > and boolean) to add null-ness to types that need it. Note that this > > does not prevent us from switching to DyND arrays with option dtype in > > the future. If the details of how we are implementing NULL are visible > > to the user, we have failed. > > > > 5) Removing the block manager in favor of simpler pandas Array (1D) > > and Table (2D -- vector of Array) data structures > > > > I believe you can do all this without harming interoperability with > > the ecosystem of projects that people currently use in conjunction > > with pandas. > > > >> More broadly, I am concerned that this rewrite may improve the tabular > >> computation ecosystem at the cost of inter-operability with the > array-based > >> ecosystem (numpy, scipy, sklearn, xarray, etc.). The later has been one > of > >> the strengths of pandas and it would be a shame to see that go away. > >> > > > > I have no intention of letting this happen. What I've am asking from > > you (and others reading) is to help define what constitutes > > interoperability. What guarantees do we make the user? > > > > For example, we should have very strict guidelines for the output of: > > > > np.asarray(pandas_obj) > > > > For example > > > > In [3]: s = pd.Series([1,2,3]*10).astype('category') > > > > In [4]: np.asarray(s) > > Out[4]: > > array([1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, > 2, > > 3, 1, 2, 3, 1, 2, 3]) > > > > I see no reason why this should necessarily behave any differently. > > The problem will come in when there is pandas data that is not > > precisely representable in a NumPy array. Example: > > > > In [5]: s = pd.Series([1,2,3, 4]) > > > > In [6]: s.dtype > > Out[6]: dtype('int64') > > > > In [7]: s2 = s.reindex(np.arange(10)) > > > > In [8]: s2.dtype > > Out[8]: dtype('float64') > > > > In [9]: np.asarray(s2) > > Out[9]: array([ 1., 2., 3., 4., nan, nan, nan, nan, nan, > nan]) > > > > With the "new internals", s2 will still be int64 type, but we may > > decide that np.asarray(s2) should raise an exception rather than > > implicitly make a decision about how to perform a "lossy" conversion > > to a NumPy array. If you are using DyND with pandas, then the > > equivalent function would be able to implicitly convert without data > > loss. > > > >> We're already starting to struggle with inter-operability with the new > >> pandas dtypes and a further rewrite would make this even harder. > >> For example, see categoricals and scikit-learn in Tom's recent post > [1], or the > >> fact that .values no longer always returns a numpy array. This has also > been > >> a challenge for xarray, which can't handle these new dtypes because we > lack > >> a suitable array backend for them. > > > > I'm definitely motivated in this initiative by these challenges. The > > idea here is that with the new internals, Series.values will always > > return the same type of object, and there will be one consistent code > > path for getting a NumPy array out. For example, rather than: > > > > if isinstance(s.values, Categorical): > > # pandas > > ... > > else: > > # NumPy > > ... > > > > We could have (just an idea) > > > > s.values.to_numpy() > > > > Or simply > > > > np.asarray(s.values) > > > >> > >> Personally, I would much rather leverage a full featured library like an > >> improved NumPy or DyND for new dtypes, because that could also be used > by > >> the array-based ecosystem. At the very least, it would be good to think > >> about zero-copy inter-operability with array-based tools. > >> > > > > I'm all for zero-copy interoperability when possible, but my gut > > feeling is that exposing the data type system of an array library (the > > choice of which is an implementation detail) to pandas users is an > > inherent leaky abstraction that will continue to cause problems if we > > plan to keep innovating inside pandas. By better hiding NumPy details > > and types from the user we will make it much easier to swap out new > > low level array data structures and compute components (e.g. DyND), or > > add custom data structures or out-of-core tools (memory maps, bcolz, > > etc.) > > > > I'm additionally offering to do nearly all of this replumbing of > > pandas internals myself, and completely in my free time. What I will > > expect in return from you all is to help enumerate our contracts with > > the pandas user (i.e. interoperability) and to hold me accountable to > > not break them. I know I haven't been committing code on pandas since > > mid-2013 (after a 5 year marathon), but these architectural problems > > have been on my mind almost constantly since then, I just haven't had > > the bandwidth to start tackling them. > > > > cheers, > > Wes > > > >> On the other hand, I wonder if maybe it would be better to write a > native > >> in-memory backend for Ibis instead of rewriting pandas. Ibis does seem > to > >> have improved/simplified API which resolves many of pandas's warts. That > >> said, it's a pretty big change from the "DataFrame as matrix" model, and > >> pandas won't be going away anytime soon. I do like that it would force > users > >> to be more explicit about converting between tables and arrays, which > might > >> also make distinctions between the tabular and array oriented ecosystems > >> easier to swallow. > >> > >> Just my two cents, from someone who has lots of opinions but who will > likely > >> stay on the sidelines for most of this work. > >> > >> Cheers, > >> Stephan > >> > >> [1] http://tomaugspurger.github.io/categorical-pipelines.html > >> > >> On Fri, Jan 1, 2016 at 6:06 PM, Jeff Reback > wrote: > >>> > >>> ok I moved the document to the Pandas folder, where the same group > should > >>> be able to edit/upload/etc. lmk if any issues > >>> > >>> On Fri, Jan 1, 2016 at 8:48 PM, Wes McKinney > wrote: > >>>> > >>>> Thanks Jeff. Can you create and share a shared Drive folder containing > >>>> this where I can put other auxiliary / follow up documents? > >>>> > >>>> On Fri, Jan 1, 2016 at 5:23 PM, Jeff Reback > wrote: > >>>> > I changed the doc so that the core dev people can edit. I *think* > that > >>>> > everyone should be able to view/comment though. > >>>> > > >>>> > On Fri, Jan 1, 2016 at 8:13 PM, Wes McKinney > >>>> > wrote: > >>>> >> > >>>> >> Jeff -- can you require log-in for editing on this document? > >>>> >> > >>>> >> > >>>> >> > https://docs.google.com/document/d/151ct8jcZWwh7XStptjbLsda6h2b3C0IuiH_hfZnUA58/edit# > >>>> >> > >>>> >> There are a number of anonymous edits. > >>>> >> > >>>> >> On Wed, Dec 30, 2015 at 6:04 PM, Wes McKinney > > >>>> >> wrote: > >>>> >> > I cobbled together an ugly start of a c++->cython->pandas > toolchain > >>>> >> > here > >>>> >> > > >>>> >> > https://github.com/wesm/pandas/tree/libpandas-native-core > >>>> >> > > >>>> >> > I used a mix of Kudu, Impala, and dynd-python cmake sources, so > it's > >>>> >> > a > >>>> >> > bit messy at the moment but it should be sufficient to run some > real > >>>> >> > experiments with a little more work. I reckon it's like a 6 month > >>>> >> > project to tear out the insides of Series and DataFrame and > replace > >>>> >> > it > >>>> >> > with a new "native core", but we should be able to get enough > info > >>>> >> > to > >>>> >> > see whether it's a viable plan within a month or so. > >>>> >> > > >>>> >> > The end goal is to create "private" extension types in Cython > that > >>>> >> > can > >>>> >> > be the new base classes for Series and NDFrame; these will hold a > >>>> >> > reference to a C++ object that contains wrappered NumPy arrays > and > >>>> >> > other metadata (like pandas-only dtypes). > >>>> >> > > >>>> >> > It might be too hard to try to replace a single usage of block > >>>> >> > manager > >>>> >> > as a first experiment, so I'll try to create a minimal > "SeriesLite" > >>>> >> > that supports 3 dtypes > >>>> >> > > >>>> >> > 1) float64 with nans > >>>> >> > 2) int64 with a bitmask for NAs > >>>> >> > 3) category type for one of these > >>>> >> > > >>>> >> > Just want to get a feel for the extensibility and offer an NA > >>>> >> > singleton Python object (a la None) for getting and setting NAs > >>>> >> > across > >>>> >> > these 3 dtypes. > >>>> >> > > >>>> >> > If we end up going down this route, any way to place a > moratorium on > >>>> >> > invasive work on pandas internals (outside bug fixes)? > >>>> >> > > >>>> >> > Pedantic aside: I'd rather avoid shipping thirdparty C/C++ > libraries > >>>> >> > like googletest and friends in pandas if we can. Cloudera folks > have > >>>> >> > been working on a portable C++ library toolchain for Impala and > >>>> >> > other > >>>> >> > projects at https://github.com/cloudera/native-toolchain, but > it is > >>>> >> > only being tested on Linux and OS X. Most google libraries should > >>>> >> > build out of the box on MSVC but it'll be something to keep an > eye > >>>> >> > on. > >>>> >> > > >>>> >> > BTW thanks to the libdynd developers for pioneering the c++ lib > <-> > >>>> >> > python-c++ lib <-> cython toolchain; being able to build Cython > >>>> >> > extensions directly from cmake is a godsend > >>>> >> > > >>>> >> > HNY all > >>>> >> > Wes > >>>> >> > > >>>> >> > On Tue, Dec 29, 2015 at 4:17 PM, Irwin Zaid > >>>> >> > wrote: > >>>> >> >> Yeah, that seems reasonable and I totally agree a Pandas wrapper > >>>> >> >> layer > >>>> >> >> would > >>>> >> >> be necessary. > >>>> >> >> > >>>> >> >> I'll keep an eye on this and I'd like to help if I can. > >>>> >> >> > >>>> >> >> Irwin > >>>> >> >> > >>>> >> >> > >>>> >> >> On Tue, Dec 29, 2015 at 6:01 PM, Wes McKinney < > wesmckinn at gmail.com> > >>>> >> >> wrote: > >>>> >> >>> > >>>> >> >>> I'm not suggesting a rewrite of NumPy functionality but rather > >>>> >> >>> pandas > >>>> >> >>> functionality that is currently written in a mishmash of Cython > >>>> >> >>> and > >>>> >> >>> Python. > >>>> >> >>> Happy to experiment with changing the internal compute > >>>> >> >>> infrastructure > >>>> >> >>> and > >>>> >> >>> data representation to DyND after this first stage of cleanup > is > >>>> >> >>> done. > >>>> >> >>> Even > >>>> >> >>> if we use DyND a pretty extensive pandas wrapper layer will be > >>>> >> >>> necessary. > >>>> >> >>> > >>>> >> >>> > >>>> >> >>> On Tuesday, December 29, 2015, Irwin Zaid > >>>> >> >>> wrote: > >>>> >> >>>> > >>>> >> >>>> Hi Wes (and others), > >>>> >> >>>> > >>>> >> >>>> I've been following this conversation with interest. I do > think > >>>> >> >>>> it > >>>> >> >>>> would > >>>> >> >>>> be worth exploring DyND, rather than setting up yet another > >>>> >> >>>> rewrite > >>>> >> >>>> of > >>>> >> >>>> NumPy-functionality. Especially because DyND is already an > >>>> >> >>>> optional > >>>> >> >>>> dependency of Pandas. > >>>> >> >>>> > >>>> >> >>>> For things like Integer NA and new dtypes, DyND is there and > >>>> >> >>>> ready to > >>>> >> >>>> do > >>>> >> >>>> this. > >>>> >> >>>> > >>>> >> >>>> Irwin > >>>> >> >>>> > >>>> >> >>>> On Tue, Dec 29, 2015 at 5:18 PM, Wes McKinney > >>>> >> >>>> > >>>> >> >>>> wrote: > >>>> >> >>>>> > >>>> >> >>>>> Can you link to the PR you're talking about? > >>>> >> >>>>> > >>>> >> >>>>> I will see about spending a few hours setting up a > libpandas.so > >>>> >> >>>>> as a > >>>> >> >>>>> C++ > >>>> >> >>>>> shared library where we can run some experiments and validate > >>>> >> >>>>> whether it can > >>>> >> >>>>> solve the integer-NA problem and be a place to put new data > >>>> >> >>>>> types > >>>> >> >>>>> (categorical and friends). I'm +1 on targeting > >>>> >> >>>>> > >>>> >> >>>>> Would it also be worth making a wish list of APIs we might > >>>> >> >>>>> consider > >>>> >> >>>>> breaking in a pandas 1.0 release that also features this new > >>>> >> >>>>> "native > >>>> >> >>>>> core"? > >>>> >> >>>>> Might as well right some wrongs while we're doing some > invasive > >>>> >> >>>>> work > >>>> >> >>>>> on the > >>>> >> >>>>> internals; some breakage might be unavoidable. We can always > >>>> >> >>>>> maintain a > >>>> >> >>>>> pandas legacy 0.x.x maintenance branch (providing a conda > binary > >>>> >> >>>>> build) for > >>>> >> >>>>> legacy users where showstopper bugs can get fixed. > >>>> >> >>>>> > >>>> >> >>>>> On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback > >>>> >> >>>>> > >>>> >> >>>>> wrote: > >>>> >> >>>>> > Wes your last is noted as well. I *think* we can actually > do > >>>> >> >>>>> > this > >>>> >> >>>>> > now > >>>> >> >>>>> > (well > >>>> >> >>>>> > there is a PR out there). > >>>> >> >>>>> > > >>>> >> >>>>> > On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney > >>>> >> >>>>> > > >>>> >> >>>>> > wrote: > >>>> >> >>>>> >> > >>>> >> >>>>> >> The other huge thing this will enable is to do is > >>>> >> >>>>> >> copy-on-write > >>>> >> >>>>> >> for > >>>> >> >>>>> >> various kinds of views, which should cut down on some of > the > >>>> >> >>>>> >> defensive > >>>> >> >>>>> >> copying in the library and reduce memory usage. > >>>> >> >>>>> >> > >>>> >> >>>>> >> On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney > >>>> >> >>>>> >> > >>>> >> >>>>> >> wrote: > >>>> >> >>>>> >> > Basically the approach is > >>>> >> >>>>> >> > > >>>> >> >>>>> >> > 1) Base dtype type > >>>> >> >>>>> >> > 2) Base array type with K >= 1 dimensions > >>>> >> >>>>> >> > 3) Base scalar type > >>>> >> >>>>> >> > 4) Base index type > >>>> >> >>>>> >> > 5) "Wrapper" subclasses for all NumPy types fitting into > >>>> >> >>>>> >> > categories > >>>> >> >>>>> >> > #1, #2, #3, #4 > >>>> >> >>>>> >> > 6) Subclasses for pandas-specific types like category, > >>>> >> >>>>> >> > datetimeTZ, > >>>> >> >>>>> >> > etc. > >>>> >> >>>>> >> > 7) NDFrame as cpcloud wrote is just a list of these > >>>> >> >>>>> >> > > >>>> >> >>>>> >> > Indexes and axis labels / column names can get layered > on > >>>> >> >>>>> >> > top. > >>>> >> >>>>> >> > > >>>> >> >>>>> >> > After we do all this we can look at adding nested types > >>>> >> >>>>> >> > (arrays, > >>>> >> >>>>> >> > maps, > >>>> >> >>>>> >> > structs) to better support JSON. > >>>> >> >>>>> >> > > >>>> >> >>>>> >> > - Wes > >>>> >> >>>>> >> > > >>>> >> >>>>> >> > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud > >>>> >> >>>>> >> > > >>>> >> >>>>> >> > wrote: > >>>> >> >>>>> >> >> Maybe this is saying the same thing as Wes, but how far > >>>> >> >>>>> >> >> would > >>>> >> >>>>> >> >> something > >>>> >> >>>>> >> >> like > >>>> >> >>>>> >> >> this get us? > >>>> >> >>>>> >> >> > >>>> >> >>>>> >> >> // warning: things are probably not this simple > >>>> >> >>>>> >> >> > >>>> >> >>>>> >> >> struct data_array_t { > >>>> >> >>>>> >> >> void *primitive; // scalar data > >>>> >> >>>>> >> >> data_array_t *nested; // nested data > >>>> >> >>>>> >> >> boost::dynamic_bitset isnull; // might have to > create > >>>> >> >>>>> >> >> our > >>>> >> >>>>> >> >> own > >>>> >> >>>>> >> >> to > >>>> >> >>>>> >> >> avoid > >>>> >> >>>>> >> >> boost > >>>> >> >>>>> >> >> schema_t schema; // not sure exactly what this > looks > >>>> >> >>>>> >> >> like > >>>> >> >>>>> >> >> }; > >>>> >> >>>>> >> >> > >>>> >> >>>>> >> >> typedef std::map data_frame_t; > // > >>>> >> >>>>> >> >> probably > >>>> >> >>>>> >> >> not > >>>> >> >>>>> >> >> this > >>>> >> >>>>> >> >> simple > >>>> >> >>>>> >> >> > >>>> >> >>>>> >> >> To answer Jeff?s use-case question: I think that the > use > >>>> >> >>>>> >> >> cases > >>>> >> >>>>> >> >> are > >>>> >> >>>>> >> >> 1) > >>>> >> >>>>> >> >> freedom from numpy (mostly) 2) no more block manager > which > >>>> >> >>>>> >> >> frees > >>>> >> >>>>> >> >> us > >>>> >> >>>>> >> >> from the > >>>> >> >>>>> >> >> limitations of the block memory layout. In particular, > the > >>>> >> >>>>> >> >> ability > >>>> >> >>>>> >> >> to > >>>> >> >>>>> >> >> take > >>>> >> >>>>> >> >> advantage of memory mapped IO would be a big win IMO. > >>>> >> >>>>> >> >> > >>>> >> >>>>> >> >> > >>>> >> >>>>> >> >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney > >>>> >> >>>>> >> >> > >>>> >> >>>>> >> >> wrote: > >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> I will write a more detailed response to some of these > >>>> >> >>>>> >> >>> things > >>>> >> >>>>> >> >>> after > >>>> >> >>>>> >> >>> the new year, but, in particular, re: missing values, > can > >>>> >> >>>>> >> >>> you > >>>> >> >>>>> >> >>> or > >>>> >> >>>>> >> >>> someone tell me why creating an object that contains a > >>>> >> >>>>> >> >>> NumPy > >>>> >> >>>>> >> >>> array and > >>>> >> >>>>> >> >>> a bitmap is not sufficient? If we we can add a > >>>> >> >>>>> >> >>> lightweight > >>>> >> >>>>> >> >>> C/C++ > >>>> >> >>>>> >> >>> class > >>>> >> >>>>> >> >>> layer between NumPy function calls (e.g. arithmetic) > and > >>>> >> >>>>> >> >>> pandas > >>>> >> >>>>> >> >>> function calls, then I see no reason why we cannot > have > >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> Int32Array->add > >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> and > >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> Float32Array->add > >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> do the right thing (the former would be responsible > for > >>>> >> >>>>> >> >>> bitmasking to > >>>> >> >>>>> >> >>> propagate NA values; the latter would defer to > NumPy). If > >>>> >> >>>>> >> >>> we > >>>> >> >>>>> >> >>> can > >>>> >> >>>>> >> >>> put > >>>> >> >>>>> >> >>> all the internals of pandas objects inside a black > box, > >>>> >> >>>>> >> >>> we > >>>> >> >>>>> >> >>> can > >>>> >> >>>>> >> >>> add > >>>> >> >>>>> >> >>> layers of virtual function indirection without a > >>>> >> >>>>> >> >>> performance > >>>> >> >>>>> >> >>> penalty > >>>> >> >>>>> >> >>> (e.g. adding more interpreter overhead with more > >>>> >> >>>>> >> >>> abstraction > >>>> >> >>>>> >> >>> layers > >>>> >> >>>>> >> >>> does add up to a perf penalty). > >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> I don't think this is too scary -- I would be willing > to > >>>> >> >>>>> >> >>> create a > >>>> >> >>>>> >> >>> small POC C++ library to prototype something like what > >>>> >> >>>>> >> >>> I'm > >>>> >> >>>>> >> >>> talking > >>>> >> >>>>> >> >>> about. > >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> Since pandas has limited points of contact with NumPy > I > >>>> >> >>>>> >> >>> don't > >>>> >> >>>>> >> >>> think > >>>> >> >>>>> >> >>> this would end up being too onerous. > >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> For the record, I'm pretty allergic to "advanced > C++"; I > >>>> >> >>>>> >> >>> think it > >>>> >> >>>>> >> >>> is a > >>>> >> >>>>> >> >>> useful tool if you pick a sane 20% subset of the C++11 > >>>> >> >>>>> >> >>> spec > >>>> >> >>>>> >> >>> and > >>>> >> >>>>> >> >>> follow > >>>> >> >>>>> >> >>> Google C++ style it's not very inaccessible to > >>>> >> >>>>> >> >>> intermediate > >>>> >> >>>>> >> >>> developers. More or less "C plus OOP and easier object > >>>> >> >>>>> >> >>> lifetime > >>>> >> >>>>> >> >>> management (shared/unique_ptr, etc.)". As soon as you > add > >>>> >> >>>>> >> >>> a > >>>> >> >>>>> >> >>> lot > >>>> >> >>>>> >> >>> of > >>>> >> >>>>> >> >>> template metaprogramming C++ library development > quickly > >>>> >> >>>>> >> >>> becomes > >>>> >> >>>>> >> >>> inaccessible except to the C++-Jedi. > >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> Maybe let's start a Google document on "pandas > roadmap" > >>>> >> >>>>> >> >>> where > >>>> >> >>>>> >> >>> we > >>>> >> >>>>> >> >>> can > >>>> >> >>>>> >> >>> break down the 1-2 year goals and some of these > >>>> >> >>>>> >> >>> infrastructure > >>>> >> >>>>> >> >>> issues > >>>> >> >>>>> >> >>> and have our discussion there? (obviously publish this > >>>> >> >>>>> >> >>> someplace > >>>> >> >>>>> >> >>> once > >>>> >> >>>>> >> >>> we're done) > >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> - Wes > >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback > >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> wrote: > >>>> >> >>>>> >> >>> > Here are some of my thoughts about pandas Roadmap / > >>>> >> >>>>> >> >>> > status > >>>> >> >>>>> >> >>> > and > >>>> >> >>>>> >> >>> > some > >>>> >> >>>>> >> >>> > responses to Wes's thoughts. > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > In the last few (and upcoming) major releases we > have > >>>> >> >>>>> >> >>> > been > >>>> >> >>>>> >> >>> > made > >>>> >> >>>>> >> >>> > the > >>>> >> >>>>> >> >>> > following changes: > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > - dtype enhancements (Categorical, Timedelta, > Datetime > >>>> >> >>>>> >> >>> > w/tz) & > >>>> >> >>>>> >> >>> > making > >>>> >> >>>>> >> >>> > these > >>>> >> >>>>> >> >>> > first class objects > >>>> >> >>>>> >> >>> > - code refactoring to remove subclassing of ndarrays > >>>> >> >>>>> >> >>> > for > >>>> >> >>>>> >> >>> > Series > >>>> >> >>>>> >> >>> > & > >>>> >> >>>>> >> >>> > Index > >>>> >> >>>>> >> >>> > - carving out / deprecating non-core parts of pandas > >>>> >> >>>>> >> >>> > - datareader > >>>> >> >>>>> >> >>> > - SparsePanel, WidePanel & other aliases > (TImeSeries) > >>>> >> >>>>> >> >>> > - rpy, rplot, irow et al. > >>>> >> >>>>> >> >>> > - google-analytics > >>>> >> >>>>> >> >>> > - API changes to make things more consistent > >>>> >> >>>>> >> >>> > - pd.rolling/expanding * -> .rolling/expanding > (this > >>>> >> >>>>> >> >>> > is > >>>> >> >>>>> >> >>> > in > >>>> >> >>>>> >> >>> > master > >>>> >> >>>>> >> >>> > now) > >>>> >> >>>>> >> >>> > - .resample becoming a full defered like groupby. > >>>> >> >>>>> >> >>> > - multi-index slicing along any level (obviates > need > >>>> >> >>>>> >> >>> > for > >>>> >> >>>>> >> >>> > .xs) > >>>> >> >>>>> >> >>> > and > >>>> >> >>>>> >> >>> > allows > >>>> >> >>>>> >> >>> > assignment > >>>> >> >>>>> >> >>> > - .loc/.iloc - for the most part obviates use of > .ix > >>>> >> >>>>> >> >>> > - .pipe & .assign > >>>> >> >>>>> >> >>> > - plotting accessors > >>>> >> >>>>> >> >>> > - fixing of the sorting API > >>>> >> >>>>> >> >>> > - many performance enhancements both micro & macro > >>>> >> >>>>> >> >>> > (e.g. > >>>> >> >>>>> >> >>> > release > >>>> >> >>>>> >> >>> > GIL) > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > Some on-deck enhancements are (meaning these are > >>>> >> >>>>> >> >>> > basically > >>>> >> >>>>> >> >>> > ready to > >>>> >> >>>>> >> >>> > go > >>>> >> >>>>> >> >>> > in): > >>>> >> >>>>> >> >>> > - IntervalIndex (and eventually make PeriodIndex > just > >>>> >> >>>>> >> >>> > a > >>>> >> >>>>> >> >>> > sub-class > >>>> >> >>>>> >> >>> > of > >>>> >> >>>>> >> >>> > this) > >>>> >> >>>>> >> >>> > - RangeIndex > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > so lots of changes, though nothing really earth > >>>> >> >>>>> >> >>> > shaking, > >>>> >> >>>>> >> >>> > just > >>>> >> >>>>> >> >>> > more > >>>> >> >>>>> >> >>> > convenience, reducing magicness somewhat > >>>> >> >>>>> >> >>> > and providing flexibility. > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > Of course we are getting increasing issues, mostly > bug > >>>> >> >>>>> >> >>> > reports > >>>> >> >>>>> >> >>> > (and > >>>> >> >>>>> >> >>> > lots > >>>> >> >>>>> >> >>> > of > >>>> >> >>>>> >> >>> > dupes), some edge case enhancements > >>>> >> >>>>> >> >>> > which can add to the existing API's and of course, > >>>> >> >>>>> >> >>> > requests > >>>> >> >>>>> >> >>> > to > >>>> >> >>>>> >> >>> > expand > >>>> >> >>>>> >> >>> > the > >>>> >> >>>>> >> >>> > (already) large code to other usecases. > >>>> >> >>>>> >> >>> > Balancing this are a good many pull-requests from > many > >>>> >> >>>>> >> >>> > different > >>>> >> >>>>> >> >>> > users, > >>>> >> >>>>> >> >>> > some > >>>> >> >>>>> >> >>> > even deep into the internals. > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > Here are some things that I have talked about and > could > >>>> >> >>>>> >> >>> > be > >>>> >> >>>>> >> >>> > considered > >>>> >> >>>>> >> >>> > for > >>>> >> >>>>> >> >>> > the roadmap. Disclaimer: I do work for Continuum > >>>> >> >>>>> >> >>> > but these views are of course my own; furthermore > >>>> >> >>>>> >> >>> > obviously > >>>> >> >>>>> >> >>> > I > >>>> >> >>>>> >> >>> > am a > >>>> >> >>>>> >> >>> > bit > >>>> >> >>>>> >> >>> > more > >>>> >> >>>>> >> >>> > familiar with some of the 'sponsored' open-source > >>>> >> >>>>> >> >>> > libraries, but always open to new things. > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > - integration / automatic deferral to numba for JIT > >>>> >> >>>>> >> >>> > (this > >>>> >> >>>>> >> >>> > would > >>>> >> >>>>> >> >>> > be > >>>> >> >>>>> >> >>> > thru > >>>> >> >>>>> >> >>> > .apply) > >>>> >> >>>>> >> >>> > - automatic deferal to dask from groubpy where > >>>> >> >>>>> >> >>> > appropriate > >>>> >> >>>>> >> >>> > / > >>>> >> >>>>> >> >>> > maybe a > >>>> >> >>>>> >> >>> > .to_parallel (to simply return a dask.DataFrame > object) > >>>> >> >>>>> >> >>> > - incorporation of quantities / units (as part of > the > >>>> >> >>>>> >> >>> > dtype) > >>>> >> >>>>> >> >>> > - use of DyND to allow missing values for int dtypes > >>>> >> >>>>> >> >>> > - make Period a first class dtype. > >>>> >> >>>>> >> >>> > - provide some copy-on-write semantics to alleviate > the > >>>> >> >>>>> >> >>> > chained-indexing > >>>> >> >>>>> >> >>> > issues which occasionaly come up with the mis-use of > >>>> >> >>>>> >> >>> > the > >>>> >> >>>>> >> >>> > indexing > >>>> >> >>>>> >> >>> > API > >>>> >> >>>>> >> >>> > - allow a 'policy' to automatically provide column > >>>> >> >>>>> >> >>> > blocks > >>>> >> >>>>> >> >>> > for > >>>> >> >>>>> >> >>> > dict-like > >>>> >> >>>>> >> >>> > input (e.g. each column would be a block), this > would > >>>> >> >>>>> >> >>> > allow > >>>> >> >>>>> >> >>> > a > >>>> >> >>>>> >> >>> > pass-thru > >>>> >> >>>>> >> >>> > API > >>>> >> >>>>> >> >>> > where you could > >>>> >> >>>>> >> >>> > put in numpy arrays where you have views and have > them > >>>> >> >>>>> >> >>> > preserved > >>>> >> >>>>> >> >>> > rather > >>>> >> >>>>> >> >>> > than > >>>> >> >>>>> >> >>> > copied automatically. Note that this would also > allow > >>>> >> >>>>> >> >>> > what > >>>> >> >>>>> >> >>> > I > >>>> >> >>>>> >> >>> > call > >>>> >> >>>>> >> >>> > 'split' > >>>> >> >>>>> >> >>> > where a passed in > >>>> >> >>>>> >> >>> > multi-dim numpy array could be split up to > individual > >>>> >> >>>>> >> >>> > blocks > >>>> >> >>>>> >> >>> > (which > >>>> >> >>>>> >> >>> > actually > >>>> >> >>>>> >> >>> > gives a nice perf boost after the splitting costs). > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > In working towards some of these goals. I have come > to > >>>> >> >>>>> >> >>> > the > >>>> >> >>>>> >> >>> > opinion > >>>> >> >>>>> >> >>> > that > >>>> >> >>>>> >> >>> > it > >>>> >> >>>>> >> >>> > would make sense to have a neutral API protocol > layer > >>>> >> >>>>> >> >>> > that would allow us to swap out different engines as > >>>> >> >>>>> >> >>> > needed, > >>>> >> >>>>> >> >>> > for > >>>> >> >>>>> >> >>> > particular > >>>> >> >>>>> >> >>> > dtypes, or *maybe* out-of-core type computations. > E.g. > >>>> >> >>>>> >> >>> > imagine that we replaced the in-memory block > structure > >>>> >> >>>>> >> >>> > with > >>>> >> >>>>> >> >>> > a > >>>> >> >>>>> >> >>> > bclolz > >>>> >> >>>>> >> >>> > / > >>>> >> >>>>> >> >>> > memap > >>>> >> >>>>> >> >>> > type; in theory this should be 'easy' and just work. > >>>> >> >>>>> >> >>> > I could also see us adopting *some* of the SFrame > code > >>>> >> >>>>> >> >>> > to > >>>> >> >>>>> >> >>> > allow > >>>> >> >>>>> >> >>> > easier > >>>> >> >>>>> >> >>> > interop with this API layer. > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > In practice, I think a nice API layer would need to > be > >>>> >> >>>>> >> >>> > created > >>>> >> >>>>> >> >>> > to > >>>> >> >>>>> >> >>> > make > >>>> >> >>>>> >> >>> > this > >>>> >> >>>>> >> >>> > clean / nice. > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > So this comes around to Wes's point about creating a > >>>> >> >>>>> >> >>> > c++ > >>>> >> >>>>> >> >>> > library for > >>>> >> >>>>> >> >>> > the > >>>> >> >>>>> >> >>> > internals (and possibly even some of the indexing > >>>> >> >>>>> >> >>> > routines). > >>>> >> >>>>> >> >>> > In an ideal world, or course this would be > desirable. > >>>> >> >>>>> >> >>> > Getting > >>>> >> >>>>> >> >>> > there > >>>> >> >>>>> >> >>> > is a > >>>> >> >>>>> >> >>> > bit > >>>> >> >>>>> >> >>> > non-trivial I think, and IMHO might not be worth the > >>>> >> >>>>> >> >>> > effort. I > >>>> >> >>>>> >> >>> > don't > >>>> >> >>>>> >> >>> > really see big performance bottlenecks. We *already* > >>>> >> >>>>> >> >>> > defer > >>>> >> >>>>> >> >>> > much > >>>> >> >>>>> >> >>> > of > >>>> >> >>>>> >> >>> > the > >>>> >> >>>>> >> >>> > computation to libraries like numexpr & bottleneck > >>>> >> >>>>> >> >>> > (where > >>>> >> >>>>> >> >>> > appropriate). > >>>> >> >>>>> >> >>> > Adding numba / dask to the list would be helpful. > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > I think that almost all performance issues are the > >>>> >> >>>>> >> >>> > result > >>>> >> >>>>> >> >>> > of: > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > a) gross misuse of the pandas API. How much code > have > >>>> >> >>>>> >> >>> > you > >>>> >> >>>>> >> >>> > seen > >>>> >> >>>>> >> >>> > that > >>>> >> >>>>> >> >>> > does > >>>> >> >>>>> >> >>> > df.apply(lambda x: x.sum()) > >>>> >> >>>>> >> >>> > b) routines which operate column-by-column rather > >>>> >> >>>>> >> >>> > block-by-block and > >>>> >> >>>>> >> >>> > are > >>>> >> >>>>> >> >>> > in > >>>> >> >>>>> >> >>> > python space (e.g. we have an issue right now about > >>>> >> >>>>> >> >>> > .quantile) > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > So I am glossing over a big goal of having a c++ > >>>> >> >>>>> >> >>> > library > >>>> >> >>>>> >> >>> > that > >>>> >> >>>>> >> >>> > represents > >>>> >> >>>>> >> >>> > the > >>>> >> >>>>> >> >>> > pandas internals. This would by definition have a > c-API > >>>> >> >>>>> >> >>> > that so > >>>> >> >>>>> >> >>> > you *could* use pandas like semantics in c/c++ and > just > >>>> >> >>>>> >> >>> > have it > >>>> >> >>>>> >> >>> > work > >>>> >> >>>>> >> >>> > (and > >>>> >> >>>>> >> >>> > then pandas would be a thin wrapper around this > >>>> >> >>>>> >> >>> > library). > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > I am not averse to this, but I think would be quite > a > >>>> >> >>>>> >> >>> > big > >>>> >> >>>>> >> >>> > effort, > >>>> >> >>>>> >> >>> > and > >>>> >> >>>>> >> >>> > not a > >>>> >> >>>>> >> >>> > huge perf boost IMHO. Further there are a number of > API > >>>> >> >>>>> >> >>> > issues > >>>> >> >>>>> >> >>> > w.r.t. > >>>> >> >>>>> >> >>> > indexing > >>>> >> >>>>> >> >>> > which need to be clarified / worked out (e.g. > should we > >>>> >> >>>>> >> >>> > simply > >>>> >> >>>>> >> >>> > deprecate > >>>> >> >>>>> >> >>> > []) > >>>> >> >>>>> >> >>> > that are much easier to test / figure out in python > >>>> >> >>>>> >> >>> > space. > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > I also thing that we have quite a large number of > >>>> >> >>>>> >> >>> > contributors. > >>>> >> >>>>> >> >>> > Moving > >>>> >> >>>>> >> >>> > to > >>>> >> >>>>> >> >>> > c++ might make the internals a bit more impenetrable > >>>> >> >>>>> >> >>> > that > >>>> >> >>>>> >> >>> > the > >>>> >> >>>>> >> >>> > current > >>>> >> >>>>> >> >>> > internals. > >>>> >> >>>>> >> >>> > (though this would allow c++ people to contribute, > so > >>>> >> >>>>> >> >>> > that > >>>> >> >>>>> >> >>> > might > >>>> >> >>>>> >> >>> > balance > >>>> >> >>>>> >> >>> > out). > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > We have a limited core of devs whom right now are > >>>> >> >>>>> >> >>> > familar > >>>> >> >>>>> >> >>> > with > >>>> >> >>>>> >> >>> > things. > >>>> >> >>>>> >> >>> > If > >>>> >> >>>>> >> >>> > someone happened to have a starting base for a c++ > >>>> >> >>>>> >> >>> > library, > >>>> >> >>>>> >> >>> > then I > >>>> >> >>>>> >> >>> > might > >>>> >> >>>>> >> >>> > change > >>>> >> >>>>> >> >>> > opinions here. > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > my 4c. > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > Jeff > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > wrote: > >>>> >> >>>>> >> >>> >> > >>>> >> >>>>> >> >>> >> Deep thoughts during the holidays. > >>>> >> >>>>> >> >>> >> > >>>> >> >>>>> >> >>> >> I might be out of line here, but the > >>>> >> >>>>> >> >>> >> interpreter-heaviness > >>>> >> >>>>> >> >>> >> of > >>>> >> >>>>> >> >>> >> the > >>>> >> >>>>> >> >>> >> inside of pandas objects is likely to be a > long-term > >>>> >> >>>>> >> >>> >> liability > >>>> >> >>>>> >> >>> >> and > >>>> >> >>>>> >> >>> >> source of performance problems and technical debt. > >>>> >> >>>>> >> >>> >> > >>>> >> >>>>> >> >>> >> Has anyone put any thought into planning and > beginning > >>>> >> >>>>> >> >>> >> to > >>>> >> >>>>> >> >>> >> execute > >>>> >> >>>>> >> >>> >> on a > >>>> >> >>>>> >> >>> >> rewrite that moves as much as possible of the > >>>> >> >>>>> >> >>> >> internals > >>>> >> >>>>> >> >>> >> into > >>>> >> >>>>> >> >>> >> native > >>>> >> >>>>> >> >>> >> / > >>>> >> >>>>> >> >>> >> compiled code? I'm talking about: > >>>> >> >>>>> >> >>> >> > >>>> >> >>>>> >> >>> >> - pandas/core/internals > >>>> >> >>>>> >> >>> >> - indexing and assignment > >>>> >> >>>>> >> >>> >> - much of pandas/core/common > >>>> >> >>>>> >> >>> >> - categorical and custom dtypes > >>>> >> >>>>> >> >>> >> - all indexing mechanisms > >>>> >> >>>>> >> >>> >> > >>>> >> >>>>> >> >>> >> I'm concerned we've already exposed too much > internals > >>>> >> >>>>> >> >>> >> to > >>>> >> >>>>> >> >>> >> users, so > >>>> >> >>>>> >> >>> >> this might lead to a lot of API breakage, but it > might > >>>> >> >>>>> >> >>> >> be > >>>> >> >>>>> >> >>> >> for > >>>> >> >>>>> >> >>> >> the > >>>> >> >>>>> >> >>> >> Greater Good. As a first step, beginning a partial > >>>> >> >>>>> >> >>> >> migration > >>>> >> >>>>> >> >>> >> of > >>>> >> >>>>> >> >>> >> internals into some C++ classes that encapsulate > the > >>>> >> >>>>> >> >>> >> insides > >>>> >> >>>>> >> >>> >> of > >>>> >> >>>>> >> >>> >> DataFrame objects and implement indexing and > >>>> >> >>>>> >> >>> >> block-level > >>>> >> >>>>> >> >>> >> manipulations > >>>> >> >>>>> >> >>> >> would be a good place to start. I think you could > do > >>>> >> >>>>> >> >>> >> this > >>>> >> >>>>> >> >>> >> wouldn't > >>>> >> >>>>> >> >>> >> too > >>>> >> >>>>> >> >>> >> much disruption. > >>>> >> >>>>> >> >>> >> > >>>> >> >>>>> >> >>> >> As part of this internal retooling we might give > >>>> >> >>>>> >> >>> >> consideration > >>>> >> >>>>> >> >>> >> to > >>>> >> >>>>> >> >>> >> alternative data structures for representing data > >>>> >> >>>>> >> >>> >> internal > >>>> >> >>>>> >> >>> >> to > >>>> >> >>>>> >> >>> >> pandas > >>>> >> >>>>> >> >>> >> objects. Now in 2015/2016, continuing to be > hamstrung > >>>> >> >>>>> >> >>> >> by > >>>> >> >>>>> >> >>> >> NumPy's > >>>> >> >>>>> >> >>> >> limitations feels somewhat anachronistic. User > code is > >>>> >> >>>>> >> >>> >> riddled > >>>> >> >>>>> >> >>> >> with > >>>> >> >>>>> >> >>> >> workarounds for data type fidelity issues and the > >>>> >> >>>>> >> >>> >> like. > >>>> >> >>>>> >> >>> >> Like, > >>>> >> >>>>> >> >>> >> really, > >>>> >> >>>>> >> >>> >> why not add a bitndarray (similar to > >>>> >> >>>>> >> >>> >> ilanschnell/bitarray) > >>>> >> >>>>> >> >>> >> for > >>>> >> >>>>> >> >>> >> storing > >>>> >> >>>>> >> >>> >> nullness for problematic types and hide this from > the > >>>> >> >>>>> >> >>> >> user? =) > >>>> >> >>>>> >> >>> >> > >>>> >> >>>>> >> >>> >> Since we are now a NumFOCUS-sponsored project, I > feel > >>>> >> >>>>> >> >>> >> like > >>>> >> >>>>> >> >>> >> we > >>>> >> >>>>> >> >>> >> might > >>>> >> >>>>> >> >>> >> consider establishing some formal governance over > >>>> >> >>>>> >> >>> >> pandas > >>>> >> >>>>> >> >>> >> and > >>>> >> >>>>> >> >>> >> publishing meetings notes and roadmap documents > >>>> >> >>>>> >> >>> >> describing > >>>> >> >>>>> >> >>> >> plans > >>>> >> >>>>> >> >>> >> for > >>>> >> >>>>> >> >>> >> the project and meetings notes from committers. > >>>> >> >>>>> >> >>> >> There's no > >>>> >> >>>>> >> >>> >> real > >>>> >> >>>>> >> >>> >> "committer culture" for NumFOCUS projects like > there > >>>> >> >>>>> >> >>> >> is > >>>> >> >>>>> >> >>> >> with > >>>> >> >>>>> >> >>> >> the > >>>> >> >>>>> >> >>> >> Apache Software Foundation, but we might try > leading > >>>> >> >>>>> >> >>> >> by > >>>> >> >>>>> >> >>> >> example! > >>>> >> >>>>> >> >>> >> > >>>> >> >>>>> >> >>> >> Also, I believe pandas as a project has reached a > >>>> >> >>>>> >> >>> >> level of > >>>> >> >>>>> >> >>> >> importance > >>>> >> >>>>> >> >>> >> where we ought to consider planning and execution > on > >>>> >> >>>>> >> >>> >> larger > >>>> >> >>>>> >> >>> >> scale > >>>> >> >>>>> >> >>> >> undertakings such as this for safeguarding the > future. > >>>> >> >>>>> >> >>> >> > >>>> >> >>>>> >> >>> >> As for myself, well, I have my hands full in Big > >>>> >> >>>>> >> >>> >> Data-land. I > >>>> >> >>>>> >> >>> >> wish > >>>> >> >>>>> >> >>> >> I > >>>> >> >>>>> >> >>> >> could be helping more with pandas, but there a > quite a > >>>> >> >>>>> >> >>> >> few > >>>> >> >>>>> >> >>> >> fundamental > >>>> >> >>>>> >> >>> >> issues (like data interoperability nested data > >>>> >> >>>>> >> >>> >> handling > >>>> >> >>>>> >> >>> >> and > >>>> >> >>>>> >> >>> >> file > >>>> >> >>>>> >> >>> >> format support ? e.g. Parquet, see > >>>> >> >>>>> >> >>> >> > >>>> >> >>>>> >> >>> >> > >>>> >> >>>>> >> >>> >> > >>>> >> >>>>> >> >>> >> > >>>> >> >>>>> >> >>> >> > >>>> >> >>>>> >> >>> >> > >>>> >> >>>>> >> >>> >> > http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/ > ) > >>>> >> >>>>> >> >>> >> preventing Python from being more useful in > industry > >>>> >> >>>>> >> >>> >> analytics > >>>> >> >>>>> >> >>> >> applications. > >>>> >> >>>>> >> >>> >> > >>>> >> >>>>> >> >>> >> Aside: one of the bigger mistakes I made with > pandas's > >>>> >> >>>>> >> >>> >> API > >>>> >> >>>>> >> >>> >> design > >>>> >> >>>>> >> >>> >> was > >>>> >> >>>>> >> >>> >> making it acceptable to call class constructors ? > like > >>>> >> >>>>> >> >>> >> pandas.DataFrame ? directly (versus factory > >>>> >> >>>>> >> >>> >> functions). > >>>> >> >>>>> >> >>> >> Sorry > >>>> >> >>>>> >> >>> >> about > >>>> >> >>>>> >> >>> >> that! If we could convince everyone to start > writing > >>>> >> >>>>> >> >>> >> pandas.data_frame > >>>> >> >>>>> >> >>> >> or dataframe instead of using the class reference > it > >>>> >> >>>>> >> >>> >> would > >>>> >> >>>>> >> >>> >> help a > >>>> >> >>>>> >> >>> >> lot > >>>> >> >>>>> >> >>> >> with code cleanup. It's hard to plan for these > things > >>>> >> >>>>> >> >>> >> ? > >>>> >> >>>>> >> >>> >> NumPy > >>>> >> >>>>> >> >>> >> interoperability seemed a lot more important in > 2008 > >>>> >> >>>>> >> >>> >> than > >>>> >> >>>>> >> >>> >> it > >>>> >> >>>>> >> >>> >> does > >>>> >> >>>>> >> >>> >> now, > >>>> >> >>>>> >> >>> >> so I forgive myself. > >>>> >> >>>>> >> >>> >> > >>>> >> >>>>> >> >>> >> cheers and best wishes for 2016, > >>>> >> >>>>> >> >>> >> Wes > >>>> >> >>>>> >> >>> >> _______________________________________________ > >>>> >> >>>>> >> >>> >> Pandas-dev mailing list > >>>> >> >>>>> >> >>> >> Pandas-dev at python.org > >>>> >> >>>>> >> >>> >> > https://mail.python.org/mailman/listinfo/pandas-dev > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> _______________________________________________ > >>>> >> >>>>> >> >>> Pandas-dev mailing list > >>>> >> >>>>> >> >>> Pandas-dev at python.org > >>>> >> >>>>> >> >>> https://mail.python.org/mailman/listinfo/pandas-dev > >>>> >> >>>>> >> _______________________________________________ > >>>> >> >>>>> >> Pandas-dev mailing list > >>>> >> >>>>> >> Pandas-dev at python.org > >>>> >> >>>>> >> https://mail.python.org/mailman/listinfo/pandas-dev > >>>> >> >>>>> > > >>>> >> >>>>> > > >>>> >> >>>>> > >>>> >> >>>>> > >>>> >> >>>>> _______________________________________________ > >>>> >> >>>>> Pandas-dev mailing list > >>>> >> >>>>> Pandas-dev at python.org > >>>> >> >>>>> https://mail.python.org/mailman/listinfo/pandas-dev > >>>> >> >>>>> > >>>> >> >>>> > >>>> >> >> > >>>> >> _______________________________________________ > >>>> >> Pandas-dev mailing list > >>>> >> Pandas-dev at python.org > >>>> >> https://mail.python.org/mailman/listinfo/pandas-dev > >>>> > > >>>> > > >>>> _______________________________________________ > >>>> Pandas-dev mailing list > >>>> Pandas-dev at python.org > >>>> https://mail.python.org/mailman/listinfo/pandas-dev > >>> > >>> > >>> > >>> _______________________________________________ > >>> Pandas-dev mailing list > >>> Pandas-dev at python.org > >>> https://mail.python.org/mailman/listinfo/pandas-dev > >>> > >> > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Wed Jan 6 15:15:38 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Wed, 6 Jan 2016 12:15:38 -0800 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References:

Message-ID: I also will add that there is an ideology that has existed in the scientific Python community since 2011 at least which is this: pandas should not have existed; it should be part of NumPy instead. In my opinion, that misses the point of pandas, both then and now. There's a large and mostly new class of Python users working on domain-specific industry analytics problems for whom pandas is the most important tool that they use on a daily basis. Their knowledge of NumPy is limited, beyond the aspects of the ndarray API that are the same in pandas. High level APIs and accessibility for them is extremely important. But their skill sets and problems they are solving are not the same ones on the whole that you would have heard discussed at SciPy 2010. Sometime in 2015, "Python for Data Analysis" sold it's 100,000th copy. I have 5 foreign translations sitting on my shelf -- this represents a very large group of people that we have all collectively enabled by developing pandas -- for a lot of people, pandas is the main reason they use Python! So the summary of all this is: pandas is much more important as a project now than it was 5 years ago. Our relationship with our library dependencies like NumPy should reflect that. Downstream pandas consumers should similarly eventually concern themselves more with pandas compatibility (rather than always assuming that NumPy arrays are the only intermediary). This is a philosophical shift, but one that will ultimately benefit the usability of the stack. On Wed, Jan 6, 2016 at 11:45 AM, Jeff Reback wrote: > I'll just apologize right up front! hahah. > > No I think I have been pushing on these extras in pandas to help move it > forward. I have commented a bit > on Stephan's issue here about why I didn't push for these in numpy. numpy is > fairly slow moving > (though moves faster lately, I suspect the pace when Wes was developing > pandas was not much faster). > > So pandas was essentially 'fixing' lots of bug / compat issues in numpy. > > To the extent that we can keep the current user facing API the same (high > likelihood I think), willing > to acccept *some* breakage with the pandas->duck-like array container API in > order to provide swappable containers. > > For example I recall that in doing datetime w/tz, that we wanted > Series.values to return a numpy array (which it DOES!) > but it is actually lossy (its loses the tz). Samething with the Categorical > example wes gave. I dont' think these requirements > should hold pandas back! > > People are increasingly using pandas as the API for there work. That makes > it very important that we can handle > lots of input properly, w/o the handcuffs of numpy. > > All this said, I'll reiterate Wes (and others points). That back-compat is > extremely important. (I in fact try > to bend over backwards to provide this, sometimes its too much of course!). > E.g. take the resample changes to API > > Was originally going to just do a hard break, but this turns off people when > they have to update there code or else. > > my 4c (incrementing!) > > Jeff > From wesmckinn at gmail.com Fri Jan 8 20:34:05 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Fri, 8 Jan 2016 17:34:05 -0800 Subject: [Pandas-dev] Unit test reorganization Message-ID: hi folks, I have a few questions about the test suite. As context, I note that test_series.py is now 8200 lines and test_frame.py 17000 lines. Big #1 question is, how strongly do you feel about *shipping* the test suite in site-packages? Some other libraries with sprawling and complex test suites have chosen not to ship them: https://github.com/zzzeek/sqlalchemy Independently, I would support and help with starting a judicious reorganization of the contents of pandas/tests. So I'm thinking like tests/ dataframe/ series/ algorithms/ internals/ tseries/ and so forth. Thoughts? - Wes From wesmckinn at gmail.com Fri Jan 8 20:47:48 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Fri, 8 Jan 2016 17:47:48 -0800 Subject: [Pandas-dev] Unit test reorganization In-Reply-To: References:

Message-ID: + mailing list Do the distros run them _after_ installation? I'm talking about installing the unit tests during `python setup.py install`, but still including them in the tarball. On Fri, Jan 8, 2016 at 5:43 PM, Jeff Reback wrote: > all for reorging into subdirs as these have grown pretty big > > what's the big deal with shipping the test? > > I suspect some of the Linux distros do run them > > and just merged https://github.com/pydata/pandas/pull/11913 > though we can could configure s subset that ships I suppose > > > On Jan 8, 2016, at 8:34 PM, Wes McKinney wrote: >> >> hi folks, >> >> I have a few questions about the test suite. As context, I note that >> test_series.py is now 8200 lines and test_frame.py 17000 lines. >> >> Big #1 question is, how strongly do you feel about *shipping* the test >> suite in site-packages? Some other libraries with sprawling and >> complex test suites have chosen not to ship them: >> https://github.com/zzzeek/sqlalchemy >> >> Independently, I would support and help with starting a judicious >> reorganization of the contents of pandas/tests. So I'm thinking like >> >> tests/ >> dataframe/ >> series/ >> algorithms/ >> internals/ >> tseries/ >> >> and so forth. >> >> Thoughts? >> >> - Wes >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev From jeffreback at gmail.com Fri Jan 8 20:53:51 2016 From: jeffreback at gmail.com (Jeff Reback) Date: Fri, 8 Jan 2016 20:53:51 -0500 Subject: [Pandas-dev] Unit test reorganization In-Reply-To: References:

Message-ID: <5E86966B-F624-4ECB-AC72-2F9DCEBC7B14@gmail.com> no idea > On Jan 8, 2016, at 8:47 PM, Wes McKinney wrote: > > + mailing list > > Do the distros run them _after_ installation? I'm talking about > installing the unit tests during `python setup.py install`, but still > including them in the tarball. > >> On Fri, Jan 8, 2016 at 5:43 PM, Jeff Reback wrote: >> all for reorging into subdirs as these have grown pretty big >> >> what's the big deal with shipping the test? >> >> I suspect some of the Linux distros do run them >> >> and just merged https://github.com/pydata/pandas/pull/11913 >> though we can could configure s subset that ships I suppose >> >> >>> On Jan 8, 2016, at 8:34 PM, Wes McKinney wrote: >>> >>> hi folks, >>> >>> I have a few questions about the test suite. As context, I note that >>> test_series.py is now 8200 lines and test_frame.py 17000 lines. >>> >>> Big #1 question is, how strongly do you feel about *shipping* the test >>> suite in site-packages? Some other libraries with sprawling and >>> complex test suites have chosen not to ship them: >>> https://github.com/zzzeek/sqlalchemy >>> >>> Independently, I would support and help with starting a judicious >>> reorganization of the contents of pandas/tests. So I'm thinking like >>> >>> tests/ >>> dataframe/ >>> series/ >>> algorithms/ >>> internals/ >>> tseries/ >>> >>> and so forth. >>> >>> Thoughts? >>> >>> - Wes >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev From wesmckinn at gmail.com Fri Jan 8 21:04:13 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Fri, 8 Jan 2016 18:04:13 -0800 Subject: [Pandas-dev] Unit test reorganization In-Reply-To: <5E86966B-F624-4ECB-AC72-2F9DCEBC7B14@gmail.com> References:

<5E86966B-F624-4ECB-AC72-2F9DCEBC7B14@gmail.com> Message-ID: It looks like the debian packaging scripts would need to change. + Yaroslav to see if this would be onerous On Fri, Jan 8, 2016 at 5:53 PM, Jeff Reback wrote: > no idea > >> On Jan 8, 2016, at 8:47 PM, Wes McKinney wrote: >> >> + mailing list >> >> Do the distros run them _after_ installation? I'm talking about >> installing the unit tests during `python setup.py install`, but still >> including them in the tarball. >> >>> On Fri, Jan 8, 2016 at 5:43 PM, Jeff Reback wrote: >>> all for reorging into subdirs as these have grown pretty big >>> >>> what's the big deal with shipping the test? >>> >>> I suspect some of the Linux distros do run them >>> >>> and just merged https://github.com/pydata/pandas/pull/11913 >>> though we can could configure s subset that ships I suppose >>> >>> >>>> On Jan 8, 2016, at 8:34 PM, Wes McKinney wrote: >>>> >>>> hi folks, >>>> >>>> I have a few questions about the test suite. As context, I note that >>>> test_series.py is now 8200 lines and test_frame.py 17000 lines. >>>> >>>> Big #1 question is, how strongly do you feel about *shipping* the test >>>> suite in site-packages? Some other libraries with sprawling and >>>> complex test suites have chosen not to ship them: >>>> https://github.com/zzzeek/sqlalchemy >>>> >>>> Independently, I would support and help with starting a judicious >>>> reorganization of the contents of pandas/tests. So I'm thinking like >>>> >>>> tests/ >>>> dataframe/ >>>> series/ >>>> algorithms/ >>>> internals/ >>>> tseries/ >>>> >>>> and so forth. >>>> >>>> Thoughts? >>>> >>>> - Wes >>>> _______________________________________________ >>>> Pandas-dev mailing list >>>> Pandas-dev at python.org >>>> https://mail.python.org/mailman/listinfo/pandas-dev From shoyer at gmail.com Sun Jan 10 21:06:56 2016 From: shoyer at gmail.com (Stephan Hoyer) Date: Sun, 10 Jan 2016 18:06:56 -0800 Subject: [Pandas-dev] Unit test reorganization In-Reply-To: References: Message-ID: On Fri, Jan 8, 2016 at 5:34 PM, Wes McKinney wrote: > Big #1 question is, how strongly do you feel about *shipping* the test > suite in site-packages? Some other libraries with sprawling and > complex test suites have chosen not to ship them: > https://github.com/zzzeek/sqlalchemy > I would prefer to include the test suite if possible, because the ability to type "nosetests pandas" makes it easy both for users to verify installations are working properly and for downstream distributors to identify and report bugs. The complete pandas test suite still runs in 20-30 minutes, so I think it's still fairly reasonable to use it for these purposes. > Independently, I would support and help with starting a judicious > reorganization of the contents of pandas/tests. So I'm thinking like > > tests/ > dataframe/ > series/ > algorithms/ > internals/ > tseries/ > > and so forth. > This sounds like a great idea -- these files have really gotten out of control! Cheers, Stephan -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Mon Jan 11 11:47:47 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Mon, 11 Jan 2016 08:47:47 -0800 Subject: [Pandas-dev] Unit test reorganization In-Reply-To: References: Message-ID: On Sun, Jan 10, 2016 at 6:06 PM, Stephan Hoyer wrote: > On Fri, Jan 8, 2016 at 5:34 PM, Wes McKinney wrote: >> >> Big #1 question is, how strongly do you feel about *shipping* the test >> suite in site-packages? Some other libraries with sprawling and >> complex test suites have chosen not to ship them: >> https://github.com/zzzeek/sqlalchemy > > > I would prefer to include the test suite if possible, because the ability to > type "nosetests pandas" makes it easy both for users to verify installations > are working properly and for downstream distributors to identify and report > bugs. The complete pandas test suite still runs in 20-30 minutes, so I think > it's still fairly reasonable to use it for these purposes. > Got it. I wasn't sure if this was something people still wanted to do in practice with the burgeoning test suite. >> >> Independently, I would support and help with starting a judicious >> reorganization of the contents of pandas/tests. So I'm thinking like >> >> tests/ >> dataframe/ >> series/ >> algorithms/ >> internals/ >> tseries/ >> >> and so forth. > > > This sounds like a great idea -- these files have really gotten out of > control! > Sounds good. I've been sorting through points of contact between Series/DataFrame's implementation and internal matters (e.g. the BlockManager) and figured it would be good to "quarantine" code that makes assumptions about what's under the hood. I'll get the first couple patches started and it can be a slow burn to break apart these large files. > Cheers, > Stephan From shoyer at gmail.com Mon Jan 11 12:36:42 2016 From: shoyer at gmail.com (Stephan Hoyer) Date: Mon, 11 Jan 2016 09:36:42 -0800 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References:

Message-ID: Hi Wes, You raise some important points. I agree that pandas's patched version of the numpy dtype system is a mess. But despite its issues, its leaky abstraction on top of NumPy provides benefits. In particular, it makes pandas easy to emulate (e.g., xarray), extend (e.g., geopandas) and integrate with other libraries (e.g., patsy, Scikit-Learn, matplotlib). You are right that pandas has started to supplant numpy as a high level API for data analysis, but of course the robust (and often numpy based) Python ecosystem is part of what has made pandas so successful. In practice, ecosystem projects often want to work with more primitive objects than series/dataframes in their internal data structures and without numpy this becomes more difficult. For example, how do you concatenate a list of categoricals? If these were numpy arrays, we could use np.concatenate, but the current implementation of categorical would require a custom solution. First class compatibility with pandas is harder when pandas data cannot be used with a full ndarray API. Likewise, hiding implementation details retains some flexibility for us (as developers), but in an ideal world, we would know we have the right abstraction, and then could expose the implementation as an advanced API! This is the case for some very mature projects, such as NumPy. Pandas is not really here yet (with the block manager), but it might be something to strive towards in this rewrite. At this point, I suppose the ship has sailed (e.g., with categorical in .values) on full numpy compatibility. So we absolutely do need explicit interfaces to converting to NumPy, rather than the current implicit guarantees about .values -- which we violated with categorical. Something like your suggested .to_numpy() method would indeed be an improvement over the current state, where we half-pretend that NumPy could be used as an advanced API for pandas, even though it doesn't really work. I'm sure you would agree that -- at least in theory -- it would be nice to push dtype improvements upstream to numpy, but that is obviously more work (for a variety of reasons) than starting from scratch in pandas. Of course, I think pandas has a need and right to exist as a separate library. But I do think building off of NumPy made it stronger, and pushing improvements upstream would be a better way to go. This has been my approach, and is why I've worked on both pandas and NumPy. The bottom line is that I don't agree that this is the most productive path forward -- I would opt for improving NumPy or DyND instead, which I believe would cause much less pain downstream -- but given that I'm not going to be the person doing the work, I will defer to your judgment. Pandas is certainly in need of holistic improvements and the maturity of a v1.0 release, and that's not something I'm in a position to push myself. Best, Stephan P.S. apologies for the delay -- it's been a busy week. On Wed, Jan 6, 2016 at 12:15 PM, Wes McKinney wrote: > I also will add that there is an ideology that has existed in the > scientific Python community since 2011 at least which is this: pandas > should not have existed; it should be part of NumPy instead. > > In my opinion, that misses the point of pandas, both then and now. > > There's a large and mostly new class of Python users working on > domain-specific industry analytics problems for whom pandas is the > most important tool that they use on a daily basis. Their knowledge of > NumPy is limited, beyond the aspects of the ndarray API that are the > same in pandas. High level APIs and accessibility for them is > extremely important. But their skill sets and problems they are > solving are not the same ones on the whole that you would have heard > discussed at SciPy 2010. > > Sometime in 2015, "Python for Data Analysis" sold it's 100,000th copy. > I have 5 foreign translations sitting on my shelf -- this represents a > very large group of people that we have all collectively enabled by > developing pandas -- for a lot of people, pandas is the main reason > they use Python! > > So the summary of all this is: pandas is much more important as a > project now than it was 5 years ago. Our relationship with our library > dependencies like NumPy should reflect that. Downstream pandas > consumers should similarly eventually concern themselves more with > pandas compatibility (rather than always assuming that NumPy arrays > are the only intermediary). This is a philosophical shift, but one > that will ultimately benefit the usability of the stack. > > On Wed, Jan 6, 2016 at 11:45 AM, Jeff Reback wrote: > > I'll just apologize right up front! hahah. > > > > No I think I have been pushing on these extras in pandas to help move it > > forward. I have commented a bit > > on Stephan's issue here about why I didn't push for these in numpy. > numpy is > > fairly slow moving > > (though moves faster lately, I suspect the pace when Wes was developing > > pandas was not much faster). > > > > So pandas was essentially 'fixing' lots of bug / compat issues in numpy. > > > > To the extent that we can keep the current user facing API the same (high > > likelihood I think), willing > > to acccept *some* breakage with the pandas->duck-like array container > API in > > order to provide swappable containers. > > > > For example I recall that in doing datetime w/tz, that we wanted > > Series.values to return a numpy array (which it DOES!) > > but it is actually lossy (its loses the tz). Samething with the > Categorical > > example wes gave. I dont' think these requirements > > should hold pandas back! > > > > People are increasingly using pandas as the API for there work. That > makes > > it very important that we can handle > > lots of input properly, w/o the handcuffs of numpy. > > > > All this said, I'll reiterate Wes (and others points). That back-compat > is > > extremely important. (I in fact try > > to bend over backwards to provide this, sometimes its too much of > course!). > > E.g. take the resample changes to API > > > > Was originally going to just do a hard break, but this turns off people > when > > they have to update there code or else. > > > > my 4c (incrementing!) > > > > Jeff > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Mon Jan 11 13:45:24 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Mon, 11 Jan 2016 10:45:24 -0800 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References: