From wesmckinn at gmail.com Fri Jan 1 20:13:58 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Fri, 1 Jan 2016 17:13:58 -0800 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References: Message-ID: Jeff -- can you require log-in for editing on this document? https://docs.google.com/document/d/151ct8jcZWwh7XStptjbLsda6h2b3C0IuiH_hfZnUA58/edit# There are a number of anonymous edits. On Wed, Dec 30, 2015 at 6:04 PM, Wes McKinney wrote: > I cobbled together an ugly start of a c++->cython->pandas toolchain here > > https://github.com/wesm/pandas/tree/libpandas-native-core > > I used a mix of Kudu, Impala, and dynd-python cmake sources, so it's a > bit messy at the moment but it should be sufficient to run some real > experiments with a little more work. I reckon it's like a 6 month > project to tear out the insides of Series and DataFrame and replace it > with a new "native core", but we should be able to get enough info to > see whether it's a viable plan within a month or so. > > The end goal is to create "private" extension types in Cython that can > be the new base classes for Series and NDFrame; these will hold a > reference to a C++ object that contains wrappered NumPy arrays and > other metadata (like pandas-only dtypes). > > It might be too hard to try to replace a single usage of block manager > as a first experiment, so I'll try to create a minimal "SeriesLite" > that supports 3 dtypes > > 1) float64 with nans > 2) int64 with a bitmask for NAs > 3) category type for one of these > > Just want to get a feel for the extensibility and offer an NA > singleton Python object (a la None) for getting and setting NAs across > these 3 dtypes. > > If we end up going down this route, any way to place a moratorium on > invasive work on pandas internals (outside bug fixes)? > > Pedantic aside: I'd rather avoid shipping thirdparty C/C++ libraries > like googletest and friends in pandas if we can. Cloudera folks have > been working on a portable C++ library toolchain for Impala and other > projects at https://github.com/cloudera/native-toolchain, but it is > only being tested on Linux and OS X. Most google libraries should > build out of the box on MSVC but it'll be something to keep an eye on. > > BTW thanks to the libdynd developers for pioneering the c++ lib <-> > python-c++ lib <-> cython toolchain; being able to build Cython > extensions directly from cmake is a godsend > > HNY all > Wes > > On Tue, Dec 29, 2015 at 4:17 PM, Irwin Zaid wrote: >> Yeah, that seems reasonable and I totally agree a Pandas wrapper layer would >> be necessary. >> >> I'll keep an eye on this and I'd like to help if I can. >> >> Irwin >> >> >> On Tue, Dec 29, 2015 at 6:01 PM, Wes McKinney wrote: >>> >>> I'm not suggesting a rewrite of NumPy functionality but rather pandas >>> functionality that is currently written in a mishmash of Cython and Python. >>> Happy to experiment with changing the internal compute infrastructure and >>> data representation to DyND after this first stage of cleanup is done. Even >>> if we use DyND a pretty extensive pandas wrapper layer will be necessary. >>> >>> >>> On Tuesday, December 29, 2015, Irwin Zaid wrote: >>>> >>>> Hi Wes (and others), >>>> >>>> I've been following this conversation with interest. I do think it would >>>> be worth exploring DyND, rather than setting up yet another rewrite of >>>> NumPy-functionality. Especially because DyND is already an optional >>>> dependency of Pandas. >>>> >>>> For things like Integer NA and new dtypes, DyND is there and ready to do >>>> this. >>>> >>>> Irwin >>>> >>>> On Tue, Dec 29, 2015 at 5:18 PM, Wes McKinney >>>> wrote: >>>>> >>>>> Can you link to the PR you're talking about? >>>>> >>>>> I will see about spending a few hours setting up a libpandas.so as a C++ >>>>> shared library where we can run some experiments and validate whether it can >>>>> solve the integer-NA problem and be a place to put new data types >>>>> (categorical and friends). I'm +1 on targeting >>>>> >>>>> Would it also be worth making a wish list of APIs we might consider >>>>> breaking in a pandas 1.0 release that also features this new "native core"? >>>>> Might as well right some wrongs while we're doing some invasive work on the >>>>> internals; some breakage might be unavoidable. We can always maintain a >>>>> pandas legacy 0.x.x maintenance branch (providing a conda binary build) for >>>>> legacy users where showstopper bugs can get fixed. >>>>> >>>>> On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback >>>>> wrote: >>>>> > Wes your last is noted as well. I *think* we can actually do this now >>>>> > (well >>>>> > there is a PR out there). >>>>> > >>>>> > On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney >>>>> > wrote: >>>>> >> >>>>> >> The other huge thing this will enable is to do is copy-on-write for >>>>> >> various kinds of views, which should cut down on some of the >>>>> >> defensive >>>>> >> copying in the library and reduce memory usage. >>>>> >> >>>>> >> On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney >>>>> >> wrote: >>>>> >> > Basically the approach is >>>>> >> > >>>>> >> > 1) Base dtype type >>>>> >> > 2) Base array type with K >= 1 dimensions >>>>> >> > 3) Base scalar type >>>>> >> > 4) Base index type >>>>> >> > 5) "Wrapper" subclasses for all NumPy types fitting into categories >>>>> >> > #1, #2, #3, #4 >>>>> >> > 6) Subclasses for pandas-specific types like category, datetimeTZ, >>>>> >> > etc. >>>>> >> > 7) NDFrame as cpcloud wrote is just a list of these >>>>> >> > >>>>> >> > Indexes and axis labels / column names can get layered on top. >>>>> >> > >>>>> >> > After we do all this we can look at adding nested types (arrays, >>>>> >> > maps, >>>>> >> > structs) to better support JSON. >>>>> >> > >>>>> >> > - Wes >>>>> >> > >>>>> >> > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud >>>>> >> > wrote: >>>>> >> >> Maybe this is saying the same thing as Wes, but how far would >>>>> >> >> something >>>>> >> >> like >>>>> >> >> this get us? >>>>> >> >> >>>>> >> >> // warning: things are probably not this simple >>>>> >> >> >>>>> >> >> struct data_array_t { >>>>> >> >> void *primitive; // scalar data >>>>> >> >> data_array_t *nested; // nested data >>>>> >> >> boost::dynamic_bitset isnull; // might have to create our own >>>>> >> >> to >>>>> >> >> avoid >>>>> >> >> boost >>>>> >> >> schema_t schema; // not sure exactly what this looks like >>>>> >> >> }; >>>>> >> >> >>>>> >> >> typedef std::map data_frame_t; // probably >>>>> >> >> not >>>>> >> >> this >>>>> >> >> simple >>>>> >> >> >>>>> >> >> To answer Jeff?s use-case question: I think that the use cases are >>>>> >> >> 1) >>>>> >> >> freedom from numpy (mostly) 2) no more block manager which frees >>>>> >> >> us >>>>> >> >> from the >>>>> >> >> limitations of the block memory layout. In particular, the ability >>>>> >> >> to >>>>> >> >> take >>>>> >> >> advantage of memory mapped IO would be a big win IMO. >>>>> >> >> >>>>> >> >> >>>>> >> >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney >>>>> >> >> wrote: >>>>> >> >>> >>>>> >> >>> I will write a more detailed response to some of these things >>>>> >> >>> after >>>>> >> >>> the new year, but, in particular, re: missing values, can you or >>>>> >> >>> someone tell me why creating an object that contains a NumPy >>>>> >> >>> array and >>>>> >> >>> a bitmap is not sufficient? If we we can add a lightweight C/C++ >>>>> >> >>> class >>>>> >> >>> layer between NumPy function calls (e.g. arithmetic) and pandas >>>>> >> >>> function calls, then I see no reason why we cannot have >>>>> >> >>> >>>>> >> >>> Int32Array->add >>>>> >> >>> >>>>> >> >>> and >>>>> >> >>> >>>>> >> >>> Float32Array->add >>>>> >> >>> >>>>> >> >>> do the right thing (the former would be responsible for >>>>> >> >>> bitmasking to >>>>> >> >>> propagate NA values; the latter would defer to NumPy). If we can >>>>> >> >>> put >>>>> >> >>> all the internals of pandas objects inside a black box, we can >>>>> >> >>> add >>>>> >> >>> layers of virtual function indirection without a performance >>>>> >> >>> penalty >>>>> >> >>> (e.g. adding more interpreter overhead with more abstraction >>>>> >> >>> layers >>>>> >> >>> does add up to a perf penalty). >>>>> >> >>> >>>>> >> >>> I don't think this is too scary -- I would be willing to create a >>>>> >> >>> small POC C++ library to prototype something like what I'm >>>>> >> >>> talking >>>>> >> >>> about. >>>>> >> >>> >>>>> >> >>> Since pandas has limited points of contact with NumPy I don't >>>>> >> >>> think >>>>> >> >>> this would end up being too onerous. >>>>> >> >>> >>>>> >> >>> For the record, I'm pretty allergic to "advanced C++"; I think it >>>>> >> >>> is a >>>>> >> >>> useful tool if you pick a sane 20% subset of the C++11 spec and >>>>> >> >>> follow >>>>> >> >>> Google C++ style it's not very inaccessible to intermediate >>>>> >> >>> developers. More or less "C plus OOP and easier object lifetime >>>>> >> >>> management (shared/unique_ptr, etc.)". As soon as you add a lot >>>>> >> >>> of >>>>> >> >>> template metaprogramming C++ library development quickly becomes >>>>> >> >>> inaccessible except to the C++-Jedi. >>>>> >> >>> >>>>> >> >>> Maybe let's start a Google document on "pandas roadmap" where we >>>>> >> >>> can >>>>> >> >>> break down the 1-2 year goals and some of these infrastructure >>>>> >> >>> issues >>>>> >> >>> and have our discussion there? (obviously publish this someplace >>>>> >> >>> once >>>>> >> >>> we're done) >>>>> >> >>> >>>>> >> >>> - Wes >>>>> >> >>> >>>>> >> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback >>>>> >> >>> >>>>> >> >>> wrote: >>>>> >> >>> > Here are some of my thoughts about pandas Roadmap / status and >>>>> >> >>> > some >>>>> >> >>> > responses to Wes's thoughts. >>>>> >> >>> > >>>>> >> >>> > In the last few (and upcoming) major releases we have been made >>>>> >> >>> > the >>>>> >> >>> > following changes: >>>>> >> >>> > >>>>> >> >>> > - dtype enhancements (Categorical, Timedelta, Datetime w/tz) & >>>>> >> >>> > making >>>>> >> >>> > these >>>>> >> >>> > first class objects >>>>> >> >>> > - code refactoring to remove subclassing of ndarrays for Series >>>>> >> >>> > & >>>>> >> >>> > Index >>>>> >> >>> > - carving out / deprecating non-core parts of pandas >>>>> >> >>> > - datareader >>>>> >> >>> > - SparsePanel, WidePanel & other aliases (TImeSeries) >>>>> >> >>> > - rpy, rplot, irow et al. >>>>> >> >>> > - google-analytics >>>>> >> >>> > - API changes to make things more consistent >>>>> >> >>> > - pd.rolling/expanding * -> .rolling/expanding (this is in >>>>> >> >>> > master >>>>> >> >>> > now) >>>>> >> >>> > - .resample becoming a full defered like groupby. >>>>> >> >>> > - multi-index slicing along any level (obviates need for .xs) >>>>> >> >>> > and >>>>> >> >>> > allows >>>>> >> >>> > assignment >>>>> >> >>> > - .loc/.iloc - for the most part obviates use of .ix >>>>> >> >>> > - .pipe & .assign >>>>> >> >>> > - plotting accessors >>>>> >> >>> > - fixing of the sorting API >>>>> >> >>> > - many performance enhancements both micro & macro (e.g. >>>>> >> >>> > release >>>>> >> >>> > GIL) >>>>> >> >>> > >>>>> >> >>> > Some on-deck enhancements are (meaning these are basically >>>>> >> >>> > ready to >>>>> >> >>> > go >>>>> >> >>> > in): >>>>> >> >>> > - IntervalIndex (and eventually make PeriodIndex just a >>>>> >> >>> > sub-class >>>>> >> >>> > of >>>>> >> >>> > this) >>>>> >> >>> > - RangeIndex >>>>> >> >>> > >>>>> >> >>> > so lots of changes, though nothing really earth shaking, just >>>>> >> >>> > more >>>>> >> >>> > convenience, reducing magicness somewhat >>>>> >> >>> > and providing flexibility. >>>>> >> >>> > >>>>> >> >>> > Of course we are getting increasing issues, mostly bug reports >>>>> >> >>> > (and >>>>> >> >>> > lots >>>>> >> >>> > of >>>>> >> >>> > dupes), some edge case enhancements >>>>> >> >>> > which can add to the existing API's and of course, requests to >>>>> >> >>> > expand >>>>> >> >>> > the >>>>> >> >>> > (already) large code to other usecases. >>>>> >> >>> > Balancing this are a good many pull-requests from many >>>>> >> >>> > different >>>>> >> >>> > users, >>>>> >> >>> > some >>>>> >> >>> > even deep into the internals. >>>>> >> >>> > >>>>> >> >>> > Here are some things that I have talked about and could be >>>>> >> >>> > considered >>>>> >> >>> > for >>>>> >> >>> > the roadmap. Disclaimer: I do work for Continuum >>>>> >> >>> > but these views are of course my own; furthermore obviously I >>>>> >> >>> > am a >>>>> >> >>> > bit >>>>> >> >>> > more >>>>> >> >>> > familiar with some of the 'sponsored' open-source >>>>> >> >>> > libraries, but always open to new things. >>>>> >> >>> > >>>>> >> >>> > - integration / automatic deferral to numba for JIT (this would >>>>> >> >>> > be >>>>> >> >>> > thru >>>>> >> >>> > .apply) >>>>> >> >>> > - automatic deferal to dask from groubpy where appropriate / >>>>> >> >>> > maybe a >>>>> >> >>> > .to_parallel (to simply return a dask.DataFrame object) >>>>> >> >>> > - incorporation of quantities / units (as part of the dtype) >>>>> >> >>> > - use of DyND to allow missing values for int dtypes >>>>> >> >>> > - make Period a first class dtype. >>>>> >> >>> > - provide some copy-on-write semantics to alleviate the >>>>> >> >>> > chained-indexing >>>>> >> >>> > issues which occasionaly come up with the mis-use of the >>>>> >> >>> > indexing >>>>> >> >>> > API >>>>> >> >>> > - allow a 'policy' to automatically provide column blocks for >>>>> >> >>> > dict-like >>>>> >> >>> > input (e.g. each column would be a block), this would allow a >>>>> >> >>> > pass-thru >>>>> >> >>> > API >>>>> >> >>> > where you could >>>>> >> >>> > put in numpy arrays where you have views and have them >>>>> >> >>> > preserved >>>>> >> >>> > rather >>>>> >> >>> > than >>>>> >> >>> > copied automatically. Note that this would also allow what I >>>>> >> >>> > call >>>>> >> >>> > 'split' >>>>> >> >>> > where a passed in >>>>> >> >>> > multi-dim numpy array could be split up to individual blocks >>>>> >> >>> > (which >>>>> >> >>> > actually >>>>> >> >>> > gives a nice perf boost after the splitting costs). >>>>> >> >>> > >>>>> >> >>> > In working towards some of these goals. I have come to the >>>>> >> >>> > opinion >>>>> >> >>> > that >>>>> >> >>> > it >>>>> >> >>> > would make sense to have a neutral API protocol layer >>>>> >> >>> > that would allow us to swap out different engines as needed, >>>>> >> >>> > for >>>>> >> >>> > particular >>>>> >> >>> > dtypes, or *maybe* out-of-core type computations. E.g. >>>>> >> >>> > imagine that we replaced the in-memory block structure with a >>>>> >> >>> > bclolz >>>>> >> >>> > / >>>>> >> >>> > memap >>>>> >> >>> > type; in theory this should be 'easy' and just work. >>>>> >> >>> > I could also see us adopting *some* of the SFrame code to allow >>>>> >> >>> > easier >>>>> >> >>> > interop with this API layer. >>>>> >> >>> > >>>>> >> >>> > In practice, I think a nice API layer would need to be created >>>>> >> >>> > to >>>>> >> >>> > make >>>>> >> >>> > this >>>>> >> >>> > clean / nice. >>>>> >> >>> > >>>>> >> >>> > So this comes around to Wes's point about creating a c++ >>>>> >> >>> > library for >>>>> >> >>> > the >>>>> >> >>> > internals (and possibly even some of the indexing routines). >>>>> >> >>> > In an ideal world, or course this would be desirable. Getting >>>>> >> >>> > there >>>>> >> >>> > is a >>>>> >> >>> > bit >>>>> >> >>> > non-trivial I think, and IMHO might not be worth the effort. I >>>>> >> >>> > don't >>>>> >> >>> > really see big performance bottlenecks. We *already* defer much >>>>> >> >>> > of >>>>> >> >>> > the >>>>> >> >>> > computation to libraries like numexpr & bottleneck (where >>>>> >> >>> > appropriate). >>>>> >> >>> > Adding numba / dask to the list would be helpful. >>>>> >> >>> > >>>>> >> >>> > I think that almost all performance issues are the result of: >>>>> >> >>> > >>>>> >> >>> > a) gross misuse of the pandas API. How much code have you seen >>>>> >> >>> > that >>>>> >> >>> > does >>>>> >> >>> > df.apply(lambda x: x.sum()) >>>>> >> >>> > b) routines which operate column-by-column rather >>>>> >> >>> > block-by-block and >>>>> >> >>> > are >>>>> >> >>> > in >>>>> >> >>> > python space (e.g. we have an issue right now about .quantile) >>>>> >> >>> > >>>>> >> >>> > So I am glossing over a big goal of having a c++ library that >>>>> >> >>> > represents >>>>> >> >>> > the >>>>> >> >>> > pandas internals. This would by definition have a c-API that so >>>>> >> >>> > you *could* use pandas like semantics in c/c++ and just have it >>>>> >> >>> > work >>>>> >> >>> > (and >>>>> >> >>> > then pandas would be a thin wrapper around this library). >>>>> >> >>> > >>>>> >> >>> > I am not averse to this, but I think would be quite a big >>>>> >> >>> > effort, >>>>> >> >>> > and >>>>> >> >>> > not a >>>>> >> >>> > huge perf boost IMHO. Further there are a number of API issues >>>>> >> >>> > w.r.t. >>>>> >> >>> > indexing >>>>> >> >>> > which need to be clarified / worked out (e.g. should we simply >>>>> >> >>> > deprecate >>>>> >> >>> > []) >>>>> >> >>> > that are much easier to test / figure out in python space. >>>>> >> >>> > >>>>> >> >>> > I also thing that we have quite a large number of contributors. >>>>> >> >>> > Moving >>>>> >> >>> > to >>>>> >> >>> > c++ might make the internals a bit more impenetrable that the >>>>> >> >>> > current >>>>> >> >>> > internals. >>>>> >> >>> > (though this would allow c++ people to contribute, so that >>>>> >> >>> > might >>>>> >> >>> > balance >>>>> >> >>> > out). >>>>> >> >>> > >>>>> >> >>> > We have a limited core of devs whom right now are familar with >>>>> >> >>> > things. >>>>> >> >>> > If >>>>> >> >>> > someone happened to have a starting base for a c++ library, >>>>> >> >>> > then I >>>>> >> >>> > might >>>>> >> >>> > change >>>>> >> >>> > opinions here. >>>>> >> >>> > >>>>> >> >>> > >>>>> >> >>> > my 4c. >>>>> >> >>> > >>>>> >> >>> > Jeff >>>>> >> >>> > >>>>> >> >>> > >>>>> >> >>> > >>>>> >> >>> > >>>>> >> >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney >>>>> >> >>> > >>>>> >> >>> > wrote: >>>>> >> >>> >> >>>>> >> >>> >> Deep thoughts during the holidays. >>>>> >> >>> >> >>>>> >> >>> >> I might be out of line here, but the interpreter-heaviness of >>>>> >> >>> >> the >>>>> >> >>> >> inside of pandas objects is likely to be a long-term liability >>>>> >> >>> >> and >>>>> >> >>> >> source of performance problems and technical debt. >>>>> >> >>> >> >>>>> >> >>> >> Has anyone put any thought into planning and beginning to >>>>> >> >>> >> execute >>>>> >> >>> >> on a >>>>> >> >>> >> rewrite that moves as much as possible of the internals into >>>>> >> >>> >> native >>>>> >> >>> >> / >>>>> >> >>> >> compiled code? I'm talking about: >>>>> >> >>> >> >>>>> >> >>> >> - pandas/core/internals >>>>> >> >>> >> - indexing and assignment >>>>> >> >>> >> - much of pandas/core/common >>>>> >> >>> >> - categorical and custom dtypes >>>>> >> >>> >> - all indexing mechanisms >>>>> >> >>> >> >>>>> >> >>> >> I'm concerned we've already exposed too much internals to >>>>> >> >>> >> users, so >>>>> >> >>> >> this might lead to a lot of API breakage, but it might be for >>>>> >> >>> >> the >>>>> >> >>> >> Greater Good. As a first step, beginning a partial migration >>>>> >> >>> >> of >>>>> >> >>> >> internals into some C++ classes that encapsulate the insides >>>>> >> >>> >> of >>>>> >> >>> >> DataFrame objects and implement indexing and block-level >>>>> >> >>> >> manipulations >>>>> >> >>> >> would be a good place to start. I think you could do this >>>>> >> >>> >> wouldn't >>>>> >> >>> >> too >>>>> >> >>> >> much disruption. >>>>> >> >>> >> >>>>> >> >>> >> As part of this internal retooling we might give consideration >>>>> >> >>> >> to >>>>> >> >>> >> alternative data structures for representing data internal to >>>>> >> >>> >> pandas >>>>> >> >>> >> objects. Now in 2015/2016, continuing to be hamstrung by >>>>> >> >>> >> NumPy's >>>>> >> >>> >> limitations feels somewhat anachronistic. User code is riddled >>>>> >> >>> >> with >>>>> >> >>> >> workarounds for data type fidelity issues and the like. Like, >>>>> >> >>> >> really, >>>>> >> >>> >> why not add a bitndarray (similar to ilanschnell/bitarray) for >>>>> >> >>> >> storing >>>>> >> >>> >> nullness for problematic types and hide this from the user? =) >>>>> >> >>> >> >>>>> >> >>> >> Since we are now a NumFOCUS-sponsored project, I feel like we >>>>> >> >>> >> might >>>>> >> >>> >> consider establishing some formal governance over pandas and >>>>> >> >>> >> publishing meetings notes and roadmap documents describing >>>>> >> >>> >> plans >>>>> >> >>> >> for >>>>> >> >>> >> the project and meetings notes from committers. There's no >>>>> >> >>> >> real >>>>> >> >>> >> "committer culture" for NumFOCUS projects like there is with >>>>> >> >>> >> the >>>>> >> >>> >> Apache Software Foundation, but we might try leading by >>>>> >> >>> >> example! >>>>> >> >>> >> >>>>> >> >>> >> Also, I believe pandas as a project has reached a level of >>>>> >> >>> >> importance >>>>> >> >>> >> where we ought to consider planning and execution on larger >>>>> >> >>> >> scale >>>>> >> >>> >> undertakings such as this for safeguarding the future. >>>>> >> >>> >> >>>>> >> >>> >> As for myself, well, I have my hands full in Big Data-land. I >>>>> >> >>> >> wish >>>>> >> >>> >> I >>>>> >> >>> >> could be helping more with pandas, but there a quite a few >>>>> >> >>> >> fundamental >>>>> >> >>> >> issues (like data interoperability nested data handling and >>>>> >> >>> >> file >>>>> >> >>> >> format support ? e.g. Parquet, see >>>>> >> >>> >> >>>>> >> >>> >> >>>>> >> >>> >> >>>>> >> >>> >> >>>>> >> >>> >> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/) >>>>> >> >>> >> preventing Python from being more useful in industry analytics >>>>> >> >>> >> applications. >>>>> >> >>> >> >>>>> >> >>> >> Aside: one of the bigger mistakes I made with pandas's API >>>>> >> >>> >> design >>>>> >> >>> >> was >>>>> >> >>> >> making it acceptable to call class constructors ? like >>>>> >> >>> >> pandas.DataFrame ? directly (versus factory functions). Sorry >>>>> >> >>> >> about >>>>> >> >>> >> that! If we could convince everyone to start writing >>>>> >> >>> >> pandas.data_frame >>>>> >> >>> >> or dataframe instead of using the class reference it would >>>>> >> >>> >> help a >>>>> >> >>> >> lot >>>>> >> >>> >> with code cleanup. It's hard to plan for these things ? NumPy >>>>> >> >>> >> interoperability seemed a lot more important in 2008 than it >>>>> >> >>> >> does >>>>> >> >>> >> now, >>>>> >> >>> >> so I forgive myself. >>>>> >> >>> >> >>>>> >> >>> >> cheers and best wishes for 2016, >>>>> >> >>> >> Wes >>>>> >> >>> >> _______________________________________________ >>>>> >> >>> >> Pandas-dev mailing list >>>>> >> >>> >> Pandas-dev at python.org >>>>> >> >>> >> https://mail.python.org/mailman/listinfo/pandas-dev >>>>> >> >>> > >>>>> >> >>> > >>>>> >> >>> _______________________________________________ >>>>> >> >>> Pandas-dev mailing list >>>>> >> >>> Pandas-dev at python.org >>>>> >> >>> https://mail.python.org/mailman/listinfo/pandas-dev >>>>> >> _______________________________________________ >>>>> >> Pandas-dev mailing list >>>>> >> Pandas-dev at python.org >>>>> >> https://mail.python.org/mailman/listinfo/pandas-dev >>>>> > >>>>> > >>>>> >>>>> >>>>> _______________________________________________ >>>>> Pandas-dev mailing list >>>>> Pandas-dev at python.org >>>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>>> >>>> >> From jeffreback at gmail.com Fri Jan 1 20:23:02 2016 From: jeffreback at gmail.com (Jeff Reback) Date: Fri, 1 Jan 2016 20:23:02 -0500 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References: Message-ID: I changed the doc so that the core dev people can edit. I *think* that everyone should be able to view/comment though. On Fri, Jan 1, 2016 at 8:13 PM, Wes McKinney wrote: > Jeff -- can you require log-in for editing on this document? > > https://docs.google.com/document/d/151ct8jcZWwh7XStptjbLsda6h2b3C0IuiH_hfZnUA58/edit# > > There are a number of anonymous edits. > > On Wed, Dec 30, 2015 at 6:04 PM, Wes McKinney wrote: > > I cobbled together an ugly start of a c++->cython->pandas toolchain here > > > > https://github.com/wesm/pandas/tree/libpandas-native-core > > > > I used a mix of Kudu, Impala, and dynd-python cmake sources, so it's a > > bit messy at the moment but it should be sufficient to run some real > > experiments with a little more work. I reckon it's like a 6 month > > project to tear out the insides of Series and DataFrame and replace it > > with a new "native core", but we should be able to get enough info to > > see whether it's a viable plan within a month or so. > > > > The end goal is to create "private" extension types in Cython that can > > be the new base classes for Series and NDFrame; these will hold a > > reference to a C++ object that contains wrappered NumPy arrays and > > other metadata (like pandas-only dtypes). > > > > It might be too hard to try to replace a single usage of block manager > > as a first experiment, so I'll try to create a minimal "SeriesLite" > > that supports 3 dtypes > > > > 1) float64 with nans > > 2) int64 with a bitmask for NAs > > 3) category type for one of these > > > > Just want to get a feel for the extensibility and offer an NA > > singleton Python object (a la None) for getting and setting NAs across > > these 3 dtypes. > > > > If we end up going down this route, any way to place a moratorium on > > invasive work on pandas internals (outside bug fixes)? > > > > Pedantic aside: I'd rather avoid shipping thirdparty C/C++ libraries > > like googletest and friends in pandas if we can. Cloudera folks have > > been working on a portable C++ library toolchain for Impala and other > > projects at https://github.com/cloudera/native-toolchain, but it is > > only being tested on Linux and OS X. Most google libraries should > > build out of the box on MSVC but it'll be something to keep an eye on. > > > > BTW thanks to the libdynd developers for pioneering the c++ lib <-> > > python-c++ lib <-> cython toolchain; being able to build Cython > > extensions directly from cmake is a godsend > > > > HNY all > > Wes > > > > On Tue, Dec 29, 2015 at 4:17 PM, Irwin Zaid wrote: > >> Yeah, that seems reasonable and I totally agree a Pandas wrapper layer > would > >> be necessary. > >> > >> I'll keep an eye on this and I'd like to help if I can. > >> > >> Irwin > >> > >> > >> On Tue, Dec 29, 2015 at 6:01 PM, Wes McKinney > wrote: > >>> > >>> I'm not suggesting a rewrite of NumPy functionality but rather pandas > >>> functionality that is currently written in a mishmash of Cython and > Python. > >>> Happy to experiment with changing the internal compute infrastructure > and > >>> data representation to DyND after this first stage of cleanup is done. > Even > >>> if we use DyND a pretty extensive pandas wrapper layer will be > necessary. > >>> > >>> > >>> On Tuesday, December 29, 2015, Irwin Zaid wrote: > >>>> > >>>> Hi Wes (and others), > >>>> > >>>> I've been following this conversation with interest. I do think it > would > >>>> be worth exploring DyND, rather than setting up yet another rewrite of > >>>> NumPy-functionality. Especially because DyND is already an optional > >>>> dependency of Pandas. > >>>> > >>>> For things like Integer NA and new dtypes, DyND is there and ready to > do > >>>> this. > >>>> > >>>> Irwin > >>>> > >>>> On Tue, Dec 29, 2015 at 5:18 PM, Wes McKinney > >>>> wrote: > >>>>> > >>>>> Can you link to the PR you're talking about? > >>>>> > >>>>> I will see about spending a few hours setting up a libpandas.so as a > C++ > >>>>> shared library where we can run some experiments and validate > whether it can > >>>>> solve the integer-NA problem and be a place to put new data types > >>>>> (categorical and friends). I'm +1 on targeting > >>>>> > >>>>> Would it also be worth making a wish list of APIs we might consider > >>>>> breaking in a pandas 1.0 release that also features this new "native > core"? > >>>>> Might as well right some wrongs while we're doing some invasive work > on the > >>>>> internals; some breakage might be unavoidable. We can always > maintain a > >>>>> pandas legacy 0.x.x maintenance branch (providing a conda binary > build) for > >>>>> legacy users where showstopper bugs can get fixed. > >>>>> > >>>>> On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback > >>>>> wrote: > >>>>> > Wes your last is noted as well. I *think* we can actually do this > now > >>>>> > (well > >>>>> > there is a PR out there). > >>>>> > > >>>>> > On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney > > >>>>> > wrote: > >>>>> >> > >>>>> >> The other huge thing this will enable is to do is copy-on-write > for > >>>>> >> various kinds of views, which should cut down on some of the > >>>>> >> defensive > >>>>> >> copying in the library and reduce memory usage. > >>>>> >> > >>>>> >> On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney < > wesmckinn at gmail.com> > >>>>> >> wrote: > >>>>> >> > Basically the approach is > >>>>> >> > > >>>>> >> > 1) Base dtype type > >>>>> >> > 2) Base array type with K >= 1 dimensions > >>>>> >> > 3) Base scalar type > >>>>> >> > 4) Base index type > >>>>> >> > 5) "Wrapper" subclasses for all NumPy types fitting into > categories > >>>>> >> > #1, #2, #3, #4 > >>>>> >> > 6) Subclasses for pandas-specific types like category, > datetimeTZ, > >>>>> >> > etc. > >>>>> >> > 7) NDFrame as cpcloud wrote is just a list of these > >>>>> >> > > >>>>> >> > Indexes and axis labels / column names can get layered on top. > >>>>> >> > > >>>>> >> > After we do all this we can look at adding nested types (arrays, > >>>>> >> > maps, > >>>>> >> > structs) to better support JSON. > >>>>> >> > > >>>>> >> > - Wes > >>>>> >> > > >>>>> >> > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud < > cpcloud at gmail.com> > >>>>> >> > wrote: > >>>>> >> >> Maybe this is saying the same thing as Wes, but how far would > >>>>> >> >> something > >>>>> >> >> like > >>>>> >> >> this get us? > >>>>> >> >> > >>>>> >> >> // warning: things are probably not this simple > >>>>> >> >> > >>>>> >> >> struct data_array_t { > >>>>> >> >> void *primitive; // scalar data > >>>>> >> >> data_array_t *nested; // nested data > >>>>> >> >> boost::dynamic_bitset isnull; // might have to create our > own > >>>>> >> >> to > >>>>> >> >> avoid > >>>>> >> >> boost > >>>>> >> >> schema_t schema; // not sure exactly what this looks like > >>>>> >> >> }; > >>>>> >> >> > >>>>> >> >> typedef std::map data_frame_t; // > probably > >>>>> >> >> not > >>>>> >> >> this > >>>>> >> >> simple > >>>>> >> >> > >>>>> >> >> To answer Jeff?s use-case question: I think that the use cases > are > >>>>> >> >> 1) > >>>>> >> >> freedom from numpy (mostly) 2) no more block manager which > frees > >>>>> >> >> us > >>>>> >> >> from the > >>>>> >> >> limitations of the block memory layout. In particular, the > ability > >>>>> >> >> to > >>>>> >> >> take > >>>>> >> >> advantage of memory mapped IO would be a big win IMO. > >>>>> >> >> > >>>>> >> >> > >>>>> >> >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney < > wesmckinn at gmail.com> > >>>>> >> >> wrote: > >>>>> >> >>> > >>>>> >> >>> I will write a more detailed response to some of these things > >>>>> >> >>> after > >>>>> >> >>> the new year, but, in particular, re: missing values, can you > or > >>>>> >> >>> someone tell me why creating an object that contains a NumPy > >>>>> >> >>> array and > >>>>> >> >>> a bitmap is not sufficient? If we we can add a lightweight > C/C++ > >>>>> >> >>> class > >>>>> >> >>> layer between NumPy function calls (e.g. arithmetic) and > pandas > >>>>> >> >>> function calls, then I see no reason why we cannot have > >>>>> >> >>> > >>>>> >> >>> Int32Array->add > >>>>> >> >>> > >>>>> >> >>> and > >>>>> >> >>> > >>>>> >> >>> Float32Array->add > >>>>> >> >>> > >>>>> >> >>> do the right thing (the former would be responsible for > >>>>> >> >>> bitmasking to > >>>>> >> >>> propagate NA values; the latter would defer to NumPy). If we > can > >>>>> >> >>> put > >>>>> >> >>> all the internals of pandas objects inside a black box, we can > >>>>> >> >>> add > >>>>> >> >>> layers of virtual function indirection without a performance > >>>>> >> >>> penalty > >>>>> >> >>> (e.g. adding more interpreter overhead with more abstraction > >>>>> >> >>> layers > >>>>> >> >>> does add up to a perf penalty). > >>>>> >> >>> > >>>>> >> >>> I don't think this is too scary -- I would be willing to > create a > >>>>> >> >>> small POC C++ library to prototype something like what I'm > >>>>> >> >>> talking > >>>>> >> >>> about. > >>>>> >> >>> > >>>>> >> >>> Since pandas has limited points of contact with NumPy I don't > >>>>> >> >>> think > >>>>> >> >>> this would end up being too onerous. > >>>>> >> >>> > >>>>> >> >>> For the record, I'm pretty allergic to "advanced C++"; I > think it > >>>>> >> >>> is a > >>>>> >> >>> useful tool if you pick a sane 20% subset of the C++11 spec > and > >>>>> >> >>> follow > >>>>> >> >>> Google C++ style it's not very inaccessible to intermediate > >>>>> >> >>> developers. More or less "C plus OOP and easier object > lifetime > >>>>> >> >>> management (shared/unique_ptr, etc.)". As soon as you add a > lot > >>>>> >> >>> of > >>>>> >> >>> template metaprogramming C++ library development quickly > becomes > >>>>> >> >>> inaccessible except to the C++-Jedi. > >>>>> >> >>> > >>>>> >> >>> Maybe let's start a Google document on "pandas roadmap" where > we > >>>>> >> >>> can > >>>>> >> >>> break down the 1-2 year goals and some of these infrastructure > >>>>> >> >>> issues > >>>>> >> >>> and have our discussion there? (obviously publish this > someplace > >>>>> >> >>> once > >>>>> >> >>> we're done) > >>>>> >> >>> > >>>>> >> >>> - Wes > >>>>> >> >>> > >>>>> >> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback > >>>>> >> >>> > >>>>> >> >>> wrote: > >>>>> >> >>> > Here are some of my thoughts about pandas Roadmap / status > and > >>>>> >> >>> > some > >>>>> >> >>> > responses to Wes's thoughts. > >>>>> >> >>> > > >>>>> >> >>> > In the last few (and upcoming) major releases we have been > made > >>>>> >> >>> > the > >>>>> >> >>> > following changes: > >>>>> >> >>> > > >>>>> >> >>> > - dtype enhancements (Categorical, Timedelta, Datetime > w/tz) & > >>>>> >> >>> > making > >>>>> >> >>> > these > >>>>> >> >>> > first class objects > >>>>> >> >>> > - code refactoring to remove subclassing of ndarrays for > Series > >>>>> >> >>> > & > >>>>> >> >>> > Index > >>>>> >> >>> > - carving out / deprecating non-core parts of pandas > >>>>> >> >>> > - datareader > >>>>> >> >>> > - SparsePanel, WidePanel & other aliases (TImeSeries) > >>>>> >> >>> > - rpy, rplot, irow et al. > >>>>> >> >>> > - google-analytics > >>>>> >> >>> > - API changes to make things more consistent > >>>>> >> >>> > - pd.rolling/expanding * -> .rolling/expanding (this is in > >>>>> >> >>> > master > >>>>> >> >>> > now) > >>>>> >> >>> > - .resample becoming a full defered like groupby. > >>>>> >> >>> > - multi-index slicing along any level (obviates need for > .xs) > >>>>> >> >>> > and > >>>>> >> >>> > allows > >>>>> >> >>> > assignment > >>>>> >> >>> > - .loc/.iloc - for the most part obviates use of .ix > >>>>> >> >>> > - .pipe & .assign > >>>>> >> >>> > - plotting accessors > >>>>> >> >>> > - fixing of the sorting API > >>>>> >> >>> > - many performance enhancements both micro & macro (e.g. > >>>>> >> >>> > release > >>>>> >> >>> > GIL) > >>>>> >> >>> > > >>>>> >> >>> > Some on-deck enhancements are (meaning these are basically > >>>>> >> >>> > ready to > >>>>> >> >>> > go > >>>>> >> >>> > in): > >>>>> >> >>> > - IntervalIndex (and eventually make PeriodIndex just a > >>>>> >> >>> > sub-class > >>>>> >> >>> > of > >>>>> >> >>> > this) > >>>>> >> >>> > - RangeIndex > >>>>> >> >>> > > >>>>> >> >>> > so lots of changes, though nothing really earth shaking, > just > >>>>> >> >>> > more > >>>>> >> >>> > convenience, reducing magicness somewhat > >>>>> >> >>> > and providing flexibility. > >>>>> >> >>> > > >>>>> >> >>> > Of course we are getting increasing issues, mostly bug > reports > >>>>> >> >>> > (and > >>>>> >> >>> > lots > >>>>> >> >>> > of > >>>>> >> >>> > dupes), some edge case enhancements > >>>>> >> >>> > which can add to the existing API's and of course, requests > to > >>>>> >> >>> > expand > >>>>> >> >>> > the > >>>>> >> >>> > (already) large code to other usecases. > >>>>> >> >>> > Balancing this are a good many pull-requests from many > >>>>> >> >>> > different > >>>>> >> >>> > users, > >>>>> >> >>> > some > >>>>> >> >>> > even deep into the internals. > >>>>> >> >>> > > >>>>> >> >>> > Here are some things that I have talked about and could be > >>>>> >> >>> > considered > >>>>> >> >>> > for > >>>>> >> >>> > the roadmap. Disclaimer: I do work for Continuum > >>>>> >> >>> > but these views are of course my own; furthermore obviously > I > >>>>> >> >>> > am a > >>>>> >> >>> > bit > >>>>> >> >>> > more > >>>>> >> >>> > familiar with some of the 'sponsored' open-source > >>>>> >> >>> > libraries, but always open to new things. > >>>>> >> >>> > > >>>>> >> >>> > - integration / automatic deferral to numba for JIT (this > would > >>>>> >> >>> > be > >>>>> >> >>> > thru > >>>>> >> >>> > .apply) > >>>>> >> >>> > - automatic deferal to dask from groubpy where appropriate / > >>>>> >> >>> > maybe a > >>>>> >> >>> > .to_parallel (to simply return a dask.DataFrame object) > >>>>> >> >>> > - incorporation of quantities / units (as part of the dtype) > >>>>> >> >>> > - use of DyND to allow missing values for int dtypes > >>>>> >> >>> > - make Period a first class dtype. > >>>>> >> >>> > - provide some copy-on-write semantics to alleviate the > >>>>> >> >>> > chained-indexing > >>>>> >> >>> > issues which occasionaly come up with the mis-use of the > >>>>> >> >>> > indexing > >>>>> >> >>> > API > >>>>> >> >>> > - allow a 'policy' to automatically provide column blocks > for > >>>>> >> >>> > dict-like > >>>>> >> >>> > input (e.g. each column would be a block), this would allow > a > >>>>> >> >>> > pass-thru > >>>>> >> >>> > API > >>>>> >> >>> > where you could > >>>>> >> >>> > put in numpy arrays where you have views and have them > >>>>> >> >>> > preserved > >>>>> >> >>> > rather > >>>>> >> >>> > than > >>>>> >> >>> > copied automatically. Note that this would also allow what I > >>>>> >> >>> > call > >>>>> >> >>> > 'split' > >>>>> >> >>> > where a passed in > >>>>> >> >>> > multi-dim numpy array could be split up to individual blocks > >>>>> >> >>> > (which > >>>>> >> >>> > actually > >>>>> >> >>> > gives a nice perf boost after the splitting costs). > >>>>> >> >>> > > >>>>> >> >>> > In working towards some of these goals. I have come to the > >>>>> >> >>> > opinion > >>>>> >> >>> > that > >>>>> >> >>> > it > >>>>> >> >>> > would make sense to have a neutral API protocol layer > >>>>> >> >>> > that would allow us to swap out different engines as needed, > >>>>> >> >>> > for > >>>>> >> >>> > particular > >>>>> >> >>> > dtypes, or *maybe* out-of-core type computations. E.g. > >>>>> >> >>> > imagine that we replaced the in-memory block structure with > a > >>>>> >> >>> > bclolz > >>>>> >> >>> > / > >>>>> >> >>> > memap > >>>>> >> >>> > type; in theory this should be 'easy' and just work. > >>>>> >> >>> > I could also see us adopting *some* of the SFrame code to > allow > >>>>> >> >>> > easier > >>>>> >> >>> > interop with this API layer. > >>>>> >> >>> > > >>>>> >> >>> > In practice, I think a nice API layer would need to be > created > >>>>> >> >>> > to > >>>>> >> >>> > make > >>>>> >> >>> > this > >>>>> >> >>> > clean / nice. > >>>>> >> >>> > > >>>>> >> >>> > So this comes around to Wes's point about creating a c++ > >>>>> >> >>> > library for > >>>>> >> >>> > the > >>>>> >> >>> > internals (and possibly even some of the indexing routines). > >>>>> >> >>> > In an ideal world, or course this would be desirable. > Getting > >>>>> >> >>> > there > >>>>> >> >>> > is a > >>>>> >> >>> > bit > >>>>> >> >>> > non-trivial I think, and IMHO might not be worth the > effort. I > >>>>> >> >>> > don't > >>>>> >> >>> > really see big performance bottlenecks. We *already* defer > much > >>>>> >> >>> > of > >>>>> >> >>> > the > >>>>> >> >>> > computation to libraries like numexpr & bottleneck (where > >>>>> >> >>> > appropriate). > >>>>> >> >>> > Adding numba / dask to the list would be helpful. > >>>>> >> >>> > > >>>>> >> >>> > I think that almost all performance issues are the result > of: > >>>>> >> >>> > > >>>>> >> >>> > a) gross misuse of the pandas API. How much code have you > seen > >>>>> >> >>> > that > >>>>> >> >>> > does > >>>>> >> >>> > df.apply(lambda x: x.sum()) > >>>>> >> >>> > b) routines which operate column-by-column rather > >>>>> >> >>> > block-by-block and > >>>>> >> >>> > are > >>>>> >> >>> > in > >>>>> >> >>> > python space (e.g. we have an issue right now about > .quantile) > >>>>> >> >>> > > >>>>> >> >>> > So I am glossing over a big goal of having a c++ library > that > >>>>> >> >>> > represents > >>>>> >> >>> > the > >>>>> >> >>> > pandas internals. This would by definition have a c-API > that so > >>>>> >> >>> > you *could* use pandas like semantics in c/c++ and just > have it > >>>>> >> >>> > work > >>>>> >> >>> > (and > >>>>> >> >>> > then pandas would be a thin wrapper around this library). > >>>>> >> >>> > > >>>>> >> >>> > I am not averse to this, but I think would be quite a big > >>>>> >> >>> > effort, > >>>>> >> >>> > and > >>>>> >> >>> > not a > >>>>> >> >>> > huge perf boost IMHO. Further there are a number of API > issues > >>>>> >> >>> > w.r.t. > >>>>> >> >>> > indexing > >>>>> >> >>> > which need to be clarified / worked out (e.g. should we > simply > >>>>> >> >>> > deprecate > >>>>> >> >>> > []) > >>>>> >> >>> > that are much easier to test / figure out in python space. > >>>>> >> >>> > > >>>>> >> >>> > I also thing that we have quite a large number of > contributors. > >>>>> >> >>> > Moving > >>>>> >> >>> > to > >>>>> >> >>> > c++ might make the internals a bit more impenetrable that > the > >>>>> >> >>> > current > >>>>> >> >>> > internals. > >>>>> >> >>> > (though this would allow c++ people to contribute, so that > >>>>> >> >>> > might > >>>>> >> >>> > balance > >>>>> >> >>> > out). > >>>>> >> >>> > > >>>>> >> >>> > We have a limited core of devs whom right now are familar > with > >>>>> >> >>> > things. > >>>>> >> >>> > If > >>>>> >> >>> > someone happened to have a starting base for a c++ library, > >>>>> >> >>> > then I > >>>>> >> >>> > might > >>>>> >> >>> > change > >>>>> >> >>> > opinions here. > >>>>> >> >>> > > >>>>> >> >>> > > >>>>> >> >>> > my 4c. > >>>>> >> >>> > > >>>>> >> >>> > Jeff > >>>>> >> >>> > > >>>>> >> >>> > > >>>>> >> >>> > > >>>>> >> >>> > > >>>>> >> >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney > >>>>> >> >>> > > >>>>> >> >>> > wrote: > >>>>> >> >>> >> > >>>>> >> >>> >> Deep thoughts during the holidays. > >>>>> >> >>> >> > >>>>> >> >>> >> I might be out of line here, but the interpreter-heaviness > of > >>>>> >> >>> >> the > >>>>> >> >>> >> inside of pandas objects is likely to be a long-term > liability > >>>>> >> >>> >> and > >>>>> >> >>> >> source of performance problems and technical debt. > >>>>> >> >>> >> > >>>>> >> >>> >> Has anyone put any thought into planning and beginning to > >>>>> >> >>> >> execute > >>>>> >> >>> >> on a > >>>>> >> >>> >> rewrite that moves as much as possible of the internals > into > >>>>> >> >>> >> native > >>>>> >> >>> >> / > >>>>> >> >>> >> compiled code? I'm talking about: > >>>>> >> >>> >> > >>>>> >> >>> >> - pandas/core/internals > >>>>> >> >>> >> - indexing and assignment > >>>>> >> >>> >> - much of pandas/core/common > >>>>> >> >>> >> - categorical and custom dtypes > >>>>> >> >>> >> - all indexing mechanisms > >>>>> >> >>> >> > >>>>> >> >>> >> I'm concerned we've already exposed too much internals to > >>>>> >> >>> >> users, so > >>>>> >> >>> >> this might lead to a lot of API breakage, but it might be > for > >>>>> >> >>> >> the > >>>>> >> >>> >> Greater Good. As a first step, beginning a partial > migration > >>>>> >> >>> >> of > >>>>> >> >>> >> internals into some C++ classes that encapsulate the > insides > >>>>> >> >>> >> of > >>>>> >> >>> >> DataFrame objects and implement indexing and block-level > >>>>> >> >>> >> manipulations > >>>>> >> >>> >> would be a good place to start. I think you could do this > >>>>> >> >>> >> wouldn't > >>>>> >> >>> >> too > >>>>> >> >>> >> much disruption. > >>>>> >> >>> >> > >>>>> >> >>> >> As part of this internal retooling we might give > consideration > >>>>> >> >>> >> to > >>>>> >> >>> >> alternative data structures for representing data internal > to > >>>>> >> >>> >> pandas > >>>>> >> >>> >> objects. Now in 2015/2016, continuing to be hamstrung by > >>>>> >> >>> >> NumPy's > >>>>> >> >>> >> limitations feels somewhat anachronistic. User code is > riddled > >>>>> >> >>> >> with > >>>>> >> >>> >> workarounds for data type fidelity issues and the like. > Like, > >>>>> >> >>> >> really, > >>>>> >> >>> >> why not add a bitndarray (similar to ilanschnell/bitarray) > for > >>>>> >> >>> >> storing > >>>>> >> >>> >> nullness for problematic types and hide this from the > user? =) > >>>>> >> >>> >> > >>>>> >> >>> >> Since we are now a NumFOCUS-sponsored project, I feel like > we > >>>>> >> >>> >> might > >>>>> >> >>> >> consider establishing some formal governance over pandas > and > >>>>> >> >>> >> publishing meetings notes and roadmap documents describing > >>>>> >> >>> >> plans > >>>>> >> >>> >> for > >>>>> >> >>> >> the project and meetings notes from committers. There's no > >>>>> >> >>> >> real > >>>>> >> >>> >> "committer culture" for NumFOCUS projects like there is > with > >>>>> >> >>> >> the > >>>>> >> >>> >> Apache Software Foundation, but we might try leading by > >>>>> >> >>> >> example! > >>>>> >> >>> >> > >>>>> >> >>> >> Also, I believe pandas as a project has reached a level of > >>>>> >> >>> >> importance > >>>>> >> >>> >> where we ought to consider planning and execution on larger > >>>>> >> >>> >> scale > >>>>> >> >>> >> undertakings such as this for safeguarding the future. > >>>>> >> >>> >> > >>>>> >> >>> >> As for myself, well, I have my hands full in Big > Data-land. I > >>>>> >> >>> >> wish > >>>>> >> >>> >> I > >>>>> >> >>> >> could be helping more with pandas, but there a quite a few > >>>>> >> >>> >> fundamental > >>>>> >> >>> >> issues (like data interoperability nested data handling and > >>>>> >> >>> >> file > >>>>> >> >>> >> format support ? e.g. Parquet, see > >>>>> >> >>> >> > >>>>> >> >>> >> > >>>>> >> >>> >> > >>>>> >> >>> >> > >>>>> >> >>> >> > http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/ > ) > >>>>> >> >>> >> preventing Python from being more useful in industry > analytics > >>>>> >> >>> >> applications. > >>>>> >> >>> >> > >>>>> >> >>> >> Aside: one of the bigger mistakes I made with pandas's API > >>>>> >> >>> >> design > >>>>> >> >>> >> was > >>>>> >> >>> >> making it acceptable to call class constructors ? like > >>>>> >> >>> >> pandas.DataFrame ? directly (versus factory functions). > Sorry > >>>>> >> >>> >> about > >>>>> >> >>> >> that! If we could convince everyone to start writing > >>>>> >> >>> >> pandas.data_frame > >>>>> >> >>> >> or dataframe instead of using the class reference it would > >>>>> >> >>> >> help a > >>>>> >> >>> >> lot > >>>>> >> >>> >> with code cleanup. It's hard to plan for these things ? > NumPy > >>>>> >> >>> >> interoperability seemed a lot more important in 2008 than > it > >>>>> >> >>> >> does > >>>>> >> >>> >> now, > >>>>> >> >>> >> so I forgive myself. > >>>>> >> >>> >> > >>>>> >> >>> >> cheers and best wishes for 2016, > >>>>> >> >>> >> Wes > >>>>> >> >>> >> _______________________________________________ > >>>>> >> >>> >> Pandas-dev mailing list > >>>>> >> >>> >> Pandas-dev at python.org > >>>>> >> >>> >> https://mail.python.org/mailman/listinfo/pandas-dev > >>>>> >> >>> > > >>>>> >> >>> > > >>>>> >> >>> _______________________________________________ > >>>>> >> >>> Pandas-dev mailing list > >>>>> >> >>> Pandas-dev at python.org > >>>>> >> >>> https://mail.python.org/mailman/listinfo/pandas-dev > >>>>> >> _______________________________________________ > >>>>> >> Pandas-dev mailing list > >>>>> >> Pandas-dev at python.org > >>>>> >> https://mail.python.org/mailman/listinfo/pandas-dev > >>>>> > > >>>>> > > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> Pandas-dev mailing list > >>>>> Pandas-dev at python.org > >>>>> https://mail.python.org/mailman/listinfo/pandas-dev > >>>>> > >>>> > >> > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Fri Jan 1 20:48:18 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Fri, 1 Jan 2016 17:48:18 -0800 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References: Message-ID: Thanks Jeff. Can you create and share a shared Drive folder containing this where I can put other auxiliary / follow up documents? On Fri, Jan 1, 2016 at 5:23 PM, Jeff Reback wrote: > I changed the doc so that the core dev people can edit. I *think* that > everyone should be able to view/comment though. > > On Fri, Jan 1, 2016 at 8:13 PM, Wes McKinney wrote: >> >> Jeff -- can you require log-in for editing on this document? >> >> https://docs.google.com/document/d/151ct8jcZWwh7XStptjbLsda6h2b3C0IuiH_hfZnUA58/edit# >> >> There are a number of anonymous edits. >> >> On Wed, Dec 30, 2015 at 6:04 PM, Wes McKinney wrote: >> > I cobbled together an ugly start of a c++->cython->pandas toolchain here >> > >> > https://github.com/wesm/pandas/tree/libpandas-native-core >> > >> > I used a mix of Kudu, Impala, and dynd-python cmake sources, so it's a >> > bit messy at the moment but it should be sufficient to run some real >> > experiments with a little more work. I reckon it's like a 6 month >> > project to tear out the insides of Series and DataFrame and replace it >> > with a new "native core", but we should be able to get enough info to >> > see whether it's a viable plan within a month or so. >> > >> > The end goal is to create "private" extension types in Cython that can >> > be the new base classes for Series and NDFrame; these will hold a >> > reference to a C++ object that contains wrappered NumPy arrays and >> > other metadata (like pandas-only dtypes). >> > >> > It might be too hard to try to replace a single usage of block manager >> > as a first experiment, so I'll try to create a minimal "SeriesLite" >> > that supports 3 dtypes >> > >> > 1) float64 with nans >> > 2) int64 with a bitmask for NAs >> > 3) category type for one of these >> > >> > Just want to get a feel for the extensibility and offer an NA >> > singleton Python object (a la None) for getting and setting NAs across >> > these 3 dtypes. >> > >> > If we end up going down this route, any way to place a moratorium on >> > invasive work on pandas internals (outside bug fixes)? >> > >> > Pedantic aside: I'd rather avoid shipping thirdparty C/C++ libraries >> > like googletest and friends in pandas if we can. Cloudera folks have >> > been working on a portable C++ library toolchain for Impala and other >> > projects at https://github.com/cloudera/native-toolchain, but it is >> > only being tested on Linux and OS X. Most google libraries should >> > build out of the box on MSVC but it'll be something to keep an eye on. >> > >> > BTW thanks to the libdynd developers for pioneering the c++ lib <-> >> > python-c++ lib <-> cython toolchain; being able to build Cython >> > extensions directly from cmake is a godsend >> > >> > HNY all >> > Wes >> > >> > On Tue, Dec 29, 2015 at 4:17 PM, Irwin Zaid wrote: >> >> Yeah, that seems reasonable and I totally agree a Pandas wrapper layer >> >> would >> >> be necessary. >> >> >> >> I'll keep an eye on this and I'd like to help if I can. >> >> >> >> Irwin >> >> >> >> >> >> On Tue, Dec 29, 2015 at 6:01 PM, Wes McKinney >> >> wrote: >> >>> >> >>> I'm not suggesting a rewrite of NumPy functionality but rather pandas >> >>> functionality that is currently written in a mishmash of Cython and >> >>> Python. >> >>> Happy to experiment with changing the internal compute infrastructure >> >>> and >> >>> data representation to DyND after this first stage of cleanup is done. >> >>> Even >> >>> if we use DyND a pretty extensive pandas wrapper layer will be >> >>> necessary. >> >>> >> >>> >> >>> On Tuesday, December 29, 2015, Irwin Zaid wrote: >> >>>> >> >>>> Hi Wes (and others), >> >>>> >> >>>> I've been following this conversation with interest. I do think it >> >>>> would >> >>>> be worth exploring DyND, rather than setting up yet another rewrite >> >>>> of >> >>>> NumPy-functionality. Especially because DyND is already an optional >> >>>> dependency of Pandas. >> >>>> >> >>>> For things like Integer NA and new dtypes, DyND is there and ready to >> >>>> do >> >>>> this. >> >>>> >> >>>> Irwin >> >>>> >> >>>> On Tue, Dec 29, 2015 at 5:18 PM, Wes McKinney >> >>>> wrote: >> >>>>> >> >>>>> Can you link to the PR you're talking about? >> >>>>> >> >>>>> I will see about spending a few hours setting up a libpandas.so as a >> >>>>> C++ >> >>>>> shared library where we can run some experiments and validate >> >>>>> whether it can >> >>>>> solve the integer-NA problem and be a place to put new data types >> >>>>> (categorical and friends). I'm +1 on targeting >> >>>>> >> >>>>> Would it also be worth making a wish list of APIs we might consider >> >>>>> breaking in a pandas 1.0 release that also features this new "native >> >>>>> core"? >> >>>>> Might as well right some wrongs while we're doing some invasive work >> >>>>> on the >> >>>>> internals; some breakage might be unavoidable. We can always >> >>>>> maintain a >> >>>>> pandas legacy 0.x.x maintenance branch (providing a conda binary >> >>>>> build) for >> >>>>> legacy users where showstopper bugs can get fixed. >> >>>>> >> >>>>> On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback >> >>>>> wrote: >> >>>>> > Wes your last is noted as well. I *think* we can actually do this >> >>>>> > now >> >>>>> > (well >> >>>>> > there is a PR out there). >> >>>>> > >> >>>>> > On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney >> >>>>> > >> >>>>> > wrote: >> >>>>> >> >> >>>>> >> The other huge thing this will enable is to do is copy-on-write >> >>>>> >> for >> >>>>> >> various kinds of views, which should cut down on some of the >> >>>>> >> defensive >> >>>>> >> copying in the library and reduce memory usage. >> >>>>> >> >> >>>>> >> On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney >> >>>>> >> >> >>>>> >> wrote: >> >>>>> >> > Basically the approach is >> >>>>> >> > >> >>>>> >> > 1) Base dtype type >> >>>>> >> > 2) Base array type with K >= 1 dimensions >> >>>>> >> > 3) Base scalar type >> >>>>> >> > 4) Base index type >> >>>>> >> > 5) "Wrapper" subclasses for all NumPy types fitting into >> >>>>> >> > categories >> >>>>> >> > #1, #2, #3, #4 >> >>>>> >> > 6) Subclasses for pandas-specific types like category, >> >>>>> >> > datetimeTZ, >> >>>>> >> > etc. >> >>>>> >> > 7) NDFrame as cpcloud wrote is just a list of these >> >>>>> >> > >> >>>>> >> > Indexes and axis labels / column names can get layered on top. >> >>>>> >> > >> >>>>> >> > After we do all this we can look at adding nested types >> >>>>> >> > (arrays, >> >>>>> >> > maps, >> >>>>> >> > structs) to better support JSON. >> >>>>> >> > >> >>>>> >> > - Wes >> >>>>> >> > >> >>>>> >> > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud >> >>>>> >> > >> >>>>> >> > wrote: >> >>>>> >> >> Maybe this is saying the same thing as Wes, but how far would >> >>>>> >> >> something >> >>>>> >> >> like >> >>>>> >> >> this get us? >> >>>>> >> >> >> >>>>> >> >> // warning: things are probably not this simple >> >>>>> >> >> >> >>>>> >> >> struct data_array_t { >> >>>>> >> >> void *primitive; // scalar data >> >>>>> >> >> data_array_t *nested; // nested data >> >>>>> >> >> boost::dynamic_bitset isnull; // might have to create our >> >>>>> >> >> own >> >>>>> >> >> to >> >>>>> >> >> avoid >> >>>>> >> >> boost >> >>>>> >> >> schema_t schema; // not sure exactly what this looks like >> >>>>> >> >> }; >> >>>>> >> >> >> >>>>> >> >> typedef std::map data_frame_t; // >> >>>>> >> >> probably >> >>>>> >> >> not >> >>>>> >> >> this >> >>>>> >> >> simple >> >>>>> >> >> >> >>>>> >> >> To answer Jeff?s use-case question: I think that the use cases >> >>>>> >> >> are >> >>>>> >> >> 1) >> >>>>> >> >> freedom from numpy (mostly) 2) no more block manager which >> >>>>> >> >> frees >> >>>>> >> >> us >> >>>>> >> >> from the >> >>>>> >> >> limitations of the block memory layout. In particular, the >> >>>>> >> >> ability >> >>>>> >> >> to >> >>>>> >> >> take >> >>>>> >> >> advantage of memory mapped IO would be a big win IMO. >> >>>>> >> >> >> >>>>> >> >> >> >>>>> >> >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney >> >>>>> >> >> >> >>>>> >> >> wrote: >> >>>>> >> >>> >> >>>>> >> >>> I will write a more detailed response to some of these things >> >>>>> >> >>> after >> >>>>> >> >>> the new year, but, in particular, re: missing values, can you >> >>>>> >> >>> or >> >>>>> >> >>> someone tell me why creating an object that contains a NumPy >> >>>>> >> >>> array and >> >>>>> >> >>> a bitmap is not sufficient? If we we can add a lightweight >> >>>>> >> >>> C/C++ >> >>>>> >> >>> class >> >>>>> >> >>> layer between NumPy function calls (e.g. arithmetic) and >> >>>>> >> >>> pandas >> >>>>> >> >>> function calls, then I see no reason why we cannot have >> >>>>> >> >>> >> >>>>> >> >>> Int32Array->add >> >>>>> >> >>> >> >>>>> >> >>> and >> >>>>> >> >>> >> >>>>> >> >>> Float32Array->add >> >>>>> >> >>> >> >>>>> >> >>> do the right thing (the former would be responsible for >> >>>>> >> >>> bitmasking to >> >>>>> >> >>> propagate NA values; the latter would defer to NumPy). If we >> >>>>> >> >>> can >> >>>>> >> >>> put >> >>>>> >> >>> all the internals of pandas objects inside a black box, we >> >>>>> >> >>> can >> >>>>> >> >>> add >> >>>>> >> >>> layers of virtual function indirection without a performance >> >>>>> >> >>> penalty >> >>>>> >> >>> (e.g. adding more interpreter overhead with more abstraction >> >>>>> >> >>> layers >> >>>>> >> >>> does add up to a perf penalty). >> >>>>> >> >>> >> >>>>> >> >>> I don't think this is too scary -- I would be willing to >> >>>>> >> >>> create a >> >>>>> >> >>> small POC C++ library to prototype something like what I'm >> >>>>> >> >>> talking >> >>>>> >> >>> about. >> >>>>> >> >>> >> >>>>> >> >>> Since pandas has limited points of contact with NumPy I don't >> >>>>> >> >>> think >> >>>>> >> >>> this would end up being too onerous. >> >>>>> >> >>> >> >>>>> >> >>> For the record, I'm pretty allergic to "advanced C++"; I >> >>>>> >> >>> think it >> >>>>> >> >>> is a >> >>>>> >> >>> useful tool if you pick a sane 20% subset of the C++11 spec >> >>>>> >> >>> and >> >>>>> >> >>> follow >> >>>>> >> >>> Google C++ style it's not very inaccessible to intermediate >> >>>>> >> >>> developers. More or less "C plus OOP and easier object >> >>>>> >> >>> lifetime >> >>>>> >> >>> management (shared/unique_ptr, etc.)". As soon as you add a >> >>>>> >> >>> lot >> >>>>> >> >>> of >> >>>>> >> >>> template metaprogramming C++ library development quickly >> >>>>> >> >>> becomes >> >>>>> >> >>> inaccessible except to the C++-Jedi. >> >>>>> >> >>> >> >>>>> >> >>> Maybe let's start a Google document on "pandas roadmap" where >> >>>>> >> >>> we >> >>>>> >> >>> can >> >>>>> >> >>> break down the 1-2 year goals and some of these >> >>>>> >> >>> infrastructure >> >>>>> >> >>> issues >> >>>>> >> >>> and have our discussion there? (obviously publish this >> >>>>> >> >>> someplace >> >>>>> >> >>> once >> >>>>> >> >>> we're done) >> >>>>> >> >>> >> >>>>> >> >>> - Wes >> >>>>> >> >>> >> >>>>> >> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback >> >>>>> >> >>> >> >>>>> >> >>> wrote: >> >>>>> >> >>> > Here are some of my thoughts about pandas Roadmap / status >> >>>>> >> >>> > and >> >>>>> >> >>> > some >> >>>>> >> >>> > responses to Wes's thoughts. >> >>>>> >> >>> > >> >>>>> >> >>> > In the last few (and upcoming) major releases we have been >> >>>>> >> >>> > made >> >>>>> >> >>> > the >> >>>>> >> >>> > following changes: >> >>>>> >> >>> > >> >>>>> >> >>> > - dtype enhancements (Categorical, Timedelta, Datetime >> >>>>> >> >>> > w/tz) & >> >>>>> >> >>> > making >> >>>>> >> >>> > these >> >>>>> >> >>> > first class objects >> >>>>> >> >>> > - code refactoring to remove subclassing of ndarrays for >> >>>>> >> >>> > Series >> >>>>> >> >>> > & >> >>>>> >> >>> > Index >> >>>>> >> >>> > - carving out / deprecating non-core parts of pandas >> >>>>> >> >>> > - datareader >> >>>>> >> >>> > - SparsePanel, WidePanel & other aliases (TImeSeries) >> >>>>> >> >>> > - rpy, rplot, irow et al. >> >>>>> >> >>> > - google-analytics >> >>>>> >> >>> > - API changes to make things more consistent >> >>>>> >> >>> > - pd.rolling/expanding * -> .rolling/expanding (this is >> >>>>> >> >>> > in >> >>>>> >> >>> > master >> >>>>> >> >>> > now) >> >>>>> >> >>> > - .resample becoming a full defered like groupby. >> >>>>> >> >>> > - multi-index slicing along any level (obviates need for >> >>>>> >> >>> > .xs) >> >>>>> >> >>> > and >> >>>>> >> >>> > allows >> >>>>> >> >>> > assignment >> >>>>> >> >>> > - .loc/.iloc - for the most part obviates use of .ix >> >>>>> >> >>> > - .pipe & .assign >> >>>>> >> >>> > - plotting accessors >> >>>>> >> >>> > - fixing of the sorting API >> >>>>> >> >>> > - many performance enhancements both micro & macro (e.g. >> >>>>> >> >>> > release >> >>>>> >> >>> > GIL) >> >>>>> >> >>> > >> >>>>> >> >>> > Some on-deck enhancements are (meaning these are basically >> >>>>> >> >>> > ready to >> >>>>> >> >>> > go >> >>>>> >> >>> > in): >> >>>>> >> >>> > - IntervalIndex (and eventually make PeriodIndex just a >> >>>>> >> >>> > sub-class >> >>>>> >> >>> > of >> >>>>> >> >>> > this) >> >>>>> >> >>> > - RangeIndex >> >>>>> >> >>> > >> >>>>> >> >>> > so lots of changes, though nothing really earth shaking, >> >>>>> >> >>> > just >> >>>>> >> >>> > more >> >>>>> >> >>> > convenience, reducing magicness somewhat >> >>>>> >> >>> > and providing flexibility. >> >>>>> >> >>> > >> >>>>> >> >>> > Of course we are getting increasing issues, mostly bug >> >>>>> >> >>> > reports >> >>>>> >> >>> > (and >> >>>>> >> >>> > lots >> >>>>> >> >>> > of >> >>>>> >> >>> > dupes), some edge case enhancements >> >>>>> >> >>> > which can add to the existing API's and of course, requests >> >>>>> >> >>> > to >> >>>>> >> >>> > expand >> >>>>> >> >>> > the >> >>>>> >> >>> > (already) large code to other usecases. >> >>>>> >> >>> > Balancing this are a good many pull-requests from many >> >>>>> >> >>> > different >> >>>>> >> >>> > users, >> >>>>> >> >>> > some >> >>>>> >> >>> > even deep into the internals. >> >>>>> >> >>> > >> >>>>> >> >>> > Here are some things that I have talked about and could be >> >>>>> >> >>> > considered >> >>>>> >> >>> > for >> >>>>> >> >>> > the roadmap. Disclaimer: I do work for Continuum >> >>>>> >> >>> > but these views are of course my own; furthermore obviously >> >>>>> >> >>> > I >> >>>>> >> >>> > am a >> >>>>> >> >>> > bit >> >>>>> >> >>> > more >> >>>>> >> >>> > familiar with some of the 'sponsored' open-source >> >>>>> >> >>> > libraries, but always open to new things. >> >>>>> >> >>> > >> >>>>> >> >>> > - integration / automatic deferral to numba for JIT (this >> >>>>> >> >>> > would >> >>>>> >> >>> > be >> >>>>> >> >>> > thru >> >>>>> >> >>> > .apply) >> >>>>> >> >>> > - automatic deferal to dask from groubpy where appropriate >> >>>>> >> >>> > / >> >>>>> >> >>> > maybe a >> >>>>> >> >>> > .to_parallel (to simply return a dask.DataFrame object) >> >>>>> >> >>> > - incorporation of quantities / units (as part of the >> >>>>> >> >>> > dtype) >> >>>>> >> >>> > - use of DyND to allow missing values for int dtypes >> >>>>> >> >>> > - make Period a first class dtype. >> >>>>> >> >>> > - provide some copy-on-write semantics to alleviate the >> >>>>> >> >>> > chained-indexing >> >>>>> >> >>> > issues which occasionaly come up with the mis-use of the >> >>>>> >> >>> > indexing >> >>>>> >> >>> > API >> >>>>> >> >>> > - allow a 'policy' to automatically provide column blocks >> >>>>> >> >>> > for >> >>>>> >> >>> > dict-like >> >>>>> >> >>> > input (e.g. each column would be a block), this would allow >> >>>>> >> >>> > a >> >>>>> >> >>> > pass-thru >> >>>>> >> >>> > API >> >>>>> >> >>> > where you could >> >>>>> >> >>> > put in numpy arrays where you have views and have them >> >>>>> >> >>> > preserved >> >>>>> >> >>> > rather >> >>>>> >> >>> > than >> >>>>> >> >>> > copied automatically. Note that this would also allow what >> >>>>> >> >>> > I >> >>>>> >> >>> > call >> >>>>> >> >>> > 'split' >> >>>>> >> >>> > where a passed in >> >>>>> >> >>> > multi-dim numpy array could be split up to individual >> >>>>> >> >>> > blocks >> >>>>> >> >>> > (which >> >>>>> >> >>> > actually >> >>>>> >> >>> > gives a nice perf boost after the splitting costs). >> >>>>> >> >>> > >> >>>>> >> >>> > In working towards some of these goals. I have come to the >> >>>>> >> >>> > opinion >> >>>>> >> >>> > that >> >>>>> >> >>> > it >> >>>>> >> >>> > would make sense to have a neutral API protocol layer >> >>>>> >> >>> > that would allow us to swap out different engines as >> >>>>> >> >>> > needed, >> >>>>> >> >>> > for >> >>>>> >> >>> > particular >> >>>>> >> >>> > dtypes, or *maybe* out-of-core type computations. E.g. >> >>>>> >> >>> > imagine that we replaced the in-memory block structure with >> >>>>> >> >>> > a >> >>>>> >> >>> > bclolz >> >>>>> >> >>> > / >> >>>>> >> >>> > memap >> >>>>> >> >>> > type; in theory this should be 'easy' and just work. >> >>>>> >> >>> > I could also see us adopting *some* of the SFrame code to >> >>>>> >> >>> > allow >> >>>>> >> >>> > easier >> >>>>> >> >>> > interop with this API layer. >> >>>>> >> >>> > >> >>>>> >> >>> > In practice, I think a nice API layer would need to be >> >>>>> >> >>> > created >> >>>>> >> >>> > to >> >>>>> >> >>> > make >> >>>>> >> >>> > this >> >>>>> >> >>> > clean / nice. >> >>>>> >> >>> > >> >>>>> >> >>> > So this comes around to Wes's point about creating a c++ >> >>>>> >> >>> > library for >> >>>>> >> >>> > the >> >>>>> >> >>> > internals (and possibly even some of the indexing >> >>>>> >> >>> > routines). >> >>>>> >> >>> > In an ideal world, or course this would be desirable. >> >>>>> >> >>> > Getting >> >>>>> >> >>> > there >> >>>>> >> >>> > is a >> >>>>> >> >>> > bit >> >>>>> >> >>> > non-trivial I think, and IMHO might not be worth the >> >>>>> >> >>> > effort. I >> >>>>> >> >>> > don't >> >>>>> >> >>> > really see big performance bottlenecks. We *already* defer >> >>>>> >> >>> > much >> >>>>> >> >>> > of >> >>>>> >> >>> > the >> >>>>> >> >>> > computation to libraries like numexpr & bottleneck (where >> >>>>> >> >>> > appropriate). >> >>>>> >> >>> > Adding numba / dask to the list would be helpful. >> >>>>> >> >>> > >> >>>>> >> >>> > I think that almost all performance issues are the result >> >>>>> >> >>> > of: >> >>>>> >> >>> > >> >>>>> >> >>> > a) gross misuse of the pandas API. How much code have you >> >>>>> >> >>> > seen >> >>>>> >> >>> > that >> >>>>> >> >>> > does >> >>>>> >> >>> > df.apply(lambda x: x.sum()) >> >>>>> >> >>> > b) routines which operate column-by-column rather >> >>>>> >> >>> > block-by-block and >> >>>>> >> >>> > are >> >>>>> >> >>> > in >> >>>>> >> >>> > python space (e.g. we have an issue right now about >> >>>>> >> >>> > .quantile) >> >>>>> >> >>> > >> >>>>> >> >>> > So I am glossing over a big goal of having a c++ library >> >>>>> >> >>> > that >> >>>>> >> >>> > represents >> >>>>> >> >>> > the >> >>>>> >> >>> > pandas internals. This would by definition have a c-API >> >>>>> >> >>> > that so >> >>>>> >> >>> > you *could* use pandas like semantics in c/c++ and just >> >>>>> >> >>> > have it >> >>>>> >> >>> > work >> >>>>> >> >>> > (and >> >>>>> >> >>> > then pandas would be a thin wrapper around this library). >> >>>>> >> >>> > >> >>>>> >> >>> > I am not averse to this, but I think would be quite a big >> >>>>> >> >>> > effort, >> >>>>> >> >>> > and >> >>>>> >> >>> > not a >> >>>>> >> >>> > huge perf boost IMHO. Further there are a number of API >> >>>>> >> >>> > issues >> >>>>> >> >>> > w.r.t. >> >>>>> >> >>> > indexing >> >>>>> >> >>> > which need to be clarified / worked out (e.g. should we >> >>>>> >> >>> > simply >> >>>>> >> >>> > deprecate >> >>>>> >> >>> > []) >> >>>>> >> >>> > that are much easier to test / figure out in python space. >> >>>>> >> >>> > >> >>>>> >> >>> > I also thing that we have quite a large number of >> >>>>> >> >>> > contributors. >> >>>>> >> >>> > Moving >> >>>>> >> >>> > to >> >>>>> >> >>> > c++ might make the internals a bit more impenetrable that >> >>>>> >> >>> > the >> >>>>> >> >>> > current >> >>>>> >> >>> > internals. >> >>>>> >> >>> > (though this would allow c++ people to contribute, so that >> >>>>> >> >>> > might >> >>>>> >> >>> > balance >> >>>>> >> >>> > out). >> >>>>> >> >>> > >> >>>>> >> >>> > We have a limited core of devs whom right now are familar >> >>>>> >> >>> > with >> >>>>> >> >>> > things. >> >>>>> >> >>> > If >> >>>>> >> >>> > someone happened to have a starting base for a c++ library, >> >>>>> >> >>> > then I >> >>>>> >> >>> > might >> >>>>> >> >>> > change >> >>>>> >> >>> > opinions here. >> >>>>> >> >>> > >> >>>>> >> >>> > >> >>>>> >> >>> > my 4c. >> >>>>> >> >>> > >> >>>>> >> >>> > Jeff >> >>>>> >> >>> > >> >>>>> >> >>> > >> >>>>> >> >>> > >> >>>>> >> >>> > >> >>>>> >> >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney >> >>>>> >> >>> > >> >>>>> >> >>> > wrote: >> >>>>> >> >>> >> >> >>>>> >> >>> >> Deep thoughts during the holidays. >> >>>>> >> >>> >> >> >>>>> >> >>> >> I might be out of line here, but the interpreter-heaviness >> >>>>> >> >>> >> of >> >>>>> >> >>> >> the >> >>>>> >> >>> >> inside of pandas objects is likely to be a long-term >> >>>>> >> >>> >> liability >> >>>>> >> >>> >> and >> >>>>> >> >>> >> source of performance problems and technical debt. >> >>>>> >> >>> >> >> >>>>> >> >>> >> Has anyone put any thought into planning and beginning to >> >>>>> >> >>> >> execute >> >>>>> >> >>> >> on a >> >>>>> >> >>> >> rewrite that moves as much as possible of the internals >> >>>>> >> >>> >> into >> >>>>> >> >>> >> native >> >>>>> >> >>> >> / >> >>>>> >> >>> >> compiled code? I'm talking about: >> >>>>> >> >>> >> >> >>>>> >> >>> >> - pandas/core/internals >> >>>>> >> >>> >> - indexing and assignment >> >>>>> >> >>> >> - much of pandas/core/common >> >>>>> >> >>> >> - categorical and custom dtypes >> >>>>> >> >>> >> - all indexing mechanisms >> >>>>> >> >>> >> >> >>>>> >> >>> >> I'm concerned we've already exposed too much internals to >> >>>>> >> >>> >> users, so >> >>>>> >> >>> >> this might lead to a lot of API breakage, but it might be >> >>>>> >> >>> >> for >> >>>>> >> >>> >> the >> >>>>> >> >>> >> Greater Good. As a first step, beginning a partial >> >>>>> >> >>> >> migration >> >>>>> >> >>> >> of >> >>>>> >> >>> >> internals into some C++ classes that encapsulate the >> >>>>> >> >>> >> insides >> >>>>> >> >>> >> of >> >>>>> >> >>> >> DataFrame objects and implement indexing and block-level >> >>>>> >> >>> >> manipulations >> >>>>> >> >>> >> would be a good place to start. I think you could do this >> >>>>> >> >>> >> wouldn't >> >>>>> >> >>> >> too >> >>>>> >> >>> >> much disruption. >> >>>>> >> >>> >> >> >>>>> >> >>> >> As part of this internal retooling we might give >> >>>>> >> >>> >> consideration >> >>>>> >> >>> >> to >> >>>>> >> >>> >> alternative data structures for representing data internal >> >>>>> >> >>> >> to >> >>>>> >> >>> >> pandas >> >>>>> >> >>> >> objects. Now in 2015/2016, continuing to be hamstrung by >> >>>>> >> >>> >> NumPy's >> >>>>> >> >>> >> limitations feels somewhat anachronistic. User code is >> >>>>> >> >>> >> riddled >> >>>>> >> >>> >> with >> >>>>> >> >>> >> workarounds for data type fidelity issues and the like. >> >>>>> >> >>> >> Like, >> >>>>> >> >>> >> really, >> >>>>> >> >>> >> why not add a bitndarray (similar to ilanschnell/bitarray) >> >>>>> >> >>> >> for >> >>>>> >> >>> >> storing >> >>>>> >> >>> >> nullness for problematic types and hide this from the >> >>>>> >> >>> >> user? =) >> >>>>> >> >>> >> >> >>>>> >> >>> >> Since we are now a NumFOCUS-sponsored project, I feel like >> >>>>> >> >>> >> we >> >>>>> >> >>> >> might >> >>>>> >> >>> >> consider establishing some formal governance over pandas >> >>>>> >> >>> >> and >> >>>>> >> >>> >> publishing meetings notes and roadmap documents describing >> >>>>> >> >>> >> plans >> >>>>> >> >>> >> for >> >>>>> >> >>> >> the project and meetings notes from committers. There's no >> >>>>> >> >>> >> real >> >>>>> >> >>> >> "committer culture" for NumFOCUS projects like there is >> >>>>> >> >>> >> with >> >>>>> >> >>> >> the >> >>>>> >> >>> >> Apache Software Foundation, but we might try leading by >> >>>>> >> >>> >> example! >> >>>>> >> >>> >> >> >>>>> >> >>> >> Also, I believe pandas as a project has reached a level of >> >>>>> >> >>> >> importance >> >>>>> >> >>> >> where we ought to consider planning and execution on >> >>>>> >> >>> >> larger >> >>>>> >> >>> >> scale >> >>>>> >> >>> >> undertakings such as this for safeguarding the future. >> >>>>> >> >>> >> >> >>>>> >> >>> >> As for myself, well, I have my hands full in Big >> >>>>> >> >>> >> Data-land. I >> >>>>> >> >>> >> wish >> >>>>> >> >>> >> I >> >>>>> >> >>> >> could be helping more with pandas, but there a quite a few >> >>>>> >> >>> >> fundamental >> >>>>> >> >>> >> issues (like data interoperability nested data handling >> >>>>> >> >>> >> and >> >>>>> >> >>> >> file >> >>>>> >> >>> >> format support ? e.g. Parquet, see >> >>>>> >> >>> >> >> >>>>> >> >>> >> >> >>>>> >> >>> >> >> >>>>> >> >>> >> >> >>>>> >> >>> >> >> >>>>> >> >>> >> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/) >> >>>>> >> >>> >> preventing Python from being more useful in industry >> >>>>> >> >>> >> analytics >> >>>>> >> >>> >> applications. >> >>>>> >> >>> >> >> >>>>> >> >>> >> Aside: one of the bigger mistakes I made with pandas's API >> >>>>> >> >>> >> design >> >>>>> >> >>> >> was >> >>>>> >> >>> >> making it acceptable to call class constructors ? like >> >>>>> >> >>> >> pandas.DataFrame ? directly (versus factory functions). >> >>>>> >> >>> >> Sorry >> >>>>> >> >>> >> about >> >>>>> >> >>> >> that! If we could convince everyone to start writing >> >>>>> >> >>> >> pandas.data_frame >> >>>>> >> >>> >> or dataframe instead of using the class reference it would >> >>>>> >> >>> >> help a >> >>>>> >> >>> >> lot >> >>>>> >> >>> >> with code cleanup. It's hard to plan for these things ? >> >>>>> >> >>> >> NumPy >> >>>>> >> >>> >> interoperability seemed a lot more important in 2008 than >> >>>>> >> >>> >> it >> >>>>> >> >>> >> does >> >>>>> >> >>> >> now, >> >>>>> >> >>> >> so I forgive myself. >> >>>>> >> >>> >> >> >>>>> >> >>> >> cheers and best wishes for 2016, >> >>>>> >> >>> >> Wes >> >>>>> >> >>> >> _______________________________________________ >> >>>>> >> >>> >> Pandas-dev mailing list >> >>>>> >> >>> >> Pandas-dev at python.org >> >>>>> >> >>> >> https://mail.python.org/mailman/listinfo/pandas-dev >> >>>>> >> >>> > >> >>>>> >> >>> > >> >>>>> >> >>> _______________________________________________ >> >>>>> >> >>> Pandas-dev mailing list >> >>>>> >> >>> Pandas-dev at python.org >> >>>>> >> >>> https://mail.python.org/mailman/listinfo/pandas-dev >> >>>>> >> _______________________________________________ >> >>>>> >> Pandas-dev mailing list >> >>>>> >> Pandas-dev at python.org >> >>>>> >> https://mail.python.org/mailman/listinfo/pandas-dev >> >>>>> > >> >>>>> > >> >>>>> >> >>>>> >> >>>>> _______________________________________________ >> >>>>> Pandas-dev mailing list >> >>>>> Pandas-dev at python.org >> >>>>> https://mail.python.org/mailman/listinfo/pandas-dev >> >>>>> >> >>>> >> >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev > > From jeffreback at gmail.com Fri Jan 1 21:06:35 2016 From: jeffreback at gmail.com (Jeff Reback) Date: Fri, 1 Jan 2016 21:06:35 -0500 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References: Message-ID: ok I moved the document to the Pandas folder, where the same group should be able to edit/upload/etc. lmk if any issues On Fri, Jan 1, 2016 at 8:48 PM, Wes McKinney wrote: > Thanks Jeff. Can you create and share a shared Drive folder containing > this where I can put other auxiliary / follow up documents? > > On Fri, Jan 1, 2016 at 5:23 PM, Jeff Reback wrote: > > I changed the doc so that the core dev people can edit. I *think* that > > everyone should be able to view/comment though. > > > > On Fri, Jan 1, 2016 at 8:13 PM, Wes McKinney > wrote: > >> > >> Jeff -- can you require log-in for editing on this document? > >> > >> > https://docs.google.com/document/d/151ct8jcZWwh7XStptjbLsda6h2b3C0IuiH_hfZnUA58/edit# > >> > >> There are a number of anonymous edits. > >> > >> On Wed, Dec 30, 2015 at 6:04 PM, Wes McKinney > wrote: > >> > I cobbled together an ugly start of a c++->cython->pandas toolchain > here > >> > > >> > https://github.com/wesm/pandas/tree/libpandas-native-core > >> > > >> > I used a mix of Kudu, Impala, and dynd-python cmake sources, so it's a > >> > bit messy at the moment but it should be sufficient to run some real > >> > experiments with a little more work. I reckon it's like a 6 month > >> > project to tear out the insides of Series and DataFrame and replace it > >> > with a new "native core", but we should be able to get enough info to > >> > see whether it's a viable plan within a month or so. > >> > > >> > The end goal is to create "private" extension types in Cython that can > >> > be the new base classes for Series and NDFrame; these will hold a > >> > reference to a C++ object that contains wrappered NumPy arrays and > >> > other metadata (like pandas-only dtypes). > >> > > >> > It might be too hard to try to replace a single usage of block manager > >> > as a first experiment, so I'll try to create a minimal "SeriesLite" > >> > that supports 3 dtypes > >> > > >> > 1) float64 with nans > >> > 2) int64 with a bitmask for NAs > >> > 3) category type for one of these > >> > > >> > Just want to get a feel for the extensibility and offer an NA > >> > singleton Python object (a la None) for getting and setting NAs across > >> > these 3 dtypes. > >> > > >> > If we end up going down this route, any way to place a moratorium on > >> > invasive work on pandas internals (outside bug fixes)? > >> > > >> > Pedantic aside: I'd rather avoid shipping thirdparty C/C++ libraries > >> > like googletest and friends in pandas if we can. Cloudera folks have > >> > been working on a portable C++ library toolchain for Impala and other > >> > projects at https://github.com/cloudera/native-toolchain, but it is > >> > only being tested on Linux and OS X. Most google libraries should > >> > build out of the box on MSVC but it'll be something to keep an eye on. > >> > > >> > BTW thanks to the libdynd developers for pioneering the c++ lib <-> > >> > python-c++ lib <-> cython toolchain; being able to build Cython > >> > extensions directly from cmake is a godsend > >> > > >> > HNY all > >> > Wes > >> > > >> > On Tue, Dec 29, 2015 at 4:17 PM, Irwin Zaid > wrote: > >> >> Yeah, that seems reasonable and I totally agree a Pandas wrapper > layer > >> >> would > >> >> be necessary. > >> >> > >> >> I'll keep an eye on this and I'd like to help if I can. > >> >> > >> >> Irwin > >> >> > >> >> > >> >> On Tue, Dec 29, 2015 at 6:01 PM, Wes McKinney > >> >> wrote: > >> >>> > >> >>> I'm not suggesting a rewrite of NumPy functionality but rather > pandas > >> >>> functionality that is currently written in a mishmash of Cython and > >> >>> Python. > >> >>> Happy to experiment with changing the internal compute > infrastructure > >> >>> and > >> >>> data representation to DyND after this first stage of cleanup is > done. > >> >>> Even > >> >>> if we use DyND a pretty extensive pandas wrapper layer will be > >> >>> necessary. > >> >>> > >> >>> > >> >>> On Tuesday, December 29, 2015, Irwin Zaid > wrote: > >> >>>> > >> >>>> Hi Wes (and others), > >> >>>> > >> >>>> I've been following this conversation with interest. I do think it > >> >>>> would > >> >>>> be worth exploring DyND, rather than setting up yet another rewrite > >> >>>> of > >> >>>> NumPy-functionality. Especially because DyND is already an optional > >> >>>> dependency of Pandas. > >> >>>> > >> >>>> For things like Integer NA and new dtypes, DyND is there and ready > to > >> >>>> do > >> >>>> this. > >> >>>> > >> >>>> Irwin > >> >>>> > >> >>>> On Tue, Dec 29, 2015 at 5:18 PM, Wes McKinney > > >> >>>> wrote: > >> >>>>> > >> >>>>> Can you link to the PR you're talking about? > >> >>>>> > >> >>>>> I will see about spending a few hours setting up a libpandas.so > as a > >> >>>>> C++ > >> >>>>> shared library where we can run some experiments and validate > >> >>>>> whether it can > >> >>>>> solve the integer-NA problem and be a place to put new data types > >> >>>>> (categorical and friends). I'm +1 on targeting > >> >>>>> > >> >>>>> Would it also be worth making a wish list of APIs we might > consider > >> >>>>> breaking in a pandas 1.0 release that also features this new > "native > >> >>>>> core"? > >> >>>>> Might as well right some wrongs while we're doing some invasive > work > >> >>>>> on the > >> >>>>> internals; some breakage might be unavoidable. We can always > >> >>>>> maintain a > >> >>>>> pandas legacy 0.x.x maintenance branch (providing a conda binary > >> >>>>> build) for > >> >>>>> legacy users where showstopper bugs can get fixed. > >> >>>>> > >> >>>>> On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback < > jeffreback at gmail.com> > >> >>>>> wrote: > >> >>>>> > Wes your last is noted as well. I *think* we can actually do > this > >> >>>>> > now > >> >>>>> > (well > >> >>>>> > there is a PR out there). > >> >>>>> > > >> >>>>> > On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney > >> >>>>> > > >> >>>>> > wrote: > >> >>>>> >> > >> >>>>> >> The other huge thing this will enable is to do is copy-on-write > >> >>>>> >> for > >> >>>>> >> various kinds of views, which should cut down on some of the > >> >>>>> >> defensive > >> >>>>> >> copying in the library and reduce memory usage. > >> >>>>> >> > >> >>>>> >> On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney > >> >>>>> >> > >> >>>>> >> wrote: > >> >>>>> >> > Basically the approach is > >> >>>>> >> > > >> >>>>> >> > 1) Base dtype type > >> >>>>> >> > 2) Base array type with K >= 1 dimensions > >> >>>>> >> > 3) Base scalar type > >> >>>>> >> > 4) Base index type > >> >>>>> >> > 5) "Wrapper" subclasses for all NumPy types fitting into > >> >>>>> >> > categories > >> >>>>> >> > #1, #2, #3, #4 > >> >>>>> >> > 6) Subclasses for pandas-specific types like category, > >> >>>>> >> > datetimeTZ, > >> >>>>> >> > etc. > >> >>>>> >> > 7) NDFrame as cpcloud wrote is just a list of these > >> >>>>> >> > > >> >>>>> >> > Indexes and axis labels / column names can get layered on > top. > >> >>>>> >> > > >> >>>>> >> > After we do all this we can look at adding nested types > >> >>>>> >> > (arrays, > >> >>>>> >> > maps, > >> >>>>> >> > structs) to better support JSON. > >> >>>>> >> > > >> >>>>> >> > - Wes > >> >>>>> >> > > >> >>>>> >> > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud > >> >>>>> >> > > >> >>>>> >> > wrote: > >> >>>>> >> >> Maybe this is saying the same thing as Wes, but how far > would > >> >>>>> >> >> something > >> >>>>> >> >> like > >> >>>>> >> >> this get us? > >> >>>>> >> >> > >> >>>>> >> >> // warning: things are probably not this simple > >> >>>>> >> >> > >> >>>>> >> >> struct data_array_t { > >> >>>>> >> >> void *primitive; // scalar data > >> >>>>> >> >> data_array_t *nested; // nested data > >> >>>>> >> >> boost::dynamic_bitset isnull; // might have to create > our > >> >>>>> >> >> own > >> >>>>> >> >> to > >> >>>>> >> >> avoid > >> >>>>> >> >> boost > >> >>>>> >> >> schema_t schema; // not sure exactly what this looks > like > >> >>>>> >> >> }; > >> >>>>> >> >> > >> >>>>> >> >> typedef std::map data_frame_t; // > >> >>>>> >> >> probably > >> >>>>> >> >> not > >> >>>>> >> >> this > >> >>>>> >> >> simple > >> >>>>> >> >> > >> >>>>> >> >> To answer Jeff?s use-case question: I think that the use > cases > >> >>>>> >> >> are > >> >>>>> >> >> 1) > >> >>>>> >> >> freedom from numpy (mostly) 2) no more block manager which > >> >>>>> >> >> frees > >> >>>>> >> >> us > >> >>>>> >> >> from the > >> >>>>> >> >> limitations of the block memory layout. In particular, the > >> >>>>> >> >> ability > >> >>>>> >> >> to > >> >>>>> >> >> take > >> >>>>> >> >> advantage of memory mapped IO would be a big win IMO. > >> >>>>> >> >> > >> >>>>> >> >> > >> >>>>> >> >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney > >> >>>>> >> >> > >> >>>>> >> >> wrote: > >> >>>>> >> >>> > >> >>>>> >> >>> I will write a more detailed response to some of these > things > >> >>>>> >> >>> after > >> >>>>> >> >>> the new year, but, in particular, re: missing values, can > you > >> >>>>> >> >>> or > >> >>>>> >> >>> someone tell me why creating an object that contains a > NumPy > >> >>>>> >> >>> array and > >> >>>>> >> >>> a bitmap is not sufficient? If we we can add a lightweight > >> >>>>> >> >>> C/C++ > >> >>>>> >> >>> class > >> >>>>> >> >>> layer between NumPy function calls (e.g. arithmetic) and > >> >>>>> >> >>> pandas > >> >>>>> >> >>> function calls, then I see no reason why we cannot have > >> >>>>> >> >>> > >> >>>>> >> >>> Int32Array->add > >> >>>>> >> >>> > >> >>>>> >> >>> and > >> >>>>> >> >>> > >> >>>>> >> >>> Float32Array->add > >> >>>>> >> >>> > >> >>>>> >> >>> do the right thing (the former would be responsible for > >> >>>>> >> >>> bitmasking to > >> >>>>> >> >>> propagate NA values; the latter would defer to NumPy). If > we > >> >>>>> >> >>> can > >> >>>>> >> >>> put > >> >>>>> >> >>> all the internals of pandas objects inside a black box, we > >> >>>>> >> >>> can > >> >>>>> >> >>> add > >> >>>>> >> >>> layers of virtual function indirection without a > performance > >> >>>>> >> >>> penalty > >> >>>>> >> >>> (e.g. adding more interpreter overhead with more > abstraction > >> >>>>> >> >>> layers > >> >>>>> >> >>> does add up to a perf penalty). > >> >>>>> >> >>> > >> >>>>> >> >>> I don't think this is too scary -- I would be willing to > >> >>>>> >> >>> create a > >> >>>>> >> >>> small POC C++ library to prototype something like what I'm > >> >>>>> >> >>> talking > >> >>>>> >> >>> about. > >> >>>>> >> >>> > >> >>>>> >> >>> Since pandas has limited points of contact with NumPy I > don't > >> >>>>> >> >>> think > >> >>>>> >> >>> this would end up being too onerous. > >> >>>>> >> >>> > >> >>>>> >> >>> For the record, I'm pretty allergic to "advanced C++"; I > >> >>>>> >> >>> think it > >> >>>>> >> >>> is a > >> >>>>> >> >>> useful tool if you pick a sane 20% subset of the C++11 spec > >> >>>>> >> >>> and > >> >>>>> >> >>> follow > >> >>>>> >> >>> Google C++ style it's not very inaccessible to intermediate > >> >>>>> >> >>> developers. More or less "C plus OOP and easier object > >> >>>>> >> >>> lifetime > >> >>>>> >> >>> management (shared/unique_ptr, etc.)". As soon as you add a > >> >>>>> >> >>> lot > >> >>>>> >> >>> of > >> >>>>> >> >>> template metaprogramming C++ library development quickly > >> >>>>> >> >>> becomes > >> >>>>> >> >>> inaccessible except to the C++-Jedi. > >> >>>>> >> >>> > >> >>>>> >> >>> Maybe let's start a Google document on "pandas roadmap" > where > >> >>>>> >> >>> we > >> >>>>> >> >>> can > >> >>>>> >> >>> break down the 1-2 year goals and some of these > >> >>>>> >> >>> infrastructure > >> >>>>> >> >>> issues > >> >>>>> >> >>> and have our discussion there? (obviously publish this > >> >>>>> >> >>> someplace > >> >>>>> >> >>> once > >> >>>>> >> >>> we're done) > >> >>>>> >> >>> > >> >>>>> >> >>> - Wes > >> >>>>> >> >>> > >> >>>>> >> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback > >> >>>>> >> >>> > >> >>>>> >> >>> wrote: > >> >>>>> >> >>> > Here are some of my thoughts about pandas Roadmap / > status > >> >>>>> >> >>> > and > >> >>>>> >> >>> > some > >> >>>>> >> >>> > responses to Wes's thoughts. > >> >>>>> >> >>> > > >> >>>>> >> >>> > In the last few (and upcoming) major releases we have > been > >> >>>>> >> >>> > made > >> >>>>> >> >>> > the > >> >>>>> >> >>> > following changes: > >> >>>>> >> >>> > > >> >>>>> >> >>> > - dtype enhancements (Categorical, Timedelta, Datetime > >> >>>>> >> >>> > w/tz) & > >> >>>>> >> >>> > making > >> >>>>> >> >>> > these > >> >>>>> >> >>> > first class objects > >> >>>>> >> >>> > - code refactoring to remove subclassing of ndarrays for > >> >>>>> >> >>> > Series > >> >>>>> >> >>> > & > >> >>>>> >> >>> > Index > >> >>>>> >> >>> > - carving out / deprecating non-core parts of pandas > >> >>>>> >> >>> > - datareader > >> >>>>> >> >>> > - SparsePanel, WidePanel & other aliases (TImeSeries) > >> >>>>> >> >>> > - rpy, rplot, irow et al. > >> >>>>> >> >>> > - google-analytics > >> >>>>> >> >>> > - API changes to make things more consistent > >> >>>>> >> >>> > - pd.rolling/expanding * -> .rolling/expanding (this is > >> >>>>> >> >>> > in > >> >>>>> >> >>> > master > >> >>>>> >> >>> > now) > >> >>>>> >> >>> > - .resample becoming a full defered like groupby. > >> >>>>> >> >>> > - multi-index slicing along any level (obviates need > for > >> >>>>> >> >>> > .xs) > >> >>>>> >> >>> > and > >> >>>>> >> >>> > allows > >> >>>>> >> >>> > assignment > >> >>>>> >> >>> > - .loc/.iloc - for the most part obviates use of .ix > >> >>>>> >> >>> > - .pipe & .assign > >> >>>>> >> >>> > - plotting accessors > >> >>>>> >> >>> > - fixing of the sorting API > >> >>>>> >> >>> > - many performance enhancements both micro & macro (e.g. > >> >>>>> >> >>> > release > >> >>>>> >> >>> > GIL) > >> >>>>> >> >>> > > >> >>>>> >> >>> > Some on-deck enhancements are (meaning these are > basically > >> >>>>> >> >>> > ready to > >> >>>>> >> >>> > go > >> >>>>> >> >>> > in): > >> >>>>> >> >>> > - IntervalIndex (and eventually make PeriodIndex just a > >> >>>>> >> >>> > sub-class > >> >>>>> >> >>> > of > >> >>>>> >> >>> > this) > >> >>>>> >> >>> > - RangeIndex > >> >>>>> >> >>> > > >> >>>>> >> >>> > so lots of changes, though nothing really earth shaking, > >> >>>>> >> >>> > just > >> >>>>> >> >>> > more > >> >>>>> >> >>> > convenience, reducing magicness somewhat > >> >>>>> >> >>> > and providing flexibility. > >> >>>>> >> >>> > > >> >>>>> >> >>> > Of course we are getting increasing issues, mostly bug > >> >>>>> >> >>> > reports > >> >>>>> >> >>> > (and > >> >>>>> >> >>> > lots > >> >>>>> >> >>> > of > >> >>>>> >> >>> > dupes), some edge case enhancements > >> >>>>> >> >>> > which can add to the existing API's and of course, > requests > >> >>>>> >> >>> > to > >> >>>>> >> >>> > expand > >> >>>>> >> >>> > the > >> >>>>> >> >>> > (already) large code to other usecases. > >> >>>>> >> >>> > Balancing this are a good many pull-requests from many > >> >>>>> >> >>> > different > >> >>>>> >> >>> > users, > >> >>>>> >> >>> > some > >> >>>>> >> >>> > even deep into the internals. > >> >>>>> >> >>> > > >> >>>>> >> >>> > Here are some things that I have talked about and could > be > >> >>>>> >> >>> > considered > >> >>>>> >> >>> > for > >> >>>>> >> >>> > the roadmap. Disclaimer: I do work for Continuum > >> >>>>> >> >>> > but these views are of course my own; furthermore > obviously > >> >>>>> >> >>> > I > >> >>>>> >> >>> > am a > >> >>>>> >> >>> > bit > >> >>>>> >> >>> > more > >> >>>>> >> >>> > familiar with some of the 'sponsored' open-source > >> >>>>> >> >>> > libraries, but always open to new things. > >> >>>>> >> >>> > > >> >>>>> >> >>> > - integration / automatic deferral to numba for JIT (this > >> >>>>> >> >>> > would > >> >>>>> >> >>> > be > >> >>>>> >> >>> > thru > >> >>>>> >> >>> > .apply) > >> >>>>> >> >>> > - automatic deferal to dask from groubpy where > appropriate > >> >>>>> >> >>> > / > >> >>>>> >> >>> > maybe a > >> >>>>> >> >>> > .to_parallel (to simply return a dask.DataFrame object) > >> >>>>> >> >>> > - incorporation of quantities / units (as part of the > >> >>>>> >> >>> > dtype) > >> >>>>> >> >>> > - use of DyND to allow missing values for int dtypes > >> >>>>> >> >>> > - make Period a first class dtype. > >> >>>>> >> >>> > - provide some copy-on-write semantics to alleviate the > >> >>>>> >> >>> > chained-indexing > >> >>>>> >> >>> > issues which occasionaly come up with the mis-use of the > >> >>>>> >> >>> > indexing > >> >>>>> >> >>> > API > >> >>>>> >> >>> > - allow a 'policy' to automatically provide column blocks > >> >>>>> >> >>> > for > >> >>>>> >> >>> > dict-like > >> >>>>> >> >>> > input (e.g. each column would be a block), this would > allow > >> >>>>> >> >>> > a > >> >>>>> >> >>> > pass-thru > >> >>>>> >> >>> > API > >> >>>>> >> >>> > where you could > >> >>>>> >> >>> > put in numpy arrays where you have views and have them > >> >>>>> >> >>> > preserved > >> >>>>> >> >>> > rather > >> >>>>> >> >>> > than > >> >>>>> >> >>> > copied automatically. Note that this would also allow > what > >> >>>>> >> >>> > I > >> >>>>> >> >>> > call > >> >>>>> >> >>> > 'split' > >> >>>>> >> >>> > where a passed in > >> >>>>> >> >>> > multi-dim numpy array could be split up to individual > >> >>>>> >> >>> > blocks > >> >>>>> >> >>> > (which > >> >>>>> >> >>> > actually > >> >>>>> >> >>> > gives a nice perf boost after the splitting costs). > >> >>>>> >> >>> > > >> >>>>> >> >>> > In working towards some of these goals. I have come to > the > >> >>>>> >> >>> > opinion > >> >>>>> >> >>> > that > >> >>>>> >> >>> > it > >> >>>>> >> >>> > would make sense to have a neutral API protocol layer > >> >>>>> >> >>> > that would allow us to swap out different engines as > >> >>>>> >> >>> > needed, > >> >>>>> >> >>> > for > >> >>>>> >> >>> > particular > >> >>>>> >> >>> > dtypes, or *maybe* out-of-core type computations. E.g. > >> >>>>> >> >>> > imagine that we replaced the in-memory block structure > with > >> >>>>> >> >>> > a > >> >>>>> >> >>> > bclolz > >> >>>>> >> >>> > / > >> >>>>> >> >>> > memap > >> >>>>> >> >>> > type; in theory this should be 'easy' and just work. > >> >>>>> >> >>> > I could also see us adopting *some* of the SFrame code to > >> >>>>> >> >>> > allow > >> >>>>> >> >>> > easier > >> >>>>> >> >>> > interop with this API layer. > >> >>>>> >> >>> > > >> >>>>> >> >>> > In practice, I think a nice API layer would need to be > >> >>>>> >> >>> > created > >> >>>>> >> >>> > to > >> >>>>> >> >>> > make > >> >>>>> >> >>> > this > >> >>>>> >> >>> > clean / nice. > >> >>>>> >> >>> > > >> >>>>> >> >>> > So this comes around to Wes's point about creating a c++ > >> >>>>> >> >>> > library for > >> >>>>> >> >>> > the > >> >>>>> >> >>> > internals (and possibly even some of the indexing > >> >>>>> >> >>> > routines). > >> >>>>> >> >>> > In an ideal world, or course this would be desirable. > >> >>>>> >> >>> > Getting > >> >>>>> >> >>> > there > >> >>>>> >> >>> > is a > >> >>>>> >> >>> > bit > >> >>>>> >> >>> > non-trivial I think, and IMHO might not be worth the > >> >>>>> >> >>> > effort. I > >> >>>>> >> >>> > don't > >> >>>>> >> >>> > really see big performance bottlenecks. We *already* > defer > >> >>>>> >> >>> > much > >> >>>>> >> >>> > of > >> >>>>> >> >>> > the > >> >>>>> >> >>> > computation to libraries like numexpr & bottleneck (where > >> >>>>> >> >>> > appropriate). > >> >>>>> >> >>> > Adding numba / dask to the list would be helpful. > >> >>>>> >> >>> > > >> >>>>> >> >>> > I think that almost all performance issues are the result > >> >>>>> >> >>> > of: > >> >>>>> >> >>> > > >> >>>>> >> >>> > a) gross misuse of the pandas API. How much code have you > >> >>>>> >> >>> > seen > >> >>>>> >> >>> > that > >> >>>>> >> >>> > does > >> >>>>> >> >>> > df.apply(lambda x: x.sum()) > >> >>>>> >> >>> > b) routines which operate column-by-column rather > >> >>>>> >> >>> > block-by-block and > >> >>>>> >> >>> > are > >> >>>>> >> >>> > in > >> >>>>> >> >>> > python space (e.g. we have an issue right now about > >> >>>>> >> >>> > .quantile) > >> >>>>> >> >>> > > >> >>>>> >> >>> > So I am glossing over a big goal of having a c++ library > >> >>>>> >> >>> > that > >> >>>>> >> >>> > represents > >> >>>>> >> >>> > the > >> >>>>> >> >>> > pandas internals. This would by definition have a c-API > >> >>>>> >> >>> > that so > >> >>>>> >> >>> > you *could* use pandas like semantics in c/c++ and just > >> >>>>> >> >>> > have it > >> >>>>> >> >>> > work > >> >>>>> >> >>> > (and > >> >>>>> >> >>> > then pandas would be a thin wrapper around this library). > >> >>>>> >> >>> > > >> >>>>> >> >>> > I am not averse to this, but I think would be quite a big > >> >>>>> >> >>> > effort, > >> >>>>> >> >>> > and > >> >>>>> >> >>> > not a > >> >>>>> >> >>> > huge perf boost IMHO. Further there are a number of API > >> >>>>> >> >>> > issues > >> >>>>> >> >>> > w.r.t. > >> >>>>> >> >>> > indexing > >> >>>>> >> >>> > which need to be clarified / worked out (e.g. should we > >> >>>>> >> >>> > simply > >> >>>>> >> >>> > deprecate > >> >>>>> >> >>> > []) > >> >>>>> >> >>> > that are much easier to test / figure out in python > space. > >> >>>>> >> >>> > > >> >>>>> >> >>> > I also thing that we have quite a large number of > >> >>>>> >> >>> > contributors. > >> >>>>> >> >>> > Moving > >> >>>>> >> >>> > to > >> >>>>> >> >>> > c++ might make the internals a bit more impenetrable that > >> >>>>> >> >>> > the > >> >>>>> >> >>> > current > >> >>>>> >> >>> > internals. > >> >>>>> >> >>> > (though this would allow c++ people to contribute, so > that > >> >>>>> >> >>> > might > >> >>>>> >> >>> > balance > >> >>>>> >> >>> > out). > >> >>>>> >> >>> > > >> >>>>> >> >>> > We have a limited core of devs whom right now are familar > >> >>>>> >> >>> > with > >> >>>>> >> >>> > things. > >> >>>>> >> >>> > If > >> >>>>> >> >>> > someone happened to have a starting base for a c++ > library, > >> >>>>> >> >>> > then I > >> >>>>> >> >>> > might > >> >>>>> >> >>> > change > >> >>>>> >> >>> > opinions here. > >> >>>>> >> >>> > > >> >>>>> >> >>> > > >> >>>>> >> >>> > my 4c. > >> >>>>> >> >>> > > >> >>>>> >> >>> > Jeff > >> >>>>> >> >>> > > >> >>>>> >> >>> > > >> >>>>> >> >>> > > >> >>>>> >> >>> > > >> >>>>> >> >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney > >> >>>>> >> >>> > > >> >>>>> >> >>> > wrote: > >> >>>>> >> >>> >> > >> >>>>> >> >>> >> Deep thoughts during the holidays. > >> >>>>> >> >>> >> > >> >>>>> >> >>> >> I might be out of line here, but the > interpreter-heaviness > >> >>>>> >> >>> >> of > >> >>>>> >> >>> >> the > >> >>>>> >> >>> >> inside of pandas objects is likely to be a long-term > >> >>>>> >> >>> >> liability > >> >>>>> >> >>> >> and > >> >>>>> >> >>> >> source of performance problems and technical debt. > >> >>>>> >> >>> >> > >> >>>>> >> >>> >> Has anyone put any thought into planning and beginning > to > >> >>>>> >> >>> >> execute > >> >>>>> >> >>> >> on a > >> >>>>> >> >>> >> rewrite that moves as much as possible of the internals > >> >>>>> >> >>> >> into > >> >>>>> >> >>> >> native > >> >>>>> >> >>> >> / > >> >>>>> >> >>> >> compiled code? I'm talking about: > >> >>>>> >> >>> >> > >> >>>>> >> >>> >> - pandas/core/internals > >> >>>>> >> >>> >> - indexing and assignment > >> >>>>> >> >>> >> - much of pandas/core/common > >> >>>>> >> >>> >> - categorical and custom dtypes > >> >>>>> >> >>> >> - all indexing mechanisms > >> >>>>> >> >>> >> > >> >>>>> >> >>> >> I'm concerned we've already exposed too much internals > to > >> >>>>> >> >>> >> users, so > >> >>>>> >> >>> >> this might lead to a lot of API breakage, but it might > be > >> >>>>> >> >>> >> for > >> >>>>> >> >>> >> the > >> >>>>> >> >>> >> Greater Good. As a first step, beginning a partial > >> >>>>> >> >>> >> migration > >> >>>>> >> >>> >> of > >> >>>>> >> >>> >> internals into some C++ classes that encapsulate the > >> >>>>> >> >>> >> insides > >> >>>>> >> >>> >> of > >> >>>>> >> >>> >> DataFrame objects and implement indexing and block-level > >> >>>>> >> >>> >> manipulations > >> >>>>> >> >>> >> would be a good place to start. I think you could do > this > >> >>>>> >> >>> >> wouldn't > >> >>>>> >> >>> >> too > >> >>>>> >> >>> >> much disruption. > >> >>>>> >> >>> >> > >> >>>>> >> >>> >> As part of this internal retooling we might give > >> >>>>> >> >>> >> consideration > >> >>>>> >> >>> >> to > >> >>>>> >> >>> >> alternative data structures for representing data > internal > >> >>>>> >> >>> >> to > >> >>>>> >> >>> >> pandas > >> >>>>> >> >>> >> objects. Now in 2015/2016, continuing to be hamstrung by > >> >>>>> >> >>> >> NumPy's > >> >>>>> >> >>> >> limitations feels somewhat anachronistic. User code is > >> >>>>> >> >>> >> riddled > >> >>>>> >> >>> >> with > >> >>>>> >> >>> >> workarounds for data type fidelity issues and the like. > >> >>>>> >> >>> >> Like, > >> >>>>> >> >>> >> really, > >> >>>>> >> >>> >> why not add a bitndarray (similar to > ilanschnell/bitarray) > >> >>>>> >> >>> >> for > >> >>>>> >> >>> >> storing > >> >>>>> >> >>> >> nullness for problematic types and hide this from the > >> >>>>> >> >>> >> user? =) > >> >>>>> >> >>> >> > >> >>>>> >> >>> >> Since we are now a NumFOCUS-sponsored project, I feel > like > >> >>>>> >> >>> >> we > >> >>>>> >> >>> >> might > >> >>>>> >> >>> >> consider establishing some formal governance over pandas > >> >>>>> >> >>> >> and > >> >>>>> >> >>> >> publishing meetings notes and roadmap documents > describing > >> >>>>> >> >>> >> plans > >> >>>>> >> >>> >> for > >> >>>>> >> >>> >> the project and meetings notes from committers. There's > no > >> >>>>> >> >>> >> real > >> >>>>> >> >>> >> "committer culture" for NumFOCUS projects like there is > >> >>>>> >> >>> >> with > >> >>>>> >> >>> >> the > >> >>>>> >> >>> >> Apache Software Foundation, but we might try leading by > >> >>>>> >> >>> >> example! > >> >>>>> >> >>> >> > >> >>>>> >> >>> >> Also, I believe pandas as a project has reached a level > of > >> >>>>> >> >>> >> importance > >> >>>>> >> >>> >> where we ought to consider planning and execution on > >> >>>>> >> >>> >> larger > >> >>>>> >> >>> >> scale > >> >>>>> >> >>> >> undertakings such as this for safeguarding the future. > >> >>>>> >> >>> >> > >> >>>>> >> >>> >> As for myself, well, I have my hands full in Big > >> >>>>> >> >>> >> Data-land. I > >> >>>>> >> >>> >> wish > >> >>>>> >> >>> >> I > >> >>>>> >> >>> >> could be helping more with pandas, but there a quite a > few > >> >>>>> >> >>> >> fundamental > >> >>>>> >> >>> >> issues (like data interoperability nested data handling > >> >>>>> >> >>> >> and > >> >>>>> >> >>> >> file > >> >>>>> >> >>> >> format support ? e.g. Parquet, see > >> >>>>> >> >>> >> > >> >>>>> >> >>> >> > >> >>>>> >> >>> >> > >> >>>>> >> >>> >> > >> >>>>> >> >>> >> > >> >>>>> >> >>> >> > http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/ > ) > >> >>>>> >> >>> >> preventing Python from being more useful in industry > >> >>>>> >> >>> >> analytics > >> >>>>> >> >>> >> applications. > >> >>>>> >> >>> >> > >> >>>>> >> >>> >> Aside: one of the bigger mistakes I made with pandas's > API > >> >>>>> >> >>> >> design > >> >>>>> >> >>> >> was > >> >>>>> >> >>> >> making it acceptable to call class constructors ? like > >> >>>>> >> >>> >> pandas.DataFrame ? directly (versus factory functions). > >> >>>>> >> >>> >> Sorry > >> >>>>> >> >>> >> about > >> >>>>> >> >>> >> that! If we could convince everyone to start writing > >> >>>>> >> >>> >> pandas.data_frame > >> >>>>> >> >>> >> or dataframe instead of using the class reference it > would > >> >>>>> >> >>> >> help a > >> >>>>> >> >>> >> lot > >> >>>>> >> >>> >> with code cleanup. It's hard to plan for these things ? > >> >>>>> >> >>> >> NumPy > >> >>>>> >> >>> >> interoperability seemed a lot more important in 2008 > than > >> >>>>> >> >>> >> it > >> >>>>> >> >>> >> does > >> >>>>> >> >>> >> now, > >> >>>>> >> >>> >> so I forgive myself. > >> >>>>> >> >>> >> > >> >>>>> >> >>> >> cheers and best wishes for 2016, > >> >>>>> >> >>> >> Wes > >> >>>>> >> >>> >> _______________________________________________ > >> >>>>> >> >>> >> Pandas-dev mailing list > >> >>>>> >> >>> >> Pandas-dev at python.org > >> >>>>> >> >>> >> https://mail.python.org/mailman/listinfo/pandas-dev > >> >>>>> >> >>> > > >> >>>>> >> >>> > > >> >>>>> >> >>> _______________________________________________ > >> >>>>> >> >>> Pandas-dev mailing list > >> >>>>> >> >>> Pandas-dev at python.org > >> >>>>> >> >>> https://mail.python.org/mailman/listinfo/pandas-dev > >> >>>>> >> _______________________________________________ > >> >>>>> >> Pandas-dev mailing list > >> >>>>> >> Pandas-dev at python.org > >> >>>>> >> https://mail.python.org/mailman/listinfo/pandas-dev > >> >>>>> > > >> >>>>> > > >> >>>>> > >> >>>>> > >> >>>>> _______________________________________________ > >> >>>>> Pandas-dev mailing list > >> >>>>> Pandas-dev at python.org > >> >>>>> https://mail.python.org/mailman/listinfo/pandas-dev > >> >>>>> > >> >>>> > >> >> > >> _______________________________________________ > >> Pandas-dev mailing list > >> Pandas-dev at python.org > >> https://mail.python.org/mailman/listinfo/pandas-dev > > > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Sun Jan 3 14:41:17 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Sun, 3 Jan 2016 11:41:17 -0800 Subject: [Pandas-dev] pandas 0.18.x and pandas 1.0 roadmap Message-ID: Per discussions we've been having here https://docs.google.com/document/d/151ct8jcZWwh7XStptjbLsda6h2b3C0IuiH_hfZnUA58/edit?ts=568725eb#heading=h.qm48l6dargmd I started this document to solicit a high level plan for the last 0.x release and where we can develop a plan for what will become pandas 1.0 https://docs.google.com/document/d/1K3uVluD9qNn9nLp6oRjIwP7qillysw820wfulJY3BiU/edit# Let me know what you think of this idea -- I'll have more bandwidth this year to be involved and I'm starting to look at what a 2nd ed of Python for Data Analysis should look like. Relatedly: I'm assembling enough basic plumbing so that I can give you all a demo of how the libpandas / C/C++ native core will help us better hide implementation details and fix problems like integer/boolean missing data in a clean and extensible way. It will also help establish a pattern for adding new data types to pandas (which may or may not rely on NumPy). I'll follow up about it when I get a bit more stuff working; probably take me a few more days at least. thanks! Wes From jorisvandenbossche at gmail.com Mon Jan 4 18:30:38 2016 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Tue, 5 Jan 2016 00:30:38 +0100 Subject: [Pandas-dev] pandas 0.18.x and pandas 1.0 roadmap In-Reply-To: References: Message-ID: Hi all, Interesting discussions! My expertise as pandas contributor is not really in the core part, so I cannot really comment on that. But for me, as we think of a pandas 1.0, a possible clean-up of the existing user facing API is an important aspect to discuss I think (regardless of a clean-up and rewrite of the internals, as this should not affect too much of the existing API? (apart from new features)). In the light of how to keep (or improve on) pandas easy to learn, clear to understand, consistent and yet powerful. There are some points listed in the Pandas Development Roadmap under 'pandas 1.0', coming from https://github.com/pydata/pandas/issues/10000, but possibly other points as well. Probably the most prominent example is the indexing API, and specifically [] / __getitem__. Some time ago I made an overview of some of its warts that have grown over time: https://github.com/pydata/pandas/issues/9595 I don't say we have to change something about this (because it will break a lot of existing code), but we should at least discuss it a bit more thoroughly I think. As for the timeline, I like the idea of limiting the number of releases for the 0.x line. Maybe we will like to do a 0.19.x as well (eg to introduce some features to improve the transition to 1.0), or depending on how long it takes to shape up 1.0, but that is something that can be discussed later if that comes up I think. Regards, Joris 2016-01-03 20:41 GMT+01:00 Wes McKinney : > Per discussions we've been having here > > > https://docs.google.com/document/d/151ct8jcZWwh7XStptjbLsda6h2b3C0IuiH_hfZnUA58/edit?ts=568725eb#heading=h.qm48l6dargmd > > I started this document to solicit a high level plan for the last 0.x > release and where we can develop a plan for what will become pandas > 1.0 > > > https://docs.google.com/document/d/1K3uVluD9qNn9nLp6oRjIwP7qillysw820wfulJY3BiU/edit# > > Let me know what you think of this idea -- I'll have more bandwidth > this year to be involved and I'm starting to look at what a 2nd ed of > Python for Data Analysis should look like. > > Relatedly: I'm assembling enough basic plumbing so that I can give you > all a demo of how the libpandas / C/C++ native core will help us > better hide implementation details and fix problems like > integer/boolean missing data in a clean and extensible way. It will > also help establish a pattern for adding new data types to pandas > (which may or may not rely on NumPy). I'll follow up about it when I > get a bit more stuff working; probably take me a few more days at > least. > > thanks! > Wes > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeffreback at gmail.com Mon Jan 4 19:36:45 2016 From: jeffreback at gmail.com (Jeff Reback) Date: Mon, 4 Jan 2016 19:36:45 -0500 Subject: [Pandas-dev] pandas 0.18.x and pandas 1.0 roadmap In-Reply-To: References: Message-ID: I agree with joris on schedule a bit. We have been putting out majors every 3-4 months and then a minor. So I would expect 0.18.0 say in februrary, then 0.18.1 march. Could see 0.19.0 in the summer, Then 1.0 in the fall (and can use 0.19. to road test some things). I also believe any internals changes can be achieved with limited compat breaks. I don't think anyone is proposing a big-break / incompat for 1.0, which IMHO would just cause fragmentation and generally not be a good thing. Certainly we can make major changes, but we have been pretty liberal about deprecations (though not so about removing prior deprecations). So this would also be a good time for that. my 3c Jeff On Mon, Jan 4, 2016 at 6:30 PM, Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > Hi all, > > Interesting discussions! > > My expertise as pandas contributor is not really in the core part, so I > cannot really comment on that. But for me, as we think of a pandas 1.0, a > possible clean-up of the existing user facing API is an important aspect to > discuss I think (regardless of a clean-up and rewrite of the internals, as > this should not affect too much of the existing API? (apart from new > features)). > In the light of how to keep (or improve on) pandas easy to learn, clear to > understand, consistent and yet powerful. > > There are some points listed in the Pandas Development Roadmap > > under 'pandas 1.0', coming from > https://github.com/pydata/pandas/issues/10000, but possibly other points > as well. > > Probably the most prominent example is the indexing API, and specifically > [] / __getitem__. Some time ago I made an overview of some of its warts > that have grown over time: https://github.com/pydata/pandas/issues/9595 > I don't say we have to change something about this (because it will break > a lot of existing code), but we should at least discuss it a bit more > thoroughly I think. > > > As for the timeline, I like the idea of limiting the number of releases > for the 0.x line. Maybe we will like to do a 0.19.x as well (eg to > introduce some features to improve the transition to 1.0), or depending on > how long it takes to shape up 1.0, but that is something that can be > discussed later if that comes up I think. > > Regards, > Joris > > > 2016-01-03 20:41 GMT+01:00 Wes McKinney : > >> Per discussions we've been having here >> >> >> https://docs.google.com/document/d/151ct8jcZWwh7XStptjbLsda6h2b3C0IuiH_hfZnUA58/edit?ts=568725eb#heading=h.qm48l6dargmd >> >> I started this document to solicit a high level plan for the last 0.x >> release and where we can develop a plan for what will become pandas >> 1.0 >> >> >> https://docs.google.com/document/d/1K3uVluD9qNn9nLp6oRjIwP7qillysw820wfulJY3BiU/edit# >> >> Let me know what you think of this idea -- I'll have more bandwidth >> this year to be involved and I'm starting to look at what a 2nd ed of >> Python for Data Analysis should look like. >> >> Relatedly: I'm assembling enough basic plumbing so that I can give you >> all a demo of how the libpandas / C/C++ native core will help us >> better hide implementation details and fix problems like >> integer/boolean missing data in a clean and extensible way. It will >> also help establish a pattern for adding new data types to pandas >> (which may or may not rely on NumPy). I'll follow up about it when I >> get a bit more stuff working; probably take me a few more days at >> least. >> >> thanks! >> Wes >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Mon Jan 4 20:31:58 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Mon, 4 Jan 2016 17:31:58 -0800 Subject: [Pandas-dev] pandas 0.18.x and pandas 1.0 roadmap In-Reply-To: References: Message-ID: This all makes sense. I guess there are too major areas for pandas 1.0: - User API cleanup - Internal cleanup In both cases, we'll want to make sure we can maintain a pandas-1.0 branch that is rebased regularly on master for a while that is not too painful to keep up. How about to keep ourselves sane we make separate roadmaps for the user API and the internals, and we can loudly mark places where there is crossover (for example: data type improvements that are user visible, or changes in data copying semantics / copy-on-write). As Jeff said, if we're doing it right, then the internals revamp shouldn't affect the user API work all that much. Since the idea is that it would fix various "warts" (like reindexing integers or booleans causing upcasts to occur), we'll want to collect all the affected test cases in one place partly as a record of what APIs are effectively broken (e.g. I'm sure some users have a lot of code that assumes that reindexing an integer series results in floating point output). Within the next couple weeks I'll try to make a compelling case for decommissioning the current BlockManager internals of Series and DataFrame in favor of much simpler Array and Table data structures implemented as C++ classes (with Cython wrappers, where Python glue and conveniences can live). A major part of this is inserting a "wrapper layer" in between NumPy and pandas that makes pandas less dependent on NumPy-specific implementation details. While this might seem scary, we already have an extensive NumPy wrapper layer between pandas.core.common and pandas.core.internals. So functions like common._maybe_promote will go away. This will also be a good time to review and cleanup a lot of the existing Cython code. It will be really nice for Series and DataFrame to have a C API ? at some point we can figure out how to enable outside projects to access the C API. I presume Jeff and I will take responsibility for the internals overhaul ? anyone else been hacking around in there want to get down in the trenches? Joris, do you want to take point on the user API roadmapping / planning? cheers, Wes On Mon, Jan 4, 2016 at 4:36 PM, Jeff Reback wrote: > I agree with joris on schedule a bit. We have been putting out majors every > 3-4 months and then a minor. So I would expect 0.18.0 say in februrary, then > 0.18.1 march. Could see 0.19.0 in the summer, Then 1.0 in the fall (and can > use 0.19. to road test some things). > > I also believe any internals changes can be achieved with limited compat > breaks. I don't think anyone is proposing a big-break / incompat for 1.0, > which IMHO would > just cause fragmentation and generally not be a good thing. > > Certainly we can make major changes, but we have been pretty liberal about > deprecations (though not so about removing prior deprecations). So this > would also be a good time for that. > > my 3c > > Jeff > > On Mon, Jan 4, 2016 at 6:30 PM, Joris Van den Bossche > wrote: >> >> Hi all, >> >> Interesting discussions! >> >> My expertise as pandas contributor is not really in the core part, so I >> cannot really comment on that. But for me, as we think of a pandas 1.0, a >> possible clean-up of the existing user facing API is an important aspect to >> discuss I think (regardless of a clean-up and rewrite of the internals, as >> this should not affect too much of the existing API? (apart from new >> features)). >> In the light of how to keep (or improve on) pandas easy to learn, clear to >> understand, consistent and yet powerful. >> >> There are some points listed in the Pandas Development Roadmap under >> 'pandas 1.0', coming from https://github.com/pydata/pandas/issues/10000, but >> possibly other points as well. >> >> Probably the most prominent example is the indexing API, and specifically >> [] / __getitem__. Some time ago I made an overview of some of its warts that >> have grown over time: https://github.com/pydata/pandas/issues/9595 >> I don't say we have to change something about this (because it will break >> a lot of existing code), but we should at least discuss it a bit more >> thoroughly I think. >> >> >> As for the timeline, I like the idea of limiting the number of releases >> for the 0.x line. Maybe we will like to do a 0.19.x as well (eg to introduce >> some features to improve the transition to 1.0), or depending on how long it >> takes to shape up 1.0, but that is something that can be discussed later if >> that comes up I think. >> >> Regards, >> Joris >> >> >> 2016-01-03 20:41 GMT+01:00 Wes McKinney : >>> >>> Per discussions we've been having here >>> >>> >>> https://docs.google.com/document/d/151ct8jcZWwh7XStptjbLsda6h2b3C0IuiH_hfZnUA58/edit?ts=568725eb#heading=h.qm48l6dargmd >>> >>> I started this document to solicit a high level plan for the last 0.x >>> release and where we can develop a plan for what will become pandas >>> 1.0 >>> >>> >>> https://docs.google.com/document/d/1K3uVluD9qNn9nLp6oRjIwP7qillysw820wfulJY3BiU/edit# >>> >>> Let me know what you think of this idea -- I'll have more bandwidth >>> this year to be involved and I'm starting to look at what a 2nd ed of >>> Python for Data Analysis should look like. >>> >>> Relatedly: I'm assembling enough basic plumbing so that I can give you >>> all a demo of how the libpandas / C/C++ native core will help us >>> better hide implementation details and fix problems like >>> integer/boolean missing data in a clean and extensible way. It will >>> also help establish a pattern for adding new data types to pandas >>> (which may or may not rely on NumPy). I'll follow up about it when I >>> get a bit more stuff working; probably take me a few more days at >>> least. >>> >>> thanks! >>> Wes >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >> >> >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > From jeffreback at gmail.com Mon Jan 4 21:21:34 2016 From: jeffreback at gmail.com (Jeff Reback) Date: Mon, 4 Jan 2016 21:21:34 -0500 Subject: [Pandas-dev] GitHub/pandas Message-ID: any thoughts on claiming the pandas org in GitHub (it's an inactive username so I think we could claim it) iow have the main repo be: pandas/pandas could make sense for spinoffs eg pandas-datareader as well xarray just moved to: PyData/xarray (so somewhat unified now) PyData isn't really used by others that pandas (except numexpr) and a number of older / much less active repos the con on this is that pandas has existed for quite a long time and is known well as PyData/pandas. furthermore I don't think pandas.org is available pro is that the future is much longer than the past! (same rationale as in making API breaks!) Jeff I can be reached on my cell 917-971-6387 From wesmckinn at gmail.com Mon Jan 4 21:25:51 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Mon, 4 Jan 2016 18:25:51 -0800 Subject: [Pandas-dev] GitHub/pandas In-Reply-To: References: Message-ID: I actually just contacted GitHub about this today. It's not inactive but I'm going to write a plea to the owner to see if they will transfer it to us. Let you know. On Mon, Jan 4, 2016 at 6:21 PM, Jeff Reback wrote: > any thoughts on claiming the > > pandas org in GitHub (it's an inactive username so I think we could claim it) > > iow have the main repo be: pandas/pandas > > could make sense for spinoffs > eg pandas-datareader as well > > xarray just moved to: PyData/xarray > (so somewhat unified now) > > PyData isn't really used by others that pandas (except numexpr) and a number of older / much less active repos > > the con on this is that pandas has existed for quite a long time and is known well as PyData/pandas. furthermore I don't think pandas.org is available > > pro is that the future is much longer than the past! (same rationale as in making API breaks!) > > Jeff > > > > I can be reached on my cell 917-971-6387 > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev From wesmckinn at gmail.com Tue Jan 5 13:15:51 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Tue, 5 Jan 2016 10:15:51 -0800 Subject: [Pandas-dev] pandas governance Message-ID: hi folks, I'm sorry I didn't do this 2 or 3 years ago when I first handed over release management responsibilities to Jeff, y-p and others, but it would be good for us to formalize the project governance like most other major open source projects. See IPython / Jupyter for an example set of governance documents https://github.com/jupyter/governance I don't have particular concerns over the project's direction and decision making procedure, but as I've had several people raise private concerns with me over the last few years, I think it would be good for the community to have a set of public documents on GitHub that lists people and process in simple terms. This is especially important now that we can receive financial sponsorship through NumFOCUS, so that sponsored contributions are subject to the same community process as volunteer contributions. A basic summary of how we've been informally operating is: Project committers (as will be defined and listed in the governance documents) make decisions based on consensus; in the absence of consensus (which has rarely occurred) I will reserve tie-breaking / BDFL privileges. I don't recall having ever having to put on the BDFL hat but it's the "just in case" should we reach some impasse down the road. I can take a crack at assembling something based on the IPython governance docs if that sounds good. At the end of the day, an OSS project is only as strong as the individuals committing code and reviewing patches. As pandas will be 8 years old in April, with 6 years as open source, I think we have a good track record of consensus-, common-sense-, and fact/evidence-driven decision making. best, Wes From jorisvandenbossche at gmail.com Tue Jan 5 19:29:23 2016 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Wed, 6 Jan 2016 01:29:23 +0100 Subject: [Pandas-dev] pandas governance In-Reply-To: References: Message-ID: Sounds very good! Certainly now we are a NumFOCUS supported project (and have to deal with financial things), I think this is important to do. 2016-01-05 19:15 GMT+01:00 Wes McKinney : > hi folks, > > I'm sorry I didn't do this 2 or 3 years ago when I first handed over > release management responsibilities to Jeff, y-p and others, but it > would be good for us to formalize the project governance like most > other major open source projects. See IPython / Jupyter for an example > set of governance documents > > https://github.com/jupyter/governance > > Numpy also recently adopted a goverance document, based on the Jupyter one: http://docs.scipy.org/doc/numpy-dev/dev/governance/governance.html and https://github.com/numpy/numpy/pull/6352. Maybe also worth a look (although I don't know what they exactly changed from the Jupyter one). > I don't have particular concerns over the project's direction and > decision making procedure, but as I've had several people raise > private concerns with me over the last few years, I think it would be > good for the community to have a set of public documents on GitHub > that lists people and process in simple terms. This is especially > important now that we can receive financial sponsorship through > NumFOCUS, so that sponsored contributions are subject to the same > community process as volunteer contributions. > > A basic summary of how we've been informally operating is: Project > committers (as will be defined and listed in the governance documents) > make decisions based on consensus; in the absence of consensus (which > has rarely occurred) I will reserve tie-breaking / BDFL privileges. I > don't recall having ever having to put on the BDFL hat but it's the > "just in case" should we reach some impasse down the road. > > Sounds good! > I can take a crack at assembling something based on the IPython > governance docs if that sounds good. > > At the end of the day, an OSS project is only as strong as the > individuals committing code and reviewing patches. As pandas will be 8 > years old in April, with 6 years as open source, I think we have a > good track record of consensus-, common-sense-, and > fact/evidence-driven decision making. > > best, > Wes > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeffreback at gmail.com Wed Jan 6 08:50:49 2016 From: jeffreback at gmail.com (Jeff Reback) Date: Wed, 6 Jan 2016 08:50:49 -0500 Subject: [Pandas-dev] pandas governance In-Reply-To: References: Message-ID: yes on board with this as well. We do have a fiscal governance document w.r.t. NUMFocus. That should at the very least be reference by the governance docs. Certainly starting with the jupyter docs is a good think. I don't think we will have the long-long-long discussion that numpy had about the steering committee representation :) Jeff On Tue, Jan 5, 2016 at 7:29 PM, Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > Sounds very good! > > Certainly now we are a NumFOCUS supported project (and have to deal with > financial things), I think this is important to do. > > 2016-01-05 19:15 GMT+01:00 Wes McKinney : > >> hi folks, >> >> I'm sorry I didn't do this 2 or 3 years ago when I first handed over >> release management responsibilities to Jeff, y-p and others, but it >> would be good for us to formalize the project governance like most >> other major open source projects. See IPython / Jupyter for an example >> set of governance documents >> >> https://github.com/jupyter/governance >> >> Numpy also recently adopted a goverance document, based on the Jupyter > one: http://docs.scipy.org/doc/numpy-dev/dev/governance/governance.html > and https://github.com/numpy/numpy/pull/6352. > Maybe also worth a look (although I don't know what they exactly changed > from the Jupyter one). > > >> I don't have particular concerns over the project's direction and >> decision making procedure, but as I've had several people raise >> private concerns with me over the last few years, I think it would be >> good for the community to have a set of public documents on GitHub >> that lists people and process in simple terms. This is especially >> important now that we can receive financial sponsorship through >> NumFOCUS, so that sponsored contributions are subject to the same >> community process as volunteer contributions. >> >> A basic summary of how we've been informally operating is: Project >> committers (as will be defined and listed in the governance documents) >> make decisions based on consensus; in the absence of consensus (which >> has rarely occurred) I will reserve tie-breaking / BDFL privileges. I >> don't recall having ever having to put on the BDFL hat but it's the >> "just in case" should we reach some impasse down the road. >> >> Sounds good! > > >> I can take a crack at assembling something based on the IPython >> governance docs if that sounds good. >> >> At the end of the day, an OSS project is only as strong as the >> individuals committing code and reviewing patches. As pandas will be 8 >> years old in April, with 6 years as open source, I think we have a >> good track record of consensus-, common-sense-, and >> fact/evidence-driven decision making. >> >> best, >> Wes >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Wed Jan 6 13:11:55 2016 From: shoyer at gmail.com (Stephan Hoyer) Date: Wed, 6 Jan 2016 10:11:55 -0800 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References: Message-ID: I was asked about this off list, so I'll belatedly share my thoughts. First of all, I am really excited by Wes's renewed engagement in the project and his interest in rewriting pandas internals. This is quite an ambitious plan and nobody is better positioned to tackle it than Wes. I have mixed feelings about the details of the rewrite itself. +1 on the simpler internal data model. The block manager is confusing and leads to hard to predict performance issues related to copying data. If we can do all column additions/removals/re-orderings without a copy it will be a clear win. +0 on moving internals to C++. I do like the performance benefits, but it seems like a lot of work, and it may make pandas less friendly to new contributors. -0 on writing a brand new dtype system just for pandas -- this stuff really belongs in NumPy (or another array library like DyND), and I am skeptical that pandas can do a complete enough job to be useful without replicating all that functionality. More broadly, I am concerned that this rewrite may improve the tabular computation ecosystem at the cost of inter-operability with the array-based ecosystem (numpy, scipy, sklearn, xarray, etc.). The later has been one of the strengths of pandas and it would be a shame to see that go away. We're already starting to struggle with inter-operability with the new pandas dtypes and a further rewrite would make this even harder. For example, see categoricals and scikit-learn in Tom's recent post [1], or the fact that .values no longer always returns a numpy array. This has also been a challenge for xarray, which can't handle these new dtypes because we lack a suitable array backend for them. Personally, I would much rather leverage a full featured library like an improved NumPy or DyND for new dtypes, because that could also be used by the array-based ecosystem. At the very least, it would be good to think about zero-copy inter-operability with array-based tools. On the other hand, I wonder if maybe it would be better to write a native in-memory backend for Ibis instead of rewriting pandas. Ibis does seem to have improved/simplified API which resolves many of pandas's warts. That said, it's a pretty big change from the "DataFrame as matrix" model, and pandas won't be going away anytime soon. I do like that it would force users to be more explicit about converting between tables and arrays, which might also make distinctions between the tabular and array oriented ecosystems easier to swallow. Just my two cents, from someone who has lots of opinions but who will likely stay on the sidelines for most of this work. Cheers, Stephan [1] http://tomaugspurger.github.io/categorical-pipelines.html On Fri, Jan 1, 2016 at 6:06 PM, Jeff Reback wrote: > ok I moved the document to the Pandas folder, where the same group should > be able to edit/upload/etc. lmk if any issues > > On Fri, Jan 1, 2016 at 8:48 PM, Wes McKinney wrote: > >> Thanks Jeff. Can you create and share a shared Drive folder containing >> this where I can put other auxiliary / follow up documents? >> >> On Fri, Jan 1, 2016 at 5:23 PM, Jeff Reback wrote: >> > I changed the doc so that the core dev people can edit. I *think* that >> > everyone should be able to view/comment though. >> > >> > On Fri, Jan 1, 2016 at 8:13 PM, Wes McKinney >> wrote: >> >> >> >> Jeff -- can you require log-in for editing on this document? >> >> >> >> >> https://docs.google.com/document/d/151ct8jcZWwh7XStptjbLsda6h2b3C0IuiH_hfZnUA58/edit# >> >> >> >> There are a number of anonymous edits. >> >> >> >> On Wed, Dec 30, 2015 at 6:04 PM, Wes McKinney >> wrote: >> >> > I cobbled together an ugly start of a c++->cython->pandas toolchain >> here >> >> > >> >> > https://github.com/wesm/pandas/tree/libpandas-native-core >> >> > >> >> > I used a mix of Kudu, Impala, and dynd-python cmake sources, so it's >> a >> >> > bit messy at the moment but it should be sufficient to run some real >> >> > experiments with a little more work. I reckon it's like a 6 month >> >> > project to tear out the insides of Series and DataFrame and replace >> it >> >> > with a new "native core", but we should be able to get enough info to >> >> > see whether it's a viable plan within a month or so. >> >> > >> >> > The end goal is to create "private" extension types in Cython that >> can >> >> > be the new base classes for Series and NDFrame; these will hold a >> >> > reference to a C++ object that contains wrappered NumPy arrays and >> >> > other metadata (like pandas-only dtypes). >> >> > >> >> > It might be too hard to try to replace a single usage of block >> manager >> >> > as a first experiment, so I'll try to create a minimal "SeriesLite" >> >> > that supports 3 dtypes >> >> > >> >> > 1) float64 with nans >> >> > 2) int64 with a bitmask for NAs >> >> > 3) category type for one of these >> >> > >> >> > Just want to get a feel for the extensibility and offer an NA >> >> > singleton Python object (a la None) for getting and setting NAs >> across >> >> > these 3 dtypes. >> >> > >> >> > If we end up going down this route, any way to place a moratorium on >> >> > invasive work on pandas internals (outside bug fixes)? >> >> > >> >> > Pedantic aside: I'd rather avoid shipping thirdparty C/C++ libraries >> >> > like googletest and friends in pandas if we can. Cloudera folks have >> >> > been working on a portable C++ library toolchain for Impala and other >> >> > projects at https://github.com/cloudera/native-toolchain, but it is >> >> > only being tested on Linux and OS X. Most google libraries should >> >> > build out of the box on MSVC but it'll be something to keep an eye >> on. >> >> > >> >> > BTW thanks to the libdynd developers for pioneering the c++ lib <-> >> >> > python-c++ lib <-> cython toolchain; being able to build Cython >> >> > extensions directly from cmake is a godsend >> >> > >> >> > HNY all >> >> > Wes >> >> > >> >> > On Tue, Dec 29, 2015 at 4:17 PM, Irwin Zaid >> wrote: >> >> >> Yeah, that seems reasonable and I totally agree a Pandas wrapper >> layer >> >> >> would >> >> >> be necessary. >> >> >> >> >> >> I'll keep an eye on this and I'd like to help if I can. >> >> >> >> >> >> Irwin >> >> >> >> >> >> >> >> >> On Tue, Dec 29, 2015 at 6:01 PM, Wes McKinney >> >> >> wrote: >> >> >>> >> >> >>> I'm not suggesting a rewrite of NumPy functionality but rather >> pandas >> >> >>> functionality that is currently written in a mishmash of Cython and >> >> >>> Python. >> >> >>> Happy to experiment with changing the internal compute >> infrastructure >> >> >>> and >> >> >>> data representation to DyND after this first stage of cleanup is >> done. >> >> >>> Even >> >> >>> if we use DyND a pretty extensive pandas wrapper layer will be >> >> >>> necessary. >> >> >>> >> >> >>> >> >> >>> On Tuesday, December 29, 2015, Irwin Zaid >> wrote: >> >> >>>> >> >> >>>> Hi Wes (and others), >> >> >>>> >> >> >>>> I've been following this conversation with interest. I do think it >> >> >>>> would >> >> >>>> be worth exploring DyND, rather than setting up yet another >> rewrite >> >> >>>> of >> >> >>>> NumPy-functionality. Especially because DyND is already an >> optional >> >> >>>> dependency of Pandas. >> >> >>>> >> >> >>>> For things like Integer NA and new dtypes, DyND is there and >> ready to >> >> >>>> do >> >> >>>> this. >> >> >>>> >> >> >>>> Irwin >> >> >>>> >> >> >>>> On Tue, Dec 29, 2015 at 5:18 PM, Wes McKinney < >> wesmckinn at gmail.com> >> >> >>>> wrote: >> >> >>>>> >> >> >>>>> Can you link to the PR you're talking about? >> >> >>>>> >> >> >>>>> I will see about spending a few hours setting up a libpandas.so >> as a >> >> >>>>> C++ >> >> >>>>> shared library where we can run some experiments and validate >> >> >>>>> whether it can >> >> >>>>> solve the integer-NA problem and be a place to put new data types >> >> >>>>> (categorical and friends). I'm +1 on targeting >> >> >>>>> >> >> >>>>> Would it also be worth making a wish list of APIs we might >> consider >> >> >>>>> breaking in a pandas 1.0 release that also features this new >> "native >> >> >>>>> core"? >> >> >>>>> Might as well right some wrongs while we're doing some invasive >> work >> >> >>>>> on the >> >> >>>>> internals; some breakage might be unavoidable. We can always >> >> >>>>> maintain a >> >> >>>>> pandas legacy 0.x.x maintenance branch (providing a conda binary >> >> >>>>> build) for >> >> >>>>> legacy users where showstopper bugs can get fixed. >> >> >>>>> >> >> >>>>> On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback < >> jeffreback at gmail.com> >> >> >>>>> wrote: >> >> >>>>> > Wes your last is noted as well. I *think* we can actually do >> this >> >> >>>>> > now >> >> >>>>> > (well >> >> >>>>> > there is a PR out there). >> >> >>>>> > >> >> >>>>> > On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney >> >> >>>>> > >> >> >>>>> > wrote: >> >> >>>>> >> >> >> >>>>> >> The other huge thing this will enable is to do is >> copy-on-write >> >> >>>>> >> for >> >> >>>>> >> various kinds of views, which should cut down on some of the >> >> >>>>> >> defensive >> >> >>>>> >> copying in the library and reduce memory usage. >> >> >>>>> >> >> >> >>>>> >> On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney >> >> >>>>> >> >> >> >>>>> >> wrote: >> >> >>>>> >> > Basically the approach is >> >> >>>>> >> > >> >> >>>>> >> > 1) Base dtype type >> >> >>>>> >> > 2) Base array type with K >= 1 dimensions >> >> >>>>> >> > 3) Base scalar type >> >> >>>>> >> > 4) Base index type >> >> >>>>> >> > 5) "Wrapper" subclasses for all NumPy types fitting into >> >> >>>>> >> > categories >> >> >>>>> >> > #1, #2, #3, #4 >> >> >>>>> >> > 6) Subclasses for pandas-specific types like category, >> >> >>>>> >> > datetimeTZ, >> >> >>>>> >> > etc. >> >> >>>>> >> > 7) NDFrame as cpcloud wrote is just a list of these >> >> >>>>> >> > >> >> >>>>> >> > Indexes and axis labels / column names can get layered on >> top. >> >> >>>>> >> > >> >> >>>>> >> > After we do all this we can look at adding nested types >> >> >>>>> >> > (arrays, >> >> >>>>> >> > maps, >> >> >>>>> >> > structs) to better support JSON. >> >> >>>>> >> > >> >> >>>>> >> > - Wes >> >> >>>>> >> > >> >> >>>>> >> > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud >> >> >>>>> >> > >> >> >>>>> >> > wrote: >> >> >>>>> >> >> Maybe this is saying the same thing as Wes, but how far >> would >> >> >>>>> >> >> something >> >> >>>>> >> >> like >> >> >>>>> >> >> this get us? >> >> >>>>> >> >> >> >> >>>>> >> >> // warning: things are probably not this simple >> >> >>>>> >> >> >> >> >>>>> >> >> struct data_array_t { >> >> >>>>> >> >> void *primitive; // scalar data >> >> >>>>> >> >> data_array_t *nested; // nested data >> >> >>>>> >> >> boost::dynamic_bitset isnull; // might have to create >> our >> >> >>>>> >> >> own >> >> >>>>> >> >> to >> >> >>>>> >> >> avoid >> >> >>>>> >> >> boost >> >> >>>>> >> >> schema_t schema; // not sure exactly what this looks >> like >> >> >>>>> >> >> }; >> >> >>>>> >> >> >> >> >>>>> >> >> typedef std::map data_frame_t; // >> >> >>>>> >> >> probably >> >> >>>>> >> >> not >> >> >>>>> >> >> this >> >> >>>>> >> >> simple >> >> >>>>> >> >> >> >> >>>>> >> >> To answer Jeff?s use-case question: I think that the use >> cases >> >> >>>>> >> >> are >> >> >>>>> >> >> 1) >> >> >>>>> >> >> freedom from numpy (mostly) 2) no more block manager which >> >> >>>>> >> >> frees >> >> >>>>> >> >> us >> >> >>>>> >> >> from the >> >> >>>>> >> >> limitations of the block memory layout. In particular, the >> >> >>>>> >> >> ability >> >> >>>>> >> >> to >> >> >>>>> >> >> take >> >> >>>>> >> >> advantage of memory mapped IO would be a big win IMO. >> >> >>>>> >> >> >> >> >>>>> >> >> >> >> >>>>> >> >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney >> >> >>>>> >> >> >> >> >>>>> >> >> wrote: >> >> >>>>> >> >>> >> >> >>>>> >> >>> I will write a more detailed response to some of these >> things >> >> >>>>> >> >>> after >> >> >>>>> >> >>> the new year, but, in particular, re: missing values, can >> you >> >> >>>>> >> >>> or >> >> >>>>> >> >>> someone tell me why creating an object that contains a >> NumPy >> >> >>>>> >> >>> array and >> >> >>>>> >> >>> a bitmap is not sufficient? If we we can add a lightweight >> >> >>>>> >> >>> C/C++ >> >> >>>>> >> >>> class >> >> >>>>> >> >>> layer between NumPy function calls (e.g. arithmetic) and >> >> >>>>> >> >>> pandas >> >> >>>>> >> >>> function calls, then I see no reason why we cannot have >> >> >>>>> >> >>> >> >> >>>>> >> >>> Int32Array->add >> >> >>>>> >> >>> >> >> >>>>> >> >>> and >> >> >>>>> >> >>> >> >> >>>>> >> >>> Float32Array->add >> >> >>>>> >> >>> >> >> >>>>> >> >>> do the right thing (the former would be responsible for >> >> >>>>> >> >>> bitmasking to >> >> >>>>> >> >>> propagate NA values; the latter would defer to NumPy). If >> we >> >> >>>>> >> >>> can >> >> >>>>> >> >>> put >> >> >>>>> >> >>> all the internals of pandas objects inside a black box, we >> >> >>>>> >> >>> can >> >> >>>>> >> >>> add >> >> >>>>> >> >>> layers of virtual function indirection without a >> performance >> >> >>>>> >> >>> penalty >> >> >>>>> >> >>> (e.g. adding more interpreter overhead with more >> abstraction >> >> >>>>> >> >>> layers >> >> >>>>> >> >>> does add up to a perf penalty). >> >> >>>>> >> >>> >> >> >>>>> >> >>> I don't think this is too scary -- I would be willing to >> >> >>>>> >> >>> create a >> >> >>>>> >> >>> small POC C++ library to prototype something like what I'm >> >> >>>>> >> >>> talking >> >> >>>>> >> >>> about. >> >> >>>>> >> >>> >> >> >>>>> >> >>> Since pandas has limited points of contact with NumPy I >> don't >> >> >>>>> >> >>> think >> >> >>>>> >> >>> this would end up being too onerous. >> >> >>>>> >> >>> >> >> >>>>> >> >>> For the record, I'm pretty allergic to "advanced C++"; I >> >> >>>>> >> >>> think it >> >> >>>>> >> >>> is a >> >> >>>>> >> >>> useful tool if you pick a sane 20% subset of the C++11 >> spec >> >> >>>>> >> >>> and >> >> >>>>> >> >>> follow >> >> >>>>> >> >>> Google C++ style it's not very inaccessible to >> intermediate >> >> >>>>> >> >>> developers. More or less "C plus OOP and easier object >> >> >>>>> >> >>> lifetime >> >> >>>>> >> >>> management (shared/unique_ptr, etc.)". As soon as you add >> a >> >> >>>>> >> >>> lot >> >> >>>>> >> >>> of >> >> >>>>> >> >>> template metaprogramming C++ library development quickly >> >> >>>>> >> >>> becomes >> >> >>>>> >> >>> inaccessible except to the C++-Jedi. >> >> >>>>> >> >>> >> >> >>>>> >> >>> Maybe let's start a Google document on "pandas roadmap" >> where >> >> >>>>> >> >>> we >> >> >>>>> >> >>> can >> >> >>>>> >> >>> break down the 1-2 year goals and some of these >> >> >>>>> >> >>> infrastructure >> >> >>>>> >> >>> issues >> >> >>>>> >> >>> and have our discussion there? (obviously publish this >> >> >>>>> >> >>> someplace >> >> >>>>> >> >>> once >> >> >>>>> >> >>> we're done) >> >> >>>>> >> >>> >> >> >>>>> >> >>> - Wes >> >> >>>>> >> >>> >> >> >>>>> >> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback >> >> >>>>> >> >>> >> >> >>>>> >> >>> wrote: >> >> >>>>> >> >>> > Here are some of my thoughts about pandas Roadmap / >> status >> >> >>>>> >> >>> > and >> >> >>>>> >> >>> > some >> >> >>>>> >> >>> > responses to Wes's thoughts. >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > In the last few (and upcoming) major releases we have >> been >> >> >>>>> >> >>> > made >> >> >>>>> >> >>> > the >> >> >>>>> >> >>> > following changes: >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > - dtype enhancements (Categorical, Timedelta, Datetime >> >> >>>>> >> >>> > w/tz) & >> >> >>>>> >> >>> > making >> >> >>>>> >> >>> > these >> >> >>>>> >> >>> > first class objects >> >> >>>>> >> >>> > - code refactoring to remove subclassing of ndarrays for >> >> >>>>> >> >>> > Series >> >> >>>>> >> >>> > & >> >> >>>>> >> >>> > Index >> >> >>>>> >> >>> > - carving out / deprecating non-core parts of pandas >> >> >>>>> >> >>> > - datareader >> >> >>>>> >> >>> > - SparsePanel, WidePanel & other aliases (TImeSeries) >> >> >>>>> >> >>> > - rpy, rplot, irow et al. >> >> >>>>> >> >>> > - google-analytics >> >> >>>>> >> >>> > - API changes to make things more consistent >> >> >>>>> >> >>> > - pd.rolling/expanding * -> .rolling/expanding (this >> is >> >> >>>>> >> >>> > in >> >> >>>>> >> >>> > master >> >> >>>>> >> >>> > now) >> >> >>>>> >> >>> > - .resample becoming a full defered like groupby. >> >> >>>>> >> >>> > - multi-index slicing along any level (obviates need >> for >> >> >>>>> >> >>> > .xs) >> >> >>>>> >> >>> > and >> >> >>>>> >> >>> > allows >> >> >>>>> >> >>> > assignment >> >> >>>>> >> >>> > - .loc/.iloc - for the most part obviates use of .ix >> >> >>>>> >> >>> > - .pipe & .assign >> >> >>>>> >> >>> > - plotting accessors >> >> >>>>> >> >>> > - fixing of the sorting API >> >> >>>>> >> >>> > - many performance enhancements both micro & macro (e.g. >> >> >>>>> >> >>> > release >> >> >>>>> >> >>> > GIL) >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > Some on-deck enhancements are (meaning these are >> basically >> >> >>>>> >> >>> > ready to >> >> >>>>> >> >>> > go >> >> >>>>> >> >>> > in): >> >> >>>>> >> >>> > - IntervalIndex (and eventually make PeriodIndex just >> a >> >> >>>>> >> >>> > sub-class >> >> >>>>> >> >>> > of >> >> >>>>> >> >>> > this) >> >> >>>>> >> >>> > - RangeIndex >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > so lots of changes, though nothing really earth shaking, >> >> >>>>> >> >>> > just >> >> >>>>> >> >>> > more >> >> >>>>> >> >>> > convenience, reducing magicness somewhat >> >> >>>>> >> >>> > and providing flexibility. >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > Of course we are getting increasing issues, mostly bug >> >> >>>>> >> >>> > reports >> >> >>>>> >> >>> > (and >> >> >>>>> >> >>> > lots >> >> >>>>> >> >>> > of >> >> >>>>> >> >>> > dupes), some edge case enhancements >> >> >>>>> >> >>> > which can add to the existing API's and of course, >> requests >> >> >>>>> >> >>> > to >> >> >>>>> >> >>> > expand >> >> >>>>> >> >>> > the >> >> >>>>> >> >>> > (already) large code to other usecases. >> >> >>>>> >> >>> > Balancing this are a good many pull-requests from many >> >> >>>>> >> >>> > different >> >> >>>>> >> >>> > users, >> >> >>>>> >> >>> > some >> >> >>>>> >> >>> > even deep into the internals. >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > Here are some things that I have talked about and could >> be >> >> >>>>> >> >>> > considered >> >> >>>>> >> >>> > for >> >> >>>>> >> >>> > the roadmap. Disclaimer: I do work for Continuum >> >> >>>>> >> >>> > but these views are of course my own; furthermore >> obviously >> >> >>>>> >> >>> > I >> >> >>>>> >> >>> > am a >> >> >>>>> >> >>> > bit >> >> >>>>> >> >>> > more >> >> >>>>> >> >>> > familiar with some of the 'sponsored' open-source >> >> >>>>> >> >>> > libraries, but always open to new things. >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > - integration / automatic deferral to numba for JIT >> (this >> >> >>>>> >> >>> > would >> >> >>>>> >> >>> > be >> >> >>>>> >> >>> > thru >> >> >>>>> >> >>> > .apply) >> >> >>>>> >> >>> > - automatic deferal to dask from groubpy where >> appropriate >> >> >>>>> >> >>> > / >> >> >>>>> >> >>> > maybe a >> >> >>>>> >> >>> > .to_parallel (to simply return a dask.DataFrame object) >> >> >>>>> >> >>> > - incorporation of quantities / units (as part of the >> >> >>>>> >> >>> > dtype) >> >> >>>>> >> >>> > - use of DyND to allow missing values for int dtypes >> >> >>>>> >> >>> > - make Period a first class dtype. >> >> >>>>> >> >>> > - provide some copy-on-write semantics to alleviate the >> >> >>>>> >> >>> > chained-indexing >> >> >>>>> >> >>> > issues which occasionaly come up with the mis-use of the >> >> >>>>> >> >>> > indexing >> >> >>>>> >> >>> > API >> >> >>>>> >> >>> > - allow a 'policy' to automatically provide column >> blocks >> >> >>>>> >> >>> > for >> >> >>>>> >> >>> > dict-like >> >> >>>>> >> >>> > input (e.g. each column would be a block), this would >> allow >> >> >>>>> >> >>> > a >> >> >>>>> >> >>> > pass-thru >> >> >>>>> >> >>> > API >> >> >>>>> >> >>> > where you could >> >> >>>>> >> >>> > put in numpy arrays where you have views and have them >> >> >>>>> >> >>> > preserved >> >> >>>>> >> >>> > rather >> >> >>>>> >> >>> > than >> >> >>>>> >> >>> > copied automatically. Note that this would also allow >> what >> >> >>>>> >> >>> > I >> >> >>>>> >> >>> > call >> >> >>>>> >> >>> > 'split' >> >> >>>>> >> >>> > where a passed in >> >> >>>>> >> >>> > multi-dim numpy array could be split up to individual >> >> >>>>> >> >>> > blocks >> >> >>>>> >> >>> > (which >> >> >>>>> >> >>> > actually >> >> >>>>> >> >>> > gives a nice perf boost after the splitting costs). >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > In working towards some of these goals. I have come to >> the >> >> >>>>> >> >>> > opinion >> >> >>>>> >> >>> > that >> >> >>>>> >> >>> > it >> >> >>>>> >> >>> > would make sense to have a neutral API protocol layer >> >> >>>>> >> >>> > that would allow us to swap out different engines as >> >> >>>>> >> >>> > needed, >> >> >>>>> >> >>> > for >> >> >>>>> >> >>> > particular >> >> >>>>> >> >>> > dtypes, or *maybe* out-of-core type computations. E.g. >> >> >>>>> >> >>> > imagine that we replaced the in-memory block structure >> with >> >> >>>>> >> >>> > a >> >> >>>>> >> >>> > bclolz >> >> >>>>> >> >>> > / >> >> >>>>> >> >>> > memap >> >> >>>>> >> >>> > type; in theory this should be 'easy' and just work. >> >> >>>>> >> >>> > I could also see us adopting *some* of the SFrame code >> to >> >> >>>>> >> >>> > allow >> >> >>>>> >> >>> > easier >> >> >>>>> >> >>> > interop with this API layer. >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > In practice, I think a nice API layer would need to be >> >> >>>>> >> >>> > created >> >> >>>>> >> >>> > to >> >> >>>>> >> >>> > make >> >> >>>>> >> >>> > this >> >> >>>>> >> >>> > clean / nice. >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > So this comes around to Wes's point about creating a c++ >> >> >>>>> >> >>> > library for >> >> >>>>> >> >>> > the >> >> >>>>> >> >>> > internals (and possibly even some of the indexing >> >> >>>>> >> >>> > routines). >> >> >>>>> >> >>> > In an ideal world, or course this would be desirable. >> >> >>>>> >> >>> > Getting >> >> >>>>> >> >>> > there >> >> >>>>> >> >>> > is a >> >> >>>>> >> >>> > bit >> >> >>>>> >> >>> > non-trivial I think, and IMHO might not be worth the >> >> >>>>> >> >>> > effort. I >> >> >>>>> >> >>> > don't >> >> >>>>> >> >>> > really see big performance bottlenecks. We *already* >> defer >> >> >>>>> >> >>> > much >> >> >>>>> >> >>> > of >> >> >>>>> >> >>> > the >> >> >>>>> >> >>> > computation to libraries like numexpr & bottleneck >> (where >> >> >>>>> >> >>> > appropriate). >> >> >>>>> >> >>> > Adding numba / dask to the list would be helpful. >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > I think that almost all performance issues are the >> result >> >> >>>>> >> >>> > of: >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > a) gross misuse of the pandas API. How much code have >> you >> >> >>>>> >> >>> > seen >> >> >>>>> >> >>> > that >> >> >>>>> >> >>> > does >> >> >>>>> >> >>> > df.apply(lambda x: x.sum()) >> >> >>>>> >> >>> > b) routines which operate column-by-column rather >> >> >>>>> >> >>> > block-by-block and >> >> >>>>> >> >>> > are >> >> >>>>> >> >>> > in >> >> >>>>> >> >>> > python space (e.g. we have an issue right now about >> >> >>>>> >> >>> > .quantile) >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > So I am glossing over a big goal of having a c++ library >> >> >>>>> >> >>> > that >> >> >>>>> >> >>> > represents >> >> >>>>> >> >>> > the >> >> >>>>> >> >>> > pandas internals. This would by definition have a c-API >> >> >>>>> >> >>> > that so >> >> >>>>> >> >>> > you *could* use pandas like semantics in c/c++ and just >> >> >>>>> >> >>> > have it >> >> >>>>> >> >>> > work >> >> >>>>> >> >>> > (and >> >> >>>>> >> >>> > then pandas would be a thin wrapper around this >> library). >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > I am not averse to this, but I think would be quite a >> big >> >> >>>>> >> >>> > effort, >> >> >>>>> >> >>> > and >> >> >>>>> >> >>> > not a >> >> >>>>> >> >>> > huge perf boost IMHO. Further there are a number of API >> >> >>>>> >> >>> > issues >> >> >>>>> >> >>> > w.r.t. >> >> >>>>> >> >>> > indexing >> >> >>>>> >> >>> > which need to be clarified / worked out (e.g. should we >> >> >>>>> >> >>> > simply >> >> >>>>> >> >>> > deprecate >> >> >>>>> >> >>> > []) >> >> >>>>> >> >>> > that are much easier to test / figure out in python >> space. >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > I also thing that we have quite a large number of >> >> >>>>> >> >>> > contributors. >> >> >>>>> >> >>> > Moving >> >> >>>>> >> >>> > to >> >> >>>>> >> >>> > c++ might make the internals a bit more impenetrable >> that >> >> >>>>> >> >>> > the >> >> >>>>> >> >>> > current >> >> >>>>> >> >>> > internals. >> >> >>>>> >> >>> > (though this would allow c++ people to contribute, so >> that >> >> >>>>> >> >>> > might >> >> >>>>> >> >>> > balance >> >> >>>>> >> >>> > out). >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > We have a limited core of devs whom right now are >> familar >> >> >>>>> >> >>> > with >> >> >>>>> >> >>> > things. >> >> >>>>> >> >>> > If >> >> >>>>> >> >>> > someone happened to have a starting base for a c++ >> library, >> >> >>>>> >> >>> > then I >> >> >>>>> >> >>> > might >> >> >>>>> >> >>> > change >> >> >>>>> >> >>> > opinions here. >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > my 4c. >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > Jeff >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > wrote: >> >> >>>>> >> >>> >> >> >> >>>>> >> >>> >> Deep thoughts during the holidays. >> >> >>>>> >> >>> >> >> >> >>>>> >> >>> >> I might be out of line here, but the >> interpreter-heaviness >> >> >>>>> >> >>> >> of >> >> >>>>> >> >>> >> the >> >> >>>>> >> >>> >> inside of pandas objects is likely to be a long-term >> >> >>>>> >> >>> >> liability >> >> >>>>> >> >>> >> and >> >> >>>>> >> >>> >> source of performance problems and technical debt. >> >> >>>>> >> >>> >> >> >> >>>>> >> >>> >> Has anyone put any thought into planning and beginning >> to >> >> >>>>> >> >>> >> execute >> >> >>>>> >> >>> >> on a >> >> >>>>> >> >>> >> rewrite that moves as much as possible of the internals >> >> >>>>> >> >>> >> into >> >> >>>>> >> >>> >> native >> >> >>>>> >> >>> >> / >> >> >>>>> >> >>> >> compiled code? I'm talking about: >> >> >>>>> >> >>> >> >> >> >>>>> >> >>> >> - pandas/core/internals >> >> >>>>> >> >>> >> - indexing and assignment >> >> >>>>> >> >>> >> - much of pandas/core/common >> >> >>>>> >> >>> >> - categorical and custom dtypes >> >> >>>>> >> >>> >> - all indexing mechanisms >> >> >>>>> >> >>> >> >> >> >>>>> >> >>> >> I'm concerned we've already exposed too much internals >> to >> >> >>>>> >> >>> >> users, so >> >> >>>>> >> >>> >> this might lead to a lot of API breakage, but it might >> be >> >> >>>>> >> >>> >> for >> >> >>>>> >> >>> >> the >> >> >>>>> >> >>> >> Greater Good. As a first step, beginning a partial >> >> >>>>> >> >>> >> migration >> >> >>>>> >> >>> >> of >> >> >>>>> >> >>> >> internals into some C++ classes that encapsulate the >> >> >>>>> >> >>> >> insides >> >> >>>>> >> >>> >> of >> >> >>>>> >> >>> >> DataFrame objects and implement indexing and >> block-level >> >> >>>>> >> >>> >> manipulations >> >> >>>>> >> >>> >> would be a good place to start. I think you could do >> this >> >> >>>>> >> >>> >> wouldn't >> >> >>>>> >> >>> >> too >> >> >>>>> >> >>> >> much disruption. >> >> >>>>> >> >>> >> >> >> >>>>> >> >>> >> As part of this internal retooling we might give >> >> >>>>> >> >>> >> consideration >> >> >>>>> >> >>> >> to >> >> >>>>> >> >>> >> alternative data structures for representing data >> internal >> >> >>>>> >> >>> >> to >> >> >>>>> >> >>> >> pandas >> >> >>>>> >> >>> >> objects. Now in 2015/2016, continuing to be hamstrung >> by >> >> >>>>> >> >>> >> NumPy's >> >> >>>>> >> >>> >> limitations feels somewhat anachronistic. User code is >> >> >>>>> >> >>> >> riddled >> >> >>>>> >> >>> >> with >> >> >>>>> >> >>> >> workarounds for data type fidelity issues and the like. >> >> >>>>> >> >>> >> Like, >> >> >>>>> >> >>> >> really, >> >> >>>>> >> >>> >> why not add a bitndarray (similar to >> ilanschnell/bitarray) >> >> >>>>> >> >>> >> for >> >> >>>>> >> >>> >> storing >> >> >>>>> >> >>> >> nullness for problematic types and hide this from the >> >> >>>>> >> >>> >> user? =) >> >> >>>>> >> >>> >> >> >> >>>>> >> >>> >> Since we are now a NumFOCUS-sponsored project, I feel >> like >> >> >>>>> >> >>> >> we >> >> >>>>> >> >>> >> might >> >> >>>>> >> >>> >> consider establishing some formal governance over >> pandas >> >> >>>>> >> >>> >> and >> >> >>>>> >> >>> >> publishing meetings notes and roadmap documents >> describing >> >> >>>>> >> >>> >> plans >> >> >>>>> >> >>> >> for >> >> >>>>> >> >>> >> the project and meetings notes from committers. >> There's no >> >> >>>>> >> >>> >> real >> >> >>>>> >> >>> >> "committer culture" for NumFOCUS projects like there is >> >> >>>>> >> >>> >> with >> >> >>>>> >> >>> >> the >> >> >>>>> >> >>> >> Apache Software Foundation, but we might try leading by >> >> >>>>> >> >>> >> example! >> >> >>>>> >> >>> >> >> >> >>>>> >> >>> >> Also, I believe pandas as a project has reached a >> level of >> >> >>>>> >> >>> >> importance >> >> >>>>> >> >>> >> where we ought to consider planning and execution on >> >> >>>>> >> >>> >> larger >> >> >>>>> >> >>> >> scale >> >> >>>>> >> >>> >> undertakings such as this for safeguarding the future. >> >> >>>>> >> >>> >> >> >> >>>>> >> >>> >> As for myself, well, I have my hands full in Big >> >> >>>>> >> >>> >> Data-land. I >> >> >>>>> >> >>> >> wish >> >> >>>>> >> >>> >> I >> >> >>>>> >> >>> >> could be helping more with pandas, but there a quite a >> few >> >> >>>>> >> >>> >> fundamental >> >> >>>>> >> >>> >> issues (like data interoperability nested data handling >> >> >>>>> >> >>> >> and >> >> >>>>> >> >>> >> file >> >> >>>>> >> >>> >> format support ? e.g. Parquet, see >> >> >>>>> >> >>> >> >> >> >>>>> >> >>> >> >> >> >>>>> >> >>> >> >> >> >>>>> >> >>> >> >> >> >>>>> >> >>> >> >> >> >>>>> >> >>> >> >> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/ >> ) >> >> >>>>> >> >>> >> preventing Python from being more useful in industry >> >> >>>>> >> >>> >> analytics >> >> >>>>> >> >>> >> applications. >> >> >>>>> >> >>> >> >> >> >>>>> >> >>> >> Aside: one of the bigger mistakes I made with pandas's >> API >> >> >>>>> >> >>> >> design >> >> >>>>> >> >>> >> was >> >> >>>>> >> >>> >> making it acceptable to call class constructors ? like >> >> >>>>> >> >>> >> pandas.DataFrame ? directly (versus factory functions). >> >> >>>>> >> >>> >> Sorry >> >> >>>>> >> >>> >> about >> >> >>>>> >> >>> >> that! If we could convince everyone to start writing >> >> >>>>> >> >>> >> pandas.data_frame >> >> >>>>> >> >>> >> or dataframe instead of using the class reference it >> would >> >> >>>>> >> >>> >> help a >> >> >>>>> >> >>> >> lot >> >> >>>>> >> >>> >> with code cleanup. It's hard to plan for these things ? >> >> >>>>> >> >>> >> NumPy >> >> >>>>> >> >>> >> interoperability seemed a lot more important in 2008 >> than >> >> >>>>> >> >>> >> it >> >> >>>>> >> >>> >> does >> >> >>>>> >> >>> >> now, >> >> >>>>> >> >>> >> so I forgive myself. >> >> >>>>> >> >>> >> >> >> >>>>> >> >>> >> cheers and best wishes for 2016, >> >> >>>>> >> >>> >> Wes >> >> >>>>> >> >>> >> _______________________________________________ >> >> >>>>> >> >>> >> Pandas-dev mailing list >> >> >>>>> >> >>> >> Pandas-dev at python.org >> >> >>>>> >> >>> >> https://mail.python.org/mailman/listinfo/pandas-dev >> >> >>>>> >> >>> > >> >> >>>>> >> >>> > >> >> >>>>> >> >>> _______________________________________________ >> >> >>>>> >> >>> Pandas-dev mailing list >> >> >>>>> >> >>> Pandas-dev at python.org >> >> >>>>> >> >>> https://mail.python.org/mailman/listinfo/pandas-dev >> >> >>>>> >> _______________________________________________ >> >> >>>>> >> Pandas-dev mailing list >> >> >>>>> >> Pandas-dev at python.org >> >> >>>>> >> https://mail.python.org/mailman/listinfo/pandas-dev >> >> >>>>> > >> >> >>>>> > >> >> >>>>> >> >> >>>>> >> >> >>>>> _______________________________________________ >> >> >>>>> Pandas-dev mailing list >> >> >>>>> Pandas-dev at python.org >> >> >>>>> https://mail.python.org/mailman/listinfo/pandas-dev >> >> >>>>> >> >> >>>> >> >> >> >> >> _______________________________________________ >> >> Pandas-dev mailing list >> >> Pandas-dev at python.org >> >> https://mail.python.org/mailman/listinfo/pandas-dev >> > >> > >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Wed Jan 6 13:30:46 2016 From: shoyer at gmail.com (Stephan Hoyer) Date: Wed, 6 Jan 2016 10:30:46 -0800 Subject: [Pandas-dev] pandas governance In-Reply-To: References: Message-ID: I'm also supportive of formalizing pandas governance like this. It's definitely the right call for a mature project. I agree that we can probably just the Jupyter governance docs with minor adjustments. Cheers, Stephan On Wed, Jan 6, 2016 at 5:50 AM, Jeff Reback wrote: > yes on board with this as well. We do have a fiscal governance document > w.r.t. NUMFocus. That should at the very least be reference by > the governance docs. Certainly starting with the jupyter docs is a good > think. > > I don't think we will have the long-long-long discussion that numpy had > about the steering committee representation :) > > Jeff > > On Tue, Jan 5, 2016 at 7:29 PM, Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > >> Sounds very good! >> >> Certainly now we are a NumFOCUS supported project (and have to deal with >> financial things), I think this is important to do. >> >> 2016-01-05 19:15 GMT+01:00 Wes McKinney : >> >>> hi folks, >>> >>> I'm sorry I didn't do this 2 or 3 years ago when I first handed over >>> release management responsibilities to Jeff, y-p and others, but it >>> would be good for us to formalize the project governance like most >>> other major open source projects. See IPython / Jupyter for an example >>> set of governance documents >>> >>> https://github.com/jupyter/governance >>> >>> Numpy also recently adopted a goverance document, based on the Jupyter >> one: http://docs.scipy.org/doc/numpy-dev/dev/governance/governance.html >> and https://github.com/numpy/numpy/pull/6352. >> Maybe also worth a look (although I don't know what they exactly changed >> from the Jupyter one). >> >> >>> I don't have particular concerns over the project's direction and >>> decision making procedure, but as I've had several people raise >>> private concerns with me over the last few years, I think it would be >>> good for the community to have a set of public documents on GitHub >>> that lists people and process in simple terms. This is especially >>> important now that we can receive financial sponsorship through >>> NumFOCUS, so that sponsored contributions are subject to the same >>> community process as volunteer contributions. >>> >>> A basic summary of how we've been informally operating is: Project >>> committers (as will be defined and listed in the governance documents) >>> make decisions based on consensus; in the absence of consensus (which >>> has rarely occurred) I will reserve tie-breaking / BDFL privileges. I >>> don't recall having ever having to put on the BDFL hat but it's the >>> "just in case" should we reach some impasse down the road. >>> >>> Sounds good! >> >> >>> I can take a crack at assembling something based on the IPython >>> governance docs if that sounds good. >>> >>> At the end of the day, an OSS project is only as strong as the >>> individuals committing code and reviewing patches. As pandas will be 8 >>> years old in April, with 6 years as open source, I think we have a >>> good track record of consensus-, common-sense-, and >>> fact/evidence-driven decision making. >>> >>> best, >>> Wes >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> >> > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Wed Jan 6 14:26:49 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Wed, 6 Jan 2016 11:26:49 -0800 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References: Message-ID: hey Stephan, Thanks for all the thoughts. Let me make a few off-the-cuff comments. On Wed, Jan 6, 2016 at 10:11 AM, Stephan Hoyer wrote: > I was asked about this off list, so I'll belatedly share my thoughts. > > First of all, I am really excited by Wes's renewed engagement in the project > and his interest in rewriting pandas internals. This is quite an ambitious > plan and nobody is better positioned to tackle it than Wes. > > I have mixed feelings about the details of the rewrite itself. > > +1 on the simpler internal data model. The block manager is confusing and > leads to hard to predict performance issues related to copying data. If we > can do all column additions/removals/re-orderings without a copy it will be > a clear win. > > +0 on moving internals to C++. I do like the performance benefits, but it > seems like a lot of work, and it may make pandas less friendly to new > contributors. > It really goes beyond performance benefits. If you go back to my 2013 talk http://www.slideshare.net/wesm/practical-medium-data-analytics-with-python there's a long list of architectural problems that now in 2016 haven't found solutions. The only way (that I can fully reason through -- I am happy to look at alternate proposals) to move the internals of pandas closer to the metal is to give Series and DataFrame a C/C++ API -- this is the "libpandas native core" as I've been describing. > -0 on writing a brand new dtype system just for pandas -- this stuff really > belongs in NumPy (or another array library like DyND), and I am skeptical > that pandas can do a complete enough job to be useful without replicating > all that functionality. > I'm curious what "a brand new dtype system" means to you. pandas already has its own data type system, but it's a potpourri of inconsistencies and rough edges with self-evident problems for both users and developers. Some indicators: - Some pandas types use NaN for missing data, others None (or both), others nothing at all. We lose data (integers) or bloat memory (booleans) by upcasting to float-NaN or object-None. - Internal functions full of is_XXX_dtype functions: pandas.core.common, pandas.core.algorithms, etc. - Series.values on synthetic dtypes like Categorical - We use arrays of Python objects for string data The biggest cause IMHO is that pandas is too tightly coupled to NumPy, but it's coupled in a way that makes development and extensibility difficult. We've already allowed NumPy-specific details to taint the pandas user API in many unpleasant ways. This isn't to say "NumPy is bad" but rather "pandas tries to layer domain-specific functionality [that NumPy was not designed for] on top". Some things things I'm advocating with the internals refactor: 1) First class "pandas type" objects. This is not the same as a NumPy dtype which has some pretty loaded implications -- in particular, NumPy dtypes are implicitly coupled to an array computing framework (see the function table that is attached to the PyArray_Descr object) 2) Pandas array container types that map user-land API calls to implementation-land API calls (in NumPy, DyND, or pandas-native code like pandas.core.algorithms etc.). This will make it much easier to leverage innovations in NumPy and DyND without those implementation details spilling over into the pandas user API 3) Adding a single pandas.NA singleton to have one library-wide notion of a scalar null value (obviously, we can automatically map NaN and None to NA for backwards compatibility). 4) Layering a bitmask internally on NumPy arrays (especially integer and boolean) to add null-ness to types that need it. Note that this does not prevent us from switching to DyND arrays with option dtype in the future. If the details of how we are implementing NULL are visible to the user, we have failed. 5) Removing the block manager in favor of simpler pandas Array (1D) and Table (2D -- vector of Array) data structures I believe you can do all this without harming interoperability with the ecosystem of projects that people currently use in conjunction with pandas. > More broadly, I am concerned that this rewrite may improve the tabular > computation ecosystem at the cost of inter-operability with the array-based > ecosystem (numpy, scipy, sklearn, xarray, etc.). The later has been one of > the strengths of pandas and it would be a shame to see that go away. > I have no intention of letting this happen. What I've am asking from you (and others reading) is to help define what constitutes interoperability. What guarantees do we make the user? For example, we should have very strict guidelines for the output of: np.asarray(pandas_obj) For example In [3]: s = pd.Series([1,2,3]*10).astype('category') In [4]: np.asarray(s) Out[4]: array([1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3]) I see no reason why this should necessarily behave any differently. The problem will come in when there is pandas data that is not precisely representable in a NumPy array. Example: In [5]: s = pd.Series([1,2,3, 4]) In [6]: s.dtype Out[6]: dtype('int64') In [7]: s2 = s.reindex(np.arange(10)) In [8]: s2.dtype Out[8]: dtype('float64') In [9]: np.asarray(s2) Out[9]: array([ 1., 2., 3., 4., nan, nan, nan, nan, nan, nan]) With the "new internals", s2 will still be int64 type, but we may decide that np.asarray(s2) should raise an exception rather than implicitly make a decision about how to perform a "lossy" conversion to a NumPy array. If you are using DyND with pandas, then the equivalent function would be able to implicitly convert without data loss. > We're already starting to struggle with inter-operability with the new > pandas dtypes and a further rewrite would make this even harder. > For example, see categoricals and scikit-learn in Tom's recent post [1], or the > fact that .values no longer always returns a numpy array. This has also been > a challenge for xarray, which can't handle these new dtypes because we lack > a suitable array backend for them. I'm definitely motivated in this initiative by these challenges. The idea here is that with the new internals, Series.values will always return the same type of object, and there will be one consistent code path for getting a NumPy array out. For example, rather than: if isinstance(s.values, Categorical): # pandas ... else: # NumPy ... We could have (just an idea) s.values.to_numpy() Or simply np.asarray(s.values) > > Personally, I would much rather leverage a full featured library like an > improved NumPy or DyND for new dtypes, because that could also be used by > the array-based ecosystem. At the very least, it would be good to think > about zero-copy inter-operability with array-based tools. > I'm all for zero-copy interoperability when possible, but my gut feeling is that exposing the data type system of an array library (the choice of which is an implementation detail) to pandas users is an inherent leaky abstraction that will continue to cause problems if we plan to keep innovating inside pandas. By better hiding NumPy details and types from the user we will make it much easier to swap out new low level array data structures and compute components (e.g. DyND), or add custom data structures or out-of-core tools (memory maps, bcolz, etc.) I'm additionally offering to do nearly all of this replumbing of pandas internals myself, and completely in my free time. What I will expect in return from you all is to help enumerate our contracts with the pandas user (i.e. interoperability) and to hold me accountable to not break them. I know I haven't been committing code on pandas since mid-2013 (after a 5 year marathon), but these architectural problems have been on my mind almost constantly since then, I just haven't had the bandwidth to start tackling them. cheers, Wes > On the other hand, I wonder if maybe it would be better to write a native > in-memory backend for Ibis instead of rewriting pandas. Ibis does seem to > have improved/simplified API which resolves many of pandas's warts. That > said, it's a pretty big change from the "DataFrame as matrix" model, and > pandas won't be going away anytime soon. I do like that it would force users > to be more explicit about converting between tables and arrays, which might > also make distinctions between the tabular and array oriented ecosystems > easier to swallow. > > Just my two cents, from someone who has lots of opinions but who will likely > stay on the sidelines for most of this work. > > Cheers, > Stephan > > [1] http://tomaugspurger.github.io/categorical-pipelines.html > > On Fri, Jan 1, 2016 at 6:06 PM, Jeff Reback wrote: >> >> ok I moved the document to the Pandas folder, where the same group should >> be able to edit/upload/etc. lmk if any issues >> >> On Fri, Jan 1, 2016 at 8:48 PM, Wes McKinney wrote: >>> >>> Thanks Jeff. Can you create and share a shared Drive folder containing >>> this where I can put other auxiliary / follow up documents? >>> >>> On Fri, Jan 1, 2016 at 5:23 PM, Jeff Reback wrote: >>> > I changed the doc so that the core dev people can edit. I *think* that >>> > everyone should be able to view/comment though. >>> > >>> > On Fri, Jan 1, 2016 at 8:13 PM, Wes McKinney >>> > wrote: >>> >> >>> >> Jeff -- can you require log-in for editing on this document? >>> >> >>> >> >>> >> https://docs.google.com/document/d/151ct8jcZWwh7XStptjbLsda6h2b3C0IuiH_hfZnUA58/edit# >>> >> >>> >> There are a number of anonymous edits. >>> >> >>> >> On Wed, Dec 30, 2015 at 6:04 PM, Wes McKinney >>> >> wrote: >>> >> > I cobbled together an ugly start of a c++->cython->pandas toolchain >>> >> > here >>> >> > >>> >> > https://github.com/wesm/pandas/tree/libpandas-native-core >>> >> > >>> >> > I used a mix of Kudu, Impala, and dynd-python cmake sources, so it's >>> >> > a >>> >> > bit messy at the moment but it should be sufficient to run some real >>> >> > experiments with a little more work. I reckon it's like a 6 month >>> >> > project to tear out the insides of Series and DataFrame and replace >>> >> > it >>> >> > with a new "native core", but we should be able to get enough info >>> >> > to >>> >> > see whether it's a viable plan within a month or so. >>> >> > >>> >> > The end goal is to create "private" extension types in Cython that >>> >> > can >>> >> > be the new base classes for Series and NDFrame; these will hold a >>> >> > reference to a C++ object that contains wrappered NumPy arrays and >>> >> > other metadata (like pandas-only dtypes). >>> >> > >>> >> > It might be too hard to try to replace a single usage of block >>> >> > manager >>> >> > as a first experiment, so I'll try to create a minimal "SeriesLite" >>> >> > that supports 3 dtypes >>> >> > >>> >> > 1) float64 with nans >>> >> > 2) int64 with a bitmask for NAs >>> >> > 3) category type for one of these >>> >> > >>> >> > Just want to get a feel for the extensibility and offer an NA >>> >> > singleton Python object (a la None) for getting and setting NAs >>> >> > across >>> >> > these 3 dtypes. >>> >> > >>> >> > If we end up going down this route, any way to place a moratorium on >>> >> > invasive work on pandas internals (outside bug fixes)? >>> >> > >>> >> > Pedantic aside: I'd rather avoid shipping thirdparty C/C++ libraries >>> >> > like googletest and friends in pandas if we can. Cloudera folks have >>> >> > been working on a portable C++ library toolchain for Impala and >>> >> > other >>> >> > projects at https://github.com/cloudera/native-toolchain, but it is >>> >> > only being tested on Linux and OS X. Most google libraries should >>> >> > build out of the box on MSVC but it'll be something to keep an eye >>> >> > on. >>> >> > >>> >> > BTW thanks to the libdynd developers for pioneering the c++ lib <-> >>> >> > python-c++ lib <-> cython toolchain; being able to build Cython >>> >> > extensions directly from cmake is a godsend >>> >> > >>> >> > HNY all >>> >> > Wes >>> >> > >>> >> > On Tue, Dec 29, 2015 at 4:17 PM, Irwin Zaid >>> >> > wrote: >>> >> >> Yeah, that seems reasonable and I totally agree a Pandas wrapper >>> >> >> layer >>> >> >> would >>> >> >> be necessary. >>> >> >> >>> >> >> I'll keep an eye on this and I'd like to help if I can. >>> >> >> >>> >> >> Irwin >>> >> >> >>> >> >> >>> >> >> On Tue, Dec 29, 2015 at 6:01 PM, Wes McKinney >>> >> >> wrote: >>> >> >>> >>> >> >>> I'm not suggesting a rewrite of NumPy functionality but rather >>> >> >>> pandas >>> >> >>> functionality that is currently written in a mishmash of Cython >>> >> >>> and >>> >> >>> Python. >>> >> >>> Happy to experiment with changing the internal compute >>> >> >>> infrastructure >>> >> >>> and >>> >> >>> data representation to DyND after this first stage of cleanup is >>> >> >>> done. >>> >> >>> Even >>> >> >>> if we use DyND a pretty extensive pandas wrapper layer will be >>> >> >>> necessary. >>> >> >>> >>> >> >>> >>> >> >>> On Tuesday, December 29, 2015, Irwin Zaid >>> >> >>> wrote: >>> >> >>>> >>> >> >>>> Hi Wes (and others), >>> >> >>>> >>> >> >>>> I've been following this conversation with interest. I do think >>> >> >>>> it >>> >> >>>> would >>> >> >>>> be worth exploring DyND, rather than setting up yet another >>> >> >>>> rewrite >>> >> >>>> of >>> >> >>>> NumPy-functionality. Especially because DyND is already an >>> >> >>>> optional >>> >> >>>> dependency of Pandas. >>> >> >>>> >>> >> >>>> For things like Integer NA and new dtypes, DyND is there and >>> >> >>>> ready to >>> >> >>>> do >>> >> >>>> this. >>> >> >>>> >>> >> >>>> Irwin >>> >> >>>> >>> >> >>>> On Tue, Dec 29, 2015 at 5:18 PM, Wes McKinney >>> >> >>>> >>> >> >>>> wrote: >>> >> >>>>> >>> >> >>>>> Can you link to the PR you're talking about? >>> >> >>>>> >>> >> >>>>> I will see about spending a few hours setting up a libpandas.so >>> >> >>>>> as a >>> >> >>>>> C++ >>> >> >>>>> shared library where we can run some experiments and validate >>> >> >>>>> whether it can >>> >> >>>>> solve the integer-NA problem and be a place to put new data >>> >> >>>>> types >>> >> >>>>> (categorical and friends). I'm +1 on targeting >>> >> >>>>> >>> >> >>>>> Would it also be worth making a wish list of APIs we might >>> >> >>>>> consider >>> >> >>>>> breaking in a pandas 1.0 release that also features this new >>> >> >>>>> "native >>> >> >>>>> core"? >>> >> >>>>> Might as well right some wrongs while we're doing some invasive >>> >> >>>>> work >>> >> >>>>> on the >>> >> >>>>> internals; some breakage might be unavoidable. We can always >>> >> >>>>> maintain a >>> >> >>>>> pandas legacy 0.x.x maintenance branch (providing a conda binary >>> >> >>>>> build) for >>> >> >>>>> legacy users where showstopper bugs can get fixed. >>> >> >>>>> >>> >> >>>>> On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback >>> >> >>>>> >>> >> >>>>> wrote: >>> >> >>>>> > Wes your last is noted as well. I *think* we can actually do >>> >> >>>>> > this >>> >> >>>>> > now >>> >> >>>>> > (well >>> >> >>>>> > there is a PR out there). >>> >> >>>>> > >>> >> >>>>> > On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney >>> >> >>>>> > >>> >> >>>>> > wrote: >>> >> >>>>> >> >>> >> >>>>> >> The other huge thing this will enable is to do is >>> >> >>>>> >> copy-on-write >>> >> >>>>> >> for >>> >> >>>>> >> various kinds of views, which should cut down on some of the >>> >> >>>>> >> defensive >>> >> >>>>> >> copying in the library and reduce memory usage. >>> >> >>>>> >> >>> >> >>>>> >> On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney >>> >> >>>>> >> >>> >> >>>>> >> wrote: >>> >> >>>>> >> > Basically the approach is >>> >> >>>>> >> > >>> >> >>>>> >> > 1) Base dtype type >>> >> >>>>> >> > 2) Base array type with K >= 1 dimensions >>> >> >>>>> >> > 3) Base scalar type >>> >> >>>>> >> > 4) Base index type >>> >> >>>>> >> > 5) "Wrapper" subclasses for all NumPy types fitting into >>> >> >>>>> >> > categories >>> >> >>>>> >> > #1, #2, #3, #4 >>> >> >>>>> >> > 6) Subclasses for pandas-specific types like category, >>> >> >>>>> >> > datetimeTZ, >>> >> >>>>> >> > etc. >>> >> >>>>> >> > 7) NDFrame as cpcloud wrote is just a list of these >>> >> >>>>> >> > >>> >> >>>>> >> > Indexes and axis labels / column names can get layered on >>> >> >>>>> >> > top. >>> >> >>>>> >> > >>> >> >>>>> >> > After we do all this we can look at adding nested types >>> >> >>>>> >> > (arrays, >>> >> >>>>> >> > maps, >>> >> >>>>> >> > structs) to better support JSON. >>> >> >>>>> >> > >>> >> >>>>> >> > - Wes >>> >> >>>>> >> > >>> >> >>>>> >> > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud >>> >> >>>>> >> > >>> >> >>>>> >> > wrote: >>> >> >>>>> >> >> Maybe this is saying the same thing as Wes, but how far >>> >> >>>>> >> >> would >>> >> >>>>> >> >> something >>> >> >>>>> >> >> like >>> >> >>>>> >> >> this get us? >>> >> >>>>> >> >> >>> >> >>>>> >> >> // warning: things are probably not this simple >>> >> >>>>> >> >> >>> >> >>>>> >> >> struct data_array_t { >>> >> >>>>> >> >> void *primitive; // scalar data >>> >> >>>>> >> >> data_array_t *nested; // nested data >>> >> >>>>> >> >> boost::dynamic_bitset isnull; // might have to create >>> >> >>>>> >> >> our >>> >> >>>>> >> >> own >>> >> >>>>> >> >> to >>> >> >>>>> >> >> avoid >>> >> >>>>> >> >> boost >>> >> >>>>> >> >> schema_t schema; // not sure exactly what this looks >>> >> >>>>> >> >> like >>> >> >>>>> >> >> }; >>> >> >>>>> >> >> >>> >> >>>>> >> >> typedef std::map data_frame_t; // >>> >> >>>>> >> >> probably >>> >> >>>>> >> >> not >>> >> >>>>> >> >> this >>> >> >>>>> >> >> simple >>> >> >>>>> >> >> >>> >> >>>>> >> >> To answer Jeff?s use-case question: I think that the use >>> >> >>>>> >> >> cases >>> >> >>>>> >> >> are >>> >> >>>>> >> >> 1) >>> >> >>>>> >> >> freedom from numpy (mostly) 2) no more block manager which >>> >> >>>>> >> >> frees >>> >> >>>>> >> >> us >>> >> >>>>> >> >> from the >>> >> >>>>> >> >> limitations of the block memory layout. In particular, the >>> >> >>>>> >> >> ability >>> >> >>>>> >> >> to >>> >> >>>>> >> >> take >>> >> >>>>> >> >> advantage of memory mapped IO would be a big win IMO. >>> >> >>>>> >> >> >>> >> >>>>> >> >> >>> >> >>>>> >> >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney >>> >> >>>>> >> >> >>> >> >>>>> >> >> wrote: >>> >> >>>>> >> >>> >>> >> >>>>> >> >>> I will write a more detailed response to some of these >>> >> >>>>> >> >>> things >>> >> >>>>> >> >>> after >>> >> >>>>> >> >>> the new year, but, in particular, re: missing values, can >>> >> >>>>> >> >>> you >>> >> >>>>> >> >>> or >>> >> >>>>> >> >>> someone tell me why creating an object that contains a >>> >> >>>>> >> >>> NumPy >>> >> >>>>> >> >>> array and >>> >> >>>>> >> >>> a bitmap is not sufficient? If we we can add a >>> >> >>>>> >> >>> lightweight >>> >> >>>>> >> >>> C/C++ >>> >> >>>>> >> >>> class >>> >> >>>>> >> >>> layer between NumPy function calls (e.g. arithmetic) and >>> >> >>>>> >> >>> pandas >>> >> >>>>> >> >>> function calls, then I see no reason why we cannot have >>> >> >>>>> >> >>> >>> >> >>>>> >> >>> Int32Array->add >>> >> >>>>> >> >>> >>> >> >>>>> >> >>> and >>> >> >>>>> >> >>> >>> >> >>>>> >> >>> Float32Array->add >>> >> >>>>> >> >>> >>> >> >>>>> >> >>> do the right thing (the former would be responsible for >>> >> >>>>> >> >>> bitmasking to >>> >> >>>>> >> >>> propagate NA values; the latter would defer to NumPy). If >>> >> >>>>> >> >>> we >>> >> >>>>> >> >>> can >>> >> >>>>> >> >>> put >>> >> >>>>> >> >>> all the internals of pandas objects inside a black box, >>> >> >>>>> >> >>> we >>> >> >>>>> >> >>> can >>> >> >>>>> >> >>> add >>> >> >>>>> >> >>> layers of virtual function indirection without a >>> >> >>>>> >> >>> performance >>> >> >>>>> >> >>> penalty >>> >> >>>>> >> >>> (e.g. adding more interpreter overhead with more >>> >> >>>>> >> >>> abstraction >>> >> >>>>> >> >>> layers >>> >> >>>>> >> >>> does add up to a perf penalty). >>> >> >>>>> >> >>> >>> >> >>>>> >> >>> I don't think this is too scary -- I would be willing to >>> >> >>>>> >> >>> create a >>> >> >>>>> >> >>> small POC C++ library to prototype something like what >>> >> >>>>> >> >>> I'm >>> >> >>>>> >> >>> talking >>> >> >>>>> >> >>> about. >>> >> >>>>> >> >>> >>> >> >>>>> >> >>> Since pandas has limited points of contact with NumPy I >>> >> >>>>> >> >>> don't >>> >> >>>>> >> >>> think >>> >> >>>>> >> >>> this would end up being too onerous. >>> >> >>>>> >> >>> >>> >> >>>>> >> >>> For the record, I'm pretty allergic to "advanced C++"; I >>> >> >>>>> >> >>> think it >>> >> >>>>> >> >>> is a >>> >> >>>>> >> >>> useful tool if you pick a sane 20% subset of the C++11 >>> >> >>>>> >> >>> spec >>> >> >>>>> >> >>> and >>> >> >>>>> >> >>> follow >>> >> >>>>> >> >>> Google C++ style it's not very inaccessible to >>> >> >>>>> >> >>> intermediate >>> >> >>>>> >> >>> developers. More or less "C plus OOP and easier object >>> >> >>>>> >> >>> lifetime >>> >> >>>>> >> >>> management (shared/unique_ptr, etc.)". As soon as you add >>> >> >>>>> >> >>> a >>> >> >>>>> >> >>> lot >>> >> >>>>> >> >>> of >>> >> >>>>> >> >>> template metaprogramming C++ library development quickly >>> >> >>>>> >> >>> becomes >>> >> >>>>> >> >>> inaccessible except to the C++-Jedi. >>> >> >>>>> >> >>> >>> >> >>>>> >> >>> Maybe let's start a Google document on "pandas roadmap" >>> >> >>>>> >> >>> where >>> >> >>>>> >> >>> we >>> >> >>>>> >> >>> can >>> >> >>>>> >> >>> break down the 1-2 year goals and some of these >>> >> >>>>> >> >>> infrastructure >>> >> >>>>> >> >>> issues >>> >> >>>>> >> >>> and have our discussion there? (obviously publish this >>> >> >>>>> >> >>> someplace >>> >> >>>>> >> >>> once >>> >> >>>>> >> >>> we're done) >>> >> >>>>> >> >>> >>> >> >>>>> >> >>> - Wes >>> >> >>>>> >> >>> >>> >> >>>>> >> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback >>> >> >>>>> >> >>> >>> >> >>>>> >> >>> wrote: >>> >> >>>>> >> >>> > Here are some of my thoughts about pandas Roadmap / >>> >> >>>>> >> >>> > status >>> >> >>>>> >> >>> > and >>> >> >>>>> >> >>> > some >>> >> >>>>> >> >>> > responses to Wes's thoughts. >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > In the last few (and upcoming) major releases we have >>> >> >>>>> >> >>> > been >>> >> >>>>> >> >>> > made >>> >> >>>>> >> >>> > the >>> >> >>>>> >> >>> > following changes: >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > - dtype enhancements (Categorical, Timedelta, Datetime >>> >> >>>>> >> >>> > w/tz) & >>> >> >>>>> >> >>> > making >>> >> >>>>> >> >>> > these >>> >> >>>>> >> >>> > first class objects >>> >> >>>>> >> >>> > - code refactoring to remove subclassing of ndarrays >>> >> >>>>> >> >>> > for >>> >> >>>>> >> >>> > Series >>> >> >>>>> >> >>> > & >>> >> >>>>> >> >>> > Index >>> >> >>>>> >> >>> > - carving out / deprecating non-core parts of pandas >>> >> >>>>> >> >>> > - datareader >>> >> >>>>> >> >>> > - SparsePanel, WidePanel & other aliases (TImeSeries) >>> >> >>>>> >> >>> > - rpy, rplot, irow et al. >>> >> >>>>> >> >>> > - google-analytics >>> >> >>>>> >> >>> > - API changes to make things more consistent >>> >> >>>>> >> >>> > - pd.rolling/expanding * -> .rolling/expanding (this >>> >> >>>>> >> >>> > is >>> >> >>>>> >> >>> > in >>> >> >>>>> >> >>> > master >>> >> >>>>> >> >>> > now) >>> >> >>>>> >> >>> > - .resample becoming a full defered like groupby. >>> >> >>>>> >> >>> > - multi-index slicing along any level (obviates need >>> >> >>>>> >> >>> > for >>> >> >>>>> >> >>> > .xs) >>> >> >>>>> >> >>> > and >>> >> >>>>> >> >>> > allows >>> >> >>>>> >> >>> > assignment >>> >> >>>>> >> >>> > - .loc/.iloc - for the most part obviates use of .ix >>> >> >>>>> >> >>> > - .pipe & .assign >>> >> >>>>> >> >>> > - plotting accessors >>> >> >>>>> >> >>> > - fixing of the sorting API >>> >> >>>>> >> >>> > - many performance enhancements both micro & macro >>> >> >>>>> >> >>> > (e.g. >>> >> >>>>> >> >>> > release >>> >> >>>>> >> >>> > GIL) >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > Some on-deck enhancements are (meaning these are >>> >> >>>>> >> >>> > basically >>> >> >>>>> >> >>> > ready to >>> >> >>>>> >> >>> > go >>> >> >>>>> >> >>> > in): >>> >> >>>>> >> >>> > - IntervalIndex (and eventually make PeriodIndex just >>> >> >>>>> >> >>> > a >>> >> >>>>> >> >>> > sub-class >>> >> >>>>> >> >>> > of >>> >> >>>>> >> >>> > this) >>> >> >>>>> >> >>> > - RangeIndex >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > so lots of changes, though nothing really earth >>> >> >>>>> >> >>> > shaking, >>> >> >>>>> >> >>> > just >>> >> >>>>> >> >>> > more >>> >> >>>>> >> >>> > convenience, reducing magicness somewhat >>> >> >>>>> >> >>> > and providing flexibility. >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > Of course we are getting increasing issues, mostly bug >>> >> >>>>> >> >>> > reports >>> >> >>>>> >> >>> > (and >>> >> >>>>> >> >>> > lots >>> >> >>>>> >> >>> > of >>> >> >>>>> >> >>> > dupes), some edge case enhancements >>> >> >>>>> >> >>> > which can add to the existing API's and of course, >>> >> >>>>> >> >>> > requests >>> >> >>>>> >> >>> > to >>> >> >>>>> >> >>> > expand >>> >> >>>>> >> >>> > the >>> >> >>>>> >> >>> > (already) large code to other usecases. >>> >> >>>>> >> >>> > Balancing this are a good many pull-requests from many >>> >> >>>>> >> >>> > different >>> >> >>>>> >> >>> > users, >>> >> >>>>> >> >>> > some >>> >> >>>>> >> >>> > even deep into the internals. >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > Here are some things that I have talked about and could >>> >> >>>>> >> >>> > be >>> >> >>>>> >> >>> > considered >>> >> >>>>> >> >>> > for >>> >> >>>>> >> >>> > the roadmap. Disclaimer: I do work for Continuum >>> >> >>>>> >> >>> > but these views are of course my own; furthermore >>> >> >>>>> >> >>> > obviously >>> >> >>>>> >> >>> > I >>> >> >>>>> >> >>> > am a >>> >> >>>>> >> >>> > bit >>> >> >>>>> >> >>> > more >>> >> >>>>> >> >>> > familiar with some of the 'sponsored' open-source >>> >> >>>>> >> >>> > libraries, but always open to new things. >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > - integration / automatic deferral to numba for JIT >>> >> >>>>> >> >>> > (this >>> >> >>>>> >> >>> > would >>> >> >>>>> >> >>> > be >>> >> >>>>> >> >>> > thru >>> >> >>>>> >> >>> > .apply) >>> >> >>>>> >> >>> > - automatic deferal to dask from groubpy where >>> >> >>>>> >> >>> > appropriate >>> >> >>>>> >> >>> > / >>> >> >>>>> >> >>> > maybe a >>> >> >>>>> >> >>> > .to_parallel (to simply return a dask.DataFrame object) >>> >> >>>>> >> >>> > - incorporation of quantities / units (as part of the >>> >> >>>>> >> >>> > dtype) >>> >> >>>>> >> >>> > - use of DyND to allow missing values for int dtypes >>> >> >>>>> >> >>> > - make Period a first class dtype. >>> >> >>>>> >> >>> > - provide some copy-on-write semantics to alleviate the >>> >> >>>>> >> >>> > chained-indexing >>> >> >>>>> >> >>> > issues which occasionaly come up with the mis-use of >>> >> >>>>> >> >>> > the >>> >> >>>>> >> >>> > indexing >>> >> >>>>> >> >>> > API >>> >> >>>>> >> >>> > - allow a 'policy' to automatically provide column >>> >> >>>>> >> >>> > blocks >>> >> >>>>> >> >>> > for >>> >> >>>>> >> >>> > dict-like >>> >> >>>>> >> >>> > input (e.g. each column would be a block), this would >>> >> >>>>> >> >>> > allow >>> >> >>>>> >> >>> > a >>> >> >>>>> >> >>> > pass-thru >>> >> >>>>> >> >>> > API >>> >> >>>>> >> >>> > where you could >>> >> >>>>> >> >>> > put in numpy arrays where you have views and have them >>> >> >>>>> >> >>> > preserved >>> >> >>>>> >> >>> > rather >>> >> >>>>> >> >>> > than >>> >> >>>>> >> >>> > copied automatically. Note that this would also allow >>> >> >>>>> >> >>> > what >>> >> >>>>> >> >>> > I >>> >> >>>>> >> >>> > call >>> >> >>>>> >> >>> > 'split' >>> >> >>>>> >> >>> > where a passed in >>> >> >>>>> >> >>> > multi-dim numpy array could be split up to individual >>> >> >>>>> >> >>> > blocks >>> >> >>>>> >> >>> > (which >>> >> >>>>> >> >>> > actually >>> >> >>>>> >> >>> > gives a nice perf boost after the splitting costs). >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > In working towards some of these goals. I have come to >>> >> >>>>> >> >>> > the >>> >> >>>>> >> >>> > opinion >>> >> >>>>> >> >>> > that >>> >> >>>>> >> >>> > it >>> >> >>>>> >> >>> > would make sense to have a neutral API protocol layer >>> >> >>>>> >> >>> > that would allow us to swap out different engines as >>> >> >>>>> >> >>> > needed, >>> >> >>>>> >> >>> > for >>> >> >>>>> >> >>> > particular >>> >> >>>>> >> >>> > dtypes, or *maybe* out-of-core type computations. E.g. >>> >> >>>>> >> >>> > imagine that we replaced the in-memory block structure >>> >> >>>>> >> >>> > with >>> >> >>>>> >> >>> > a >>> >> >>>>> >> >>> > bclolz >>> >> >>>>> >> >>> > / >>> >> >>>>> >> >>> > memap >>> >> >>>>> >> >>> > type; in theory this should be 'easy' and just work. >>> >> >>>>> >> >>> > I could also see us adopting *some* of the SFrame code >>> >> >>>>> >> >>> > to >>> >> >>>>> >> >>> > allow >>> >> >>>>> >> >>> > easier >>> >> >>>>> >> >>> > interop with this API layer. >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > In practice, I think a nice API layer would need to be >>> >> >>>>> >> >>> > created >>> >> >>>>> >> >>> > to >>> >> >>>>> >> >>> > make >>> >> >>>>> >> >>> > this >>> >> >>>>> >> >>> > clean / nice. >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > So this comes around to Wes's point about creating a >>> >> >>>>> >> >>> > c++ >>> >> >>>>> >> >>> > library for >>> >> >>>>> >> >>> > the >>> >> >>>>> >> >>> > internals (and possibly even some of the indexing >>> >> >>>>> >> >>> > routines). >>> >> >>>>> >> >>> > In an ideal world, or course this would be desirable. >>> >> >>>>> >> >>> > Getting >>> >> >>>>> >> >>> > there >>> >> >>>>> >> >>> > is a >>> >> >>>>> >> >>> > bit >>> >> >>>>> >> >>> > non-trivial I think, and IMHO might not be worth the >>> >> >>>>> >> >>> > effort. I >>> >> >>>>> >> >>> > don't >>> >> >>>>> >> >>> > really see big performance bottlenecks. We *already* >>> >> >>>>> >> >>> > defer >>> >> >>>>> >> >>> > much >>> >> >>>>> >> >>> > of >>> >> >>>>> >> >>> > the >>> >> >>>>> >> >>> > computation to libraries like numexpr & bottleneck >>> >> >>>>> >> >>> > (where >>> >> >>>>> >> >>> > appropriate). >>> >> >>>>> >> >>> > Adding numba / dask to the list would be helpful. >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > I think that almost all performance issues are the >>> >> >>>>> >> >>> > result >>> >> >>>>> >> >>> > of: >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > a) gross misuse of the pandas API. How much code have >>> >> >>>>> >> >>> > you >>> >> >>>>> >> >>> > seen >>> >> >>>>> >> >>> > that >>> >> >>>>> >> >>> > does >>> >> >>>>> >> >>> > df.apply(lambda x: x.sum()) >>> >> >>>>> >> >>> > b) routines which operate column-by-column rather >>> >> >>>>> >> >>> > block-by-block and >>> >> >>>>> >> >>> > are >>> >> >>>>> >> >>> > in >>> >> >>>>> >> >>> > python space (e.g. we have an issue right now about >>> >> >>>>> >> >>> > .quantile) >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > So I am glossing over a big goal of having a c++ >>> >> >>>>> >> >>> > library >>> >> >>>>> >> >>> > that >>> >> >>>>> >> >>> > represents >>> >> >>>>> >> >>> > the >>> >> >>>>> >> >>> > pandas internals. This would by definition have a c-API >>> >> >>>>> >> >>> > that so >>> >> >>>>> >> >>> > you *could* use pandas like semantics in c/c++ and just >>> >> >>>>> >> >>> > have it >>> >> >>>>> >> >>> > work >>> >> >>>>> >> >>> > (and >>> >> >>>>> >> >>> > then pandas would be a thin wrapper around this >>> >> >>>>> >> >>> > library). >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > I am not averse to this, but I think would be quite a >>> >> >>>>> >> >>> > big >>> >> >>>>> >> >>> > effort, >>> >> >>>>> >> >>> > and >>> >> >>>>> >> >>> > not a >>> >> >>>>> >> >>> > huge perf boost IMHO. Further there are a number of API >>> >> >>>>> >> >>> > issues >>> >> >>>>> >> >>> > w.r.t. >>> >> >>>>> >> >>> > indexing >>> >> >>>>> >> >>> > which need to be clarified / worked out (e.g. should we >>> >> >>>>> >> >>> > simply >>> >> >>>>> >> >>> > deprecate >>> >> >>>>> >> >>> > []) >>> >> >>>>> >> >>> > that are much easier to test / figure out in python >>> >> >>>>> >> >>> > space. >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > I also thing that we have quite a large number of >>> >> >>>>> >> >>> > contributors. >>> >> >>>>> >> >>> > Moving >>> >> >>>>> >> >>> > to >>> >> >>>>> >> >>> > c++ might make the internals a bit more impenetrable >>> >> >>>>> >> >>> > that >>> >> >>>>> >> >>> > the >>> >> >>>>> >> >>> > current >>> >> >>>>> >> >>> > internals. >>> >> >>>>> >> >>> > (though this would allow c++ people to contribute, so >>> >> >>>>> >> >>> > that >>> >> >>>>> >> >>> > might >>> >> >>>>> >> >>> > balance >>> >> >>>>> >> >>> > out). >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > We have a limited core of devs whom right now are >>> >> >>>>> >> >>> > familar >>> >> >>>>> >> >>> > with >>> >> >>>>> >> >>> > things. >>> >> >>>>> >> >>> > If >>> >> >>>>> >> >>> > someone happened to have a starting base for a c++ >>> >> >>>>> >> >>> > library, >>> >> >>>>> >> >>> > then I >>> >> >>>>> >> >>> > might >>> >> >>>>> >> >>> > change >>> >> >>>>> >> >>> > opinions here. >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > my 4c. >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > Jeff >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > wrote: >>> >> >>>>> >> >>> >> >>> >> >>>>> >> >>> >> Deep thoughts during the holidays. >>> >> >>>>> >> >>> >> >>> >> >>>>> >> >>> >> I might be out of line here, but the >>> >> >>>>> >> >>> >> interpreter-heaviness >>> >> >>>>> >> >>> >> of >>> >> >>>>> >> >>> >> the >>> >> >>>>> >> >>> >> inside of pandas objects is likely to be a long-term >>> >> >>>>> >> >>> >> liability >>> >> >>>>> >> >>> >> and >>> >> >>>>> >> >>> >> source of performance problems and technical debt. >>> >> >>>>> >> >>> >> >>> >> >>>>> >> >>> >> Has anyone put any thought into planning and beginning >>> >> >>>>> >> >>> >> to >>> >> >>>>> >> >>> >> execute >>> >> >>>>> >> >>> >> on a >>> >> >>>>> >> >>> >> rewrite that moves as much as possible of the >>> >> >>>>> >> >>> >> internals >>> >> >>>>> >> >>> >> into >>> >> >>>>> >> >>> >> native >>> >> >>>>> >> >>> >> / >>> >> >>>>> >> >>> >> compiled code? I'm talking about: >>> >> >>>>> >> >>> >> >>> >> >>>>> >> >>> >> - pandas/core/internals >>> >> >>>>> >> >>> >> - indexing and assignment >>> >> >>>>> >> >>> >> - much of pandas/core/common >>> >> >>>>> >> >>> >> - categorical and custom dtypes >>> >> >>>>> >> >>> >> - all indexing mechanisms >>> >> >>>>> >> >>> >> >>> >> >>>>> >> >>> >> I'm concerned we've already exposed too much internals >>> >> >>>>> >> >>> >> to >>> >> >>>>> >> >>> >> users, so >>> >> >>>>> >> >>> >> this might lead to a lot of API breakage, but it might >>> >> >>>>> >> >>> >> be >>> >> >>>>> >> >>> >> for >>> >> >>>>> >> >>> >> the >>> >> >>>>> >> >>> >> Greater Good. As a first step, beginning a partial >>> >> >>>>> >> >>> >> migration >>> >> >>>>> >> >>> >> of >>> >> >>>>> >> >>> >> internals into some C++ classes that encapsulate the >>> >> >>>>> >> >>> >> insides >>> >> >>>>> >> >>> >> of >>> >> >>>>> >> >>> >> DataFrame objects and implement indexing and >>> >> >>>>> >> >>> >> block-level >>> >> >>>>> >> >>> >> manipulations >>> >> >>>>> >> >>> >> would be a good place to start. I think you could do >>> >> >>>>> >> >>> >> this >>> >> >>>>> >> >>> >> wouldn't >>> >> >>>>> >> >>> >> too >>> >> >>>>> >> >>> >> much disruption. >>> >> >>>>> >> >>> >> >>> >> >>>>> >> >>> >> As part of this internal retooling we might give >>> >> >>>>> >> >>> >> consideration >>> >> >>>>> >> >>> >> to >>> >> >>>>> >> >>> >> alternative data structures for representing data >>> >> >>>>> >> >>> >> internal >>> >> >>>>> >> >>> >> to >>> >> >>>>> >> >>> >> pandas >>> >> >>>>> >> >>> >> objects. Now in 2015/2016, continuing to be hamstrung >>> >> >>>>> >> >>> >> by >>> >> >>>>> >> >>> >> NumPy's >>> >> >>>>> >> >>> >> limitations feels somewhat anachronistic. User code is >>> >> >>>>> >> >>> >> riddled >>> >> >>>>> >> >>> >> with >>> >> >>>>> >> >>> >> workarounds for data type fidelity issues and the >>> >> >>>>> >> >>> >> like. >>> >> >>>>> >> >>> >> Like, >>> >> >>>>> >> >>> >> really, >>> >> >>>>> >> >>> >> why not add a bitndarray (similar to >>> >> >>>>> >> >>> >> ilanschnell/bitarray) >>> >> >>>>> >> >>> >> for >>> >> >>>>> >> >>> >> storing >>> >> >>>>> >> >>> >> nullness for problematic types and hide this from the >>> >> >>>>> >> >>> >> user? =) >>> >> >>>>> >> >>> >> >>> >> >>>>> >> >>> >> Since we are now a NumFOCUS-sponsored project, I feel >>> >> >>>>> >> >>> >> like >>> >> >>>>> >> >>> >> we >>> >> >>>>> >> >>> >> might >>> >> >>>>> >> >>> >> consider establishing some formal governance over >>> >> >>>>> >> >>> >> pandas >>> >> >>>>> >> >>> >> and >>> >> >>>>> >> >>> >> publishing meetings notes and roadmap documents >>> >> >>>>> >> >>> >> describing >>> >> >>>>> >> >>> >> plans >>> >> >>>>> >> >>> >> for >>> >> >>>>> >> >>> >> the project and meetings notes from committers. >>> >> >>>>> >> >>> >> There's no >>> >> >>>>> >> >>> >> real >>> >> >>>>> >> >>> >> "committer culture" for NumFOCUS projects like there >>> >> >>>>> >> >>> >> is >>> >> >>>>> >> >>> >> with >>> >> >>>>> >> >>> >> the >>> >> >>>>> >> >>> >> Apache Software Foundation, but we might try leading >>> >> >>>>> >> >>> >> by >>> >> >>>>> >> >>> >> example! >>> >> >>>>> >> >>> >> >>> >> >>>>> >> >>> >> Also, I believe pandas as a project has reached a >>> >> >>>>> >> >>> >> level of >>> >> >>>>> >> >>> >> importance >>> >> >>>>> >> >>> >> where we ought to consider planning and execution on >>> >> >>>>> >> >>> >> larger >>> >> >>>>> >> >>> >> scale >>> >> >>>>> >> >>> >> undertakings such as this for safeguarding the future. >>> >> >>>>> >> >>> >> >>> >> >>>>> >> >>> >> As for myself, well, I have my hands full in Big >>> >> >>>>> >> >>> >> Data-land. I >>> >> >>>>> >> >>> >> wish >>> >> >>>>> >> >>> >> I >>> >> >>>>> >> >>> >> could be helping more with pandas, but there a quite a >>> >> >>>>> >> >>> >> few >>> >> >>>>> >> >>> >> fundamental >>> >> >>>>> >> >>> >> issues (like data interoperability nested data >>> >> >>>>> >> >>> >> handling >>> >> >>>>> >> >>> >> and >>> >> >>>>> >> >>> >> file >>> >> >>>>> >> >>> >> format support ? e.g. Parquet, see >>> >> >>>>> >> >>> >> >>> >> >>>>> >> >>> >> >>> >> >>>>> >> >>> >> >>> >> >>>>> >> >>> >> >>> >> >>>>> >> >>> >> >>> >> >>>>> >> >>> >> >>> >> >>>>> >> >>> >> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/) >>> >> >>>>> >> >>> >> preventing Python from being more useful in industry >>> >> >>>>> >> >>> >> analytics >>> >> >>>>> >> >>> >> applications. >>> >> >>>>> >> >>> >> >>> >> >>>>> >> >>> >> Aside: one of the bigger mistakes I made with pandas's >>> >> >>>>> >> >>> >> API >>> >> >>>>> >> >>> >> design >>> >> >>>>> >> >>> >> was >>> >> >>>>> >> >>> >> making it acceptable to call class constructors ? like >>> >> >>>>> >> >>> >> pandas.DataFrame ? directly (versus factory >>> >> >>>>> >> >>> >> functions). >>> >> >>>>> >> >>> >> Sorry >>> >> >>>>> >> >>> >> about >>> >> >>>>> >> >>> >> that! If we could convince everyone to start writing >>> >> >>>>> >> >>> >> pandas.data_frame >>> >> >>>>> >> >>> >> or dataframe instead of using the class reference it >>> >> >>>>> >> >>> >> would >>> >> >>>>> >> >>> >> help a >>> >> >>>>> >> >>> >> lot >>> >> >>>>> >> >>> >> with code cleanup. It's hard to plan for these things >>> >> >>>>> >> >>> >> ? >>> >> >>>>> >> >>> >> NumPy >>> >> >>>>> >> >>> >> interoperability seemed a lot more important in 2008 >>> >> >>>>> >> >>> >> than >>> >> >>>>> >> >>> >> it >>> >> >>>>> >> >>> >> does >>> >> >>>>> >> >>> >> now, >>> >> >>>>> >> >>> >> so I forgive myself. >>> >> >>>>> >> >>> >> >>> >> >>>>> >> >>> >> cheers and best wishes for 2016, >>> >> >>>>> >> >>> >> Wes >>> >> >>>>> >> >>> >> _______________________________________________ >>> >> >>>>> >> >>> >> Pandas-dev mailing list >>> >> >>>>> >> >>> >> Pandas-dev at python.org >>> >> >>>>> >> >>> >> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> > >>> >> >>>>> >> >>> _______________________________________________ >>> >> >>>>> >> >>> Pandas-dev mailing list >>> >> >>>>> >> >>> Pandas-dev at python.org >>> >> >>>>> >> >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> >>>>> >> _______________________________________________ >>> >> >>>>> >> Pandas-dev mailing list >>> >> >>>>> >> Pandas-dev at python.org >>> >> >>>>> >> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> >>>>> > >>> >> >>>>> > >>> >> >>>>> >>> >> >>>>> >>> >> >>>>> _______________________________________________ >>> >> >>>>> Pandas-dev mailing list >>> >> >>>>> Pandas-dev at python.org >>> >> >>>>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> >>>>> >>> >> >>>> >>> >> >> >>> >> _______________________________________________ >>> >> Pandas-dev mailing list >>> >> Pandas-dev at python.org >>> >> https://mail.python.org/mailman/listinfo/pandas-dev >>> > >>> > >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >> >> >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > From wesmckinn at gmail.com Wed Jan 6 14:37:11 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Wed, 6 Jan 2016 11:37:11 -0800 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References: Message-ID: On Wed, Jan 6, 2016 at 11:26 AM, Wes McKinney wrote: > hey Stephan, > > Thanks for all the thoughts. Let me make a few off-the-cuff comments. > > On Wed, Jan 6, 2016 at 10:11 AM, Stephan Hoyer wrote: >> I was asked about this off list, so I'll belatedly share my thoughts. >> >> First of all, I am really excited by Wes's renewed engagement in the project >> and his interest in rewriting pandas internals. This is quite an ambitious >> plan and nobody is better positioned to tackle it than Wes. >> >> I have mixed feelings about the details of the rewrite itself. >> >> +1 on the simpler internal data model. The block manager is confusing and >> leads to hard to predict performance issues related to copying data. If we >> can do all column additions/removals/re-orderings without a copy it will be >> a clear win. >> >> +0 on moving internals to C++. I do like the performance benefits, but it >> seems like a lot of work, and it may make pandas less friendly to new >> contributors. >> > > It really goes beyond performance benefits. If you go back to my 2013 > talk http://www.slideshare.net/wesm/practical-medium-data-analytics-with-python > there's a long list of architectural problems that now in 2016 haven't > found solutions. The only way (that I can fully reason through -- I am > happy to look at alternate proposals) to move the internals of pandas > closer to the metal is to give Series and DataFrame a C/C++ API -- > this is the "libpandas native core" as I've been describing. I should point out the the main thing that's changed since that preso is "synthetic" data types like Categorical. But seeing what it took for Jeff et al to build that is a prime motivation for this internals refactoring plan. > >> -0 on writing a brand new dtype system just for pandas -- this stuff really >> belongs in NumPy (or another array library like DyND), and I am skeptical >> that pandas can do a complete enough job to be useful without replicating >> all that functionality. >> > > I'm curious what "a brand new dtype system" means to you. pandas > already has its own data type system, but it's a potpourri of > inconsistencies and rough edges with self-evident problems for both > users and developers. Some indicators: > > - Some pandas types use NaN for missing data, others None (or both), > others nothing at all. We lose data (integers) or bloat memory > (booleans) by upcasting to float-NaN or object-None. > - Internal functions full of is_XXX_dtype functions: > pandas.core.common, pandas.core.algorithms, etc. > - Series.values on synthetic dtypes like Categorical > - We use arrays of Python objects for string data > > The biggest cause IMHO is that pandas is too tightly coupled to NumPy, > but it's coupled in a way that makes development and extensibility > difficult. We've already allowed NumPy-specific details to taint the > pandas user API in many unpleasant ways. This isn't to say "NumPy is > bad" but rather "pandas tries to layer domain-specific functionality > [that NumPy was not designed for] on top". > > Some things things I'm advocating with the internals refactor: > > 1) First class "pandas type" objects. This is not the same as a NumPy > dtype which has some pretty loaded implications -- in particular, > NumPy dtypes are implicitly coupled to an array computing framework > (see the function table that is attached to the PyArray_Descr object) > > 2) Pandas array container types that map user-land API calls to > implementation-land API calls (in NumPy, DyND, or pandas-native code > like pandas.core.algorithms etc.). This will make it much easier to > leverage innovations in NumPy and DyND without those implementation > details spilling over into the pandas user API > > 3) Adding a single pandas.NA singleton to have one library-wide notion > of a scalar null value (obviously, we can automatically map NaN and > None to NA for backwards compatibility). > > 4) Layering a bitmask internally on NumPy arrays (especially integer > and boolean) to add null-ness to types that need it. Note that this > does not prevent us from switching to DyND arrays with option dtype in > the future. If the details of how we are implementing NULL are visible > to the user, we have failed. > > 5) Removing the block manager in favor of simpler pandas Array (1D) > and Table (2D -- vector of Array) data structures > > I believe you can do all this without harming interoperability with > the ecosystem of projects that people currently use in conjunction > with pandas. > >> More broadly, I am concerned that this rewrite may improve the tabular >> computation ecosystem at the cost of inter-operability with the array-based >> ecosystem (numpy, scipy, sklearn, xarray, etc.). The later has been one of >> the strengths of pandas and it would be a shame to see that go away. >> > > I have no intention of letting this happen. What I've am asking from > you (and others reading) is to help define what constitutes > interoperability. What guarantees do we make the user? > > For example, we should have very strict guidelines for the output of: > > np.asarray(pandas_obj) > > For example > > In [3]: s = pd.Series([1,2,3]*10).astype('category') > > In [4]: np.asarray(s) > Out[4]: > array([1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, > 3, 1, 2, 3, 1, 2, 3]) > > I see no reason why this should necessarily behave any differently. > The problem will come in when there is pandas data that is not > precisely representable in a NumPy array. Example: > > In [5]: s = pd.Series([1,2,3, 4]) > > In [6]: s.dtype > Out[6]: dtype('int64') > > In [7]: s2 = s.reindex(np.arange(10)) > > In [8]: s2.dtype > Out[8]: dtype('float64') > > In [9]: np.asarray(s2) > Out[9]: array([ 1., 2., 3., 4., nan, nan, nan, nan, nan, nan]) > > With the "new internals", s2 will still be int64 type, but we may > decide that np.asarray(s2) should raise an exception rather than > implicitly make a decision about how to perform a "lossy" conversion > to a NumPy array. If you are using DyND with pandas, then the > equivalent function would be able to implicitly convert without data > loss. > >> We're already starting to struggle with inter-operability with the new >> pandas dtypes and a further rewrite would make this even harder. >> For example, see categoricals and scikit-learn in Tom's recent post [1], or the >> fact that .values no longer always returns a numpy array. This has also been >> a challenge for xarray, which can't handle these new dtypes because we lack >> a suitable array backend for them. > > I'm definitely motivated in this initiative by these challenges. The > idea here is that with the new internals, Series.values will always > return the same type of object, and there will be one consistent code > path for getting a NumPy array out. For example, rather than: > > if isinstance(s.values, Categorical): > # pandas > ... > else: > # NumPy > ... > > We could have (just an idea) > > s.values.to_numpy() > > Or simply > > np.asarray(s.values) > >> >> Personally, I would much rather leverage a full featured library like an >> improved NumPy or DyND for new dtypes, because that could also be used by >> the array-based ecosystem. At the very least, it would be good to think >> about zero-copy inter-operability with array-based tools. >> > > I'm all for zero-copy interoperability when possible, but my gut > feeling is that exposing the data type system of an array library (the > choice of which is an implementation detail) to pandas users is an > inherent leaky abstraction that will continue to cause problems if we > plan to keep innovating inside pandas. By better hiding NumPy details > and types from the user we will make it much easier to swap out new > low level array data structures and compute components (e.g. DyND), or > add custom data structures or out-of-core tools (memory maps, bcolz, > etc.) > > I'm additionally offering to do nearly all of this replumbing of > pandas internals myself, and completely in my free time. What I will > expect in return from you all is to help enumerate our contracts with > the pandas user (i.e. interoperability) and to hold me accountable to > not break them. I know I haven't been committing code on pandas since > mid-2013 (after a 5 year marathon), but these architectural problems > have been on my mind almost constantly since then, I just haven't had > the bandwidth to start tackling them. > > cheers, > Wes > >> On the other hand, I wonder if maybe it would be better to write a native >> in-memory backend for Ibis instead of rewriting pandas. Ibis does seem to >> have improved/simplified API which resolves many of pandas's warts. That >> said, it's a pretty big change from the "DataFrame as matrix" model, and >> pandas won't be going away anytime soon. I do like that it would force users >> to be more explicit about converting between tables and arrays, which might >> also make distinctions between the tabular and array oriented ecosystems >> easier to swallow. >> >> Just my two cents, from someone who has lots of opinions but who will likely >> stay on the sidelines for most of this work. >> >> Cheers, >> Stephan >> >> [1] http://tomaugspurger.github.io/categorical-pipelines.html >> >> On Fri, Jan 1, 2016 at 6:06 PM, Jeff Reback wrote: >>> >>> ok I moved the document to the Pandas folder, where the same group should >>> be able to edit/upload/etc. lmk if any issues >>> >>> On Fri, Jan 1, 2016 at 8:48 PM, Wes McKinney wrote: >>>> >>>> Thanks Jeff. Can you create and share a shared Drive folder containing >>>> this where I can put other auxiliary / follow up documents? >>>> >>>> On Fri, Jan 1, 2016 at 5:23 PM, Jeff Reback wrote: >>>> > I changed the doc so that the core dev people can edit. I *think* that >>>> > everyone should be able to view/comment though. >>>> > >>>> > On Fri, Jan 1, 2016 at 8:13 PM, Wes McKinney >>>> > wrote: >>>> >> >>>> >> Jeff -- can you require log-in for editing on this document? >>>> >> >>>> >> >>>> >> https://docs.google.com/document/d/151ct8jcZWwh7XStptjbLsda6h2b3C0IuiH_hfZnUA58/edit# >>>> >> >>>> >> There are a number of anonymous edits. >>>> >> >>>> >> On Wed, Dec 30, 2015 at 6:04 PM, Wes McKinney >>>> >> wrote: >>>> >> > I cobbled together an ugly start of a c++->cython->pandas toolchain >>>> >> > here >>>> >> > >>>> >> > https://github.com/wesm/pandas/tree/libpandas-native-core >>>> >> > >>>> >> > I used a mix of Kudu, Impala, and dynd-python cmake sources, so it's >>>> >> > a >>>> >> > bit messy at the moment but it should be sufficient to run some real >>>> >> > experiments with a little more work. I reckon it's like a 6 month >>>> >> > project to tear out the insides of Series and DataFrame and replace >>>> >> > it >>>> >> > with a new "native core", but we should be able to get enough info >>>> >> > to >>>> >> > see whether it's a viable plan within a month or so. >>>> >> > >>>> >> > The end goal is to create "private" extension types in Cython that >>>> >> > can >>>> >> > be the new base classes for Series and NDFrame; these will hold a >>>> >> > reference to a C++ object that contains wrappered NumPy arrays and >>>> >> > other metadata (like pandas-only dtypes). >>>> >> > >>>> >> > It might be too hard to try to replace a single usage of block >>>> >> > manager >>>> >> > as a first experiment, so I'll try to create a minimal "SeriesLite" >>>> >> > that supports 3 dtypes >>>> >> > >>>> >> > 1) float64 with nans >>>> >> > 2) int64 with a bitmask for NAs >>>> >> > 3) category type for one of these >>>> >> > >>>> >> > Just want to get a feel for the extensibility and offer an NA >>>> >> > singleton Python object (a la None) for getting and setting NAs >>>> >> > across >>>> >> > these 3 dtypes. >>>> >> > >>>> >> > If we end up going down this route, any way to place a moratorium on >>>> >> > invasive work on pandas internals (outside bug fixes)? >>>> >> > >>>> >> > Pedantic aside: I'd rather avoid shipping thirdparty C/C++ libraries >>>> >> > like googletest and friends in pandas if we can. Cloudera folks have >>>> >> > been working on a portable C++ library toolchain for Impala and >>>> >> > other >>>> >> > projects at https://github.com/cloudera/native-toolchain, but it is >>>> >> > only being tested on Linux and OS X. Most google libraries should >>>> >> > build out of the box on MSVC but it'll be something to keep an eye >>>> >> > on. >>>> >> > >>>> >> > BTW thanks to the libdynd developers for pioneering the c++ lib <-> >>>> >> > python-c++ lib <-> cython toolchain; being able to build Cython >>>> >> > extensions directly from cmake is a godsend >>>> >> > >>>> >> > HNY all >>>> >> > Wes >>>> >> > >>>> >> > On Tue, Dec 29, 2015 at 4:17 PM, Irwin Zaid >>>> >> > wrote: >>>> >> >> Yeah, that seems reasonable and I totally agree a Pandas wrapper >>>> >> >> layer >>>> >> >> would >>>> >> >> be necessary. >>>> >> >> >>>> >> >> I'll keep an eye on this and I'd like to help if I can. >>>> >> >> >>>> >> >> Irwin >>>> >> >> >>>> >> >> >>>> >> >> On Tue, Dec 29, 2015 at 6:01 PM, Wes McKinney >>>> >> >> wrote: >>>> >> >>> >>>> >> >>> I'm not suggesting a rewrite of NumPy functionality but rather >>>> >> >>> pandas >>>> >> >>> functionality that is currently written in a mishmash of Cython >>>> >> >>> and >>>> >> >>> Python. >>>> >> >>> Happy to experiment with changing the internal compute >>>> >> >>> infrastructure >>>> >> >>> and >>>> >> >>> data representation to DyND after this first stage of cleanup is >>>> >> >>> done. >>>> >> >>> Even >>>> >> >>> if we use DyND a pretty extensive pandas wrapper layer will be >>>> >> >>> necessary. >>>> >> >>> >>>> >> >>> >>>> >> >>> On Tuesday, December 29, 2015, Irwin Zaid >>>> >> >>> wrote: >>>> >> >>>> >>>> >> >>>> Hi Wes (and others), >>>> >> >>>> >>>> >> >>>> I've been following this conversation with interest. I do think >>>> >> >>>> it >>>> >> >>>> would >>>> >> >>>> be worth exploring DyND, rather than setting up yet another >>>> >> >>>> rewrite >>>> >> >>>> of >>>> >> >>>> NumPy-functionality. Especially because DyND is already an >>>> >> >>>> optional >>>> >> >>>> dependency of Pandas. >>>> >> >>>> >>>> >> >>>> For things like Integer NA and new dtypes, DyND is there and >>>> >> >>>> ready to >>>> >> >>>> do >>>> >> >>>> this. >>>> >> >>>> >>>> >> >>>> Irwin >>>> >> >>>> >>>> >> >>>> On Tue, Dec 29, 2015 at 5:18 PM, Wes McKinney >>>> >> >>>> >>>> >> >>>> wrote: >>>> >> >>>>> >>>> >> >>>>> Can you link to the PR you're talking about? >>>> >> >>>>> >>>> >> >>>>> I will see about spending a few hours setting up a libpandas.so >>>> >> >>>>> as a >>>> >> >>>>> C++ >>>> >> >>>>> shared library where we can run some experiments and validate >>>> >> >>>>> whether it can >>>> >> >>>>> solve the integer-NA problem and be a place to put new data >>>> >> >>>>> types >>>> >> >>>>> (categorical and friends). I'm +1 on targeting >>>> >> >>>>> >>>> >> >>>>> Would it also be worth making a wish list of APIs we might >>>> >> >>>>> consider >>>> >> >>>>> breaking in a pandas 1.0 release that also features this new >>>> >> >>>>> "native >>>> >> >>>>> core"? >>>> >> >>>>> Might as well right some wrongs while we're doing some invasive >>>> >> >>>>> work >>>> >> >>>>> on the >>>> >> >>>>> internals; some breakage might be unavoidable. We can always >>>> >> >>>>> maintain a >>>> >> >>>>> pandas legacy 0.x.x maintenance branch (providing a conda binary >>>> >> >>>>> build) for >>>> >> >>>>> legacy users where showstopper bugs can get fixed. >>>> >> >>>>> >>>> >> >>>>> On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback >>>> >> >>>>> >>>> >> >>>>> wrote: >>>> >> >>>>> > Wes your last is noted as well. I *think* we can actually do >>>> >> >>>>> > this >>>> >> >>>>> > now >>>> >> >>>>> > (well >>>> >> >>>>> > there is a PR out there). >>>> >> >>>>> > >>>> >> >>>>> > On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney >>>> >> >>>>> > >>>> >> >>>>> > wrote: >>>> >> >>>>> >> >>>> >> >>>>> >> The other huge thing this will enable is to do is >>>> >> >>>>> >> copy-on-write >>>> >> >>>>> >> for >>>> >> >>>>> >> various kinds of views, which should cut down on some of the >>>> >> >>>>> >> defensive >>>> >> >>>>> >> copying in the library and reduce memory usage. >>>> >> >>>>> >> >>>> >> >>>>> >> On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney >>>> >> >>>>> >> >>>> >> >>>>> >> wrote: >>>> >> >>>>> >> > Basically the approach is >>>> >> >>>>> >> > >>>> >> >>>>> >> > 1) Base dtype type >>>> >> >>>>> >> > 2) Base array type with K >= 1 dimensions >>>> >> >>>>> >> > 3) Base scalar type >>>> >> >>>>> >> > 4) Base index type >>>> >> >>>>> >> > 5) "Wrapper" subclasses for all NumPy types fitting into >>>> >> >>>>> >> > categories >>>> >> >>>>> >> > #1, #2, #3, #4 >>>> >> >>>>> >> > 6) Subclasses for pandas-specific types like category, >>>> >> >>>>> >> > datetimeTZ, >>>> >> >>>>> >> > etc. >>>> >> >>>>> >> > 7) NDFrame as cpcloud wrote is just a list of these >>>> >> >>>>> >> > >>>> >> >>>>> >> > Indexes and axis labels / column names can get layered on >>>> >> >>>>> >> > top. >>>> >> >>>>> >> > >>>> >> >>>>> >> > After we do all this we can look at adding nested types >>>> >> >>>>> >> > (arrays, >>>> >> >>>>> >> > maps, >>>> >> >>>>> >> > structs) to better support JSON. >>>> >> >>>>> >> > >>>> >> >>>>> >> > - Wes >>>> >> >>>>> >> > >>>> >> >>>>> >> > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud >>>> >> >>>>> >> > >>>> >> >>>>> >> > wrote: >>>> >> >>>>> >> >> Maybe this is saying the same thing as Wes, but how far >>>> >> >>>>> >> >> would >>>> >> >>>>> >> >> something >>>> >> >>>>> >> >> like >>>> >> >>>>> >> >> this get us? >>>> >> >>>>> >> >> >>>> >> >>>>> >> >> // warning: things are probably not this simple >>>> >> >>>>> >> >> >>>> >> >>>>> >> >> struct data_array_t { >>>> >> >>>>> >> >> void *primitive; // scalar data >>>> >> >>>>> >> >> data_array_t *nested; // nested data >>>> >> >>>>> >> >> boost::dynamic_bitset isnull; // might have to create >>>> >> >>>>> >> >> our >>>> >> >>>>> >> >> own >>>> >> >>>>> >> >> to >>>> >> >>>>> >> >> avoid >>>> >> >>>>> >> >> boost >>>> >> >>>>> >> >> schema_t schema; // not sure exactly what this looks >>>> >> >>>>> >> >> like >>>> >> >>>>> >> >> }; >>>> >> >>>>> >> >> >>>> >> >>>>> >> >> typedef std::map data_frame_t; // >>>> >> >>>>> >> >> probably >>>> >> >>>>> >> >> not >>>> >> >>>>> >> >> this >>>> >> >>>>> >> >> simple >>>> >> >>>>> >> >> >>>> >> >>>>> >> >> To answer Jeff?s use-case question: I think that the use >>>> >> >>>>> >> >> cases >>>> >> >>>>> >> >> are >>>> >> >>>>> >> >> 1) >>>> >> >>>>> >> >> freedom from numpy (mostly) 2) no more block manager which >>>> >> >>>>> >> >> frees >>>> >> >>>>> >> >> us >>>> >> >>>>> >> >> from the >>>> >> >>>>> >> >> limitations of the block memory layout. In particular, the >>>> >> >>>>> >> >> ability >>>> >> >>>>> >> >> to >>>> >> >>>>> >> >> take >>>> >> >>>>> >> >> advantage of memory mapped IO would be a big win IMO. >>>> >> >>>>> >> >> >>>> >> >>>>> >> >> >>>> >> >>>>> >> >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney >>>> >> >>>>> >> >> >>>> >> >>>>> >> >> wrote: >>>> >> >>>>> >> >>> >>>> >> >>>>> >> >>> I will write a more detailed response to some of these >>>> >> >>>>> >> >>> things >>>> >> >>>>> >> >>> after >>>> >> >>>>> >> >>> the new year, but, in particular, re: missing values, can >>>> >> >>>>> >> >>> you >>>> >> >>>>> >> >>> or >>>> >> >>>>> >> >>> someone tell me why creating an object that contains a >>>> >> >>>>> >> >>> NumPy >>>> >> >>>>> >> >>> array and >>>> >> >>>>> >> >>> a bitmap is not sufficient? If we we can add a >>>> >> >>>>> >> >>> lightweight >>>> >> >>>>> >> >>> C/C++ >>>> >> >>>>> >> >>> class >>>> >> >>>>> >> >>> layer between NumPy function calls (e.g. arithmetic) and >>>> >> >>>>> >> >>> pandas >>>> >> >>>>> >> >>> function calls, then I see no reason why we cannot have >>>> >> >>>>> >> >>> >>>> >> >>>>> >> >>> Int32Array->add >>>> >> >>>>> >> >>> >>>> >> >>>>> >> >>> and >>>> >> >>>>> >> >>> >>>> >> >>>>> >> >>> Float32Array->add >>>> >> >>>>> >> >>> >>>> >> >>>>> >> >>> do the right thing (the former would be responsible for >>>> >> >>>>> >> >>> bitmasking to >>>> >> >>>>> >> >>> propagate NA values; the latter would defer to NumPy). If >>>> >> >>>>> >> >>> we >>>> >> >>>>> >> >>> can >>>> >> >>>>> >> >>> put >>>> >> >>>>> >> >>> all the internals of pandas objects inside a black box, >>>> >> >>>>> >> >>> we >>>> >> >>>>> >> >>> can >>>> >> >>>>> >> >>> add >>>> >> >>>>> >> >>> layers of virtual function indirection without a >>>> >> >>>>> >> >>> performance >>>> >> >>>>> >> >>> penalty >>>> >> >>>>> >> >>> (e.g. adding more interpreter overhead with more >>>> >> >>>>> >> >>> abstraction >>>> >> >>>>> >> >>> layers >>>> >> >>>>> >> >>> does add up to a perf penalty). >>>> >> >>>>> >> >>> >>>> >> >>>>> >> >>> I don't think this is too scary -- I would be willing to >>>> >> >>>>> >> >>> create a >>>> >> >>>>> >> >>> small POC C++ library to prototype something like what >>>> >> >>>>> >> >>> I'm >>>> >> >>>>> >> >>> talking >>>> >> >>>>> >> >>> about. >>>> >> >>>>> >> >>> >>>> >> >>>>> >> >>> Since pandas has limited points of contact with NumPy I >>>> >> >>>>> >> >>> don't >>>> >> >>>>> >> >>> think >>>> >> >>>>> >> >>> this would end up being too onerous. >>>> >> >>>>> >> >>> >>>> >> >>>>> >> >>> For the record, I'm pretty allergic to "advanced C++"; I >>>> >> >>>>> >> >>> think it >>>> >> >>>>> >> >>> is a >>>> >> >>>>> >> >>> useful tool if you pick a sane 20% subset of the C++11 >>>> >> >>>>> >> >>> spec >>>> >> >>>>> >> >>> and >>>> >> >>>>> >> >>> follow >>>> >> >>>>> >> >>> Google C++ style it's not very inaccessible to >>>> >> >>>>> >> >>> intermediate >>>> >> >>>>> >> >>> developers. More or less "C plus OOP and easier object >>>> >> >>>>> >> >>> lifetime >>>> >> >>>>> >> >>> management (shared/unique_ptr, etc.)". As soon as you add >>>> >> >>>>> >> >>> a >>>> >> >>>>> >> >>> lot >>>> >> >>>>> >> >>> of >>>> >> >>>>> >> >>> template metaprogramming C++ library development quickly >>>> >> >>>>> >> >>> becomes >>>> >> >>>>> >> >>> inaccessible except to the C++-Jedi. >>>> >> >>>>> >> >>> >>>> >> >>>>> >> >>> Maybe let's start a Google document on "pandas roadmap" >>>> >> >>>>> >> >>> where >>>> >> >>>>> >> >>> we >>>> >> >>>>> >> >>> can >>>> >> >>>>> >> >>> break down the 1-2 year goals and some of these >>>> >> >>>>> >> >>> infrastructure >>>> >> >>>>> >> >>> issues >>>> >> >>>>> >> >>> and have our discussion there? (obviously publish this >>>> >> >>>>> >> >>> someplace >>>> >> >>>>> >> >>> once >>>> >> >>>>> >> >>> we're done) >>>> >> >>>>> >> >>> >>>> >> >>>>> >> >>> - Wes >>>> >> >>>>> >> >>> >>>> >> >>>>> >> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback >>>> >> >>>>> >> >>> >>>> >> >>>>> >> >>> wrote: >>>> >> >>>>> >> >>> > Here are some of my thoughts about pandas Roadmap / >>>> >> >>>>> >> >>> > status >>>> >> >>>>> >> >>> > and >>>> >> >>>>> >> >>> > some >>>> >> >>>>> >> >>> > responses to Wes's thoughts. >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > In the last few (and upcoming) major releases we have >>>> >> >>>>> >> >>> > been >>>> >> >>>>> >> >>> > made >>>> >> >>>>> >> >>> > the >>>> >> >>>>> >> >>> > following changes: >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > - dtype enhancements (Categorical, Timedelta, Datetime >>>> >> >>>>> >> >>> > w/tz) & >>>> >> >>>>> >> >>> > making >>>> >> >>>>> >> >>> > these >>>> >> >>>>> >> >>> > first class objects >>>> >> >>>>> >> >>> > - code refactoring to remove subclassing of ndarrays >>>> >> >>>>> >> >>> > for >>>> >> >>>>> >> >>> > Series >>>> >> >>>>> >> >>> > & >>>> >> >>>>> >> >>> > Index >>>> >> >>>>> >> >>> > - carving out / deprecating non-core parts of pandas >>>> >> >>>>> >> >>> > - datareader >>>> >> >>>>> >> >>> > - SparsePanel, WidePanel & other aliases (TImeSeries) >>>> >> >>>>> >> >>> > - rpy, rplot, irow et al. >>>> >> >>>>> >> >>> > - google-analytics >>>> >> >>>>> >> >>> > - API changes to make things more consistent >>>> >> >>>>> >> >>> > - pd.rolling/expanding * -> .rolling/expanding (this >>>> >> >>>>> >> >>> > is >>>> >> >>>>> >> >>> > in >>>> >> >>>>> >> >>> > master >>>> >> >>>>> >> >>> > now) >>>> >> >>>>> >> >>> > - .resample becoming a full defered like groupby. >>>> >> >>>>> >> >>> > - multi-index slicing along any level (obviates need >>>> >> >>>>> >> >>> > for >>>> >> >>>>> >> >>> > .xs) >>>> >> >>>>> >> >>> > and >>>> >> >>>>> >> >>> > allows >>>> >> >>>>> >> >>> > assignment >>>> >> >>>>> >> >>> > - .loc/.iloc - for the most part obviates use of .ix >>>> >> >>>>> >> >>> > - .pipe & .assign >>>> >> >>>>> >> >>> > - plotting accessors >>>> >> >>>>> >> >>> > - fixing of the sorting API >>>> >> >>>>> >> >>> > - many performance enhancements both micro & macro >>>> >> >>>>> >> >>> > (e.g. >>>> >> >>>>> >> >>> > release >>>> >> >>>>> >> >>> > GIL) >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > Some on-deck enhancements are (meaning these are >>>> >> >>>>> >> >>> > basically >>>> >> >>>>> >> >>> > ready to >>>> >> >>>>> >> >>> > go >>>> >> >>>>> >> >>> > in): >>>> >> >>>>> >> >>> > - IntervalIndex (and eventually make PeriodIndex just >>>> >> >>>>> >> >>> > a >>>> >> >>>>> >> >>> > sub-class >>>> >> >>>>> >> >>> > of >>>> >> >>>>> >> >>> > this) >>>> >> >>>>> >> >>> > - RangeIndex >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > so lots of changes, though nothing really earth >>>> >> >>>>> >> >>> > shaking, >>>> >> >>>>> >> >>> > just >>>> >> >>>>> >> >>> > more >>>> >> >>>>> >> >>> > convenience, reducing magicness somewhat >>>> >> >>>>> >> >>> > and providing flexibility. >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > Of course we are getting increasing issues, mostly bug >>>> >> >>>>> >> >>> > reports >>>> >> >>>>> >> >>> > (and >>>> >> >>>>> >> >>> > lots >>>> >> >>>>> >> >>> > of >>>> >> >>>>> >> >>> > dupes), some edge case enhancements >>>> >> >>>>> >> >>> > which can add to the existing API's and of course, >>>> >> >>>>> >> >>> > requests >>>> >> >>>>> >> >>> > to >>>> >> >>>>> >> >>> > expand >>>> >> >>>>> >> >>> > the >>>> >> >>>>> >> >>> > (already) large code to other usecases. >>>> >> >>>>> >> >>> > Balancing this are a good many pull-requests from many >>>> >> >>>>> >> >>> > different >>>> >> >>>>> >> >>> > users, >>>> >> >>>>> >> >>> > some >>>> >> >>>>> >> >>> > even deep into the internals. >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > Here are some things that I have talked about and could >>>> >> >>>>> >> >>> > be >>>> >> >>>>> >> >>> > considered >>>> >> >>>>> >> >>> > for >>>> >> >>>>> >> >>> > the roadmap. Disclaimer: I do work for Continuum >>>> >> >>>>> >> >>> > but these views are of course my own; furthermore >>>> >> >>>>> >> >>> > obviously >>>> >> >>>>> >> >>> > I >>>> >> >>>>> >> >>> > am a >>>> >> >>>>> >> >>> > bit >>>> >> >>>>> >> >>> > more >>>> >> >>>>> >> >>> > familiar with some of the 'sponsored' open-source >>>> >> >>>>> >> >>> > libraries, but always open to new things. >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > - integration / automatic deferral to numba for JIT >>>> >> >>>>> >> >>> > (this >>>> >> >>>>> >> >>> > would >>>> >> >>>>> >> >>> > be >>>> >> >>>>> >> >>> > thru >>>> >> >>>>> >> >>> > .apply) >>>> >> >>>>> >> >>> > - automatic deferal to dask from groubpy where >>>> >> >>>>> >> >>> > appropriate >>>> >> >>>>> >> >>> > / >>>> >> >>>>> >> >>> > maybe a >>>> >> >>>>> >> >>> > .to_parallel (to simply return a dask.DataFrame object) >>>> >> >>>>> >> >>> > - incorporation of quantities / units (as part of the >>>> >> >>>>> >> >>> > dtype) >>>> >> >>>>> >> >>> > - use of DyND to allow missing values for int dtypes >>>> >> >>>>> >> >>> > - make Period a first class dtype. >>>> >> >>>>> >> >>> > - provide some copy-on-write semantics to alleviate the >>>> >> >>>>> >> >>> > chained-indexing >>>> >> >>>>> >> >>> > issues which occasionaly come up with the mis-use of >>>> >> >>>>> >> >>> > the >>>> >> >>>>> >> >>> > indexing >>>> >> >>>>> >> >>> > API >>>> >> >>>>> >> >>> > - allow a 'policy' to automatically provide column >>>> >> >>>>> >> >>> > blocks >>>> >> >>>>> >> >>> > for >>>> >> >>>>> >> >>> > dict-like >>>> >> >>>>> >> >>> > input (e.g. each column would be a block), this would >>>> >> >>>>> >> >>> > allow >>>> >> >>>>> >> >>> > a >>>> >> >>>>> >> >>> > pass-thru >>>> >> >>>>> >> >>> > API >>>> >> >>>>> >> >>> > where you could >>>> >> >>>>> >> >>> > put in numpy arrays where you have views and have them >>>> >> >>>>> >> >>> > preserved >>>> >> >>>>> >> >>> > rather >>>> >> >>>>> >> >>> > than >>>> >> >>>>> >> >>> > copied automatically. Note that this would also allow >>>> >> >>>>> >> >>> > what >>>> >> >>>>> >> >>> > I >>>> >> >>>>> >> >>> > call >>>> >> >>>>> >> >>> > 'split' >>>> >> >>>>> >> >>> > where a passed in >>>> >> >>>>> >> >>> > multi-dim numpy array could be split up to individual >>>> >> >>>>> >> >>> > blocks >>>> >> >>>>> >> >>> > (which >>>> >> >>>>> >> >>> > actually >>>> >> >>>>> >> >>> > gives a nice perf boost after the splitting costs). >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > In working towards some of these goals. I have come to >>>> >> >>>>> >> >>> > the >>>> >> >>>>> >> >>> > opinion >>>> >> >>>>> >> >>> > that >>>> >> >>>>> >> >>> > it >>>> >> >>>>> >> >>> > would make sense to have a neutral API protocol layer >>>> >> >>>>> >> >>> > that would allow us to swap out different engines as >>>> >> >>>>> >> >>> > needed, >>>> >> >>>>> >> >>> > for >>>> >> >>>>> >> >>> > particular >>>> >> >>>>> >> >>> > dtypes, or *maybe* out-of-core type computations. E.g. >>>> >> >>>>> >> >>> > imagine that we replaced the in-memory block structure >>>> >> >>>>> >> >>> > with >>>> >> >>>>> >> >>> > a >>>> >> >>>>> >> >>> > bclolz >>>> >> >>>>> >> >>> > / >>>> >> >>>>> >> >>> > memap >>>> >> >>>>> >> >>> > type; in theory this should be 'easy' and just work. >>>> >> >>>>> >> >>> > I could also see us adopting *some* of the SFrame code >>>> >> >>>>> >> >>> > to >>>> >> >>>>> >> >>> > allow >>>> >> >>>>> >> >>> > easier >>>> >> >>>>> >> >>> > interop with this API layer. >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > In practice, I think a nice API layer would need to be >>>> >> >>>>> >> >>> > created >>>> >> >>>>> >> >>> > to >>>> >> >>>>> >> >>> > make >>>> >> >>>>> >> >>> > this >>>> >> >>>>> >> >>> > clean / nice. >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > So this comes around to Wes's point about creating a >>>> >> >>>>> >> >>> > c++ >>>> >> >>>>> >> >>> > library for >>>> >> >>>>> >> >>> > the >>>> >> >>>>> >> >>> > internals (and possibly even some of the indexing >>>> >> >>>>> >> >>> > routines). >>>> >> >>>>> >> >>> > In an ideal world, or course this would be desirable. >>>> >> >>>>> >> >>> > Getting >>>> >> >>>>> >> >>> > there >>>> >> >>>>> >> >>> > is a >>>> >> >>>>> >> >>> > bit >>>> >> >>>>> >> >>> > non-trivial I think, and IMHO might not be worth the >>>> >> >>>>> >> >>> > effort. I >>>> >> >>>>> >> >>> > don't >>>> >> >>>>> >> >>> > really see big performance bottlenecks. We *already* >>>> >> >>>>> >> >>> > defer >>>> >> >>>>> >> >>> > much >>>> >> >>>>> >> >>> > of >>>> >> >>>>> >> >>> > the >>>> >> >>>>> >> >>> > computation to libraries like numexpr & bottleneck >>>> >> >>>>> >> >>> > (where >>>> >> >>>>> >> >>> > appropriate). >>>> >> >>>>> >> >>> > Adding numba / dask to the list would be helpful. >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > I think that almost all performance issues are the >>>> >> >>>>> >> >>> > result >>>> >> >>>>> >> >>> > of: >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > a) gross misuse of the pandas API. How much code have >>>> >> >>>>> >> >>> > you >>>> >> >>>>> >> >>> > seen >>>> >> >>>>> >> >>> > that >>>> >> >>>>> >> >>> > does >>>> >> >>>>> >> >>> > df.apply(lambda x: x.sum()) >>>> >> >>>>> >> >>> > b) routines which operate column-by-column rather >>>> >> >>>>> >> >>> > block-by-block and >>>> >> >>>>> >> >>> > are >>>> >> >>>>> >> >>> > in >>>> >> >>>>> >> >>> > python space (e.g. we have an issue right now about >>>> >> >>>>> >> >>> > .quantile) >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > So I am glossing over a big goal of having a c++ >>>> >> >>>>> >> >>> > library >>>> >> >>>>> >> >>> > that >>>> >> >>>>> >> >>> > represents >>>> >> >>>>> >> >>> > the >>>> >> >>>>> >> >>> > pandas internals. This would by definition have a c-API >>>> >> >>>>> >> >>> > that so >>>> >> >>>>> >> >>> > you *could* use pandas like semantics in c/c++ and just >>>> >> >>>>> >> >>> > have it >>>> >> >>>>> >> >>> > work >>>> >> >>>>> >> >>> > (and >>>> >> >>>>> >> >>> > then pandas would be a thin wrapper around this >>>> >> >>>>> >> >>> > library). >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > I am not averse to this, but I think would be quite a >>>> >> >>>>> >> >>> > big >>>> >> >>>>> >> >>> > effort, >>>> >> >>>>> >> >>> > and >>>> >> >>>>> >> >>> > not a >>>> >> >>>>> >> >>> > huge perf boost IMHO. Further there are a number of API >>>> >> >>>>> >> >>> > issues >>>> >> >>>>> >> >>> > w.r.t. >>>> >> >>>>> >> >>> > indexing >>>> >> >>>>> >> >>> > which need to be clarified / worked out (e.g. should we >>>> >> >>>>> >> >>> > simply >>>> >> >>>>> >> >>> > deprecate >>>> >> >>>>> >> >>> > []) >>>> >> >>>>> >> >>> > that are much easier to test / figure out in python >>>> >> >>>>> >> >>> > space. >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > I also thing that we have quite a large number of >>>> >> >>>>> >> >>> > contributors. >>>> >> >>>>> >> >>> > Moving >>>> >> >>>>> >> >>> > to >>>> >> >>>>> >> >>> > c++ might make the internals a bit more impenetrable >>>> >> >>>>> >> >>> > that >>>> >> >>>>> >> >>> > the >>>> >> >>>>> >> >>> > current >>>> >> >>>>> >> >>> > internals. >>>> >> >>>>> >> >>> > (though this would allow c++ people to contribute, so >>>> >> >>>>> >> >>> > that >>>> >> >>>>> >> >>> > might >>>> >> >>>>> >> >>> > balance >>>> >> >>>>> >> >>> > out). >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > We have a limited core of devs whom right now are >>>> >> >>>>> >> >>> > familar >>>> >> >>>>> >> >>> > with >>>> >> >>>>> >> >>> > things. >>>> >> >>>>> >> >>> > If >>>> >> >>>>> >> >>> > someone happened to have a starting base for a c++ >>>> >> >>>>> >> >>> > library, >>>> >> >>>>> >> >>> > then I >>>> >> >>>>> >> >>> > might >>>> >> >>>>> >> >>> > change >>>> >> >>>>> >> >>> > opinions here. >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > my 4c. >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > Jeff >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > wrote: >>>> >> >>>>> >> >>> >> >>>> >> >>>>> >> >>> >> Deep thoughts during the holidays. >>>> >> >>>>> >> >>> >> >>>> >> >>>>> >> >>> >> I might be out of line here, but the >>>> >> >>>>> >> >>> >> interpreter-heaviness >>>> >> >>>>> >> >>> >> of >>>> >> >>>>> >> >>> >> the >>>> >> >>>>> >> >>> >> inside of pandas objects is likely to be a long-term >>>> >> >>>>> >> >>> >> liability >>>> >> >>>>> >> >>> >> and >>>> >> >>>>> >> >>> >> source of performance problems and technical debt. >>>> >> >>>>> >> >>> >> >>>> >> >>>>> >> >>> >> Has anyone put any thought into planning and beginning >>>> >> >>>>> >> >>> >> to >>>> >> >>>>> >> >>> >> execute >>>> >> >>>>> >> >>> >> on a >>>> >> >>>>> >> >>> >> rewrite that moves as much as possible of the >>>> >> >>>>> >> >>> >> internals >>>> >> >>>>> >> >>> >> into >>>> >> >>>>> >> >>> >> native >>>> >> >>>>> >> >>> >> / >>>> >> >>>>> >> >>> >> compiled code? I'm talking about: >>>> >> >>>>> >> >>> >> >>>> >> >>>>> >> >>> >> - pandas/core/internals >>>> >> >>>>> >> >>> >> - indexing and assignment >>>> >> >>>>> >> >>> >> - much of pandas/core/common >>>> >> >>>>> >> >>> >> - categorical and custom dtypes >>>> >> >>>>> >> >>> >> - all indexing mechanisms >>>> >> >>>>> >> >>> >> >>>> >> >>>>> >> >>> >> I'm concerned we've already exposed too much internals >>>> >> >>>>> >> >>> >> to >>>> >> >>>>> >> >>> >> users, so >>>> >> >>>>> >> >>> >> this might lead to a lot of API breakage, but it might >>>> >> >>>>> >> >>> >> be >>>> >> >>>>> >> >>> >> for >>>> >> >>>>> >> >>> >> the >>>> >> >>>>> >> >>> >> Greater Good. As a first step, beginning a partial >>>> >> >>>>> >> >>> >> migration >>>> >> >>>>> >> >>> >> of >>>> >> >>>>> >> >>> >> internals into some C++ classes that encapsulate the >>>> >> >>>>> >> >>> >> insides >>>> >> >>>>> >> >>> >> of >>>> >> >>>>> >> >>> >> DataFrame objects and implement indexing and >>>> >> >>>>> >> >>> >> block-level >>>> >> >>>>> >> >>> >> manipulations >>>> >> >>>>> >> >>> >> would be a good place to start. I think you could do >>>> >> >>>>> >> >>> >> this >>>> >> >>>>> >> >>> >> wouldn't >>>> >> >>>>> >> >>> >> too >>>> >> >>>>> >> >>> >> much disruption. >>>> >> >>>>> >> >>> >> >>>> >> >>>>> >> >>> >> As part of this internal retooling we might give >>>> >> >>>>> >> >>> >> consideration >>>> >> >>>>> >> >>> >> to >>>> >> >>>>> >> >>> >> alternative data structures for representing data >>>> >> >>>>> >> >>> >> internal >>>> >> >>>>> >> >>> >> to >>>> >> >>>>> >> >>> >> pandas >>>> >> >>>>> >> >>> >> objects. Now in 2015/2016, continuing to be hamstrung >>>> >> >>>>> >> >>> >> by >>>> >> >>>>> >> >>> >> NumPy's >>>> >> >>>>> >> >>> >> limitations feels somewhat anachronistic. User code is >>>> >> >>>>> >> >>> >> riddled >>>> >> >>>>> >> >>> >> with >>>> >> >>>>> >> >>> >> workarounds for data type fidelity issues and the >>>> >> >>>>> >> >>> >> like. >>>> >> >>>>> >> >>> >> Like, >>>> >> >>>>> >> >>> >> really, >>>> >> >>>>> >> >>> >> why not add a bitndarray (similar to >>>> >> >>>>> >> >>> >> ilanschnell/bitarray) >>>> >> >>>>> >> >>> >> for >>>> >> >>>>> >> >>> >> storing >>>> >> >>>>> >> >>> >> nullness for problematic types and hide this from the >>>> >> >>>>> >> >>> >> user? =) >>>> >> >>>>> >> >>> >> >>>> >> >>>>> >> >>> >> Since we are now a NumFOCUS-sponsored project, I feel >>>> >> >>>>> >> >>> >> like >>>> >> >>>>> >> >>> >> we >>>> >> >>>>> >> >>> >> might >>>> >> >>>>> >> >>> >> consider establishing some formal governance over >>>> >> >>>>> >> >>> >> pandas >>>> >> >>>>> >> >>> >> and >>>> >> >>>>> >> >>> >> publishing meetings notes and roadmap documents >>>> >> >>>>> >> >>> >> describing >>>> >> >>>>> >> >>> >> plans >>>> >> >>>>> >> >>> >> for >>>> >> >>>>> >> >>> >> the project and meetings notes from committers. >>>> >> >>>>> >> >>> >> There's no >>>> >> >>>>> >> >>> >> real >>>> >> >>>>> >> >>> >> "committer culture" for NumFOCUS projects like there >>>> >> >>>>> >> >>> >> is >>>> >> >>>>> >> >>> >> with >>>> >> >>>>> >> >>> >> the >>>> >> >>>>> >> >>> >> Apache Software Foundation, but we might try leading >>>> >> >>>>> >> >>> >> by >>>> >> >>>>> >> >>> >> example! >>>> >> >>>>> >> >>> >> >>>> >> >>>>> >> >>> >> Also, I believe pandas as a project has reached a >>>> >> >>>>> >> >>> >> level of >>>> >> >>>>> >> >>> >> importance >>>> >> >>>>> >> >>> >> where we ought to consider planning and execution on >>>> >> >>>>> >> >>> >> larger >>>> >> >>>>> >> >>> >> scale >>>> >> >>>>> >> >>> >> undertakings such as this for safeguarding the future. >>>> >> >>>>> >> >>> >> >>>> >> >>>>> >> >>> >> As for myself, well, I have my hands full in Big >>>> >> >>>>> >> >>> >> Data-land. I >>>> >> >>>>> >> >>> >> wish >>>> >> >>>>> >> >>> >> I >>>> >> >>>>> >> >>> >> could be helping more with pandas, but there a quite a >>>> >> >>>>> >> >>> >> few >>>> >> >>>>> >> >>> >> fundamental >>>> >> >>>>> >> >>> >> issues (like data interoperability nested data >>>> >> >>>>> >> >>> >> handling >>>> >> >>>>> >> >>> >> and >>>> >> >>>>> >> >>> >> file >>>> >> >>>>> >> >>> >> format support ? e.g. Parquet, see >>>> >> >>>>> >> >>> >> >>>> >> >>>>> >> >>> >> >>>> >> >>>>> >> >>> >> >>>> >> >>>>> >> >>> >> >>>> >> >>>>> >> >>> >> >>>> >> >>>>> >> >>> >> >>>> >> >>>>> >> >>> >> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/) >>>> >> >>>>> >> >>> >> preventing Python from being more useful in industry >>>> >> >>>>> >> >>> >> analytics >>>> >> >>>>> >> >>> >> applications. >>>> >> >>>>> >> >>> >> >>>> >> >>>>> >> >>> >> Aside: one of the bigger mistakes I made with pandas's >>>> >> >>>>> >> >>> >> API >>>> >> >>>>> >> >>> >> design >>>> >> >>>>> >> >>> >> was >>>> >> >>>>> >> >>> >> making it acceptable to call class constructors ? like >>>> >> >>>>> >> >>> >> pandas.DataFrame ? directly (versus factory >>>> >> >>>>> >> >>> >> functions). >>>> >> >>>>> >> >>> >> Sorry >>>> >> >>>>> >> >>> >> about >>>> >> >>>>> >> >>> >> that! If we could convince everyone to start writing >>>> >> >>>>> >> >>> >> pandas.data_frame >>>> >> >>>>> >> >>> >> or dataframe instead of using the class reference it >>>> >> >>>>> >> >>> >> would >>>> >> >>>>> >> >>> >> help a >>>> >> >>>>> >> >>> >> lot >>>> >> >>>>> >> >>> >> with code cleanup. It's hard to plan for these things >>>> >> >>>>> >> >>> >> ? >>>> >> >>>>> >> >>> >> NumPy >>>> >> >>>>> >> >>> >> interoperability seemed a lot more important in 2008 >>>> >> >>>>> >> >>> >> than >>>> >> >>>>> >> >>> >> it >>>> >> >>>>> >> >>> >> does >>>> >> >>>>> >> >>> >> now, >>>> >> >>>>> >> >>> >> so I forgive myself. >>>> >> >>>>> >> >>> >> >>>> >> >>>>> >> >>> >> cheers and best wishes for 2016, >>>> >> >>>>> >> >>> >> Wes >>>> >> >>>>> >> >>> >> _______________________________________________ >>>> >> >>>>> >> >>> >> Pandas-dev mailing list >>>> >> >>>>> >> >>> >> Pandas-dev at python.org >>>> >> >>>>> >> >>> >> https://mail.python.org/mailman/listinfo/pandas-dev >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> _______________________________________________ >>>> >> >>>>> >> >>> Pandas-dev mailing list >>>> >> >>>>> >> >>> Pandas-dev at python.org >>>> >> >>>>> >> >>> https://mail.python.org/mailman/listinfo/pandas-dev >>>> >> >>>>> >> _______________________________________________ >>>> >> >>>>> >> Pandas-dev mailing list >>>> >> >>>>> >> Pandas-dev at python.org >>>> >> >>>>> >> https://mail.python.org/mailman/listinfo/pandas-dev >>>> >> >>>>> > >>>> >> >>>>> > >>>> >> >>>>> >>>> >> >>>>> >>>> >> >>>>> _______________________________________________ >>>> >> >>>>> Pandas-dev mailing list >>>> >> >>>>> Pandas-dev at python.org >>>> >> >>>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>> >> >>>>> >>>> >> >>>> >>>> >> >> >>>> >> _______________________________________________ >>>> >> Pandas-dev mailing list >>>> >> Pandas-dev at python.org >>>> >> https://mail.python.org/mailman/listinfo/pandas-dev >>>> > >>>> > >>>> _______________________________________________ >>>> Pandas-dev mailing list >>>> Pandas-dev at python.org >>>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >>> >>> >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> From jeffreback at gmail.com Wed Jan 6 14:45:45 2016 From: jeffreback at gmail.com (Jeff Reback) Date: Wed, 6 Jan 2016 14:45:45 -0500 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References: Message-ID: I'll just apologize right up front! hahah. No I think I have been pushing on these extras in pandas to help move it forward. I have commented a bit on Stephan's issue here about why I didn't push for these in numpy. numpy is fairly slow moving (though moves faster lately, I suspect the pace when Wes was developing pandas was not much faster). So pandas was essentially 'fixing' lots of bug / compat issues in numpy. To the extent that we can keep the current user facing API the same (high likelihood I think), willing to acccept *some* breakage with the pandas->duck-like array container API in order to provide swappable containers. For example I recall that in doing datetime w/tz, that we wanted Series.values to return a numpy array (which it DOES!) but it is actually lossy (its loses the tz). Samething with the Categorical example wes gave. I dont' think these requirements should hold pandas back! People are increasingly using pandas as the API for there work. That makes it very important that we can handle lots of input properly, w/o the handcuffs of numpy. All this said, I'll reiterate Wes (and others points). That back-compat is extremely important. (I in fact try to bend over backwards to provide this, sometimes its too much of course!). E.g. take the resample changes to API Was originally going to just do a hard break, but this turns off people when they have to update there code or else. my 4c (incrementing!) Jeff On Wed, Jan 6, 2016 at 2:37 PM, Wes McKinney wrote: > On Wed, Jan 6, 2016 at 11:26 AM, Wes McKinney wrote: > > hey Stephan, > > > > Thanks for all the thoughts. Let me make a few off-the-cuff comments. > > > > On Wed, Jan 6, 2016 at 10:11 AM, Stephan Hoyer wrote: > >> I was asked about this off list, so I'll belatedly share my thoughts. > >> > >> First of all, I am really excited by Wes's renewed engagement in the > project > >> and his interest in rewriting pandas internals. This is quite an > ambitious > >> plan and nobody is better positioned to tackle it than Wes. > >> > >> I have mixed feelings about the details of the rewrite itself. > >> > >> +1 on the simpler internal data model. The block manager is confusing > and > >> leads to hard to predict performance issues related to copying data. If > we > >> can do all column additions/removals/re-orderings without a copy it > will be > >> a clear win. > >> > >> +0 on moving internals to C++. I do like the performance benefits, but > it > >> seems like a lot of work, and it may make pandas less friendly to new > >> contributors. > >> > > > > It really goes beyond performance benefits. If you go back to my 2013 > > talk > http://www.slideshare.net/wesm/practical-medium-data-analytics-with-python > > there's a long list of architectural problems that now in 2016 haven't > > found solutions. The only way (that I can fully reason through -- I am > > happy to look at alternate proposals) to move the internals of pandas > > closer to the metal is to give Series and DataFrame a C/C++ API -- > > this is the "libpandas native core" as I've been describing. > > I should point out the the main thing that's changed since that preso > is "synthetic" data types like Categorical. But seeing what it took > for Jeff et al to build that is a prime motivation for this internals > refactoring plan. > > > > >> -0 on writing a brand new dtype system just for pandas -- this stuff > really > >> belongs in NumPy (or another array library like DyND), and I am > skeptical > >> that pandas can do a complete enough job to be useful without > replicating > >> all that functionality. > >> > > > > I'm curious what "a brand new dtype system" means to you. pandas > > already has its own data type system, but it's a potpourri of > > inconsistencies and rough edges with self-evident problems for both > > users and developers. Some indicators: > > > > - Some pandas types use NaN for missing data, others None (or both), > > others nothing at all. We lose data (integers) or bloat memory > > (booleans) by upcasting to float-NaN or object-None. > > - Internal functions full of is_XXX_dtype functions: > > pandas.core.common, pandas.core.algorithms, etc. > > - Series.values on synthetic dtypes like Categorical > > - We use arrays of Python objects for string data > > > > The biggest cause IMHO is that pandas is too tightly coupled to NumPy, > > but it's coupled in a way that makes development and extensibility > > difficult. We've already allowed NumPy-specific details to taint the > > pandas user API in many unpleasant ways. This isn't to say "NumPy is > > bad" but rather "pandas tries to layer domain-specific functionality > > [that NumPy was not designed for] on top". > > > > Some things things I'm advocating with the internals refactor: > > > > 1) First class "pandas type" objects. This is not the same as a NumPy > > dtype which has some pretty loaded implications -- in particular, > > NumPy dtypes are implicitly coupled to an array computing framework > > (see the function table that is attached to the PyArray_Descr object) > > > > 2) Pandas array container types that map user-land API calls to > > implementation-land API calls (in NumPy, DyND, or pandas-native code > > like pandas.core.algorithms etc.). This will make it much easier to > > leverage innovations in NumPy and DyND without those implementation > > details spilling over into the pandas user API > > > > 3) Adding a single pandas.NA singleton to have one library-wide notion > > of a scalar null value (obviously, we can automatically map NaN and > > None to NA for backwards compatibility). > > > > 4) Layering a bitmask internally on NumPy arrays (especially integer > > and boolean) to add null-ness to types that need it. Note that this > > does not prevent us from switching to DyND arrays with option dtype in > > the future. If the details of how we are implementing NULL are visible > > to the user, we have failed. > > > > 5) Removing the block manager in favor of simpler pandas Array (1D) > > and Table (2D -- vector of Array) data structures > > > > I believe you can do all this without harming interoperability with > > the ecosystem of projects that people currently use in conjunction > > with pandas. > > > >> More broadly, I am concerned that this rewrite may improve the tabular > >> computation ecosystem at the cost of inter-operability with the > array-based > >> ecosystem (numpy, scipy, sklearn, xarray, etc.). The later has been one > of > >> the strengths of pandas and it would be a shame to see that go away. > >> > > > > I have no intention of letting this happen. What I've am asking from > > you (and others reading) is to help define what constitutes > > interoperability. What guarantees do we make the user? > > > > For example, we should have very strict guidelines for the output of: > > > > np.asarray(pandas_obj) > > > > For example > > > > In [3]: s = pd.Series([1,2,3]*10).astype('category') > > > > In [4]: np.asarray(s) > > Out[4]: > > array([1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, > 2, > > 3, 1, 2, 3, 1, 2, 3]) > > > > I see no reason why this should necessarily behave any differently. > > The problem will come in when there is pandas data that is not > > precisely representable in a NumPy array. Example: > > > > In [5]: s = pd.Series([1,2,3, 4]) > > > > In [6]: s.dtype > > Out[6]: dtype('int64') > > > > In [7]: s2 = s.reindex(np.arange(10)) > > > > In [8]: s2.dtype > > Out[8]: dtype('float64') > > > > In [9]: np.asarray(s2) > > Out[9]: array([ 1., 2., 3., 4., nan, nan, nan, nan, nan, > nan]) > > > > With the "new internals", s2 will still be int64 type, but we may > > decide that np.asarray(s2) should raise an exception rather than > > implicitly make a decision about how to perform a "lossy" conversion > > to a NumPy array. If you are using DyND with pandas, then the > > equivalent function would be able to implicitly convert without data > > loss. > > > >> We're already starting to struggle with inter-operability with the new > >> pandas dtypes and a further rewrite would make this even harder. > >> For example, see categoricals and scikit-learn in Tom's recent post > [1], or the > >> fact that .values no longer always returns a numpy array. This has also > been > >> a challenge for xarray, which can't handle these new dtypes because we > lack > >> a suitable array backend for them. > > > > I'm definitely motivated in this initiative by these challenges. The > > idea here is that with the new internals, Series.values will always > > return the same type of object, and there will be one consistent code > > path for getting a NumPy array out. For example, rather than: > > > > if isinstance(s.values, Categorical): > > # pandas > > ... > > else: > > # NumPy > > ... > > > > We could have (just an idea) > > > > s.values.to_numpy() > > > > Or simply > > > > np.asarray(s.values) > > > >> > >> Personally, I would much rather leverage a full featured library like an > >> improved NumPy or DyND for new dtypes, because that could also be used > by > >> the array-based ecosystem. At the very least, it would be good to think > >> about zero-copy inter-operability with array-based tools. > >> > > > > I'm all for zero-copy interoperability when possible, but my gut > > feeling is that exposing the data type system of an array library (the > > choice of which is an implementation detail) to pandas users is an > > inherent leaky abstraction that will continue to cause problems if we > > plan to keep innovating inside pandas. By better hiding NumPy details > > and types from the user we will make it much easier to swap out new > > low level array data structures and compute components (e.g. DyND), or > > add custom data structures or out-of-core tools (memory maps, bcolz, > > etc.) > > > > I'm additionally offering to do nearly all of this replumbing of > > pandas internals myself, and completely in my free time. What I will > > expect in return from you all is to help enumerate our contracts with > > the pandas user (i.e. interoperability) and to hold me accountable to > > not break them. I know I haven't been committing code on pandas since > > mid-2013 (after a 5 year marathon), but these architectural problems > > have been on my mind almost constantly since then, I just haven't had > > the bandwidth to start tackling them. > > > > cheers, > > Wes > > > >> On the other hand, I wonder if maybe it would be better to write a > native > >> in-memory backend for Ibis instead of rewriting pandas. Ibis does seem > to > >> have improved/simplified API which resolves many of pandas's warts. That > >> said, it's a pretty big change from the "DataFrame as matrix" model, and > >> pandas won't be going away anytime soon. I do like that it would force > users > >> to be more explicit about converting between tables and arrays, which > might > >> also make distinctions between the tabular and array oriented ecosystems > >> easier to swallow. > >> > >> Just my two cents, from someone who has lots of opinions but who will > likely > >> stay on the sidelines for most of this work. > >> > >> Cheers, > >> Stephan > >> > >> [1] http://tomaugspurger.github.io/categorical-pipelines.html > >> > >> On Fri, Jan 1, 2016 at 6:06 PM, Jeff Reback > wrote: > >>> > >>> ok I moved the document to the Pandas folder, where the same group > should > >>> be able to edit/upload/etc. lmk if any issues > >>> > >>> On Fri, Jan 1, 2016 at 8:48 PM, Wes McKinney > wrote: > >>>> > >>>> Thanks Jeff. Can you create and share a shared Drive folder containing > >>>> this where I can put other auxiliary / follow up documents? > >>>> > >>>> On Fri, Jan 1, 2016 at 5:23 PM, Jeff Reback > wrote: > >>>> > I changed the doc so that the core dev people can edit. I *think* > that > >>>> > everyone should be able to view/comment though. > >>>> > > >>>> > On Fri, Jan 1, 2016 at 8:13 PM, Wes McKinney > >>>> > wrote: > >>>> >> > >>>> >> Jeff -- can you require log-in for editing on this document? > >>>> >> > >>>> >> > >>>> >> > https://docs.google.com/document/d/151ct8jcZWwh7XStptjbLsda6h2b3C0IuiH_hfZnUA58/edit# > >>>> >> > >>>> >> There are a number of anonymous edits. > >>>> >> > >>>> >> On Wed, Dec 30, 2015 at 6:04 PM, Wes McKinney > > >>>> >> wrote: > >>>> >> > I cobbled together an ugly start of a c++->cython->pandas > toolchain > >>>> >> > here > >>>> >> > > >>>> >> > https://github.com/wesm/pandas/tree/libpandas-native-core > >>>> >> > > >>>> >> > I used a mix of Kudu, Impala, and dynd-python cmake sources, so > it's > >>>> >> > a > >>>> >> > bit messy at the moment but it should be sufficient to run some > real > >>>> >> > experiments with a little more work. I reckon it's like a 6 month > >>>> >> > project to tear out the insides of Series and DataFrame and > replace > >>>> >> > it > >>>> >> > with a new "native core", but we should be able to get enough > info > >>>> >> > to > >>>> >> > see whether it's a viable plan within a month or so. > >>>> >> > > >>>> >> > The end goal is to create "private" extension types in Cython > that > >>>> >> > can > >>>> >> > be the new base classes for Series and NDFrame; these will hold a > >>>> >> > reference to a C++ object that contains wrappered NumPy arrays > and > >>>> >> > other metadata (like pandas-only dtypes). > >>>> >> > > >>>> >> > It might be too hard to try to replace a single usage of block > >>>> >> > manager > >>>> >> > as a first experiment, so I'll try to create a minimal > "SeriesLite" > >>>> >> > that supports 3 dtypes > >>>> >> > > >>>> >> > 1) float64 with nans > >>>> >> > 2) int64 with a bitmask for NAs > >>>> >> > 3) category type for one of these > >>>> >> > > >>>> >> > Just want to get a feel for the extensibility and offer an NA > >>>> >> > singleton Python object (a la None) for getting and setting NAs > >>>> >> > across > >>>> >> > these 3 dtypes. > >>>> >> > > >>>> >> > If we end up going down this route, any way to place a > moratorium on > >>>> >> > invasive work on pandas internals (outside bug fixes)? > >>>> >> > > >>>> >> > Pedantic aside: I'd rather avoid shipping thirdparty C/C++ > libraries > >>>> >> > like googletest and friends in pandas if we can. Cloudera folks > have > >>>> >> > been working on a portable C++ library toolchain for Impala and > >>>> >> > other > >>>> >> > projects at https://github.com/cloudera/native-toolchain, but > it is > >>>> >> > only being tested on Linux and OS X. Most google libraries should > >>>> >> > build out of the box on MSVC but it'll be something to keep an > eye > >>>> >> > on. > >>>> >> > > >>>> >> > BTW thanks to the libdynd developers for pioneering the c++ lib > <-> > >>>> >> > python-c++ lib <-> cython toolchain; being able to build Cython > >>>> >> > extensions directly from cmake is a godsend > >>>> >> > > >>>> >> > HNY all > >>>> >> > Wes > >>>> >> > > >>>> >> > On Tue, Dec 29, 2015 at 4:17 PM, Irwin Zaid > >>>> >> > wrote: > >>>> >> >> Yeah, that seems reasonable and I totally agree a Pandas wrapper > >>>> >> >> layer > >>>> >> >> would > >>>> >> >> be necessary. > >>>> >> >> > >>>> >> >> I'll keep an eye on this and I'd like to help if I can. > >>>> >> >> > >>>> >> >> Irwin > >>>> >> >> > >>>> >> >> > >>>> >> >> On Tue, Dec 29, 2015 at 6:01 PM, Wes McKinney < > wesmckinn at gmail.com> > >>>> >> >> wrote: > >>>> >> >>> > >>>> >> >>> I'm not suggesting a rewrite of NumPy functionality but rather > >>>> >> >>> pandas > >>>> >> >>> functionality that is currently written in a mishmash of Cython > >>>> >> >>> and > >>>> >> >>> Python. > >>>> >> >>> Happy to experiment with changing the internal compute > >>>> >> >>> infrastructure > >>>> >> >>> and > >>>> >> >>> data representation to DyND after this first stage of cleanup > is > >>>> >> >>> done. > >>>> >> >>> Even > >>>> >> >>> if we use DyND a pretty extensive pandas wrapper layer will be > >>>> >> >>> necessary. > >>>> >> >>> > >>>> >> >>> > >>>> >> >>> On Tuesday, December 29, 2015, Irwin Zaid > >>>> >> >>> wrote: > >>>> >> >>>> > >>>> >> >>>> Hi Wes (and others), > >>>> >> >>>> > >>>> >> >>>> I've been following this conversation with interest. I do > think > >>>> >> >>>> it > >>>> >> >>>> would > >>>> >> >>>> be worth exploring DyND, rather than setting up yet another > >>>> >> >>>> rewrite > >>>> >> >>>> of > >>>> >> >>>> NumPy-functionality. Especially because DyND is already an > >>>> >> >>>> optional > >>>> >> >>>> dependency of Pandas. > >>>> >> >>>> > >>>> >> >>>> For things like Integer NA and new dtypes, DyND is there and > >>>> >> >>>> ready to > >>>> >> >>>> do > >>>> >> >>>> this. > >>>> >> >>>> > >>>> >> >>>> Irwin > >>>> >> >>>> > >>>> >> >>>> On Tue, Dec 29, 2015 at 5:18 PM, Wes McKinney > >>>> >> >>>> > >>>> >> >>>> wrote: > >>>> >> >>>>> > >>>> >> >>>>> Can you link to the PR you're talking about? > >>>> >> >>>>> > >>>> >> >>>>> I will see about spending a few hours setting up a > libpandas.so > >>>> >> >>>>> as a > >>>> >> >>>>> C++ > >>>> >> >>>>> shared library where we can run some experiments and validate > >>>> >> >>>>> whether it can > >>>> >> >>>>> solve the integer-NA problem and be a place to put new data > >>>> >> >>>>> types > >>>> >> >>>>> (categorical and friends). I'm +1 on targeting > >>>> >> >>>>> > >>>> >> >>>>> Would it also be worth making a wish list of APIs we might > >>>> >> >>>>> consider > >>>> >> >>>>> breaking in a pandas 1.0 release that also features this new > >>>> >> >>>>> "native > >>>> >> >>>>> core"? > >>>> >> >>>>> Might as well right some wrongs while we're doing some > invasive > >>>> >> >>>>> work > >>>> >> >>>>> on the > >>>> >> >>>>> internals; some breakage might be unavoidable. We can always > >>>> >> >>>>> maintain a > >>>> >> >>>>> pandas legacy 0.x.x maintenance branch (providing a conda > binary > >>>> >> >>>>> build) for > >>>> >> >>>>> legacy users where showstopper bugs can get fixed. > >>>> >> >>>>> > >>>> >> >>>>> On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback > >>>> >> >>>>> > >>>> >> >>>>> wrote: > >>>> >> >>>>> > Wes your last is noted as well. I *think* we can actually > do > >>>> >> >>>>> > this > >>>> >> >>>>> > now > >>>> >> >>>>> > (well > >>>> >> >>>>> > there is a PR out there). > >>>> >> >>>>> > > >>>> >> >>>>> > On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney > >>>> >> >>>>> > > >>>> >> >>>>> > wrote: > >>>> >> >>>>> >> > >>>> >> >>>>> >> The other huge thing this will enable is to do is > >>>> >> >>>>> >> copy-on-write > >>>> >> >>>>> >> for > >>>> >> >>>>> >> various kinds of views, which should cut down on some of > the > >>>> >> >>>>> >> defensive > >>>> >> >>>>> >> copying in the library and reduce memory usage. > >>>> >> >>>>> >> > >>>> >> >>>>> >> On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney > >>>> >> >>>>> >> > >>>> >> >>>>> >> wrote: > >>>> >> >>>>> >> > Basically the approach is > >>>> >> >>>>> >> > > >>>> >> >>>>> >> > 1) Base dtype type > >>>> >> >>>>> >> > 2) Base array type with K >= 1 dimensions > >>>> >> >>>>> >> > 3) Base scalar type > >>>> >> >>>>> >> > 4) Base index type > >>>> >> >>>>> >> > 5) "Wrapper" subclasses for all NumPy types fitting into > >>>> >> >>>>> >> > categories > >>>> >> >>>>> >> > #1, #2, #3, #4 > >>>> >> >>>>> >> > 6) Subclasses for pandas-specific types like category, > >>>> >> >>>>> >> > datetimeTZ, > >>>> >> >>>>> >> > etc. > >>>> >> >>>>> >> > 7) NDFrame as cpcloud wrote is just a list of these > >>>> >> >>>>> >> > > >>>> >> >>>>> >> > Indexes and axis labels / column names can get layered > on > >>>> >> >>>>> >> > top. > >>>> >> >>>>> >> > > >>>> >> >>>>> >> > After we do all this we can look at adding nested types > >>>> >> >>>>> >> > (arrays, > >>>> >> >>>>> >> > maps, > >>>> >> >>>>> >> > structs) to better support JSON. > >>>> >> >>>>> >> > > >>>> >> >>>>> >> > - Wes > >>>> >> >>>>> >> > > >>>> >> >>>>> >> > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud > >>>> >> >>>>> >> > > >>>> >> >>>>> >> > wrote: > >>>> >> >>>>> >> >> Maybe this is saying the same thing as Wes, but how far > >>>> >> >>>>> >> >> would > >>>> >> >>>>> >> >> something > >>>> >> >>>>> >> >> like > >>>> >> >>>>> >> >> this get us? > >>>> >> >>>>> >> >> > >>>> >> >>>>> >> >> // warning: things are probably not this simple > >>>> >> >>>>> >> >> > >>>> >> >>>>> >> >> struct data_array_t { > >>>> >> >>>>> >> >> void *primitive; // scalar data > >>>> >> >>>>> >> >> data_array_t *nested; // nested data > >>>> >> >>>>> >> >> boost::dynamic_bitset isnull; // might have to > create > >>>> >> >>>>> >> >> our > >>>> >> >>>>> >> >> own > >>>> >> >>>>> >> >> to > >>>> >> >>>>> >> >> avoid > >>>> >> >>>>> >> >> boost > >>>> >> >>>>> >> >> schema_t schema; // not sure exactly what this > looks > >>>> >> >>>>> >> >> like > >>>> >> >>>>> >> >> }; > >>>> >> >>>>> >> >> > >>>> >> >>>>> >> >> typedef std::map data_frame_t; > // > >>>> >> >>>>> >> >> probably > >>>> >> >>>>> >> >> not > >>>> >> >>>>> >> >> this > >>>> >> >>>>> >> >> simple > >>>> >> >>>>> >> >> > >>>> >> >>>>> >> >> To answer Jeff?s use-case question: I think that the > use > >>>> >> >>>>> >> >> cases > >>>> >> >>>>> >> >> are > >>>> >> >>>>> >> >> 1) > >>>> >> >>>>> >> >> freedom from numpy (mostly) 2) no more block manager > which > >>>> >> >>>>> >> >> frees > >>>> >> >>>>> >> >> us > >>>> >> >>>>> >> >> from the > >>>> >> >>>>> >> >> limitations of the block memory layout. In particular, > the > >>>> >> >>>>> >> >> ability > >>>> >> >>>>> >> >> to > >>>> >> >>>>> >> >> take > >>>> >> >>>>> >> >> advantage of memory mapped IO would be a big win IMO. > >>>> >> >>>>> >> >> > >>>> >> >>>>> >> >> > >>>> >> >>>>> >> >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney > >>>> >> >>>>> >> >> > >>>> >> >>>>> >> >> wrote: > >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> I will write a more detailed response to some of these > >>>> >> >>>>> >> >>> things > >>>> >> >>>>> >> >>> after > >>>> >> >>>>> >> >>> the new year, but, in particular, re: missing values, > can > >>>> >> >>>>> >> >>> you > >>>> >> >>>>> >> >>> or > >>>> >> >>>>> >> >>> someone tell me why creating an object that contains a > >>>> >> >>>>> >> >>> NumPy > >>>> >> >>>>> >> >>> array and > >>>> >> >>>>> >> >>> a bitmap is not sufficient? If we we can add a > >>>> >> >>>>> >> >>> lightweight > >>>> >> >>>>> >> >>> C/C++ > >>>> >> >>>>> >> >>> class > >>>> >> >>>>> >> >>> layer between NumPy function calls (e.g. arithmetic) > and > >>>> >> >>>>> >> >>> pandas > >>>> >> >>>>> >> >>> function calls, then I see no reason why we cannot > have > >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> Int32Array->add > >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> and > >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> Float32Array->add > >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> do the right thing (the former would be responsible > for > >>>> >> >>>>> >> >>> bitmasking to > >>>> >> >>>>> >> >>> propagate NA values; the latter would defer to > NumPy). If > >>>> >> >>>>> >> >>> we > >>>> >> >>>>> >> >>> can > >>>> >> >>>>> >> >>> put > >>>> >> >>>>> >> >>> all the internals of pandas objects inside a black > box, > >>>> >> >>>>> >> >>> we > >>>> >> >>>>> >> >>> can > >>>> >> >>>>> >> >>> add > >>>> >> >>>>> >> >>> layers of virtual function indirection without a > >>>> >> >>>>> >> >>> performance > >>>> >> >>>>> >> >>> penalty > >>>> >> >>>>> >> >>> (e.g. adding more interpreter overhead with more > >>>> >> >>>>> >> >>> abstraction > >>>> >> >>>>> >> >>> layers > >>>> >> >>>>> >> >>> does add up to a perf penalty). > >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> I don't think this is too scary -- I would be willing > to > >>>> >> >>>>> >> >>> create a > >>>> >> >>>>> >> >>> small POC C++ library to prototype something like what > >>>> >> >>>>> >> >>> I'm > >>>> >> >>>>> >> >>> talking > >>>> >> >>>>> >> >>> about. > >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> Since pandas has limited points of contact with NumPy > I > >>>> >> >>>>> >> >>> don't > >>>> >> >>>>> >> >>> think > >>>> >> >>>>> >> >>> this would end up being too onerous. > >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> For the record, I'm pretty allergic to "advanced > C++"; I > >>>> >> >>>>> >> >>> think it > >>>> >> >>>>> >> >>> is a > >>>> >> >>>>> >> >>> useful tool if you pick a sane 20% subset of the C++11 > >>>> >> >>>>> >> >>> spec > >>>> >> >>>>> >> >>> and > >>>> >> >>>>> >> >>> follow > >>>> >> >>>>> >> >>> Google C++ style it's not very inaccessible to > >>>> >> >>>>> >> >>> intermediate > >>>> >> >>>>> >> >>> developers. More or less "C plus OOP and easier object > >>>> >> >>>>> >> >>> lifetime > >>>> >> >>>>> >> >>> management (shared/unique_ptr, etc.)". As soon as you > add > >>>> >> >>>>> >> >>> a > >>>> >> >>>>> >> >>> lot > >>>> >> >>>>> >> >>> of > >>>> >> >>>>> >> >>> template metaprogramming C++ library development > quickly > >>>> >> >>>>> >> >>> becomes > >>>> >> >>>>> >> >>> inaccessible except to the C++-Jedi. > >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> Maybe let's start a Google document on "pandas > roadmap" > >>>> >> >>>>> >> >>> where > >>>> >> >>>>> >> >>> we > >>>> >> >>>>> >> >>> can > >>>> >> >>>>> >> >>> break down the 1-2 year goals and some of these > >>>> >> >>>>> >> >>> infrastructure > >>>> >> >>>>> >> >>> issues > >>>> >> >>>>> >> >>> and have our discussion there? (obviously publish this > >>>> >> >>>>> >> >>> someplace > >>>> >> >>>>> >> >>> once > >>>> >> >>>>> >> >>> we're done) > >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> - Wes > >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback > >>>> >> >>>>> >> >>> > >>>> >> >>>>> >> >>> wrote: > >>>> >> >>>>> >> >>> > Here are some of my thoughts about pandas Roadmap / > >>>> >> >>>>> >> >>> > status > >>>> >> >>>>> >> >>> > and > >>>> >> >>>>> >> >>> > some > >>>> >> >>>>> >> >>> > responses to Wes's thoughts. > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > In the last few (and upcoming) major releases we > have > >>>> >> >>>>> >> >>> > been > >>>> >> >>>>> >> >>> > made > >>>> >> >>>>> >> >>> > the > >>>> >> >>>>> >> >>> > following changes: > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > - dtype enhancements (Categorical, Timedelta, > Datetime > >>>> >> >>>>> >> >>> > w/tz) & > >>>> >> >>>>> >> >>> > making > >>>> >> >>>>> >> >>> > these > >>>> >> >>>>> >> >>> > first class objects > >>>> >> >>>>> >> >>> > - code refactoring to remove subclassing of ndarrays > >>>> >> >>>>> >> >>> > for > >>>> >> >>>>> >> >>> > Series > >>>> >> >>>>> >> >>> > & > >>>> >> >>>>> >> >>> > Index > >>>> >> >>>>> >> >>> > - carving out / deprecating non-core parts of pandas > >>>> >> >>>>> >> >>> > - datareader > >>>> >> >>>>> >> >>> > - SparsePanel, WidePanel & other aliases > (TImeSeries) > >>>> >> >>>>> >> >>> > - rpy, rplot, irow et al. > >>>> >> >>>>> >> >>> > - google-analytics > >>>> >> >>>>> >> >>> > - API changes to make things more consistent > >>>> >> >>>>> >> >>> > - pd.rolling/expanding * -> .rolling/expanding > (this > >>>> >> >>>>> >> >>> > is > >>>> >> >>>>> >> >>> > in > >>>> >> >>>>> >> >>> > master > >>>> >> >>>>> >> >>> > now) > >>>> >> >>>>> >> >>> > - .resample becoming a full defered like groupby. > >>>> >> >>>>> >> >>> > - multi-index slicing along any level (obviates > need > >>>> >> >>>>> >> >>> > for > >>>> >> >>>>> >> >>> > .xs) > >>>> >> >>>>> >> >>> > and > >>>> >> >>>>> >> >>> > allows > >>>> >> >>>>> >> >>> > assignment > >>>> >> >>>>> >> >>> > - .loc/.iloc - for the most part obviates use of > .ix > >>>> >> >>>>> >> >>> > - .pipe & .assign > >>>> >> >>>>> >> >>> > - plotting accessors > >>>> >> >>>>> >> >>> > - fixing of the sorting API > >>>> >> >>>>> >> >>> > - many performance enhancements both micro & macro > >>>> >> >>>>> >> >>> > (e.g. > >>>> >> >>>>> >> >>> > release > >>>> >> >>>>> >> >>> > GIL) > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > Some on-deck enhancements are (meaning these are > >>>> >> >>>>> >> >>> > basically > >>>> >> >>>>> >> >>> > ready to > >>>> >> >>>>> >> >>> > go > >>>> >> >>>>> >> >>> > in): > >>>> >> >>>>> >> >>> > - IntervalIndex (and eventually make PeriodIndex > just > >>>> >> >>>>> >> >>> > a > >>>> >> >>>>> >> >>> > sub-class > >>>> >> >>>>> >> >>> > of > >>>> >> >>>>> >> >>> > this) > >>>> >> >>>>> >> >>> > - RangeIndex > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > so lots of changes, though nothing really earth > >>>> >> >>>>> >> >>> > shaking, > >>>> >> >>>>> >> >>> > just > >>>> >> >>>>> >> >>> > more > >>>> >> >>>>> >> >>> > convenience, reducing magicness somewhat > >>>> >> >>>>> >> >>> > and providing flexibility. > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > Of course we are getting increasing issues, mostly > bug > >>>> >> >>>>> >> >>> > reports > >>>> >> >>>>> >> >>> > (and > >>>> >> >>>>> >> >>> > lots > >>>> >> >>>>> >> >>> > of > >>>> >> >>>>> >> >>> > dupes), some edge case enhancements > >>>> >> >>>>> >> >>> > which can add to the existing API's and of course, > >>>> >> >>>>> >> >>> > requests > >>>> >> >>>>> >> >>> > to > >>>> >> >>>>> >> >>> > expand > >>>> >> >>>>> >> >>> > the > >>>> >> >>>>> >> >>> > (already) large code to other usecases. > >>>> >> >>>>> >> >>> > Balancing this are a good many pull-requests from > many > >>>> >> >>>>> >> >>> > different > >>>> >> >>>>> >> >>> > users, > >>>> >> >>>>> >> >>> > some > >>>> >> >>>>> >> >>> > even deep into the internals. > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > Here are some things that I have talked about and > could > >>>> >> >>>>> >> >>> > be > >>>> >> >>>>> >> >>> > considered > >>>> >> >>>>> >> >>> > for > >>>> >> >>>>> >> >>> > the roadmap. Disclaimer: I do work for Continuum > >>>> >> >>>>> >> >>> > but these views are of course my own; furthermore > >>>> >> >>>>> >> >>> > obviously > >>>> >> >>>>> >> >>> > I > >>>> >> >>>>> >> >>> > am a > >>>> >> >>>>> >> >>> > bit > >>>> >> >>>>> >> >>> > more > >>>> >> >>>>> >> >>> > familiar with some of the 'sponsored' open-source > >>>> >> >>>>> >> >>> > libraries, but always open to new things. > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > - integration / automatic deferral to numba for JIT > >>>> >> >>>>> >> >>> > (this > >>>> >> >>>>> >> >>> > would > >>>> >> >>>>> >> >>> > be > >>>> >> >>>>> >> >>> > thru > >>>> >> >>>>> >> >>> > .apply) > >>>> >> >>>>> >> >>> > - automatic deferal to dask from groubpy where > >>>> >> >>>>> >> >>> > appropriate > >>>> >> >>>>> >> >>> > / > >>>> >> >>>>> >> >>> > maybe a > >>>> >> >>>>> >> >>> > .to_parallel (to simply return a dask.DataFrame > object) > >>>> >> >>>>> >> >>> > - incorporation of quantities / units (as part of > the > >>>> >> >>>>> >> >>> > dtype) > >>>> >> >>>>> >> >>> > - use of DyND to allow missing values for int dtypes > >>>> >> >>>>> >> >>> > - make Period a first class dtype. > >>>> >> >>>>> >> >>> > - provide some copy-on-write semantics to alleviate > the > >>>> >> >>>>> >> >>> > chained-indexing > >>>> >> >>>>> >> >>> > issues which occasionaly come up with the mis-use of > >>>> >> >>>>> >> >>> > the > >>>> >> >>>>> >> >>> > indexing > >>>> >> >>>>> >> >>> > API > >>>> >> >>>>> >> >>> > - allow a 'policy' to automatically provide column > >>>> >> >>>>> >> >>> > blocks > >>>> >> >>>>> >> >>> > for > >>>> >> >>>>> >> >>> > dict-like > >>>> >> >>>>> >> >>> > input (e.g. each column would be a block), this > would > >>>> >> >>>>> >> >>> > allow > >>>> >> >>>>> >> >>> > a > >>>> >> >>>>> >> >>> > pass-thru > >>>> >> >>>>> >> >>> > API > >>>> >> >>>>> >> >>> > where you could > >>>> >> >>>>> >> >>> > put in numpy arrays where you have views and have > them > >>>> >> >>>>> >> >>> > preserved > >>>> >> >>>>> >> >>> > rather > >>>> >> >>>>> >> >>> > than > >>>> >> >>>>> >> >>> > copied automatically. Note that this would also > allow > >>>> >> >>>>> >> >>> > what > >>>> >> >>>>> >> >>> > I > >>>> >> >>>>> >> >>> > call > >>>> >> >>>>> >> >>> > 'split' > >>>> >> >>>>> >> >>> > where a passed in > >>>> >> >>>>> >> >>> > multi-dim numpy array could be split up to > individual > >>>> >> >>>>> >> >>> > blocks > >>>> >> >>>>> >> >>> > (which > >>>> >> >>>>> >> >>> > actually > >>>> >> >>>>> >> >>> > gives a nice perf boost after the splitting costs). > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > In working towards some of these goals. I have come > to > >>>> >> >>>>> >> >>> > the > >>>> >> >>>>> >> >>> > opinion > >>>> >> >>>>> >> >>> > that > >>>> >> >>>>> >> >>> > it > >>>> >> >>>>> >> >>> > would make sense to have a neutral API protocol > layer > >>>> >> >>>>> >> >>> > that would allow us to swap out different engines as > >>>> >> >>>>> >> >>> > needed, > >>>> >> >>>>> >> >>> > for > >>>> >> >>>>> >> >>> > particular > >>>> >> >>>>> >> >>> > dtypes, or *maybe* out-of-core type computations. > E.g. > >>>> >> >>>>> >> >>> > imagine that we replaced the in-memory block > structure > >>>> >> >>>>> >> >>> > with > >>>> >> >>>>> >> >>> > a > >>>> >> >>>>> >> >>> > bclolz > >>>> >> >>>>> >> >>> > / > >>>> >> >>>>> >> >>> > memap > >>>> >> >>>>> >> >>> > type; in theory this should be 'easy' and just work. > >>>> >> >>>>> >> >>> > I could also see us adopting *some* of the SFrame > code > >>>> >> >>>>> >> >>> > to > >>>> >> >>>>> >> >>> > allow > >>>> >> >>>>> >> >>> > easier > >>>> >> >>>>> >> >>> > interop with this API layer. > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > In practice, I think a nice API layer would need to > be > >>>> >> >>>>> >> >>> > created > >>>> >> >>>>> >> >>> > to > >>>> >> >>>>> >> >>> > make > >>>> >> >>>>> >> >>> > this > >>>> >> >>>>> >> >>> > clean / nice. > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > So this comes around to Wes's point about creating a > >>>> >> >>>>> >> >>> > c++ > >>>> >> >>>>> >> >>> > library for > >>>> >> >>>>> >> >>> > the > >>>> >> >>>>> >> >>> > internals (and possibly even some of the indexing > >>>> >> >>>>> >> >>> > routines). > >>>> >> >>>>> >> >>> > In an ideal world, or course this would be > desirable. > >>>> >> >>>>> >> >>> > Getting > >>>> >> >>>>> >> >>> > there > >>>> >> >>>>> >> >>> > is a > >>>> >> >>>>> >> >>> > bit > >>>> >> >>>>> >> >>> > non-trivial I think, and IMHO might not be worth the > >>>> >> >>>>> >> >>> > effort. I > >>>> >> >>>>> >> >>> > don't > >>>> >> >>>>> >> >>> > really see big performance bottlenecks. We *already* > >>>> >> >>>>> >> >>> > defer > >>>> >> >>>>> >> >>> > much > >>>> >> >>>>> >> >>> > of > >>>> >> >>>>> >> >>> > the > >>>> >> >>>>> >> >>> > computation to libraries like numexpr & bottleneck > >>>> >> >>>>> >> >>> > (where > >>>> >> >>>>> >> >>> > appropriate). > >>>> >> >>>>> >> >>> > Adding numba / dask to the list would be helpful. > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > I think that almost all performance issues are the > >>>> >> >>>>> >> >>> > result > >>>> >> >>>>> >> >>> > of: > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > a) gross misuse of the pandas API. How much code > have > >>>> >> >>>>> >> >>> > you > >>>> >> >>>>> >> >>> > seen > >>>> >> >>>>> >> >>> > that > >>>> >> >>>>> >> >>> > does > >>>> >> >>>>> >> >>> > df.apply(lambda x: x.sum()) > >>>> >> >>>>> >> >>> > b) routines which operate column-by-column rather > >>>> >> >>>>> >> >>> > block-by-block and > >>>> >> >>>>> >> >>> > are > >>>> >> >>>>> >> >>> > in > >>>> >> >>>>> >> >>> > python space (e.g. we have an issue right now about > >>>> >> >>>>> >> >>> > .quantile) > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > So I am glossing over a big goal of having a c++ > >>>> >> >>>>> >> >>> > library > >>>> >> >>>>> >> >>> > that > >>>> >> >>>>> >> >>> > represents > >>>> >> >>>>> >> >>> > the > >>>> >> >>>>> >> >>> > pandas internals. This would by definition have a > c-API > >>>> >> >>>>> >> >>> > that so > >>>> >> >>>>> >> >>> > you *could* use pandas like semantics in c/c++ and > just > >>>> >> >>>>> >> >>> > have it > >>>> >> >>>>> >> >>> > work > >>>> >> >>>>> >> >>> > (and > >>>> >> >>>>> >> >>> > then pandas would be a thin wrapper around this > >>>> >> >>>>> >> >>> > library). > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > I am not averse to this, but I think would be quite > a > >>>> >> >>>>> >> >>> > big > >>>> >> >>>>> >> >>> > effort, > >>>> >> >>>>> >> >>> > and > >>>> >> >>>>> >> >>> > not a > >>>> >> >>>>> >> >>> > huge perf boost IMHO. Further there are a number of > API > >>>> >> >>>>> >> >>> > issues > >>>> >> >>>>> >> >>> > w.r.t. > >>>> >> >>>>> >> >>> > indexing > >>>> >> >>>>> >> >>> > which need to be clarified / worked out (e.g. > should we > >>>> >> >>>>> >> >>> > simply > >>>> >> >>>>> >> >>> > deprecate > >>>> >> >>>>> >> >>> > []) > >>>> >> >>>>> >> >>> > that are much easier to test / figure out in python > >>>> >> >>>>> >> >>> > space. > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > I also thing that we have quite a large number of > >>>> >> >>>>> >> >>> > contributors. > >>>> >> >>>>> >> >>> > Moving > >>>> >> >>>>> >> >>> > to > >>>> >> >>>>> >> >>> > c++ might make the internals a bit more impenetrable > >>>> >> >>>>> >> >>> > that > >>>> >> >>>>> >> >>> > the > >>>> >> >>>>> >> >>> > current > >>>> >> >>>>> >> >>> > internals. > >>>> >> >>>>> >> >>> > (though this would allow c++ people to contribute, > so > >>>> >> >>>>> >> >>> > that > >>>> >> >>>>> >> >>> > might > >>>> >> >>>>> >> >>> > balance > >>>> >> >>>>> >> >>> > out). > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > We have a limited core of devs whom right now are > >>>> >> >>>>> >> >>> > familar > >>>> >> >>>>> >> >>> > with > >>>> >> >>>>> >> >>> > things. > >>>> >> >>>>> >> >>> > If > >>>> >> >>>>> >> >>> > someone happened to have a starting base for a c++ > >>>> >> >>>>> >> >>> > library, > >>>> >> >>>>> >> >>> > then I > >>>> >> >>>>> >> >>> > might > >>>> >> >>>>> >> >>> > change > >>>> >> >>>>> >> >>> > opinions here. > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > my 4c. > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > Jeff > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > wrote: > >>>> >> >>>>> >> >>> >> > >>>> >> >>>>> >> >>> >> Deep thoughts during the holidays. > >>>> >> >>>>> >> >>> >> > >>>> >> >>>>> >> >>> >> I might be out of line here, but the > >>>> >> >>>>> >> >>> >> interpreter-heaviness > >>>> >> >>>>> >> >>> >> of > >>>> >> >>>>> >> >>> >> the > >>>> >> >>>>> >> >>> >> inside of pandas objects is likely to be a > long-term > >>>> >> >>>>> >> >>> >> liability > >>>> >> >>>>> >> >>> >> and > >>>> >> >>>>> >> >>> >> source of performance problems and technical debt. > >>>> >> >>>>> >> >>> >> > >>>> >> >>>>> >> >>> >> Has anyone put any thought into planning and > beginning > >>>> >> >>>>> >> >>> >> to > >>>> >> >>>>> >> >>> >> execute > >>>> >> >>>>> >> >>> >> on a > >>>> >> >>>>> >> >>> >> rewrite that moves as much as possible of the > >>>> >> >>>>> >> >>> >> internals > >>>> >> >>>>> >> >>> >> into > >>>> >> >>>>> >> >>> >> native > >>>> >> >>>>> >> >>> >> / > >>>> >> >>>>> >> >>> >> compiled code? I'm talking about: > >>>> >> >>>>> >> >>> >> > >>>> >> >>>>> >> >>> >> - pandas/core/internals > >>>> >> >>>>> >> >>> >> - indexing and assignment > >>>> >> >>>>> >> >>> >> - much of pandas/core/common > >>>> >> >>>>> >> >>> >> - categorical and custom dtypes > >>>> >> >>>>> >> >>> >> - all indexing mechanisms > >>>> >> >>>>> >> >>> >> > >>>> >> >>>>> >> >>> >> I'm concerned we've already exposed too much > internals > >>>> >> >>>>> >> >>> >> to > >>>> >> >>>>> >> >>> >> users, so > >>>> >> >>>>> >> >>> >> this might lead to a lot of API breakage, but it > might > >>>> >> >>>>> >> >>> >> be > >>>> >> >>>>> >> >>> >> for > >>>> >> >>>>> >> >>> >> the > >>>> >> >>>>> >> >>> >> Greater Good. As a first step, beginning a partial > >>>> >> >>>>> >> >>> >> migration > >>>> >> >>>>> >> >>> >> of > >>>> >> >>>>> >> >>> >> internals into some C++ classes that encapsulate > the > >>>> >> >>>>> >> >>> >> insides > >>>> >> >>>>> >> >>> >> of > >>>> >> >>>>> >> >>> >> DataFrame objects and implement indexing and > >>>> >> >>>>> >> >>> >> block-level > >>>> >> >>>>> >> >>> >> manipulations > >>>> >> >>>>> >> >>> >> would be a good place to start. I think you could > do > >>>> >> >>>>> >> >>> >> this > >>>> >> >>>>> >> >>> >> wouldn't > >>>> >> >>>>> >> >>> >> too > >>>> >> >>>>> >> >>> >> much disruption. > >>>> >> >>>>> >> >>> >> > >>>> >> >>>>> >> >>> >> As part of this internal retooling we might give > >>>> >> >>>>> >> >>> >> consideration > >>>> >> >>>>> >> >>> >> to > >>>> >> >>>>> >> >>> >> alternative data structures for representing data > >>>> >> >>>>> >> >>> >> internal > >>>> >> >>>>> >> >>> >> to > >>>> >> >>>>> >> >>> >> pandas > >>>> >> >>>>> >> >>> >> objects. Now in 2015/2016, continuing to be > hamstrung > >>>> >> >>>>> >> >>> >> by > >>>> >> >>>>> >> >>> >> NumPy's > >>>> >> >>>>> >> >>> >> limitations feels somewhat anachronistic. User > code is > >>>> >> >>>>> >> >>> >> riddled > >>>> >> >>>>> >> >>> >> with > >>>> >> >>>>> >> >>> >> workarounds for data type fidelity issues and the > >>>> >> >>>>> >> >>> >> like. > >>>> >> >>>>> >> >>> >> Like, > >>>> >> >>>>> >> >>> >> really, > >>>> >> >>>>> >> >>> >> why not add a bitndarray (similar to > >>>> >> >>>>> >> >>> >> ilanschnell/bitarray) > >>>> >> >>>>> >> >>> >> for > >>>> >> >>>>> >> >>> >> storing > >>>> >> >>>>> >> >>> >> nullness for problematic types and hide this from > the > >>>> >> >>>>> >> >>> >> user? =) > >>>> >> >>>>> >> >>> >> > >>>> >> >>>>> >> >>> >> Since we are now a NumFOCUS-sponsored project, I > feel > >>>> >> >>>>> >> >>> >> like > >>>> >> >>>>> >> >>> >> we > >>>> >> >>>>> >> >>> >> might > >>>> >> >>>>> >> >>> >> consider establishing some formal governance over > >>>> >> >>>>> >> >>> >> pandas > >>>> >> >>>>> >> >>> >> and > >>>> >> >>>>> >> >>> >> publishing meetings notes and roadmap documents > >>>> >> >>>>> >> >>> >> describing > >>>> >> >>>>> >> >>> >> plans > >>>> >> >>>>> >> >>> >> for > >>>> >> >>>>> >> >>> >> the project and meetings notes from committers. > >>>> >> >>>>> >> >>> >> There's no > >>>> >> >>>>> >> >>> >> real > >>>> >> >>>>> >> >>> >> "committer culture" for NumFOCUS projects like > there > >>>> >> >>>>> >> >>> >> is > >>>> >> >>>>> >> >>> >> with > >>>> >> >>>>> >> >>> >> the > >>>> >> >>>>> >> >>> >> Apache Software Foundation, but we might try > leading > >>>> >> >>>>> >> >>> >> by > >>>> >> >>>>> >> >>> >> example! > >>>> >> >>>>> >> >>> >> > >>>> >> >>>>> >> >>> >> Also, I believe pandas as a project has reached a > >>>> >> >>>>> >> >>> >> level of > >>>> >> >>>>> >> >>> >> importance > >>>> >> >>>>> >> >>> >> where we ought to consider planning and execution > on > >>>> >> >>>>> >> >>> >> larger > >>>> >> >>>>> >> >>> >> scale > >>>> >> >>>>> >> >>> >> undertakings such as this for safeguarding the > future. > >>>> >> >>>>> >> >>> >> > >>>> >> >>>>> >> >>> >> As for myself, well, I have my hands full in Big > >>>> >> >>>>> >> >>> >> Data-land. I > >>>> >> >>>>> >> >>> >> wish > >>>> >> >>>>> >> >>> >> I > >>>> >> >>>>> >> >>> >> could be helping more with pandas, but there a > quite a > >>>> >> >>>>> >> >>> >> few > >>>> >> >>>>> >> >>> >> fundamental > >>>> >> >>>>> >> >>> >> issues (like data interoperability nested data > >>>> >> >>>>> >> >>> >> handling > >>>> >> >>>>> >> >>> >> and > >>>> >> >>>>> >> >>> >> file > >>>> >> >>>>> >> >>> >> format support ? e.g. Parquet, see > >>>> >> >>>>> >> >>> >> > >>>> >> >>>>> >> >>> >> > >>>> >> >>>>> >> >>> >> > >>>> >> >>>>> >> >>> >> > >>>> >> >>>>> >> >>> >> > >>>> >> >>>>> >> >>> >> > >>>> >> >>>>> >> >>> >> > http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/ > ) > >>>> >> >>>>> >> >>> >> preventing Python from being more useful in > industry > >>>> >> >>>>> >> >>> >> analytics > >>>> >> >>>>> >> >>> >> applications. > >>>> >> >>>>> >> >>> >> > >>>> >> >>>>> >> >>> >> Aside: one of the bigger mistakes I made with > pandas's > >>>> >> >>>>> >> >>> >> API > >>>> >> >>>>> >> >>> >> design > >>>> >> >>>>> >> >>> >> was > >>>> >> >>>>> >> >>> >> making it acceptable to call class constructors ? > like > >>>> >> >>>>> >> >>> >> pandas.DataFrame ? directly (versus factory > >>>> >> >>>>> >> >>> >> functions). > >>>> >> >>>>> >> >>> >> Sorry > >>>> >> >>>>> >> >>> >> about > >>>> >> >>>>> >> >>> >> that! If we could convince everyone to start > writing > >>>> >> >>>>> >> >>> >> pandas.data_frame > >>>> >> >>>>> >> >>> >> or dataframe instead of using the class reference > it > >>>> >> >>>>> >> >>> >> would > >>>> >> >>>>> >> >>> >> help a > >>>> >> >>>>> >> >>> >> lot > >>>> >> >>>>> >> >>> >> with code cleanup. It's hard to plan for these > things > >>>> >> >>>>> >> >>> >> ? > >>>> >> >>>>> >> >>> >> NumPy > >>>> >> >>>>> >> >>> >> interoperability seemed a lot more important in > 2008 > >>>> >> >>>>> >> >>> >> than > >>>> >> >>>>> >> >>> >> it > >>>> >> >>>>> >> >>> >> does > >>>> >> >>>>> >> >>> >> now, > >>>> >> >>>>> >> >>> >> so I forgive myself. > >>>> >> >>>>> >> >>> >> > >>>> >> >>>>> >> >>> >> cheers and best wishes for 2016, > >>>> >> >>>>> >> >>> >> Wes > >>>> >> >>>>> >> >>> >> _______________________________________________ > >>>> >> >>>>> >> >>> >> Pandas-dev mailing list > >>>> >> >>>>> >> >>> >> Pandas-dev at python.org > >>>> >> >>>>> >> >>> >> > https://mail.python.org/mailman/listinfo/pandas-dev > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> > > >>>> >> >>>>> >> >>> _______________________________________________ > >>>> >> >>>>> >> >>> Pandas-dev mailing list > >>>> >> >>>>> >> >>> Pandas-dev at python.org > >>>> >> >>>>> >> >>> https://mail.python.org/mailman/listinfo/pandas-dev > >>>> >> >>>>> >> _______________________________________________ > >>>> >> >>>>> >> Pandas-dev mailing list > >>>> >> >>>>> >> Pandas-dev at python.org > >>>> >> >>>>> >> https://mail.python.org/mailman/listinfo/pandas-dev > >>>> >> >>>>> > > >>>> >> >>>>> > > >>>> >> >>>>> > >>>> >> >>>>> > >>>> >> >>>>> _______________________________________________ > >>>> >> >>>>> Pandas-dev mailing list > >>>> >> >>>>> Pandas-dev at python.org > >>>> >> >>>>> https://mail.python.org/mailman/listinfo/pandas-dev > >>>> >> >>>>> > >>>> >> >>>> > >>>> >> >> > >>>> >> _______________________________________________ > >>>> >> Pandas-dev mailing list > >>>> >> Pandas-dev at python.org > >>>> >> https://mail.python.org/mailman/listinfo/pandas-dev > >>>> > > >>>> > > >>>> _______________________________________________ > >>>> Pandas-dev mailing list > >>>> Pandas-dev at python.org > >>>> https://mail.python.org/mailman/listinfo/pandas-dev > >>> > >>> > >>> > >>> _______________________________________________ > >>> Pandas-dev mailing list > >>> Pandas-dev at python.org > >>> https://mail.python.org/mailman/listinfo/pandas-dev > >>> > >> > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Wed Jan 6 15:15:38 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Wed, 6 Jan 2016 12:15:38 -0800 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References: Message-ID: I also will add that there is an ideology that has existed in the scientific Python community since 2011 at least which is this: pandas should not have existed; it should be part of NumPy instead. In my opinion, that misses the point of pandas, both then and now. There's a large and mostly new class of Python users working on domain-specific industry analytics problems for whom pandas is the most important tool that they use on a daily basis. Their knowledge of NumPy is limited, beyond the aspects of the ndarray API that are the same in pandas. High level APIs and accessibility for them is extremely important. But their skill sets and problems they are solving are not the same ones on the whole that you would have heard discussed at SciPy 2010. Sometime in 2015, "Python for Data Analysis" sold it's 100,000th copy. I have 5 foreign translations sitting on my shelf -- this represents a very large group of people that we have all collectively enabled by developing pandas -- for a lot of people, pandas is the main reason they use Python! So the summary of all this is: pandas is much more important as a project now than it was 5 years ago. Our relationship with our library dependencies like NumPy should reflect that. Downstream pandas consumers should similarly eventually concern themselves more with pandas compatibility (rather than always assuming that NumPy arrays are the only intermediary). This is a philosophical shift, but one that will ultimately benefit the usability of the stack. On Wed, Jan 6, 2016 at 11:45 AM, Jeff Reback wrote: > I'll just apologize right up front! hahah. > > No I think I have been pushing on these extras in pandas to help move it > forward. I have commented a bit > on Stephan's issue here about why I didn't push for these in numpy. numpy is > fairly slow moving > (though moves faster lately, I suspect the pace when Wes was developing > pandas was not much faster). > > So pandas was essentially 'fixing' lots of bug / compat issues in numpy. > > To the extent that we can keep the current user facing API the same (high > likelihood I think), willing > to acccept *some* breakage with the pandas->duck-like array container API in > order to provide swappable containers. > > For example I recall that in doing datetime w/tz, that we wanted > Series.values to return a numpy array (which it DOES!) > but it is actually lossy (its loses the tz). Samething with the Categorical > example wes gave. I dont' think these requirements > should hold pandas back! > > People are increasingly using pandas as the API for there work. That makes > it very important that we can handle > lots of input properly, w/o the handcuffs of numpy. > > All this said, I'll reiterate Wes (and others points). That back-compat is > extremely important. (I in fact try > to bend over backwards to provide this, sometimes its too much of course!). > E.g. take the resample changes to API > > Was originally going to just do a hard break, but this turns off people when > they have to update there code or else. > > my 4c (incrementing!) > > Jeff > From wesmckinn at gmail.com Fri Jan 8 20:34:05 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Fri, 8 Jan 2016 17:34:05 -0800 Subject: [Pandas-dev] Unit test reorganization Message-ID: hi folks, I have a few questions about the test suite. As context, I note that test_series.py is now 8200 lines and test_frame.py 17000 lines. Big #1 question is, how strongly do you feel about *shipping* the test suite in site-packages? Some other libraries with sprawling and complex test suites have chosen not to ship them: https://github.com/zzzeek/sqlalchemy Independently, I would support and help with starting a judicious reorganization of the contents of pandas/tests. So I'm thinking like tests/ dataframe/ series/ algorithms/ internals/ tseries/ and so forth. Thoughts? - Wes From wesmckinn at gmail.com Fri Jan 8 20:47:48 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Fri, 8 Jan 2016 17:47:48 -0800 Subject: [Pandas-dev] Unit test reorganization In-Reply-To: References: Message-ID: + mailing list Do the distros run them _after_ installation? I'm talking about installing the unit tests during `python setup.py install`, but still including them in the tarball. On Fri, Jan 8, 2016 at 5:43 PM, Jeff Reback wrote: > all for reorging into subdirs as these have grown pretty big > > what's the big deal with shipping the test? > > I suspect some of the Linux distros do run them > > and just merged https://github.com/pydata/pandas/pull/11913 > though we can could configure s subset that ships I suppose > > > On Jan 8, 2016, at 8:34 PM, Wes McKinney wrote: >> >> hi folks, >> >> I have a few questions about the test suite. As context, I note that >> test_series.py is now 8200 lines and test_frame.py 17000 lines. >> >> Big #1 question is, how strongly do you feel about *shipping* the test >> suite in site-packages? Some other libraries with sprawling and >> complex test suites have chosen not to ship them: >> https://github.com/zzzeek/sqlalchemy >> >> Independently, I would support and help with starting a judicious >> reorganization of the contents of pandas/tests. So I'm thinking like >> >> tests/ >> dataframe/ >> series/ >> algorithms/ >> internals/ >> tseries/ >> >> and so forth. >> >> Thoughts? >> >> - Wes >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev From jeffreback at gmail.com Fri Jan 8 20:53:51 2016 From: jeffreback at gmail.com (Jeff Reback) Date: Fri, 8 Jan 2016 20:53:51 -0500 Subject: [Pandas-dev] Unit test reorganization In-Reply-To: References: Message-ID: <5E86966B-F624-4ECB-AC72-2F9DCEBC7B14@gmail.com> no idea > On Jan 8, 2016, at 8:47 PM, Wes McKinney wrote: > > + mailing list > > Do the distros run them _after_ installation? I'm talking about > installing the unit tests during `python setup.py install`, but still > including them in the tarball. > >> On Fri, Jan 8, 2016 at 5:43 PM, Jeff Reback wrote: >> all for reorging into subdirs as these have grown pretty big >> >> what's the big deal with shipping the test? >> >> I suspect some of the Linux distros do run them >> >> and just merged https://github.com/pydata/pandas/pull/11913 >> though we can could configure s subset that ships I suppose >> >> >>> On Jan 8, 2016, at 8:34 PM, Wes McKinney wrote: >>> >>> hi folks, >>> >>> I have a few questions about the test suite. As context, I note that >>> test_series.py is now 8200 lines and test_frame.py 17000 lines. >>> >>> Big #1 question is, how strongly do you feel about *shipping* the test >>> suite in site-packages? Some other libraries with sprawling and >>> complex test suites have chosen not to ship them: >>> https://github.com/zzzeek/sqlalchemy >>> >>> Independently, I would support and help with starting a judicious >>> reorganization of the contents of pandas/tests. So I'm thinking like >>> >>> tests/ >>> dataframe/ >>> series/ >>> algorithms/ >>> internals/ >>> tseries/ >>> >>> and so forth. >>> >>> Thoughts? >>> >>> - Wes >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev From wesmckinn at gmail.com Fri Jan 8 21:04:13 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Fri, 8 Jan 2016 18:04:13 -0800 Subject: [Pandas-dev] Unit test reorganization In-Reply-To: <5E86966B-F624-4ECB-AC72-2F9DCEBC7B14@gmail.com> References: <5E86966B-F624-4ECB-AC72-2F9DCEBC7B14@gmail.com> Message-ID: It looks like the debian packaging scripts would need to change. + Yaroslav to see if this would be onerous On Fri, Jan 8, 2016 at 5:53 PM, Jeff Reback wrote: > no idea > >> On Jan 8, 2016, at 8:47 PM, Wes McKinney wrote: >> >> + mailing list >> >> Do the distros run them _after_ installation? I'm talking about >> installing the unit tests during `python setup.py install`, but still >> including them in the tarball. >> >>> On Fri, Jan 8, 2016 at 5:43 PM, Jeff Reback wrote: >>> all for reorging into subdirs as these have grown pretty big >>> >>> what's the big deal with shipping the test? >>> >>> I suspect some of the Linux distros do run them >>> >>> and just merged https://github.com/pydata/pandas/pull/11913 >>> though we can could configure s subset that ships I suppose >>> >>> >>>> On Jan 8, 2016, at 8:34 PM, Wes McKinney wrote: >>>> >>>> hi folks, >>>> >>>> I have a few questions about the test suite. As context, I note that >>>> test_series.py is now 8200 lines and test_frame.py 17000 lines. >>>> >>>> Big #1 question is, how strongly do you feel about *shipping* the test >>>> suite in site-packages? Some other libraries with sprawling and >>>> complex test suites have chosen not to ship them: >>>> https://github.com/zzzeek/sqlalchemy >>>> >>>> Independently, I would support and help with starting a judicious >>>> reorganization of the contents of pandas/tests. So I'm thinking like >>>> >>>> tests/ >>>> dataframe/ >>>> series/ >>>> algorithms/ >>>> internals/ >>>> tseries/ >>>> >>>> and so forth. >>>> >>>> Thoughts? >>>> >>>> - Wes >>>> _______________________________________________ >>>> Pandas-dev mailing list >>>> Pandas-dev at python.org >>>> https://mail.python.org/mailman/listinfo/pandas-dev From shoyer at gmail.com Sun Jan 10 21:06:56 2016 From: shoyer at gmail.com (Stephan Hoyer) Date: Sun, 10 Jan 2016 18:06:56 -0800 Subject: [Pandas-dev] Unit test reorganization In-Reply-To: References: Message-ID: On Fri, Jan 8, 2016 at 5:34 PM, Wes McKinney wrote: > Big #1 question is, how strongly do you feel about *shipping* the test > suite in site-packages? Some other libraries with sprawling and > complex test suites have chosen not to ship them: > https://github.com/zzzeek/sqlalchemy > I would prefer to include the test suite if possible, because the ability to type "nosetests pandas" makes it easy both for users to verify installations are working properly and for downstream distributors to identify and report bugs. The complete pandas test suite still runs in 20-30 minutes, so I think it's still fairly reasonable to use it for these purposes. > Independently, I would support and help with starting a judicious > reorganization of the contents of pandas/tests. So I'm thinking like > > tests/ > dataframe/ > series/ > algorithms/ > internals/ > tseries/ > > and so forth. > This sounds like a great idea -- these files have really gotten out of control! Cheers, Stephan -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Mon Jan 11 11:47:47 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Mon, 11 Jan 2016 08:47:47 -0800 Subject: [Pandas-dev] Unit test reorganization In-Reply-To: References: Message-ID: On Sun, Jan 10, 2016 at 6:06 PM, Stephan Hoyer wrote: > On Fri, Jan 8, 2016 at 5:34 PM, Wes McKinney wrote: >> >> Big #1 question is, how strongly do you feel about *shipping* the test >> suite in site-packages? Some other libraries with sprawling and >> complex test suites have chosen not to ship them: >> https://github.com/zzzeek/sqlalchemy > > > I would prefer to include the test suite if possible, because the ability to > type "nosetests pandas" makes it easy both for users to verify installations > are working properly and for downstream distributors to identify and report > bugs. The complete pandas test suite still runs in 20-30 minutes, so I think > it's still fairly reasonable to use it for these purposes. > Got it. I wasn't sure if this was something people still wanted to do in practice with the burgeoning test suite. >> >> Independently, I would support and help with starting a judicious >> reorganization of the contents of pandas/tests. So I'm thinking like >> >> tests/ >> dataframe/ >> series/ >> algorithms/ >> internals/ >> tseries/ >> >> and so forth. > > > This sounds like a great idea -- these files have really gotten out of > control! > Sounds good. I've been sorting through points of contact between Series/DataFrame's implementation and internal matters (e.g. the BlockManager) and figured it would be good to "quarantine" code that makes assumptions about what's under the hood. I'll get the first couple patches started and it can be a slow burn to break apart these large files. > Cheers, > Stephan From shoyer at gmail.com Mon Jan 11 12:36:42 2016 From: shoyer at gmail.com (Stephan Hoyer) Date: Mon, 11 Jan 2016 09:36:42 -0800 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References: Message-ID: Hi Wes, You raise some important points. I agree that pandas's patched version of the numpy dtype system is a mess. But despite its issues, its leaky abstraction on top of NumPy provides benefits. In particular, it makes pandas easy to emulate (e.g., xarray), extend (e.g., geopandas) and integrate with other libraries (e.g., patsy, Scikit-Learn, matplotlib). You are right that pandas has started to supplant numpy as a high level API for data analysis, but of course the robust (and often numpy based) Python ecosystem is part of what has made pandas so successful. In practice, ecosystem projects often want to work with more primitive objects than series/dataframes in their internal data structures and without numpy this becomes more difficult. For example, how do you concatenate a list of categoricals? If these were numpy arrays, we could use np.concatenate, but the current implementation of categorical would require a custom solution. First class compatibility with pandas is harder when pandas data cannot be used with a full ndarray API. Likewise, hiding implementation details retains some flexibility for us (as developers), but in an ideal world, we would know we have the right abstraction, and then could expose the implementation as an advanced API! This is the case for some very mature projects, such as NumPy. Pandas is not really here yet (with the block manager), but it might be something to strive towards in this rewrite. At this point, I suppose the ship has sailed (e.g., with categorical in .values) on full numpy compatibility. So we absolutely do need explicit interfaces to converting to NumPy, rather than the current implicit guarantees about .values -- which we violated with categorical. Something like your suggested .to_numpy() method would indeed be an improvement over the current state, where we half-pretend that NumPy could be used as an advanced API for pandas, even though it doesn't really work. I'm sure you would agree that -- at least in theory -- it would be nice to push dtype improvements upstream to numpy, but that is obviously more work (for a variety of reasons) than starting from scratch in pandas. Of course, I think pandas has a need and right to exist as a separate library. But I do think building off of NumPy made it stronger, and pushing improvements upstream would be a better way to go. This has been my approach, and is why I've worked on both pandas and NumPy. The bottom line is that I don't agree that this is the most productive path forward -- I would opt for improving NumPy or DyND instead, which I believe would cause much less pain downstream -- but given that I'm not going to be the person doing the work, I will defer to your judgment. Pandas is certainly in need of holistic improvements and the maturity of a v1.0 release, and that's not something I'm in a position to push myself. Best, Stephan P.S. apologies for the delay -- it's been a busy week. On Wed, Jan 6, 2016 at 12:15 PM, Wes McKinney wrote: > I also will add that there is an ideology that has existed in the > scientific Python community since 2011 at least which is this: pandas > should not have existed; it should be part of NumPy instead. > > In my opinion, that misses the point of pandas, both then and now. > > There's a large and mostly new class of Python users working on > domain-specific industry analytics problems for whom pandas is the > most important tool that they use on a daily basis. Their knowledge of > NumPy is limited, beyond the aspects of the ndarray API that are the > same in pandas. High level APIs and accessibility for them is > extremely important. But their skill sets and problems they are > solving are not the same ones on the whole that you would have heard > discussed at SciPy 2010. > > Sometime in 2015, "Python for Data Analysis" sold it's 100,000th copy. > I have 5 foreign translations sitting on my shelf -- this represents a > very large group of people that we have all collectively enabled by > developing pandas -- for a lot of people, pandas is the main reason > they use Python! > > So the summary of all this is: pandas is much more important as a > project now than it was 5 years ago. Our relationship with our library > dependencies like NumPy should reflect that. Downstream pandas > consumers should similarly eventually concern themselves more with > pandas compatibility (rather than always assuming that NumPy arrays > are the only intermediary). This is a philosophical shift, but one > that will ultimately benefit the usability of the stack. > > On Wed, Jan 6, 2016 at 11:45 AM, Jeff Reback wrote: > > I'll just apologize right up front! hahah. > > > > No I think I have been pushing on these extras in pandas to help move it > > forward. I have commented a bit > > on Stephan's issue here about why I didn't push for these in numpy. > numpy is > > fairly slow moving > > (though moves faster lately, I suspect the pace when Wes was developing > > pandas was not much faster). > > > > So pandas was essentially 'fixing' lots of bug / compat issues in numpy. > > > > To the extent that we can keep the current user facing API the same (high > > likelihood I think), willing > > to acccept *some* breakage with the pandas->duck-like array container > API in > > order to provide swappable containers. > > > > For example I recall that in doing datetime w/tz, that we wanted > > Series.values to return a numpy array (which it DOES!) > > but it is actually lossy (its loses the tz). Samething with the > Categorical > > example wes gave. I dont' think these requirements > > should hold pandas back! > > > > People are increasingly using pandas as the API for there work. That > makes > > it very important that we can handle > > lots of input properly, w/o the handcuffs of numpy. > > > > All this said, I'll reiterate Wes (and others points). That back-compat > is > > extremely important. (I in fact try > > to bend over backwards to provide this, sometimes its too much of > course!). > > E.g. take the resample changes to API > > > > Was originally going to just do a hard break, but this turns off people > when > > they have to update there code or else. > > > > my 4c (incrementing!) > > > > Jeff > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Mon Jan 11 13:45:24 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Mon, 11 Jan 2016 10:45:24 -0800 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References: Message-ID: On Mon, Jan 11, 2016 at 9:36 AM, Stephan Hoyer wrote: > Hi Wes, > > You raise some important points. > > I agree that pandas's patched version of the numpy dtype system is a mess. > But despite its issues, its leaky abstraction on top of NumPy provides > benefits. In particular, it makes pandas easy to emulate (e.g., xarray), > extend (e.g., geopandas) and integrate with other libraries (e.g., patsy, > Scikit-Learn, matplotlib). > > You are right that pandas has started to supplant numpy as a high level API > for data analysis, but of course the robust (and often numpy based) Python > ecosystem is part of what has made pandas so successful. In practice, > ecosystem projects often want to work with more primitive objects than > series/dataframes in their internal data structures and without numpy this > becomes more difficult. For example, how do you concatenate a list of > categoricals? If these were numpy arrays, we could use np.concatenate, but > the current implementation of categorical would require a custom solution. > First class compatibility with pandas is harder when pandas data cannot be > used with a full ndarray API. > > Likewise, hiding implementation details retains some flexibility for us (as > developers), but in an ideal world, we would know we have the right > abstraction, and then could expose the implementation as an advanced API! > This is the case for some very mature projects, such as NumPy. Pandas is not > really here yet (with the block manager), but it might be something to > strive towards in this rewrite. > > At this point, I suppose the ship has sailed (e.g., with categorical in > .values) on full numpy compatibility. So we absolutely do need explicit > interfaces to converting to NumPy, rather than the current implicit > guarantees about .values -- which we violated with categorical. Something > like your suggested .to_numpy() method would indeed be an improvement over > the current state, where we half-pretend that NumPy could be used as an > advanced API for pandas, even though it doesn't really work. > > I'm sure you would agree that -- at least in theory -- it would be nice to > push dtype improvements upstream to numpy, but that is obviously more work > (for a variety of reasons) than starting from scratch in pandas. Of course, > I think pandas has a need and right to exist as a separate library. But I do > think building off of NumPy made it stronger, and pushing improvements > upstream would be a better way to go. This has been my approach, and is why > I've worked on both pandas and NumPy. > > The bottom line is that I don't agree that this is the most productive path > forward -- I would opt for improving NumPy or DyND instead, which I believe > would cause much less pain downstream -- but given that I'm not going to be > the person doing the work, I will defer to your judgment. Pandas is > certainly in need of holistic improvements and the maturity of a v1.0 > release, and that's not something I'm in a position to push myself. > This seems like a false dichotomy to me. I'm not arguing for forging NumPy-free or DyND-free path, but rather making DyND's or NumPy's physical memory representation and array computing infrastructure more clearly implementation details of pandas that have limited user-visibility (except when using NumPy / DyND-based tools is necessary). The main problem we have faced with NumPy is: - Much more difficult to extend - Legacy code makes major changes difficult or impossible - pandas users likely represent a minority (but perhaps a plurality, at this point) of users DyND's scope, as I understand it, is to be used for more use cases than an internal detail of pandas objects. It doesn't have the legacy baggage, but it will face similar challenges around being a general purpose array library versus a more domain-specific analytics and data preparation library. pandas already has what can be called a "logical type system" (see e.g. https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md for other examples of logical type representations). We use NumPy dtypes for the physical memory representation along with various conventions for pandas-specific behavior like missing data, but they are weakly abstracted in a way that's definitely harmful for users. What I am arguing is 1) Introduce a proper (from a software engineering perspective) logical data type abstraction that models the way that pandas already works, but cleaning up all the mess (implicit upcasts, lack of a real "NA" scalar value, making pandas-specific methods like unique, factorize, match, etc. true "array methods") 2) Use NumPy physical dtypes (for now) as the primary target physical representation 3) Layer new machinery (like bitmasks) on top of raw NumPy arrays to add new features to pandas 4) Give pandas objects a real C API so that users can manipulate and create pandas objects with their own native (C/C++/Cython) code. 5) Yes, absolutely improve NumPy and DyND and transition to improved NumPy and DyND facilities as soon as they are available and shipped I don't see alternative ways for pandas to have a truly healthy relationship with more general purpose array / scientific computing libraries without being able to add new pandas functionality in a clean way, and without requiring us to get patches accepted (and released) in NumPy or DyND. Can you clarify what aspects of this plan are disagreeable / contentious? Are you arguing for pandas becoming more of a companion tool / user interface layer for NumPy or DyND? cheers, Wes > Best, > Stephan > > P.S. apologies for the delay -- it's been a busy week. > > > On Wed, Jan 6, 2016 at 12:15 PM, Wes McKinney wrote: >> >> I also will add that there is an ideology that has existed in the >> scientific Python community since 2011 at least which is this: pandas >> should not have existed; it should be part of NumPy instead. >> >> In my opinion, that misses the point of pandas, both then and now. >> >> There's a large and mostly new class of Python users working on >> domain-specific industry analytics problems for whom pandas is the >> most important tool that they use on a daily basis. Their knowledge of >> NumPy is limited, beyond the aspects of the ndarray API that are the >> same in pandas. High level APIs and accessibility for them is >> extremely important. But their skill sets and problems they are >> solving are not the same ones on the whole that you would have heard >> discussed at SciPy 2010. >> >> Sometime in 2015, "Python for Data Analysis" sold it's 100,000th copy. >> I have 5 foreign translations sitting on my shelf -- this represents a >> very large group of people that we have all collectively enabled by >> developing pandas -- for a lot of people, pandas is the main reason >> they use Python! >> >> So the summary of all this is: pandas is much more important as a >> project now than it was 5 years ago. Our relationship with our library >> dependencies like NumPy should reflect that. Downstream pandas >> consumers should similarly eventually concern themselves more with >> pandas compatibility (rather than always assuming that NumPy arrays >> are the only intermediary). This is a philosophical shift, but one >> that will ultimately benefit the usability of the stack. >> >> On Wed, Jan 6, 2016 at 11:45 AM, Jeff Reback wrote: >> > I'll just apologize right up front! hahah. >> > >> > No I think I have been pushing on these extras in pandas to help move it >> > forward. I have commented a bit >> > on Stephan's issue here about why I didn't push for these in numpy. >> > numpy is >> > fairly slow moving >> > (though moves faster lately, I suspect the pace when Wes was developing >> > pandas was not much faster). >> > >> > So pandas was essentially 'fixing' lots of bug / compat issues in numpy. >> > >> > To the extent that we can keep the current user facing API the same >> > (high >> > likelihood I think), willing >> > to acccept *some* breakage with the pandas->duck-like array container >> > API in >> > order to provide swappable containers. >> > >> > For example I recall that in doing datetime w/tz, that we wanted >> > Series.values to return a numpy array (which it DOES!) >> > but it is actually lossy (its loses the tz). Samething with the >> > Categorical >> > example wes gave. I dont' think these requirements >> > should hold pandas back! >> > >> > People are increasingly using pandas as the API for there work. That >> > makes >> > it very important that we can handle >> > lots of input properly, w/o the handcuffs of numpy. >> > >> > All this said, I'll reiterate Wes (and others points). That back-compat >> > is >> > extremely important. (I in fact try >> > to bend over backwards to provide this, sometimes its too much of >> > course!). >> > E.g. take the resample changes to API >> > >> > Was originally going to just do a hard break, but this turns off people >> > when >> > they have to update there code or else. >> > >> > my 4c (incrementing!) >> > >> > Jeff >> > > > From wesmckinn at gmail.com Mon Jan 11 14:33:39 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Mon, 11 Jan 2016 11:33:39 -0800 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References: Message-ID: On Mon, Jan 11, 2016 at 10:45 AM, Wes McKinney wrote: > On Mon, Jan 11, 2016 at 9:36 AM, Stephan Hoyer wrote: >> Hi Wes, >> >> You raise some important points. >> >> I agree that pandas's patched version of the numpy dtype system is a mess. >> But despite its issues, its leaky abstraction on top of NumPy provides >> benefits. In particular, it makes pandas easy to emulate (e.g., xarray), >> extend (e.g., geopandas) and integrate with other libraries (e.g., patsy, >> Scikit-Learn, matplotlib). >> >> You are right that pandas has started to supplant numpy as a high level API >> for data analysis, but of course the robust (and often numpy based) Python >> ecosystem is part of what has made pandas so successful. In practice, >> ecosystem projects often want to work with more primitive objects than >> series/dataframes in their internal data structures and without numpy this >> becomes more difficult. For example, how do you concatenate a list of >> categoricals? If these were numpy arrays, we could use np.concatenate, but >> the current implementation of categorical would require a custom solution. >> First class compatibility with pandas is harder when pandas data cannot be >> used with a full ndarray API. >> >> Likewise, hiding implementation details retains some flexibility for us (as >> developers), but in an ideal world, we would know we have the right >> abstraction, and then could expose the implementation as an advanced API! >> This is the case for some very mature projects, such as NumPy. Pandas is not >> really here yet (with the block manager), but it might be something to >> strive towards in this rewrite. >> >> At this point, I suppose the ship has sailed (e.g., with categorical in >> .values) on full numpy compatibility. So we absolutely do need explicit >> interfaces to converting to NumPy, rather than the current implicit >> guarantees about .values -- which we violated with categorical. Something >> like your suggested .to_numpy() method would indeed be an improvement over >> the current state, where we half-pretend that NumPy could be used as an >> advanced API for pandas, even though it doesn't really work. >> >> I'm sure you would agree that -- at least in theory -- it would be nice to >> push dtype improvements upstream to numpy, but that is obviously more work >> (for a variety of reasons) than starting from scratch in pandas. Of course, >> I think pandas has a need and right to exist as a separate library. But I do >> think building off of NumPy made it stronger, and pushing improvements >> upstream would be a better way to go. This has been my approach, and is why >> I've worked on both pandas and NumPy. >> >> The bottom line is that I don't agree that this is the most productive path >> forward -- I would opt for improving NumPy or DyND instead, which I believe >> would cause much less pain downstream -- but given that I'm not going to be >> the person doing the work, I will defer to your judgment. Pandas is >> certainly in need of holistic improvements and the maturity of a v1.0 >> release, and that's not something I'm in a position to push myself. >> > > This seems like a false dichotomy to me. I'm not arguing for forging > NumPy-free or DyND-free path, but rather making DyND's or NumPy's > physical memory representation and array computing infrastructure more > clearly implementation details of pandas that have limited > user-visibility (except when using NumPy / DyND-based tools is > necessary). > > The main problem we have faced with NumPy is: > > - Much more difficult to extend > - Legacy code makes major changes difficult or impossible > - pandas users likely represent a minority (but perhaps a plurality, > at this point) of users > > DyND's scope, as I understand it, is to be used for more use cases > than an internal detail of pandas objects. It doesn't have the legacy > baggage, but it will face similar challenges around being a general > purpose array library versus a more domain-specific analytics and data > preparation library. > > pandas already has what can be called a "logical type system" (see > e.g. https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md > for other examples of logical type representations). We use NumPy > dtypes for the physical memory representation along with various > conventions for pandas-specific behavior like missing data, but they > are weakly abstracted in a way that's definitely harmful for users. > What I am arguing is > > 1) Introduce a proper (from a software engineering perspective) > logical data type abstraction that models the way that pandas already > works, but cleaning up all the mess (implicit upcasts, lack of a real > "NA" scalar value, making pandas-specific methods like unique, > factorize, match, etc. true "array methods") > > 2) Use NumPy physical dtypes (for now) as the primary target physical > representation > > 3) Layer new machinery (like bitmasks) on top of raw NumPy arrays to > add new features to pandas > > 4) Give pandas objects a real C API so that users can manipulate and > create pandas objects with their own native (C/C++/Cython) code. > > 5) Yes, absolutely improve NumPy and DyND and transition to improved > NumPy and DyND facilities as soon as they are available and shipped > > I don't see alternative ways for pandas to have a truly healthy > relationship with more general purpose array / scientific computing > libraries without being able to add new pandas functionality in a > clean way, and without requiring us to get patches accepted (and > released) in NumPy or DyND. > Just to be clear on my stance re: pushing more code upstream into array libraries: if we introduce the right level of coupling / abstraction between pandas and NumPy/DyND, it will be much easier for us to use libpandas as a staging area for code that we are proposing to push upstream into one of those libraries. That's not really possible right now because pandas's internals are not easily portable to other C/C++ codebases (being written in a mix of pure Python and Cython). > Can you clarify what aspects of this plan are disagreeable / > contentious? Are you arguing for pandas becoming more of a companion > tool / user interface layer for NumPy or DyND? > > cheers, > Wes > >> Best, >> Stephan >> >> P.S. apologies for the delay -- it's been a busy week. >> >> >> On Wed, Jan 6, 2016 at 12:15 PM, Wes McKinney wrote: >>> >>> I also will add that there is an ideology that has existed in the >>> scientific Python community since 2011 at least which is this: pandas >>> should not have existed; it should be part of NumPy instead. >>> >>> In my opinion, that misses the point of pandas, both then and now. >>> >>> There's a large and mostly new class of Python users working on >>> domain-specific industry analytics problems for whom pandas is the >>> most important tool that they use on a daily basis. Their knowledge of >>> NumPy is limited, beyond the aspects of the ndarray API that are the >>> same in pandas. High level APIs and accessibility for them is >>> extremely important. But their skill sets and problems they are >>> solving are not the same ones on the whole that you would have heard >>> discussed at SciPy 2010. >>> >>> Sometime in 2015, "Python for Data Analysis" sold it's 100,000th copy. >>> I have 5 foreign translations sitting on my shelf -- this represents a >>> very large group of people that we have all collectively enabled by >>> developing pandas -- for a lot of people, pandas is the main reason >>> they use Python! >>> >>> So the summary of all this is: pandas is much more important as a >>> project now than it was 5 years ago. Our relationship with our library >>> dependencies like NumPy should reflect that. Downstream pandas >>> consumers should similarly eventually concern themselves more with >>> pandas compatibility (rather than always assuming that NumPy arrays >>> are the only intermediary). This is a philosophical shift, but one >>> that will ultimately benefit the usability of the stack. >>> >>> On Wed, Jan 6, 2016 at 11:45 AM, Jeff Reback wrote: >>> > I'll just apologize right up front! hahah. >>> > >>> > No I think I have been pushing on these extras in pandas to help move it >>> > forward. I have commented a bit >>> > on Stephan's issue here about why I didn't push for these in numpy. >>> > numpy is >>> > fairly slow moving >>> > (though moves faster lately, I suspect the pace when Wes was developing >>> > pandas was not much faster). >>> > >>> > So pandas was essentially 'fixing' lots of bug / compat issues in numpy. >>> > >>> > To the extent that we can keep the current user facing API the same >>> > (high >>> > likelihood I think), willing >>> > to acccept *some* breakage with the pandas->duck-like array container >>> > API in >>> > order to provide swappable containers. >>> > >>> > For example I recall that in doing datetime w/tz, that we wanted >>> > Series.values to return a numpy array (which it DOES!) >>> > but it is actually lossy (its loses the tz). Samething with the >>> > Categorical >>> > example wes gave. I dont' think these requirements >>> > should hold pandas back! >>> > >>> > People are increasingly using pandas as the API for there work. That >>> > makes >>> > it very important that we can handle >>> > lots of input properly, w/o the handcuffs of numpy. >>> > >>> > All this said, I'll reiterate Wes (and others points). That back-compat >>> > is >>> > extremely important. (I in fact try >>> > to bend over backwards to provide this, sometimes its too much of >>> > course!). >>> > E.g. take the resample changes to API >>> > >>> > Was originally going to just do a hard break, but this turns off people >>> > when >>> > they have to update there code or else. >>> > >>> > my 4c (incrementing!) >>> > >>> > Jeff >>> > >> >> From shoyer at gmail.com Mon Jan 11 14:55:21 2016 From: shoyer at gmail.com (Stephan Hoyer) Date: Mon, 11 Jan 2016 11:55:21 -0800 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References: Message-ID: On Mon, Jan 11, 2016 at 11:33 AM, Wes McKinney wrote: > Just to be clear on my stance re: pushing more code upstream into > array libraries: if we introduce the right level of coupling / > abstraction between pandas and NumPy/DyND, it will be much easier for > us to use libpandas as a staging area for code that we are proposing > to push upstream into one of those libraries. That's not really > possible right now because pandas's internals are not easily portable > to other C/C++ codebases (being written in a mix of pure Python and > Cython). Yep, also agreed. I think DyND is probably a better target than NumPy here, if only because it's also written in C++. NumPy, of course, has been a beast to extend. -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Mon Jan 11 14:55:24 2016 From: shoyer at gmail.com (Stephan Hoyer) Date: Mon, 11 Jan 2016 11:55:24 -0800 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References: Message-ID: > > I don't see alternative ways for pandas to have a truly healthy > relationship with more general purpose array / scientific computing > libraries without being able to add new pandas functionality in a > clean way, and without requiring us to get patches accepted (and > released) in NumPy or DyND. > Indeed, I think my disagreement is mostly about the order in which we approach these problems. > Can you clarify what aspects of this plan are disagreeable / > contentious? See my comments below. > Are you arguing for pandas becoming more of a companion > tool / user interface layer for NumPy or DyND? > Not quite. Pandas has some fantastic and highly useable data (Series, DataFrame, Index). These certainly don't belong in NumPy or DyND. However, the array-based ecosystem certainly could use improvements to dtypes (e.g., datetime and categorical) and dtype specific methods (e.g., for strings) just as much as pandas. I do firmly believe that pushing these types of improvements upstream, rather than implementing them independently for pandas, would yield benefits for the broader ecosystem. With the right infrastructure, generalizing things to arrays is not much more work. I'd like to see pandas itself focus more on the data-structures and less on the data types. This would let us share more work with the "general purpose array / scientific computing libraries". 1) Introduce a proper (from a software engineering perspective) > logical data type abstraction that models the way that pandas already > works, but cleaning up all the mess (implicit upcasts, lack of a real > "NA" scalar value, making pandas-specific methods like unique, > factorize, match, etc. true "array methods") > New abstractions have a cost. A new logical data type abstraction is better than no proper abstraction at all, but (in principle), one data type abstraction should be enough to share. A proper logical data type abstraction would be an improvement over the current situation, but if there's a way we could introduce one less abstraction (by improving things upstream in a general purpose array library) that would help even more. For example, we could imagine pushing to make DyND the new core for pandas. This could be enough of a push to make DyND generally useful -- I know it still has a few kinks to work out. 4) Give pandas objects a real C API so that users can manipulate and > create pandas objects with their own native (C/C++/Cython) code. > 5) Yes, absolutely improve NumPy and DyND and transition to improved > NumPy and DyND facilities as soon as they are available and shipped > I like the sound of both of these. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeffreback at gmail.com Mon Jan 11 18:04:58 2016 From: jeffreback at gmail.com (Jeff Reback) Date: Mon, 11 Jan 2016 18:04:58 -0500 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References: Message-ID: I am in favor of the Wes refactoring, but for some slightly different reasons. I am including some in-line comments. On Mon, Jan 11, 2016 at 2:55 PM, Stephan Hoyer wrote: > I don't see alternative ways for pandas to have a truly healthy >> relationship with more general purpose array / scientific computing >> libraries without being able to add new pandas functionality in a >> clean way, and without requiring us to get patches accepted (and >> released) in NumPy or DyND. >> > > Indeed, I think my disagreement is mostly about the order in which we > approach these problems. > I agree here. I had started on *some* of this to enable swappable numpy to DyND to support IntNA (all in python, but the fundamental change was to provide an API layer to the back-end). > > >> Can you clarify what aspects of this plan are disagreeable / >> contentious? > > > See my comments below. > > >> Are you arguing for pandas becoming more of a companion >> tool / user interface layer for NumPy or DyND? >> > > Not quite. Pandas has some fantastic and highly useable data (Series, > DataFrame, Index). These certainly don't belong in NumPy or DyND. > > However, the array-based ecosystem certainly could use improvements to > dtypes (e.g., datetime and categorical) and dtype specific methods (e.g., > for strings) just as much as pandas. I do firmly believe that pushing these > types of improvements upstream, rather than implementing them independently > for pandas, would yield benefits for the broader ecosystem. With the right > infrastructure, generalizing things to arrays is not much more work. > I dont' think Wes nor I disagree here at all. The problem was (and is), the pace of change in the underlying libraries. It is simply too slow for pandas development efforts. I think the pandas efforts (and other libraries) can result in more powerful fundamental libraries that get pushed upstream. However, it would not benefit ANYONE to slow down downstream efforts. I am not sure why you suggest that we WAIT for the upstream libraries to change? We have been waiting forever for that. Now we have a concrete implementation of certain data types that are useful. They (upstream) can take this and build on (or throw it away and make a better one or whatever). But I don't think it benefits anyone to WAIT for someone to change numpy first. Look at how long it took them to (partially) fix datetimes. xarray in particular has done the same thing to pandas, e.g. you have added additional selection operators and syntax (e.g. passing dicts of named axes). These changes are in fact propogating to pandas. This has taken time (but much much less that this took for any of pandas changes to numpy). Further look at how long you have advocated (correctly) for labeled arrays in numpy (which we are still waiting). > > I'd like to see pandas itself focus more on the data-structures and less > on the data types. This would let us share more work with the "general > purpose array / scientific computing libraries". > > Pandas IS about specifying the correct data types. It is simply incorrect to decouple this problem from the data-structures. A lot of effort over the years has gone into making all dtypes playing nice with each other and within pandas. > 1) Introduce a proper (from a software engineering perspective) >> logical data type abstraction that models the way that pandas already >> works, but cleaning up all the mess (implicit upcasts, lack of a real >> "NA" scalar value, making pandas-specific methods like unique, >> factorize, match, etc. true "array methods") >> > > New abstractions have a cost. A new logical data type abstraction is > better than no proper abstraction at all, but (in principle), one data type > abstraction should be enough to share. > > > A proper logical data type abstraction would be an improvement over the > current situation, but if there's a way we could introduce one less > abstraction (by improving things upstream in a general purpose array > library) that would help even more. > > This is just pushing a problem upstream, which ultimately, given the track history of numpy, won't be solved at all. We will be here 1 year from now with the exact same discussion. Why are we waiting on upstream for anything? As I said above, if something is created which upstream finds useful on a general level. great. The great cost here is time. > For example, we could imagine pushing to make DyND the new core for > pandas. This could be enough of a push to make DyND generally useful -- I > know it still has a few kinks to work out. > > maybe, but DyND has to have full compat with what currently is out there (soonish). Then I agree this could be possible. But wouldn't it be even better for pandas to be able to swap back-ends. Why limit ourselves to a particular backend if its not that difficult. > 4) Give pandas objects a real C API so that users can manipulate and >> create pandas objects with their own native (C/C++/Cython) code. >> > > 5) Yes, absolutely improve NumPy and DyND and transition to improved >> NumPy and DyND facilities as soon as they are available and shipped >> > > I like the sound of both of these. > Further you made a point above You are right that pandas has started to supplant numpy as a high level API > for data analysis, but of course the robust (and often numpy based) Python > ecosystem is part of what has made pandas so successful. In practice, > ecosystem projects often want to work with more primitive objects than > series/dataframes in their internal data structures and without numpy this > becomes more difficult. For example, how do you concatenate a list of > categoricals? If these were numpy arrays, we could use np.concatenate, but > the current implementation of categorical would require a custom solution. > First class compatibility with pandas is harder when pandas data cannotbe > used with a full ndarray API. I disagree entirely here. I think that Series/DataFrame ARE becoming primitive objects. Look at seaborn, statsmodels, and xarray These are first class users of these structures, whom need the additional meta-data attached. Yes categorical are useful in numpy, and they should support them. But lots of libraries can simply use pandas and do lots of really useful stuff. However, why reinvent the wheel and use numpy, when you have DataFrames. >From a user point of view, I don't think they even care about numpy (or whatever drives pandas). It solves a very general problem of working with labeled data. Jeff -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Mon Jan 11 18:35:44 2016 From: shoyer at gmail.com (Stephan Hoyer) Date: Mon, 11 Jan 2016 15:35:44 -0800 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References: Message-ID: On Mon, Jan 11, 2016 at 3:04 PM, Jeff Reback wrote: > I think the pandas efforts (and other libraries) can result in more > powerful fundamental libraries > that get pushed upstream. However, it would not benefit ANYONE to slow > down downstream efforts. I am not sure why you suggest that we WAIT for the > upstream libraries to change? We have been waiting forever for that. Now we > have a concrete implementation of certain data types that are useful. They > (upstream) can take > this and build on (or throw it away and make a better one or whatever). > But I don't think it benefits anyone to WAIT for someone to change numpy > first. > Look at how long it took them to (partially) fix datetimes. > I agree, it is insane to wait on upstream improvements to spontaneously happen on their own. We (interested downstream developers) would need to push them through. I started on this recently for making datetime64 timezone naive (https://github.com/numpy/numpy/pull/6453) -- though of course, this is one of the easier issue. Of course, this being open source, my suggestions require someone interested in doing all the hard work. And given that that is not me, perhaps I should just shut up :). If the best we think we can realistically do is Wes writing our own data type system, then I'll be a little sad, but it would still be a win. > xarray in particular has done the same thing to pandas, e.g. you have > added additional selection operators and syntax (e.g. passing dicts of > named axes). These changes are in fact propogating to pandas. This has > taken time (but much much less that this took for any of pandas changes to > numpy). Further look at how long you have advocated (correctly) for labeled > arrays in numpy (which we are still waiting). > I'm actually not convinced NumPy needs labeled arrays. In my mind, libraries like pandas and xarray solve the labeled array problem very well downstream of NumPy. There are costs to making the basic libraries label aware. > I'd like to see pandas itself focus more on the data-structures and less >> on the data types. This would let us share more work with the "general >> purpose array / scientific computing libraries". >> >> Pandas IS about specifying the correct data types. It is simply incorrect > to decouple this problem from the data-structures. A lot of effort over the > years has gone into > making all dtypes playing nice with each other and within pandas. > Yes, a lot of effort has gone into dtypes in pandas. This is great! But wouldn't it be even better if we had a viable path for pushing this stuff upstream? ;) > maybe, but DyND has to have full compat with what currently is out there > (soonish). Then I agree this could be possible. But wouldn't it be even > better > for pandas to be able to swap back-ends. Why limit ourselves to a > particular backend if its not that difficult. > Well, Irwin, what do you say? :) I'm just saying that in my ideal world, we would not invent a new dtype standard for pandas (insert obligatory xkcd reference here). I disagree entirely here. I think that Series/DataFrame ARE becoming > primitive objects. Look at seaborn, statsmodels, and xarray These are first > class users of these structures, whom need the additional meta-data > attached. > Seaborn does use Series/DataFrame internally as first class data structures. But for xarray and statsmodels it is the other way around -- pandas objects are accepted as input, but coerced into NumPy arrays internally for storage and manipulation. This presents issues for new types with metadata like categorical. Best, Stephan > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeffreback at gmail.com Mon Jan 11 19:19:29 2016 From: jeffreback at gmail.com (Jeff Reback) Date: Mon, 11 Jan 2016 19:19:29 -0500 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References: Message-ID: Stephan Seaborn does use Series/DataFrame internally as first class data > structures. But for xarray and statsmodels it is the other way around -- > pandas objects are accepted as input, but coerced into NumPy arrays > internally for storage and manipulation. This presents issues for new types > with metadata like categorical. care to elaborate on the xarray decision to keep data as numpy arrays, rather than Series in DataArray? (as you do keep the Index objects intact). On Mon, Jan 11, 2016 at 6:35 PM, Stephan Hoyer wrote: > On Mon, Jan 11, 2016 at 3:04 PM, Jeff Reback wrote: > >> I think the pandas efforts (and other libraries) can result in more >> powerful fundamental libraries >> that get pushed upstream. However, it would not benefit ANYONE to slow >> down downstream efforts. I am not sure why you suggest that we WAIT for the >> upstream libraries to change? We have been waiting forever for that. Now we >> have a concrete implementation of certain data types that are useful. They >> (upstream) can take >> this and build on (or throw it away and make a better one or whatever). >> But I don't think it benefits anyone to WAIT for someone to change numpy >> first. >> Look at how long it took them to (partially) fix datetimes. >> > > I agree, it is insane to wait on upstream improvements to spontaneously > happen on their own. We (interested downstream developers) would need to > push them through. I started on this recently for making datetime64 > timezone naive (https://github.com/numpy/numpy/pull/6453) -- though of > course, this is one of the easier issue. > > Of course, this being open source, my suggestions require someone > interested in doing all the hard work. And given that that is not me, > perhaps I should just shut up :). > > If the best we think we can realistically do is Wes writing our own data > type system, then I'll be a little sad, but it would still be a win. > > >> xarray in particular has done the same thing to pandas, e.g. you have >> added additional selection operators and syntax (e.g. passing dicts of >> named axes). These changes are in fact propogating to pandas. This has >> taken time (but much much less that this took for any of pandas changes to >> numpy). Further look at how long you have advocated (correctly) for labeled >> arrays in numpy (which we are still waiting). >> > > I'm actually not convinced NumPy needs labeled arrays. In my mind, > libraries like pandas and xarray solve the labeled array problem very well > downstream of NumPy. There are costs to making the basic libraries label > aware. > > >> I'd like to see pandas itself focus more on the data-structures and less >>> on the data types. This would let us share more work with the "general >>> purpose array / scientific computing libraries". >>> >>> Pandas IS about specifying the correct data types. It is simply >> incorrect to decouple this problem from the data-structures. A lot of >> effort over the years has gone into >> making all dtypes playing nice with each other and within pandas. >> > > Yes, a lot of effort has gone into dtypes in pandas. This is great! But > wouldn't it be even better if we had a viable path for pushing this stuff > upstream? ;) > > >> maybe, but DyND has to have full compat with what currently is out there >> (soonish). Then I agree this could be possible. But wouldn't it be even >> better >> for pandas to be able to swap back-ends. Why limit ourselves to a >> particular backend if its not that difficult. >> > > Well, Irwin, what do you say? :) > > I'm just saying that in my ideal world, we would not invent a new dtype > standard for pandas (insert obligatory xkcd reference here). > > I disagree entirely here. I think that Series/DataFrame ARE becoming >> primitive objects. Look at seaborn, statsmodels, and xarray These are first >> class users of these structures, whom need the additional meta-data >> attached. >> > > Seaborn does use Series/DataFrame internally as first class data > structures. But for xarray and statsmodels it is the other way around -- > pandas objects are accepted as input, but coerced into NumPy arrays > internally for storage and manipulation. This presents issues for new types > with metadata like categorical. > > Best, > Stephan > >> > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Mon Jan 11 19:23:51 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Mon, 11 Jan 2016 16:23:51 -0800 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References: Message-ID: On Mon, Jan 11, 2016 at 3:04 PM, Jeff Reback wrote: > I am in favor of the Wes refactoring, but for some slightly different > reasons. > > I am including some in-line comments. > > On Mon, Jan 11, 2016 at 2:55 PM, Stephan Hoyer wrote: >>> >>> I don't see alternative ways for pandas to have a truly healthy >>> relationship with more general purpose array / scientific computing >>> libraries without being able to add new pandas functionality in a >>> clean way, and without requiring us to get patches accepted (and >>> released) in NumPy or DyND. >> >> >> Indeed, I think my disagreement is mostly about the order in which we >> approach these problems. > > > I agree here. I had started on *some* of this to enable swappable numpy to > DyND to support IntNA (all in python, > but the fundamental change was to provide an API layer to the back-end). > >> >> >>> >>> Can you clarify what aspects of this plan are disagreeable / >>> contentious? >> >> >> See my comments below. >> >>> >>> Are you arguing for pandas becoming more of a companion >>> tool / user interface layer for NumPy or DyND? >> >> >> Not quite. Pandas has some fantastic and highly useable data (Series, >> DataFrame, Index). These certainly don't belong in NumPy or DyND. >> >> However, the array-based ecosystem certainly could use improvements to >> dtypes (e.g., datetime and categorical) and dtype specific methods (e.g., >> for strings) just as much as pandas. I do firmly believe that pushing these >> types of improvements upstream, rather than implementing them independently >> for pandas, would yield benefits for the broader ecosystem. With the right >> infrastructure, generalizing things to arrays is not much more work. > > > I dont' think Wes nor I disagree here at all. The problem was (and is), the > pace of change in the underlying libraries. It is simply too slow > for pandas development efforts. > > I think the pandas efforts (and other libraries) can result in more powerful > fundamental libraries > that get pushed upstream. However, it would not benefit ANYONE to slow down > downstream efforts. I am not sure why you suggest that we WAIT for the > upstream libraries to change? We have been waiting forever for that. Now we > have a concrete implementation of certain data types that are useful. They > (upstream) can take > this and build on (or throw it away and make a better one or whatever). But > I don't think it benefits anyone to WAIT for someone to change numpy first. > Look at how long it took them to (partially) fix datetimes. > > xarray in particular has done the same thing to pandas, e.g. you have added > additional selection operators and syntax (e.g. passing dicts of named > axes). These changes are in fact propogating to pandas. This has taken time > (but much much less that this took for any of pandas changes to numpy). > Further look at how long you have advocated (correctly) for labeled arrays > in numpy (which we are still waiting). > >> >> >> I'd like to see pandas itself focus more on the data-structures and less >> on the data types. This would let us share more work with the "general >> purpose array / scientific computing libraries". >> > Pandas IS about specifying the correct data types. It is simply incorrect to > decouple this problem from the data-structures. A lot of effort over the > years has gone into > making all dtypes playing nice with each other and within pandas. > >>> >>> 1) Introduce a proper (from a software engineering perspective) >>> logical data type abstraction that models the way that pandas already >>> works, but cleaning up all the mess (implicit upcasts, lack of a real >>> "NA" scalar value, making pandas-specific methods like unique, >>> factorize, match, etc. true "array methods") >> >> >> New abstractions have a cost. A new logical data type abstraction is >> better than no proper abstraction at all, but (in principle), one data type >> abstraction should be enough to share. >> > >> >> A proper logical data type abstraction would be an improvement over the >> current situation, but if there's a way we could introduce one less >> abstraction (by improving things upstream in a general purpose array >> library) that would help even more. >> > > This is just pushing a problem upstream, which ultimately, given the track > history of numpy, won't be solved at all. We will be here 1 year from now > with the exact same discussion. Why are we waiting on upstream for anything? > As I said above, if something is created which upstream finds useful on a > general level. great. The great cost here is time. > >> >> For example, we could imagine pushing to make DyND the new core for >> pandas. This could be enough of a push to make DyND generally useful -- I >> know it still has a few kinks to work out. >> > > maybe, but DyND has to have full compat with what currently is out there > (soonish). Then I agree this could be possible. But wouldn't it be even > better > for pandas to be able to swap back-ends. Why limit ourselves to a particular > backend if its not that difficult. > I think Jeff and I are on the same page here. 5 years ago we were having the *exact same* discussions around NumPy and adding new data type functionality. 5 years is a staggering amount of time in open source. It was less than 5 years between pandas not existing and being a super popular project with 2/3 of a best-selling O'Reilly book written about it. To whit, DyND exists in large part because of the difficulty in making progress within NumPy. Now, as 5 years ago, I think we should be acting in the best interests of pandas users, and what I've been describing is intended as a straightforward (though definitely labor intensive) and relatively low-risk plan that will "future-proof" the pandas user API for at least the next few years, and probably much longer. If we find that enabling some internals to use DyND is the right choice, we can do that in a non-invasive way while carefully minding data interoperability. Meaningful performance benefits would be a clear motivation. To be 100% open and transparent (in the spirit of pandas's new governance docs): Before committing to using DyND in any binding way (i.e. required, as opposed to opt-in) in pandas, I'd really like to see more evidence from 3rd parties without direct financial interest (i.e. employment or equity from Continuum) that DyND is "the future of Python array computing"; in the absence of significant user and community code contribution, it still feels like a political quagmire leftover from the Continuum-Enthought rift in 2011. - Wes >>> >>> 4) Give pandas objects a real C API so that users can manipulate and >>> create pandas objects with their own native (C/C++/Cython) code. >> >> >>> 5) Yes, absolutely improve NumPy and DyND and transition to improved >>> NumPy and DyND facilities as soon as they are available and shipped >> >> >> I like the sound of both of these. > > > > Further you made a point above > >> You are right that pandas has started to supplant numpy as a high level >> API for data analysis, but of course the robust (and often numpy based) >> Python ecosystem is part of what has made pandas so successful. In practice, >> ecosystem projects often want to work with more primitive objects than >> series/dataframes in their internal data structures and without numpy this >> becomes more difficult. For example, how do you concatenate a list of >> categoricals? If these were numpy arrays, we could use np.concatenate, but >> the current implementation of categorical would require a custom solution. >> First class compatibility with pandas is harder when pandas data cannotbe >> used with a full ndarray API. > > > I disagree entirely here. I think that Series/DataFrame ARE becoming > primitive objects. Look at seaborn, statsmodels, and xarray These are first > class users of these structures, whom need the additional meta-data > attached. > > Yes categorical are useful in numpy, and they should support them. But lots > of libraries can simply use pandas and do lots of really useful stuff. > However, why reinvent the wheel and use numpy, when you have DataFrames. > > From a user point of view, I don't think they even care about numpy (or > whatever drives pandas). It solves a very general problem of working with > labeled data. > > Jeff From shoyer at gmail.com Mon Jan 11 19:34:00 2016 From: shoyer at gmail.com (Stephan Hoyer) Date: Mon, 11 Jan 2016 16:34:00 -0800 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References: Message-ID: On Mon, Jan 11, 2016 at 4:19 PM, Jeff Reback wrote: > Seaborn does use Series/DataFrame internally as first class data >> structures. But for xarray and statsmodels it is the other way around -- >> pandas objects are accepted as input, but coerced into NumPy arrays >> internally for storage and manipulation. This presents issues for new types >> with metadata like categorical. > > > > care to elaborate on the xarray decision to keep data as numpy arrays, > rather than Series in DataArray? (as you do keep the Index objects intact). > Sure -- the main point of xarray is that we need N-dimensional data structures, so we definitely need to support NumPy as a backend. Xarray operations are defined in terms of NumPy (or dask) arrays. In principle, we could store data as a Series, but for the sake of sanity we would need to convert to NumPy arrays before doing any operations. Duck typing compatibility is nice in theory, but in practice lots of subtle issues tend to come up. The alternative is to write our own ndarray abstraction internally to xarray that could handle special types like Categorical, but I'm pretty reluctant to do that. It seems like a lot of work, and numpy is "good enough" in most cases. And, of course, I'd rather solve those problems upstream :). Stephan -------------- next part -------------- An HTML attachment was scrubbed... URL: From izaid at continuum.io Tue Jan 12 16:32:23 2016 From: izaid at continuum.io (Irwin Zaid) Date: Tue, 12 Jan 2016 15:32:23 -0600 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References: Message-ID: Hi all, Stephan Hoyer asked me to comment on DyND and it's relation to the changes in Pandas that we're discussing here, so I'd like to do that. But, before I do, I want to clear up some misconceptions about DyND's history from Wes' most recent email. To be 100% open and transparent (in the spirit of pandas's new > governance docs): Before committing to using DyND in any binding way > (i.e. required, as opposed to opt-in) in pandas, I'd really like to > see more evidence from 3rd parties without direct financial interest > (i.e. employment or equity from Continuum) that DyND is "the future of > Python array computing"; in the absence of significant user and > community code contribution, it still feels like a political quagmire > leftover from the Continuum-Enthought rift in 2011. > Let's be very clear about the history (and present) of DyND -- and I think Travis Oliphant captured it well in his email to the NumPy list some months ago: https://mail.scipy.org/pipermail/numpy-discussion/2015-August/073412.html DyND was started as a personal project of Mark Wiebe in September 2011, and you can see the first commit at https://github.com/libdynd/libdynd/commit/768ac9a30cdb4619d09f4656bfd895ab2b91185d. At the time, Mark was at the University of British Columbia. He joined Continuum part-time when it was founded in January 2012, and later became full-time in the spring of 2012. DyND, therefore, predates Continuum and never had any relationship with Enthought. As Travis said in his email to the NumPy list (link above), after that "Continuum supported DyND with some fraction of Mark's time". Mark can speak more about this if he wishes, but the point is that DyND's origins are not "a political quagmire leftover from the Continuum-Enthought rift in 2011". Also, Mark left Continuum in December 2014, so everything contributed after that had nothing to do with Continuum. Now let's move to the other main DyND developers, me and Ian Henriksen. Until June 29, 2015, I had no relationship with Continuum, Enthought, or even the people we're speaking about in this thread. I knew Mark and that was it. I started working on DyND in January 2014, meaning I contributed to it just by choice for 1.5 years. And, if you look at my commit contributions at https://github.com/libdynd/libdynd/graphs/contributors, you'll see that represents about 50% of all of my contributions. And I've contributed a lot. Ian was originally a Google Summer of Code student that DyND applied for as an open-source project, through NumFOCUS, in the summer of 2015. He started on May 25, 2015 and went until the end of August. Anything he contributed in this time had nothing to do with Continuum. He formally joined Continuum on September 1, 2015. So, basically, a majority of DyND's commits were given freely by Mark, myself, and Ian. Now, at present, both Ian and I are sponsored by Continuum. And, yes, they are very graciously supporting us to work on DyND, like they did in the past with Mark. While I understand that, in theory, that could potentially be a conflict of interest, let me be very clear about one thing: Continuum has always approached DyND in a very balanced way, letting it grow as it needs while encouraging interaction with Pandas and other open-source projects in the ecosystem. The decisions we make for DyND have been decisions we've taken for the good of the project. And, yes, the eventual goal of DyND is to move from incubation at Continuum to a NumFOCUS-sponsored project. And we'll do that as soon as we can. Irwin -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Tue Jan 12 17:57:28 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Tue, 12 Jan 2016 14:57:28 -0800 Subject: [Pandas-dev] DyND and pandas [was Rewriting some of internals of pandas in C/C++? / Roadmap] Message-ID: hi, This discussion doesn't belong on this mailing list, but a couple of brief points. On Tue, Jan 12, 2016 at 1:32 PM, Irwin Zaid wrote: > Hi all, > > Stephan Hoyer asked me to comment on DyND and it's relation to the changes > in Pandas that we're discussing here, so I'd like to do that. But, before I > do, I want to clear up some misconceptions about DyND's history from Wes' > most recent email. > >> To be 100% open and transparent (in the spirit of pandas's new >> governance docs): Before committing to using DyND in any binding way >> (i.e. required, as opposed to opt-in) in pandas, I'd really like to >> see more evidence from 3rd parties without direct financial interest >> (i.e. employment or equity from Continuum) that DyND is "the future of >> Python array computing"; in the absence of significant user and >> community code contribution, it still feels like a political quagmire >> leftover from the Continuum-Enthought rift in 2011. > > > Let's be very clear about the history (and present) of DyND -- and I think > Travis Oliphant captured it well in his email to the NumPy list some months > ago: > https://mail.scipy.org/pipermail/numpy-discussion/2015-August/073412.html > > DyND was started as a personal project of Mark Wiebe in September 2011, and > you can see the first commit at > https://github.com/libdynd/libdynd/commit/768ac9a30cdb4619d09f4656bfd895ab2b91185d. > At the time, Mark was at the University of British Columbia. He joined > Continuum part-time when it was founded in January 2012, and later became > full-time in the spring of 2012. DyND, therefore, predates Continuum and > never had any relationship with Enthought. As Travis said in his email to > the NumPy list (link above), after that "Continuum supported DyND with some > fraction of Mark's time". Mark can speak more about this if he wishes, but > the point is that DyND's origins are not "a political quagmire leftover from > the Continuum-Enthought rift in 2011". Also, Mark left Continuum in December > 2014, so everything contributed after that had nothing to do with Continuum. > I was approached by Travis and Peter about being a part of Continuum Analytics in late 2011. According to my e-mail records we were having these discussions at least as early as October 2011. The phrase "NumPy 2.0" was spoken in this epoch (referring to -the-project-now-known-as-DyND). So, I have quite a bit of first- and second-hand information from this time period, including many of the details of Mark's Enthought-sponsored NumPy development and the problems that occurred online and offline. > Now let's move to the other main DyND developers, me and Ian Henriksen. > > Until June 29, 2015, I had no relationship with Continuum, Enthought, or > even the people we're speaking about in this thread. I knew Mark and that > was it. I started working on DyND in January 2014, meaning I contributed to > it just by choice for 1.5 years. And, if you look at my commit contributions > at https://github.com/libdynd/libdynd/graphs/contributors, you'll see that > represents about 50% of all of my contributions. And I've contributed a lot. > > Ian was originally a Google Summer of Code student that DyND applied for as > an open-source project, through NumFOCUS, in the summer of 2015. He started > on May 25, 2015 and went until the end of August. Anything he contributed in > this time had nothing to do with Continuum. He formally joined Continuum on > September 1, 2015. > > So, basically, a majority of DyND's commits were given freely by Mark, > myself, and Ian. > > Now, at present, both Ian and I are sponsored by Continuum. And, yes, they > are very graciously supporting us to work on DyND, like they did in the past > with Mark. While I understand that, in theory, that could potentially be a > conflict of interest, let me be very clear about one thing: Continuum has > always approached DyND in a very balanced way, letting it grow as it needs > while encouraging interaction with Pandas and other open-source projects in > the ecosystem. The decisions we make for DyND have been decisions we've > taken for the good of the project. > > And, yes, the eventual goal of DyND is to move from incubation at Continuum > to a NumFOCUS-sponsored project. And we'll do that as soon as we can. > I applaud Continuum for using R&D budget to build something new and forward thinking that is also permissively licensed open source software. However, it is well known that open source projects driven by for-profit organizations can run into governance problems that place them in conflict with the community. Since DyND is a large project that I would not be comfortable forking (if that were required in the future), building an outside developer and user community is essential if pandas is to consider using it as a hard dependency in the future. The Apache Software Foundation exists for this reason and others, and if you wish to place a community-oriented and merit-based governance structure around DyND to assist with its incubation, the ASF may be worth pursuing. NumFOCUS provides a fiscal sponsorship apparatus but does not really address the governance questions. Whether or not the governance issues are real doesn't really matter; it's about setting people's minds at ease. Thanks, Wes > Irwin From izaid at continuum.io Tue Jan 12 18:20:13 2016 From: izaid at continuum.io (Irwin Zaid) Date: Tue, 12 Jan 2016 17:20:13 -0600 Subject: [Pandas-dev] DyND and pandas [was Rewriting some of internals of pandas in C/C++? / Roadmap] In-Reply-To: References: Message-ID: > This discussion doesn't belong on this mailing list, but a couple of > brief points. > Wes, if you don't want this discussion on this mailing list then don't say things like: "it still feels like a political quagmirie leftover from the Continuum-Enthought rift in 2011". My email reply to that was simply a statement of facts, as this one will also be. I was approached by Travis and Peter about being a part of Continuum > Analytics in late 2011. According to my e-mail records we were having > these discussions at least as early as October 2011. The phrase "NumPy > 2.0" was spoken in this epoch (referring to > -the-project-now-known-as-DyND). So, I have quite a bit of first- and > second-hand information from this time period, including many of the > details of Mark's Enthought-sponsored NumPy development and the > problems that occurred online and offline. > The phrase "NumPy 2.0" means a number of things, and DyND was not one of them. Yes, you have some first-hand knowledge, but it's not relevant. Even IF it was, a lot of modern DyND also came from my massive contribution before I joined Continuum. Mark will speak up here as well. > I applaud Continuum for using R&D budget to build something new and > forward thinking that is also permissively licensed open source > software. However, it is well known that open source projects driven > by for-profit organizations can run into governance problems that > place them in conflict with the community. Since DyND is a large > project that I would not be comfortable forking (if that were required > in the future), building an outside developer and user community is > essential if pandas is to consider using it as a hard dependency in > the future. > > The Apache Software Foundation exists for this reason and others, and > if you wish to place a community-oriented and merit-based governance > structure around DyND to assist with its incubation, the ASF may be > worth pursuing. NumFOCUS provides a fiscal sponsorship apparatus but > does not really address the governance questions. Whether or not the > governance issues are real doesn't really matter; it's about setting > people's minds at ease. > Okay, let me state again: The majority of DyND's contributions (as net from Mark, myself, and Ian) came without Continuum funding. Just because Continuum is funding DyND now does not make it a "Continuum project", whatever this means. Some of your other points are valid, and we'll address them as best we can as time goes on. DyND clearly needs a community, but it's a chicken-and-egg problem. If you try and build something hard, it takes time and users come when things work. The issue of refactoring Pandas is a different one that I'll add comments to in another email. Irwin -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeffreback at gmail.com Tue Jan 12 18:41:45 2016 From: jeffreback at gmail.com (Jeff Reback) Date: Tue, 12 Jan 2016 18:41:45 -0500 Subject: [Pandas-dev] DyND and pandas [was Rewriting some of internals of pandas in C/C++? / Roadmap] In-Reply-To: References: Message-ID: So this thread is off-topic, but I believe the gist of what wes is proposing from a technical point of view for libpandas: - the user facing pandas API will not change (except better perf / copy-on-write etc) - the back-end API should not change much either - c-API for the back-end. - allows swappable / agnostic numpy-like back-ends. - ideally libpandas won't rewrite a completely new dtype system, maybe could co-op datashape / pluribus for extensible dtypes If the above are met by a back-end, e.g. numpy, potentially DyND, then it a back-end should be allowed (certainly as an optional dep, whether its required or not can be a choice made down the road). I think during implementation, that wes will be congnizant of these points, and leave things as wide open as possible w/o going down the road we are currently in (where lots of different API's are intermixed). Jeff On Tue, Jan 12, 2016 at 6:20 PM, Irwin Zaid wrote: > > This discussion doesn't belong on this mailing list, but a couple of >> brief points. >> > > Wes, if you don't want this discussion on this mailing list then don't say > things like: "it still feels like a political quagmirie leftover from the > Continuum-Enthought rift in 2011". My email reply to that was simply a > statement of facts, as this one will also be. > > I was approached by Travis and Peter about being a part of Continuum >> Analytics in late 2011. According to my e-mail records we were having >> these discussions at least as early as October 2011. The phrase "NumPy >> 2.0" was spoken in this epoch (referring to >> -the-project-now-known-as-DyND). So, I have quite a bit of first- and >> second-hand information from this time period, including many of the >> details of Mark's Enthought-sponsored NumPy development and the >> problems that occurred online and offline. >> > > The phrase "NumPy 2.0" means a number of things, and DyND was not one of > them. Yes, you have some first-hand knowledge, > but it's not relevant. Even IF it was, a lot of modern DyND also came from > my massive contribution before I joined Continuum. > > Mark will speak up here as well. > > >> I applaud Continuum for using R&D budget to build something new and >> forward thinking that is also permissively licensed open source >> software. However, it is well known that open source projects driven >> by for-profit organizations can run into governance problems that >> place them in conflict with the community. Since DyND is a large >> project that I would not be comfortable forking (if that were required >> in the future), building an outside developer and user community is >> essential if pandas is to consider using it as a hard dependency in >> the future. >> >> The Apache Software Foundation exists for this reason and others, and >> if you wish to place a community-oriented and merit-based governance >> structure around DyND to assist with its incubation, the ASF may be >> worth pursuing. NumFOCUS provides a fiscal sponsorship apparatus but >> does not really address the governance questions. Whether or not the >> governance issues are real doesn't really matter; it's about setting >> people's minds at ease. >> > > Okay, let me state again: The majority of DyND's contributions (as net > from Mark, myself, and Ian) came without Continuum funding. Just because > Continuum is funding DyND now does not make it a "Continuum project", > whatever this means. > > Some of your other points are valid, and we'll address them as best we can > as time goes on. DyND clearly needs a community, but it's a chicken-and-egg > problem. If you try and build something hard, it takes time and users come > when things work. > > The issue of refactoring Pandas is a different one that I'll add comments > to in another email. > > Irwin > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Tue Jan 12 18:50:33 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Tue, 12 Jan 2016 15:50:33 -0800 Subject: [Pandas-dev] DyND and pandas [was Rewriting some of internals of pandas in C/C++? / Roadmap] In-Reply-To: References: Message-ID: On Tue, Jan 12, 2016 at 3:41 PM, Jeff Reback wrote: > So this thread is off-topic, but I believe the gist of what wes is proposing > from a technical point of view for libpandas: > > - the user facing pandas API will not change (except better perf / > copy-on-write etc) > - the back-end API should not change much either > - c-API for the back-end. > - allows swappable / agnostic numpy-like back-ends. > - ideally libpandas won't rewrite a completely new dtype system, maybe could > co-op datashape / pluribus for extensible dtypes > > If the above are met by a back-end, e.g. numpy, potentially DyND, then it a > back-end should be allowed > (certainly as an optional dep, whether its required or not can be a choice > made down the road). > > I think during implementation, that wes will be congnizant of these points, > and leave things as wide open as > possible w/o going down the road we are currently in (where lots of > different API's are intermixed). > Yep, you nailed it. > Jeff > > > On Tue, Jan 12, 2016 at 6:20 PM, Irwin Zaid wrote: >> >> >>> This discussion doesn't belong on this mailing list, but a couple of >>> brief points. >> >> >> Wes, if you don't want this discussion on this mailing list then don't say >> things like: "it still feels like a political quagmirie leftover from the >> Continuum-Enthought rift in 2011". My email reply to that was simply a >> statement of facts, as this one will also be. >> >>> I was approached by Travis and Peter about being a part of Continuum >>> Analytics in late 2011. According to my e-mail records we were having >>> these discussions at least as early as October 2011. The phrase "NumPy >>> 2.0" was spoken in this epoch (referring to >>> -the-project-now-known-as-DyND). So, I have quite a bit of first- and >>> second-hand information from this time period, including many of the >>> details of Mark's Enthought-sponsored NumPy development and the >>> problems that occurred online and offline. >> >> >> The phrase "NumPy 2.0" means a number of things, and DyND was not one of >> them. Yes, you have some first-hand knowledge, >> but it's not relevant. Even IF it was, a lot of modern DyND also came from >> my massive contribution before I joined Continuum. >> >> Mark will speak up here as well. >> >>> >>> I applaud Continuum for using R&D budget to build something new and >>> forward thinking that is also permissively licensed open source >>> software. However, it is well known that open source projects driven >>> by for-profit organizations can run into governance problems that >>> place them in conflict with the community. Since DyND is a large >>> project that I would not be comfortable forking (if that were required >>> in the future), building an outside developer and user community is >>> essential if pandas is to consider using it as a hard dependency in >>> the future. >>> >>> The Apache Software Foundation exists for this reason and others, and >>> if you wish to place a community-oriented and merit-based governance >>> structure around DyND to assist with its incubation, the ASF may be >>> worth pursuing. NumFOCUS provides a fiscal sponsorship apparatus but >>> does not really address the governance questions. Whether or not the >>> governance issues are real doesn't really matter; it's about setting >>> people's minds at ease. >> >> >> Okay, let me state again: The majority of DyND's contributions (as net >> from Mark, myself, and Ian) came without Continuum funding. Just because >> Continuum is funding DyND now does not make it a "Continuum project", >> whatever this means. >> >> Some of your other points are valid, and we'll address them as best we can >> as time goes on. DyND clearly needs a community, but it's a chicken-and-egg >> problem. If you try and build something hard, it takes time and users come >> when things work. >> >> The issue of refactoring Pandas is a different one that I'll add comments >> to in another email. >> >> Irwin >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > From izaid at continuum.io Tue Jan 12 18:54:06 2016 From: izaid at continuum.io (Irwin Zaid) Date: Tue, 12 Jan 2016 17:54:06 -0600 Subject: [Pandas-dev] DyND and pandas [was Rewriting some of internals of pandas in C/C++? / Roadmap] In-Reply-To: References: Message-ID: Thanks, Jeff. Let's talk about this. > So this thread is off-topic, but I believe the gist of what wes is > proposing from a technical point of view for libpandas: > > - the user facing pandas API will not change (except better perf / > copy-on-write etc) > - the back-end API should not change much either > - c-API for the back-end. > - allows swappable / agnostic numpy-like back-ends. > - ideally libpandas won't rewrite a completely new dtype system, maybe > could co-op datashape / pluribus for extensible dtypes > For the most part, I think these are good ideas, but I share many of Stephan's concerns. I'd much rather we improve the array ecosystem in general and, very specifically, I don't think new dtypes should be added to pandas via libpandas. What I'd really like to see is for Wes and I to collaborate on *something* that solves the dtype problem and can be shared across libraries. I think Wes and I working together could result in potentially phenomenal things, both for pandas and other projects. I believe that the DyND type system is pretty close to a solution here, I think it could be spun out as an independent data description system. If for some reason the DyND type system is not sufficient, I'd *still* be happy to work together on a solution that has nothing to do with DyND. Of course, I'm not a pandas developer. But, at the same time, I'm offering to do free work here to help pandas. If the above are met by a back-end, e.g. numpy, potentially DyND, then it a > back-end should be allowed > (certainly as an optional dep, whether its required or not can be a choice > made down the road). > > I think during implementation, that wes will be congnizant of these > points, and leave things as wide open as > possible w/o going down the road we are currently in (where lots of > different API's are intermixed). > If the above is true, that sounds great. Wes, I'd appreciate it if you left opinions about Continuum funding DyND out of it -- we've both had our say now. Irwin -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Tue Jan 12 19:06:55 2016 From: shoyer at gmail.com (Stephan Hoyer) Date: Tue, 12 Jan 2016 16:06:55 -0800 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References: Message-ID: I think I'm mostly on the same page as well. Five years has certainly been too long. I agree that it would be premature to commit to using DyND in a binding way in pandas. A lot seems to be up in the air with regards to dtypes in Python right now (yes, particularly from projects sponsored by Continuum). So I would advocate for proceeding with the refactor for now (which will have numerous other benefits), and see how the situation evolves. If it seems like we are in a plausible position to unify the dtype system with a tool like DyND, then let's seriously consider that down the road. Either way, explicit interfaces (e.g., to_numpy(), to_dynd()) will help. On Mon, Jan 11, 2016 at 4:23 PM, Wes McKinney wrote: > On Mon, Jan 11, 2016 at 3:04 PM, Jeff Reback wrote: > > I am in favor of the Wes refactoring, but for some slightly different > > reasons. > > > > I am including some in-line comments. > > > > On Mon, Jan 11, 2016 at 2:55 PM, Stephan Hoyer wrote: > >>> > >>> I don't see alternative ways for pandas to have a truly healthy > >>> relationship with more general purpose array / scientific computing > >>> libraries without being able to add new pandas functionality in a > >>> clean way, and without requiring us to get patches accepted (and > >>> released) in NumPy or DyND. > >> > >> > >> Indeed, I think my disagreement is mostly about the order in which we > >> approach these problems. > > > > > > I agree here. I had started on *some* of this to enable swappable numpy > to > > DyND to support IntNA (all in python, > > but the fundamental change was to provide an API layer to the back-end). > > > >> > >> > >>> > >>> Can you clarify what aspects of this plan are disagreeable / > >>> contentious? > >> > >> > >> See my comments below. > >> > >>> > >>> Are you arguing for pandas becoming more of a companion > >>> tool / user interface layer for NumPy or DyND? > >> > >> > >> Not quite. Pandas has some fantastic and highly useable data (Series, > >> DataFrame, Index). These certainly don't belong in NumPy or DyND. > >> > >> However, the array-based ecosystem certainly could use improvements to > >> dtypes (e.g., datetime and categorical) and dtype specific methods > (e.g., > >> for strings) just as much as pandas. I do firmly believe that pushing > these > >> types of improvements upstream, rather than implementing them > independently > >> for pandas, would yield benefits for the broader ecosystem. With the > right > >> infrastructure, generalizing things to arrays is not much more work. > > > > > > I dont' think Wes nor I disagree here at all. The problem was (and is), > the > > pace of change in the underlying libraries. It is simply too slow > > for pandas development efforts. > > > > I think the pandas efforts (and other libraries) can result in more > powerful > > fundamental libraries > > that get pushed upstream. However, it would not benefit ANYONE to slow > down > > downstream efforts. I am not sure why you suggest that we WAIT for the > > upstream libraries to change? We have been waiting forever for that. Now > we > > have a concrete implementation of certain data types that are useful. > They > > (upstream) can take > > this and build on (or throw it away and make a better one or whatever). > But > > I don't think it benefits anyone to WAIT for someone to change numpy > first. > > Look at how long it took them to (partially) fix datetimes. > > > > xarray in particular has done the same thing to pandas, e.g. you have > added > > additional selection operators and syntax (e.g. passing dicts of named > > axes). These changes are in fact propogating to pandas. This has taken > time > > (but much much less that this took for any of pandas changes to numpy). > > Further look at how long you have advocated (correctly) for labeled > arrays > > in numpy (which we are still waiting). > > > >> > >> > >> I'd like to see pandas itself focus more on the data-structures and less > >> on the data types. This would let us share more work with the "general > >> purpose array / scientific computing libraries". > >> > > Pandas IS about specifying the correct data types. It is simply > incorrect to > > decouple this problem from the data-structures. A lot of effort over the > > years has gone into > > making all dtypes playing nice with each other and within pandas. > > > >>> > >>> 1) Introduce a proper (from a software engineering perspective) > >>> logical data type abstraction that models the way that pandas already > >>> works, but cleaning up all the mess (implicit upcasts, lack of a real > >>> "NA" scalar value, making pandas-specific methods like unique, > >>> factorize, match, etc. true "array methods") > >> > >> > >> New abstractions have a cost. A new logical data type abstraction is > >> better than no proper abstraction at all, but (in principle), one data > type > >> abstraction should be enough to share. > >> > > > >> > >> A proper logical data type abstraction would be an improvement over the > >> current situation, but if there's a way we could introduce one less > >> abstraction (by improving things upstream in a general purpose array > >> library) that would help even more. > >> > > > > This is just pushing a problem upstream, which ultimately, given the > track > > history of numpy, won't be solved at all. We will be here 1 year from now > > with the exact same discussion. Why are we waiting on upstream for > anything? > > As I said above, if something is created which upstream finds useful on a > > general level. great. The great cost here is time. > > > >> > >> For example, we could imagine pushing to make DyND the new core for > >> pandas. This could be enough of a push to make DyND generally useful -- > I > >> know it still has a few kinks to work out. > >> > > > > maybe, but DyND has to have full compat with what currently is out there > > (soonish). Then I agree this could be possible. But wouldn't it be even > > better > > for pandas to be able to swap back-ends. Why limit ourselves to a > particular > > backend if its not that difficult. > > > > I think Jeff and I are on the same page here. 5 years ago we were > having the *exact same* discussions around NumPy and adding new data > type functionality. 5 years is a staggering amount of time in open > source. It was less than 5 years between pandas not existing and being > a super popular project with 2/3 of a best-selling O'Reilly book > written about it. To whit, DyND exists in large part because of the > difficulty in making progress within NumPy. > > Now, as 5 years ago, I think we should be acting in the best interests > of pandas users, and what I've been describing is intended as a > straightforward (though definitely labor intensive) and relatively > low-risk plan that will "future-proof" the pandas user API for at > least the next few years, and probably much longer. If we find that > enabling some internals to use DyND is the right choice, we can do > that in a non-invasive way while carefully minding data > interoperability. Meaningful performance benefits would be a clear > motivation. > > To be 100% open and transparent (in the spirit of pandas's new > governance docs): Before committing to using DyND in any binding way > (i.e. required, as opposed to opt-in) in pandas, I'd really like to > see more evidence from 3rd parties without direct financial interest > (i.e. employment or equity from Continuum) that DyND is "the future of > Python array computing"; in the absence of significant user and > community code contribution, it still feels like a political quagmire > leftover from the Continuum-Enthought rift in 2011. > > - Wes > > >>> > >>> 4) Give pandas objects a real C API so that users can manipulate and > >>> create pandas objects with their own native (C/C++/Cython) code. > >> > >> > >>> 5) Yes, absolutely improve NumPy and DyND and transition to improved > >>> NumPy and DyND facilities as soon as they are available and shipped > >> > >> > >> I like the sound of both of these. > > > > > > > > Further you made a point above > > > >> You are right that pandas has started to supplant numpy as a high level > >> API for data analysis, but of course the robust (and often numpy based) > >> Python ecosystem is part of what has made pandas so successful. In > practice, > >> ecosystem projects often want to work with more primitive objects than > >> series/dataframes in their internal data structures and without numpy > this > >> becomes more difficult. For example, how do you concatenate a list of > >> categoricals? If these were numpy arrays, we could use np.concatenate, > but > >> the current implementation of categorical would require a custom > solution. > >> First class compatibility with pandas is harder when pandas data > cannotbe > >> used with a full ndarray API. > > > > > > I disagree entirely here. I think that Series/DataFrame ARE becoming > > primitive objects. Look at seaborn, statsmodels, and xarray These are > first > > class users of these structures, whom need the additional meta-data > > attached. > > > > Yes categorical are useful in numpy, and they should support them. But > lots > > of libraries can simply use pandas and do lots of really useful stuff. > > However, why reinvent the wheel and use numpy, when you have DataFrames. > > > > From a user point of view, I don't think they even care about numpy (or > > whatever drives pandas). It solves a very general problem of working with > > labeled data. > > > > Jeff > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Tue Jan 12 19:49:33 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Tue, 12 Jan 2016 16:49:33 -0800 Subject: [Pandas-dev] DyND and pandas [was Rewriting some of internals of pandas in C/C++? / Roadmap] In-Reply-To: References: Message-ID: On Tue, Jan 12, 2016 at 3:54 PM, Irwin Zaid wrote: > Thanks, Jeff. Let's talk about this. > >> >> So this thread is off-topic, but I believe the gist of what wes is >> proposing from a technical point of view for libpandas: >> >> - the user facing pandas API will not change (except better perf / >> copy-on-write etc) >> - the back-end API should not change much either >> - c-API for the back-end. >> - allows swappable / agnostic numpy-like back-ends. >> - ideally libpandas won't rewrite a completely new dtype system, maybe >> could co-op datashape / pluribus for extensible dtypes > > > For the most part, I think these are good ideas, but I share many of > Stephan's concerns. I'd much rather we improve the array ecosystem in > general and, very specifically, I don't think new dtypes should be added to > pandas via libpandas. > > What I'd really like to see is for Wes and I to collaborate on *something* > that solves the dtype problem and can be shared across libraries. I think > Wes and I working together could result in potentially phenomenal things, > both for pandas and other projects. I believe that the DyND type system is > pretty close to a solution here, I think it could be spun out as an > independent data description system. If for some reason the DyND type system > is not sufficient, I'd *still* be happy to work together on a solution that > has nothing to do with DyND. > I am happy to collaborate and propagate requirements and ideas upstream. I absolutely think we should be doing the work necessary to make DyND a suitable optional backend for pandas right now. The libpandas refactoring effort will provide a TODO list of array backend requirements that should help with that. But: I'm not comfortable with pandas and DyND getting married, so to speak, right now. Once DyND gains more broad mindshare as a NumPy replacement, let's re-evaluate as a team and decide whether maintaining pandas's NumPy-based array backend is worth our time. That leaves us at a slight impasse about how to fix pandas's data type woes with NumPy as the internal data container. A lightweight "pass-through" logical type apparatus (which dispatches to NumPy or DyND or native pandas code, as needed) is the simplest way to do that. This is already the way that pandas works (with a hodgepodge of NumPy data type objects and pandas data type objects weakly proxying for logical types), but it will be much cleaner / better abstracted. It also has the benefit of both: - making array backends "swappable" and - hiding level level details of the array backend from the pandas user I see both of these points as justifications for the implementation approach. It will also help DyND "cut its teeth" on the pandas unit test suite and fill in feature gaps (and build a performance test suite, too), and when it's ready we can "flip the switch". The logical type abstraction and the choice of array backend are orthogonal issues for me. The details of NumPy that have "leaked" through to pandas have harmed its users, so independent of the DyND-backend discussion I feel that the cleaner abstraction will improve the library's accessibility and make its users more productive. To summarize this: it should be enough to "just learn pandas". I wish I'd done this originally, but early on it seemed better to cut a few corners and get the software shipped rather than taking more time to build abstractions. At that time I was "funding" the project out of my savings account. - Wes > Of course, I'm not a pandas developer. But, at the same time, I'm offering > to do free work here to help pandas. > >> If the above are met by a back-end, e.g. numpy, potentially DyND, then it >> a back-end should be allowed >> (certainly as an optional dep, whether its required or not can be a choice >> made down the road). >> >> I think during implementation, that wes will be congnizant of these >> points, and leave things as wide open as >> possible w/o going down the road we are currently in (where lots of >> different API's are intermixed). > > > If the above is true, that sounds great. Wes, I'd appreciate it if you left > opinions about Continuum funding DyND out of it -- we've both had our say > now. > > Irwin > From wesmckinn at gmail.com Tue Jan 12 20:42:07 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Tue, 12 Jan 2016 17:42:07 -0800 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References: Message-ID: On Tue, Jan 12, 2016 at 4:06 PM, Stephan Hoyer wrote: > I think I'm mostly on the same page as well. Five years has certainly been > too long. > > I agree that it would be premature to commit to using DyND in a binding way > in pandas. A lot seems to be up in the air with regards to dtypes in Python > right now (yes, particularly from projects sponsored by Continuum). > > So I would advocate for proceeding with the refactor for now (which will > have numerous other benefits), and see how the situation evolves. If it > seems like we are in a plausible position to unify the dtype system with a > tool like DyND, then let's seriously consider that down the road. Either > way, explicit interfaces (e.g., to_numpy(), to_dynd()) will help. > +1 -- I think our long term goal should be to have a common physical memory representation. If pandas internally stays slightly malleable (in a non-user-visible-way) we can conform to a standard (presuming one develops) with less user-land disruption. If a standard does not develop we can just shrug our shoulders and do what's best for pandas. We'll have to think about how this will affect pandas's future C API (zero-copy interop guarantees): we might make the C API in the first release more clearly not-for-production use. Aside: There doesn't even seem to be consensus at the moment on missing data representation. Sentinels, for example, causes interoperability problems with ODBC / databases, and Apache ecosystem projects (e.g. HDFS file formats, Thrift, Spark, Kafka, etc.). If we build a C interface to Avro or Parquet in pandas right now we'll have to convert bitmasks to pandas's bespoke sentinels. To be clear, R has this problem too. I see good arguments for even nixing NaN in floating point arrays, as heretical as that might sound. Ironically I used to be in favor of sentinels but I realized it was an isolationist view. -W > On Mon, Jan 11, 2016 at 4:23 PM, Wes McKinney wrote: >> >> On Mon, Jan 11, 2016 at 3:04 PM, Jeff Reback wrote: >> > I am in favor of the Wes refactoring, but for some slightly different >> > reasons. >> > >> > I am including some in-line comments. >> > >> > On Mon, Jan 11, 2016 at 2:55 PM, Stephan Hoyer wrote: >> >>> >> >>> I don't see alternative ways for pandas to have a truly healthy >> >>> relationship with more general purpose array / scientific computing >> >>> libraries without being able to add new pandas functionality in a >> >>> clean way, and without requiring us to get patches accepted (and >> >>> released) in NumPy or DyND. >> >> >> >> >> >> Indeed, I think my disagreement is mostly about the order in which we >> >> approach these problems. >> > >> > >> > I agree here. I had started on *some* of this to enable swappable numpy >> > to >> > DyND to support IntNA (all in python, >> > but the fundamental change was to provide an API layer to the back-end). >> > >> >> >> >> >> >>> >> >>> Can you clarify what aspects of this plan are disagreeable / >> >>> contentious? >> >> >> >> >> >> See my comments below. >> >> >> >>> >> >>> Are you arguing for pandas becoming more of a companion >> >>> tool / user interface layer for NumPy or DyND? >> >> >> >> >> >> Not quite. Pandas has some fantastic and highly useable data (Series, >> >> DataFrame, Index). These certainly don't belong in NumPy or DyND. >> >> >> >> However, the array-based ecosystem certainly could use improvements to >> >> dtypes (e.g., datetime and categorical) and dtype specific methods >> >> (e.g., >> >> for strings) just as much as pandas. I do firmly believe that pushing >> >> these >> >> types of improvements upstream, rather than implementing them >> >> independently >> >> for pandas, would yield benefits for the broader ecosystem. With the >> >> right >> >> infrastructure, generalizing things to arrays is not much more work. >> > >> > >> > I dont' think Wes nor I disagree here at all. The problem was (and is), >> > the >> > pace of change in the underlying libraries. It is simply too slow >> > for pandas development efforts. >> > >> > I think the pandas efforts (and other libraries) can result in more >> > powerful >> > fundamental libraries >> > that get pushed upstream. However, it would not benefit ANYONE to slow >> > down >> > downstream efforts. I am not sure why you suggest that we WAIT for the >> > upstream libraries to change? We have been waiting forever for that. Now >> > we >> > have a concrete implementation of certain data types that are useful. >> > They >> > (upstream) can take >> > this and build on (or throw it away and make a better one or whatever). >> > But >> > I don't think it benefits anyone to WAIT for someone to change numpy >> > first. >> > Look at how long it took them to (partially) fix datetimes. >> > >> > xarray in particular has done the same thing to pandas, e.g. you have >> > added >> > additional selection operators and syntax (e.g. passing dicts of named >> > axes). These changes are in fact propogating to pandas. This has taken >> > time >> > (but much much less that this took for any of pandas changes to numpy). >> > Further look at how long you have advocated (correctly) for labeled >> > arrays >> > in numpy (which we are still waiting). >> > >> >> >> >> >> >> I'd like to see pandas itself focus more on the data-structures and >> >> less >> >> on the data types. This would let us share more work with the "general >> >> purpose array / scientific computing libraries". >> >> >> > Pandas IS about specifying the correct data types. It is simply >> > incorrect to >> > decouple this problem from the data-structures. A lot of effort over the >> > years has gone into >> > making all dtypes playing nice with each other and within pandas. >> > >> >>> >> >>> 1) Introduce a proper (from a software engineering perspective) >> >>> logical data type abstraction that models the way that pandas already >> >>> works, but cleaning up all the mess (implicit upcasts, lack of a real >> >>> "NA" scalar value, making pandas-specific methods like unique, >> >>> factorize, match, etc. true "array methods") >> >> >> >> >> >> New abstractions have a cost. A new logical data type abstraction is >> >> better than no proper abstraction at all, but (in principle), one data >> >> type >> >> abstraction should be enough to share. >> >> >> > >> >> >> >> A proper logical data type abstraction would be an improvement over the >> >> current situation, but if there's a way we could introduce one less >> >> abstraction (by improving things upstream in a general purpose array >> >> library) that would help even more. >> >> >> > >> > This is just pushing a problem upstream, which ultimately, given the >> > track >> > history of numpy, won't be solved at all. We will be here 1 year from >> > now >> > with the exact same discussion. Why are we waiting on upstream for >> > anything? >> > As I said above, if something is created which upstream finds useful on >> > a >> > general level. great. The great cost here is time. >> > >> >> >> >> For example, we could imagine pushing to make DyND the new core for >> >> pandas. This could be enough of a push to make DyND generally useful -- >> >> I >> >> know it still has a few kinks to work out. >> >> >> > >> > maybe, but DyND has to have full compat with what currently is out there >> > (soonish). Then I agree this could be possible. But wouldn't it be even >> > better >> > for pandas to be able to swap back-ends. Why limit ourselves to a >> > particular >> > backend if its not that difficult. >> > >> >> I think Jeff and I are on the same page here. 5 years ago we were >> having the *exact same* discussions around NumPy and adding new data >> type functionality. 5 years is a staggering amount of time in open >> source. It was less than 5 years between pandas not existing and being >> a super popular project with 2/3 of a best-selling O'Reilly book >> written about it. To whit, DyND exists in large part because of the >> difficulty in making progress within NumPy. >> >> Now, as 5 years ago, I think we should be acting in the best interests >> of pandas users, and what I've been describing is intended as a >> straightforward (though definitely labor intensive) and relatively >> low-risk plan that will "future-proof" the pandas user API for at >> least the next few years, and probably much longer. If we find that >> enabling some internals to use DyND is the right choice, we can do >> that in a non-invasive way while carefully minding data >> interoperability. Meaningful performance benefits would be a clear >> motivation. >> >> To be 100% open and transparent (in the spirit of pandas's new >> governance docs): Before committing to using DyND in any binding way >> (i.e. required, as opposed to opt-in) in pandas, I'd really like to >> see more evidence from 3rd parties without direct financial interest >> (i.e. employment or equity from Continuum) that DyND is "the future of >> Python array computing"; in the absence of significant user and >> community code contribution, it still feels like a political quagmire >> leftover from the Continuum-Enthought rift in 2011. >> >> - Wes >> >> >>> >> >>> 4) Give pandas objects a real C API so that users can manipulate and >> >>> create pandas objects with their own native (C/C++/Cython) code. >> >> >> >> >> >>> 5) Yes, absolutely improve NumPy and DyND and transition to improved >> >>> NumPy and DyND facilities as soon as they are available and shipped >> >> >> >> >> >> I like the sound of both of these. >> > >> > >> > >> > Further you made a point above >> > >> >> You are right that pandas has started to supplant numpy as a high level >> >> API for data analysis, but of course the robust (and often numpy based) >> >> Python ecosystem is part of what has made pandas so successful. In >> >> practice, >> >> ecosystem projects often want to work with more primitive objects than >> >> series/dataframes in their internal data structures and without numpy >> >> this >> >> becomes more difficult. For example, how do you concatenate a list of >> >> categoricals? If these were numpy arrays, we could use np.concatenate, >> >> but >> >> the current implementation of categorical would require a custom >> >> solution. >> >> First class compatibility with pandas is harder when pandas data >> >> cannotbe >> >> used with a full ndarray API. >> > >> > >> > I disagree entirely here. I think that Series/DataFrame ARE becoming >> > primitive objects. Look at seaborn, statsmodels, and xarray These are >> > first >> > class users of these structures, whom need the additional meta-data >> > attached. >> > >> > Yes categorical are useful in numpy, and they should support them. But >> > lots >> > of libraries can simply use pandas and do lots of really useful stuff. >> > However, why reinvent the wheel and use numpy, when you have DataFrames. >> > >> > From a user point of view, I don't think they even care about numpy (or >> > whatever drives pandas). It solves a very general problem of working >> > with >> > labeled data. >> > >> > Jeff > > From wesmckinn at gmail.com Wed Jan 13 16:16:07 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Wed, 13 Jan 2016 13:16:07 -0800 Subject: [Pandas-dev] Unit test reorganization In-Reply-To: References: Message-ID: OK, I got started with the biggest offender: https://github.com/pydata/pandas/pull/12032 It would be great to take the same approach with the other large test modules, with a special eye for quarantining "leaky" internals code and segregating NumPy interoperability contracts. I didn't completely do this with test_frame.py but it's a good start. There's definitely plenty of code in the other top level test modules which may nest under tests/frame or tests/series - Wes On Mon, Jan 11, 2016 at 8:47 AM, Wes McKinney wrote: > On Sun, Jan 10, 2016 at 6:06 PM, Stephan Hoyer wrote: >> On Fri, Jan 8, 2016 at 5:34 PM, Wes McKinney wrote: >>> >>> Big #1 question is, how strongly do you feel about *shipping* the test >>> suite in site-packages? Some other libraries with sprawling and >>> complex test suites have chosen not to ship them: >>> https://github.com/zzzeek/sqlalchemy >> >> >> I would prefer to include the test suite if possible, because the ability to >> type "nosetests pandas" makes it easy both for users to verify installations >> are working properly and for downstream distributors to identify and report >> bugs. The complete pandas test suite still runs in 20-30 minutes, so I think >> it's still fairly reasonable to use it for these purposes. >> > > Got it. I wasn't sure if this was something people still wanted to do > in practice with the burgeoning test suite. > >>> >>> Independently, I would support and help with starting a judicious >>> reorganization of the contents of pandas/tests. So I'm thinking like >>> >>> tests/ >>> dataframe/ >>> series/ >>> algorithms/ >>> internals/ >>> tseries/ >>> >>> and so forth. >> >> >> This sounds like a great idea -- these files have really gotten out of >> control! >> > > Sounds good. I've been sorting through points of contact between > Series/DataFrame's implementation and internal matters (e.g. the > BlockManager) and figured it would be good to "quarantine" code that > makes assumptions about what's under the hood. I'll get the first > couple patches started and it can be a slow burn to break apart these > large files. > >> Cheers, >> Stephan From wesmckinn at gmail.com Wed Jan 13 20:51:28 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Wed, 13 Jan 2016 17:51:28 -0800 Subject: [Pandas-dev] Unit test reorganization In-Reply-To: References: Message-ID: Another idea here I've been toying with to achieve better logical test organization is to place all tests in the whole project under pandas/tests. This way we can centralize all the tests relating to some functional aspect of pandas in one place, rather than the status quo where test code tends to be fairly close to its implementation (but not always). A prime example of where I let this get disorganized early on is time series functionality tests are somewhat scattered across pandas/tests, pandas/tseries, etc. This way we can also collect a single directory of "quarantined" pandas 0.x behavior that we are contemplating changing in a 1.0 release. Thoughts on this + other ideas how to help organize the tests to help mentally in approaching refactoring and internal changes? - Wes On Wed, Jan 13, 2016 at 1:16 PM, Wes McKinney wrote: > OK, I got started with the biggest offender: > > https://github.com/pydata/pandas/pull/12032 > > It would be great to take the same approach with the other large test > modules, with a special eye for quarantining "leaky" internals code > and segregating NumPy interoperability contracts. I didn't completely > do this with test_frame.py but it's a good start. > > There's definitely plenty of code in the other top level test modules > which may nest under tests/frame or tests/series > > - Wes > > On Mon, Jan 11, 2016 at 8:47 AM, Wes McKinney wrote: >> On Sun, Jan 10, 2016 at 6:06 PM, Stephan Hoyer wrote: >>> On Fri, Jan 8, 2016 at 5:34 PM, Wes McKinney wrote: >>>> >>>> Big #1 question is, how strongly do you feel about *shipping* the test >>>> suite in site-packages? Some other libraries with sprawling and >>>> complex test suites have chosen not to ship them: >>>> https://github.com/zzzeek/sqlalchemy >>> >>> >>> I would prefer to include the test suite if possible, because the ability to >>> type "nosetests pandas" makes it easy both for users to verify installations >>> are working properly and for downstream distributors to identify and report >>> bugs. The complete pandas test suite still runs in 20-30 minutes, so I think >>> it's still fairly reasonable to use it for these purposes. >>> >> >> Got it. I wasn't sure if this was something people still wanted to do >> in practice with the burgeoning test suite. >> >>>> >>>> Independently, I would support and help with starting a judicious >>>> reorganization of the contents of pandas/tests. So I'm thinking like >>>> >>>> tests/ >>>> dataframe/ >>>> series/ >>>> algorithms/ >>>> internals/ >>>> tseries/ >>>> >>>> and so forth. >>> >>> >>> This sounds like a great idea -- these files have really gotten out of >>> control! >>> >> >> Sounds good. I've been sorting through points of contact between >> Series/DataFrame's implementation and internal matters (e.g. the >> BlockManager) and figured it would be good to "quarantine" code that >> makes assumptions about what's under the hood. I'll get the first >> couple patches started and it can be a slow burn to break apart these >> large files. >> >>> Cheers, >>> Stephan From jeffreback at gmail.com Wed Jan 13 21:01:50 2016 From: jeffreback at gmail.com (Jeff Reback) Date: Wed, 13 Jan 2016 21:01:50 -0500 Subject: [Pandas-dev] Unit test reorganization In-Reply-To: References: Message-ID: so I agree +1 on moving all to pandas/tests - the indexing tests, which *mostly* are in test_indexing.py, though quite a few are in test_series/test_frame.py, should ideally be merged into a set of tests/indexing - io tests could be left alone I think - stats tests are *mostly* deprecated - since going to deprecate panel + nd soon, I think makes sense to move these tests & code to pandas/deprecated, to keep separate - test_tslib.py should be integrated into tseries/test_timeseries.py - almost all of the Index tests are now in test_index (which sub-class being somewhat generically tested), but the time-series ones are in tseries/test_base, so these could be merged as well. Jeff On Wed, Jan 13, 2016 at 8:51 PM, Wes McKinney wrote: > Another idea here I've been toying with to achieve better logical test > organization is to place all tests in the whole project under > pandas/tests. This way we can centralize all the tests relating to > some functional aspect of pandas in one place, rather than the status > quo where test code tends to be fairly close to its implementation > (but not always). A prime example of where I let this get disorganized > early on is time series functionality tests are somewhat scattered > across pandas/tests, pandas/tseries, etc. This way we can also collect > a single directory of "quarantined" pandas 0.x behavior that we are > contemplating changing in a 1.0 release. > > Thoughts on this + other ideas how to help organize the tests to help > mentally in approaching refactoring and internal changes? > > - Wes > > On Wed, Jan 13, 2016 at 1:16 PM, Wes McKinney wrote: > > OK, I got started with the biggest offender: > > > > https://github.com/pydata/pandas/pull/12032 > > > > It would be great to take the same approach with the other large test > > modules, with a special eye for quarantining "leaky" internals code > > and segregating NumPy interoperability contracts. I didn't completely > > do this with test_frame.py but it's a good start. > > > > There's definitely plenty of code in the other top level test modules > > which may nest under tests/frame or tests/series > > > > - Wes > > > > On Mon, Jan 11, 2016 at 8:47 AM, Wes McKinney > wrote: > >> On Sun, Jan 10, 2016 at 6:06 PM, Stephan Hoyer > wrote: > >>> On Fri, Jan 8, 2016 at 5:34 PM, Wes McKinney > wrote: > >>>> > >>>> Big #1 question is, how strongly do you feel about *shipping* the test > >>>> suite in site-packages? Some other libraries with sprawling and > >>>> complex test suites have chosen not to ship them: > >>>> https://github.com/zzzeek/sqlalchemy > >>> > >>> > >>> I would prefer to include the test suite if possible, because the > ability to > >>> type "nosetests pandas" makes it easy both for users to verify > installations > >>> are working properly and for downstream distributors to identify and > report > >>> bugs. The complete pandas test suite still runs in 20-30 minutes, so I > think > >>> it's still fairly reasonable to use it for these purposes. > >>> > >> > >> Got it. I wasn't sure if this was something people still wanted to do > >> in practice with the burgeoning test suite. > >> > >>>> > >>>> Independently, I would support and help with starting a judicious > >>>> reorganization of the contents of pandas/tests. So I'm thinking like > >>>> > >>>> tests/ > >>>> dataframe/ > >>>> series/ > >>>> algorithms/ > >>>> internals/ > >>>> tseries/ > >>>> > >>>> and so forth. > >>> > >>> > >>> This sounds like a great idea -- these files have really gotten out of > >>> control! > >>> > >> > >> Sounds good. I've been sorting through points of contact between > >> Series/DataFrame's implementation and internal matters (e.g. the > >> BlockManager) and figured it would be good to "quarantine" code that > >> makes assumptions about what's under the hood. I'll get the first > >> couple patches started and it can be a slow burn to break apart these > >> large files. > >> > >>> Cheers, > >>> Stephan > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Wed Jan 13 21:06:57 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Wed, 13 Jan 2016 18:06:57 -0800 Subject: [Pandas-dev] Unit test reorganization In-Reply-To: References: Message-ID: On Wed, Jan 13, 2016 at 6:01 PM, Jeff Reback wrote: > so I agree +1 on moving all to pandas/tests > > - the indexing tests, which *mostly* are in test_indexing.py, though quite a > few are in test_series/test_frame.py, should ideally be > merged into a set of tests/indexing > > - io tests could be left alone I think > Yeah, I think pandas/io/tests is the one definite exception where there isn't much benefit > - stats tests are *mostly* deprecated > > - since going to deprecate panel + nd soon, I think makes sense to move > these tests & code to pandas/deprecated, to keep separate > > - test_tslib.py should be integrated into tseries/test_timeseries.py > > - almost all of the Index tests are now in test_index (which sub-class being > somewhat generically tested), but the time-series ones > are in tseries/test_base, so these could be merged as well. > Yep, it specifically would be good to collect 100% of the index data structure machinery (including Datetime/Timedelta/PeriodIndex) in one place (same for axis indexing as you said, since it got pretty scattered) > > Jeff > > > > > On Wed, Jan 13, 2016 at 8:51 PM, Wes McKinney wrote: >> >> Another idea here I've been toying with to achieve better logical test >> organization is to place all tests in the whole project under >> pandas/tests. This way we can centralize all the tests relating to >> some functional aspect of pandas in one place, rather than the status >> quo where test code tends to be fairly close to its implementation >> (but not always). A prime example of where I let this get disorganized >> early on is time series functionality tests are somewhat scattered >> across pandas/tests, pandas/tseries, etc. This way we can also collect >> a single directory of "quarantined" pandas 0.x behavior that we are >> contemplating changing in a 1.0 release. >> >> Thoughts on this + other ideas how to help organize the tests to help >> mentally in approaching refactoring and internal changes? >> >> - Wes >> >> On Wed, Jan 13, 2016 at 1:16 PM, Wes McKinney wrote: >> > OK, I got started with the biggest offender: >> > >> > https://github.com/pydata/pandas/pull/12032 >> > >> > It would be great to take the same approach with the other large test >> > modules, with a special eye for quarantining "leaky" internals code >> > and segregating NumPy interoperability contracts. I didn't completely >> > do this with test_frame.py but it's a good start. >> > >> > There's definitely plenty of code in the other top level test modules >> > which may nest under tests/frame or tests/series >> > >> > - Wes >> > >> > On Mon, Jan 11, 2016 at 8:47 AM, Wes McKinney >> > wrote: >> >> On Sun, Jan 10, 2016 at 6:06 PM, Stephan Hoyer >> >> wrote: >> >>> On Fri, Jan 8, 2016 at 5:34 PM, Wes McKinney >> >>> wrote: >> >>>> >> >>>> Big #1 question is, how strongly do you feel about *shipping* the >> >>>> test >> >>>> suite in site-packages? Some other libraries with sprawling and >> >>>> complex test suites have chosen not to ship them: >> >>>> https://github.com/zzzeek/sqlalchemy >> >>> >> >>> >> >>> I would prefer to include the test suite if possible, because the >> >>> ability to >> >>> type "nosetests pandas" makes it easy both for users to verify >> >>> installations >> >>> are working properly and for downstream distributors to identify and >> >>> report >> >>> bugs. The complete pandas test suite still runs in 20-30 minutes, so I >> >>> think >> >>> it's still fairly reasonable to use it for these purposes. >> >>> >> >> >> >> Got it. I wasn't sure if this was something people still wanted to do >> >> in practice with the burgeoning test suite. >> >> >> >>>> >> >>>> Independently, I would support and help with starting a judicious >> >>>> reorganization of the contents of pandas/tests. So I'm thinking like >> >>>> >> >>>> tests/ >> >>>> dataframe/ >> >>>> series/ >> >>>> algorithms/ >> >>>> internals/ >> >>>> tseries/ >> >>>> >> >>>> and so forth. >> >>> >> >>> >> >>> This sounds like a great idea -- these files have really gotten out of >> >>> control! >> >>> >> >> >> >> Sounds good. I've been sorting through points of contact between >> >> Series/DataFrame's implementation and internal matters (e.g. the >> >> BlockManager) and figured it would be good to "quarantine" code that >> >> makes assumptions about what's under the hood. I'll get the first >> >> couple patches started and it can be a slow burn to break apart these >> >> large files. >> >> >> >>> Cheers, >> >>> Stephan >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev > > From yoh at onerussian.com Fri Jan 8 22:35:59 2016 From: yoh at onerussian.com (Yaroslav Halchenko) Date: Fri, 08 Jan 2016 22:35:59 -0500 Subject: [Pandas-dev] Unit test reorganization In-Reply-To: References: <5E86966B-F624-4ECB-AC72-2F9DCEBC7B14@gmail.com> Message-ID: <49059D46-865A-4D9B-85A9-8F0FF826E9B3@onerussian.com> I prefer and do ship all the tests for Debian packages wherever possible and not prohibitive. Do as you see fitting, I will adjust for it. FWIW In my code bases I started to place them under tests directories where tested files reside (as sklearn and others do), not overall top/tests. Much more manageable and makes it easier to test effected by changes submodules. On January 8, 2016 9:04:13 PM EST, Wes McKinney wrote: >It looks like the debian packaging scripts would need to change. + >Yaroslav to see if this would be onerous > >On Fri, Jan 8, 2016 at 5:53 PM, Jeff Reback >wrote: >> no idea >> >>> On Jan 8, 2016, at 8:47 PM, Wes McKinney >wrote: >>> >>> + mailing list >>> >>> Do the distros run them _after_ installation? I'm talking about >>> installing the unit tests during `python setup.py install`, but >still >>> including them in the tarball. >>> >>>> On Fri, Jan 8, 2016 at 5:43 PM, Jeff Reback >wrote: >>>> all for reorging into subdirs as these have grown pretty big >>>> >>>> what's the big deal with shipping the test? >>>> >>>> I suspect some of the Linux distros do run them >>>> >>>> and just merged https://github.com/pydata/pandas/pull/11913 >>>> though we can could configure s subset that ships I suppose >>>> >>>> >>>>> On Jan 8, 2016, at 8:34 PM, Wes McKinney >wrote: >>>>> >>>>> hi folks, >>>>> >>>>> I have a few questions about the test suite. As context, I note >that >>>>> test_series.py is now 8200 lines and test_frame.py 17000 lines. >>>>> >>>>> Big #1 question is, how strongly do you feel about *shipping* the >test >>>>> suite in site-packages? Some other libraries with sprawling and >>>>> complex test suites have chosen not to ship them: >>>>> https://github.com/zzzeek/sqlalchemy >>>>> >>>>> Independently, I would support and help with starting a judicious >>>>> reorganization of the contents of pandas/tests. So I'm thinking >like >>>>> >>>>> tests/ >>>>> dataframe/ >>>>> series/ >>>>> algorithms/ >>>>> internals/ >>>>> tseries/ >>>>> >>>>> and so forth. >>>>> >>>>> Thoughts? >>>>> >>>>> - Wes >>>>> _______________________________________________ >>>>> Pandas-dev mailing list >>>>> Pandas-dev at python.org >>>>> https://mail.python.org/mailman/listinfo/pandas-dev -- Sent from a phone which beats iPhone. From mwwiebe at gmail.com Tue Jan 12 19:20:15 2016 From: mwwiebe at gmail.com (Mark Wiebe) Date: Tue, 12 Jan 2016 16:20:15 -0800 Subject: [Pandas-dev] DyND and pandas [was Rewriting some of internals of pandas in C/C++? / Roadmap] In-Reply-To: References: Message-ID: On Tue, Jan 12, 2016 at 3:20 PM, Irwin Zaid wrote: > > This discussion doesn't belong on this mailing list, but a couple of >> brief points. >> > > Wes, if you don't want this discussion on this mailing list then don't say > things like: "it still feels like a political quagmirie leftover from the > Continuum-Enthought rift in 2011". My email reply to that was simply a > statement of facts, as this one will also be. > > I was approached by Travis and Peter about being a part of Continuum >> Analytics in late 2011. According to my e-mail records we were having >> these discussions at least as early as October 2011. The phrase "NumPy >> 2.0" was spoken in this epoch (referring to >> -the-project-now-known-as-DyND). So, I have quite a bit of first- and >> second-hand information from this time period, including many of the >> details of Mark's Enthought-sponsored NumPy development and the >> problems that occurred online and offline. >> > > The phrase "NumPy 2.0" means a number of things, and DyND was not one of > them. Yes, you have some first-hand knowledge, > but it's not relevant. Even IF it was, a lot of modern DyND also came from > my massive contribution before I joined Continuum. > > Mark will speak up here as well. > It's certainly true that the phrase "NumPy 2.0" was spoken a lot during the formation and early days of Continuum, but that's a term that was used commonly even before the NumPy 1.6 release. It has long been the vehicle for discussions about doing big refactoring and breaking changes in NumPy. The discussions you're referring to were about a mixture of two things: a NumPy 2.0 developed within the NumPy development process, and re-conceptualizing NumPy at a higher level towards abstractions that could be out of core, distributed, etc. The former is represented by emails like https://mail.scipy.org/pipermail/numpy-discussion/2012-February/060623.html and work that Continuum sponsored within NumPy. The latter is what became branded as Blaze. DyND itself began life as "dynamicndarray," and was a place to experiment with some of the ideas I had about how the dtypes could be structured, how things could work as a C++ library. It was started after all my involvement with Enthought was completed and before Continuum began. It was completely independent of either company. It was not adopted as part of development at Continuum immediately, I did my best to present a solid case about how such a thing would fit into Blaze, and the decision to open source the code and include it as a component of the Blaze development was later made in one swoop. My hope during that time frame was that NumPy's internals could be refactored in a way that isolated them more from its interface, and then could begin a faster evolution without breaking that interface. I wanted NumPy to transition ever so slowly into C++. Even if all of that occurred, NumPy's evolution would have still been slow, and I knew that, so I saw DyND as a place to boldly try things, to really experiment with how a dynamic array programming library could look. We were particularly sensitive to avoiding a recreation of the numeric vs numarray schism, and DyND's Python bindings are separate from NumPy but interact naturally where we found a way to do it. The idea that DyND should have broad support from multiple companies is something I strongly agree with, and I think specifically that should extend to multiple industries. I believe the current development push led by Irwin is bringing it close to a threshold where it's possible for that to start happening, and developing it in close co-operation with Pandas would be amazing for both DyND and Pandas. I'm reading this thread mostly with hope that this possibility has a good chance of working, and a desire that any decisions are made with an accurate picture of what DyND is and aims to become. -Mark > > >> I applaud Continuum for using R&D budget to build something new and >> forward thinking that is also permissively licensed open source >> software. However, it is well known that open source projects driven >> by for-profit organizations can run into governance problems that >> place them in conflict with the community. Since DyND is a large >> project that I would not be comfortable forking (if that were required >> in the future), building an outside developer and user community is >> essential if pandas is to consider using it as a hard dependency in >> the future. >> >> The Apache Software Foundation exists for this reason and others, and >> if you wish to place a community-oriented and merit-based governance >> structure around DyND to assist with its incubation, the ASF may be >> worth pursuing. NumFOCUS provides a fiscal sponsorship apparatus but >> does not really address the governance questions. Whether or not the >> governance issues are real doesn't really matter; it's about setting >> people's minds at ease. >> > > Okay, let me state again: The majority of DyND's contributions (as net > from Mark, myself, and Ian) came without Continuum funding. Just because > Continuum is funding DyND now does not make it a "Continuum project", > whatever this means. > > Some of your other points are valid, and we'll address them as best we can > as time goes on. DyND clearly needs a community, but it's a chicken-and-egg > problem. If you try and build something hard, it takes time and users come > when things work. > > The issue of refactoring Pandas is a different one that I'll add comments > to in another email. > > Irwin > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Fri Jan 15 10:22:19 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Fri, 15 Jan 2016 07:22:19 -0800 Subject: [Pandas-dev] Reorganizing the megamodules Message-ID: As part of improving our code organization, I'd like to look at splitting up modules exceeding 3000 lines into subpackages. Obvious targets are core/frame.py core/generic.py core/index.py core/series.py For the "big" classes like Series and DataFrame, this amounts mainly to having a common pattern for adding new instance methods that aren't nested under the main class: header (or in one of their subclasses). Thoughts? - Wes From wesmckinn at gmail.com Fri Jan 15 10:52:57 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Fri, 15 Jan 2016 07:52:57 -0800 Subject: [Pandas-dev] Reorganizing the megamodules In-Reply-To: References: Message-ID: I was thinking we could promote all of the index-related code to pandas/index/ and same for pandas/core/frame.py -> pandas/frame/ and so forth. We'd have to keep around the old files for pickles, but perhaps we can do a "pickle cleanup" (remove all sources of pickle backward compatibility) with 1.0. On Fri, Jan 15, 2016 at 7:32 AM, Jeff Reback wrote: > Index is also fairly straightforward to do this > > eg. we already have sub-class based modules for > ``DatetimeIndex,TimedeltaIndex,PeriodIndex``. > > only caveat is have to have a ``Base`` type so imports are crazy. But sure > > for dir: ``pandas/core/index`` > ``categorical``, ``numeric``, ``multi`` are prob candidates. > > On Fri, Jan 15, 2016 at 10:22 AM, Wes McKinney wrote: >> >> As part of improving our code organization, I'd like to look at >> splitting up modules exceeding 3000 lines into subpackages. Obvious >> targets are >> >> core/frame.py >> core/generic.py >> core/index.py >> core/series.py >> >> For the "big" classes like Series and DataFrame, this amounts mainly >> to having a common pattern for adding new instance methods that aren't >> nested under the main class: header (or in one of their subclasses). >> >> Thoughts? >> >> - Wes >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev > > From jorisvandenbossche at gmail.com Fri Jan 15 10:58:43 2016 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Fri, 15 Jan 2016 16:58:43 +0100 Subject: [Pandas-dev] Reorganizing the megamodules In-Reply-To: References: Message-ID: 2016-01-15 16:22 GMT+01:00 Wes McKinney : > As part of improving our code organization, I'd like to look at > splitting up modules exceeding 3000 lines into subpackages. Obvious > targets are > > core/frame.py > core/generic.py > core/index.py > core/series.py > > For the "big" classes like Series and DataFrame, this amounts mainly > to having a common pattern for adding new instance methods that aren't > nested under the main class: header (or in one of their subclasses). > > Thoughts? > How would you like to split up eg frame.py? As the majority of that file consists of the DataFrame class definition. > > - Wes > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Fri Jan 15 11:26:01 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Fri, 15 Jan 2016 08:26:01 -0800 Subject: [Pandas-dev] Reorganizing the megamodules In-Reply-To: References: Message-ID: On Fri, Jan 15, 2016 at 7:58 AM, Joris Van den Bossche wrote: > 2016-01-15 16:22 GMT+01:00 Wes McKinney : >> >> As part of improving our code organization, I'd like to look at >> splitting up modules exceeding 3000 lines into subpackages. Obvious >> targets are >> >> core/frame.py >> core/generic.py >> core/index.py >> core/series.py >> >> For the "big" classes like Series and DataFrame, this amounts mainly >> to having a common pattern for adding new instance methods that aren't >> nested under the main class: header (or in one of their subclasses). >> >> Thoughts? > > > How would you like to split up eg frame.py? As the majority of that file > consists of the DataFrame class definition. Into modules containing groups of functionally-related methods (for example: all IO methods together). Class methods are just attributes of the class object (which can be assigned elsewhere), so they don't need to be in the same module as the class definition. > > >> >> >> - Wes >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev > > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > From wesmckinn at gmail.com Sat Jan 16 18:20:17 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Sat, 16 Jan 2016 15:20:17 -0800 Subject: [Pandas-dev] Thoughts on adopting a 1-PR-1-commit policy? Message-ID: I've grown very fond of the PR cherry-picking style used in many Apache projects. Here's an example of a very large commit to Apache Spark that was performed in this fashion: https://github.com/apache/spark/commit/2fe0a1aaeebbf7f60bd4130847d738c29f1e3d53#diff-e1e1d3d40573127e9ee0480caf1283d6 If you compare pandas's commit history with a project like this, you'll see it is much easier to follow because there is one commit for each patch to the project, rather than a merge commit plus 1 or more merged commits (depending on whether the person merging the PR did an interactive rebase). The script to do this is not too complex, and is even less complex for pandas because we do not use JIRA: https://github.com/apache/spark/blob/master/dev/merge_spark_pr.py I've been using a pared down version of the script in Ibis: https://github.com/cloudera/ibis/blob/master/dev/merge-pr.py Here is an example of what a merge commit with multiple subcommits looks like using this tool: https://github.com/cloudera/ibis/commit/eafabe060dcaaea0a6076342eaa374929b91cf47 It's pretty easy to use: run the script and enter the PR # you are merging. It automatically squashes and closes the merged PR. Let me know if this is something that would interest the team. I know there are varying opinions on the GitHub Green Button =) - Wes From wesmckinn at gmail.com Sat Jan 16 18:46:51 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Sat, 16 Jan 2016 15:46:51 -0800 Subject: [Pandas-dev] Thoughts on adopting a 1-PR-1-commit policy? In-Reply-To: References: Message-ID: Copying the mailing list. Indeed makes rebasing unnecessary if there are no cherry-pick conflicts. On Sat, Jan 16, 2016 at 3:34 PM, Tom Augspurger wrote: > Rebasing can be tough for new contributors, so for that alone I'd say let's try it. > > -Tom > >> On Jan 16, 2016, at 5:20 PM, Wes McKinney wrote: >> >> I've grown very fond of the PR cherry-picking style used in many >> Apache projects. >> >> Here's an example of a very large commit to Apache Spark that was >> performed in this fashion: >> >> https://github.com/apache/spark/commit/2fe0a1aaeebbf7f60bd4130847d738c29f1e3d53#diff-e1e1d3d40573127e9ee0480caf1283d6 >> >> If you compare pandas's commit history with a project like this, >> you'll see it is much easier to follow because there is one commit for >> each patch to the project, rather than a merge commit plus 1 or more >> merged commits (depending on whether the person merging the PR did an >> interactive rebase). >> >> The script to do this is not too complex, and is even less complex for >> pandas because we do not use JIRA: >> >> https://github.com/apache/spark/blob/master/dev/merge_spark_pr.py >> >> I've been using a pared down version of the script in Ibis: >> >> https://github.com/cloudera/ibis/blob/master/dev/merge-pr.py >> >> Here is an example of what a merge commit with multiple subcommits >> looks like using this tool: >> >> https://github.com/cloudera/ibis/commit/eafabe060dcaaea0a6076342eaa374929b91cf47 >> >> It's pretty easy to use: run the script and enter the PR # you are >> merging. It automatically squashes and closes the merged PR. >> >> Let me know if this is something that would interest the team. I know >> there are varying opinions on the GitHub Green Button =) >> >> - Wes >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev From jeffreback at gmail.com Sat Jan 16 23:11:55 2016 From: jeffreback at gmail.com (Jeff Reback) Date: Sat, 16 Jan 2016 23:11:55 -0500 Subject: [Pandas-dev] Thoughts on adopting a 1-PR-1-commit policy? In-Reply-To: References: Message-ID: <6D657DD5-1A6C-4D74-9B42-E603AE27B675@gmail.com> seems find to use the cherry picking script though I don't think should relax users from squashing > On Jan 16, 2016, at 6:46 PM, Wes McKinney wrote: > > Copying the mailing list. Indeed makes rebasing unnecessary if there > are no cherry-pick conflicts. > > On Sat, Jan 16, 2016 at 3:34 PM, Tom Augspurger > wrote: >> Rebasing can be tough for new contributors, so for that alone I'd say let's try it. >> >> -Tom >> >>> On Jan 16, 2016, at 5:20 PM, Wes McKinney wrote: >>> >>> I've grown very fond of the PR cherry-picking style used in many >>> Apache projects. >>> >>> Here's an example of a very large commit to Apache Spark that was >>> performed in this fashion: >>> >>> https://github.com/apache/spark/commit/2fe0a1aaeebbf7f60bd4130847d738c29f1e3d53#diff-e1e1d3d40573127e9ee0480caf1283d6 >>> >>> If you compare pandas's commit history with a project like this, >>> you'll see it is much easier to follow because there is one commit for >>> each patch to the project, rather than a merge commit plus 1 or more >>> merged commits (depending on whether the person merging the PR did an >>> interactive rebase). >>> >>> The script to do this is not too complex, and is even less complex for >>> pandas because we do not use JIRA: >>> >>> https://github.com/apache/spark/blob/master/dev/merge_spark_pr.py >>> >>> I've been using a pared down version of the script in Ibis: >>> >>> https://github.com/cloudera/ibis/blob/master/dev/merge-pr.py >>> >>> Here is an example of what a merge commit with multiple subcommits >>> looks like using this tool: >>> >>> https://github.com/cloudera/ibis/commit/eafabe060dcaaea0a6076342eaa374929b91cf47 >>> >>> It's pretty easy to use: run the script and enter the PR # you are >>> merging. It automatically squashes and closes the merged PR. >>> >>> Let me know if this is something that would interest the team. I know >>> there are varying opinions on the GitHub Green Button =) >>> >>> - Wes >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev From wesmckinn at gmail.com Sun Jan 17 01:29:58 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Sat, 16 Jan 2016 22:29:58 -0800 Subject: [Pandas-dev] Thoughts on adopting a 1-PR-1-commit policy? In-Reply-To: <6D657DD5-1A6C-4D74-9B42-E603AE27B675@gmail.com> References: <6D657DD5-1A6C-4D74-9B42-E603AE27B675@gmail.com> Message-ID: The script performs the squashing automatically, so users can squash manually if they wish or let us do it. On Sat, Jan 16, 2016 at 8:11 PM, Jeff Reback wrote: > seems find to use the cherry picking script > > though I don't think should relax users from squashing > >> On Jan 16, 2016, at 6:46 PM, Wes McKinney wrote: >> >> Copying the mailing list. Indeed makes rebasing unnecessary if there >> are no cherry-pick conflicts. >> >> On Sat, Jan 16, 2016 at 3:34 PM, Tom Augspurger >> wrote: >>> Rebasing can be tough for new contributors, so for that alone I'd say let's try it. >>> >>> -Tom >>> >>>> On Jan 16, 2016, at 5:20 PM, Wes McKinney wrote: >>>> >>>> I've grown very fond of the PR cherry-picking style used in many >>>> Apache projects. >>>> >>>> Here's an example of a very large commit to Apache Spark that was >>>> performed in this fashion: >>>> >>>> https://github.com/apache/spark/commit/2fe0a1aaeebbf7f60bd4130847d738c29f1e3d53#diff-e1e1d3d40573127e9ee0480caf1283d6 >>>> >>>> If you compare pandas's commit history with a project like this, >>>> you'll see it is much easier to follow because there is one commit for >>>> each patch to the project, rather than a merge commit plus 1 or more >>>> merged commits (depending on whether the person merging the PR did an >>>> interactive rebase). >>>> >>>> The script to do this is not too complex, and is even less complex for >>>> pandas because we do not use JIRA: >>>> >>>> https://github.com/apache/spark/blob/master/dev/merge_spark_pr.py >>>> >>>> I've been using a pared down version of the script in Ibis: >>>> >>>> https://github.com/cloudera/ibis/blob/master/dev/merge-pr.py >>>> >>>> Here is an example of what a merge commit with multiple subcommits >>>> looks like using this tool: >>>> >>>> https://github.com/cloudera/ibis/commit/eafabe060dcaaea0a6076342eaa374929b91cf47 >>>> >>>> It's pretty easy to use: run the script and enter the PR # you are >>>> merging. It automatically squashes and closes the merged PR. >>>> >>>> Let me know if this is something that would interest the team. I know >>>> there are varying opinions on the GitHub Green Button =) >>>> >>>> - Wes >>>> _______________________________________________ >>>> Pandas-dev mailing list >>>> Pandas-dev at python.org >>>> https://mail.python.org/mailman/listinfo/pandas-dev >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev From wesmckinn at gmail.com Sun Jan 17 08:34:07 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Sun, 17 Jan 2016 05:34:07 -0800 Subject: [Pandas-dev] Thoughts on adopting a 1-PR-1-commit policy? In-Reply-To: References: <6D657DD5-1A6C-4D74-9B42-E603AE27B675@gmail.com> Message-ID: hey Jan, I'm adding the mailing list back. Several comments inline On Sun, Jan 17, 2016 at 1:09 AM, Jan Schulz wrote: > Hi, > > Just a different opinion: I like having commits do one logical thing > and not squash multiple "logical complete"things together (this means > that commit is a logical step, not the PR and that the commits should > be clean and not contain "fixup", "typo" style commits). During the > categorical work, I found that a few times I regretted that I couldn't > go back to look up the specific change in a commit and look up what > and why that commit was done because it was all mixed up with the rest > of the squashed commits in that PR. > I'm not suggesting that you should have to squash your commits inside the PR. This only concerns how *patches are applied to pandas's master branch". Ideally, no squashing occurs inside the developer branch (so the "story", so to speak, about the patch is preserved), but what the Apache patch tool does is - Turns a multi-commit PR into a single-commit patch - Puts the individual commit hashes in the commit message; so you can always visit the original commits - Puts the description from the PR into the commit message - Cherry-picks instead of merging, so you can observe evolution of pandas/master in a clear and linear way > My feeling was always that squashes are performed because the rebases > are so hard because of multiple PRs fixing stuff in the same files at > the same time. If this refactorings come through (both better > separation of backend-frontend specific code and the suggested split > up of frame.py, etc), I think this is not so much a problem anymore. > > I'm not sure where the commit history with merges is a problem: during > balme (in github, never used git itself), I don't see any merges?! > Here is something that would be very hard: create release notes given the commit history. > So my suggestion would go in a different direction: better commits in > the PRs with proper commit messages (not only the headline but also > explanations in the message. > > Jan > -- > Jan Schulz > mail: jasc at gmx.net > web: http://www.katzien.de > > > On 17 January 2016 at 07:29, Wes McKinney wrote: >> The script performs the squashing automatically, so users can squash >> manually if they wish or let us do it. >> >> On Sat, Jan 16, 2016 at 8:11 PM, Jeff Reback wrote: >>> seems find to use the cherry picking script >>> >>> though I don't think should relax users from squashing >>> >>>> On Jan 16, 2016, at 6:46 PM, Wes McKinney wrote: >>>> >>>> Copying the mailing list. Indeed makes rebasing unnecessary if there >>>> are no cherry-pick conflicts. >>>> >>>> On Sat, Jan 16, 2016 at 3:34 PM, Tom Augspurger >>>> wrote: >>>>> Rebasing can be tough for new contributors, so for that alone I'd say let's try it. >>>>> >>>>> -Tom >>>>> >>>>>> On Jan 16, 2016, at 5:20 PM, Wes McKinney wrote: >>>>>> >>>>>> I've grown very fond of the PR cherry-picking style used in many >>>>>> Apache projects. >>>>>> >>>>>> Here's an example of a very large commit to Apache Spark that was >>>>>> performed in this fashion: >>>>>> >>>>>> https://github.com/apache/spark/commit/2fe0a1aaeebbf7f60bd4130847d738c29f1e3d53#diff-e1e1d3d40573127e9ee0480caf1283d6 >>>>>> >>>>>> If you compare pandas's commit history with a project like this, >>>>>> you'll see it is much easier to follow because there is one commit for >>>>>> each patch to the project, rather than a merge commit plus 1 or more >>>>>> merged commits (depending on whether the person merging the PR did an >>>>>> interactive rebase). >>>>>> >>>>>> The script to do this is not too complex, and is even less complex for >>>>>> pandas because we do not use JIRA: >>>>>> >>>>>> https://github.com/apache/spark/blob/master/dev/merge_spark_pr.py >>>>>> >>>>>> I've been using a pared down version of the script in Ibis: >>>>>> >>>>>> https://github.com/cloudera/ibis/blob/master/dev/merge-pr.py >>>>>> >>>>>> Here is an example of what a merge commit with multiple subcommits >>>>>> looks like using this tool: >>>>>> >>>>>> https://github.com/cloudera/ibis/commit/eafabe060dcaaea0a6076342eaa374929b91cf47 >>>>>> >>>>>> It's pretty easy to use: run the script and enter the PR # you are >>>>>> merging. It automatically squashes and closes the merged PR. >>>>>> >>>>>> Let me know if this is something that would interest the team. I know >>>>>> there are varying opinions on the GitHub Green Button =) >>>>>> >>>>>> - Wes >>>>>> _______________________________________________ >>>>>> Pandas-dev mailing list >>>>>> Pandas-dev at python.org >>>>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>> _______________________________________________ >>>> Pandas-dev mailing list >>>> Pandas-dev at python.org >>>> https://mail.python.org/mailman/listinfo/pandas-dev >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev From jeffreback at gmail.com Sun Jan 17 11:56:48 2016 From: jeffreback at gmail.com (Jeff Reback) Date: Sun, 17 Jan 2016 11:56:48 -0500 Subject: [Pandas-dev] Thoughts on adopting a 1-PR-1-commit policy? In-Reply-To: References: <6D657DD5-1A6C-4D74-9B42-E603AE27B675@gmail.com> Message-ID: ok this is now implemented, ./merge-pr.py will merge things via cherry-picking. Though still think that most users should have a clean PR going in. This will be more useful for bigger patches where you do want to preserve the history. On Sun, Jan 17, 2016 at 8:34 AM, Wes McKinney wrote: > hey Jan, > > I'm adding the mailing list back. Several comments inline > > On Sun, Jan 17, 2016 at 1:09 AM, Jan Schulz wrote: > > Hi, > > > > Just a different opinion: I like having commits do one logical thing > > and not squash multiple "logical complete"things together (this means > > that commit is a logical step, not the PR and that the commits should > > be clean and not contain "fixup", "typo" style commits). During the > > categorical work, I found that a few times I regretted that I couldn't > > go back to look up the specific change in a commit and look up what > > and why that commit was done because it was all mixed up with the rest > > of the squashed commits in that PR. > > > > I'm not suggesting that you should have to squash your commits inside > the PR. This only concerns how *patches are applied to pandas's master > branch". Ideally, no squashing occurs inside the developer branch (so > the "story", so to speak, about the patch is preserved), but what the > Apache patch tool does is > > - Turns a multi-commit PR into a single-commit patch > - Puts the individual commit hashes in the commit message; so you can > always visit the original commits > - Puts the description from the PR into the commit message > - Cherry-picks instead of merging, so you can observe evolution of > pandas/master in a clear and linear way > > > My feeling was always that squashes are performed because the rebases > > are so hard because of multiple PRs fixing stuff in the same files at > > the same time. If this refactorings come through (both better > > separation of backend-frontend specific code and the suggested split > > up of frame.py, etc), I think this is not so much a problem anymore. > > > > I'm not sure where the commit history with merges is a problem: during > > balme (in github, never used git itself), I don't see any merges?! > > > > Here is something that would be very hard: create release notes given > the commit history. > > > So my suggestion would go in a different direction: better commits in > > the PRs with proper commit messages (not only the headline but also > > explanations in the message. > > > > Jan > > -- > > Jan Schulz > > mail: jasc at gmx.net > > web: http://www.katzien.de > > > > > > On 17 January 2016 at 07:29, Wes McKinney wrote: > >> The script performs the squashing automatically, so users can squash > >> manually if they wish or let us do it. > >> > >> On Sat, Jan 16, 2016 at 8:11 PM, Jeff Reback > wrote: > >>> seems find to use the cherry picking script > >>> > >>> though I don't think should relax users from squashing > >>> > >>>> On Jan 16, 2016, at 6:46 PM, Wes McKinney > wrote: > >>>> > >>>> Copying the mailing list. Indeed makes rebasing unnecessary if there > >>>> are no cherry-pick conflicts. > >>>> > >>>> On Sat, Jan 16, 2016 at 3:34 PM, Tom Augspurger > >>>> wrote: > >>>>> Rebasing can be tough for new contributors, so for that alone I'd > say let's try it. > >>>>> > >>>>> -Tom > >>>>> > >>>>>> On Jan 16, 2016, at 5:20 PM, Wes McKinney > wrote: > >>>>>> > >>>>>> I've grown very fond of the PR cherry-picking style used in many > >>>>>> Apache projects. > >>>>>> > >>>>>> Here's an example of a very large commit to Apache Spark that was > >>>>>> performed in this fashion: > >>>>>> > >>>>>> > https://github.com/apache/spark/commit/2fe0a1aaeebbf7f60bd4130847d738c29f1e3d53#diff-e1e1d3d40573127e9ee0480caf1283d6 > >>>>>> > >>>>>> If you compare pandas's commit history with a project like this, > >>>>>> you'll see it is much easier to follow because there is one commit > for > >>>>>> each patch to the project, rather than a merge commit plus 1 or more > >>>>>> merged commits (depending on whether the person merging the PR did > an > >>>>>> interactive rebase). > >>>>>> > >>>>>> The script to do this is not too complex, and is even less complex > for > >>>>>> pandas because we do not use JIRA: > >>>>>> > >>>>>> https://github.com/apache/spark/blob/master/dev/merge_spark_pr.py > >>>>>> > >>>>>> I've been using a pared down version of the script in Ibis: > >>>>>> > >>>>>> https://github.com/cloudera/ibis/blob/master/dev/merge-pr.py > >>>>>> > >>>>>> Here is an example of what a merge commit with multiple subcommits > >>>>>> looks like using this tool: > >>>>>> > >>>>>> > https://github.com/cloudera/ibis/commit/eafabe060dcaaea0a6076342eaa374929b91cf47 > >>>>>> > >>>>>> It's pretty easy to use: run the script and enter the PR # you are > >>>>>> merging. It automatically squashes and closes the merged PR. > >>>>>> > >>>>>> Let me know if this is something that would interest the team. I > know > >>>>>> there are varying opinions on the GitHub Green Button =) > >>>>>> > >>>>>> - Wes > >>>>>> _______________________________________________ > >>>>>> Pandas-dev mailing list > >>>>>> Pandas-dev at python.org > >>>>>> https://mail.python.org/mailman/listinfo/pandas-dev > >>>> _______________________________________________ > >>>> Pandas-dev mailing list > >>>> Pandas-dev at python.org > >>>> https://mail.python.org/mailman/listinfo/pandas-dev > >> _______________________________________________ > >> Pandas-dev mailing list > >> Pandas-dev at python.org > >> https://mail.python.org/mailman/listinfo/pandas-dev > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jasc at gmx.net Sun Jan 17 12:04:37 2016 From: jasc at gmx.net (Jan Schulz) Date: Sun, 17 Jan 2016 18:04:37 +0100 Subject: [Pandas-dev] Thoughts on adopting a 1-PR-1-commit policy? In-Reply-To: References: <6D657DD5-1A6C-4D74-9B42-E603AE27B675@gmail.com> Message-ID: Hi, On 17 January 2016 at 14:34, Wes McKinney wrote: > I'm adding the mailing list back. Several comments inline Oups, sorry! > I'm not suggesting that you should have to squash your commits inside > the PR. This only concerns how *patches are applied to pandas's master > branch". Ideally, no squashing occurs inside the developer branch (so > the "story", so to speak, about the patch is preserved), but what the > Apache patch tool does is I would still argue against this: the master branch is what is used in blame and figuring out why something was done in that way is much harder if you always have to get back to some commits in obscure branches which might even be removed from the repo. IMO, all this squashing is an incentive not to write good commit messages as these are in the end more or less discarded as they are all mixed up :-( At least that what happened with me: I tried to write "unit of change" commits (`rebase -i` all "typo" and "fixup" commits + good commits messages) but then these got squashed and I stopped writing such messages because it felt that this was simply wasted and not appreciated. The result was that some decisions which I took are not explained in the commits and were lost when the topic were revisited half a year latter. > Here is something that would be very hard: create release notes given > the commit history. I don't think this gets any easier as long as some manual things are done. There will still be simple commits which correct a typo and which should not show up in the release notes. Whatever technical solution is used for this, some discipline has to be taken to make that successfull. If release notes generation is what is this all about, then there could be several solutions: * only generate release notes from merge commits (remove headline, only use message) * only generate release notes from commits which include a tag * only generate release notes from merge commits which include a tag Jan From shoyer at gmail.com Sun Jan 17 13:34:27 2016 From: shoyer at gmail.com (Stephan Hoyer) Date: Sun, 17 Jan 2016 10:34:27 -0800 (PST) Subject: [Pandas-dev] Thoughts on adopting a 1-PR-1-commit policy? In-Reply-To: References: Message-ID: <1453055667593.2ea32751@Nodemailer> I actually have a soft spot for the Green Button, although I'm rarely the one hitting merge these days. ? ?In particular, I like that it preserves the identities of individual patch authors who contributed to a big change, and assures they all get credit on github. On Saturday, Jan 16, 2016 at 3:21 PM, Wes McKinney , wrote: Let me know if this is something that would interest the team. I know there are varying opinions on the GitHub Green Button =) -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Sun Jan 17 13:37:02 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Sun, 17 Jan 2016 10:37:02 -0800 Subject: [Pandas-dev] Thoughts on adopting a 1-PR-1-commit policy? In-Reply-To: References: <6D657DD5-1A6C-4D74-9B42-E603AE27B675@gmail.com> Message-ID: On Sun, Jan 17, 2016 at 9:04 AM, Jan Schulz wrote: > Hi, > > On 17 January 2016 at 14:34, Wes McKinney wrote: >> I'm adding the mailing list back. Several comments inline > > Oups, sorry! > >> I'm not suggesting that you should have to squash your commits inside >> the PR. This only concerns how *patches are applied to pandas's master >> branch". Ideally, no squashing occurs inside the developer branch (so >> the "story", so to speak, about the patch is preserved), but what the >> Apache patch tool does is > > I would still argue against this: the master branch is what is used in > blame and figuring out why something was done in that way is much > harder if you always have to get back to some commits in obscure > branches which might even be removed from the repo. These issues should be addressed during the code review process. It is worse, in my opinion, to have a mix of intermediate (possibly broken) and verified commits in master as opposed to atomic, verified commits. Additionally, lack of "atomicity" with patches has more issues: - Difficult to revert patches - Difficult to port patches into maintenance branches Requiring patches to be atomic is common practice in large software teams because otherwise codebase maintenance is a nightmare with > 5-10 developers working in parallel. > IMO, all this squashing is an incentive not to write good commit > messages as these are in the end more or less discarded as they are > all mixed up :-( Squash or no squash, the only way to have good commit messages is to expect a certain level of professionalism from pandas contributors. If the commit / PR description is inadequate, this is the responsibility of the code reviewer to address with the developer proposing the patch. > At least that what happened with me: I tried to write "unit of change" > commits (`rebase -i` all "typo" and "fixup" commits + good commits > messages) but then these got squashed and I stopped writing such > messages because it felt that this was simply wasted and not > appreciated. > > The result was that some decisions which I took are not explained in > the commits and were lost when the topic were revisited half a year > latter. > > >> Here is something that would be very hard: create release notes given >> the commit history. > > I don't think this gets any easier as long as some manual things are > done. There will still be simple commits which correct a typo and > which should not show up in the release notes. Whatever technical > solution is used for this, some discipline has to be taken to make > that successfull. > > If release notes generation is what is this all about, then there > could be several solutions: > > * only generate release notes from merge commits (remove headline, > only use message) > * only generate release notes from commits which include a tag > * only generate release notes from merge commits which include a tag > No, the release notes aren't a major factor, but one of many issues caused by a non-atomic, non-linear commit history. > Jan > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev From wesmckinn at gmail.com Sun Jan 17 13:39:23 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Sun, 17 Jan 2016 10:39:23 -0800 Subject: [Pandas-dev] Thoughts on adopting a 1-PR-1-commit policy? In-Reply-To: <1453055667593.2ea32751@Nodemailer> References: <1453055667593.2ea32751@Nodemailer> Message-ID: On Sun, Jan 17, 2016 at 10:34 AM, Stephan Hoyer wrote: > I actually have a soft spot for the Green Button, although I'm rarely the > one hitting merge these days. > In particular, I like that it preserves the identities of individual patch > authors who contributed to a big change, and assures they all get credit on > github. > The patch tool does not occlude the patch author identity on GitHub (i.e. the patches will show up on the user profile just the same). > On Saturday, Jan 16, 2016 at 3:21 PM, Wes McKinney , > wrote: >> >> Let me know if this is something that would interest the team. I know >> there are varying opinions on the GitHub Green Button =) From shoyer at gmail.com Sun Jan 17 13:46:12 2016 From: shoyer at gmail.com (Stephan Hoyer) Date: Sun, 17 Jan 2016 10:46:12 -0800 (PST) Subject: [Pandas-dev] Thoughts on adopting a 1-PR-1-commit policy? In-Reply-To: References: Message-ID: <1453056371875.b7c59591@Nodemailer> On Sunday, Jan 17, 2016 at 10:40 AM, Wes McKinney , wrote: The patch tool does not occlude the patch author identity on GitHub (i.e. the patches will show up on the user profile just the same). ? ?Yes, but not on the "contributors" page for the github project itself. ? ?That said, I agree that atomic commits are useful for large projects. This is part of why we encourage/require squashing. I'm not entirely sure that pandas has enough synchronous developments that this is necessary. ? ?If this helps maintaince branches, it would certainly be a win -- we haven't been very good about maintaining bug fix only branches, which would a healthy thing to do for a mature project. -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Sun Jan 17 13:48:02 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Sun, 17 Jan 2016 10:48:02 -0800 Subject: [Pandas-dev] Thoughts on adopting a 1-PR-1-commit policy? In-Reply-To: <1453056371875.b7c59591@Nodemailer> References: <1453056371875.b7c59591@Nodemailer> Message-ID: On Sun, Jan 17, 2016 at 10:46 AM, Stephan Hoyer wrote: > On Sunday, Jan 17, 2016 at 10:40 AM, Wes McKinney , > wrote: >> >> The patch tool does not occlude the patch author identity on GitHub >> (i.e. the patches will show up on the user profile just the same). > > Yes, but not on the "contributors" page for the github project itself. No, this is not true. See https://github.com/apache/spark/graphs/contributors > That said, I agree that atomic commits are useful for large projects. This > is part of why we encourage/require squashing. I'm not entirely sure that > pandas has enough synchronous developments that this is necessary. > If this helps maintaince branches, it would certainly be a win -- we haven't > been very good about maintaining bug fix only branches, which would a > healthy thing to do for a mature project. > > > > From jeffreback at gmail.com Sun Jan 17 13:56:30 2016 From: jeffreback at gmail.com (Jeff Reback) Date: Sun, 17 Jan 2016 13:56:30 -0500 Subject: [Pandas-dev] Thoughts on adopting a 1-PR-1-commit policy? In-Reply-To: References: <1453056371875.b7c59591@Nodemailer> Message-ID: if this preserves the individual authors commits (and it appears that way), then I am all for a single commit, even for large patches. On Sun, Jan 17, 2016 at 1:48 PM, Wes McKinney wrote: > On Sun, Jan 17, 2016 at 10:46 AM, Stephan Hoyer wrote: > > On Sunday, Jan 17, 2016 at 10:40 AM, Wes McKinney , > > wrote: > >> > >> The patch tool does not occlude the patch author identity on GitHub > >> (i.e. the patches will show up on the user profile just the same). > > > > Yes, but not on the "contributors" page for the github project itself. > > No, this is not true. See > https://github.com/apache/spark/graphs/contributors > > > That said, I agree that atomic commits are useful for large projects. > This > > is part of why we encourage/require squashing. I'm not entirely sure that > > pandas has enough synchronous developments that this is necessary. > > If this helps maintaince branches, it would certainly be a win -- we > haven't > > been very good about maintaining bug fix only branches, which would a > > healthy thing to do for a mature project. > > > > > > > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Sun Jan 17 13:58:30 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Sun, 17 Jan 2016 10:58:30 -0800 Subject: [Pandas-dev] Thoughts on adopting a 1-PR-1-commit policy? In-Reply-To: References: <1453056371875.b7c59591@Nodemailer> Message-ID: I just made my first patch to Parquet, which was committed with this method, and you can see it here https://github.com/apache/parquet-cpp/graphs/contributors On Sun, Jan 17, 2016 at 10:56 AM, Jeff Reback wrote: > if this preserves the individual authors commits (and it appears that way), > then I am all for a single commit, > even for large patches. > > > > On Sun, Jan 17, 2016 at 1:48 PM, Wes McKinney wrote: >> >> On Sun, Jan 17, 2016 at 10:46 AM, Stephan Hoyer wrote: >> > On Sunday, Jan 17, 2016 at 10:40 AM, Wes McKinney , >> > wrote: >> >> >> >> The patch tool does not occlude the patch author identity on GitHub >> >> (i.e. the patches will show up on the user profile just the same). >> > >> > Yes, but not on the "contributors" page for the github project itself. >> >> No, this is not true. See >> https://github.com/apache/spark/graphs/contributors >> >> > That said, I agree that atomic commits are useful for large projects. >> > This >> > is part of why we encourage/require squashing. I'm not entirely sure >> > that >> > pandas has enough synchronous developments that this is necessary. >> > If this helps maintaince branches, it would certainly be a win -- we >> > haven't >> > been very good about maintaining bug fix only branches, which would a >> > healthy thing to do for a mature project. >> > >> > >> > >> > >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev > > From jorisvandenbossche at gmail.com Sun Jan 17 14:02:05 2016 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Sun, 17 Jan 2016 20:02:05 +0100 Subject: [Pandas-dev] Thoughts on adopting a 1-PR-1-commit policy? In-Reply-To: References: <1453056371875.b7c59591@Nodemailer> Message-ID: Stephan, the script works this way that the commit author is still the original contributor from the PR, but the committer is the one from the core team running the script (IIUC) 2016-01-17 19:58 GMT+01:00 Wes McKinney : > I just made my first patch to Parquet, which was committed with this > method, and you can see it here > > https://github.com/apache/parquet-cpp/graphs/contributors > > On Sun, Jan 17, 2016 at 10:56 AM, Jeff Reback > wrote: > > if this preserves the individual authors commits (and it appears that > way), > > then I am all for a single commit, > > even for large patches. > > > > > > > > On Sun, Jan 17, 2016 at 1:48 PM, Wes McKinney > wrote: > >> > >> On Sun, Jan 17, 2016 at 10:46 AM, Stephan Hoyer > wrote: > >> > On Sunday, Jan 17, 2016 at 10:40 AM, Wes McKinney < > wesmckinn at gmail.com>, > >> > wrote: > >> >> > >> >> The patch tool does not occlude the patch author identity on GitHub > >> >> (i.e. the patches will show up on the user profile just the same). > >> > > >> > Yes, but not on the "contributors" page for the github project itself. > >> > >> No, this is not true. See > >> https://github.com/apache/spark/graphs/contributors > >> > >> > That said, I agree that atomic commits are useful for large projects. > >> > This > >> > is part of why we encourage/require squashing. I'm not entirely sure > >> > that > >> > pandas has enough synchronous developments that this is necessary. > >> > If this helps maintaince branches, it would certainly be a win -- we > >> > haven't > >> > been very good about maintaining bug fix only branches, which would a > >> > healthy thing to do for a mature project. > >> > > >> > > >> > > >> > > >> _______________________________________________ > >> Pandas-dev mailing list > >> Pandas-dev at python.org > >> https://mail.python.org/mailman/listinfo/pandas-dev > > > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Sun Jan 17 14:04:27 2016 From: shoyer at gmail.com (Stephan Hoyer) Date: Sun, 17 Jan 2016 11:04:27 -0800 (PST) Subject: [Pandas-dev] Thoughts on adopting a 1-PR-1-commit policy? In-Reply-To: References: Message-ID: <1453057467070.3bb357c@Nodemailer> Does it preserve credit for all the authors on a PR (beyond the first), even if they don't even end up as the author on squashed commit? That would surprise me. I do agree this is a bit of an edge case, though it does include big merges like the one in your original email. On Sun, Jan 17, 2016 at 10:59 AM, Wes McKinney wrote: > I just made my first patch to Parquet, which was committed with this > method, and you can see it here > https://github.com/apache/parquet-cpp/graphs/contributors > On Sun, Jan 17, 2016 at 10:56 AM, Jeff Reback wrote: >> if this preserves the individual authors commits (and it appears that way), >> then I am all for a single commit, >> even for large patches. >> >> >> >> On Sun, Jan 17, 2016 at 1:48 PM, Wes McKinney wrote: >>> >>> On Sun, Jan 17, 2016 at 10:46 AM, Stephan Hoyer wrote: >>> > On Sunday, Jan 17, 2016 at 10:40 AM, Wes McKinney , >>> > wrote: >>> >> >>> >> The patch tool does not occlude the patch author identity on GitHub >>> >> (i.e. the patches will show up on the user profile just the same). >>> > >>> > Yes, but not on the "contributors" page for the github project itself. >>> >>> No, this is not true. See >>> https://github.com/apache/spark/graphs/contributors >>> >>> > That said, I agree that atomic commits are useful for large projects. >>> > This >>> > is part of why we encourage/require squashing. I'm not entirely sure >>> > that >>> > pandas has enough synchronous developments that this is necessary. >>> > If this helps maintaince branches, it would certainly be a win -- we >>> > haven't >>> > been very good about maintaining bug fix only branches, which would a >>> > healthy thing to do for a mature project. >>> > >>> > >>> > >>> > >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Sun Jan 17 14:14:52 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Sun, 17 Jan 2016 11:14:52 -0800 Subject: [Pandas-dev] Thoughts on adopting a 1-PR-1-commit policy? In-Reply-To: <1453057467070.3bb357c@Nodemailer> References: <1453057467070.3bb357c@Nodemailer> Message-ID: It only preserves the primary author, so there are some edge cases where GitHub metadata would get lost. On Sun, Jan 17, 2016 at 11:04 AM, Stephan Hoyer wrote: > Does it preserve credit for all the authors on a PR (beyond the first), even > if they don't even end up as the author on squashed commit? That would > surprise me. I do agree this is a bit of an edge case, though it does > include big merges like the one in your original email. > > > > On Sun, Jan 17, 2016 at 10:59 AM, Wes McKinney wrote: >> >> I just made my first patch to Parquet, which was committed with this >> method, and you can see it here >> >> https://github.com/apache/parquet-cpp/graphs/contributors >> >> On Sun, Jan 17, 2016 at 10:56 AM, Jeff Reback >> wrote: >> > if this preserves the individual authors commits (and it appears that >> > way), >> > then I am all for a single commit, >> > even for large patches. >> > >> > >> > >> > On Sun, Jan 17, 2016 at 1:48 PM, Wes McKinney >> > wrote: >> >> >> >> On Sun, Jan 17, 2016 at 10:46 AM, Stephan Hoyer >> >> wrote: >> >> > On Sunday, Jan 17, 2016 at 10:40 AM, Wes McKinney >> >> > , >> >> > wrote: >> >> >> >> >> >> The patch tool does not occlude the patch author identity on GitHub >> >> >> (i.e. the patches will show up on the user profile just the same). >> >> > >> >> > Yes, but not on the "contributors" page for the github project >> >> > itself. >> >> >> >> No, this is not true. See >> >> https://github.com/apache/spark/graphs/contributors >> >> >> >> > That said, I agree that atomic commits are useful for large projects. >> >> > This >> >> > is part of why we encourage/require squashing. I'm not entirely sure >> >> > that >> >> > pandas has enough synchronous developments that this is necessary. >> >> > If this helps maintaince branches, it would certainly be a win -- we >> >> > haven't >> >> > been very good about maintaining bug fix only branches, which would a >> >> > healthy thing to do for a mature project. >> >> > >> >> > >> >> > >> >> > >> >> _______________________________________________ >> >> Pandas-dev mailing list >> >> Pandas-dev at python.org >> >> https://mail.python.org/mailman/listinfo/pandas-dev >> > >> > > > From jasc at gmx.net Sun Jan 17 15:53:33 2016 From: jasc at gmx.net (Jan Schulz) Date: Sun, 17 Jan 2016 21:53:33 +0100 Subject: [Pandas-dev] Thoughts on adopting a 1-PR-1-commit policy? In-Reply-To: References: <6D657DD5-1A6C-4D74-9B42-E603AE27B675@gmail.com> Message-ID: Hi, On 17 January 2016 at 19:37, Wes McKinney wrote: > These issues should be addressed during the code review process. It is > worse, in my opinion, to have a mix of intermediate (possibly broken) > and verified commits in master as opposed to atomic, verified commits. > > Additionally, lack of "atomicity" with patches has more issues: I agree, but there is also `rebase -i` to make these commits be atomic... E.g. https://github.com/pydata/pandas/pull/11582 The 4 commits work on their own and AFAIR each commit tested green. So: IMO important changes (especially everything which changes behaviour) should get one commit (or PR/Cherry pick as per your proposal) per behaviour change and this behaviour change should be explained in the commit message. If the above PR would have been squashed, then 4 messages would be appended to each other and one couldn't separate which description would belong to which... Jan From wesmckinn at gmail.com Sun Jan 17 18:34:28 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Sun, 17 Jan 2016 15:34:28 -0800 Subject: [Pandas-dev] Thoughts on adopting a 1-PR-1-commit policy? In-Reply-To: References: <6D657DD5-1A6C-4D74-9B42-E603AE27B675@gmail.com> Message-ID: Yeah, the point of this issue is not "there shall be no more than 1 commit per PR" but rather that smaller patches (i.e., most patches) should not degrade the signal-to-noise ratio of our commit history. Further, we should avoid merging commits that don't stand on their own. Lastly, merge commits generally only serve to degrade the SnR ratio. Let's look at a sample of yesterday's commits: https://www.dropbox.com/s/mp5yfp76h6h8z3y/commit-log-20160116.png?dl=0 No mistakes were made here, except that our current process (which Jeff has been following diligently) is resulting in a commit history that is less useful than it could be. My preference is: - Use the patch tool for smaller patches and large patches that haven't been split out into a series of incremental, standalone patches - For large patches that make sense as multiple incremental commits, none of which breaks the build, merge with --ff-only (rebasing as needed). I expect this to be rare. I really like to avoid "edge-case driven development" -- we are bound to have patches where this guidance doesn't feel right, and we definitely don't have to dogmatically follow it. - Wes On Sun, Jan 17, 2016 at 12:53 PM, Jan Schulz wrote: > Hi, > > On 17 January 2016 at 19:37, Wes McKinney wrote: >> These issues should be addressed during the code review process. It is >> worse, in my opinion, to have a mix of intermediate (possibly broken) >> and verified commits in master as opposed to atomic, verified commits. >> >> Additionally, lack of "atomicity" with patches has more issues: > > I agree, but there is also `rebase -i` to make these commits be atomic... > > E.g. https://github.com/pydata/pandas/pull/11582 > > The 4 commits work on their own and AFAIR each commit tested green. > > So: IMO important changes (especially everything which changes > behaviour) should get one commit (or PR/Cherry pick as per your > proposal) per behaviour change and this behaviour change should be > explained in the commit message. > > If the above PR would have been squashed, then 4 messages would be > appended to each other and one couldn't separate which description > would belong to which... > > Jan From njs at pobox.com Sun Jan 17 15:02:45 2016 From: njs at pobox.com (Nathaniel Smith) Date: Sun, 17 Jan 2016 12:02:45 -0800 Subject: [Pandas-dev] Thoughts on adopting a 1-PR-1-commit policy? In-Reply-To: References: <6D657DD5-1A6C-4D74-9B42-E603AE27B675@gmail.com> Message-ID: On Jan 17, 2016 10:37, "Wes McKinney" wrote: > [...] > Requiring patches to be atomic is common practice in large software > teams because otherwise codebase maintenance is a nightmare with > > 5-10 developers working in parallel. A few thoughts: For what it's worth, possibly the largest/most parallel software collaboration in the world is the Linux kernel, and they mandate that complex patches must *not* be squashed, but must instead be broken up into a series of self-contained incremental patches (as Jan is advocating). BTW I think you'll find that if you consistently merge using --no-ff (which is what the green button does), then "git log --first-parent" will give you *exactly* the same linear squashed history that you are hoping for, as in the diffs will be byte-for-byte identical. This approach keeps all the history in the repository, and discards the distracting parts at access time rather than commit time. In numpy we mandate that all PRs go via the green button for this reason. I suspect that the projects you're thinking of do what they do because of a combination of (a) not being very large in the grand scheme of things so that the linearization itself doesn't become a bottleneck the way it would for a project like the kernel, (b) not understanding git terribly well, (c) having to assume an even lower level of git knowledge in individual contributors. (Versus the kernel, where they have the "luxury" of imposing arbitrarily high standards and then abusing anyone who doesn't meet them until they figure it out or quit.) Note that it isn't a great idea to assume that the individual commits that you squashed will still be findable later, even given their id. It's actually a good idea, generally, to delete no longer relevant branches from a personal fork, to avoid getting lost among hundreds of similarly named branches. I actually need to get more in the habit of doing this :-). And even if one disagrees about it being a good idea, people do it and there's no way to stop them. But when this happens, if the commit was cherrypicked into the main repo and the branch is gone from the personal fork, then github will eventually garbage collect the original commits, and trying to look up those commit hashes, maybe years later, will give you a 404. Anyway, both processes can obviously work, and what works for you is what works for you. I'm not an absolutist :-). But I thought it might be helpful to at least be aware of some of these points while making the decision. Cheers, -n -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Mon Jan 18 17:59:35 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Mon, 18 Jan 2016 14:59:35 -0800 Subject: [Pandas-dev] Thoughts on adopting a 1-PR-1-commit policy? In-Reply-To: References: <6D657DD5-1A6C-4D74-9B42-E603AE27B675@gmail.com> Message-ID: On Sun, Jan 17, 2016 at 3:34 PM, Wes McKinney wrote: > Yeah, the point of this issue is not "there shall be no more than 1 > commit per PR" but rather that smaller patches (i.e., most patches) > should not degrade the signal-to-noise ratio of our commit history. > Further, we should avoid merging commits that don't stand on their > own. Lastly, merge commits generally only serve to degrade the SnR > ratio. > > Let's look at a sample of yesterday's commits: > > https://www.dropbox.com/s/mp5yfp76h6h8z3y/commit-log-20160116.png?dl=0 > > No mistakes were made here, except that our current process (which > Jeff has been following diligently) is resulting in a commit history > that is less useful than it could be. > > My preference is: > > - Use the patch tool for smaller patches and large patches that > haven't been split out into a series of incremental, standalone > patches > - For large patches that make sense as multiple incremental commits, > none of which breaks the build, merge with --ff-only (rebasing as > needed). I expect this to be rare. > > I really like to avoid "edge-case driven development" -- we are bound > to have patches where this guidance doesn't feel right, and we > definitely don't have to dogmatically follow it. > For the record -- I spent some time reviewing the major category dtype pull requests that were merged in 2014, and given the sprawling nature of those changes and the huge amount of collaboration that took place, I agree it would have been preferable to fast-forward merge the incremental commits instead of squashing them into a couple of monolithic commits. https://github.com/pydata/pandas/commit/0f62d3fc62f317538044ed3d349bfb89fb7ee9de https://github.com/pydata/pandas/commit/ea0a13c172761348d08285a19ebf731cdabb2db3 > - Wes > > On Sun, Jan 17, 2016 at 12:53 PM, Jan Schulz wrote: >> Hi, >> >> On 17 January 2016 at 19:37, Wes McKinney wrote: >>> These issues should be addressed during the code review process. It is >>> worse, in my opinion, to have a mix of intermediate (possibly broken) >>> and verified commits in master as opposed to atomic, verified commits. >>> >>> Additionally, lack of "atomicity" with patches has more issues: >> >> I agree, but there is also `rebase -i` to make these commits be atomic... >> >> E.g. https://github.com/pydata/pandas/pull/11582 >> >> The 4 commits work on their own and AFAIR each commit tested green. >> >> So: IMO important changes (especially everything which changes >> behaviour) should get one commit (or PR/Cherry pick as per your >> proposal) per behaviour change and this behaviour change should be >> explained in the commit message. >> >> If the above PR would have been squashed, then 4 messages would be >> appended to each other and one couldn't separate which description >> would belong to which... >> >> Jan From wesmckinn at gmail.com Mon Jan 25 11:47:42 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Mon, 25 Jan 2016 08:47:42 -0800 Subject: [Pandas-dev] Unit test reorganization In-Reply-To: References: Message-ID: hi all, As part of code cleanup and reorganization, let's start creating a "quarantine" of test code for functionality (like Panel classes) that we are contemplating deprecating and later removing in 1.0, if that sounds like a good idea? - Wes On Wed, Jan 13, 2016 at 6:06 PM, Wes McKinney wrote: > On Wed, Jan 13, 2016 at 6:01 PM, Jeff Reback wrote: >> so I agree +1 on moving all to pandas/tests >> >> - the indexing tests, which *mostly* are in test_indexing.py, though quite a >> few are in test_series/test_frame.py, should ideally be >> merged into a set of tests/indexing >> >> - io tests could be left alone I think >> > > Yeah, I think pandas/io/tests is the one definite exception where > there isn't much benefit > >> - stats tests are *mostly* deprecated >> >> - since going to deprecate panel + nd soon, I think makes sense to move >> these tests & code to pandas/deprecated, to keep separate >> >> - test_tslib.py should be integrated into tseries/test_timeseries.py >> >> - almost all of the Index tests are now in test_index (which sub-class being >> somewhat generically tested), but the time-series ones >> are in tseries/test_base, so these could be merged as well. >> > > Yep, it specifically would be good to collect 100% of the index data > structure machinery (including Datetime/Timedelta/PeriodIndex) in one > place (same for axis indexing as you said, since it got pretty > scattered) > >> >> Jeff >> >> >> >> >> On Wed, Jan 13, 2016 at 8:51 PM, Wes McKinney wrote: >>> >>> Another idea here I've been toying with to achieve better logical test >>> organization is to place all tests in the whole project under >>> pandas/tests. This way we can centralize all the tests relating to >>> some functional aspect of pandas in one place, rather than the status >>> quo where test code tends to be fairly close to its implementation >>> (but not always). A prime example of where I let this get disorganized >>> early on is time series functionality tests are somewhat scattered >>> across pandas/tests, pandas/tseries, etc. This way we can also collect >>> a single directory of "quarantined" pandas 0.x behavior that we are >>> contemplating changing in a 1.0 release. >>> >>> Thoughts on this + other ideas how to help organize the tests to help >>> mentally in approaching refactoring and internal changes? >>> >>> - Wes >>> >>> On Wed, Jan 13, 2016 at 1:16 PM, Wes McKinney wrote: >>> > OK, I got started with the biggest offender: >>> > >>> > https://github.com/pydata/pandas/pull/12032 >>> > >>> > It would be great to take the same approach with the other large test >>> > modules, with a special eye for quarantining "leaky" internals code >>> > and segregating NumPy interoperability contracts. I didn't completely >>> > do this with test_frame.py but it's a good start. >>> > >>> > There's definitely plenty of code in the other top level test modules >>> > which may nest under tests/frame or tests/series >>> > >>> > - Wes >>> > >>> > On Mon, Jan 11, 2016 at 8:47 AM, Wes McKinney >>> > wrote: >>> >> On Sun, Jan 10, 2016 at 6:06 PM, Stephan Hoyer >>> >> wrote: >>> >>> On Fri, Jan 8, 2016 at 5:34 PM, Wes McKinney >>> >>> wrote: >>> >>>> >>> >>>> Big #1 question is, how strongly do you feel about *shipping* the >>> >>>> test >>> >>>> suite in site-packages? Some other libraries with sprawling and >>> >>>> complex test suites have chosen not to ship them: >>> >>>> https://github.com/zzzeek/sqlalchemy >>> >>> >>> >>> >>> >>> I would prefer to include the test suite if possible, because the >>> >>> ability to >>> >>> type "nosetests pandas" makes it easy both for users to verify >>> >>> installations >>> >>> are working properly and for downstream distributors to identify and >>> >>> report >>> >>> bugs. The complete pandas test suite still runs in 20-30 minutes, so I >>> >>> think >>> >>> it's still fairly reasonable to use it for these purposes. >>> >>> >>> >> >>> >> Got it. I wasn't sure if this was something people still wanted to do >>> >> in practice with the burgeoning test suite. >>> >> >>> >>>> >>> >>>> Independently, I would support and help with starting a judicious >>> >>>> reorganization of the contents of pandas/tests. So I'm thinking like >>> >>>> >>> >>>> tests/ >>> >>>> dataframe/ >>> >>>> series/ >>> >>>> algorithms/ >>> >>>> internals/ >>> >>>> tseries/ >>> >>>> >>> >>>> and so forth. >>> >>> >>> >>> >>> >>> This sounds like a great idea -- these files have really gotten out of >>> >>> control! >>> >>> >>> >> >>> >> Sounds good. I've been sorting through points of contact between >>> >> Series/DataFrame's implementation and internal matters (e.g. the >>> >> BlockManager) and figured it would be good to "quarantine" code that >>> >> makes assumptions about what's under the hood. I'll get the first >>> >> couple patches started and it can be a slow burn to break apart these >>> >> large files. >>> >> >>> >>> Cheers, >>> >>> Stephan >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >> >> From jeffreback at gmail.com Mon Jan 25 11:59:25 2016 From: jeffreback at gmail.com (Jeff Reback) Date: Mon, 25 Jan 2016 11:59:25 -0500 Subject: [Pandas-dev] Unit test reorganization In-Reply-To: References: Message-ID: sounds good to me. on other notes. Planning on doing the 0.18.0 RC in say 2 weeks time. I think that adding to_xarray to 0.18.0 is realistic, but I think we need to push deprecating Panel to 0.19.0 simply to have some time for this to brew (and to_xarray) to mature. (*could* simply delay 0.18.0 for say a month otherwise). any objections? On Mon, Jan 25, 2016 at 11:47 AM, Wes McKinney wrote: > hi all, > > As part of code cleanup and reorganization, let's start creating a > "quarantine" of test code for functionality (like Panel classes) that > we are contemplating deprecating and later removing in 1.0, if that > sounds like a good idea? > > - Wes > > On Wed, Jan 13, 2016 at 6:06 PM, Wes McKinney wrote: > > On Wed, Jan 13, 2016 at 6:01 PM, Jeff Reback > wrote: > >> so I agree +1 on moving all to pandas/tests > >> > >> - the indexing tests, which *mostly* are in test_indexing.py, though > quite a > >> few are in test_series/test_frame.py, should ideally be > >> merged into a set of tests/indexing > >> > >> - io tests could be left alone I think > >> > > > > Yeah, I think pandas/io/tests is the one definite exception where > > there isn't much benefit > > > >> - stats tests are *mostly* deprecated > >> > >> - since going to deprecate panel + nd soon, I think makes sense to move > >> these tests & code to pandas/deprecated, to keep separate > >> > >> - test_tslib.py should be integrated into tseries/test_timeseries.py > >> > >> - almost all of the Index tests are now in test_index (which sub-class > being > >> somewhat generically tested), but the time-series ones > >> are in tseries/test_base, so these could be merged as well. > >> > > > > Yep, it specifically would be good to collect 100% of the index data > > structure machinery (including Datetime/Timedelta/PeriodIndex) in one > > place (same for axis indexing as you said, since it got pretty > > scattered) > > > >> > >> Jeff > >> > >> > >> > >> > >> On Wed, Jan 13, 2016 at 8:51 PM, Wes McKinney > wrote: > >>> > >>> Another idea here I've been toying with to achieve better logical test > >>> organization is to place all tests in the whole project under > >>> pandas/tests. This way we can centralize all the tests relating to > >>> some functional aspect of pandas in one place, rather than the status > >>> quo where test code tends to be fairly close to its implementation > >>> (but not always). A prime example of where I let this get disorganized > >>> early on is time series functionality tests are somewhat scattered > >>> across pandas/tests, pandas/tseries, etc. This way we can also collect > >>> a single directory of "quarantined" pandas 0.x behavior that we are > >>> contemplating changing in a 1.0 release. > >>> > >>> Thoughts on this + other ideas how to help organize the tests to help > >>> mentally in approaching refactoring and internal changes? > >>> > >>> - Wes > >>> > >>> On Wed, Jan 13, 2016 at 1:16 PM, Wes McKinney > wrote: > >>> > OK, I got started with the biggest offender: > >>> > > >>> > https://github.com/pydata/pandas/pull/12032 > >>> > > >>> > It would be great to take the same approach with the other large test > >>> > modules, with a special eye for quarantining "leaky" internals code > >>> > and segregating NumPy interoperability contracts. I didn't completely > >>> > do this with test_frame.py but it's a good start. > >>> > > >>> > There's definitely plenty of code in the other top level test modules > >>> > which may nest under tests/frame or tests/series > >>> > > >>> > - Wes > >>> > > >>> > On Mon, Jan 11, 2016 at 8:47 AM, Wes McKinney > >>> > wrote: > >>> >> On Sun, Jan 10, 2016 at 6:06 PM, Stephan Hoyer > >>> >> wrote: > >>> >>> On Fri, Jan 8, 2016 at 5:34 PM, Wes McKinney > >>> >>> wrote: > >>> >>>> > >>> >>>> Big #1 question is, how strongly do you feel about *shipping* the > >>> >>>> test > >>> >>>> suite in site-packages? Some other libraries with sprawling and > >>> >>>> complex test suites have chosen not to ship them: > >>> >>>> https://github.com/zzzeek/sqlalchemy > >>> >>> > >>> >>> > >>> >>> I would prefer to include the test suite if possible, because the > >>> >>> ability to > >>> >>> type "nosetests pandas" makes it easy both for users to verify > >>> >>> installations > >>> >>> are working properly and for downstream distributors to identify > and > >>> >>> report > >>> >>> bugs. The complete pandas test suite still runs in 20-30 minutes, > so I > >>> >>> think > >>> >>> it's still fairly reasonable to use it for these purposes. > >>> >>> > >>> >> > >>> >> Got it. I wasn't sure if this was something people still wanted to > do > >>> >> in practice with the burgeoning test suite. > >>> >> > >>> >>>> > >>> >>>> Independently, I would support and help with starting a judicious > >>> >>>> reorganization of the contents of pandas/tests. So I'm thinking > like > >>> >>>> > >>> >>>> tests/ > >>> >>>> dataframe/ > >>> >>>> series/ > >>> >>>> algorithms/ > >>> >>>> internals/ > >>> >>>> tseries/ > >>> >>>> > >>> >>>> and so forth. > >>> >>> > >>> >>> > >>> >>> This sounds like a great idea -- these files have really gotten > out of > >>> >>> control! > >>> >>> > >>> >> > >>> >> Sounds good. I've been sorting through points of contact between > >>> >> Series/DataFrame's implementation and internal matters (e.g. the > >>> >> BlockManager) and figured it would be good to "quarantine" code that > >>> >> makes assumptions about what's under the hood. I'll get the first > >>> >> couple patches started and it can be a slow burn to break apart > these > >>> >> large files. > >>> >> > >>> >>> Cheers, > >>> >>> Stephan > >>> _______________________________________________ > >>> Pandas-dev mailing list > >>> Pandas-dev at python.org > >>> https://mail.python.org/mailman/listinfo/pandas-dev > >> > >> > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: