From jorisvandenbossche at gmail.com Fri Dec 18 12:04:17 2015 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Fri, 18 Dec 2015 18:04:17 +0100 Subject: [Pandas-dev] Notice: backwards incompatible change to the 'resample' method considered for 0.18 Message-ID: Dear all, At the moment we are considering to change the API for the resample method. See the issue API: change .resample to be a groupby-like API , and Jeff's corresponding PR 11841 . Basically, we want to make it more similar to groupby. Code that is now written as: s.resample('D', how='max') would become a two-step operation: s.resample('D').max() This change makes it more consistent with groupby (as downsampling can be seen as a special case of groupby), and at the same time enabling the more powerful features in the groupby-API for resampling. In the current version of the PR, it will not be breaking silently your code, as there is a deprecation warning when using resample in the old way, or a clear error will be raised in some cases (when assigning to the result of resample). Feedback always welcome! Regards, Joris -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Thu Dec 24 19:18:46 2015 From: wesmckinn at gmail.com (Wes McKinney) Date: Thu, 24 Dec 2015 16:18:46 -0800 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? Message-ID: Deep thoughts during the holidays. I might be out of line here, but the interpreter-heaviness of the inside of pandas objects is likely to be a long-term liability and source of performance problems and technical debt. Has anyone put any thought into planning and beginning to execute on a rewrite that moves as much as possible of the internals into native / compiled code? I'm talking about: - pandas/core/internals - indexing and assignment - much of pandas/core/common - categorical and custom dtypes - all indexing mechanisms I'm concerned we've already exposed too much internals to users, so this might lead to a lot of API breakage, but it might be for the Greater Good. As a first step, beginning a partial migration of internals into some C++ classes that encapsulate the insides of DataFrame objects and implement indexing and block-level manipulations would be a good place to start. I think you could do this wouldn't too much disruption. As part of this internal retooling we might give consideration to alternative data structures for representing data internal to pandas objects. Now in 2015/2016, continuing to be hamstrung by NumPy's limitations feels somewhat anachronistic. User code is riddled with workarounds for data type fidelity issues and the like. Like, really, why not add a bitndarray (similar to ilanschnell/bitarray) for storing nullness for problematic types and hide this from the user? =) Since we are now a NumFOCUS-sponsored project, I feel like we might consider establishing some formal governance over pandas and publishing meetings notes and roadmap documents describing plans for the project and meetings notes from committers. There's no real "committer culture" for NumFOCUS projects like there is with the Apache Software Foundation, but we might try leading by example! Also, I believe pandas as a project has reached a level of importance where we ought to consider planning and execution on larger scale undertakings such as this for safeguarding the future. As for myself, well, I have my hands full in Big Data-land. I wish I could be helping more with pandas, but there a quite a few fundamental issues (like data interoperability nested data handling and file format support ? e.g. Parquet, see http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/) preventing Python from being more useful in industry analytics applications. Aside: one of the bigger mistakes I made with pandas's API design was making it acceptable to call class constructors ? like pandas.DataFrame ? directly (versus factory functions). Sorry about that! If we could convince everyone to start writing pandas.data_frame or dataframe instead of using the class reference it would help a lot with code cleanup. It's hard to plan for these things ? NumPy interoperability seemed a lot more important in 2008 than it does now, so I forgive myself. cheers and best wishes for 2016, Wes From jeffreback at gmail.com Fri Dec 25 17:14:35 2015 From: jeffreback at gmail.com (Jeff Reback) Date: Fri, 25 Dec 2015 17:14:35 -0500 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap Message-ID: Here are some of my thoughts about pandas Roadmap / status and some responses to Wes's thoughts. In the last few (and upcoming) major releases we have been made the following changes: - dtype enhancements (Categorical, Timedelta, Datetime w/tz) & making these first class objects - code refactoring to remove subclassing of ndarrays for Series & Index - carving out / deprecating non-core parts of pandas - datareader - SparsePanel, WidePanel & other aliases (TImeSeries) - rpy, rplot, irow et al. - google-analytics - API changes to make things more consistent - pd.rolling/expanding * -> .rolling/expanding (this is in master now) - .resample becoming a full defered like groupby. - multi-index slicing along any level (obviates need for .xs) and allows assignment - .loc/.iloc - for the most part obviates use of .ix - .pipe & .assign - plotting accessors - fixing of the sorting API - many performance enhancements both micro & macro (e.g. release GIL) Some on-deck enhancements are (meaning these are basically ready to go in): - IntervalIndex (and eventually make PeriodIndex just a sub-class of this) - RangeIndex so lots of changes, though nothing really earth shaking, just more convenience, reducing magicness somewhat and providing flexibility. Of course we are getting increasing issues, mostly bug reports (and lots of dupes), some edge case enhancements which can add to the existing API's and of course, requests to expand the (already) large code to other usecases. Balancing this are a good many pull-requests from many different users, some even deep into the internals. Here are some things that I have talked about and could be considered for the roadmap. Disclaimer: I do work for Continuum but these views are of course my own; furthermore obviously I am a bit more familiar with some of the 'sponsored' open-source libraries, but always open to new things. - integration / automatic deferral to numba for JIT (this would be thru .apply) - automatic deferal to dask from groubpy where appropriate / maybe a .to_parallel (to simply return a dask.DataFrame object) - incorporation of quantities / units (as part of the dtype) - use of DyND to allow missing values for int dtypes - make Period a first class dtype. - provide some copy-on-write semantics to alleviate the chained-indexing issues which occasionaly come up with the mis-use of the indexing API - allow a 'policy' to automatically provide column blocks for dict-like input (e.g. each column would be a block), this would allow a pass-thru API where you could put in numpy arrays where you have views and have them preserved rather than copied automatically. Note that this would also allow what I call 'split' where a passed in multi-dim numpy array could be split up to individual blocks (which actually gives a nice perf boost after the splitting costs). In working towards some of these goals. I have come to the opinion that it would make sense to have a neutral API protocol layer that would allow us to swap out different engines as needed, for particular dtypes, or *maybe* out-of-core type computations. E.g. imagine that we replaced the in-memory block structure with a bclolz / memap type; in theory this should be 'easy' and just work. I could also see us adopting *some* of the SFrame code to allow easier interop with this API layer. In practice, I think a nice API layer would need to be created to make this clean / nice. So this comes around to Wes's point about creating a c++ library for the internals (and possibly even some of the indexing routines). In an ideal world, or course this would be desirable. Getting there is a bit non-trivial I think, and IMHO might not be worth the effort. I don't really see big performance bottlenecks. We *already* defer much of the computation to libraries like numexpr & bottleneck (where appropriate). Adding numba / dask to the list would be helpful. I think that almost all performance issues are the result of: a) gross misuse of the pandas API. How much code have you seen that does df.apply(lambda x: x.sum()) b) routines which operate column-by-column rather block-by-block and are in python space (e.g. we have an issue right now about .quantile) So I am glossing over a big goal of having a c++ library that represents the pandas internals. This would by definition have a c-API that so you *could* use pandas like semantics in c/c++ and just have it work (and then pandas would be a thin wrapper around this library). I am not averse to this, but I think would be quite a big effort, and not a huge perf boost IMHO. Further there are a number of API issues w.r.t. indexing which need to be clarified / worked out (e.g. should we simply deprecate []) that are much easier to test / figure out in python space. I also thing that we have quite a large number of contributors. Moving to c++ might make the internals a bit more impenetrable that the current internals. (though this would allow c++ people to contribute, so that might balance out). We have a limited core of devs whom right now are familar with things. If someone happened to have a starting base for a c++ library, then I might change opinions here. my 4c. Jeff On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney wrote: > Deep thoughts during the holidays. > > I might be out of line here, but the interpreter-heaviness of the > inside of pandas objects is likely to be a long-term liability and > source of performance problems and technical debt. > > Has anyone put any thought into planning and beginning to execute on a > rewrite that moves as much as possible of the internals into native / > compiled code? I'm talking about: > > - pandas/core/internals > - indexing and assignment > - much of pandas/core/common > - categorical and custom dtypes > - all indexing mechanisms > > I'm concerned we've already exposed too much internals to users, so > this might lead to a lot of API breakage, but it might be for the > Greater Good. As a first step, beginning a partial migration of > internals into some C++ classes that encapsulate the insides of > DataFrame objects and implement indexing and block-level manipulations > would be a good place to start. I think you could do this wouldn't too > much disruption. > > As part of this internal retooling we might give consideration to > alternative data structures for representing data internal to pandas > objects. Now in 2015/2016, continuing to be hamstrung by NumPy's > limitations feels somewhat anachronistic. User code is riddled with > workarounds for data type fidelity issues and the like. Like, really, > why not add a bitndarray (similar to ilanschnell/bitarray) for storing > nullness for problematic types and hide this from the user? =) > > Since we are now a NumFOCUS-sponsored project, I feel like we might > consider establishing some formal governance over pandas and > publishing meetings notes and roadmap documents describing plans for > the project and meetings notes from committers. There's no real > "committer culture" for NumFOCUS projects like there is with the > Apache Software Foundation, but we might try leading by example! > > Also, I believe pandas as a project has reached a level of importance > where we ought to consider planning and execution on larger scale > undertakings such as this for safeguarding the future. > > As for myself, well, I have my hands full in Big Data-land. I wish I > could be helping more with pandas, but there a quite a few fundamental > issues (like data interoperability nested data handling and file > format support ? e.g. Parquet, see > > http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/ > ) > preventing Python from being more useful in industry analytics > applications. > > Aside: one of the bigger mistakes I made with pandas's API design was > making it acceptable to call class constructors ? like > pandas.DataFrame ? directly (versus factory functions). Sorry about > that! If we could convince everyone to start writing pandas.data_frame > or dataframe instead of using the class reference it would help a lot > with code cleanup. It's hard to plan for these things ? NumPy > interoperability seemed a lot more important in 2008 than it does now, > so I forgive myself. > > cheers and best wishes for 2016, > Wes > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Tue Dec 29 14:49:49 2015 From: wesmckinn at gmail.com (Wes McKinney) Date: Tue, 29 Dec 2015 11:49:49 -0800 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References: Message-ID: I will write a more detailed response to some of these things after the new year, but, in particular, re: missing values, can you or someone tell me why creating an object that contains a NumPy array and a bitmap is not sufficient? If we we can add a lightweight C/C++ class layer between NumPy function calls (e.g. arithmetic) and pandas function calls, then I see no reason why we cannot have Int32Array->add and Float32Array->add do the right thing (the former would be responsible for bitmasking to propagate NA values; the latter would defer to NumPy). If we can put all the internals of pandas objects inside a black box, we can add layers of virtual function indirection without a performance penalty (e.g. adding more interpreter overhead with more abstraction layers does add up to a perf penalty). I don't think this is too scary -- I would be willing to create a small POC C++ library to prototype something like what I'm talking about. Since pandas has limited points of contact with NumPy I don't think this would end up being too onerous. For the record, I'm pretty allergic to "advanced C++"; I think it is a useful tool if you pick a sane 20% subset of the C++11 spec and follow Google C++ style it's not very inaccessible to intermediate developers. More or less "C plus OOP and easier object lifetime management (shared/unique_ptr, etc.)". As soon as you add a lot of template metaprogramming C++ library development quickly becomes inaccessible except to the C++-Jedi. Maybe let's start a Google document on "pandas roadmap" where we can break down the 1-2 year goals and some of these infrastructure issues and have our discussion there? (obviously publish this someplace once we're done) - Wes On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback wrote: > Here are some of my thoughts about pandas Roadmap / status and some > responses to Wes's thoughts. > > In the last few (and upcoming) major releases we have been made the > following changes: > > - dtype enhancements (Categorical, Timedelta, Datetime w/tz) & making these > first class objects > - code refactoring to remove subclassing of ndarrays for Series & Index > - carving out / deprecating non-core parts of pandas > - datareader > - SparsePanel, WidePanel & other aliases (TImeSeries) > - rpy, rplot, irow et al. > - google-analytics > - API changes to make things more consistent > - pd.rolling/expanding * -> .rolling/expanding (this is in master now) > - .resample becoming a full defered like groupby. > - multi-index slicing along any level (obviates need for .xs) and allows > assignment > - .loc/.iloc - for the most part obviates use of .ix > - .pipe & .assign > - plotting accessors > - fixing of the sorting API > - many performance enhancements both micro & macro (e.g. release GIL) > > Some on-deck enhancements are (meaning these are basically ready to go in): > - IntervalIndex (and eventually make PeriodIndex just a sub-class of this) > - RangeIndex > > so lots of changes, though nothing really earth shaking, just more > convenience, reducing magicness somewhat > and providing flexibility. > > Of course we are getting increasing issues, mostly bug reports (and lots of > dupes), some edge case enhancements > which can add to the existing API's and of course, requests to expand the > (already) large code to other usecases. > Balancing this are a good many pull-requests from many different users, some > even deep into the internals. > > Here are some things that I have talked about and could be considered for > the roadmap. Disclaimer: I do work for Continuum > but these views are of course my own; furthermore obviously I am a bit more > familiar with some of the 'sponsored' open-source > libraries, but always open to new things. > > - integration / automatic deferral to numba for JIT (this would be thru > .apply) > - automatic deferal to dask from groubpy where appropriate / maybe a > .to_parallel (to simply return a dask.DataFrame object) > - incorporation of quantities / units (as part of the dtype) > - use of DyND to allow missing values for int dtypes > - make Period a first class dtype. > - provide some copy-on-write semantics to alleviate the chained-indexing > issues which occasionaly come up with the mis-use of the indexing API > - allow a 'policy' to automatically provide column blocks for dict-like > input (e.g. each column would be a block), this would allow a pass-thru API > where you could > put in numpy arrays where you have views and have them preserved rather than > copied automatically. Note that this would also allow what I call 'split' > where a passed in > multi-dim numpy array could be split up to individual blocks (which actually > gives a nice perf boost after the splitting costs). > > In working towards some of these goals. I have come to the opinion that it > would make sense to have a neutral API protocol layer > that would allow us to swap out different engines as needed, for particular > dtypes, or *maybe* out-of-core type computations. E.g. > imagine that we replaced the in-memory block structure with a bclolz / memap > type; in theory this should be 'easy' and just work. > I could also see us adopting *some* of the SFrame code to allow easier > interop with this API layer. > > In practice, I think a nice API layer would need to be created to make this > clean / nice. > > So this comes around to Wes's point about creating a c++ library for the > internals (and possibly even some of the indexing routines). > In an ideal world, or course this would be desirable. Getting there is a bit > non-trivial I think, and IMHO might not be worth the effort. I don't > really see big performance bottlenecks. We *already* defer much of the > computation to libraries like numexpr & bottleneck (where appropriate). > Adding numba / dask to the list would be helpful. > > I think that almost all performance issues are the result of: > > a) gross misuse of the pandas API. How much code have you seen that does > df.apply(lambda x: x.sum()) > b) routines which operate column-by-column rather block-by-block and are in > python space (e.g. we have an issue right now about .quantile) > > So I am glossing over a big goal of having a c++ library that represents the > pandas internals. This would by definition have a c-API that so > you *could* use pandas like semantics in c/c++ and just have it work (and > then pandas would be a thin wrapper around this library). > > I am not averse to this, but I think would be quite a big effort, and not a > huge perf boost IMHO. Further there are a number of API issues w.r.t. > indexing > which need to be clarified / worked out (e.g. should we simply deprecate []) > that are much easier to test / figure out in python space. > > I also thing that we have quite a large number of contributors. Moving to > c++ might make the internals a bit more impenetrable that the current > internals. > (though this would allow c++ people to contribute, so that might balance > out). > > We have a limited core of devs whom right now are familar with things. If > someone happened to have a starting base for a c++ library, then I might > change > opinions here. > > > my 4c. > > Jeff > > > > > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney wrote: >> >> Deep thoughts during the holidays. >> >> I might be out of line here, but the interpreter-heaviness of the >> inside of pandas objects is likely to be a long-term liability and >> source of performance problems and technical debt. >> >> Has anyone put any thought into planning and beginning to execute on a >> rewrite that moves as much as possible of the internals into native / >> compiled code? I'm talking about: >> >> - pandas/core/internals >> - indexing and assignment >> - much of pandas/core/common >> - categorical and custom dtypes >> - all indexing mechanisms >> >> I'm concerned we've already exposed too much internals to users, so >> this might lead to a lot of API breakage, but it might be for the >> Greater Good. As a first step, beginning a partial migration of >> internals into some C++ classes that encapsulate the insides of >> DataFrame objects and implement indexing and block-level manipulations >> would be a good place to start. I think you could do this wouldn't too >> much disruption. >> >> As part of this internal retooling we might give consideration to >> alternative data structures for representing data internal to pandas >> objects. Now in 2015/2016, continuing to be hamstrung by NumPy's >> limitations feels somewhat anachronistic. User code is riddled with >> workarounds for data type fidelity issues and the like. Like, really, >> why not add a bitndarray (similar to ilanschnell/bitarray) for storing >> nullness for problematic types and hide this from the user? =) >> >> Since we are now a NumFOCUS-sponsored project, I feel like we might >> consider establishing some formal governance over pandas and >> publishing meetings notes and roadmap documents describing plans for >> the project and meetings notes from committers. There's no real >> "committer culture" for NumFOCUS projects like there is with the >> Apache Software Foundation, but we might try leading by example! >> >> Also, I believe pandas as a project has reached a level of importance >> where we ought to consider planning and execution on larger scale >> undertakings such as this for safeguarding the future. >> >> As for myself, well, I have my hands full in Big Data-land. I wish I >> could be helping more with pandas, but there a quite a few fundamental >> issues (like data interoperability nested data handling and file >> format support ? e.g. Parquet, see >> >> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/) >> preventing Python from being more useful in industry analytics >> applications. >> >> Aside: one of the bigger mistakes I made with pandas's API design was >> making it acceptable to call class constructors ? like >> pandas.DataFrame ? directly (versus factory functions). Sorry about >> that! If we could convince everyone to start writing pandas.data_frame >> or dataframe instead of using the class reference it would help a lot >> with code cleanup. It's hard to plan for these things ? NumPy >> interoperability seemed a lot more important in 2008 than it does now, >> so I forgive myself. >> >> cheers and best wishes for 2016, >> Wes >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev > > From jeffreback at gmail.com Tue Dec 29 14:56:08 2015 From: jeffreback at gmail.com (Jeff Reback) Date: Tue, 29 Dec 2015 14:56:08 -0500 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References: Message-ID: Ok certainly not averse to using bitfields. I agree that would solve the problem. In fact Stefan Hoyer and I briefly discussed this w.r.t. IntervalIndex. Turns out just as easy to use a sentinel. In fact that was my original idea (for int NA). really similar to how we handle Datetime et al. So will create a google doc for discussion points. I agree creating a minimalist c++ library is not too hard. But my original question stands, what are the use cases. I can enumerate some here: - 1) performance (I am not convinced of this, but could be wrong) - 2) c-api always a good thing & other lang bindings I suspect you are in the part 2 camp? On Tue, Dec 29, 2015 at 2:49 PM, Wes McKinney wrote: > I will write a more detailed response to some of these things after > the new year, but, in particular, re: missing values, can you or > someone tell me why creating an object that contains a NumPy array and > a bitmap is not sufficient? If we we can add a lightweight C/C++ class > layer between NumPy function calls (e.g. arithmetic) and pandas > function calls, then I see no reason why we cannot have > > Int32Array->add > > and > > Float32Array->add > > do the right thing (the former would be responsible for bitmasking to > propagate NA values; the latter would defer to NumPy). If we can put > all the internals of pandas objects inside a black box, we can add > layers of virtual function indirection without a performance penalty > (e.g. adding more interpreter overhead with more abstraction layers > does add up to a perf penalty). > > I don't think this is too scary -- I would be willing to create a > small POC C++ library to prototype something like what I'm talking > about. > > Since pandas has limited points of contact with NumPy I don't think > this would end up being too onerous. > > For the record, I'm pretty allergic to "advanced C++"; I think it is a > useful tool if you pick a sane 20% subset of the C++11 spec and follow > Google C++ style it's not very inaccessible to intermediate > developers. More or less "C plus OOP and easier object lifetime > management (shared/unique_ptr, etc.)". As soon as you add a lot of > template metaprogramming C++ library development quickly becomes > inaccessible except to the C++-Jedi. > > Maybe let's start a Google document on "pandas roadmap" where we can > break down the 1-2 year goals and some of these infrastructure issues > and have our discussion there? (obviously publish this someplace once > we're done) > > - Wes > > On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback wrote: > > Here are some of my thoughts about pandas Roadmap / status and some > > responses to Wes's thoughts. > > > > In the last few (and upcoming) major releases we have been made the > > following changes: > > > > - dtype enhancements (Categorical, Timedelta, Datetime w/tz) & making > these > > first class objects > > - code refactoring to remove subclassing of ndarrays for Series & Index > > - carving out / deprecating non-core parts of pandas > > - datareader > > - SparsePanel, WidePanel & other aliases (TImeSeries) > > - rpy, rplot, irow et al. > > - google-analytics > > - API changes to make things more consistent > > - pd.rolling/expanding * -> .rolling/expanding (this is in master now) > > - .resample becoming a full defered like groupby. > > - multi-index slicing along any level (obviates need for .xs) and > allows > > assignment > > - .loc/.iloc - for the most part obviates use of .ix > > - .pipe & .assign > > - plotting accessors > > - fixing of the sorting API > > - many performance enhancements both micro & macro (e.g. release GIL) > > > > Some on-deck enhancements are (meaning these are basically ready to go > in): > > - IntervalIndex (and eventually make PeriodIndex just a sub-class of > this) > > - RangeIndex > > > > so lots of changes, though nothing really earth shaking, just more > > convenience, reducing magicness somewhat > > and providing flexibility. > > > > Of course we are getting increasing issues, mostly bug reports (and lots > of > > dupes), some edge case enhancements > > which can add to the existing API's and of course, requests to expand the > > (already) large code to other usecases. > > Balancing this are a good many pull-requests from many different users, > some > > even deep into the internals. > > > > Here are some things that I have talked about and could be considered for > > the roadmap. Disclaimer: I do work for Continuum > > but these views are of course my own; furthermore obviously I am a bit > more > > familiar with some of the 'sponsored' open-source > > libraries, but always open to new things. > > > > - integration / automatic deferral to numba for JIT (this would be thru > > .apply) > > - automatic deferal to dask from groubpy where appropriate / maybe a > > .to_parallel (to simply return a dask.DataFrame object) > > - incorporation of quantities / units (as part of the dtype) > > - use of DyND to allow missing values for int dtypes > > - make Period a first class dtype. > > - provide some copy-on-write semantics to alleviate the chained-indexing > > issues which occasionaly come up with the mis-use of the indexing API > > - allow a 'policy' to automatically provide column blocks for dict-like > > input (e.g. each column would be a block), this would allow a pass-thru > API > > where you could > > put in numpy arrays where you have views and have them preserved rather > than > > copied automatically. Note that this would also allow what I call 'split' > > where a passed in > > multi-dim numpy array could be split up to individual blocks (which > actually > > gives a nice perf boost after the splitting costs). > > > > In working towards some of these goals. I have come to the opinion that > it > > would make sense to have a neutral API protocol layer > > that would allow us to swap out different engines as needed, for > particular > > dtypes, or *maybe* out-of-core type computations. E.g. > > imagine that we replaced the in-memory block structure with a bclolz / > memap > > type; in theory this should be 'easy' and just work. > > I could also see us adopting *some* of the SFrame code to allow easier > > interop with this API layer. > > > > In practice, I think a nice API layer would need to be created to make > this > > clean / nice. > > > > So this comes around to Wes's point about creating a c++ library for the > > internals (and possibly even some of the indexing routines). > > In an ideal world, or course this would be desirable. Getting there is a > bit > > non-trivial I think, and IMHO might not be worth the effort. I don't > > really see big performance bottlenecks. We *already* defer much of the > > computation to libraries like numexpr & bottleneck (where appropriate). > > Adding numba / dask to the list would be helpful. > > > > I think that almost all performance issues are the result of: > > > > a) gross misuse of the pandas API. How much code have you seen that does > > df.apply(lambda x: x.sum()) > > b) routines which operate column-by-column rather block-by-block and are > in > > python space (e.g. we have an issue right now about .quantile) > > > > So I am glossing over a big goal of having a c++ library that represents > the > > pandas internals. This would by definition have a c-API that so > > you *could* use pandas like semantics in c/c++ and just have it work (and > > then pandas would be a thin wrapper around this library). > > > > I am not averse to this, but I think would be quite a big effort, and > not a > > huge perf boost IMHO. Further there are a number of API issues w.r.t. > > indexing > > which need to be clarified / worked out (e.g. should we simply deprecate > []) > > that are much easier to test / figure out in python space. > > > > I also thing that we have quite a large number of contributors. Moving to > > c++ might make the internals a bit more impenetrable that the current > > internals. > > (though this would allow c++ people to contribute, so that might balance > > out). > > > > We have a limited core of devs whom right now are familar with things. If > > someone happened to have a starting base for a c++ library, then I might > > change > > opinions here. > > > > > > my 4c. > > > > Jeff > > > > > > > > > > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney > wrote: > >> > >> Deep thoughts during the holidays. > >> > >> I might be out of line here, but the interpreter-heaviness of the > >> inside of pandas objects is likely to be a long-term liability and > >> source of performance problems and technical debt. > >> > >> Has anyone put any thought into planning and beginning to execute on a > >> rewrite that moves as much as possible of the internals into native / > >> compiled code? I'm talking about: > >> > >> - pandas/core/internals > >> - indexing and assignment > >> - much of pandas/core/common > >> - categorical and custom dtypes > >> - all indexing mechanisms > >> > >> I'm concerned we've already exposed too much internals to users, so > >> this might lead to a lot of API breakage, but it might be for the > >> Greater Good. As a first step, beginning a partial migration of > >> internals into some C++ classes that encapsulate the insides of > >> DataFrame objects and implement indexing and block-level manipulations > >> would be a good place to start. I think you could do this wouldn't too > >> much disruption. > >> > >> As part of this internal retooling we might give consideration to > >> alternative data structures for representing data internal to pandas > >> objects. Now in 2015/2016, continuing to be hamstrung by NumPy's > >> limitations feels somewhat anachronistic. User code is riddled with > >> workarounds for data type fidelity issues and the like. Like, really, > >> why not add a bitndarray (similar to ilanschnell/bitarray) for storing > >> nullness for problematic types and hide this from the user? =) > >> > >> Since we are now a NumFOCUS-sponsored project, I feel like we might > >> consider establishing some formal governance over pandas and > >> publishing meetings notes and roadmap documents describing plans for > >> the project and meetings notes from committers. There's no real > >> "committer culture" for NumFOCUS projects like there is with the > >> Apache Software Foundation, but we might try leading by example! > >> > >> Also, I believe pandas as a project has reached a level of importance > >> where we ought to consider planning and execution on larger scale > >> undertakings such as this for safeguarding the future. > >> > >> As for myself, well, I have my hands full in Big Data-land. I wish I > >> could be helping more with pandas, but there a quite a few fundamental > >> issues (like data interoperability nested data handling and file > >> format support ? e.g. Parquet, see > >> > >> > http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/ > ) > >> preventing Python from being more useful in industry analytics > >> applications. > >> > >> Aside: one of the bigger mistakes I made with pandas's API design was > >> making it acceptable to call class constructors ? like > >> pandas.DataFrame ? directly (versus factory functions). Sorry about > >> that! If we could convince everyone to start writing pandas.data_frame > >> or dataframe instead of using the class reference it would help a lot > >> with code cleanup. It's hard to plan for these things ? NumPy > >> interoperability seemed a lot more important in 2008 than it does now, > >> so I forgive myself. > >> > >> cheers and best wishes for 2016, > >> Wes > >> _______________________________________________ > >> Pandas-dev mailing list > >> Pandas-dev at python.org > >> https://mail.python.org/mailman/listinfo/pandas-dev > > > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeffreback at gmail.com Tue Dec 29 14:59:33 2015 From: jeffreback at gmail.com (Jeff Reback) Date: Tue, 29 Dec 2015 14:59:33 -0500 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References: Message-ID: Here's a link where we can discuss the roadmap: https://docs.google.com/document/d/151ct8jcZWwh7XStptjbLsda6h2b3C0IuiH_hfZnUA58/edit?usp=sharing On Tue, Dec 29, 2015 at 2:56 PM, Jeff Reback wrote: > Ok certainly not averse to using bitfields. I agree that would solve the > problem. In fact Stefan Hoyer and I briefly discussed this w.r.t. > IntervalIndex. Turns out just as easy to use a sentinel. In fact that was > my original idea (for int NA). really similar to how we handle Datetime et > al. > > So will create a google doc for discussion points. > > I agree creating a minimalist c++ library is not too hard. But my original > question stands, what are the use cases. I can enumerate some here: > > - 1) performance (I am not convinced of this, but could be wrong) > - 2) c-api always a good thing & other lang bindings > > I suspect you are in the part 2 camp? > > > On Tue, Dec 29, 2015 at 2:49 PM, Wes McKinney wrote: > >> I will write a more detailed response to some of these things after >> the new year, but, in particular, re: missing values, can you or >> someone tell me why creating an object that contains a NumPy array and >> a bitmap is not sufficient? If we we can add a lightweight C/C++ class >> layer between NumPy function calls (e.g. arithmetic) and pandas >> function calls, then I see no reason why we cannot have >> >> Int32Array->add >> >> and >> >> Float32Array->add >> >> do the right thing (the former would be responsible for bitmasking to >> propagate NA values; the latter would defer to NumPy). If we can put >> all the internals of pandas objects inside a black box, we can add >> layers of virtual function indirection without a performance penalty >> (e.g. adding more interpreter overhead with more abstraction layers >> does add up to a perf penalty). >> >> I don't think this is too scary -- I would be willing to create a >> small POC C++ library to prototype something like what I'm talking >> about. >> >> Since pandas has limited points of contact with NumPy I don't think >> this would end up being too onerous. >> >> For the record, I'm pretty allergic to "advanced C++"; I think it is a >> useful tool if you pick a sane 20% subset of the C++11 spec and follow >> Google C++ style it's not very inaccessible to intermediate >> developers. More or less "C plus OOP and easier object lifetime >> management (shared/unique_ptr, etc.)". As soon as you add a lot of >> template metaprogramming C++ library development quickly becomes >> inaccessible except to the C++-Jedi. >> >> Maybe let's start a Google document on "pandas roadmap" where we can >> break down the 1-2 year goals and some of these infrastructure issues >> and have our discussion there? (obviously publish this someplace once >> we're done) >> >> - Wes >> >> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback >> wrote: >> > Here are some of my thoughts about pandas Roadmap / status and some >> > responses to Wes's thoughts. >> > >> > In the last few (and upcoming) major releases we have been made the >> > following changes: >> > >> > - dtype enhancements (Categorical, Timedelta, Datetime w/tz) & making >> these >> > first class objects >> > - code refactoring to remove subclassing of ndarrays for Series & Index >> > - carving out / deprecating non-core parts of pandas >> > - datareader >> > - SparsePanel, WidePanel & other aliases (TImeSeries) >> > - rpy, rplot, irow et al. >> > - google-analytics >> > - API changes to make things more consistent >> > - pd.rolling/expanding * -> .rolling/expanding (this is in master now) >> > - .resample becoming a full defered like groupby. >> > - multi-index slicing along any level (obviates need for .xs) and >> allows >> > assignment >> > - .loc/.iloc - for the most part obviates use of .ix >> > - .pipe & .assign >> > - plotting accessors >> > - fixing of the sorting API >> > - many performance enhancements both micro & macro (e.g. release GIL) >> > >> > Some on-deck enhancements are (meaning these are basically ready to go >> in): >> > - IntervalIndex (and eventually make PeriodIndex just a sub-class of >> this) >> > - RangeIndex >> > >> > so lots of changes, though nothing really earth shaking, just more >> > convenience, reducing magicness somewhat >> > and providing flexibility. >> > >> > Of course we are getting increasing issues, mostly bug reports (and >> lots of >> > dupes), some edge case enhancements >> > which can add to the existing API's and of course, requests to expand >> the >> > (already) large code to other usecases. >> > Balancing this are a good many pull-requests from many different users, >> some >> > even deep into the internals. >> > >> > Here are some things that I have talked about and could be considered >> for >> > the roadmap. Disclaimer: I do work for Continuum >> > but these views are of course my own; furthermore obviously I am a bit >> more >> > familiar with some of the 'sponsored' open-source >> > libraries, but always open to new things. >> > >> > - integration / automatic deferral to numba for JIT (this would be thru >> > .apply) >> > - automatic deferal to dask from groubpy where appropriate / maybe a >> > .to_parallel (to simply return a dask.DataFrame object) >> > - incorporation of quantities / units (as part of the dtype) >> > - use of DyND to allow missing values for int dtypes >> > - make Period a first class dtype. >> > - provide some copy-on-write semantics to alleviate the chained-indexing >> > issues which occasionaly come up with the mis-use of the indexing API >> > - allow a 'policy' to automatically provide column blocks for dict-like >> > input (e.g. each column would be a block), this would allow a pass-thru >> API >> > where you could >> > put in numpy arrays where you have views and have them preserved rather >> than >> > copied automatically. Note that this would also allow what I call >> 'split' >> > where a passed in >> > multi-dim numpy array could be split up to individual blocks (which >> actually >> > gives a nice perf boost after the splitting costs). >> > >> > In working towards some of these goals. I have come to the opinion that >> it >> > would make sense to have a neutral API protocol layer >> > that would allow us to swap out different engines as needed, for >> particular >> > dtypes, or *maybe* out-of-core type computations. E.g. >> > imagine that we replaced the in-memory block structure with a bclolz / >> memap >> > type; in theory this should be 'easy' and just work. >> > I could also see us adopting *some* of the SFrame code to allow easier >> > interop with this API layer. >> > >> > In practice, I think a nice API layer would need to be created to make >> this >> > clean / nice. >> > >> > So this comes around to Wes's point about creating a c++ library for the >> > internals (and possibly even some of the indexing routines). >> > In an ideal world, or course this would be desirable. Getting there is >> a bit >> > non-trivial I think, and IMHO might not be worth the effort. I don't >> > really see big performance bottlenecks. We *already* defer much of the >> > computation to libraries like numexpr & bottleneck (where appropriate). >> > Adding numba / dask to the list would be helpful. >> > >> > I think that almost all performance issues are the result of: >> > >> > a) gross misuse of the pandas API. How much code have you seen that does >> > df.apply(lambda x: x.sum()) >> > b) routines which operate column-by-column rather block-by-block and >> are in >> > python space (e.g. we have an issue right now about .quantile) >> > >> > So I am glossing over a big goal of having a c++ library that >> represents the >> > pandas internals. This would by definition have a c-API that so >> > you *could* use pandas like semantics in c/c++ and just have it work >> (and >> > then pandas would be a thin wrapper around this library). >> > >> > I am not averse to this, but I think would be quite a big effort, and >> not a >> > huge perf boost IMHO. Further there are a number of API issues w.r.t. >> > indexing >> > which need to be clarified / worked out (e.g. should we simply >> deprecate []) >> > that are much easier to test / figure out in python space. >> > >> > I also thing that we have quite a large number of contributors. Moving >> to >> > c++ might make the internals a bit more impenetrable that the current >> > internals. >> > (though this would allow c++ people to contribute, so that might balance >> > out). >> > >> > We have a limited core of devs whom right now are familar with things. >> If >> > someone happened to have a starting base for a c++ library, then I might >> > change >> > opinions here. >> > >> > >> > my 4c. >> > >> > Jeff >> > >> > >> > >> > >> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney >> wrote: >> >> >> >> Deep thoughts during the holidays. >> >> >> >> I might be out of line here, but the interpreter-heaviness of the >> >> inside of pandas objects is likely to be a long-term liability and >> >> source of performance problems and technical debt. >> >> >> >> Has anyone put any thought into planning and beginning to execute on a >> >> rewrite that moves as much as possible of the internals into native / >> >> compiled code? I'm talking about: >> >> >> >> - pandas/core/internals >> >> - indexing and assignment >> >> - much of pandas/core/common >> >> - categorical and custom dtypes >> >> - all indexing mechanisms >> >> >> >> I'm concerned we've already exposed too much internals to users, so >> >> this might lead to a lot of API breakage, but it might be for the >> >> Greater Good. As a first step, beginning a partial migration of >> >> internals into some C++ classes that encapsulate the insides of >> >> DataFrame objects and implement indexing and block-level manipulations >> >> would be a good place to start. I think you could do this wouldn't too >> >> much disruption. >> >> >> >> As part of this internal retooling we might give consideration to >> >> alternative data structures for representing data internal to pandas >> >> objects. Now in 2015/2016, continuing to be hamstrung by NumPy's >> >> limitations feels somewhat anachronistic. User code is riddled with >> >> workarounds for data type fidelity issues and the like. Like, really, >> >> why not add a bitndarray (similar to ilanschnell/bitarray) for storing >> >> nullness for problematic types and hide this from the user? =) >> >> >> >> Since we are now a NumFOCUS-sponsored project, I feel like we might >> >> consider establishing some formal governance over pandas and >> >> publishing meetings notes and roadmap documents describing plans for >> >> the project and meetings notes from committers. There's no real >> >> "committer culture" for NumFOCUS projects like there is with the >> >> Apache Software Foundation, but we might try leading by example! >> >> >> >> Also, I believe pandas as a project has reached a level of importance >> >> where we ought to consider planning and execution on larger scale >> >> undertakings such as this for safeguarding the future. >> >> >> >> As for myself, well, I have my hands full in Big Data-land. I wish I >> >> could be helping more with pandas, but there a quite a few fundamental >> >> issues (like data interoperability nested data handling and file >> >> format support ? e.g. Parquet, see >> >> >> >> >> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/ >> ) >> >> preventing Python from being more useful in industry analytics >> >> applications. >> >> >> >> Aside: one of the bigger mistakes I made with pandas's API design was >> >> making it acceptable to call class constructors ? like >> >> pandas.DataFrame ? directly (versus factory functions). Sorry about >> >> that! If we could convince everyone to start writing pandas.data_frame >> >> or dataframe instead of using the class reference it would help a lot >> >> with code cleanup. It's hard to plan for these things ? NumPy >> >> interoperability seemed a lot more important in 2008 than it does now, >> >> so I forgive myself. >> >> >> >> cheers and best wishes for 2016, >> >> Wes >> >> _______________________________________________ >> >> Pandas-dev mailing list >> >> Pandas-dev at python.org >> >> https://mail.python.org/mailman/listinfo/pandas-dev >> > >> > >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Tue Dec 29 15:07:26 2015 From: wesmckinn at gmail.com (Wes McKinney) Date: Tue, 29 Dec 2015 12:07:26 -0800 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References: Message-ID: Yeah, basically creating a "libpandas" with a C API for Series and DataFrame objects (and maybe a roadmap for more interchangeable internals) is definitely what I'm talking about. We can probably move a lot of the Cython guts there, too. I think better microperformance will fall out of this naturally but the big goal is a more maintainable and extensible core. I try to find some time to hack together a CMake file that creates a libpandas suitable for static linking with a Cython extension and that links dynamically with NumPy's multiarray.so and libpythonXX. The library setup is honestly the most tedious part. Aside: I'm working a lot on nested / Parquet-type data these days, and this is not a "pandas problem", but I want to make sure the tooling develops a reasonable C API so that interoperability between pandas and systems with different non-NumPy-like data models will have minimal performance overhead. - Wes On Tue, Dec 29, 2015 at 11:59 AM, Jeff Reback wrote: > Here's a link where we can discuss the roadmap: > > https://docs.google.com/document/d/151ct8jcZWwh7XStptjbLsda6h2b3C0IuiH_hfZnUA58/edit?usp=sharing > > On Tue, Dec 29, 2015 at 2:56 PM, Jeff Reback wrote: >> >> Ok certainly not averse to using bitfields. I agree that would solve the >> problem. In fact Stefan Hoyer and I briefly discussed this w.r.t. >> IntervalIndex. Turns out just as easy to use a sentinel. In fact that was my >> original idea (for int NA). really similar to how we handle Datetime et al. >> >> So will create a google doc for discussion points. >> >> I agree creating a minimalist c++ library is not too hard. But my original >> question stands, what are the use cases. I can enumerate some here: >> >> - 1) performance (I am not convinced of this, but could be wrong) >> - 2) c-api always a good thing & other lang bindings >> >> I suspect you are in the part 2 camp? >> >> >> On Tue, Dec 29, 2015 at 2:49 PM, Wes McKinney wrote: >>> >>> I will write a more detailed response to some of these things after >>> the new year, but, in particular, re: missing values, can you or >>> someone tell me why creating an object that contains a NumPy array and >>> a bitmap is not sufficient? If we we can add a lightweight C/C++ class >>> layer between NumPy function calls (e.g. arithmetic) and pandas >>> function calls, then I see no reason why we cannot have >>> >>> Int32Array->add >>> >>> and >>> >>> Float32Array->add >>> >>> do the right thing (the former would be responsible for bitmasking to >>> propagate NA values; the latter would defer to NumPy). If we can put >>> all the internals of pandas objects inside a black box, we can add >>> layers of virtual function indirection without a performance penalty >>> (e.g. adding more interpreter overhead with more abstraction layers >>> does add up to a perf penalty). >>> >>> I don't think this is too scary -- I would be willing to create a >>> small POC C++ library to prototype something like what I'm talking >>> about. >>> >>> Since pandas has limited points of contact with NumPy I don't think >>> this would end up being too onerous. >>> >>> For the record, I'm pretty allergic to "advanced C++"; I think it is a >>> useful tool if you pick a sane 20% subset of the C++11 spec and follow >>> Google C++ style it's not very inaccessible to intermediate >>> developers. More or less "C plus OOP and easier object lifetime >>> management (shared/unique_ptr, etc.)". As soon as you add a lot of >>> template metaprogramming C++ library development quickly becomes >>> inaccessible except to the C++-Jedi. >>> >>> Maybe let's start a Google document on "pandas roadmap" where we can >>> break down the 1-2 year goals and some of these infrastructure issues >>> and have our discussion there? (obviously publish this someplace once >>> we're done) >>> >>> - Wes >>> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback >>> wrote: >>> > Here are some of my thoughts about pandas Roadmap / status and some >>> > responses to Wes's thoughts. >>> > >>> > In the last few (and upcoming) major releases we have been made the >>> > following changes: >>> > >>> > - dtype enhancements (Categorical, Timedelta, Datetime w/tz) & making >>> > these >>> > first class objects >>> > - code refactoring to remove subclassing of ndarrays for Series & Index >>> > - carving out / deprecating non-core parts of pandas >>> > - datareader >>> > - SparsePanel, WidePanel & other aliases (TImeSeries) >>> > - rpy, rplot, irow et al. >>> > - google-analytics >>> > - API changes to make things more consistent >>> > - pd.rolling/expanding * -> .rolling/expanding (this is in master >>> > now) >>> > - .resample becoming a full defered like groupby. >>> > - multi-index slicing along any level (obviates need for .xs) and >>> > allows >>> > assignment >>> > - .loc/.iloc - for the most part obviates use of .ix >>> > - .pipe & .assign >>> > - plotting accessors >>> > - fixing of the sorting API >>> > - many performance enhancements both micro & macro (e.g. release GIL) >>> > >>> > Some on-deck enhancements are (meaning these are basically ready to go >>> > in): >>> > - IntervalIndex (and eventually make PeriodIndex just a sub-class of >>> > this) >>> > - RangeIndex >>> > >>> > so lots of changes, though nothing really earth shaking, just more >>> > convenience, reducing magicness somewhat >>> > and providing flexibility. >>> > >>> > Of course we are getting increasing issues, mostly bug reports (and >>> > lots of >>> > dupes), some edge case enhancements >>> > which can add to the existing API's and of course, requests to expand >>> > the >>> > (already) large code to other usecases. >>> > Balancing this are a good many pull-requests from many different users, >>> > some >>> > even deep into the internals. >>> > >>> > Here are some things that I have talked about and could be considered >>> > for >>> > the roadmap. Disclaimer: I do work for Continuum >>> > but these views are of course my own; furthermore obviously I am a bit >>> > more >>> > familiar with some of the 'sponsored' open-source >>> > libraries, but always open to new things. >>> > >>> > - integration / automatic deferral to numba for JIT (this would be thru >>> > .apply) >>> > - automatic deferal to dask from groubpy where appropriate / maybe a >>> > .to_parallel (to simply return a dask.DataFrame object) >>> > - incorporation of quantities / units (as part of the dtype) >>> > - use of DyND to allow missing values for int dtypes >>> > - make Period a first class dtype. >>> > - provide some copy-on-write semantics to alleviate the >>> > chained-indexing >>> > issues which occasionaly come up with the mis-use of the indexing API >>> > - allow a 'policy' to automatically provide column blocks for dict-like >>> > input (e.g. each column would be a block), this would allow a pass-thru >>> > API >>> > where you could >>> > put in numpy arrays where you have views and have them preserved rather >>> > than >>> > copied automatically. Note that this would also allow what I call >>> > 'split' >>> > where a passed in >>> > multi-dim numpy array could be split up to individual blocks (which >>> > actually >>> > gives a nice perf boost after the splitting costs). >>> > >>> > In working towards some of these goals. I have come to the opinion that >>> > it >>> > would make sense to have a neutral API protocol layer >>> > that would allow us to swap out different engines as needed, for >>> > particular >>> > dtypes, or *maybe* out-of-core type computations. E.g. >>> > imagine that we replaced the in-memory block structure with a bclolz / >>> > memap >>> > type; in theory this should be 'easy' and just work. >>> > I could also see us adopting *some* of the SFrame code to allow easier >>> > interop with this API layer. >>> > >>> > In practice, I think a nice API layer would need to be created to make >>> > this >>> > clean / nice. >>> > >>> > So this comes around to Wes's point about creating a c++ library for >>> > the >>> > internals (and possibly even some of the indexing routines). >>> > In an ideal world, or course this would be desirable. Getting there is >>> > a bit >>> > non-trivial I think, and IMHO might not be worth the effort. I don't >>> > really see big performance bottlenecks. We *already* defer much of the >>> > computation to libraries like numexpr & bottleneck (where appropriate). >>> > Adding numba / dask to the list would be helpful. >>> > >>> > I think that almost all performance issues are the result of: >>> > >>> > a) gross misuse of the pandas API. How much code have you seen that >>> > does >>> > df.apply(lambda x: x.sum()) >>> > b) routines which operate column-by-column rather block-by-block and >>> > are in >>> > python space (e.g. we have an issue right now about .quantile) >>> > >>> > So I am glossing over a big goal of having a c++ library that >>> > represents the >>> > pandas internals. This would by definition have a c-API that so >>> > you *could* use pandas like semantics in c/c++ and just have it work >>> > (and >>> > then pandas would be a thin wrapper around this library). >>> > >>> > I am not averse to this, but I think would be quite a big effort, and >>> > not a >>> > huge perf boost IMHO. Further there are a number of API issues w.r.t. >>> > indexing >>> > which need to be clarified / worked out (e.g. should we simply >>> > deprecate []) >>> > that are much easier to test / figure out in python space. >>> > >>> > I also thing that we have quite a large number of contributors. Moving >>> > to >>> > c++ might make the internals a bit more impenetrable that the current >>> > internals. >>> > (though this would allow c++ people to contribute, so that might >>> > balance >>> > out). >>> > >>> > We have a limited core of devs whom right now are familar with things. >>> > If >>> > someone happened to have a starting base for a c++ library, then I >>> > might >>> > change >>> > opinions here. >>> > >>> > >>> > my 4c. >>> > >>> > Jeff >>> > >>> > >>> > >>> > >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney >>> > wrote: >>> >> >>> >> Deep thoughts during the holidays. >>> >> >>> >> I might be out of line here, but the interpreter-heaviness of the >>> >> inside of pandas objects is likely to be a long-term liability and >>> >> source of performance problems and technical debt. >>> >> >>> >> Has anyone put any thought into planning and beginning to execute on a >>> >> rewrite that moves as much as possible of the internals into native / >>> >> compiled code? I'm talking about: >>> >> >>> >> - pandas/core/internals >>> >> - indexing and assignment >>> >> - much of pandas/core/common >>> >> - categorical and custom dtypes >>> >> - all indexing mechanisms >>> >> >>> >> I'm concerned we've already exposed too much internals to users, so >>> >> this might lead to a lot of API breakage, but it might be for the >>> >> Greater Good. As a first step, beginning a partial migration of >>> >> internals into some C++ classes that encapsulate the insides of >>> >> DataFrame objects and implement indexing and block-level manipulations >>> >> would be a good place to start. I think you could do this wouldn't too >>> >> much disruption. >>> >> >>> >> As part of this internal retooling we might give consideration to >>> >> alternative data structures for representing data internal to pandas >>> >> objects. Now in 2015/2016, continuing to be hamstrung by NumPy's >>> >> limitations feels somewhat anachronistic. User code is riddled with >>> >> workarounds for data type fidelity issues and the like. Like, really, >>> >> why not add a bitndarray (similar to ilanschnell/bitarray) for storing >>> >> nullness for problematic types and hide this from the user? =) >>> >> >>> >> Since we are now a NumFOCUS-sponsored project, I feel like we might >>> >> consider establishing some formal governance over pandas and >>> >> publishing meetings notes and roadmap documents describing plans for >>> >> the project and meetings notes from committers. There's no real >>> >> "committer culture" for NumFOCUS projects like there is with the >>> >> Apache Software Foundation, but we might try leading by example! >>> >> >>> >> Also, I believe pandas as a project has reached a level of importance >>> >> where we ought to consider planning and execution on larger scale >>> >> undertakings such as this for safeguarding the future. >>> >> >>> >> As for myself, well, I have my hands full in Big Data-land. I wish I >>> >> could be helping more with pandas, but there a quite a few fundamental >>> >> issues (like data interoperability nested data handling and file >>> >> format support ? e.g. Parquet, see >>> >> >>> >> >>> >> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/) >>> >> preventing Python from being more useful in industry analytics >>> >> applications. >>> >> >>> >> Aside: one of the bigger mistakes I made with pandas's API design was >>> >> making it acceptable to call class constructors ? like >>> >> pandas.DataFrame ? directly (versus factory functions). Sorry about >>> >> that! If we could convince everyone to start writing pandas.data_frame >>> >> or dataframe instead of using the class reference it would help a lot >>> >> with code cleanup. It's hard to plan for these things ? NumPy >>> >> interoperability seemed a lot more important in 2008 than it does now, >>> >> so I forgive myself. >>> >> >>> >> cheers and best wishes for 2016, >>> >> Wes >>> >> _______________________________________________ >>> >> Pandas-dev mailing list >>> >> Pandas-dev at python.org >>> >> https://mail.python.org/mailman/listinfo/pandas-dev >>> > >>> > >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >> >> > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > From cpcloud at gmail.com Tue Dec 29 15:14:06 2015 From: cpcloud at gmail.com (Phillip Cloud) Date: Tue, 29 Dec 2015 20:14:06 +0000 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References: Message-ID: Maybe this is saying the same thing as Wes, but how far would something like this get us? // warning: things are probably not this simple struct data_array_t { void *primitive; // scalar data data_array_t *nested; // nested data boost::dynamic_bitset isnull; // might have to create our own to avoid boost schema_t schema; // not sure exactly what this looks like }; typedef std::map data_frame_t; // probably not this simple To answer Jeff?s use-case question: I think that the use cases are 1) freedom from numpy (mostly) 2) no more block manager which frees us from the limitations of the block memory layout. In particular, the ability to take advantage of memory mapped IO would be a big win IMO. ? On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney wrote: > I will write a more detailed response to some of these things after > the new year, but, in particular, re: missing values, can you or > someone tell me why creating an object that contains a NumPy array and > a bitmap is not sufficient? If we we can add a lightweight C/C++ class > layer between NumPy function calls (e.g. arithmetic) and pandas > function calls, then I see no reason why we cannot have > > Int32Array->add > > and > > Float32Array->add > > do the right thing (the former would be responsible for bitmasking to > propagate NA values; the latter would defer to NumPy). If we can put > all the internals of pandas objects inside a black box, we can add > layers of virtual function indirection without a performance penalty > (e.g. adding more interpreter overhead with more abstraction layers > does add up to a perf penalty). > > I don't think this is too scary -- I would be willing to create a > small POC C++ library to prototype something like what I'm talking > about. > > Since pandas has limited points of contact with NumPy I don't think > this would end up being too onerous. > > For the record, I'm pretty allergic to "advanced C++"; I think it is a > useful tool if you pick a sane 20% subset of the C++11 spec and follow > Google C++ style it's not very inaccessible to intermediate > developers. More or less "C plus OOP and easier object lifetime > management (shared/unique_ptr, etc.)". As soon as you add a lot of > template metaprogramming C++ library development quickly becomes > inaccessible except to the C++-Jedi. > > Maybe let's start a Google document on "pandas roadmap" where we can > break down the 1-2 year goals and some of these infrastructure issues > and have our discussion there? (obviously publish this someplace once > we're done) > > - Wes > > On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback wrote: > > Here are some of my thoughts about pandas Roadmap / status and some > > responses to Wes's thoughts. > > > > In the last few (and upcoming) major releases we have been made the > > following changes: > > > > - dtype enhancements (Categorical, Timedelta, Datetime w/tz) & making > these > > first class objects > > - code refactoring to remove subclassing of ndarrays for Series & Index > > - carving out / deprecating non-core parts of pandas > > - datareader > > - SparsePanel, WidePanel & other aliases (TImeSeries) > > - rpy, rplot, irow et al. > > - google-analytics > > - API changes to make things more consistent > > - pd.rolling/expanding * -> .rolling/expanding (this is in master now) > > - .resample becoming a full defered like groupby. > > - multi-index slicing along any level (obviates need for .xs) and > allows > > assignment > > - .loc/.iloc - for the most part obviates use of .ix > > - .pipe & .assign > > - plotting accessors > > - fixing of the sorting API > > - many performance enhancements both micro & macro (e.g. release GIL) > > > > Some on-deck enhancements are (meaning these are basically ready to go > in): > > - IntervalIndex (and eventually make PeriodIndex just a sub-class of > this) > > - RangeIndex > > > > so lots of changes, though nothing really earth shaking, just more > > convenience, reducing magicness somewhat > > and providing flexibility. > > > > Of course we are getting increasing issues, mostly bug reports (and lots > of > > dupes), some edge case enhancements > > which can add to the existing API's and of course, requests to expand the > > (already) large code to other usecases. > > Balancing this are a good many pull-requests from many different users, > some > > even deep into the internals. > > > > Here are some things that I have talked about and could be considered for > > the roadmap. Disclaimer: I do work for Continuum > > but these views are of course my own; furthermore obviously I am a bit > more > > familiar with some of the 'sponsored' open-source > > libraries, but always open to new things. > > > > - integration / automatic deferral to numba for JIT (this would be thru > > .apply) > > - automatic deferal to dask from groubpy where appropriate / maybe a > > .to_parallel (to simply return a dask.DataFrame object) > > - incorporation of quantities / units (as part of the dtype) > > - use of DyND to allow missing values for int dtypes > > - make Period a first class dtype. > > - provide some copy-on-write semantics to alleviate the chained-indexing > > issues which occasionaly come up with the mis-use of the indexing API > > - allow a 'policy' to automatically provide column blocks for dict-like > > input (e.g. each column would be a block), this would allow a pass-thru > API > > where you could > > put in numpy arrays where you have views and have them preserved rather > than > > copied automatically. Note that this would also allow what I call 'split' > > where a passed in > > multi-dim numpy array could be split up to individual blocks (which > actually > > gives a nice perf boost after the splitting costs). > > > > In working towards some of these goals. I have come to the opinion that > it > > would make sense to have a neutral API protocol layer > > that would allow us to swap out different engines as needed, for > particular > > dtypes, or *maybe* out-of-core type computations. E.g. > > imagine that we replaced the in-memory block structure with a bclolz / > memap > > type; in theory this should be 'easy' and just work. > > I could also see us adopting *some* of the SFrame code to allow easier > > interop with this API layer. > > > > In practice, I think a nice API layer would need to be created to make > this > > clean / nice. > > > > So this comes around to Wes's point about creating a c++ library for the > > internals (and possibly even some of the indexing routines). > > In an ideal world, or course this would be desirable. Getting there is a > bit > > non-trivial I think, and IMHO might not be worth the effort. I don't > > really see big performance bottlenecks. We *already* defer much of the > > computation to libraries like numexpr & bottleneck (where appropriate). > > Adding numba / dask to the list would be helpful. > > > > I think that almost all performance issues are the result of: > > > > a) gross misuse of the pandas API. How much code have you seen that does > > df.apply(lambda x: x.sum()) > > b) routines which operate column-by-column rather block-by-block and are > in > > python space (e.g. we have an issue right now about .quantile) > > > > So I am glossing over a big goal of having a c++ library that represents > the > > pandas internals. This would by definition have a c-API that so > > you *could* use pandas like semantics in c/c++ and just have it work (and > > then pandas would be a thin wrapper around this library). > > > > I am not averse to this, but I think would be quite a big effort, and > not a > > huge perf boost IMHO. Further there are a number of API issues w.r.t. > > indexing > > which need to be clarified / worked out (e.g. should we simply deprecate > []) > > that are much easier to test / figure out in python space. > > > > I also thing that we have quite a large number of contributors. Moving to > > c++ might make the internals a bit more impenetrable that the current > > internals. > > (though this would allow c++ people to contribute, so that might balance > > out). > > > > We have a limited core of devs whom right now are familar with things. If > > someone happened to have a starting base for a c++ library, then I might > > change > > opinions here. > > > > > > my 4c. > > > > Jeff > > > > > > > > > > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney > wrote: > >> > >> Deep thoughts during the holidays. > >> > >> I might be out of line here, but the interpreter-heaviness of the > >> inside of pandas objects is likely to be a long-term liability and > >> source of performance problems and technical debt. > >> > >> Has anyone put any thought into planning and beginning to execute on a > >> rewrite that moves as much as possible of the internals into native / > >> compiled code? I'm talking about: > >> > >> - pandas/core/internals > >> - indexing and assignment > >> - much of pandas/core/common > >> - categorical and custom dtypes > >> - all indexing mechanisms > >> > >> I'm concerned we've already exposed too much internals to users, so > >> this might lead to a lot of API breakage, but it might be for the > >> Greater Good. As a first step, beginning a partial migration of > >> internals into some C++ classes that encapsulate the insides of > >> DataFrame objects and implement indexing and block-level manipulations > >> would be a good place to start. I think you could do this wouldn't too > >> much disruption. > >> > >> As part of this internal retooling we might give consideration to > >> alternative data structures for representing data internal to pandas > >> objects. Now in 2015/2016, continuing to be hamstrung by NumPy's > >> limitations feels somewhat anachronistic. User code is riddled with > >> workarounds for data type fidelity issues and the like. Like, really, > >> why not add a bitndarray (similar to ilanschnell/bitarray) for storing > >> nullness for problematic types and hide this from the user? =) > >> > >> Since we are now a NumFOCUS-sponsored project, I feel like we might > >> consider establishing some formal governance over pandas and > >> publishing meetings notes and roadmap documents describing plans for > >> the project and meetings notes from committers. There's no real > >> "committer culture" for NumFOCUS projects like there is with the > >> Apache Software Foundation, but we might try leading by example! > >> > >> Also, I believe pandas as a project has reached a level of importance > >> where we ought to consider planning and execution on larger scale > >> undertakings such as this for safeguarding the future. > >> > >> As for myself, well, I have my hands full in Big Data-land. I wish I > >> could be helping more with pandas, but there a quite a few fundamental > >> issues (like data interoperability nested data handling and file > >> format support ? e.g. Parquet, see > >> > >> > http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/ > ) > >> preventing Python from being more useful in industry analytics > >> applications. > >> > >> Aside: one of the bigger mistakes I made with pandas's API design was > >> making it acceptable to call class constructors ? like > >> pandas.DataFrame ? directly (versus factory functions). Sorry about > >> that! If we could convince everyone to start writing pandas.data_frame > >> or dataframe instead of using the class reference it would help a lot > >> with code cleanup. It's hard to plan for these things ? NumPy > >> interoperability seemed a lot more important in 2008 than it does now, > >> so I forgive myself. > >> > >> cheers and best wishes for 2016, > >> Wes > >> _______________________________________________ > >> Pandas-dev mailing list > >> Pandas-dev at python.org > >> https://mail.python.org/mailman/listinfo/pandas-dev > > > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Tue Dec 29 16:02:50 2015 From: wesmckinn at gmail.com (Wes McKinney) Date: Tue, 29 Dec 2015 13:02:50 -0800 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References: Message-ID: Basically the approach is 1) Base dtype type 2) Base array type with K >= 1 dimensions 3) Base scalar type 4) Base index type 5) "Wrapper" subclasses for all NumPy types fitting into categories #1, #2, #3, #4 6) Subclasses for pandas-specific types like category, datetimeTZ, etc. 7) NDFrame as cpcloud wrote is just a list of these Indexes and axis labels / column names can get layered on top. After we do all this we can look at adding nested types (arrays, maps, structs) to better support JSON. - Wes On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud wrote: > Maybe this is saying the same thing as Wes, but how far would something like > this get us? > > // warning: things are probably not this simple > > struct data_array_t { > void *primitive; // scalar data > data_array_t *nested; // nested data > boost::dynamic_bitset isnull; // might have to create our own to avoid > boost > schema_t schema; // not sure exactly what this looks like > }; > > typedef std::map data_frame_t; // probably not this > simple > > To answer Jeff?s use-case question: I think that the use cases are 1) > freedom from numpy (mostly) 2) no more block manager which frees us from the > limitations of the block memory layout. In particular, the ability to take > advantage of memory mapped IO would be a big win IMO. > > > On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney wrote: >> >> I will write a more detailed response to some of these things after >> the new year, but, in particular, re: missing values, can you or >> someone tell me why creating an object that contains a NumPy array and >> a bitmap is not sufficient? If we we can add a lightweight C/C++ class >> layer between NumPy function calls (e.g. arithmetic) and pandas >> function calls, then I see no reason why we cannot have >> >> Int32Array->add >> >> and >> >> Float32Array->add >> >> do the right thing (the former would be responsible for bitmasking to >> propagate NA values; the latter would defer to NumPy). If we can put >> all the internals of pandas objects inside a black box, we can add >> layers of virtual function indirection without a performance penalty >> (e.g. adding more interpreter overhead with more abstraction layers >> does add up to a perf penalty). >> >> I don't think this is too scary -- I would be willing to create a >> small POC C++ library to prototype something like what I'm talking >> about. >> >> Since pandas has limited points of contact with NumPy I don't think >> this would end up being too onerous. >> >> For the record, I'm pretty allergic to "advanced C++"; I think it is a >> useful tool if you pick a sane 20% subset of the C++11 spec and follow >> Google C++ style it's not very inaccessible to intermediate >> developers. More or less "C plus OOP and easier object lifetime >> management (shared/unique_ptr, etc.)". As soon as you add a lot of >> template metaprogramming C++ library development quickly becomes >> inaccessible except to the C++-Jedi. >> >> Maybe let's start a Google document on "pandas roadmap" where we can >> break down the 1-2 year goals and some of these infrastructure issues >> and have our discussion there? (obviously publish this someplace once >> we're done) >> >> - Wes >> >> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback wrote: >> > Here are some of my thoughts about pandas Roadmap / status and some >> > responses to Wes's thoughts. >> > >> > In the last few (and upcoming) major releases we have been made the >> > following changes: >> > >> > - dtype enhancements (Categorical, Timedelta, Datetime w/tz) & making >> > these >> > first class objects >> > - code refactoring to remove subclassing of ndarrays for Series & Index >> > - carving out / deprecating non-core parts of pandas >> > - datareader >> > - SparsePanel, WidePanel & other aliases (TImeSeries) >> > - rpy, rplot, irow et al. >> > - google-analytics >> > - API changes to make things more consistent >> > - pd.rolling/expanding * -> .rolling/expanding (this is in master now) >> > - .resample becoming a full defered like groupby. >> > - multi-index slicing along any level (obviates need for .xs) and >> > allows >> > assignment >> > - .loc/.iloc - for the most part obviates use of .ix >> > - .pipe & .assign >> > - plotting accessors >> > - fixing of the sorting API >> > - many performance enhancements both micro & macro (e.g. release GIL) >> > >> > Some on-deck enhancements are (meaning these are basically ready to go >> > in): >> > - IntervalIndex (and eventually make PeriodIndex just a sub-class of >> > this) >> > - RangeIndex >> > >> > so lots of changes, though nothing really earth shaking, just more >> > convenience, reducing magicness somewhat >> > and providing flexibility. >> > >> > Of course we are getting increasing issues, mostly bug reports (and lots >> > of >> > dupes), some edge case enhancements >> > which can add to the existing API's and of course, requests to expand >> > the >> > (already) large code to other usecases. >> > Balancing this are a good many pull-requests from many different users, >> > some >> > even deep into the internals. >> > >> > Here are some things that I have talked about and could be considered >> > for >> > the roadmap. Disclaimer: I do work for Continuum >> > but these views are of course my own; furthermore obviously I am a bit >> > more >> > familiar with some of the 'sponsored' open-source >> > libraries, but always open to new things. >> > >> > - integration / automatic deferral to numba for JIT (this would be thru >> > .apply) >> > - automatic deferal to dask from groubpy where appropriate / maybe a >> > .to_parallel (to simply return a dask.DataFrame object) >> > - incorporation of quantities / units (as part of the dtype) >> > - use of DyND to allow missing values for int dtypes >> > - make Period a first class dtype. >> > - provide some copy-on-write semantics to alleviate the chained-indexing >> > issues which occasionaly come up with the mis-use of the indexing API >> > - allow a 'policy' to automatically provide column blocks for dict-like >> > input (e.g. each column would be a block), this would allow a pass-thru >> > API >> > where you could >> > put in numpy arrays where you have views and have them preserved rather >> > than >> > copied automatically. Note that this would also allow what I call >> > 'split' >> > where a passed in >> > multi-dim numpy array could be split up to individual blocks (which >> > actually >> > gives a nice perf boost after the splitting costs). >> > >> > In working towards some of these goals. I have come to the opinion that >> > it >> > would make sense to have a neutral API protocol layer >> > that would allow us to swap out different engines as needed, for >> > particular >> > dtypes, or *maybe* out-of-core type computations. E.g. >> > imagine that we replaced the in-memory block structure with a bclolz / >> > memap >> > type; in theory this should be 'easy' and just work. >> > I could also see us adopting *some* of the SFrame code to allow easier >> > interop with this API layer. >> > >> > In practice, I think a nice API layer would need to be created to make >> > this >> > clean / nice. >> > >> > So this comes around to Wes's point about creating a c++ library for the >> > internals (and possibly even some of the indexing routines). >> > In an ideal world, or course this would be desirable. Getting there is a >> > bit >> > non-trivial I think, and IMHO might not be worth the effort. I don't >> > really see big performance bottlenecks. We *already* defer much of the >> > computation to libraries like numexpr & bottleneck (where appropriate). >> > Adding numba / dask to the list would be helpful. >> > >> > I think that almost all performance issues are the result of: >> > >> > a) gross misuse of the pandas API. How much code have you seen that does >> > df.apply(lambda x: x.sum()) >> > b) routines which operate column-by-column rather block-by-block and are >> > in >> > python space (e.g. we have an issue right now about .quantile) >> > >> > So I am glossing over a big goal of having a c++ library that represents >> > the >> > pandas internals. This would by definition have a c-API that so >> > you *could* use pandas like semantics in c/c++ and just have it work >> > (and >> > then pandas would be a thin wrapper around this library). >> > >> > I am not averse to this, but I think would be quite a big effort, and >> > not a >> > huge perf boost IMHO. Further there are a number of API issues w.r.t. >> > indexing >> > which need to be clarified / worked out (e.g. should we simply deprecate >> > []) >> > that are much easier to test / figure out in python space. >> > >> > I also thing that we have quite a large number of contributors. Moving >> > to >> > c++ might make the internals a bit more impenetrable that the current >> > internals. >> > (though this would allow c++ people to contribute, so that might balance >> > out). >> > >> > We have a limited core of devs whom right now are familar with things. >> > If >> > someone happened to have a starting base for a c++ library, then I might >> > change >> > opinions here. >> > >> > >> > my 4c. >> > >> > Jeff >> > >> > >> > >> > >> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney >> > wrote: >> >> >> >> Deep thoughts during the holidays. >> >> >> >> I might be out of line here, but the interpreter-heaviness of the >> >> inside of pandas objects is likely to be a long-term liability and >> >> source of performance problems and technical debt. >> >> >> >> Has anyone put any thought into planning and beginning to execute on a >> >> rewrite that moves as much as possible of the internals into native / >> >> compiled code? I'm talking about: >> >> >> >> - pandas/core/internals >> >> - indexing and assignment >> >> - much of pandas/core/common >> >> - categorical and custom dtypes >> >> - all indexing mechanisms >> >> >> >> I'm concerned we've already exposed too much internals to users, so >> >> this might lead to a lot of API breakage, but it might be for the >> >> Greater Good. As a first step, beginning a partial migration of >> >> internals into some C++ classes that encapsulate the insides of >> >> DataFrame objects and implement indexing and block-level manipulations >> >> would be a good place to start. I think you could do this wouldn't too >> >> much disruption. >> >> >> >> As part of this internal retooling we might give consideration to >> >> alternative data structures for representing data internal to pandas >> >> objects. Now in 2015/2016, continuing to be hamstrung by NumPy's >> >> limitations feels somewhat anachronistic. User code is riddled with >> >> workarounds for data type fidelity issues and the like. Like, really, >> >> why not add a bitndarray (similar to ilanschnell/bitarray) for storing >> >> nullness for problematic types and hide this from the user? =) >> >> >> >> Since we are now a NumFOCUS-sponsored project, I feel like we might >> >> consider establishing some formal governance over pandas and >> >> publishing meetings notes and roadmap documents describing plans for >> >> the project and meetings notes from committers. There's no real >> >> "committer culture" for NumFOCUS projects like there is with the >> >> Apache Software Foundation, but we might try leading by example! >> >> >> >> Also, I believe pandas as a project has reached a level of importance >> >> where we ought to consider planning and execution on larger scale >> >> undertakings such as this for safeguarding the future. >> >> >> >> As for myself, well, I have my hands full in Big Data-land. I wish I >> >> could be helping more with pandas, but there a quite a few fundamental >> >> issues (like data interoperability nested data handling and file >> >> format support ? e.g. Parquet, see >> >> >> >> >> >> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/) >> >> preventing Python from being more useful in industry analytics >> >> applications. >> >> >> >> Aside: one of the bigger mistakes I made with pandas's API design was >> >> making it acceptable to call class constructors ? like >> >> pandas.DataFrame ? directly (versus factory functions). Sorry about >> >> that! If we could convince everyone to start writing pandas.data_frame >> >> or dataframe instead of using the class reference it would help a lot >> >> with code cleanup. It's hard to plan for these things ? NumPy >> >> interoperability seemed a lot more important in 2008 than it does now, >> >> so I forgive myself. >> >> >> >> cheers and best wishes for 2016, >> >> Wes >> >> _______________________________________________ >> >> Pandas-dev mailing list >> >> Pandas-dev at python.org >> >> https://mail.python.org/mailman/listinfo/pandas-dev >> > >> > >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev From wesmckinn at gmail.com Tue Dec 29 16:12:52 2015 From: wesmckinn at gmail.com (Wes McKinney) Date: Tue, 29 Dec 2015 13:12:52 -0800 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References: Message-ID: The other huge thing this will enable is to do is copy-on-write for various kinds of views, which should cut down on some of the defensive copying in the library and reduce memory usage. On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney wrote: > Basically the approach is > > 1) Base dtype type > 2) Base array type with K >= 1 dimensions > 3) Base scalar type > 4) Base index type > 5) "Wrapper" subclasses for all NumPy types fitting into categories > #1, #2, #3, #4 > 6) Subclasses for pandas-specific types like category, datetimeTZ, etc. > 7) NDFrame as cpcloud wrote is just a list of these > > Indexes and axis labels / column names can get layered on top. > > After we do all this we can look at adding nested types (arrays, maps, > structs) to better support JSON. > > - Wes > > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud wrote: >> Maybe this is saying the same thing as Wes, but how far would something like >> this get us? >> >> // warning: things are probably not this simple >> >> struct data_array_t { >> void *primitive; // scalar data >> data_array_t *nested; // nested data >> boost::dynamic_bitset isnull; // might have to create our own to avoid >> boost >> schema_t schema; // not sure exactly what this looks like >> }; >> >> typedef std::map data_frame_t; // probably not this >> simple >> >> To answer Jeff?s use-case question: I think that the use cases are 1) >> freedom from numpy (mostly) 2) no more block manager which frees us from the >> limitations of the block memory layout. In particular, the ability to take >> advantage of memory mapped IO would be a big win IMO. >> >> >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney wrote: >>> >>> I will write a more detailed response to some of these things after >>> the new year, but, in particular, re: missing values, can you or >>> someone tell me why creating an object that contains a NumPy array and >>> a bitmap is not sufficient? If we we can add a lightweight C/C++ class >>> layer between NumPy function calls (e.g. arithmetic) and pandas >>> function calls, then I see no reason why we cannot have >>> >>> Int32Array->add >>> >>> and >>> >>> Float32Array->add >>> >>> do the right thing (the former would be responsible for bitmasking to >>> propagate NA values; the latter would defer to NumPy). If we can put >>> all the internals of pandas objects inside a black box, we can add >>> layers of virtual function indirection without a performance penalty >>> (e.g. adding more interpreter overhead with more abstraction layers >>> does add up to a perf penalty). >>> >>> I don't think this is too scary -- I would be willing to create a >>> small POC C++ library to prototype something like what I'm talking >>> about. >>> >>> Since pandas has limited points of contact with NumPy I don't think >>> this would end up being too onerous. >>> >>> For the record, I'm pretty allergic to "advanced C++"; I think it is a >>> useful tool if you pick a sane 20% subset of the C++11 spec and follow >>> Google C++ style it's not very inaccessible to intermediate >>> developers. More or less "C plus OOP and easier object lifetime >>> management (shared/unique_ptr, etc.)". As soon as you add a lot of >>> template metaprogramming C++ library development quickly becomes >>> inaccessible except to the C++-Jedi. >>> >>> Maybe let's start a Google document on "pandas roadmap" where we can >>> break down the 1-2 year goals and some of these infrastructure issues >>> and have our discussion there? (obviously publish this someplace once >>> we're done) >>> >>> - Wes >>> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback wrote: >>> > Here are some of my thoughts about pandas Roadmap / status and some >>> > responses to Wes's thoughts. >>> > >>> > In the last few (and upcoming) major releases we have been made the >>> > following changes: >>> > >>> > - dtype enhancements (Categorical, Timedelta, Datetime w/tz) & making >>> > these >>> > first class objects >>> > - code refactoring to remove subclassing of ndarrays for Series & Index >>> > - carving out / deprecating non-core parts of pandas >>> > - datareader >>> > - SparsePanel, WidePanel & other aliases (TImeSeries) >>> > - rpy, rplot, irow et al. >>> > - google-analytics >>> > - API changes to make things more consistent >>> > - pd.rolling/expanding * -> .rolling/expanding (this is in master now) >>> > - .resample becoming a full defered like groupby. >>> > - multi-index slicing along any level (obviates need for .xs) and >>> > allows >>> > assignment >>> > - .loc/.iloc - for the most part obviates use of .ix >>> > - .pipe & .assign >>> > - plotting accessors >>> > - fixing of the sorting API >>> > - many performance enhancements both micro & macro (e.g. release GIL) >>> > >>> > Some on-deck enhancements are (meaning these are basically ready to go >>> > in): >>> > - IntervalIndex (and eventually make PeriodIndex just a sub-class of >>> > this) >>> > - RangeIndex >>> > >>> > so lots of changes, though nothing really earth shaking, just more >>> > convenience, reducing magicness somewhat >>> > and providing flexibility. >>> > >>> > Of course we are getting increasing issues, mostly bug reports (and lots >>> > of >>> > dupes), some edge case enhancements >>> > which can add to the existing API's and of course, requests to expand >>> > the >>> > (already) large code to other usecases. >>> > Balancing this are a good many pull-requests from many different users, >>> > some >>> > even deep into the internals. >>> > >>> > Here are some things that I have talked about and could be considered >>> > for >>> > the roadmap. Disclaimer: I do work for Continuum >>> > but these views are of course my own; furthermore obviously I am a bit >>> > more >>> > familiar with some of the 'sponsored' open-source >>> > libraries, but always open to new things. >>> > >>> > - integration / automatic deferral to numba for JIT (this would be thru >>> > .apply) >>> > - automatic deferal to dask from groubpy where appropriate / maybe a >>> > .to_parallel (to simply return a dask.DataFrame object) >>> > - incorporation of quantities / units (as part of the dtype) >>> > - use of DyND to allow missing values for int dtypes >>> > - make Period a first class dtype. >>> > - provide some copy-on-write semantics to alleviate the chained-indexing >>> > issues which occasionaly come up with the mis-use of the indexing API >>> > - allow a 'policy' to automatically provide column blocks for dict-like >>> > input (e.g. each column would be a block), this would allow a pass-thru >>> > API >>> > where you could >>> > put in numpy arrays where you have views and have them preserved rather >>> > than >>> > copied automatically. Note that this would also allow what I call >>> > 'split' >>> > where a passed in >>> > multi-dim numpy array could be split up to individual blocks (which >>> > actually >>> > gives a nice perf boost after the splitting costs). >>> > >>> > In working towards some of these goals. I have come to the opinion that >>> > it >>> > would make sense to have a neutral API protocol layer >>> > that would allow us to swap out different engines as needed, for >>> > particular >>> > dtypes, or *maybe* out-of-core type computations. E.g. >>> > imagine that we replaced the in-memory block structure with a bclolz / >>> > memap >>> > type; in theory this should be 'easy' and just work. >>> > I could also see us adopting *some* of the SFrame code to allow easier >>> > interop with this API layer. >>> > >>> > In practice, I think a nice API layer would need to be created to make >>> > this >>> > clean / nice. >>> > >>> > So this comes around to Wes's point about creating a c++ library for the >>> > internals (and possibly even some of the indexing routines). >>> > In an ideal world, or course this would be desirable. Getting there is a >>> > bit >>> > non-trivial I think, and IMHO might not be worth the effort. I don't >>> > really see big performance bottlenecks. We *already* defer much of the >>> > computation to libraries like numexpr & bottleneck (where appropriate). >>> > Adding numba / dask to the list would be helpful. >>> > >>> > I think that almost all performance issues are the result of: >>> > >>> > a) gross misuse of the pandas API. How much code have you seen that does >>> > df.apply(lambda x: x.sum()) >>> > b) routines which operate column-by-column rather block-by-block and are >>> > in >>> > python space (e.g. we have an issue right now about .quantile) >>> > >>> > So I am glossing over a big goal of having a c++ library that represents >>> > the >>> > pandas internals. This would by definition have a c-API that so >>> > you *could* use pandas like semantics in c/c++ and just have it work >>> > (and >>> > then pandas would be a thin wrapper around this library). >>> > >>> > I am not averse to this, but I think would be quite a big effort, and >>> > not a >>> > huge perf boost IMHO. Further there are a number of API issues w.r.t. >>> > indexing >>> > which need to be clarified / worked out (e.g. should we simply deprecate >>> > []) >>> > that are much easier to test / figure out in python space. >>> > >>> > I also thing that we have quite a large number of contributors. Moving >>> > to >>> > c++ might make the internals a bit more impenetrable that the current >>> > internals. >>> > (though this would allow c++ people to contribute, so that might balance >>> > out). >>> > >>> > We have a limited core of devs whom right now are familar with things. >>> > If >>> > someone happened to have a starting base for a c++ library, then I might >>> > change >>> > opinions here. >>> > >>> > >>> > my 4c. >>> > >>> > Jeff >>> > >>> > >>> > >>> > >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney >>> > wrote: >>> >> >>> >> Deep thoughts during the holidays. >>> >> >>> >> I might be out of line here, but the interpreter-heaviness of the >>> >> inside of pandas objects is likely to be a long-term liability and >>> >> source of performance problems and technical debt. >>> >> >>> >> Has anyone put any thought into planning and beginning to execute on a >>> >> rewrite that moves as much as possible of the internals into native / >>> >> compiled code? I'm talking about: >>> >> >>> >> - pandas/core/internals >>> >> - indexing and assignment >>> >> - much of pandas/core/common >>> >> - categorical and custom dtypes >>> >> - all indexing mechanisms >>> >> >>> >> I'm concerned we've already exposed too much internals to users, so >>> >> this might lead to a lot of API breakage, but it might be for the >>> >> Greater Good. As a first step, beginning a partial migration of >>> >> internals into some C++ classes that encapsulate the insides of >>> >> DataFrame objects and implement indexing and block-level manipulations >>> >> would be a good place to start. I think you could do this wouldn't too >>> >> much disruption. >>> >> >>> >> As part of this internal retooling we might give consideration to >>> >> alternative data structures for representing data internal to pandas >>> >> objects. Now in 2015/2016, continuing to be hamstrung by NumPy's >>> >> limitations feels somewhat anachronistic. User code is riddled with >>> >> workarounds for data type fidelity issues and the like. Like, really, >>> >> why not add a bitndarray (similar to ilanschnell/bitarray) for storing >>> >> nullness for problematic types and hide this from the user? =) >>> >> >>> >> Since we are now a NumFOCUS-sponsored project, I feel like we might >>> >> consider establishing some formal governance over pandas and >>> >> publishing meetings notes and roadmap documents describing plans for >>> >> the project and meetings notes from committers. There's no real >>> >> "committer culture" for NumFOCUS projects like there is with the >>> >> Apache Software Foundation, but we might try leading by example! >>> >> >>> >> Also, I believe pandas as a project has reached a level of importance >>> >> where we ought to consider planning and execution on larger scale >>> >> undertakings such as this for safeguarding the future. >>> >> >>> >> As for myself, well, I have my hands full in Big Data-land. I wish I >>> >> could be helping more with pandas, but there a quite a few fundamental >>> >> issues (like data interoperability nested data handling and file >>> >> format support ? e.g. Parquet, see >>> >> >>> >> >>> >> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/) >>> >> preventing Python from being more useful in industry analytics >>> >> applications. >>> >> >>> >> Aside: one of the bigger mistakes I made with pandas's API design was >>> >> making it acceptable to call class constructors ? like >>> >> pandas.DataFrame ? directly (versus factory functions). Sorry about >>> >> that! If we could convince everyone to start writing pandas.data_frame >>> >> or dataframe instead of using the class reference it would help a lot >>> >> with code cleanup. It's hard to plan for these things ? NumPy >>> >> interoperability seemed a lot more important in 2008 than it does now, >>> >> so I forgive myself. >>> >> >>> >> cheers and best wishes for 2016, >>> >> Wes >>> >> _______________________________________________ >>> >> Pandas-dev mailing list >>> >> Pandas-dev at python.org >>> >> https://mail.python.org/mailman/listinfo/pandas-dev >>> > >>> > >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev From jeffreback at gmail.com Tue Dec 29 16:20:05 2015 From: jeffreback at gmail.com (Jeff Reback) Date: Tue, 29 Dec 2015 16:20:05 -0500 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References: Message-ID: Wes your last is noted as well. I *think* we can actually do this now (well there is a PR out there). On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney wrote: > The other huge thing this will enable is to do is copy-on-write for > various kinds of views, which should cut down on some of the defensive > copying in the library and reduce memory usage. > > On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney wrote: > > Basically the approach is > > > > 1) Base dtype type > > 2) Base array type with K >= 1 dimensions > > 3) Base scalar type > > 4) Base index type > > 5) "Wrapper" subclasses for all NumPy types fitting into categories > > #1, #2, #3, #4 > > 6) Subclasses for pandas-specific types like category, datetimeTZ, etc. > > 7) NDFrame as cpcloud wrote is just a list of these > > > > Indexes and axis labels / column names can get layered on top. > > > > After we do all this we can look at adding nested types (arrays, maps, > > structs) to better support JSON. > > > > - Wes > > > > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud > wrote: > >> Maybe this is saying the same thing as Wes, but how far would something > like > >> this get us? > >> > >> // warning: things are probably not this simple > >> > >> struct data_array_t { > >> void *primitive; // scalar data > >> data_array_t *nested; // nested data > >> boost::dynamic_bitset isnull; // might have to create our own to > avoid > >> boost > >> schema_t schema; // not sure exactly what this looks like > >> }; > >> > >> typedef std::map data_frame_t; // probably not > this > >> simple > >> > >> To answer Jeff?s use-case question: I think that the use cases are 1) > >> freedom from numpy (mostly) 2) no more block manager which frees us > from the > >> limitations of the block memory layout. In particular, the ability to > take > >> advantage of memory mapped IO would be a big win IMO. > >> > >> > >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney > wrote: > >>> > >>> I will write a more detailed response to some of these things after > >>> the new year, but, in particular, re: missing values, can you or > >>> someone tell me why creating an object that contains a NumPy array and > >>> a bitmap is not sufficient? If we we can add a lightweight C/C++ class > >>> layer between NumPy function calls (e.g. arithmetic) and pandas > >>> function calls, then I see no reason why we cannot have > >>> > >>> Int32Array->add > >>> > >>> and > >>> > >>> Float32Array->add > >>> > >>> do the right thing (the former would be responsible for bitmasking to > >>> propagate NA values; the latter would defer to NumPy). If we can put > >>> all the internals of pandas objects inside a black box, we can add > >>> layers of virtual function indirection without a performance penalty > >>> (e.g. adding more interpreter overhead with more abstraction layers > >>> does add up to a perf penalty). > >>> > >>> I don't think this is too scary -- I would be willing to create a > >>> small POC C++ library to prototype something like what I'm talking > >>> about. > >>> > >>> Since pandas has limited points of contact with NumPy I don't think > >>> this would end up being too onerous. > >>> > >>> For the record, I'm pretty allergic to "advanced C++"; I think it is a > >>> useful tool if you pick a sane 20% subset of the C++11 spec and follow > >>> Google C++ style it's not very inaccessible to intermediate > >>> developers. More or less "C plus OOP and easier object lifetime > >>> management (shared/unique_ptr, etc.)". As soon as you add a lot of > >>> template metaprogramming C++ library development quickly becomes > >>> inaccessible except to the C++-Jedi. > >>> > >>> Maybe let's start a Google document on "pandas roadmap" where we can > >>> break down the 1-2 year goals and some of these infrastructure issues > >>> and have our discussion there? (obviously publish this someplace once > >>> we're done) > >>> > >>> - Wes > >>> > >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback > wrote: > >>> > Here are some of my thoughts about pandas Roadmap / status and some > >>> > responses to Wes's thoughts. > >>> > > >>> > In the last few (and upcoming) major releases we have been made the > >>> > following changes: > >>> > > >>> > - dtype enhancements (Categorical, Timedelta, Datetime w/tz) & making > >>> > these > >>> > first class objects > >>> > - code refactoring to remove subclassing of ndarrays for Series & > Index > >>> > - carving out / deprecating non-core parts of pandas > >>> > - datareader > >>> > - SparsePanel, WidePanel & other aliases (TImeSeries) > >>> > - rpy, rplot, irow et al. > >>> > - google-analytics > >>> > - API changes to make things more consistent > >>> > - pd.rolling/expanding * -> .rolling/expanding (this is in master > now) > >>> > - .resample becoming a full defered like groupby. > >>> > - multi-index slicing along any level (obviates need for .xs) and > >>> > allows > >>> > assignment > >>> > - .loc/.iloc - for the most part obviates use of .ix > >>> > - .pipe & .assign > >>> > - plotting accessors > >>> > - fixing of the sorting API > >>> > - many performance enhancements both micro & macro (e.g. release GIL) > >>> > > >>> > Some on-deck enhancements are (meaning these are basically ready to > go > >>> > in): > >>> > - IntervalIndex (and eventually make PeriodIndex just a sub-class > of > >>> > this) > >>> > - RangeIndex > >>> > > >>> > so lots of changes, though nothing really earth shaking, just more > >>> > convenience, reducing magicness somewhat > >>> > and providing flexibility. > >>> > > >>> > Of course we are getting increasing issues, mostly bug reports (and > lots > >>> > of > >>> > dupes), some edge case enhancements > >>> > which can add to the existing API's and of course, requests to expand > >>> > the > >>> > (already) large code to other usecases. > >>> > Balancing this are a good many pull-requests from many different > users, > >>> > some > >>> > even deep into the internals. > >>> > > >>> > Here are some things that I have talked about and could be considered > >>> > for > >>> > the roadmap. Disclaimer: I do work for Continuum > >>> > but these views are of course my own; furthermore obviously I am a > bit > >>> > more > >>> > familiar with some of the 'sponsored' open-source > >>> > libraries, but always open to new things. > >>> > > >>> > - integration / automatic deferral to numba for JIT (this would be > thru > >>> > .apply) > >>> > - automatic deferal to dask from groubpy where appropriate / maybe a > >>> > .to_parallel (to simply return a dask.DataFrame object) > >>> > - incorporation of quantities / units (as part of the dtype) > >>> > - use of DyND to allow missing values for int dtypes > >>> > - make Period a first class dtype. > >>> > - provide some copy-on-write semantics to alleviate the > chained-indexing > >>> > issues which occasionaly come up with the mis-use of the indexing API > >>> > - allow a 'policy' to automatically provide column blocks for > dict-like > >>> > input (e.g. each column would be a block), this would allow a > pass-thru > >>> > API > >>> > where you could > >>> > put in numpy arrays where you have views and have them preserved > rather > >>> > than > >>> > copied automatically. Note that this would also allow what I call > >>> > 'split' > >>> > where a passed in > >>> > multi-dim numpy array could be split up to individual blocks (which > >>> > actually > >>> > gives a nice perf boost after the splitting costs). > >>> > > >>> > In working towards some of these goals. I have come to the opinion > that > >>> > it > >>> > would make sense to have a neutral API protocol layer > >>> > that would allow us to swap out different engines as needed, for > >>> > particular > >>> > dtypes, or *maybe* out-of-core type computations. E.g. > >>> > imagine that we replaced the in-memory block structure with a bclolz > / > >>> > memap > >>> > type; in theory this should be 'easy' and just work. > >>> > I could also see us adopting *some* of the SFrame code to allow > easier > >>> > interop with this API layer. > >>> > > >>> > In practice, I think a nice API layer would need to be created to > make > >>> > this > >>> > clean / nice. > >>> > > >>> > So this comes around to Wes's point about creating a c++ library for > the > >>> > internals (and possibly even some of the indexing routines). > >>> > In an ideal world, or course this would be desirable. Getting there > is a > >>> > bit > >>> > non-trivial I think, and IMHO might not be worth the effort. I don't > >>> > really see big performance bottlenecks. We *already* defer much of > the > >>> > computation to libraries like numexpr & bottleneck (where > appropriate). > >>> > Adding numba / dask to the list would be helpful. > >>> > > >>> > I think that almost all performance issues are the result of: > >>> > > >>> > a) gross misuse of the pandas API. How much code have you seen that > does > >>> > df.apply(lambda x: x.sum()) > >>> > b) routines which operate column-by-column rather block-by-block and > are > >>> > in > >>> > python space (e.g. we have an issue right now about .quantile) > >>> > > >>> > So I am glossing over a big goal of having a c++ library that > represents > >>> > the > >>> > pandas internals. This would by definition have a c-API that so > >>> > you *could* use pandas like semantics in c/c++ and just have it work > >>> > (and > >>> > then pandas would be a thin wrapper around this library). > >>> > > >>> > I am not averse to this, but I think would be quite a big effort, and > >>> > not a > >>> > huge perf boost IMHO. Further there are a number of API issues w.r.t. > >>> > indexing > >>> > which need to be clarified / worked out (e.g. should we simply > deprecate > >>> > []) > >>> > that are much easier to test / figure out in python space. > >>> > > >>> > I also thing that we have quite a large number of contributors. > Moving > >>> > to > >>> > c++ might make the internals a bit more impenetrable that the current > >>> > internals. > >>> > (though this would allow c++ people to contribute, so that might > balance > >>> > out). > >>> > > >>> > We have a limited core of devs whom right now are familar with > things. > >>> > If > >>> > someone happened to have a starting base for a c++ library, then I > might > >>> > change > >>> > opinions here. > >>> > > >>> > > >>> > my 4c. > >>> > > >>> > Jeff > >>> > > >>> > > >>> > > >>> > > >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney > >>> > wrote: > >>> >> > >>> >> Deep thoughts during the holidays. > >>> >> > >>> >> I might be out of line here, but the interpreter-heaviness of the > >>> >> inside of pandas objects is likely to be a long-term liability and > >>> >> source of performance problems and technical debt. > >>> >> > >>> >> Has anyone put any thought into planning and beginning to execute > on a > >>> >> rewrite that moves as much as possible of the internals into native > / > >>> >> compiled code? I'm talking about: > >>> >> > >>> >> - pandas/core/internals > >>> >> - indexing and assignment > >>> >> - much of pandas/core/common > >>> >> - categorical and custom dtypes > >>> >> - all indexing mechanisms > >>> >> > >>> >> I'm concerned we've already exposed too much internals to users, so > >>> >> this might lead to a lot of API breakage, but it might be for the > >>> >> Greater Good. As a first step, beginning a partial migration of > >>> >> internals into some C++ classes that encapsulate the insides of > >>> >> DataFrame objects and implement indexing and block-level > manipulations > >>> >> would be a good place to start. I think you could do this wouldn't > too > >>> >> much disruption. > >>> >> > >>> >> As part of this internal retooling we might give consideration to > >>> >> alternative data structures for representing data internal to pandas > >>> >> objects. Now in 2015/2016, continuing to be hamstrung by NumPy's > >>> >> limitations feels somewhat anachronistic. User code is riddled with > >>> >> workarounds for data type fidelity issues and the like. Like, > really, > >>> >> why not add a bitndarray (similar to ilanschnell/bitarray) for > storing > >>> >> nullness for problematic types and hide this from the user? =) > >>> >> > >>> >> Since we are now a NumFOCUS-sponsored project, I feel like we might > >>> >> consider establishing some formal governance over pandas and > >>> >> publishing meetings notes and roadmap documents describing plans for > >>> >> the project and meetings notes from committers. There's no real > >>> >> "committer culture" for NumFOCUS projects like there is with the > >>> >> Apache Software Foundation, but we might try leading by example! > >>> >> > >>> >> Also, I believe pandas as a project has reached a level of > importance > >>> >> where we ought to consider planning and execution on larger scale > >>> >> undertakings such as this for safeguarding the future. > >>> >> > >>> >> As for myself, well, I have my hands full in Big Data-land. I wish I > >>> >> could be helping more with pandas, but there a quite a few > fundamental > >>> >> issues (like data interoperability nested data handling and file > >>> >> format support ? e.g. Parquet, see > >>> >> > >>> >> > >>> >> > http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/ > ) > >>> >> preventing Python from being more useful in industry analytics > >>> >> applications. > >>> >> > >>> >> Aside: one of the bigger mistakes I made with pandas's API design > was > >>> >> making it acceptable to call class constructors ? like > >>> >> pandas.DataFrame ? directly (versus factory functions). Sorry about > >>> >> that! If we could convince everyone to start writing > pandas.data_frame > >>> >> or dataframe instead of using the class reference it would help a > lot > >>> >> with code cleanup. It's hard to plan for these things ? NumPy > >>> >> interoperability seemed a lot more important in 2008 than it does > now, > >>> >> so I forgive myself. > >>> >> > >>> >> cheers and best wishes for 2016, > >>> >> Wes > >>> >> _______________________________________________ > >>> >> Pandas-dev mailing list > >>> >> Pandas-dev at python.org > >>> >> https://mail.python.org/mailman/listinfo/pandas-dev > >>> > > >>> > > >>> _______________________________________________ > >>> Pandas-dev mailing list > >>> Pandas-dev at python.org > >>> https://mail.python.org/mailman/listinfo/pandas-dev > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Tue Dec 29 18:18:04 2015 From: wesmckinn at gmail.com (Wes McKinney) Date: Tue, 29 Dec 2015 15:18:04 -0800 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References: Message-ID: Can you link to the PR you're talking about? I will see about spending a few hours setting up a libpandas.so as a C++ shared library where we can run some experiments and validate whether it can solve the integer-NA problem and be a place to put new data types (categorical and friends). I'm +1 on targeting Would it also be worth making a wish list of APIs we might consider breaking in a pandas 1.0 release that also features this new "native core"? Might as well right some wrongs while we're doing some invasive work on the internals; some breakage might be unavoidable. We can always maintain a pandas legacy 0.x.x maintenance branch (providing a conda binary build) for legacy users where showstopper bugs can get fixed. On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback > wrote: > Wes your last is noted as well. I *think* we can actually do this now (well > there is a PR out there). > > On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney > wrote: >> >> The other huge thing this will enable is to do is copy-on-write for >> various kinds of views, which should cut down on some of the defensive >> copying in the library and reduce memory usage. >> >> On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney > wrote: >> > Basically the approach is >> > >> > 1) Base dtype type >> > 2) Base array type with K >= 1 dimensions >> > 3) Base scalar type >> > 4) Base index type >> > 5) "Wrapper" subclasses for all NumPy types fitting into categories >> > #1, #2, #3, #4 >> > 6) Subclasses for pandas-specific types like category, datetimeTZ, etc. >> > 7) NDFrame as cpcloud wrote is just a list of these >> > >> > Indexes and axis labels / column names can get layered on top. >> > >> > After we do all this we can look at adding nested types (arrays, maps, >> > structs) to better support JSON. >> > >> > - Wes >> > >> > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud > >> > wrote: >> >> Maybe this is saying the same thing as Wes, but how far would something >> >> like >> >> this get us? >> >> >> >> // warning: things are probably not this simple >> >> >> >> struct data_array_t { >> >> void *primitive; // scalar data >> >> data_array_t *nested; // nested data >> >> boost::dynamic_bitset isnull; // might have to create our own to >> >> avoid >> >> boost >> >> schema_t schema; // not sure exactly what this looks like >> >> }; >> >> >> >> typedef std::map data_frame_t; // probably not >> >> this >> >> simple >> >> >> >> To answer Jeff?s use-case question: I think that the use cases are 1) >> >> freedom from numpy (mostly) 2) no more block manager which frees us >> >> from the >> >> limitations of the block memory layout. In particular, the ability to >> >> take >> >> advantage of memory mapped IO would be a big win IMO. >> >> >> >> >> >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney > >> >> wrote: >> >>> >> >>> I will write a more detailed response to some of these things after >> >>> the new year, but, in particular, re: missing values, can you or >> >>> someone tell me why creating an object that contains a NumPy array and >> >>> a bitmap is not sufficient? If we we can add a lightweight C/C++ class >> >>> layer between NumPy function calls (e.g. arithmetic) and pandas >> >>> function calls, then I see no reason why we cannot have >> >>> >> >>> Int32Array->add >> >>> >> >>> and >> >>> >> >>> Float32Array->add >> >>> >> >>> do the right thing (the former would be responsible for bitmasking to >> >>> propagate NA values; the latter would defer to NumPy). If we can put >> >>> all the internals of pandas objects inside a black box, we can add >> >>> layers of virtual function indirection without a performance penalty >> >>> (e.g. adding more interpreter overhead with more abstraction layers >> >>> does add up to a perf penalty). >> >>> >> >>> I don't think this is too scary -- I would be willing to create a >> >>> small POC C++ library to prototype something like what I'm talking >> >>> about. >> >>> >> >>> Since pandas has limited points of contact with NumPy I don't think >> >>> this would end up being too onerous. >> >>> >> >>> For the record, I'm pretty allergic to "advanced C++"; I think it is a >> >>> useful tool if you pick a sane 20% subset of the C++11 spec and follow >> >>> Google C++ style it's not very inaccessible to intermediate >> >>> developers. More or less "C plus OOP and easier object lifetime >> >>> management (shared/unique_ptr, etc.)". As soon as you add a lot of >> >>> template metaprogramming C++ library development quickly becomes >> >>> inaccessible except to the C++-Jedi. >> >>> >> >>> Maybe let's start a Google document on "pandas roadmap" where we can >> >>> break down the 1-2 year goals and some of these infrastructure issues >> >>> and have our discussion there? (obviously publish this someplace once >> >>> we're done) >> >>> >> >>> - Wes >> >>> >> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback > >> >>> wrote: >> >>> > Here are some of my thoughts about pandas Roadmap / status and some >> >>> > responses to Wes's thoughts. >> >>> > >> >>> > In the last few (and upcoming) major releases we have been made the >> >>> > following changes: >> >>> > >> >>> > - dtype enhancements (Categorical, Timedelta, Datetime w/tz) & >> >>> > making >> >>> > these >> >>> > first class objects >> >>> > - code refactoring to remove subclassing of ndarrays for Series & >> >>> > Index >> >>> > - carving out / deprecating non-core parts of pandas >> >>> > - datareader >> >>> > - SparsePanel, WidePanel & other aliases (TImeSeries) >> >>> > - rpy, rplot, irow et al. >> >>> > - google-analytics >> >>> > - API changes to make things more consistent >> >>> > - pd.rolling/expanding * -> .rolling/expanding (this is in master >> >>> > now) >> >>> > - .resample becoming a full defered like groupby. >> >>> > - multi-index slicing along any level (obviates need for .xs) and >> >>> > allows >> >>> > assignment >> >>> > - .loc/.iloc - for the most part obviates use of .ix >> >>> > - .pipe & .assign >> >>> > - plotting accessors >> >>> > - fixing of the sorting API >> >>> > - many performance enhancements both micro & macro (e.g. release >> >>> > GIL) >> >>> > >> >>> > Some on-deck enhancements are (meaning these are basically ready to >> >>> > go >> >>> > in): >> >>> > - IntervalIndex (and eventually make PeriodIndex just a sub-class >> >>> > of >> >>> > this) >> >>> > - RangeIndex >> >>> > >> >>> > so lots of changes, though nothing really earth shaking, just more >> >>> > convenience, reducing magicness somewhat >> >>> > and providing flexibility. >> >>> > >> >>> > Of course we are getting increasing issues, mostly bug reports (and >> >>> > lots >> >>> > of >> >>> > dupes), some edge case enhancements >> >>> > which can add to the existing API's and of course, requests to >> >>> > expand >> >>> > the >> >>> > (already) large code to other usecases. >> >>> > Balancing this are a good many pull-requests from many different >> >>> > users, >> >>> > some >> >>> > even deep into the internals. >> >>> > >> >>> > Here are some things that I have talked about and could be >> >>> > considered >> >>> > for >> >>> > the roadmap. Disclaimer: I do work for Continuum >> >>> > but these views are of course my own; furthermore obviously I am a >> >>> > bit >> >>> > more >> >>> > familiar with some of the 'sponsored' open-source >> >>> > libraries, but always open to new things. >> >>> > >> >>> > - integration / automatic deferral to numba for JIT (this would be >> >>> > thru >> >>> > .apply) >> >>> > - automatic deferal to dask from groubpy where appropriate / maybe a >> >>> > .to_parallel (to simply return a dask.DataFrame object) >> >>> > - incorporation of quantities / units (as part of the dtype) >> >>> > - use of DyND to allow missing values for int dtypes >> >>> > - make Period a first class dtype. >> >>> > - provide some copy-on-write semantics to alleviate the >> >>> > chained-indexing >> >>> > issues which occasionaly come up with the mis-use of the indexing >> >>> > API >> >>> > - allow a 'policy' to automatically provide column blocks for >> >>> > dict-like >> >>> > input (e.g. each column would be a block), this would allow a >> >>> > pass-thru >> >>> > API >> >>> > where you could >> >>> > put in numpy arrays where you have views and have them preserved >> >>> > rather >> >>> > than >> >>> > copied automatically. Note that this would also allow what I call >> >>> > 'split' >> >>> > where a passed in >> >>> > multi-dim numpy array could be split up to individual blocks (which >> >>> > actually >> >>> > gives a nice perf boost after the splitting costs). >> >>> > >> >>> > In working towards some of these goals. I have come to the opinion >> >>> > that >> >>> > it >> >>> > would make sense to have a neutral API protocol layer >> >>> > that would allow us to swap out different engines as needed, for >> >>> > particular >> >>> > dtypes, or *maybe* out-of-core type computations. E.g. >> >>> > imagine that we replaced the in-memory block structure with a bclolz >> >>> > / >> >>> > memap >> >>> > type; in theory this should be 'easy' and just work. >> >>> > I could also see us adopting *some* of the SFrame code to allow >> >>> > easier >> >>> > interop with this API layer. >> >>> > >> >>> > In practice, I think a nice API layer would need to be created to >> >>> > make >> >>> > this >> >>> > clean / nice. >> >>> > >> >>> > So this comes around to Wes's point about creating a c++ library for >> >>> > the >> >>> > internals (and possibly even some of the indexing routines). >> >>> > In an ideal world, or course this would be desirable. Getting there >> >>> > is a >> >>> > bit >> >>> > non-trivial I think, and IMHO might not be worth the effort. I don't >> >>> > really see big performance bottlenecks. We *already* defer much of >> >>> > the >> >>> > computation to libraries like numexpr & bottleneck (where >> >>> > appropriate). >> >>> > Adding numba / dask to the list would be helpful. >> >>> > >> >>> > I think that almost all performance issues are the result of: >> >>> > >> >>> > a) gross misuse of the pandas API. How much code have you seen that >> >>> > does >> >>> > df.apply(lambda x: x.sum()) >> >>> > b) routines which operate column-by-column rather block-by-block and >> >>> > are >> >>> > in >> >>> > python space (e.g. we have an issue right now about .quantile) >> >>> > >> >>> > So I am glossing over a big goal of having a c++ library that >> >>> > represents >> >>> > the >> >>> > pandas internals. This would by definition have a c-API that so >> >>> > you *could* use pandas like semantics in c/c++ and just have it work >> >>> > (and >> >>> > then pandas would be a thin wrapper around this library). >> >>> > >> >>> > I am not averse to this, but I think would be quite a big effort, >> >>> > and >> >>> > not a >> >>> > huge perf boost IMHO. Further there are a number of API issues >> >>> > w.r.t. >> >>> > indexing >> >>> > which need to be clarified / worked out (e.g. should we simply >> >>> > deprecate >> >>> > []) >> >>> > that are much easier to test / figure out in python space. >> >>> > >> >>> > I also thing that we have quite a large number of contributors. >> >>> > Moving >> >>> > to >> >>> > c++ might make the internals a bit more impenetrable that the >> >>> > current >> >>> > internals. >> >>> > (though this would allow c++ people to contribute, so that might >> >>> > balance >> >>> > out). >> >>> > >> >>> > We have a limited core of devs whom right now are familar with >> >>> > things. >> >>> > If >> >>> > someone happened to have a starting base for a c++ library, then I >> >>> > might >> >>> > change >> >>> > opinions here. >> >>> > >> >>> > >> >>> > my 4c. >> >>> > >> >>> > Jeff >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney > >> >>> > wrote: >> >>> >> >> >>> >> Deep thoughts during the holidays. >> >>> >> >> >>> >> I might be out of line here, but the interpreter-heaviness of the >> >>> >> inside of pandas objects is likely to be a long-term liability and >> >>> >> source of performance problems and technical debt. >> >>> >> >> >>> >> Has anyone put any thought into planning and beginning to execute >> >>> >> on a >> >>> >> rewrite that moves as much as possible of the internals into native >> >>> >> / >> >>> >> compiled code? I'm talking about: >> >>> >> >> >>> >> - pandas/core/internals >> >>> >> - indexing and assignment >> >>> >> - much of pandas/core/common >> >>> >> - categorical and custom dtypes >> >>> >> - all indexing mechanisms >> >>> >> >> >>> >> I'm concerned we've already exposed too much internals to users, so >> >>> >> this might lead to a lot of API breakage, but it might be for the >> >>> >> Greater Good. As a first step, beginning a partial migration of >> >>> >> internals into some C++ classes that encapsulate the insides of >> >>> >> DataFrame objects and implement indexing and block-level >> >>> >> manipulations >> >>> >> would be a good place to start. I think you could do this wouldn't >> >>> >> too >> >>> >> much disruption. >> >>> >> >> >>> >> As part of this internal retooling we might give consideration to >> >>> >> alternative data structures for representing data internal to >> >>> >> pandas >> >>> >> objects. Now in 2015/2016, continuing to be hamstrung by NumPy's >> >>> >> limitations feels somewhat anachronistic. User code is riddled with >> >>> >> workarounds for data type fidelity issues and the like. Like, >> >>> >> really, >> >>> >> why not add a bitndarray (similar to ilanschnell/bitarray) for >> >>> >> storing >> >>> >> nullness for problematic types and hide this from the user? =) >> >>> >> >> >>> >> Since we are now a NumFOCUS-sponsored project, I feel like we might >> >>> >> consider establishing some formal governance over pandas and >> >>> >> publishing meetings notes and roadmap documents describing plans >> >>> >> for >> >>> >> the project and meetings notes from committers. There's no real >> >>> >> "committer culture" for NumFOCUS projects like there is with the >> >>> >> Apache Software Foundation, but we might try leading by example! >> >>> >> >> >>> >> Also, I believe pandas as a project has reached a level of >> >>> >> importance >> >>> >> where we ought to consider planning and execution on larger scale >> >>> >> undertakings such as this for safeguarding the future. >> >>> >> >> >>> >> As for myself, well, I have my hands full in Big Data-land. I wish >> >>> >> I >> >>> >> could be helping more with pandas, but there a quite a few >> >>> >> fundamental >> >>> >> issues (like data interoperability nested data handling and file >> >>> >> format support ? e.g. Parquet, see >> >>> >> >> >>> >> >> >>> >> >> >>> >> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/ ) >> >>> >> preventing Python from being more useful in industry analytics >> >>> >> applications. >> >>> >> >> >>> >> Aside: one of the bigger mistakes I made with pandas's API design >> >>> >> was >> >>> >> making it acceptable to call class constructors ? like >> >>> >> pandas.DataFrame ? directly (versus factory functions). Sorry about >> >>> >> that! If we could convince everyone to start writing >> >>> >> pandas.data_frame >> >>> >> or dataframe instead of using the class reference it would help a >> >>> >> lot >> >>> >> with code cleanup. It's hard to plan for these things ? NumPy >> >>> >> interoperability seemed a lot more important in 2008 than it does >> >>> >> now, >> >>> >> so I forgive myself. >> >>> >> >> >>> >> cheers and best wishes for 2016, >> >>> >> Wes >> >>> >> _______________________________________________ >> >>> >> Pandas-dev mailing list >> >>> >> Pandas-dev at python.org >> >>> >> https://mail.python.org/mailman/listinfo/pandas-dev >> >>> > >> >>> > >> >>> _______________________________________________ >> >>> Pandas-dev mailing list >> >>> Pandas-dev at python.org >> >>> https://mail.python.org/mailman/listinfo/pandas-dev >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Tue Dec 29 18:25:54 2015 From: wesmckinn at gmail.com (Wes McKinney) Date: Tue, 29 Dec 2015 15:25:54 -0800 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References: Message-ID: Hit send by accident. I meant to say targeting pandas/core/internals.py with the initial explorations. On Tuesday, December 29, 2015, Wes McKinney wrote: > Can you link to the PR you're talking about? > > I will see about spending a few hours setting up a libpandas.so as a C++ > shared library where we can run some experiments and validate whether it > can solve the integer-NA problem and be a place to put new data types > (categorical and friends). I'm +1 on targeting > > Would it also be worth making a wish list of APIs we might consider > breaking in a pandas 1.0 release that also features this new "native core"? > Might as well right some wrongs while we're doing some invasive work on the > internals; some breakage might be unavoidable. We can always maintain a > pandas legacy 0.x.x maintenance branch (providing a conda binary build) for > legacy users where showstopper bugs can get fixed. > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeffreback at gmail.com Tue Dec 29 18:25:31 2015 From: jeffreback at gmail.com (Jeff Reback) Date: Tue, 29 Dec 2015 18:25:31 -0500 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References: Message-ID: https://github.com/pydata/pandas/pull/11500. I annotated in the shared google doc as well. There is a section on some pandas 1.0 things to do. On Tue, Dec 29, 2015 at 6:18 PM, Wes McKinney wrote: > Can you link to the PR you're talking about? > > I will see about spending a few hours setting up a libpandas.so as a C++ > shared library where we can run some experiments and validate whether it > can solve the integer-NA problem and be a place to put new data types > (categorical and friends). I'm +1 on targeting > > Would it also be worth making a wish list of APIs we might consider > breaking in a pandas 1.0 release that also features this new "native core"? > Might as well right some wrongs while we're doing some invasive work on the > internals; some breakage might be unavoidable. We can always maintain a > pandas legacy 0.x.x maintenance branch (providing a conda binary build) for > legacy users where showstopper bugs can get fixed. > > > On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback wrote: > > Wes your last is noted as well. I *think* we can actually do this now > (well > > there is a PR out there). > > > > On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney > wrote: > >> > >> The other huge thing this will enable is to do is copy-on-write for > >> various kinds of views, which should cut down on some of the defensive > >> copying in the library and reduce memory usage. > >> > >> On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney > wrote: > >> > Basically the approach is > >> > > >> > 1) Base dtype type > >> > 2) Base array type with K >= 1 dimensions > >> > 3) Base scalar type > >> > 4) Base index type > >> > 5) "Wrapper" subclasses for all NumPy types fitting into categories > >> > #1, #2, #3, #4 > >> > 6) Subclasses for pandas-specific types like category, datetimeTZ, > etc. > >> > 7) NDFrame as cpcloud wrote is just a list of these > >> > > >> > Indexes and axis labels / column names can get layered on top. > >> > > >> > After we do all this we can look at adding nested types (arrays, maps, > >> > structs) to better support JSON. > >> > > >> > - Wes > >> > > >> > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud > >> > wrote: > >> >> Maybe this is saying the same thing as Wes, but how far would > something > >> >> like > >> >> this get us? > >> >> > >> >> // warning: things are probably not this simple > >> >> > >> >> struct data_array_t { > >> >> void *primitive; // scalar data > >> >> data_array_t *nested; // nested data > >> >> boost::dynamic_bitset isnull; // might have to create our own to > >> >> avoid > >> >> boost > >> >> schema_t schema; // not sure exactly what this looks like > >> >> }; > >> >> > >> >> typedef std::map data_frame_t; // probably not > >> >> this > >> >> simple > >> >> > >> >> To answer Jeff?s use-case question: I think that the use cases are 1) > >> >> freedom from numpy (mostly) 2) no more block manager which frees us > >> >> from the > >> >> limitations of the block memory layout. In particular, the ability to > >> >> take > >> >> advantage of memory mapped IO would be a big win IMO. > >> >> > >> >> > >> >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney > >> >> wrote: > >> >>> > >> >>> I will write a more detailed response to some of these things after > >> >>> the new year, but, in particular, re: missing values, can you or > >> >>> someone tell me why creating an object that contains a NumPy array > and > >> >>> a bitmap is not sufficient? If we we can add a lightweight C/C++ > class > >> >>> layer between NumPy function calls (e.g. arithmetic) and pandas > >> >>> function calls, then I see no reason why we cannot have > >> >>> > >> >>> Int32Array->add > >> >>> > >> >>> and > >> >>> > >> >>> Float32Array->add > >> >>> > >> >>> do the right thing (the former would be responsible for bitmasking > to > >> >>> propagate NA values; the latter would defer to NumPy). If we can put > >> >>> all the internals of pandas objects inside a black box, we can add > >> >>> layers of virtual function indirection without a performance penalty > >> >>> (e.g. adding more interpreter overhead with more abstraction layers > >> >>> does add up to a perf penalty). > >> >>> > >> >>> I don't think this is too scary -- I would be willing to create a > >> >>> small POC C++ library to prototype something like what I'm talking > >> >>> about. > >> >>> > >> >>> Since pandas has limited points of contact with NumPy I don't think > >> >>> this would end up being too onerous. > >> >>> > >> >>> For the record, I'm pretty allergic to "advanced C++"; I think it > is a > >> >>> useful tool if you pick a sane 20% subset of the C++11 spec and > follow > >> >>> Google C++ style it's not very inaccessible to intermediate > >> >>> developers. More or less "C plus OOP and easier object lifetime > >> >>> management (shared/unique_ptr, etc.)". As soon as you add a lot of > >> >>> template metaprogramming C++ library development quickly becomes > >> >>> inaccessible except to the C++-Jedi. > >> >>> > >> >>> Maybe let's start a Google document on "pandas roadmap" where we can > >> >>> break down the 1-2 year goals and some of these infrastructure > issues > >> >>> and have our discussion there? (obviously publish this someplace > once > >> >>> we're done) > >> >>> > >> >>> - Wes > >> >>> > >> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback > >> >>> wrote: > >> >>> > Here are some of my thoughts about pandas Roadmap / status and > some > >> >>> > responses to Wes's thoughts. > >> >>> > > >> >>> > In the last few (and upcoming) major releases we have been made > the > >> >>> > following changes: > >> >>> > > >> >>> > - dtype enhancements (Categorical, Timedelta, Datetime w/tz) & > >> >>> > making > >> >>> > these > >> >>> > first class objects > >> >>> > - code refactoring to remove subclassing of ndarrays for Series & > >> >>> > Index > >> >>> > - carving out / deprecating non-core parts of pandas > >> >>> > - datareader > >> >>> > - SparsePanel, WidePanel & other aliases (TImeSeries) > >> >>> > - rpy, rplot, irow et al. > >> >>> > - google-analytics > >> >>> > - API changes to make things more consistent > >> >>> > - pd.rolling/expanding * -> .rolling/expanding (this is in > master > >> >>> > now) > >> >>> > - .resample becoming a full defered like groupby. > >> >>> > - multi-index slicing along any level (obviates need for .xs) > and > >> >>> > allows > >> >>> > assignment > >> >>> > - .loc/.iloc - for the most part obviates use of .ix > >> >>> > - .pipe & .assign > >> >>> > - plotting accessors > >> >>> > - fixing of the sorting API > >> >>> > - many performance enhancements both micro & macro (e.g. release > >> >>> > GIL) > >> >>> > > >> >>> > Some on-deck enhancements are (meaning these are basically ready > to > >> >>> > go > >> >>> > in): > >> >>> > - IntervalIndex (and eventually make PeriodIndex just a > sub-class > >> >>> > of > >> >>> > this) > >> >>> > - RangeIndex > >> >>> > > >> >>> > so lots of changes, though nothing really earth shaking, just more > >> >>> > convenience, reducing magicness somewhat > >> >>> > and providing flexibility. > >> >>> > > >> >>> > Of course we are getting increasing issues, mostly bug reports > (and > >> >>> > lots > >> >>> > of > >> >>> > dupes), some edge case enhancements > >> >>> > which can add to the existing API's and of course, requests to > >> >>> > expand > >> >>> > the > >> >>> > (already) large code to other usecases. > >> >>> > Balancing this are a good many pull-requests from many different > >> >>> > users, > >> >>> > some > >> >>> > even deep into the internals. > >> >>> > > >> >>> > Here are some things that I have talked about and could be > >> >>> > considered > >> >>> > for > >> >>> > the roadmap. Disclaimer: I do work for Continuum > >> >>> > but these views are of course my own; furthermore obviously I am a > >> >>> > bit > >> >>> > more > >> >>> > familiar with some of the 'sponsored' open-source > >> >>> > libraries, but always open to new things. > >> >>> > > >> >>> > - integration / automatic deferral to numba for JIT (this would be > >> >>> > thru > >> >>> > .apply) > >> >>> > - automatic deferal to dask from groubpy where appropriate / > maybe a > >> >>> > .to_parallel (to simply return a dask.DataFrame object) > >> >>> > - incorporation of quantities / units (as part of the dtype) > >> >>> > - use of DyND to allow missing values for int dtypes > >> >>> > - make Period a first class dtype. > >> >>> > - provide some copy-on-write semantics to alleviate the > >> >>> > chained-indexing > >> >>> > issues which occasionaly come up with the mis-use of the indexing > >> >>> > API > >> >>> > - allow a 'policy' to automatically provide column blocks for > >> >>> > dict-like > >> >>> > input (e.g. each column would be a block), this would allow a > >> >>> > pass-thru > >> >>> > API > >> >>> > where you could > >> >>> > put in numpy arrays where you have views and have them preserved > >> >>> > rather > >> >>> > than > >> >>> > copied automatically. Note that this would also allow what I call > >> >>> > 'split' > >> >>> > where a passed in > >> >>> > multi-dim numpy array could be split up to individual blocks > (which > >> >>> > actually > >> >>> > gives a nice perf boost after the splitting costs). > >> >>> > > >> >>> > In working towards some of these goals. I have come to the opinion > >> >>> > that > >> >>> > it > >> >>> > would make sense to have a neutral API protocol layer > >> >>> > that would allow us to swap out different engines as needed, for > >> >>> > particular > >> >>> > dtypes, or *maybe* out-of-core type computations. E.g. > >> >>> > imagine that we replaced the in-memory block structure with a > bclolz > >> >>> > / > >> >>> > memap > >> >>> > type; in theory this should be 'easy' and just work. > >> >>> > I could also see us adopting *some* of the SFrame code to allow > >> >>> > easier > >> >>> > interop with this API layer. > >> >>> > > >> >>> > In practice, I think a nice API layer would need to be created to > >> >>> > make > >> >>> > this > >> >>> > clean / nice. > >> >>> > > >> >>> > So this comes around to Wes's point about creating a c++ library > for > >> >>> > the > >> >>> > internals (and possibly even some of the indexing routines). > >> >>> > In an ideal world, or course this would be desirable. Getting > there > >> >>> > is a > >> >>> > bit > >> >>> > non-trivial I think, and IMHO might not be worth the effort. I > don't > >> >>> > really see big performance bottlenecks. We *already* defer much of > >> >>> > the > >> >>> > computation to libraries like numexpr & bottleneck (where > >> >>> > appropriate). > >> >>> > Adding numba / dask to the list would be helpful. > >> >>> > > >> >>> > I think that almost all performance issues are the result of: > >> >>> > > >> >>> > a) gross misuse of the pandas API. How much code have you seen > that > >> >>> > does > >> >>> > df.apply(lambda x: x.sum()) > >> >>> > b) routines which operate column-by-column rather block-by-block > and > >> >>> > are > >> >>> > in > >> >>> > python space (e.g. we have an issue right now about .quantile) > >> >>> > > >> >>> > So I am glossing over a big goal of having a c++ library that > >> >>> > represents > >> >>> > the > >> >>> > pandas internals. This would by definition have a c-API that so > >> >>> > you *could* use pandas like semantics in c/c++ and just have it > work > >> >>> > (and > >> >>> > then pandas would be a thin wrapper around this library). > >> >>> > > >> >>> > I am not averse to this, but I think would be quite a big effort, > >> >>> > and > >> >>> > not a > >> >>> > huge perf boost IMHO. Further there are a number of API issues > >> >>> > w.r.t. > >> >>> > indexing > >> >>> > which need to be clarified / worked out (e.g. should we simply > >> >>> > deprecate > >> >>> > []) > >> >>> > that are much easier to test / figure out in python space. > >> >>> > > >> >>> > I also thing that we have quite a large number of contributors. > >> >>> > Moving > >> >>> > to > >> >>> > c++ might make the internals a bit more impenetrable that the > >> >>> > current > >> >>> > internals. > >> >>> > (though this would allow c++ people to contribute, so that might > >> >>> > balance > >> >>> > out). > >> >>> > > >> >>> > We have a limited core of devs whom right now are familar with > >> >>> > things. > >> >>> > If > >> >>> > someone happened to have a starting base for a c++ library, then I > >> >>> > might > >> >>> > change > >> >>> > opinions here. > >> >>> > > >> >>> > > >> >>> > my 4c. > >> >>> > > >> >>> > Jeff > >> >>> > > >> >>> > > >> >>> > > >> >>> > > >> >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney < > wesmckinn at gmail.com> > >> >>> > wrote: > >> >>> >> > >> >>> >> Deep thoughts during the holidays. > >> >>> >> > >> >>> >> I might be out of line here, but the interpreter-heaviness of the > >> >>> >> inside of pandas objects is likely to be a long-term liability > and > >> >>> >> source of performance problems and technical debt. > >> >>> >> > >> >>> >> Has anyone put any thought into planning and beginning to execute > >> >>> >> on a > >> >>> >> rewrite that moves as much as possible of the internals into > native > >> >>> >> / > >> >>> >> compiled code? I'm talking about: > >> >>> >> > >> >>> >> - pandas/core/internals > >> >>> >> - indexing and assignment > >> >>> >> - much of pandas/core/common > >> >>> >> - categorical and custom dtypes > >> >>> >> - all indexing mechanisms > >> >>> >> > >> >>> >> I'm concerned we've already exposed too much internals to users, > so > >> >>> >> this might lead to a lot of API breakage, but it might be for the > >> >>> >> Greater Good. As a first step, beginning a partial migration of > >> >>> >> internals into some C++ classes that encapsulate the insides of > >> >>> >> DataFrame objects and implement indexing and block-level > >> >>> >> manipulations > >> >>> >> would be a good place to start. I think you could do this > wouldn't > >> >>> >> too > >> >>> >> much disruption. > >> >>> >> > >> >>> >> As part of this internal retooling we might give consideration to > >> >>> >> alternative data structures for representing data internal to > >> >>> >> pandas > >> >>> >> objects. Now in 2015/2016, continuing to be hamstrung by NumPy's > >> >>> >> limitations feels somewhat anachronistic. User code is riddled > with > >> >>> >> workarounds for data type fidelity issues and the like. Like, > >> >>> >> really, > >> >>> >> why not add a bitndarray (similar to ilanschnell/bitarray) for > >> >>> >> storing > >> >>> >> nullness for problematic types and hide this from the user? =) > >> >>> >> > >> >>> >> Since we are now a NumFOCUS-sponsored project, I feel like we > might > >> >>> >> consider establishing some formal governance over pandas and > >> >>> >> publishing meetings notes and roadmap documents describing plans > >> >>> >> for > >> >>> >> the project and meetings notes from committers. There's no real > >> >>> >> "committer culture" for NumFOCUS projects like there is with the > >> >>> >> Apache Software Foundation, but we might try leading by example! > >> >>> >> > >> >>> >> Also, I believe pandas as a project has reached a level of > >> >>> >> importance > >> >>> >> where we ought to consider planning and execution on larger scale > >> >>> >> undertakings such as this for safeguarding the future. > >> >>> >> > >> >>> >> As for myself, well, I have my hands full in Big Data-land. I > wish > >> >>> >> I > >> >>> >> could be helping more with pandas, but there a quite a few > >> >>> >> fundamental > >> >>> >> issues (like data interoperability nested data handling and file > >> >>> >> format support ? e.g. Parquet, see > >> >>> >> > >> >>> >> > >> >>> >> > >> >>> >> > http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/ > ) > >> >>> >> preventing Python from being more useful in industry analytics > >> >>> >> applications. > >> >>> >> > >> >>> >> Aside: one of the bigger mistakes I made with pandas's API design > >> >>> >> was > >> >>> >> making it acceptable to call class constructors ? like > >> >>> >> pandas.DataFrame ? directly (versus factory functions). Sorry > about > >> >>> >> that! If we could convince everyone to start writing > >> >>> >> pandas.data_frame > >> >>> >> or dataframe instead of using the class reference it would help a > >> >>> >> lot > >> >>> >> with code cleanup. It's hard to plan for these things ? NumPy > >> >>> >> interoperability seemed a lot more important in 2008 than it does > >> >>> >> now, > >> >>> >> so I forgive myself. > >> >>> >> > >> >>> >> cheers and best wishes for 2016, > >> >>> >> Wes > >> >>> >> _______________________________________________ > >> >>> >> Pandas-dev mailing list > >> >>> >> Pandas-dev at python.org > >> >>> >> https://mail.python.org/mailman/listinfo/pandas-dev > >> >>> > > >> >>> > > >> >>> _______________________________________________ > >> >>> Pandas-dev mailing list > >> >>> Pandas-dev at python.org > >> >>> https://mail.python.org/mailman/listinfo/pandas-dev > >> _______________________________________________ > >> Pandas-dev mailing list > >> Pandas-dev at python.org > >> https://mail.python.org/mailman/listinfo/pandas-dev > > > > > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From izaid at continuum.io Tue Dec 29 18:31:59 2015 From: izaid at continuum.io (Irwin Zaid) Date: Tue, 29 Dec 2015 17:31:59 -0600 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References: Message-ID: Hi Wes (and others), I've been following this conversation with interest. I do think it would be worth exploring DyND, rather than setting up yet another rewrite of NumPy-functionality. Especially because DyND is already an optional dependency of Pandas. For things like Integer NA and new dtypes, DyND is there and ready to do this. Irwin On Tue, Dec 29, 2015 at 5:18 PM, Wes McKinney wrote: > Can you link to the PR you're talking about? > > I will see about spending a few hours setting up a libpandas.so as a C++ > shared library where we can run some experiments and validate whether it > can solve the integer-NA problem and be a place to put new data types > (categorical and friends). I'm +1 on targeting > > Would it also be worth making a wish list of APIs we might consider > breaking in a pandas 1.0 release that also features this new "native core"? > Might as well right some wrongs while we're doing some invasive work on the > internals; some breakage might be unavoidable. We can always maintain a > pandas legacy 0.x.x maintenance branch (providing a conda binary build) for > legacy users where showstopper bugs can get fixed. > > On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback wrote: > > Wes your last is noted as well. I *think* we can actually do this now > (well > > there is a PR out there). > > > > On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney > wrote: > >> > >> The other huge thing this will enable is to do is copy-on-write for > >> various kinds of views, which should cut down on some of the defensive > >> copying in the library and reduce memory usage. > >> > >> On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney > wrote: > >> > Basically the approach is > >> > > >> > 1) Base dtype type > >> > 2) Base array type with K >= 1 dimensions > >> > 3) Base scalar type > >> > 4) Base index type > >> > 5) "Wrapper" subclasses for all NumPy types fitting into categories > >> > #1, #2, #3, #4 > >> > 6) Subclasses for pandas-specific types like category, datetimeTZ, > etc. > >> > 7) NDFrame as cpcloud wrote is just a list of these > >> > > >> > Indexes and axis labels / column names can get layered on top. > >> > > >> > After we do all this we can look at adding nested types (arrays, maps, > >> > structs) to better support JSON. > >> > > >> > - Wes > >> > > >> > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud > >> > wrote: > >> >> Maybe this is saying the same thing as Wes, but how far would > something > >> >> like > >> >> this get us? > >> >> > >> >> // warning: things are probably not this simple > >> >> > >> >> struct data_array_t { > >> >> void *primitive; // scalar data > >> >> data_array_t *nested; // nested data > >> >> boost::dynamic_bitset isnull; // might have to create our own to > >> >> avoid > >> >> boost > >> >> schema_t schema; // not sure exactly what this looks like > >> >> }; > >> >> > >> >> typedef std::map data_frame_t; // probably not > >> >> this > >> >> simple > >> >> > >> >> To answer Jeff?s use-case question: I think that the use cases are 1) > >> >> freedom from numpy (mostly) 2) no more block manager which frees us > >> >> from the > >> >> limitations of the block memory layout. In particular, the ability to > >> >> take > >> >> advantage of memory mapped IO would be a big win IMO. > >> >> > >> >> > >> >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney > >> >> wrote: > >> >>> > >> >>> I will write a more detailed response to some of these things after > >> >>> the new year, but, in particular, re: missing values, can you or > >> >>> someone tell me why creating an object that contains a NumPy array > and > >> >>> a bitmap is not sufficient? If we we can add a lightweight C/C++ > class > >> >>> layer between NumPy function calls (e.g. arithmetic) and pandas > >> >>> function calls, then I see no reason why we cannot have > >> >>> > >> >>> Int32Array->add > >> >>> > >> >>> and > >> >>> > >> >>> Float32Array->add > >> >>> > >> >>> do the right thing (the former would be responsible for bitmasking > to > >> >>> propagate NA values; the latter would defer to NumPy). If we can put > >> >>> all the internals of pandas objects inside a black box, we can add > >> >>> layers of virtual function indirection without a performance penalty > >> >>> (e.g. adding more interpreter overhead with more abstraction layers > >> >>> does add up to a perf penalty). > >> >>> > >> >>> I don't think this is too scary -- I would be willing to create a > >> >>> small POC C++ library to prototype something like what I'm talking > >> >>> about. > >> >>> > >> >>> Since pandas has limited points of contact with NumPy I don't think > >> >>> this would end up being too onerous. > >> >>> > >> >>> For the record, I'm pretty allergic to "advanced C++"; I think it > is a > >> >>> useful tool if you pick a sane 20% subset of the C++11 spec and > follow > >> >>> Google C++ style it's not very inaccessible to intermediate > >> >>> developers. More or less "C plus OOP and easier object lifetime > >> >>> management (shared/unique_ptr, etc.)". As soon as you add a lot of > >> >>> template metaprogramming C++ library development quickly becomes > >> >>> inaccessible except to the C++-Jedi. > >> >>> > >> >>> Maybe let's start a Google document on "pandas roadmap" where we can > >> >>> break down the 1-2 year goals and some of these infrastructure > issues > >> >>> and have our discussion there? (obviously publish this someplace > once > >> >>> we're done) > >> >>> > >> >>> - Wes > >> >>> > >> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback > >> >>> wrote: > >> >>> > Here are some of my thoughts about pandas Roadmap / status and > some > >> >>> > responses to Wes's thoughts. > >> >>> > > >> >>> > In the last few (and upcoming) major releases we have been made > the > >> >>> > following changes: > >> >>> > > >> >>> > - dtype enhancements (Categorical, Timedelta, Datetime w/tz) & > >> >>> > making > >> >>> > these > >> >>> > first class objects > >> >>> > - code refactoring to remove subclassing of ndarrays for Series & > >> >>> > Index > >> >>> > - carving out / deprecating non-core parts of pandas > >> >>> > - datareader > >> >>> > - SparsePanel, WidePanel & other aliases (TImeSeries) > >> >>> > - rpy, rplot, irow et al. > >> >>> > - google-analytics > >> >>> > - API changes to make things more consistent > >> >>> > - pd.rolling/expanding * -> .rolling/expanding (this is in > master > >> >>> > now) > >> >>> > - .resample becoming a full defered like groupby. > >> >>> > - multi-index slicing along any level (obviates need for .xs) > and > >> >>> > allows > >> >>> > assignment > >> >>> > - .loc/.iloc - for the most part obviates use of .ix > >> >>> > - .pipe & .assign > >> >>> > - plotting accessors > >> >>> > - fixing of the sorting API > >> >>> > - many performance enhancements both micro & macro (e.g. release > >> >>> > GIL) > >> >>> > > >> >>> > Some on-deck enhancements are (meaning these are basically ready > to > >> >>> > go > >> >>> > in): > >> >>> > - IntervalIndex (and eventually make PeriodIndex just a > sub-class > >> >>> > of > >> >>> > this) > >> >>> > - RangeIndex > >> >>> > > >> >>> > so lots of changes, though nothing really earth shaking, just more > >> >>> > convenience, reducing magicness somewhat > >> >>> > and providing flexibility. > >> >>> > > >> >>> > Of course we are getting increasing issues, mostly bug reports > (and > >> >>> > lots > >> >>> > of > >> >>> > dupes), some edge case enhancements > >> >>> > which can add to the existing API's and of course, requests to > >> >>> > expand > >> >>> > the > >> >>> > (already) large code to other usecases. > >> >>> > Balancing this are a good many pull-requests from many different > >> >>> > users, > >> >>> > some > >> >>> > even deep into the internals. > >> >>> > > >> >>> > Here are some things that I have talked about and could be > >> >>> > considered > >> >>> > for > >> >>> > the roadmap. Disclaimer: I do work for Continuum > >> >>> > but these views are of course my own; furthermore obviously I am a > >> >>> > bit > >> >>> > more > >> >>> > familiar with some of the 'sponsored' open-source > >> >>> > libraries, but always open to new things. > >> >>> > > >> >>> > - integration / automatic deferral to numba for JIT (this would be > >> >>> > thru > >> >>> > .apply) > >> >>> > - automatic deferal to dask from groubpy where appropriate / > maybe a > >> >>> > .to_parallel (to simply return a dask.DataFrame object) > >> >>> > - incorporation of quantities / units (as part of the dtype) > >> >>> > - use of DyND to allow missing values for int dtypes > >> >>> > - make Period a first class dtype. > >> >>> > - provide some copy-on-write semantics to alleviate the > >> >>> > chained-indexing > >> >>> > issues which occasionaly come up with the mis-use of the indexing > >> >>> > API > >> >>> > - allow a 'policy' to automatically provide column blocks for > >> >>> > dict-like > >> >>> > input (e.g. each column would be a block), this would allow a > >> >>> > pass-thru > >> >>> > API > >> >>> > where you could > >> >>> > put in numpy arrays where you have views and have them preserved > >> >>> > rather > >> >>> > than > >> >>> > copied automatically. Note that this would also allow what I call > >> >>> > 'split' > >> >>> > where a passed in > >> >>> > multi-dim numpy array could be split up to individual blocks > (which > >> >>> > actually > >> >>> > gives a nice perf boost after the splitting costs). > >> >>> > > >> >>> > In working towards some of these goals. I have come to the opinion > >> >>> > that > >> >>> > it > >> >>> > would make sense to have a neutral API protocol layer > >> >>> > that would allow us to swap out different engines as needed, for > >> >>> > particular > >> >>> > dtypes, or *maybe* out-of-core type computations. E.g. > >> >>> > imagine that we replaced the in-memory block structure with a > bclolz > >> >>> > / > >> >>> > memap > >> >>> > type; in theory this should be 'easy' and just work. > >> >>> > I could also see us adopting *some* of the SFrame code to allow > >> >>> > easier > >> >>> > interop with this API layer. > >> >>> > > >> >>> > In practice, I think a nice API layer would need to be created to > >> >>> > make > >> >>> > this > >> >>> > clean / nice. > >> >>> > > >> >>> > So this comes around to Wes's point about creating a c++ library > for > >> >>> > the > >> >>> > internals (and possibly even some of the indexing routines). > >> >>> > In an ideal world, or course this would be desirable. Getting > there > >> >>> > is a > >> >>> > bit > >> >>> > non-trivial I think, and IMHO might not be worth the effort. I > don't > >> >>> > really see big performance bottlenecks. We *already* defer much of > >> >>> > the > >> >>> > computation to libraries like numexpr & bottleneck (where > >> >>> > appropriate). > >> >>> > Adding numba / dask to the list would be helpful. > >> >>> > > >> >>> > I think that almost all performance issues are the result of: > >> >>> > > >> >>> > a) gross misuse of the pandas API. How much code have you seen > that > >> >>> > does > >> >>> > df.apply(lambda x: x.sum()) > >> >>> > b) routines which operate column-by-column rather block-by-block > and > >> >>> > are > >> >>> > in > >> >>> > python space (e.g. we have an issue right now about .quantile) > >> >>> > > >> >>> > So I am glossing over a big goal of having a c++ library that > >> >>> > represents > >> >>> > the > >> >>> > pandas internals. This would by definition have a c-API that so > >> >>> > you *could* use pandas like semantics in c/c++ and just have it > work > >> >>> > (and > >> >>> > then pandas would be a thin wrapper around this library). > >> >>> > > >> >>> > I am not averse to this, but I think would be quite a big effort, > >> >>> > and > >> >>> > not a > >> >>> > huge perf boost IMHO. Further there are a number of API issues > >> >>> > w.r.t. > >> >>> > indexing > >> >>> > which need to be clarified / worked out (e.g. should we simply > >> >>> > deprecate > >> >>> > []) > >> >>> > that are much easier to test / figure out in python space. > >> >>> > > >> >>> > I also thing that we have quite a large number of contributors. > >> >>> > Moving > >> >>> > to > >> >>> > c++ might make the internals a bit more impenetrable that the > >> >>> > current > >> >>> > internals. > >> >>> > (though this would allow c++ people to contribute, so that might > >> >>> > balance > >> >>> > out). > >> >>> > > >> >>> > We have a limited core of devs whom right now are familar with > >> >>> > things. > >> >>> > If > >> >>> > someone happened to have a starting base for a c++ library, then I > >> >>> > might > >> >>> > change > >> >>> > opinions here. > >> >>> > > >> >>> > > >> >>> > my 4c. > >> >>> > > >> >>> > Jeff > >> >>> > > >> >>> > > >> >>> > > >> >>> > > >> >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney < > wesmckinn at gmail.com> > >> >>> > wrote: > >> >>> >> > >> >>> >> Deep thoughts during the holidays. > >> >>> >> > >> >>> >> I might be out of line here, but the interpreter-heaviness of the > >> >>> >> inside of pandas objects is likely to be a long-term liability > and > >> >>> >> source of performance problems and technical debt. > >> >>> >> > >> >>> >> Has anyone put any thought into planning and beginning to execute > >> >>> >> on a > >> >>> >> rewrite that moves as much as possible of the internals into > native > >> >>> >> / > >> >>> >> compiled code? I'm talking about: > >> >>> >> > >> >>> >> - pandas/core/internals > >> >>> >> - indexing and assignment > >> >>> >> - much of pandas/core/common > >> >>> >> - categorical and custom dtypes > >> >>> >> - all indexing mechanisms > >> >>> >> > >> >>> >> I'm concerned we've already exposed too much internals to users, > so > >> >>> >> this might lead to a lot of API breakage, but it might be for the > >> >>> >> Greater Good. As a first step, beginning a partial migration of > >> >>> >> internals into some C++ classes that encapsulate the insides of > >> >>> >> DataFrame objects and implement indexing and block-level > >> >>> >> manipulations > >> >>> >> would be a good place to start. I think you could do this > wouldn't > >> >>> >> too > >> >>> >> much disruption. > >> >>> >> > >> >>> >> As part of this internal retooling we might give consideration to > >> >>> >> alternative data structures for representing data internal to > >> >>> >> pandas > >> >>> >> objects. Now in 2015/2016, continuing to be hamstrung by NumPy's > >> >>> >> limitations feels somewhat anachronistic. User code is riddled > with > >> >>> >> workarounds for data type fidelity issues and the like. Like, > >> >>> >> really, > >> >>> >> why not add a bitndarray (similar to ilanschnell/bitarray) for > >> >>> >> storing > >> >>> >> nullness for problematic types and hide this from the user? =) > >> >>> >> > >> >>> >> Since we are now a NumFOCUS-sponsored project, I feel like we > might > >> >>> >> consider establishing some formal governance over pandas and > >> >>> >> publishing meetings notes and roadmap documents describing plans > >> >>> >> for > >> >>> >> the project and meetings notes from committers. There's no real > >> >>> >> "committer culture" for NumFOCUS projects like there is with the > >> >>> >> Apache Software Foundation, but we might try leading by example! > >> >>> >> > >> >>> >> Also, I believe pandas as a project has reached a level of > >> >>> >> importance > >> >>> >> where we ought to consider planning and execution on larger scale > >> >>> >> undertakings such as this for safeguarding the future. > >> >>> >> > >> >>> >> As for myself, well, I have my hands full in Big Data-land. I > wish > >> >>> >> I > >> >>> >> could be helping more with pandas, but there a quite a few > >> >>> >> fundamental > >> >>> >> issues (like data interoperability nested data handling and file > >> >>> >> format support ? e.g. Parquet, see > >> >>> >> > >> >>> >> > >> >>> >> > >> >>> >> > http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/ > ) > >> >>> >> preventing Python from being more useful in industry analytics > >> >>> >> applications. > >> >>> >> > >> >>> >> Aside: one of the bigger mistakes I made with pandas's API design > >> >>> >> was > >> >>> >> making it acceptable to call class constructors ? like > >> >>> >> pandas.DataFrame ? directly (versus factory functions). Sorry > about > >> >>> >> that! If we could convince everyone to start writing > >> >>> >> pandas.data_frame > >> >>> >> or dataframe instead of using the class reference it would help a > >> >>> >> lot > >> >>> >> with code cleanup. It's hard to plan for these things ? NumPy > >> >>> >> interoperability seemed a lot more important in 2008 than it does > >> >>> >> now, > >> >>> >> so I forgive myself. > >> >>> >> > >> >>> >> cheers and best wishes for 2016, > >> >>> >> Wes > >> >>> >> _______________________________________________ > >> >>> >> Pandas-dev mailing list > >> >>> >> Pandas-dev at python.org > >> >>> >> https://mail.python.org/mailman/listinfo/pandas-dev > >> >>> > > >> >>> > > >> >>> _______________________________________________ > >> >>> Pandas-dev mailing list > >> >>> Pandas-dev at python.org > >> >>> https://mail.python.org/mailman/listinfo/pandas-dev > >> _______________________________________________ > >> Pandas-dev mailing list > >> Pandas-dev at python.org > >> https://mail.python.org/mailman/listinfo/pandas-dev > > > > > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Tue Dec 29 19:01:33 2015 From: wesmckinn at gmail.com (Wes McKinney) Date: Tue, 29 Dec 2015 16:01:33 -0800 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References: Message-ID: I'm not suggesting a rewrite of NumPy functionality but rather pandas functionality that is currently written in a mishmash of Cython and Python. Happy to experiment with changing the internal compute infrastructure and data representation to DyND after this first stage of cleanup is done. Even if we use DyND a pretty extensive pandas wrapper layer will be necessary. On Tuesday, December 29, 2015, Irwin Zaid wrote: > Hi Wes (and others), > > I've been following this conversation with interest. I do think it would > be worth exploring DyND, rather than setting up yet another rewrite of > NumPy-functionality. Especially because DyND is already an optional > dependency of Pandas. > > For things like Integer NA and new dtypes, DyND is there and ready to do > this. > > Irwin > > On Tue, Dec 29, 2015 at 5:18 PM, Wes McKinney > wrote: > >> Can you link to the PR you're talking about? >> >> I will see about spending a few hours setting up a libpandas.so as a C++ >> shared library where we can run some experiments and validate whether it >> can solve the integer-NA problem and be a place to put new data types >> (categorical and friends). I'm +1 on targeting >> >> Would it also be worth making a wish list of APIs we might consider >> breaking in a pandas 1.0 release that also features this new "native core"? >> Might as well right some wrongs while we're doing some invasive work on the >> internals; some breakage might be unavoidable. We can always maintain a >> pandas legacy 0.x.x maintenance branch (providing a conda binary build) for >> legacy users where showstopper bugs can get fixed. >> >> On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback >> wrote: >> > Wes your last is noted as well. I *think* we can actually do this now >> (well >> > there is a PR out there). >> > >> > On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney >> wrote: >> >> >> >> The other huge thing this will enable is to do is copy-on-write for >> >> various kinds of views, which should cut down on some of the defensive >> >> copying in the library and reduce memory usage. >> >> >> >> On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney >> wrote: >> >> > Basically the approach is >> >> > >> >> > 1) Base dtype type >> >> > 2) Base array type with K >= 1 dimensions >> >> > 3) Base scalar type >> >> > 4) Base index type >> >> > 5) "Wrapper" subclasses for all NumPy types fitting into categories >> >> > #1, #2, #3, #4 >> >> > 6) Subclasses for pandas-specific types like category, datetimeTZ, >> etc. >> >> > 7) NDFrame as cpcloud wrote is just a list of these >> >> > >> >> > Indexes and axis labels / column names can get layered on top. >> >> > >> >> > After we do all this we can look at adding nested types (arrays, >> maps, >> >> > structs) to better support JSON. >> >> > >> >> > - Wes >> >> > >> >> > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud >> >> > wrote: >> >> >> Maybe this is saying the same thing as Wes, but how far would >> something >> >> >> like >> >> >> this get us? >> >> >> >> >> >> // warning: things are probably not this simple >> >> >> >> >> >> struct data_array_t { >> >> >> void *primitive; // scalar data >> >> >> data_array_t *nested; // nested data >> >> >> boost::dynamic_bitset isnull; // might have to create our own >> to >> >> >> avoid >> >> >> boost >> >> >> schema_t schema; // not sure exactly what this looks like >> >> >> }; >> >> >> >> >> >> typedef std::map data_frame_t; // probably >> not >> >> >> this >> >> >> simple >> >> >> >> >> >> To answer Jeff?s use-case question: I think that the use cases are >> 1) >> >> >> freedom from numpy (mostly) 2) no more block manager which frees us >> >> >> from the >> >> >> limitations of the block memory layout. In particular, the ability >> to >> >> >> take >> >> >> advantage of memory mapped IO would be a big win IMO. >> >> >> >> >> >> >> >> >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney >> >> >> wrote: >> >> >>> >> >> >>> I will write a more detailed response to some of these things after >> >> >>> the new year, but, in particular, re: missing values, can you or >> >> >>> someone tell me why creating an object that contains a NumPy array >> and >> >> >>> a bitmap is not sufficient? If we we can add a lightweight C/C++ >> class >> >> >>> layer between NumPy function calls (e.g. arithmetic) and pandas >> >> >>> function calls, then I see no reason why we cannot have >> >> >>> >> >> >>> Int32Array->add >> >> >>> >> >> >>> and >> >> >>> >> >> >>> Float32Array->add >> >> >>> >> >> >>> do the right thing (the former would be responsible for bitmasking >> to >> >> >>> propagate NA values; the latter would defer to NumPy). If we can >> put >> >> >>> all the internals of pandas objects inside a black box, we can add >> >> >>> layers of virtual function indirection without a performance >> penalty >> >> >>> (e.g. adding more interpreter overhead with more abstraction layers >> >> >>> does add up to a perf penalty). >> >> >>> >> >> >>> I don't think this is too scary -- I would be willing to create a >> >> >>> small POC C++ library to prototype something like what I'm talking >> >> >>> about. >> >> >>> >> >> >>> Since pandas has limited points of contact with NumPy I don't think >> >> >>> this would end up being too onerous. >> >> >>> >> >> >>> For the record, I'm pretty allergic to "advanced C++"; I think it >> is a >> >> >>> useful tool if you pick a sane 20% subset of the C++11 spec and >> follow >> >> >>> Google C++ style it's not very inaccessible to intermediate >> >> >>> developers. More or less "C plus OOP and easier object lifetime >> >> >>> management (shared/unique_ptr, etc.)". As soon as you add a lot of >> >> >>> template metaprogramming C++ library development quickly becomes >> >> >>> inaccessible except to the C++-Jedi. >> >> >>> >> >> >>> Maybe let's start a Google document on "pandas roadmap" where we >> can >> >> >>> break down the 1-2 year goals and some of these infrastructure >> issues >> >> >>> and have our discussion there? (obviously publish this someplace >> once >> >> >>> we're done) >> >> >>> >> >> >>> - Wes >> >> >>> >> >> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback > > >> >> >>> wrote: >> >> >>> > Here are some of my thoughts about pandas Roadmap / status and >> some >> >> >>> > responses to Wes's thoughts. >> >> >>> > >> >> >>> > In the last few (and upcoming) major releases we have been made >> the >> >> >>> > following changes: >> >> >>> > >> >> >>> > - dtype enhancements (Categorical, Timedelta, Datetime w/tz) & >> >> >>> > making >> >> >>> > these >> >> >>> > first class objects >> >> >>> > - code refactoring to remove subclassing of ndarrays for Series & >> >> >>> > Index >> >> >>> > - carving out / deprecating non-core parts of pandas >> >> >>> > - datareader >> >> >>> > - SparsePanel, WidePanel & other aliases (TImeSeries) >> >> >>> > - rpy, rplot, irow et al. >> >> >>> > - google-analytics >> >> >>> > - API changes to make things more consistent >> >> >>> > - pd.rolling/expanding * -> .rolling/expanding (this is in >> master >> >> >>> > now) >> >> >>> > - .resample becoming a full defered like groupby. >> >> >>> > - multi-index slicing along any level (obviates need for .xs) >> and >> >> >>> > allows >> >> >>> > assignment >> >> >>> > - .loc/.iloc - for the most part obviates use of .ix >> >> >>> > - .pipe & .assign >> >> >>> > - plotting accessors >> >> >>> > - fixing of the sorting API >> >> >>> > - many performance enhancements both micro & macro (e.g. release >> >> >>> > GIL) >> >> >>> > >> >> >>> > Some on-deck enhancements are (meaning these are basically ready >> to >> >> >>> > go >> >> >>> > in): >> >> >>> > - IntervalIndex (and eventually make PeriodIndex just a >> sub-class >> >> >>> > of >> >> >>> > this) >> >> >>> > - RangeIndex >> >> >>> > >> >> >>> > so lots of changes, though nothing really earth shaking, just >> more >> >> >>> > convenience, reducing magicness somewhat >> >> >>> > and providing flexibility. >> >> >>> > >> >> >>> > Of course we are getting increasing issues, mostly bug reports >> (and >> >> >>> > lots >> >> >>> > of >> >> >>> > dupes), some edge case enhancements >> >> >>> > which can add to the existing API's and of course, requests to >> >> >>> > expand >> >> >>> > the >> >> >>> > (already) large code to other usecases. >> >> >>> > Balancing this are a good many pull-requests from many different >> >> >>> > users, >> >> >>> > some >> >> >>> > even deep into the internals. >> >> >>> > >> >> >>> > Here are some things that I have talked about and could be >> >> >>> > considered >> >> >>> > for >> >> >>> > the roadmap. Disclaimer: I do work for Continuum >> >> >>> > but these views are of course my own; furthermore obviously I am >> a >> >> >>> > bit >> >> >>> > more >> >> >>> > familiar with some of the 'sponsored' open-source >> >> >>> > libraries, but always open to new things. >> >> >>> > >> >> >>> > - integration / automatic deferral to numba for JIT (this would >> be >> >> >>> > thru >> >> >>> > .apply) >> >> >>> > - automatic deferal to dask from groubpy where appropriate / >> maybe a >> >> >>> > .to_parallel (to simply return a dask.DataFrame object) >> >> >>> > - incorporation of quantities / units (as part of the dtype) >> >> >>> > - use of DyND to allow missing values for int dtypes >> >> >>> > - make Period a first class dtype. >> >> >>> > - provide some copy-on-write semantics to alleviate the >> >> >>> > chained-indexing >> >> >>> > issues which occasionaly come up with the mis-use of the indexing >> >> >>> > API >> >> >>> > - allow a 'policy' to automatically provide column blocks for >> >> >>> > dict-like >> >> >>> > input (e.g. each column would be a block), this would allow a >> >> >>> > pass-thru >> >> >>> > API >> >> >>> > where you could >> >> >>> > put in numpy arrays where you have views and have them preserved >> >> >>> > rather >> >> >>> > than >> >> >>> > copied automatically. Note that this would also allow what I call >> >> >>> > 'split' >> >> >>> > where a passed in >> >> >>> > multi-dim numpy array could be split up to individual blocks >> (which >> >> >>> > actually >> >> >>> > gives a nice perf boost after the splitting costs). >> >> >>> > >> >> >>> > In working towards some of these goals. I have come to the >> opinion >> >> >>> > that >> >> >>> > it >> >> >>> > would make sense to have a neutral API protocol layer >> >> >>> > that would allow us to swap out different engines as needed, for >> >> >>> > particular >> >> >>> > dtypes, or *maybe* out-of-core type computations. E.g. >> >> >>> > imagine that we replaced the in-memory block structure with a >> bclolz >> >> >>> > / >> >> >>> > memap >> >> >>> > type; in theory this should be 'easy' and just work. >> >> >>> > I could also see us adopting *some* of the SFrame code to allow >> >> >>> > easier >> >> >>> > interop with this API layer. >> >> >>> > >> >> >>> > In practice, I think a nice API layer would need to be created to >> >> >>> > make >> >> >>> > this >> >> >>> > clean / nice. >> >> >>> > >> >> >>> > So this comes around to Wes's point about creating a c++ library >> for >> >> >>> > the >> >> >>> > internals (and possibly even some of the indexing routines). >> >> >>> > In an ideal world, or course this would be desirable. Getting >> there >> >> >>> > is a >> >> >>> > bit >> >> >>> > non-trivial I think, and IMHO might not be worth the effort. I >> don't >> >> >>> > really see big performance bottlenecks. We *already* defer much >> of >> >> >>> > the >> >> >>> > computation to libraries like numexpr & bottleneck (where >> >> >>> > appropriate). >> >> >>> > Adding numba / dask to the list would be helpful. >> >> >>> > >> >> >>> > I think that almost all performance issues are the result of: >> >> >>> > >> >> >>> > a) gross misuse of the pandas API. How much code have you seen >> that >> >> >>> > does >> >> >>> > df.apply(lambda x: x.sum()) >> >> >>> > b) routines which operate column-by-column rather block-by-block >> and >> >> >>> > are >> >> >>> > in >> >> >>> > python space (e.g. we have an issue right now about .quantile) >> >> >>> > >> >> >>> > So I am glossing over a big goal of having a c++ library that >> >> >>> > represents >> >> >>> > the >> >> >>> > pandas internals. This would by definition have a c-API that so >> >> >>> > you *could* use pandas like semantics in c/c++ and just have it >> work >> >> >>> > (and >> >> >>> > then pandas would be a thin wrapper around this library). >> >> >>> > >> >> >>> > I am not averse to this, but I think would be quite a big effort, >> >> >>> > and >> >> >>> > not a >> >> >>> > huge perf boost IMHO. Further there are a number of API issues >> >> >>> > w.r.t. >> >> >>> > indexing >> >> >>> > which need to be clarified / worked out (e.g. should we simply >> >> >>> > deprecate >> >> >>> > []) >> >> >>> > that are much easier to test / figure out in python space. >> >> >>> > >> >> >>> > I also thing that we have quite a large number of contributors. >> >> >>> > Moving >> >> >>> > to >> >> >>> > c++ might make the internals a bit more impenetrable that the >> >> >>> > current >> >> >>> > internals. >> >> >>> > (though this would allow c++ people to contribute, so that might >> >> >>> > balance >> >> >>> > out). >> >> >>> > >> >> >>> > We have a limited core of devs whom right now are familar with >> >> >>> > things. >> >> >>> > If >> >> >>> > someone happened to have a starting base for a c++ library, then >> I >> >> >>> > might >> >> >>> > change >> >> >>> > opinions here. >> >> >>> > >> >> >>> > >> >> >>> > my 4c. >> >> >>> > >> >> >>> > Jeff >> >> >>> > >> >> >>> > >> >> >>> > >> >> >>> > >> >> >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney < >> wesmckinn at gmail.com> >> >> >>> > wrote: >> >> >>> >> >> >> >>> >> Deep thoughts during the holidays. >> >> >>> >> >> >> >>> >> I might be out of line here, but the interpreter-heaviness of >> the >> >> >>> >> inside of pandas objects is likely to be a long-term liability >> and >> >> >>> >> source of performance problems and technical debt. >> >> >>> >> >> >> >>> >> Has anyone put any thought into planning and beginning to >> execute >> >> >>> >> on a >> >> >>> >> rewrite that moves as much as possible of the internals into >> native >> >> >>> >> / >> >> >>> >> compiled code? I'm talking about: >> >> >>> >> >> >> >>> >> - pandas/core/internals >> >> >>> >> - indexing and assignment >> >> >>> >> - much of pandas/core/common >> >> >>> >> - categorical and custom dtypes >> >> >>> >> - all indexing mechanisms >> >> >>> >> >> >> >>> >> I'm concerned we've already exposed too much internals to >> users, so >> >> >>> >> this might lead to a lot of API breakage, but it might be for >> the >> >> >>> >> Greater Good. As a first step, beginning a partial migration of >> >> >>> >> internals into some C++ classes that encapsulate the insides of >> >> >>> >> DataFrame objects and implement indexing and block-level >> >> >>> >> manipulations >> >> >>> >> would be a good place to start. I think you could do this >> wouldn't >> >> >>> >> too >> >> >>> >> much disruption. >> >> >>> >> >> >> >>> >> As part of this internal retooling we might give consideration >> to >> >> >>> >> alternative data structures for representing data internal to >> >> >>> >> pandas >> >> >>> >> objects. Now in 2015/2016, continuing to be hamstrung by NumPy's >> >> >>> >> limitations feels somewhat anachronistic. User code is riddled >> with >> >> >>> >> workarounds for data type fidelity issues and the like. Like, >> >> >>> >> really, >> >> >>> >> why not add a bitndarray (similar to ilanschnell/bitarray) for >> >> >>> >> storing >> >> >>> >> nullness for problematic types and hide this from the user? =) >> >> >>> >> >> >> >>> >> Since we are now a NumFOCUS-sponsored project, I feel like we >> might >> >> >>> >> consider establishing some formal governance over pandas and >> >> >>> >> publishing meetings notes and roadmap documents describing plans >> >> >>> >> for >> >> >>> >> the project and meetings notes from committers. There's no real >> >> >>> >> "committer culture" for NumFOCUS projects like there is with the >> >> >>> >> Apache Software Foundation, but we might try leading by example! >> >> >>> >> >> >> >>> >> Also, I believe pandas as a project has reached a level of >> >> >>> >> importance >> >> >>> >> where we ought to consider planning and execution on larger >> scale >> >> >>> >> undertakings such as this for safeguarding the future. >> >> >>> >> >> >> >>> >> As for myself, well, I have my hands full in Big Data-land. I >> wish >> >> >>> >> I >> >> >>> >> could be helping more with pandas, but there a quite a few >> >> >>> >> fundamental >> >> >>> >> issues (like data interoperability nested data handling and file >> >> >>> >> format support ? e.g. Parquet, see >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/ >> ) >> >> >>> >> preventing Python from being more useful in industry analytics >> >> >>> >> applications. >> >> >>> >> >> >> >>> >> Aside: one of the bigger mistakes I made with pandas's API >> design >> >> >>> >> was >> >> >>> >> making it acceptable to call class constructors ? like >> >> >>> >> pandas.DataFrame ? directly (versus factory functions). Sorry >> about >> >> >>> >> that! If we could convince everyone to start writing >> >> >>> >> pandas.data_frame >> >> >>> >> or dataframe instead of using the class reference it would help >> a >> >> >>> >> lot >> >> >>> >> with code cleanup. It's hard to plan for these things ? NumPy >> >> >>> >> interoperability seemed a lot more important in 2008 than it >> does >> >> >>> >> now, >> >> >>> >> so I forgive myself. >> >> >>> >> >> >> >>> >> cheers and best wishes for 2016, >> >> >>> >> Wes >> >> >>> >> _______________________________________________ >> >> >>> >> Pandas-dev mailing list >> >> >>> >> Pandas-dev at python.org >> >> >>> >> https://mail.python.org/mailman/listinfo/pandas-dev >> >> >>> > >> >> >>> > >> >> >>> _______________________________________________ >> >> >>> Pandas-dev mailing list >> >> >>> Pandas-dev at python.org >> >> >>> https://mail.python.org/mailman/listinfo/pandas-dev >> >> _______________________________________________ >> >> Pandas-dev mailing list >> >> Pandas-dev at python.org >> >> https://mail.python.org/mailman/listinfo/pandas-dev >> > >> > >> >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> >> https://mail.python.org/mailman/listinfo/pandas-dev >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From izaid at continuum.io Tue Dec 29 19:17:25 2015 From: izaid at continuum.io (Irwin Zaid) Date: Tue, 29 Dec 2015 18:17:25 -0600 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References: Message-ID: Yeah, that seems reasonable and I totally agree a Pandas wrapper layer would be necessary. I'll keep an eye on this and I'd like to help if I can. Irwin On Tue, Dec 29, 2015 at 6:01 PM, Wes McKinney wrote: > I'm not suggesting a rewrite of NumPy functionality but rather pandas > functionality that is currently written in a mishmash of Cython and Python. > Happy to experiment with changing the internal compute infrastructure and > data representation to DyND after this first stage of cleanup is done. Even > if we use DyND a pretty extensive pandas wrapper layer will be necessary. > > > On Tuesday, December 29, 2015, Irwin Zaid wrote: > >> Hi Wes (and others), >> >> I've been following this conversation with interest. I do think it would >> be worth exploring DyND, rather than setting up yet another rewrite of >> NumPy-functionality. Especially because DyND is already an optional >> dependency of Pandas. >> >> For things like Integer NA and new dtypes, DyND is there and ready to do >> this. >> >> Irwin >> >> On Tue, Dec 29, 2015 at 5:18 PM, Wes McKinney >> wrote: >> >>> Can you link to the PR you're talking about? >>> >>> I will see about spending a few hours setting up a libpandas.so as a C++ >>> shared library where we can run some experiments and validate whether it >>> can solve the integer-NA problem and be a place to put new data types >>> (categorical and friends). I'm +1 on targeting >>> >>> Would it also be worth making a wish list of APIs we might consider >>> breaking in a pandas 1.0 release that also features this new "native core"? >>> Might as well right some wrongs while we're doing some invasive work on the >>> internals; some breakage might be unavoidable. We can always maintain a >>> pandas legacy 0.x.x maintenance branch (providing a conda binary build) for >>> legacy users where showstopper bugs can get fixed. >>> >>> On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback >>> wrote: >>> > Wes your last is noted as well. I *think* we can actually do this now >>> (well >>> > there is a PR out there). >>> > >>> > On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney >>> wrote: >>> >> >>> >> The other huge thing this will enable is to do is copy-on-write for >>> >> various kinds of views, which should cut down on some of the defensive >>> >> copying in the library and reduce memory usage. >>> >> >>> >> On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney >>> wrote: >>> >> > Basically the approach is >>> >> > >>> >> > 1) Base dtype type >>> >> > 2) Base array type with K >= 1 dimensions >>> >> > 3) Base scalar type >>> >> > 4) Base index type >>> >> > 5) "Wrapper" subclasses for all NumPy types fitting into categories >>> >> > #1, #2, #3, #4 >>> >> > 6) Subclasses for pandas-specific types like category, datetimeTZ, >>> etc. >>> >> > 7) NDFrame as cpcloud wrote is just a list of these >>> >> > >>> >> > Indexes and axis labels / column names can get layered on top. >>> >> > >>> >> > After we do all this we can look at adding nested types (arrays, >>> maps, >>> >> > structs) to better support JSON. >>> >> > >>> >> > - Wes >>> >> > >>> >> > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud >>> >> > wrote: >>> >> >> Maybe this is saying the same thing as Wes, but how far would >>> something >>> >> >> like >>> >> >> this get us? >>> >> >> >>> >> >> // warning: things are probably not this simple >>> >> >> >>> >> >> struct data_array_t { >>> >> >> void *primitive; // scalar data >>> >> >> data_array_t *nested; // nested data >>> >> >> boost::dynamic_bitset isnull; // might have to create our own >>> to >>> >> >> avoid >>> >> >> boost >>> >> >> schema_t schema; // not sure exactly what this looks like >>> >> >> }; >>> >> >> >>> >> >> typedef std::map data_frame_t; // probably >>> not >>> >> >> this >>> >> >> simple >>> >> >> >>> >> >> To answer Jeff?s use-case question: I think that the use cases are >>> 1) >>> >> >> freedom from numpy (mostly) 2) no more block manager which frees us >>> >> >> from the >>> >> >> limitations of the block memory layout. In particular, the ability >>> to >>> >> >> take >>> >> >> advantage of memory mapped IO would be a big win IMO. >>> >> >> >>> >> >> >>> >> >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney >>> >> >> wrote: >>> >> >>> >>> >> >>> I will write a more detailed response to some of these things >>> after >>> >> >>> the new year, but, in particular, re: missing values, can you or >>> >> >>> someone tell me why creating an object that contains a NumPy >>> array and >>> >> >>> a bitmap is not sufficient? If we we can add a lightweight C/C++ >>> class >>> >> >>> layer between NumPy function calls (e.g. arithmetic) and pandas >>> >> >>> function calls, then I see no reason why we cannot have >>> >> >>> >>> >> >>> Int32Array->add >>> >> >>> >>> >> >>> and >>> >> >>> >>> >> >>> Float32Array->add >>> >> >>> >>> >> >>> do the right thing (the former would be responsible for >>> bitmasking to >>> >> >>> propagate NA values; the latter would defer to NumPy). If we can >>> put >>> >> >>> all the internals of pandas objects inside a black box, we can add >>> >> >>> layers of virtual function indirection without a performance >>> penalty >>> >> >>> (e.g. adding more interpreter overhead with more abstraction >>> layers >>> >> >>> does add up to a perf penalty). >>> >> >>> >>> >> >>> I don't think this is too scary -- I would be willing to create a >>> >> >>> small POC C++ library to prototype something like what I'm talking >>> >> >>> about. >>> >> >>> >>> >> >>> Since pandas has limited points of contact with NumPy I don't >>> think >>> >> >>> this would end up being too onerous. >>> >> >>> >>> >> >>> For the record, I'm pretty allergic to "advanced C++"; I think it >>> is a >>> >> >>> useful tool if you pick a sane 20% subset of the C++11 spec and >>> follow >>> >> >>> Google C++ style it's not very inaccessible to intermediate >>> >> >>> developers. More or less "C plus OOP and easier object lifetime >>> >> >>> management (shared/unique_ptr, etc.)". As soon as you add a lot of >>> >> >>> template metaprogramming C++ library development quickly becomes >>> >> >>> inaccessible except to the C++-Jedi. >>> >> >>> >>> >> >>> Maybe let's start a Google document on "pandas roadmap" where we >>> can >>> >> >>> break down the 1-2 year goals and some of these infrastructure >>> issues >>> >> >>> and have our discussion there? (obviously publish this someplace >>> once >>> >> >>> we're done) >>> >> >>> >>> >> >>> - Wes >>> >> >>> >>> >> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback < >>> jeffreback at gmail.com> >>> >> >>> wrote: >>> >> >>> > Here are some of my thoughts about pandas Roadmap / status and >>> some >>> >> >>> > responses to Wes's thoughts. >>> >> >>> > >>> >> >>> > In the last few (and upcoming) major releases we have been made >>> the >>> >> >>> > following changes: >>> >> >>> > >>> >> >>> > - dtype enhancements (Categorical, Timedelta, Datetime w/tz) & >>> >> >>> > making >>> >> >>> > these >>> >> >>> > first class objects >>> >> >>> > - code refactoring to remove subclassing of ndarrays for Series >>> & >>> >> >>> > Index >>> >> >>> > - carving out / deprecating non-core parts of pandas >>> >> >>> > - datareader >>> >> >>> > - SparsePanel, WidePanel & other aliases (TImeSeries) >>> >> >>> > - rpy, rplot, irow et al. >>> >> >>> > - google-analytics >>> >> >>> > - API changes to make things more consistent >>> >> >>> > - pd.rolling/expanding * -> .rolling/expanding (this is in >>> master >>> >> >>> > now) >>> >> >>> > - .resample becoming a full defered like groupby. >>> >> >>> > - multi-index slicing along any level (obviates need for .xs) >>> and >>> >> >>> > allows >>> >> >>> > assignment >>> >> >>> > - .loc/.iloc - for the most part obviates use of .ix >>> >> >>> > - .pipe & .assign >>> >> >>> > - plotting accessors >>> >> >>> > - fixing of the sorting API >>> >> >>> > - many performance enhancements both micro & macro (e.g. release >>> >> >>> > GIL) >>> >> >>> > >>> >> >>> > Some on-deck enhancements are (meaning these are basically >>> ready to >>> >> >>> > go >>> >> >>> > in): >>> >> >>> > - IntervalIndex (and eventually make PeriodIndex just a >>> sub-class >>> >> >>> > of >>> >> >>> > this) >>> >> >>> > - RangeIndex >>> >> >>> > >>> >> >>> > so lots of changes, though nothing really earth shaking, just >>> more >>> >> >>> > convenience, reducing magicness somewhat >>> >> >>> > and providing flexibility. >>> >> >>> > >>> >> >>> > Of course we are getting increasing issues, mostly bug reports >>> (and >>> >> >>> > lots >>> >> >>> > of >>> >> >>> > dupes), some edge case enhancements >>> >> >>> > which can add to the existing API's and of course, requests to >>> >> >>> > expand >>> >> >>> > the >>> >> >>> > (already) large code to other usecases. >>> >> >>> > Balancing this are a good many pull-requests from many different >>> >> >>> > users, >>> >> >>> > some >>> >> >>> > even deep into the internals. >>> >> >>> > >>> >> >>> > Here are some things that I have talked about and could be >>> >> >>> > considered >>> >> >>> > for >>> >> >>> > the roadmap. Disclaimer: I do work for Continuum >>> >> >>> > but these views are of course my own; furthermore obviously I >>> am a >>> >> >>> > bit >>> >> >>> > more >>> >> >>> > familiar with some of the 'sponsored' open-source >>> >> >>> > libraries, but always open to new things. >>> >> >>> > >>> >> >>> > - integration / automatic deferral to numba for JIT (this would >>> be >>> >> >>> > thru >>> >> >>> > .apply) >>> >> >>> > - automatic deferal to dask from groubpy where appropriate / >>> maybe a >>> >> >>> > .to_parallel (to simply return a dask.DataFrame object) >>> >> >>> > - incorporation of quantities / units (as part of the dtype) >>> >> >>> > - use of DyND to allow missing values for int dtypes >>> >> >>> > - make Period a first class dtype. >>> >> >>> > - provide some copy-on-write semantics to alleviate the >>> >> >>> > chained-indexing >>> >> >>> > issues which occasionaly come up with the mis-use of the >>> indexing >>> >> >>> > API >>> >> >>> > - allow a 'policy' to automatically provide column blocks for >>> >> >>> > dict-like >>> >> >>> > input (e.g. each column would be a block), this would allow a >>> >> >>> > pass-thru >>> >> >>> > API >>> >> >>> > where you could >>> >> >>> > put in numpy arrays where you have views and have them preserved >>> >> >>> > rather >>> >> >>> > than >>> >> >>> > copied automatically. Note that this would also allow what I >>> call >>> >> >>> > 'split' >>> >> >>> > where a passed in >>> >> >>> > multi-dim numpy array could be split up to individual blocks >>> (which >>> >> >>> > actually >>> >> >>> > gives a nice perf boost after the splitting costs). >>> >> >>> > >>> >> >>> > In working towards some of these goals. I have come to the >>> opinion >>> >> >>> > that >>> >> >>> > it >>> >> >>> > would make sense to have a neutral API protocol layer >>> >> >>> > that would allow us to swap out different engines as needed, for >>> >> >>> > particular >>> >> >>> > dtypes, or *maybe* out-of-core type computations. E.g. >>> >> >>> > imagine that we replaced the in-memory block structure with a >>> bclolz >>> >> >>> > / >>> >> >>> > memap >>> >> >>> > type; in theory this should be 'easy' and just work. >>> >> >>> > I could also see us adopting *some* of the SFrame code to allow >>> >> >>> > easier >>> >> >>> > interop with this API layer. >>> >> >>> > >>> >> >>> > In practice, I think a nice API layer would need to be created >>> to >>> >> >>> > make >>> >> >>> > this >>> >> >>> > clean / nice. >>> >> >>> > >>> >> >>> > So this comes around to Wes's point about creating a c++ >>> library for >>> >> >>> > the >>> >> >>> > internals (and possibly even some of the indexing routines). >>> >> >>> > In an ideal world, or course this would be desirable. Getting >>> there >>> >> >>> > is a >>> >> >>> > bit >>> >> >>> > non-trivial I think, and IMHO might not be worth the effort. I >>> don't >>> >> >>> > really see big performance bottlenecks. We *already* defer much >>> of >>> >> >>> > the >>> >> >>> > computation to libraries like numexpr & bottleneck (where >>> >> >>> > appropriate). >>> >> >>> > Adding numba / dask to the list would be helpful. >>> >> >>> > >>> >> >>> > I think that almost all performance issues are the result of: >>> >> >>> > >>> >> >>> > a) gross misuse of the pandas API. How much code have you seen >>> that >>> >> >>> > does >>> >> >>> > df.apply(lambda x: x.sum()) >>> >> >>> > b) routines which operate column-by-column rather >>> block-by-block and >>> >> >>> > are >>> >> >>> > in >>> >> >>> > python space (e.g. we have an issue right now about .quantile) >>> >> >>> > >>> >> >>> > So I am glossing over a big goal of having a c++ library that >>> >> >>> > represents >>> >> >>> > the >>> >> >>> > pandas internals. This would by definition have a c-API that so >>> >> >>> > you *could* use pandas like semantics in c/c++ and just have it >>> work >>> >> >>> > (and >>> >> >>> > then pandas would be a thin wrapper around this library). >>> >> >>> > >>> >> >>> > I am not averse to this, but I think would be quite a big >>> effort, >>> >> >>> > and >>> >> >>> > not a >>> >> >>> > huge perf boost IMHO. Further there are a number of API issues >>> >> >>> > w.r.t. >>> >> >>> > indexing >>> >> >>> > which need to be clarified / worked out (e.g. should we simply >>> >> >>> > deprecate >>> >> >>> > []) >>> >> >>> > that are much easier to test / figure out in python space. >>> >> >>> > >>> >> >>> > I also thing that we have quite a large number of contributors. >>> >> >>> > Moving >>> >> >>> > to >>> >> >>> > c++ might make the internals a bit more impenetrable that the >>> >> >>> > current >>> >> >>> > internals. >>> >> >>> > (though this would allow c++ people to contribute, so that might >>> >> >>> > balance >>> >> >>> > out). >>> >> >>> > >>> >> >>> > We have a limited core of devs whom right now are familar with >>> >> >>> > things. >>> >> >>> > If >>> >> >>> > someone happened to have a starting base for a c++ library, >>> then I >>> >> >>> > might >>> >> >>> > change >>> >> >>> > opinions here. >>> >> >>> > >>> >> >>> > >>> >> >>> > my 4c. >>> >> >>> > >>> >> >>> > Jeff >>> >> >>> > >>> >> >>> > >>> >> >>> > >>> >> >>> > >>> >> >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney < >>> wesmckinn at gmail.com> >>> >> >>> > wrote: >>> >> >>> >> >>> >> >>> >> Deep thoughts during the holidays. >>> >> >>> >> >>> >> >>> >> I might be out of line here, but the interpreter-heaviness of >>> the >>> >> >>> >> inside of pandas objects is likely to be a long-term liability >>> and >>> >> >>> >> source of performance problems and technical debt. >>> >> >>> >> >>> >> >>> >> Has anyone put any thought into planning and beginning to >>> execute >>> >> >>> >> on a >>> >> >>> >> rewrite that moves as much as possible of the internals into >>> native >>> >> >>> >> / >>> >> >>> >> compiled code? I'm talking about: >>> >> >>> >> >>> >> >>> >> - pandas/core/internals >>> >> >>> >> - indexing and assignment >>> >> >>> >> - much of pandas/core/common >>> >> >>> >> - categorical and custom dtypes >>> >> >>> >> - all indexing mechanisms >>> >> >>> >> >>> >> >>> >> I'm concerned we've already exposed too much internals to >>> users, so >>> >> >>> >> this might lead to a lot of API breakage, but it might be for >>> the >>> >> >>> >> Greater Good. As a first step, beginning a partial migration of >>> >> >>> >> internals into some C++ classes that encapsulate the insides of >>> >> >>> >> DataFrame objects and implement indexing and block-level >>> >> >>> >> manipulations >>> >> >>> >> would be a good place to start. I think you could do this >>> wouldn't >>> >> >>> >> too >>> >> >>> >> much disruption. >>> >> >>> >> >>> >> >>> >> As part of this internal retooling we might give consideration >>> to >>> >> >>> >> alternative data structures for representing data internal to >>> >> >>> >> pandas >>> >> >>> >> objects. Now in 2015/2016, continuing to be hamstrung by >>> NumPy's >>> >> >>> >> limitations feels somewhat anachronistic. User code is riddled >>> with >>> >> >>> >> workarounds for data type fidelity issues and the like. Like, >>> >> >>> >> really, >>> >> >>> >> why not add a bitndarray (similar to ilanschnell/bitarray) for >>> >> >>> >> storing >>> >> >>> >> nullness for problematic types and hide this from the user? =) >>> >> >>> >> >>> >> >>> >> Since we are now a NumFOCUS-sponsored project, I feel like we >>> might >>> >> >>> >> consider establishing some formal governance over pandas and >>> >> >>> >> publishing meetings notes and roadmap documents describing >>> plans >>> >> >>> >> for >>> >> >>> >> the project and meetings notes from committers. There's no real >>> >> >>> >> "committer culture" for NumFOCUS projects like there is with >>> the >>> >> >>> >> Apache Software Foundation, but we might try leading by >>> example! >>> >> >>> >> >>> >> >>> >> Also, I believe pandas as a project has reached a level of >>> >> >>> >> importance >>> >> >>> >> where we ought to consider planning and execution on larger >>> scale >>> >> >>> >> undertakings such as this for safeguarding the future. >>> >> >>> >> >>> >> >>> >> As for myself, well, I have my hands full in Big Data-land. I >>> wish >>> >> >>> >> I >>> >> >>> >> could be helping more with pandas, but there a quite a few >>> >> >>> >> fundamental >>> >> >>> >> issues (like data interoperability nested data handling and >>> file >>> >> >>> >> format support ? e.g. Parquet, see >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/ >>> ) >>> >> >>> >> preventing Python from being more useful in industry analytics >>> >> >>> >> applications. >>> >> >>> >> >>> >> >>> >> Aside: one of the bigger mistakes I made with pandas's API >>> design >>> >> >>> >> was >>> >> >>> >> making it acceptable to call class constructors ? like >>> >> >>> >> pandas.DataFrame ? directly (versus factory functions). Sorry >>> about >>> >> >>> >> that! If we could convince everyone to start writing >>> >> >>> >> pandas.data_frame >>> >> >>> >> or dataframe instead of using the class reference it would >>> help a >>> >> >>> >> lot >>> >> >>> >> with code cleanup. It's hard to plan for these things ? NumPy >>> >> >>> >> interoperability seemed a lot more important in 2008 than it >>> does >>> >> >>> >> now, >>> >> >>> >> so I forgive myself. >>> >> >>> >> >>> >> >>> >> cheers and best wishes for 2016, >>> >> >>> >> Wes >>> >> >>> >> _______________________________________________ >>> >> >>> >> Pandas-dev mailing list >>> >> >>> >> Pandas-dev at python.org >>> >> >>> >> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> >>> > >>> >> >>> > >>> >> >>> _______________________________________________ >>> >> >>> Pandas-dev mailing list >>> >> >>> Pandas-dev at python.org >>> >> >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> _______________________________________________ >>> >> Pandas-dev mailing list >>> >> Pandas-dev at python.org >>> >> https://mail.python.org/mailman/listinfo/pandas-dev >>> > >>> > >>> >>> >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Wed Dec 30 21:04:01 2015 From: wesmckinn at gmail.com (Wes McKinney) Date: Wed, 30 Dec 2015 18:04:01 -0800 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References: Message-ID: I cobbled together an ugly start of a c++->cython->pandas toolchain here https://github.com/wesm/pandas/tree/libpandas-native-core I used a mix of Kudu, Impala, and dynd-python cmake sources, so it's a bit messy at the moment but it should be sufficient to run some real experiments with a little more work. I reckon it's like a 6 month project to tear out the insides of Series and DataFrame and replace it with a new "native core", but we should be able to get enough info to see whether it's a viable plan within a month or so. The end goal is to create "private" extension types in Cython that can be the new base classes for Series and NDFrame; these will hold a reference to a C++ object that contains wrappered NumPy arrays and other metadata (like pandas-only dtypes). It might be too hard to try to replace a single usage of block manager as a first experiment, so I'll try to create a minimal "SeriesLite" that supports 3 dtypes 1) float64 with nans 2) int64 with a bitmask for NAs 3) category type for one of these Just want to get a feel for the extensibility and offer an NA singleton Python object (a la None) for getting and setting NAs across these 3 dtypes. If we end up going down this route, any way to place a moratorium on invasive work on pandas internals (outside bug fixes)? Pedantic aside: I'd rather avoid shipping thirdparty C/C++ libraries like googletest and friends in pandas if we can. Cloudera folks have been working on a portable C++ library toolchain for Impala and other projects at https://github.com/cloudera/native-toolchain, but it is only being tested on Linux and OS X. Most google libraries should build out of the box on MSVC but it'll be something to keep an eye on. BTW thanks to the libdynd developers for pioneering the c++ lib <-> python-c++ lib <-> cython toolchain; being able to build Cython extensions directly from cmake is a godsend HNY all Wes On Tue, Dec 29, 2015 at 4:17 PM, Irwin Zaid wrote: > Yeah, that seems reasonable and I totally agree a Pandas wrapper layer would > be necessary. > > I'll keep an eye on this and I'd like to help if I can. > > Irwin > > > On Tue, Dec 29, 2015 at 6:01 PM, Wes McKinney wrote: >> >> I'm not suggesting a rewrite of NumPy functionality but rather pandas >> functionality that is currently written in a mishmash of Cython and Python. >> Happy to experiment with changing the internal compute infrastructure and >> data representation to DyND after this first stage of cleanup is done. Even >> if we use DyND a pretty extensive pandas wrapper layer will be necessary. >> >> >> On Tuesday, December 29, 2015, Irwin Zaid wrote: >>> >>> Hi Wes (and others), >>> >>> I've been following this conversation with interest. I do think it would >>> be worth exploring DyND, rather than setting up yet another rewrite of >>> NumPy-functionality. Especially because DyND is already an optional >>> dependency of Pandas. >>> >>> For things like Integer NA and new dtypes, DyND is there and ready to do >>> this. >>> >>> Irwin >>> >>> On Tue, Dec 29, 2015 at 5:18 PM, Wes McKinney >>> wrote: >>>> >>>> Can you link to the PR you're talking about? >>>> >>>> I will see about spending a few hours setting up a libpandas.so as a C++ >>>> shared library where we can run some experiments and validate whether it can >>>> solve the integer-NA problem and be a place to put new data types >>>> (categorical and friends). I'm +1 on targeting >>>> >>>> Would it also be worth making a wish list of APIs we might consider >>>> breaking in a pandas 1.0 release that also features this new "native core"? >>>> Might as well right some wrongs while we're doing some invasive work on the >>>> internals; some breakage might be unavoidable. We can always maintain a >>>> pandas legacy 0.x.x maintenance branch (providing a conda binary build) for >>>> legacy users where showstopper bugs can get fixed. >>>> >>>> On Tue, Dec 29, 2015 at 1:20 PM, Jeff Reback >>>> wrote: >>>> > Wes your last is noted as well. I *think* we can actually do this now >>>> > (well >>>> > there is a PR out there). >>>> > >>>> > On Tue, Dec 29, 2015 at 4:12 PM, Wes McKinney >>>> > wrote: >>>> >> >>>> >> The other huge thing this will enable is to do is copy-on-write for >>>> >> various kinds of views, which should cut down on some of the >>>> >> defensive >>>> >> copying in the library and reduce memory usage. >>>> >> >>>> >> On Tue, Dec 29, 2015 at 1:02 PM, Wes McKinney >>>> >> wrote: >>>> >> > Basically the approach is >>>> >> > >>>> >> > 1) Base dtype type >>>> >> > 2) Base array type with K >= 1 dimensions >>>> >> > 3) Base scalar type >>>> >> > 4) Base index type >>>> >> > 5) "Wrapper" subclasses for all NumPy types fitting into categories >>>> >> > #1, #2, #3, #4 >>>> >> > 6) Subclasses for pandas-specific types like category, datetimeTZ, >>>> >> > etc. >>>> >> > 7) NDFrame as cpcloud wrote is just a list of these >>>> >> > >>>> >> > Indexes and axis labels / column names can get layered on top. >>>> >> > >>>> >> > After we do all this we can look at adding nested types (arrays, >>>> >> > maps, >>>> >> > structs) to better support JSON. >>>> >> > >>>> >> > - Wes >>>> >> > >>>> >> > On Tue, Dec 29, 2015 at 12:14 PM, Phillip Cloud >>>> >> > wrote: >>>> >> >> Maybe this is saying the same thing as Wes, but how far would >>>> >> >> something >>>> >> >> like >>>> >> >> this get us? >>>> >> >> >>>> >> >> // warning: things are probably not this simple >>>> >> >> >>>> >> >> struct data_array_t { >>>> >> >> void *primitive; // scalar data >>>> >> >> data_array_t *nested; // nested data >>>> >> >> boost::dynamic_bitset isnull; // might have to create our own >>>> >> >> to >>>> >> >> avoid >>>> >> >> boost >>>> >> >> schema_t schema; // not sure exactly what this looks like >>>> >> >> }; >>>> >> >> >>>> >> >> typedef std::map data_frame_t; // probably >>>> >> >> not >>>> >> >> this >>>> >> >> simple >>>> >> >> >>>> >> >> To answer Jeff?s use-case question: I think that the use cases are >>>> >> >> 1) >>>> >> >> freedom from numpy (mostly) 2) no more block manager which frees >>>> >> >> us >>>> >> >> from the >>>> >> >> limitations of the block memory layout. In particular, the ability >>>> >> >> to >>>> >> >> take >>>> >> >> advantage of memory mapped IO would be a big win IMO. >>>> >> >> >>>> >> >> >>>> >> >> On Tue, Dec 29, 2015 at 2:50 PM Wes McKinney >>>> >> >> wrote: >>>> >> >>> >>>> >> >>> I will write a more detailed response to some of these things >>>> >> >>> after >>>> >> >>> the new year, but, in particular, re: missing values, can you or >>>> >> >>> someone tell me why creating an object that contains a NumPy >>>> >> >>> array and >>>> >> >>> a bitmap is not sufficient? If we we can add a lightweight C/C++ >>>> >> >>> class >>>> >> >>> layer between NumPy function calls (e.g. arithmetic) and pandas >>>> >> >>> function calls, then I see no reason why we cannot have >>>> >> >>> >>>> >> >>> Int32Array->add >>>> >> >>> >>>> >> >>> and >>>> >> >>> >>>> >> >>> Float32Array->add >>>> >> >>> >>>> >> >>> do the right thing (the former would be responsible for >>>> >> >>> bitmasking to >>>> >> >>> propagate NA values; the latter would defer to NumPy). If we can >>>> >> >>> put >>>> >> >>> all the internals of pandas objects inside a black box, we can >>>> >> >>> add >>>> >> >>> layers of virtual function indirection without a performance >>>> >> >>> penalty >>>> >> >>> (e.g. adding more interpreter overhead with more abstraction >>>> >> >>> layers >>>> >> >>> does add up to a perf penalty). >>>> >> >>> >>>> >> >>> I don't think this is too scary -- I would be willing to create a >>>> >> >>> small POC C++ library to prototype something like what I'm >>>> >> >>> talking >>>> >> >>> about. >>>> >> >>> >>>> >> >>> Since pandas has limited points of contact with NumPy I don't >>>> >> >>> think >>>> >> >>> this would end up being too onerous. >>>> >> >>> >>>> >> >>> For the record, I'm pretty allergic to "advanced C++"; I think it >>>> >> >>> is a >>>> >> >>> useful tool if you pick a sane 20% subset of the C++11 spec and >>>> >> >>> follow >>>> >> >>> Google C++ style it's not very inaccessible to intermediate >>>> >> >>> developers. More or less "C plus OOP and easier object lifetime >>>> >> >>> management (shared/unique_ptr, etc.)". As soon as you add a lot >>>> >> >>> of >>>> >> >>> template metaprogramming C++ library development quickly becomes >>>> >> >>> inaccessible except to the C++-Jedi. >>>> >> >>> >>>> >> >>> Maybe let's start a Google document on "pandas roadmap" where we >>>> >> >>> can >>>> >> >>> break down the 1-2 year goals and some of these infrastructure >>>> >> >>> issues >>>> >> >>> and have our discussion there? (obviously publish this someplace >>>> >> >>> once >>>> >> >>> we're done) >>>> >> >>> >>>> >> >>> - Wes >>>> >> >>> >>>> >> >>> On Fri, Dec 25, 2015 at 2:14 PM, Jeff Reback >>>> >> >>> >>>> >> >>> wrote: >>>> >> >>> > Here are some of my thoughts about pandas Roadmap / status and >>>> >> >>> > some >>>> >> >>> > responses to Wes's thoughts. >>>> >> >>> > >>>> >> >>> > In the last few (and upcoming) major releases we have been made >>>> >> >>> > the >>>> >> >>> > following changes: >>>> >> >>> > >>>> >> >>> > - dtype enhancements (Categorical, Timedelta, Datetime w/tz) & >>>> >> >>> > making >>>> >> >>> > these >>>> >> >>> > first class objects >>>> >> >>> > - code refactoring to remove subclassing of ndarrays for Series >>>> >> >>> > & >>>> >> >>> > Index >>>> >> >>> > - carving out / deprecating non-core parts of pandas >>>> >> >>> > - datareader >>>> >> >>> > - SparsePanel, WidePanel & other aliases (TImeSeries) >>>> >> >>> > - rpy, rplot, irow et al. >>>> >> >>> > - google-analytics >>>> >> >>> > - API changes to make things more consistent >>>> >> >>> > - pd.rolling/expanding * -> .rolling/expanding (this is in >>>> >> >>> > master >>>> >> >>> > now) >>>> >> >>> > - .resample becoming a full defered like groupby. >>>> >> >>> > - multi-index slicing along any level (obviates need for .xs) >>>> >> >>> > and >>>> >> >>> > allows >>>> >> >>> > assignment >>>> >> >>> > - .loc/.iloc - for the most part obviates use of .ix >>>> >> >>> > - .pipe & .assign >>>> >> >>> > - plotting accessors >>>> >> >>> > - fixing of the sorting API >>>> >> >>> > - many performance enhancements both micro & macro (e.g. >>>> >> >>> > release >>>> >> >>> > GIL) >>>> >> >>> > >>>> >> >>> > Some on-deck enhancements are (meaning these are basically >>>> >> >>> > ready to >>>> >> >>> > go >>>> >> >>> > in): >>>> >> >>> > - IntervalIndex (and eventually make PeriodIndex just a >>>> >> >>> > sub-class >>>> >> >>> > of >>>> >> >>> > this) >>>> >> >>> > - RangeIndex >>>> >> >>> > >>>> >> >>> > so lots of changes, though nothing really earth shaking, just >>>> >> >>> > more >>>> >> >>> > convenience, reducing magicness somewhat >>>> >> >>> > and providing flexibility. >>>> >> >>> > >>>> >> >>> > Of course we are getting increasing issues, mostly bug reports >>>> >> >>> > (and >>>> >> >>> > lots >>>> >> >>> > of >>>> >> >>> > dupes), some edge case enhancements >>>> >> >>> > which can add to the existing API's and of course, requests to >>>> >> >>> > expand >>>> >> >>> > the >>>> >> >>> > (already) large code to other usecases. >>>> >> >>> > Balancing this are a good many pull-requests from many >>>> >> >>> > different >>>> >> >>> > users, >>>> >> >>> > some >>>> >> >>> > even deep into the internals. >>>> >> >>> > >>>> >> >>> > Here are some things that I have talked about and could be >>>> >> >>> > considered >>>> >> >>> > for >>>> >> >>> > the roadmap. Disclaimer: I do work for Continuum >>>> >> >>> > but these views are of course my own; furthermore obviously I >>>> >> >>> > am a >>>> >> >>> > bit >>>> >> >>> > more >>>> >> >>> > familiar with some of the 'sponsored' open-source >>>> >> >>> > libraries, but always open to new things. >>>> >> >>> > >>>> >> >>> > - integration / automatic deferral to numba for JIT (this would >>>> >> >>> > be >>>> >> >>> > thru >>>> >> >>> > .apply) >>>> >> >>> > - automatic deferal to dask from groubpy where appropriate / >>>> >> >>> > maybe a >>>> >> >>> > .to_parallel (to simply return a dask.DataFrame object) >>>> >> >>> > - incorporation of quantities / units (as part of the dtype) >>>> >> >>> > - use of DyND to allow missing values for int dtypes >>>> >> >>> > - make Period a first class dtype. >>>> >> >>> > - provide some copy-on-write semantics to alleviate the >>>> >> >>> > chained-indexing >>>> >> >>> > issues which occasionaly come up with the mis-use of the >>>> >> >>> > indexing >>>> >> >>> > API >>>> >> >>> > - allow a 'policy' to automatically provide column blocks for >>>> >> >>> > dict-like >>>> >> >>> > input (e.g. each column would be a block), this would allow a >>>> >> >>> > pass-thru >>>> >> >>> > API >>>> >> >>> > where you could >>>> >> >>> > put in numpy arrays where you have views and have them >>>> >> >>> > preserved >>>> >> >>> > rather >>>> >> >>> > than >>>> >> >>> > copied automatically. Note that this would also allow what I >>>> >> >>> > call >>>> >> >>> > 'split' >>>> >> >>> > where a passed in >>>> >> >>> > multi-dim numpy array could be split up to individual blocks >>>> >> >>> > (which >>>> >> >>> > actually >>>> >> >>> > gives a nice perf boost after the splitting costs). >>>> >> >>> > >>>> >> >>> > In working towards some of these goals. I have come to the >>>> >> >>> > opinion >>>> >> >>> > that >>>> >> >>> > it >>>> >> >>> > would make sense to have a neutral API protocol layer >>>> >> >>> > that would allow us to swap out different engines as needed, >>>> >> >>> > for >>>> >> >>> > particular >>>> >> >>> > dtypes, or *maybe* out-of-core type computations. E.g. >>>> >> >>> > imagine that we replaced the in-memory block structure with a >>>> >> >>> > bclolz >>>> >> >>> > / >>>> >> >>> > memap >>>> >> >>> > type; in theory this should be 'easy' and just work. >>>> >> >>> > I could also see us adopting *some* of the SFrame code to allow >>>> >> >>> > easier >>>> >> >>> > interop with this API layer. >>>> >> >>> > >>>> >> >>> > In practice, I think a nice API layer would need to be created >>>> >> >>> > to >>>> >> >>> > make >>>> >> >>> > this >>>> >> >>> > clean / nice. >>>> >> >>> > >>>> >> >>> > So this comes around to Wes's point about creating a c++ >>>> >> >>> > library for >>>> >> >>> > the >>>> >> >>> > internals (and possibly even some of the indexing routines). >>>> >> >>> > In an ideal world, or course this would be desirable. Getting >>>> >> >>> > there >>>> >> >>> > is a >>>> >> >>> > bit >>>> >> >>> > non-trivial I think, and IMHO might not be worth the effort. I >>>> >> >>> > don't >>>> >> >>> > really see big performance bottlenecks. We *already* defer much >>>> >> >>> > of >>>> >> >>> > the >>>> >> >>> > computation to libraries like numexpr & bottleneck (where >>>> >> >>> > appropriate). >>>> >> >>> > Adding numba / dask to the list would be helpful. >>>> >> >>> > >>>> >> >>> > I think that almost all performance issues are the result of: >>>> >> >>> > >>>> >> >>> > a) gross misuse of the pandas API. How much code have you seen >>>> >> >>> > that >>>> >> >>> > does >>>> >> >>> > df.apply(lambda x: x.sum()) >>>> >> >>> > b) routines which operate column-by-column rather >>>> >> >>> > block-by-block and >>>> >> >>> > are >>>> >> >>> > in >>>> >> >>> > python space (e.g. we have an issue right now about .quantile) >>>> >> >>> > >>>> >> >>> > So I am glossing over a big goal of having a c++ library that >>>> >> >>> > represents >>>> >> >>> > the >>>> >> >>> > pandas internals. This would by definition have a c-API that so >>>> >> >>> > you *could* use pandas like semantics in c/c++ and just have it >>>> >> >>> > work >>>> >> >>> > (and >>>> >> >>> > then pandas would be a thin wrapper around this library). >>>> >> >>> > >>>> >> >>> > I am not averse to this, but I think would be quite a big >>>> >> >>> > effort, >>>> >> >>> > and >>>> >> >>> > not a >>>> >> >>> > huge perf boost IMHO. Further there are a number of API issues >>>> >> >>> > w.r.t. >>>> >> >>> > indexing >>>> >> >>> > which need to be clarified / worked out (e.g. should we simply >>>> >> >>> > deprecate >>>> >> >>> > []) >>>> >> >>> > that are much easier to test / figure out in python space. >>>> >> >>> > >>>> >> >>> > I also thing that we have quite a large number of contributors. >>>> >> >>> > Moving >>>> >> >>> > to >>>> >> >>> > c++ might make the internals a bit more impenetrable that the >>>> >> >>> > current >>>> >> >>> > internals. >>>> >> >>> > (though this would allow c++ people to contribute, so that >>>> >> >>> > might >>>> >> >>> > balance >>>> >> >>> > out). >>>> >> >>> > >>>> >> >>> > We have a limited core of devs whom right now are familar with >>>> >> >>> > things. >>>> >> >>> > If >>>> >> >>> > someone happened to have a starting base for a c++ library, >>>> >> >>> > then I >>>> >> >>> > might >>>> >> >>> > change >>>> >> >>> > opinions here. >>>> >> >>> > >>>> >> >>> > >>>> >> >>> > my 4c. >>>> >> >>> > >>>> >> >>> > Jeff >>>> >> >>> > >>>> >> >>> > >>>> >> >>> > >>>> >> >>> > >>>> >> >>> > On Thu, Dec 24, 2015 at 7:18 PM, Wes McKinney >>>> >> >>> > >>>> >> >>> > wrote: >>>> >> >>> >> >>>> >> >>> >> Deep thoughts during the holidays. >>>> >> >>> >> >>>> >> >>> >> I might be out of line here, but the interpreter-heaviness of >>>> >> >>> >> the >>>> >> >>> >> inside of pandas objects is likely to be a long-term liability >>>> >> >>> >> and >>>> >> >>> >> source of performance problems and technical debt. >>>> >> >>> >> >>>> >> >>> >> Has anyone put any thought into planning and beginning to >>>> >> >>> >> execute >>>> >> >>> >> on a >>>> >> >>> >> rewrite that moves as much as possible of the internals into >>>> >> >>> >> native >>>> >> >>> >> / >>>> >> >>> >> compiled code? I'm talking about: >>>> >> >>> >> >>>> >> >>> >> - pandas/core/internals >>>> >> >>> >> - indexing and assignment >>>> >> >>> >> - much of pandas/core/common >>>> >> >>> >> - categorical and custom dtypes >>>> >> >>> >> - all indexing mechanisms >>>> >> >>> >> >>>> >> >>> >> I'm concerned we've already exposed too much internals to >>>> >> >>> >> users, so >>>> >> >>> >> this might lead to a lot of API breakage, but it might be for >>>> >> >>> >> the >>>> >> >>> >> Greater Good. As a first step, beginning a partial migration >>>> >> >>> >> of >>>> >> >>> >> internals into some C++ classes that encapsulate the insides >>>> >> >>> >> of >>>> >> >>> >> DataFrame objects and implement indexing and block-level >>>> >> >>> >> manipulations >>>> >> >>> >> would be a good place to start. I think you could do this >>>> >> >>> >> wouldn't >>>> >> >>> >> too >>>> >> >>> >> much disruption. >>>> >> >>> >> >>>> >> >>> >> As part of this internal retooling we might give consideration >>>> >> >>> >> to >>>> >> >>> >> alternative data structures for representing data internal to >>>> >> >>> >> pandas >>>> >> >>> >> objects. Now in 2015/2016, continuing to be hamstrung by >>>> >> >>> >> NumPy's >>>> >> >>> >> limitations feels somewhat anachronistic. User code is riddled >>>> >> >>> >> with >>>> >> >>> >> workarounds for data type fidelity issues and the like. Like, >>>> >> >>> >> really, >>>> >> >>> >> why not add a bitndarray (similar to ilanschnell/bitarray) for >>>> >> >>> >> storing >>>> >> >>> >> nullness for problematic types and hide this from the user? =) >>>> >> >>> >> >>>> >> >>> >> Since we are now a NumFOCUS-sponsored project, I feel like we >>>> >> >>> >> might >>>> >> >>> >> consider establishing some formal governance over pandas and >>>> >> >>> >> publishing meetings notes and roadmap documents describing >>>> >> >>> >> plans >>>> >> >>> >> for >>>> >> >>> >> the project and meetings notes from committers. There's no >>>> >> >>> >> real >>>> >> >>> >> "committer culture" for NumFOCUS projects like there is with >>>> >> >>> >> the >>>> >> >>> >> Apache Software Foundation, but we might try leading by >>>> >> >>> >> example! >>>> >> >>> >> >>>> >> >>> >> Also, I believe pandas as a project has reached a level of >>>> >> >>> >> importance >>>> >> >>> >> where we ought to consider planning and execution on larger >>>> >> >>> >> scale >>>> >> >>> >> undertakings such as this for safeguarding the future. >>>> >> >>> >> >>>> >> >>> >> As for myself, well, I have my hands full in Big Data-land. I >>>> >> >>> >> wish >>>> >> >>> >> I >>>> >> >>> >> could be helping more with pandas, but there a quite a few >>>> >> >>> >> fundamental >>>> >> >>> >> issues (like data interoperability nested data handling and >>>> >> >>> >> file >>>> >> >>> >> format support ? e.g. Parquet, see >>>> >> >>> >> >>>> >> >>> >> >>>> >> >>> >> >>>> >> >>> >> >>>> >> >>> >> http://wesmckinney.com/blog/the-problem-with-the-data-science-language-wars/) >>>> >> >>> >> preventing Python from being more useful in industry analytics >>>> >> >>> >> applications. >>>> >> >>> >> >>>> >> >>> >> Aside: one of the bigger mistakes I made with pandas's API >>>> >> >>> >> design >>>> >> >>> >> was >>>> >> >>> >> making it acceptable to call class constructors ? like >>>> >> >>> >> pandas.DataFrame ? directly (versus factory functions). Sorry >>>> >> >>> >> about >>>> >> >>> >> that! If we could convince everyone to start writing >>>> >> >>> >> pandas.data_frame >>>> >> >>> >> or dataframe instead of using the class reference it would >>>> >> >>> >> help a >>>> >> >>> >> lot >>>> >> >>> >> with code cleanup. It's hard to plan for these things ? NumPy >>>> >> >>> >> interoperability seemed a lot more important in 2008 than it >>>> >> >>> >> does >>>> >> >>> >> now, >>>> >> >>> >> so I forgive myself. >>>> >> >>> >> >>>> >> >>> >> cheers and best wishes for 2016, >>>> >> >>> >> Wes >>>> >> >>> >> _______________________________________________ >>>> >> >>> >> Pandas-dev mailing list >>>> >> >>> >> Pandas-dev at python.org >>>> >> >>> >> https://mail.python.org/mailman/listinfo/pandas-dev >>>> >> >>> > >>>> >> >>> > >>>> >> >>> _______________________________________________ >>>> >> >>> Pandas-dev mailing list >>>> >> >>> Pandas-dev at python.org >>>> >> >>> https://mail.python.org/mailman/listinfo/pandas-dev >>>> >> _______________________________________________ >>>> >> Pandas-dev mailing list >>>> >> Pandas-dev at python.org >>>> >> https://mail.python.org/mailman/listinfo/pandas-dev >>>> > >>>> > >>>> >>>> >>>> _______________________________________________ >>>> Pandas-dev mailing list >>>> Pandas-dev at python.org >>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>> >>> >