From jeffreback at gmail.com Tue Mar 8 12:48:56 2016 From: jeffreback at gmail.com (Jeff Reback) Date: Tue, 8 Mar 2016 12:48:56 -0500 Subject: [Pandas-dev] GSOC 2016 Message-ID: Ok, so we are setup with NUMFocus for 2016 to participate in GSOC 2016. If you'd like to be a mentor, pls e-mail me & raniere at rgaiacs.com (from NUMFocus) and he will send you an invite. myself and Stephan are already signed up. thanks Jeff -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Tue Mar 8 17:48:43 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Tue, 8 Mar 2016 14:48:43 -0800 Subject: [Pandas-dev] On bug-fix releases and maintenance branches In-Reply-To: References: Message-ID: hey Jeff, On Tue, Feb 23, 2016 at 12:11 PM, Jeff Reback wrote: > Thanks for bringing this up joris, here are some thoughts. > > 1) I agree that the next releases should probably focus on bug fixes. So > this might mean > we should shoot for 0.18.2....3 etc. > > However, we do need a 0.19.0 in order to provide any big deprecations > (Panel) and API changes that > are needed. > > 2) I am a bit hesitant to even make a big break (1.0) because I have seen > this just bifurcating people (e.g. do I upgrade now, what if I want > compat). This just creates less community. So I think this should be a goal, > that even though its called 1.0 it is as back-compat as possible. > Yeah, with more significant internal refactoring the goal would be to not break API compatibility unless absolutely necessary. However, fixing such horror shows as this In [2]: import pandas as pd In [3]: s = pd.Series([1,2,3]) In [4]: s Out[4]: 0 1 1 2 2 3 dtype: int64 In [5]: import numpy as np In [6]: s[1] = np.nan In [7]: s Out[7]: 0 1 1 NaN 2 3 dtype: float64 should be fair game. > 3) Releases can be big, and do fix lots of bugs, and usually introduce new > ones. This is almost inevitable as we add new features, changes, and even > bug fixes which occasionally have regressions (though test suite is pretty > good, so hopefully not too often). > > 4) I don't relish backporting things. I think this could lead to lots of > headaches and IMHO doesn't really buy much. > I think what we are talking about is backporting bug fixes for major brokenness (e.g. serious correctness issues) or regressions that aren't caught by major release time. I think what's been happening in practice is that people are creating their own patched bugfix versions of releases to avoid the pain induced by API-breakage in major releases. Obviously, continuing to innovate and clean up the API (with judicious breakage where absolutely necessary -- I think the resampling cleanup is a good example where the net benefit in the long run will be high) but we have to take care of the user base, many of whom depend on pandas in production applications. This is all made more difficult because there isn't any direct cash flow funding pandas development AFAICT. Where I work, for example, we have many employees who are responsible for creating patched builds and handling backports for otherwise API-stable branches of major Apache open source projects. But we can afford to do this because customers are paying for this (priority support and backports / patched builds). So what I would suggest, in lieu of financial support for backports and maintenance builds, is that we consider maint-0.XX.X branches for backporting only the most serious of serious bug fixes ("Bad Bugs"). Major regressions and correctness issues should go into this bucket. Perhaps we can start doing this with 0.18.x -- as a matter of process if any PR appears to fix a Bad Bug it should be brought up here on the mailing list so we can decide whether it should be backported. > 5) We don't want to just go into maintenance mode because we still have a > fair amount of feature requests. (though these are often pretty targeted), > but off of the top of my head, nothing really *new*, mainly some API changes > to bring consistency. E.g. ``.agg`` on a DataFrame is a long-requested > feature, which actually after 0.18.0 is quite trivially to do. > Yeah, I think we should try to stick with https://en.wikipedia.org/wiki/Open/closed_principle -- so conveniences, extensions to existing APIs, and other helpful new features are fair game, but breaking API changes should be > 6) I think we telegraph any API changes and really really try to have > back-compat, so people do have the ability to upgrade at their leisure. > >> API changes are most painful for users who do not write tests for >> their code that depends on pandas. That problem is probably not >> fixable =) > > > of course this is a telling point. pandas upgrades often expose bugs in user > code. I view this as a good thing! > > So given all of the somewhat contradictory points above, what do I really > think we should do? > > In order for pandas to be (even more) of a force in leading the scientific > community. I think we have to grow. So having more contributors is a great > thing. People do like / appreciate fixing bugs, but even more (IMHO), are > performance enhancements and *some* new features. > > I will probably try to do more bug-fixing (rather than large API's ish > fixes) I think. There is quite a back-log. This should *slow* the issue of > the BIG API changes. > > So I am kind of -1 on backports for mostly 2), it seems to just slow things > down, and 4) it can often lead to MORE things being inconcistent (you need > machinery to ensure that what is backported is correct and is included). I > can easily forsee that we decide to create 'stable' branches, which in fact > are stable but might have inconsistent fixes, this is even more confusing in > my view. > Let me know what you think about my Bad Bug = backport policy. This is mostly about communication and keeping track of serious issues that should necessitate upgrading. I also think we should try to keep minor releases API stable from here on out; so this may result in our version numbers increasing more quickly but that's OK for the improved communication about "what is a minor release (major release plus bug fix backports)" - Wes > I think we have a fairly aggressive release cycle. We for sure don't want to > debate everything. I am of the opinion that it is much better to put things > out there quicker, then to endlessly debate extremely minor points (not > naming project names here :). > > For the general user what we do w.r.t. release cycles probably doesn't > matter, and for the corporate user, they almost always have a 'fixed' > version anyhow (and then they do of course port the new ones, but then they > have people upgraded it carefully). I am not so sure we should impose > structure on this. We already have announced major releases and minor > releases. > > All for better 'language' in the minor releases. > > Jeff > > > On Tue, Feb 23, 2016 at 2:21 PM, Wes McKinney wrote: >> >> hi Joris, >> >> I'm sorry it's taken a couple weeks to write a reply -- been really >> busy and wanted to put some thought into this. >> >> This is a really important discussion given how important pandas has >> become to so many people, thank you for bringing it up. >> >> On Tue, Feb 9, 2016 at 4:59 PM, Joris Van den Bossche >> wrote: >> > Hi all, >> > >> > I wanted to stir some discussion on pandas its policy on bug-fx releases >> > and >> > upgrading pains. First some context: >> > >> > Context part 1: Currently we do not use maintenance branches for bugfix >> > releases, and we actually also do not really do bugfix releases. We just >> > develop further on master, and try to not merge breaking changes the >> > first >> > weeks/months, so we can do a minor kind of bug-fix release (but usually >> > also >> > with a lot of new features). >> > But we don't, for example, backport fixes of regressions if they are >> > fixed >> > after master is pointing to the next major release. >> >> I think in general it would be a good idea to tilt development away >> from new feature development and toward bug fixes and stability. Given >> that we are contemplating making some breaking changes in a 1.x >> development branch (like removing the Panel classes), we should decide >> as some point to create a 0.X.Y maintenance line where we can backport >> bug fixes only, so that "legacy pandas" users can have a "LTS" (in >> Ubuntu parlance) maintenance branch. This introduces some development >> overhead but it seems worth it. >> >> > >> > Context part 2: pandas is not yet that stable, in the sense that there >> > are >> > still quite some breaking changes in each release. I am not arguing for >> > not >> > doing these breaking changes, as some of these changes are really needed >> > to >> > clean up the API (although there are also arguments for that, but I >> > think >> > that is another discussion). This has the consequence that updating your >> > pandas version is not always that pleasant. >> >> Over the years I've heard many horror stories from companies who have >> created and maintained internal 0.7.x, 0.8.x, or 0.9.x pandas forks >> because of the API breakage issues. This is definitely an anti-pattern >> that we should try to avoid happening in the future, but API breakages >> in many cases are the inevitable price of progress. >> >> Some of the API breakage has resulted from experiences accumulated >> over a long period of time -- I made a lot of decisions early on in >> the project that ended up not being the right ones (e.g. resample >> default arguments changed at one point). There wasn't enough community >> engagement at that point to have a thorough design process to >> potentially come up with the "right" design first. In other cases, the >> "right" choice was perhaps more ambiguous. >> >> API changes are most painful for users who do not write tests for >> their code that depends on pandas. That problem is probably not >> fixable =) >> >> I think having stable releases with backports of serious correctness >> bugs helps mitigate this problem, whereas modest API changes between >> major releases. I would also be in favor of having point releases only >> contain bug fixes rather than the current system of point releases >> being a stable snapshot of trunk. >> >> Since Jeff is the most affected by this on a day to day basis as de >> facto steward of the PR queue I would be curious what process he feels >> would be the most helpful. >> >> - Wes >> >> > >> > Sidenote: I have not that much experience with using pandas in a larger >> > company or in larger codebases that need to be upgraded, rather with >> > just my >> > own code for my PhD. So it is difficult for me to judge on how much this >> > is >> > a problem or if bug-fx releases would help. >> > >> > Questions: >> > >> > What are other people's experiences with upgrading pandas? And would >> > more >> > bug-fix releases actually ease the upgrading? >> > Do we want to do more bug-fix releases? >> > Having a maintenance branch and backporting fixes is extra work. Would >> > we be >> > able to handle this? Would it be worth the effort? >> > >> > (It has been mentioned before, but I think the main point raised was >> > lack of >> > manpower to maintain separate branches) >> > >> > To put it another way. In our whatsnew notice there is "We recommend >> > that >> > all users upgrade to this version", but I am actually not sure we should >> > recommend that. I personally do not always recommend that no matter what >> > without careful consideration. >> > >> > Regards, >> > Joris >> > >> > _______________________________________________ >> > Pandas-dev mailing list >> > Pandas-dev at python.org >> > https://mail.python.org/mailman/listinfo/pandas-dev >> > >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev > > From jorisvandenbossche at gmail.com Tue Mar 8 20:23:07 2016 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Wed, 9 Mar 2016 02:23:07 +0100 Subject: [Pandas-dev] On bug-fix releases and maintenance branches In-Reply-To: References: Message-ID: 2016-03-08 23:48 GMT+01:00 Wes McKinney : > hey Jeff, > > On Tue, Feb 23, 2016 at 12:11 PM, Jeff Reback > wrote: > > Thanks for bringing this up joris, here are some thoughts. > > > > 1) I agree that the next releases should probably focus on bug fixes. So > > this might mean > > we should shoot for 0.18.2....3 etc. > > > > However, we do need a 0.19.0 in order to provide any big deprecations > > (Panel) and API changes that > > are needed. > > > > 2) I am a bit hesitant to even make a big break (1.0) because I have seen > > this just bifurcating people (e.g. do I upgrade now, what if I want > > compat). This just creates less community. So I think this should be a > goal, > > that even though its called 1.0 it is as back-compat as possible. > > > > Yeah, with more significant internal refactoring the goal would be to > not break API compatibility unless absolutely necessary. > Jeff, to be clear, my initial mail was not to discuss the issue whether to do a major breaking release or not, or going into general maintanance mode or not (that's certainly an interesting discussion, but another one I think). The fact is that we are still cleaning up things and so do breakages in 0.X releases (like the resample now), and that won't directly stop. But given that context, we can think about how to do 0.X.X releases that help users as much as possible to upgrade smoothly. We now put quite a lot in a micro 0.X.X bug-fix releases (including new features), which can have the consequence that it introduces new bugs. > > 3) Releases can be big, and do fix lots of bugs, and usually introduce > new > > ones. This is almost inevitable as we add new features, changes, and even > > bug fixes which occasionally have regressions (though test suite is > pretty > > good, so hopefully not too often). > > > > 4) I don't relish backporting things. I think this could lead to lots of > > headaches and IMHO doesn't really buy much. > > > > I think what we are talking about is backporting bug fixes for major > brokenness (e.g. serious correctness issues) or regressions that > aren't caught by major release time. I think what's been happening in > practice is that people are creating their own patched bugfix versions > of releases to avoid the pain induced by API-breakage in major > releases. > > Obviously, continuing to innovate and clean up the API (with judicious > breakage where absolutely necessary -- I think the resampling cleanup > is a good example where the net benefit in the long run will be high) > but we have to take care of the user base, many of whom depend on > pandas in production applications. > > This is all made more difficult because there isn't any direct cash > flow funding pandas development AFAICT. Where I work, for example, we > have many employees who are responsible for creating patched builds > and handling backports for otherwise API-stable branches of major > Apache open source projects. But we can afford to do this because > customers are paying for this (priority support and backports / > patched builds). > > So what I would suggest, in lieu of financial support for backports > and maintenance builds, is that we consider maint-0.XX.X branches for > backporting only the most serious of serious bug fixes ("Bad Bugs"). > Major regressions and correctness issues should go into this bucket. > Perhaps we can start doing this with 0.18.x -- as a matter of process > if any PR appears to fix a Bad Bug it should be brought up here on the > mailing list so we can decide whether it should be backported. > > With regard to the possible concern of "this is too much work": I don't think it would be many bug fixes that would be backported. For example, the last micro release, 0.17.1 had quite a lot of new features and the whatsnew notes listed 50 bug fixes. But a lot of these bug fixes were not regressions, but were bugs that were also in the previous releases. So if we restrict the 0.xx.x release to only regressions, it would be a much smaller of maybe 10 to 15 bug fixes (rough estimate, didn't look into detail). But in any case I think this would be a rather manageable amount. So that would make our bug fix releases smaller, and we also don't have to hold up master with breaking changes/larger new features until one or two bug-fix releases are released. For me, the fixes that could go in such a bug-fix release: - bug fixes or clean-up of rough edges of major new features in the 0.X release (for example for 0.18.1 possible changes to the newly introduced RangeIndex) - regressions, issues that were not present in the previous 0.X release, and could make it therefore more difficult to upgrade + the correctness issues that Wes mentioned. > > 5) We don't want to just go into maintenance mode because we still have a > > fair amount of feature requests. (though these are often pretty > targeted), > > but off of the top of my head, nothing really *new*, mainly some API > changes > > to bring consistency. E.g. ``.agg`` on a DataFrame is a long-requested > > feature, which actually after 0.18.0 is quite trivially to do. > > > > Yeah, I think we should try to stick with > https://en.wikipedia.org/wiki/Open/closed_principle -- so > conveniences, extensions to existing APIs, and other helpful new > features are fair game, but breaking API changes should be > > > 6) I think we telegraph any API changes and really really try to have > > back-compat, so people do have the ability to upgrade at their leisure. > > > >> API changes are most painful for users who do not write tests for > >> their code that depends on pandas. That problem is probably not > >> fixable =) > > > > > > of course this is a telling point. pandas upgrades often expose bugs in > user > > code. I view this as a good thing! > > > > So given all of the somewhat contradictory points above, what do I really > > think we should do? > > > > In order for pandas to be (even more) of a force in leading the > scientific > > community. I think we have to grow. So having more contributors is a > great > > thing. People do like / appreciate fixing bugs, but even more (IMHO), are > > performance enhancements and *some* new features. > > > > I will probably try to do more bug-fixing (rather than large API's ish > > fixes) I think. There is quite a back-log. This should *slow* the issue > of > > the BIG API changes. > > > > So I am kind of -1 on backports for mostly 2), it seems to just slow > things > > down, and 4) it can often lead to MORE things being inconcistent (you > need > > machinery to ensure that what is backported is correct and is included). > I > > can easily forsee that we decide to create 'stable' branches, which in > fact > > are stable but might have inconsistent fixes, this is even more > confusing in > > my view. > > > > Let me know what you think about my Bad Bug = backport policy. This is > mostly about communication and keeping track of serious issues that > should necessitate upgrading. > > I also think we should try to keep minor releases API stable from here > on out; so this may result in our version numbers increasing more > quickly but that's OK for the improved communication about "what is a > minor release (major release plus bug fix backports)" > Just for clarity, with minor release, do you mean the 0.X releases? (because 0.X.X matches more the 'major release plus bug fix backports' description) Joris > > - Wes > > > I think we have a fairly aggressive release cycle. We for sure don't > want to > > debate everything. I am of the opinion that it is much better to put > things > > out there quicker, then to endlessly debate extremely minor points (not > > naming project names here :). > > > > For the general user what we do w.r.t. release cycles probably doesn't > > matter, and for the corporate user, they almost always have a 'fixed' > > version anyhow (and then they do of course port the new ones, but then > they > > have people upgraded it carefully). I am not so sure we should impose > > structure on this. We already have announced major releases and minor > > releases. > > > > All for better 'language' in the minor releases. > > > > Jeff > > > > > > On Tue, Feb 23, 2016 at 2:21 PM, Wes McKinney > wrote: > >> > >> hi Joris, > >> > >> I'm sorry it's taken a couple weeks to write a reply -- been really > >> busy and wanted to put some thought into this. > >> > >> This is a really important discussion given how important pandas has > >> become to so many people, thank you for bringing it up. > >> > >> On Tue, Feb 9, 2016 at 4:59 PM, Joris Van den Bossche > >> wrote: > >> > Hi all, > >> > > >> > I wanted to stir some discussion on pandas its policy on bug-fx > releases > >> > and > >> > upgrading pains. First some context: > >> > > >> > Context part 1: Currently we do not use maintenance branches for > bugfix > >> > releases, and we actually also do not really do bugfix releases. We > just > >> > develop further on master, and try to not merge breaking changes the > >> > first > >> > weeks/months, so we can do a minor kind of bug-fix release (but > usually > >> > also > >> > with a lot of new features). > >> > But we don't, for example, backport fixes of regressions if they are > >> > fixed > >> > after master is pointing to the next major release. > >> > >> I think in general it would be a good idea to tilt development away > >> from new feature development and toward bug fixes and stability. Given > >> that we are contemplating making some breaking changes in a 1.x > >> development branch (like removing the Panel classes), we should decide > >> as some point to create a 0.X.Y maintenance line where we can backport > >> bug fixes only, so that "legacy pandas" users can have a "LTS" (in > >> Ubuntu parlance) maintenance branch. This introduces some development > >> overhead but it seems worth it. > >> > >> > > >> > Context part 2: pandas is not yet that stable, in the sense that there > >> > are > >> > still quite some breaking changes in each release. I am not arguing > for > >> > not > >> > doing these breaking changes, as some of these changes are really > needed > >> > to > >> > clean up the API (although there are also arguments for that, but I > >> > think > >> > that is another discussion). This has the consequence that updating > your > >> > pandas version is not always that pleasant. > >> > >> Over the years I've heard many horror stories from companies who have > >> created and maintained internal 0.7.x, 0.8.x, or 0.9.x pandas forks > >> because of the API breakage issues. This is definitely an anti-pattern > >> that we should try to avoid happening in the future, but API breakages > >> in many cases are the inevitable price of progress. > >> > >> Some of the API breakage has resulted from experiences accumulated > >> over a long period of time -- I made a lot of decisions early on in > >> the project that ended up not being the right ones (e.g. resample > >> default arguments changed at one point). There wasn't enough community > >> engagement at that point to have a thorough design process to > >> potentially come up with the "right" design first. In other cases, the > >> "right" choice was perhaps more ambiguous. > >> > >> API changes are most painful for users who do not write tests for > >> their code that depends on pandas. That problem is probably not > >> fixable =) > >> > >> I think having stable releases with backports of serious correctness > >> bugs helps mitigate this problem, whereas modest API changes between > >> major releases. I would also be in favor of having point releases only > >> contain bug fixes rather than the current system of point releases > >> being a stable snapshot of trunk. > >> > >> Since Jeff is the most affected by this on a day to day basis as de > >> facto steward of the PR queue I would be curious what process he feels > >> would be the most helpful. > >> > >> - Wes > >> > >> > > >> > Sidenote: I have not that much experience with using pandas in a > larger > >> > company or in larger codebases that need to be upgraded, rather with > >> > just my > >> > own code for my PhD. So it is difficult for me to judge on how much > this > >> > is > >> > a problem or if bug-fx releases would help. > >> > > >> > Questions: > >> > > >> > What are other people's experiences with upgrading pandas? And would > >> > more > >> > bug-fix releases actually ease the upgrading? > >> > Do we want to do more bug-fix releases? > >> > Having a maintenance branch and backporting fixes is extra work. Would > >> > we be > >> > able to handle this? Would it be worth the effort? > >> > > >> > (It has been mentioned before, but I think the main point raised was > >> > lack of > >> > manpower to maintain separate branches) > >> > > >> > To put it another way. In our whatsnew notice there is "We recommend > >> > that > >> > all users upgrade to this version", but I am actually not sure we > should > >> > recommend that. I personally do not always recommend that no matter > what > >> > without careful consideration. > >> > > >> > Regards, > >> > Joris > >> > > >> > _______________________________________________ > >> > Pandas-dev mailing list > >> > Pandas-dev at python.org > >> > https://mail.python.org/mailman/listinfo/pandas-dev > >> > > >> _______________________________________________ > >> Pandas-dev mailing list > >> Pandas-dev at python.org > >> https://mail.python.org/mailman/listinfo/pandas-dev > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeffreback at gmail.com Wed Mar 9 10:05:39 2016 From: jeffreback at gmail.com (Jeff Reback) Date: Wed, 9 Mar 2016 10:05:39 -0500 Subject: [Pandas-dev] ANN: pandas v0.18.0rc2 - RELEASE CANDIDATE Message-ID: Hi, I'm pleased to announce the availability of the second release candidate of Pandas 0.18.0. Please try this RC and report any issues here: Pandas Issues . Compared to RC1, we have added updated read_sas and fixed float indexing. We will be releasing officially very shortly. THIS IS NOT A PRODUCTION RELEASE This is a major release from 0.17.1 and includes a small number of API changes, several new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version. Highlights include: - pandas >= 0.18.0 will no longer support compatibility with Python version 2.6 GH7718 or version 3.3 GH11273 - Moving and expanding window functions are now methods on Series and DataFrame similar to .groupby like objects, see here . - Adding support for a RangeIndex as a specialized form of the Int64Index for memory savings, see here . - API breaking .resample changes to make it more .groupby like, see here - Removal of support for positional indexing with floats, which was deprecated since 0.14.0. This will now raise a TypeError, see here - The .to_xarray() function has been added for compatibility with the xarray package see here . - The read_sas() function has been enhanced to read sas7bdat files, see here - Addition of the .str.extractall() method , and API changes to the the .str.extract() method , and the .str.cat() method - pd.test() top-level nose test runner is available GH4327 See the Whatsnew for much more information. Best way to get this is to install via conda from our development channel. Builds for osx-64,linux-64,win-64 for Python 2.7 and Python 3.5 are all available. conda install pandas=v0.18.0rc2 -c pandas Thanks to all who made this release happen. It is a very large release! Jeff -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeffreback at gmail.com Sat Mar 12 11:13:49 2016 From: jeffreback at gmail.com (Jeff Reback) Date: Sat, 12 Mar 2016 11:13:49 -0500 Subject: [Pandas-dev] ANN: pandas v0.18.0 Final released Message-ID: Hi, This is a major release from 0.17.1 and includes a small number of API changes, several new features, enhancements, and performance improvements along with a large number of bug fixes. We recommend that all users upgrade to this version. This was a release of 3.5 months with 381 commits by 100 authors encompassing 465 issues and 290 pull-requests. *What is it:* *pandas* is a Python package providing fast, flexible, and expressive data structures designed to make working with ?relational? or ?labeled? data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. *Highlights*: - pandas >= 0.18.0 will no longer support compatibility with Python version 2.6 GH7718 or version 3.3 GH11273 - Moving and expanding window functions are now methods on Series and DataFrame similar to .groupby like objects, see here . - Adding support for a RangeIndex as a specialized form of the Int64Index for memory savings, see here . - API breaking .resample changes to make it more .groupby like, see here - Removal of support for positional indexing with floats, which was deprecated since 0.14.0. This will now raise a TypeError, see here - The .to_xarray() function has been added for compatibility with the xarray package see here . - The read_sas() function has been enhanced to read sas7bdat files, see here - Addition of the .str.extractall() method , and API changes to the the .str.extract() method , and the .str.cat() method - pd.test() top-level nose test runner is available GH4327 See the Whatsnew for much more information and the full Documentation link. *How to get it:* Source tarballs, windows wheels, and macosx wheels are available on PyPI Installation via conda is: - conda install pandas windows wheels are courtesy of Christoph Gohlke and are built on Numpy 1.10 macosx wheels are courtesy of Matthew Brett. *Issues:* Please report any issues on our issue tracker : Jeff *Thanks to all of the contributors* - ARF - Alex Alekseyev - Andrew McPherson - Andrew Rosenfeld - Anthonios Partheniou - Anton I. Sipos - Ben - Ben North - Bran Yang - Chris - Chris Carroux - Christopher C. Aycock - Christopher Scanlin - Cody - Da Wang - Daniel Grady - Dorozhko Anton - Dr-Irv - Erik M. Bray - Evan Wright - Francis T. O'Donovan - Frank Cleary - Gianluca Rossi - Graham Jeffries - Guillaume Horel - Henry Hammond - Isaac Schwabacher - Jean-Mathieu Deschenes - Jeff Reback - Joe Jevnik - John Freeman - John Fremlin - Jonas Hoersch - Joris Van den Bossche - Joris Vankerschaver - Justin Lecher - Justin Lin - Ka Wo Chen - Keming Zhang - Kerby Shedden - Kyle - Marco Farrugia - MasonGallo - MattRijk - Matthew Lurie - Maximilian Roos - Mayank Asthana - Mortada Mehyar - Moussa Taifi - Navreet Gill - Nicolas Bonnotte - Paul Reiners - Philip Gura - Pietro Battiston - RahulHP - Randy Carnevale - Rinoc Johnson - Rishipuri - Sangmin Park - Scott E Lasley - Sereger13 - Shannon Wang - Skipper Seabold - Thierry Moisan - Thomas A Caswell - Toby Dylan Hocking - Tom Augspurger - Travis - Trent Hauck - Tux1 - Varun - Wes McKinney - Will Thompson - Yoav Ram - Yoong Kang Lim - Yoshiki V?zquez Baeza - Young Joong Kim - Younggun Kim - Yuval Langer - alex argunov - behzad nouri - boombard - ian-pantano - chromy - daniel - dgram0 - gfyoung - hack-c - hcontrast - jfoo - kaustuv deolal - llllllllll - ranarag - rockg - scls19fr - seales - sinhrks - srib - surveymedia.ca - tworec -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeffreback at gmail.com Mon Mar 14 16:19:44 2016 From: jeffreback at gmail.com (Jeff Reback) Date: Mon, 14 Mar 2016 16:19:44 -0400 Subject: [Pandas-dev] Fwd: GSoC2016 - NumFOCUS In-Reply-To: <20160314200801.GB26640@pupunha.rgaiacs.com> References: <20160314200801.GB26640@pupunha.rgaiacs.com> Message-ID: If anyone else is interested in being a pandas mentor for GSOC, pls e-mail me ASAP. (and Raniere). Jeff ---------- Forwarded message ---------- From: Raniere Silva Date: Mon, Mar 14, 2016 at 4:08 PM Subject: GSoC2016 - NumFOCUS To: Bryan Van de Ven , Devasena Inupakutika < devasena.prasad at gmail.com>, Ethan White , Greg Wilson < gvwilson.third.bit at gmail.com>, henry senyondo , Jeff Reback , joehuchette , "lev.konst" , Madeleine , Mark Wiebe , Michael Droettboom , Miles Lubin , piotrb , QuLogic , Radim , sckott < myrmecocystus at gmail.com>, Stephan Hoyer , tacaswell < tcaswell at gmail.com> Hi all, thanks to volunteer to be a mentor for GSoC 2016 under NumFOCUS umbrella. Could you look at the addressed to list of this email and verify that all your colegues that will be mentors are on it? If they aren't please ask him to send me a email or email me on their behalf. Today the students will start send their application. This year they will submit their final proposal as PDF (for better or worse) and we will learn during the process. We suggest that the students that want comments on their proposal send it as a pull request against https://github.com/numfocus/gsoc/ because it makes easy for some of us to keep track of the proposals that want feedback and show us that the student knows how to use Git and GitHub. If you want that your students use Google Docs or another tool to get feedback on their proposals you are free to request it to them. If you need anything, let me know. Cheers, Raniere -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: not available URL: From wesmckinn at gmail.com Tue Mar 15 23:16:01 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Tue, 15 Mar 2016 20:16:01 -0700 Subject: [Pandas-dev] On bug-fix releases and maintenance branches In-Reply-To: References: Message-ID: On Tue, Mar 8, 2016 at 5:23 PM, Joris Van den Bossche wrote: > > > 2016-03-08 23:48 GMT+01:00 Wes McKinney : >> >> hey Jeff, >> >> On Tue, Feb 23, 2016 at 12:11 PM, Jeff Reback >> wrote: >> > Thanks for bringing this up joris, here are some thoughts. >> > >> > 1) I agree that the next releases should probably focus on bug fixes. So >> > this might mean >> > we should shoot for 0.18.2....3 etc. >> > >> > However, we do need a 0.19.0 in order to provide any big deprecations >> > (Panel) and API changes that >> > are needed. >> > >> > 2) I am a bit hesitant to even make a big break (1.0) because I have >> > seen >> > this just bifurcating people (e.g. do I upgrade now, what if I want >> > compat). This just creates less community. So I think this should be a >> > goal, >> > that even though its called 1.0 it is as back-compat as possible. >> > >> >> Yeah, with more significant internal refactoring the goal would be to >> not break API compatibility unless absolutely necessary. > > > Jeff, to be clear, my initial mail was not to discuss the issue whether to > do a major breaking release or not, or going into general maintanance mode > or not (that's certainly an interesting discussion, but another one I > think). The fact is that we are still cleaning up things and so do breakages > in 0.X releases (like the resample now), and that won't directly stop. > > But given that context, we can think about how to do 0.X.X releases that > help users as much as possible to upgrade smoothly. > We now put quite a lot in a micro 0.X.X bug-fix releases (including new > features), which can have the consequence that it introduces new bugs. > >> >> > 3) Releases can be big, and do fix lots of bugs, and usually introduce >> > new >> > ones. This is almost inevitable as we add new features, changes, and >> > even >> > bug fixes which occasionally have regressions (though test suite is >> > pretty >> > good, so hopefully not too often). >> > >> > 4) I don't relish backporting things. I think this could lead to lots of >> > headaches and IMHO doesn't really buy much. >> > >> >> I think what we are talking about is backporting bug fixes for major >> brokenness (e.g. serious correctness issues) or regressions that >> aren't caught by major release time. I think what's been happening in >> practice is that people are creating their own patched bugfix versions >> of releases to avoid the pain induced by API-breakage in major >> releases. >> >> Obviously, continuing to innovate and clean up the API (with judicious >> breakage where absolutely necessary -- I think the resampling cleanup >> is a good example where the net benefit in the long run will be high) >> but we have to take care of the user base, many of whom depend on >> pandas in production applications. >> >> This is all made more difficult because there isn't any direct cash >> flow funding pandas development AFAICT. Where I work, for example, we >> have many employees who are responsible for creating patched builds >> and handling backports for otherwise API-stable branches of major >> Apache open source projects. But we can afford to do this because >> customers are paying for this (priority support and backports / >> patched builds). >> >> So what I would suggest, in lieu of financial support for backports >> and maintenance builds, is that we consider maint-0.XX.X branches for >> backporting only the most serious of serious bug fixes ("Bad Bugs"). >> Major regressions and correctness issues should go into this bucket. >> Perhaps we can start doing this with 0.18.x -- as a matter of process >> if any PR appears to fix a Bad Bug it should be brought up here on the >> mailing list so we can decide whether it should be backported. >> > > With regard to the possible concern of "this is too much work": I don't > think it would be many bug fixes that would be backported. > For example, the last micro release, 0.17.1 had quite a lot of new features > and the whatsnew notes listed 50 bug fixes. But a lot of these bug fixes > were not regressions, but were bugs that were also in the previous releases. > So if we restrict the 0.xx.x release to only regressions, it would be a much > smaller of maybe 10 to 15 bug fixes (rough estimate, didn't look into > detail). But in any case I think this would be a rather manageable amount. > > So that would make our bug fix releases smaller, and we also don't have to > hold up master with breaking changes/larger new features until one or two > bug-fix releases are released. > > For me, the fixes that could go in such a bug-fix release: > > - bug fixes or clean-up of rough edges of major new features in the 0.X > release (for example for 0.18.1 possible changes to the newly introduced > RangeIndex) > - regressions, issues that were not present in the previous 0.X release, and > could make it therefore more difficult to upgrade > > + the correctness issues that Wes mentioned. > > > >> >> > 5) We don't want to just go into maintenance mode because we still have >> > a >> > fair amount of feature requests. (though these are often pretty >> > targeted), >> > but off of the top of my head, nothing really *new*, mainly some API >> > changes >> > to bring consistency. E.g. ``.agg`` on a DataFrame is a long-requested >> > feature, which actually after 0.18.0 is quite trivially to do. >> > >> >> Yeah, I think we should try to stick with >> https://en.wikipedia.org/wiki/Open/closed_principle -- so >> conveniences, extensions to existing APIs, and other helpful new >> features are fair game, but breaking API changes should be >> >> > 6) I think we telegraph any API changes and really really try to have >> > back-compat, so people do have the ability to upgrade at their leisure. >> > >> >> API changes are most painful for users who do not write tests for >> >> their code that depends on pandas. That problem is probably not >> >> fixable =) >> > >> > >> > of course this is a telling point. pandas upgrades often expose bugs in >> > user >> > code. I view this as a good thing! >> > >> > So given all of the somewhat contradictory points above, what do I >> > really >> > think we should do? >> > >> > In order for pandas to be (even more) of a force in leading the >> > scientific >> > community. I think we have to grow. So having more contributors is a >> > great >> > thing. People do like / appreciate fixing bugs, but even more (IMHO), >> > are >> > performance enhancements and *some* new features. >> > >> > I will probably try to do more bug-fixing (rather than large API's ish >> > fixes) I think. There is quite a back-log. This should *slow* the issue >> > of >> > the BIG API changes. >> > >> > So I am kind of -1 on backports for mostly 2), it seems to just slow >> > things >> > down, and 4) it can often lead to MORE things being inconcistent (you >> > need >> > machinery to ensure that what is backported is correct and is included). >> > I >> > can easily forsee that we decide to create 'stable' branches, which in >> > fact >> > are stable but might have inconsistent fixes, this is even more >> > confusing in >> > my view. >> > >> >> Let me know what you think about my Bad Bug = backport policy. This is >> mostly about communication and keeping track of serious issues that >> should necessitate upgrading. >> >> I also think we should try to keep minor releases API stable from here >> on out; so this may result in our version numbers increasing more >> quickly but that's OK for the improved communication about "what is a >> minor release (major release plus bug fix backports)" > > > Just for clarity, with minor release, do you mean the 0.X releases? (because > 0.X.X matches more the 'major release plus bug fix backports' description) > Sorry, I meant that 0.X.Y should be API stable with all other 0.X versions > Joris > > From shoyer at gmail.com Wed Mar 16 12:44:51 2016 From: shoyer at gmail.com (Stephan Hoyer) Date: Wed, 16 Mar 2016 09:44:51 -0700 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References: Message-ID: After taking a step back and starting a new job, I am coming around to Wes's perspective here. The lack of integer-NAs and the overly complex/unpredictable internal memory model are major shortcomings (along with the indexing API) for using pandas in production software. Compatibility with the rest of the SciPy ecosystem is important, but it shouldn't hold pandas back. There's no good reason why pandas needs to built on a library for strided n-dimensional arrays -- that's a lot more complexity than we need. Best, Stephan On Tue, Jan 12, 2016 at 5:42 PM, Wes McKinney wrote: > On Tue, Jan 12, 2016 at 4:06 PM, Stephan Hoyer wrote: > > I think I'm mostly on the same page as well. Five years has certainly > been > > too long. > > > > I agree that it would be premature to commit to using DyND in a binding > way > > in pandas. A lot seems to be up in the air with regards to dtypes in > Python > > right now (yes, particularly from projects sponsored by Continuum). > > > > So I would advocate for proceeding with the refactor for now (which will > > have numerous other benefits), and see how the situation evolves. If it > > seems like we are in a plausible position to unify the dtype system with > a > > tool like DyND, then let's seriously consider that down the road. Either > > way, explicit interfaces (e.g., to_numpy(), to_dynd()) will help. > > > > +1 -- I think our long term goal should be to have a common physical > memory representation. If pandas internally stays slightly malleable > (in a non-user-visible-way) we can conform to a standard (presuming > one develops) with less user-land disruption. If a standard does not > develop we can just shrug our shoulders and do what's best for pandas. > We'll have to think about how this will affect pandas's future C API > (zero-copy interop guarantees): we might make the C API in the first > release more clearly not-for-production use. > > Aside: There doesn't even seem to be consensus at the moment on > missing data representation. Sentinels, for example, causes > interoperability problems with ODBC / databases, and Apache ecosystem > projects (e.g. HDFS file formats, Thrift, Spark, Kafka, etc.). If we > build a C interface to Avro or Parquet in pandas right now we'll have > to convert bitmasks to pandas's bespoke sentinels. To be clear, R has > this problem too. I see good arguments for even nixing NaN in floating > point arrays, as heretical as that might sound. Ironically I used to > be in favor of sentinels but I realized it was an isolationist view. > > -W > > > On Mon, Jan 11, 2016 at 4:23 PM, Wes McKinney > wrote: > >> > >> On Mon, Jan 11, 2016 at 3:04 PM, Jeff Reback > wrote: > >> > I am in favor of the Wes refactoring, but for some slightly different > >> > reasons. > >> > > >> > I am including some in-line comments. > >> > > >> > On Mon, Jan 11, 2016 at 2:55 PM, Stephan Hoyer > wrote: > >> >>> > >> >>> I don't see alternative ways for pandas to have a truly healthy > >> >>> relationship with more general purpose array / scientific computing > >> >>> libraries without being able to add new pandas functionality in a > >> >>> clean way, and without requiring us to get patches accepted (and > >> >>> released) in NumPy or DyND. > >> >> > >> >> > >> >> Indeed, I think my disagreement is mostly about the order in which we > >> >> approach these problems. > >> > > >> > > >> > I agree here. I had started on *some* of this to enable swappable > numpy > >> > to > >> > DyND to support IntNA (all in python, > >> > but the fundamental change was to provide an API layer to the > back-end). > >> > > >> >> > >> >> > >> >>> > >> >>> Can you clarify what aspects of this plan are disagreeable / > >> >>> contentious? > >> >> > >> >> > >> >> See my comments below. > >> >> > >> >>> > >> >>> Are you arguing for pandas becoming more of a companion > >> >>> tool / user interface layer for NumPy or DyND? > >> >> > >> >> > >> >> Not quite. Pandas has some fantastic and highly useable data (Series, > >> >> DataFrame, Index). These certainly don't belong in NumPy or DyND. > >> >> > >> >> However, the array-based ecosystem certainly could use improvements > to > >> >> dtypes (e.g., datetime and categorical) and dtype specific methods > >> >> (e.g., > >> >> for strings) just as much as pandas. I do firmly believe that pushing > >> >> these > >> >> types of improvements upstream, rather than implementing them > >> >> independently > >> >> for pandas, would yield benefits for the broader ecosystem. With the > >> >> right > >> >> infrastructure, generalizing things to arrays is not much more work. > >> > > >> > > >> > I dont' think Wes nor I disagree here at all. The problem was (and > is), > >> > the > >> > pace of change in the underlying libraries. It is simply too slow > >> > for pandas development efforts. > >> > > >> > I think the pandas efforts (and other libraries) can result in more > >> > powerful > >> > fundamental libraries > >> > that get pushed upstream. However, it would not benefit ANYONE to slow > >> > down > >> > downstream efforts. I am not sure why you suggest that we WAIT for the > >> > upstream libraries to change? We have been waiting forever for that. > Now > >> > we > >> > have a concrete implementation of certain data types that are useful. > >> > They > >> > (upstream) can take > >> > this and build on (or throw it away and make a better one or > whatever). > >> > But > >> > I don't think it benefits anyone to WAIT for someone to change numpy > >> > first. > >> > Look at how long it took them to (partially) fix datetimes. > >> > > >> > xarray in particular has done the same thing to pandas, e.g. you have > >> > added > >> > additional selection operators and syntax (e.g. passing dicts of named > >> > axes). These changes are in fact propogating to pandas. This has taken > >> > time > >> > (but much much less that this took for any of pandas changes to > numpy). > >> > Further look at how long you have advocated (correctly) for labeled > >> > arrays > >> > in numpy (which we are still waiting). > >> > > >> >> > >> >> > >> >> I'd like to see pandas itself focus more on the data-structures and > >> >> less > >> >> on the data types. This would let us share more work with the > "general > >> >> purpose array / scientific computing libraries". > >> >> > >> > Pandas IS about specifying the correct data types. It is simply > >> > incorrect to > >> > decouple this problem from the data-structures. A lot of effort over > the > >> > years has gone into > >> > making all dtypes playing nice with each other and within pandas. > >> > > >> >>> > >> >>> 1) Introduce a proper (from a software engineering perspective) > >> >>> logical data type abstraction that models the way that pandas > already > >> >>> works, but cleaning up all the mess (implicit upcasts, lack of a > real > >> >>> "NA" scalar value, making pandas-specific methods like unique, > >> >>> factorize, match, etc. true "array methods") > >> >> > >> >> > >> >> New abstractions have a cost. A new logical data type abstraction is > >> >> better than no proper abstraction at all, but (in principle), one > data > >> >> type > >> >> abstraction should be enough to share. > >> >> > >> > > >> >> > >> >> A proper logical data type abstraction would be an improvement over > the > >> >> current situation, but if there's a way we could introduce one less > >> >> abstraction (by improving things upstream in a general purpose array > >> >> library) that would help even more. > >> >> > >> > > >> > This is just pushing a problem upstream, which ultimately, given the > >> > track > >> > history of numpy, won't be solved at all. We will be here 1 year from > >> > now > >> > with the exact same discussion. Why are we waiting on upstream for > >> > anything? > >> > As I said above, if something is created which upstream finds useful > on > >> > a > >> > general level. great. The great cost here is time. > >> > > >> >> > >> >> For example, we could imagine pushing to make DyND the new core for > >> >> pandas. This could be enough of a push to make DyND generally useful > -- > >> >> I > >> >> know it still has a few kinks to work out. > >> >> > >> > > >> > maybe, but DyND has to have full compat with what currently is out > there > >> > (soonish). Then I agree this could be possible. But wouldn't it be > even > >> > better > >> > for pandas to be able to swap back-ends. Why limit ourselves to a > >> > particular > >> > backend if its not that difficult. > >> > > >> > >> I think Jeff and I are on the same page here. 5 years ago we were > >> having the *exact same* discussions around NumPy and adding new data > >> type functionality. 5 years is a staggering amount of time in open > >> source. It was less than 5 years between pandas not existing and being > >> a super popular project with 2/3 of a best-selling O'Reilly book > >> written about it. To whit, DyND exists in large part because of the > >> difficulty in making progress within NumPy. > >> > >> Now, as 5 years ago, I think we should be acting in the best interests > >> of pandas users, and what I've been describing is intended as a > >> straightforward (though definitely labor intensive) and relatively > >> low-risk plan that will "future-proof" the pandas user API for at > >> least the next few years, and probably much longer. If we find that > >> enabling some internals to use DyND is the right choice, we can do > >> that in a non-invasive way while carefully minding data > >> interoperability. Meaningful performance benefits would be a clear > >> motivation. > >> > >> To be 100% open and transparent (in the spirit of pandas's new > >> governance docs): Before committing to using DyND in any binding way > >> (i.e. required, as opposed to opt-in) in pandas, I'd really like to > >> see more evidence from 3rd parties without direct financial interest > >> (i.e. employment or equity from Continuum) that DyND is "the future of > >> Python array computing"; in the absence of significant user and > >> community code contribution, it still feels like a political quagmire > >> leftover from the Continuum-Enthought rift in 2011. > >> > >> - Wes > >> > >> >>> > >> >>> 4) Give pandas objects a real C API so that users can manipulate and > >> >>> create pandas objects with their own native (C/C++/Cython) code. > >> >> > >> >> > >> >>> 5) Yes, absolutely improve NumPy and DyND and transition to improved > >> >>> NumPy and DyND facilities as soon as they are available and shipped > >> >> > >> >> > >> >> I like the sound of both of these. > >> > > >> > > >> > > >> > Further you made a point above > >> > > >> >> You are right that pandas has started to supplant numpy as a high > level > >> >> API for data analysis, but of course the robust (and often numpy > based) > >> >> Python ecosystem is part of what has made pandas so successful. In > >> >> practice, > >> >> ecosystem projects often want to work with more primitive objects > than > >> >> series/dataframes in their internal data structures and without numpy > >> >> this > >> >> becomes more difficult. For example, how do you concatenate a list of > >> >> categoricals? If these were numpy arrays, we could use > np.concatenate, > >> >> but > >> >> the current implementation of categorical would require a custom > >> >> solution. > >> >> First class compatibility with pandas is harder when pandas data > >> >> cannotbe > >> >> used with a full ndarray API. > >> > > >> > > >> > I disagree entirely here. I think that Series/DataFrame ARE becoming > >> > primitive objects. Look at seaborn, statsmodels, and xarray These are > >> > first > >> > class users of these structures, whom need the additional meta-data > >> > attached. > >> > > >> > Yes categorical are useful in numpy, and they should support them. But > >> > lots > >> > of libraries can simply use pandas and do lots of really useful stuff. > >> > However, why reinvent the wheel and use numpy, when you have > DataFrames. > >> > > >> > From a user point of view, I don't think they even care about numpy > (or > >> > whatever drives pandas). It solves a very general problem of working > >> > with > >> > labeled data. > >> > > >> > Jeff > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeffreback at gmail.com Wed Mar 23 17:16:47 2016 From: jeffreback at gmail.com (Jeff Reback) Date: Wed, 23 Mar 2016 17:16:47 -0400 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References: Message-ID: https://github.com/apache/arrow/tree/master/python/pyarrow looking pretty good. assume there is a notion of an extension dtype? (to support dtype/schema that other systems may not) in order to implement things like categorical / datetime tz etc then libpandas becomes a pretty thin wrapper around this > On Mar 16, 2016, at 12:44 PM, Stephan Hoyer wrote: > > After taking a step back and starting a new job, I am coming around to Wes's perspective here. > > The lack of integer-NAs and the overly complex/unpredictable internal memory model are major shortcomings (along with the indexing API) for using pandas in production software. > > Compatibility with the rest of the SciPy ecosystem is important, but it shouldn't hold pandas back. There's no good reason why pandas needs to built on a library for strided n-dimensional arrays -- that's a lot more complexity than we need. > > Best, > Stephan > > >> On Tue, Jan 12, 2016 at 5:42 PM, Wes McKinney wrote: >> On Tue, Jan 12, 2016 at 4:06 PM, Stephan Hoyer wrote: >> > I think I'm mostly on the same page as well. Five years has certainly been >> > too long. >> > >> > I agree that it would be premature to commit to using DyND in a binding way >> > in pandas. A lot seems to be up in the air with regards to dtypes in Python >> > right now (yes, particularly from projects sponsored by Continuum). >> > >> > So I would advocate for proceeding with the refactor for now (which will >> > have numerous other benefits), and see how the situation evolves. If it >> > seems like we are in a plausible position to unify the dtype system with a >> > tool like DyND, then let's seriously consider that down the road. Either >> > way, explicit interfaces (e.g., to_numpy(), to_dynd()) will help. >> > >> >> +1 -- I think our long term goal should be to have a common physical >> memory representation. If pandas internally stays slightly malleable >> (in a non-user-visible-way) we can conform to a standard (presuming >> one develops) with less user-land disruption. If a standard does not >> develop we can just shrug our shoulders and do what's best for pandas. >> We'll have to think about how this will affect pandas's future C API >> (zero-copy interop guarantees): we might make the C API in the first >> release more clearly not-for-production use. >> >> Aside: There doesn't even seem to be consensus at the moment on >> missing data representation. Sentinels, for example, causes >> interoperability problems with ODBC / databases, and Apache ecosystem >> projects (e.g. HDFS file formats, Thrift, Spark, Kafka, etc.). If we >> build a C interface to Avro or Parquet in pandas right now we'll have >> to convert bitmasks to pandas's bespoke sentinels. To be clear, R has >> this problem too. I see good arguments for even nixing NaN in floating >> point arrays, as heretical as that might sound. Ironically I used to >> be in favor of sentinels but I realized it was an isolationist view. >> >> -W >> >> > On Mon, Jan 11, 2016 at 4:23 PM, Wes McKinney wrote: >> >> >> >> On Mon, Jan 11, 2016 at 3:04 PM, Jeff Reback wrote: >> >> > I am in favor of the Wes refactoring, but for some slightly different >> >> > reasons. >> >> > >> >> > I am including some in-line comments. >> >> > >> >> > On Mon, Jan 11, 2016 at 2:55 PM, Stephan Hoyer wrote: >> >> >>> >> >> >>> I don't see alternative ways for pandas to have a truly healthy >> >> >>> relationship with more general purpose array / scientific computing >> >> >>> libraries without being able to add new pandas functionality in a >> >> >>> clean way, and without requiring us to get patches accepted (and >> >> >>> released) in NumPy or DyND. >> >> >> >> >> >> >> >> >> Indeed, I think my disagreement is mostly about the order in which we >> >> >> approach these problems. >> >> > >> >> > >> >> > I agree here. I had started on *some* of this to enable swappable numpy >> >> > to >> >> > DyND to support IntNA (all in python, >> >> > but the fundamental change was to provide an API layer to the back-end). >> >> > >> >> >> >> >> >> >> >> >>> >> >> >>> Can you clarify what aspects of this plan are disagreeable / >> >> >>> contentious? >> >> >> >> >> >> >> >> >> See my comments below. >> >> >> >> >> >>> >> >> >>> Are you arguing for pandas becoming more of a companion >> >> >>> tool / user interface layer for NumPy or DyND? >> >> >> >> >> >> >> >> >> Not quite. Pandas has some fantastic and highly useable data (Series, >> >> >> DataFrame, Index). These certainly don't belong in NumPy or DyND. >> >> >> >> >> >> However, the array-based ecosystem certainly could use improvements to >> >> >> dtypes (e.g., datetime and categorical) and dtype specific methods >> >> >> (e.g., >> >> >> for strings) just as much as pandas. I do firmly believe that pushing >> >> >> these >> >> >> types of improvements upstream, rather than implementing them >> >> >> independently >> >> >> for pandas, would yield benefits for the broader ecosystem. With the >> >> >> right >> >> >> infrastructure, generalizing things to arrays is not much more work. >> >> > >> >> > >> >> > I dont' think Wes nor I disagree here at all. The problem was (and is), >> >> > the >> >> > pace of change in the underlying libraries. It is simply too slow >> >> > for pandas development efforts. >> >> > >> >> > I think the pandas efforts (and other libraries) can result in more >> >> > powerful >> >> > fundamental libraries >> >> > that get pushed upstream. However, it would not benefit ANYONE to slow >> >> > down >> >> > downstream efforts. I am not sure why you suggest that we WAIT for the >> >> > upstream libraries to change? We have been waiting forever for that. Now >> >> > we >> >> > have a concrete implementation of certain data types that are useful. >> >> > They >> >> > (upstream) can take >> >> > this and build on (or throw it away and make a better one or whatever). >> >> > But >> >> > I don't think it benefits anyone to WAIT for someone to change numpy >> >> > first. >> >> > Look at how long it took them to (partially) fix datetimes. >> >> > >> >> > xarray in particular has done the same thing to pandas, e.g. you have >> >> > added >> >> > additional selection operators and syntax (e.g. passing dicts of named >> >> > axes). These changes are in fact propogating to pandas. This has taken >> >> > time >> >> > (but much much less that this took for any of pandas changes to numpy). >> >> > Further look at how long you have advocated (correctly) for labeled >> >> > arrays >> >> > in numpy (which we are still waiting). >> >> > >> >> >> >> >> >> >> >> >> I'd like to see pandas itself focus more on the data-structures and >> >> >> less >> >> >> on the data types. This would let us share more work with the "general >> >> >> purpose array / scientific computing libraries". >> >> >> >> >> > Pandas IS about specifying the correct data types. It is simply >> >> > incorrect to >> >> > decouple this problem from the data-structures. A lot of effort over the >> >> > years has gone into >> >> > making all dtypes playing nice with each other and within pandas. >> >> > >> >> >>> >> >> >>> 1) Introduce a proper (from a software engineering perspective) >> >> >>> logical data type abstraction that models the way that pandas already >> >> >>> works, but cleaning up all the mess (implicit upcasts, lack of a real >> >> >>> "NA" scalar value, making pandas-specific methods like unique, >> >> >>> factorize, match, etc. true "array methods") >> >> >> >> >> >> >> >> >> New abstractions have a cost. A new logical data type abstraction is >> >> >> better than no proper abstraction at all, but (in principle), one data >> >> >> type >> >> >> abstraction should be enough to share. >> >> >> >> >> > >> >> >> >> >> >> A proper logical data type abstraction would be an improvement over the >> >> >> current situation, but if there's a way we could introduce one less >> >> >> abstraction (by improving things upstream in a general purpose array >> >> >> library) that would help even more. >> >> >> >> >> > >> >> > This is just pushing a problem upstream, which ultimately, given the >> >> > track >> >> > history of numpy, won't be solved at all. We will be here 1 year from >> >> > now >> >> > with the exact same discussion. Why are we waiting on upstream for >> >> > anything? >> >> > As I said above, if something is created which upstream finds useful on >> >> > a >> >> > general level. great. The great cost here is time. >> >> > >> >> >> >> >> >> For example, we could imagine pushing to make DyND the new core for >> >> >> pandas. This could be enough of a push to make DyND generally useful -- >> >> >> I >> >> >> know it still has a few kinks to work out. >> >> >> >> >> > >> >> > maybe, but DyND has to have full compat with what currently is out there >> >> > (soonish). Then I agree this could be possible. But wouldn't it be even >> >> > better >> >> > for pandas to be able to swap back-ends. Why limit ourselves to a >> >> > particular >> >> > backend if its not that difficult. >> >> > >> >> >> >> I think Jeff and I are on the same page here. 5 years ago we were >> >> having the *exact same* discussions around NumPy and adding new data >> >> type functionality. 5 years is a staggering amount of time in open >> >> source. It was less than 5 years between pandas not existing and being >> >> a super popular project with 2/3 of a best-selling O'Reilly book >> >> written about it. To whit, DyND exists in large part because of the >> >> difficulty in making progress within NumPy. >> >> >> >> Now, as 5 years ago, I think we should be acting in the best interests >> >> of pandas users, and what I've been describing is intended as a >> >> straightforward (though definitely labor intensive) and relatively >> >> low-risk plan that will "future-proof" the pandas user API for at >> >> least the next few years, and probably much longer. If we find that >> >> enabling some internals to use DyND is the right choice, we can do >> >> that in a non-invasive way while carefully minding data >> >> interoperability. Meaningful performance benefits would be a clear >> >> motivation. >> >> >> >> To be 100% open and transparent (in the spirit of pandas's new >> >> governance docs): Before committing to using DyND in any binding way >> >> (i.e. required, as opposed to opt-in) in pandas, I'd really like to >> >> see more evidence from 3rd parties without direct financial interest >> >> (i.e. employment or equity from Continuum) that DyND is "the future of >> >> Python array computing"; in the absence of significant user and >> >> community code contribution, it still feels like a political quagmire >> >> leftover from the Continuum-Enthought rift in 2011. >> >> >> >> - Wes >> >> >> >> >>> >> >> >>> 4) Give pandas objects a real C API so that users can manipulate and >> >> >>> create pandas objects with their own native (C/C++/Cython) code. >> >> >> >> >> >> >> >> >>> 5) Yes, absolutely improve NumPy and DyND and transition to improved >> >> >>> NumPy and DyND facilities as soon as they are available and shipped >> >> >> >> >> >> >> >> >> I like the sound of both of these. >> >> > >> >> > >> >> > >> >> > Further you made a point above >> >> > >> >> >> You are right that pandas has started to supplant numpy as a high level >> >> >> API for data analysis, but of course the robust (and often numpy based) >> >> >> Python ecosystem is part of what has made pandas so successful. In >> >> >> practice, >> >> >> ecosystem projects often want to work with more primitive objects than >> >> >> series/dataframes in their internal data structures and without numpy >> >> >> this >> >> >> becomes more difficult. For example, how do you concatenate a list of >> >> >> categoricals? If these were numpy arrays, we could use np.concatenate, >> >> >> but >> >> >> the current implementation of categorical would require a custom >> >> >> solution. >> >> >> First class compatibility with pandas is harder when pandas data >> >> >> cannotbe >> >> >> used with a full ndarray API. >> >> > >> >> > >> >> > I disagree entirely here. I think that Series/DataFrame ARE becoming >> >> > primitive objects. Look at seaborn, statsmodels, and xarray These are >> >> > first >> >> > class users of these structures, whom need the additional meta-data >> >> > attached. >> >> > >> >> > Yes categorical are useful in numpy, and they should support them. But >> >> > lots >> >> > of libraries can simply use pandas and do lots of really useful stuff. >> >> > However, why reinvent the wheel and use numpy, when you have DataFrames. >> >> > >> >> > From a user point of view, I don't think they even care about numpy (or >> >> > whatever drives pandas). It solves a very general problem of working >> >> > with >> >> > labeled data. >> >> > >> >> > Jeff >> > >> > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev -------------- next part -------------- An HTML attachment was scrubbed... URL: From izaid at continuum.io Wed Mar 23 17:22:28 2016 From: izaid at continuum.io (Irwin Zaid) Date: Wed, 23 Mar 2016 16:22:28 -0500 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References: Message-ID: If it is of interest, I'll just mention that we recently split off DyND's type system to be its own independent library -- libdyndt (for the types) and libdynd (for the callables and array). Distributing them separately still needs work, but the binaries are there. DyND / Arrow compatibility would be interesting, and I'm always keen to avoid duplication of effort. -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Wed Mar 23 20:01:44 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Wed, 23 Mar 2016 17:01:44 -0700 Subject: [Pandas-dev] Rewriting some of internals of pandas in C/C++? / Roadmap In-Reply-To: References: Message-ID: At the moment, I don't have plans for PyArrow that extend beyond being the point of contact with systems that use Arrow natively. For example, pandas users will soon be able to read and write Parquet format files via pyarrow (which will handle low-level conversion to/from pandas's NumPy memory representation) I'd like to continue the pandas refactoring / reorganization effort (+ organizing deprecations) to be able to encapsulate pandas's interactions with NumPy so that alternate backends can be conceivable at all (possibly in 2017-2018). I don't have a lot of bandwidth for this until 2nd half of April at earliest, though. Happy to respond to inquiries at the meantime. - Wes On Wed, Mar 23, 2016 at 2:22 PM, Irwin Zaid wrote: > If it is of interest, I'll just mention that we recently split off DyND's > type system to be its own independent library -- libdyndt (for the types) > and libdynd (for the callables and array). Distributing them separately > still needs work, but the binaries are there. DyND / Arrow compatibility > would be interesting, and I'm always keen to avoid duplication of effort. > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Tue Mar 29 03:50:08 2016 From: wesmckinn at gmail.com (Wes McKinney) Date: Tue, 29 Mar 2016 00:50:08 -0700 Subject: [Pandas-dev] pandas microperformance, do we care? Message-ID: I've noticed there's been a slow degrading in pandas microperformance as time has gone by. I looked into this when I found that df.icol(i) has been deprecated in favor of df.iloc[:, i] df = pd.DataFrame(np.random.randn(10, 5)) So here we go: pandas v0.12 %timeit df.icol(2) 100000 loops, best of 3: 13.5 ?s per loop pandas v0.18 %timeit df.icol(2) 10000 loops, best of 3: 25.4 ?s per loop In [6]: timeit df.iloc[:, 2] 10000 loops, best of 3: 60.8 ?s per loop Once upon a time, I spent a lot of time shaving microseconds off some of these data accessor methods. For example, pandas v0.12 again: In [17]: s = df[2] In [18]: timeit s.get_value(5) 1000000 loops, best of 3: 609 ns per loop In [21]: timeit s[5] 1000000 loops, best of 3: 860 ns per loop And pandas v0.18 In [15]: timeit s.get_value(5) 100000 loops, best of 3: 7.17 ?s per loop In [16]: timeit s[5] 100000 loops, best of 3: 9.31 ?s per loop I understand that the performance was made worse in order to add various layers of indirection in order to make new features available (and fix bugs). I'm hoping as part of looking at revamping pandas's internals (and closing the gap to the "metal") that we are able to tighten up some of these "inner loop" methods, preferably back to pandas 0.12-level performance. It's true that writing a lot of Python for-loops isn't optimal for lots of reasons, but we should avoid overly penalizing users when this does happen. Thanks, Wes -------------- next part -------------- An HTML attachment was scrubbed... URL: