From jorisvandenbossche at gmail.com Sun Nov 12 12:24:46 2017 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Sun, 12 Nov 2017 18:24:46 +0100 Subject: [Pandas-dev] Online dev meeting - wednesday 15th November 6pm UTC Message-ID: Hi all, FYI, we are planning a dev meeting coming Wednesday at 6-7pm UTC. If you are interested to join, always welcome! (https://appear.in/pandas-dev) Best, Joris -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Mon Nov 13 16:54:20 2017 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Mon, 13 Nov 2017 22:54:20 +0100 Subject: [Pandas-dev] Proposal to change PR update policy from rebase to merge Message-ID: Hi all, Currently when PRs get outdated, we often ask to "rebase and update" (although to be honest, it mainly Jeff that is doing most of this work to ping stale PRs), and rebasing is also how it is explained in the docs ( http://pandas-docs.github.io/pandas-docs-travis/contributing.html#creating-a-branch). And also many active contributors use rebasing while working on a PR to get in sync with changes in master. I would like to propose to change this policy from rebasing to merging (= merging master in the feature branch, creating a merge commit). Some reasons for this: - I personally think this is easier to do (certainly for less experienced git users; conflicts can be easier to solve, ..) - It makes it easier to follow what has changed (certainly if we extend 'not rebasing' with 'not squash rebasing'), making it easier review - It doesn't destroy links to github PR comments - Since we squash on merge in the end, we don't care about the additional merge commits in the PR's history Thoughts on this? If we would agree on this, that would mean: update the docs + start doing it ourselves + start asking that of contributors consistently. Regards, Joris -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Mon Nov 13 16:58:23 2017 From: shoyer at gmail.com (Stephan Hoyer) Date: Mon, 13 Nov 2017 21:58:23 +0000 Subject: [Pandas-dev] Proposal to change PR update policy from rebase to merge In-Reply-To: References: Message-ID: +1 for merging instead of rebasing. Not losing comment history in PRs is a major bonus, and the end result (when we squash on merge) is basically looks the same. In fact, I would say with this workflow GitHub almost works as well as Gerrit ;). On Mon, Nov 13, 2017 at 1:54 PM Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > Hi all, > > Currently when PRs get outdated, we often ask to "rebase and update" > (although to be honest, it mainly Jeff that is doing most of this work to > ping stale PRs), and rebasing is also how it is explained in the docs ( > http://pandas-docs.github.io/pandas-docs-travis/contributing.html#creating-a-branch). > > And also many active contributors use rebasing while working on a PR to > get in sync with changes in master. > > I would like to propose to change this policy from rebasing to merging (= > merging master in the feature branch, creating a merge commit). > > Some reasons for this: > > - I personally think this is easier to do (certainly for less experienced > git users; conflicts can be easier to solve, ..) > - It makes it easier to follow what has changed (certainly if we extend > 'not rebasing' with 'not squash rebasing'), making it easier review > - It doesn't destroy links to github PR comments > - Since we squash on merge in the end, we don't care about the additional > merge commits in the PR's history > > Thoughts on this? > > If we would agree on this, that would mean: update the docs + start doing > it ourselves + start asking that of contributors consistently. > > Regards, > Joris > > > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.augspurger88 at gmail.com Mon Nov 13 17:02:43 2017 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Mon, 13 Nov 2017 16:02:43 -0600 Subject: [Pandas-dev] Proposal to change PR update policy from rebase to merge In-Reply-To: References: Message-ID: Yes, recommending merging instead of rebasing seems OK now that Github has squash on merge. Tom On Mon, Nov 13, 2017 at 3:58 PM, Stephan Hoyer wrote: > +1 for merging instead of rebasing. Not losing comment history in PRs is a > major bonus, and the end result (when we squash on merge) is basically > looks the same. > > In fact, I would say with this workflow GitHub almost works as well as > Gerrit ;). > > On Mon, Nov 13, 2017 at 1:54 PM Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > >> Hi all, >> >> Currently when PRs get outdated, we often ask to "rebase and update" >> (although to be honest, it mainly Jeff that is doing most of this work to >> ping stale PRs), and rebasing is also how it is explained in the docs ( >> http://pandas-docs.github.io/pandas-docs-travis/ >> contributing.html#creating-a-branch). >> And also many active contributors use rebasing while working on a PR to >> get in sync with changes in master. >> >> I would like to propose to change this policy from rebasing to merging (= >> merging master in the feature branch, creating a merge commit). >> >> Some reasons for this: >> >> - I personally think this is easier to do (certainly for less experienced >> git users; conflicts can be easier to solve, ..) >> - It makes it easier to follow what has changed (certainly if we extend >> 'not rebasing' with 'not squash rebasing'), making it easier review >> - It doesn't destroy links to github PR comments >> - Since we squash on merge in the end, we don't care about the additional >> merge commits in the PR's history >> >> Thoughts on this? >> >> If we would agree on this, that would mean: update the docs + start doing >> it ourselves + start asking that of contributors consistently. >> >> Regards, >> Joris >> >> >> >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeffreback at gmail.com Mon Nov 13 17:05:18 2017 From: jeffreback at gmail.com (Jeff Reback) Date: Mon, 13 Nov 2017 17:05:18 -0500 Subject: [Pandas-dev] Proposal to change PR update policy from rebase to merge In-Reply-To: References: Message-ID: i don?t care what people actually do in there own branches; though i find rebase much easier to read as long as github squashes then it?s fine the issue is that when i need to look at there branches locally they are always a mess and very hard to follow i would still recommend rebasing On Nov 13, 2017, at 5:02 PM, Tom Augspurger wrote: > > Yes, recommending merging instead of rebasing seems OK now that Github has squash on merge. > > Tom > >> On Mon, Nov 13, 2017 at 3:58 PM, Stephan Hoyer wrote: >> +1 for merging instead of rebasing. Not losing comment history in PRs is a major bonus, and the end result (when we squash on merge) is basically looks the same. >> >> In fact, I would say with this workflow GitHub almost works as well as Gerrit ;). >> >>> On Mon, Nov 13, 2017 at 1:54 PM Joris Van den Bossche wrote: >>> Hi all, >>> >>> Currently when PRs get outdated, we often ask to "rebase and update" (although to be honest, it mainly Jeff that is doing most of this work to ping stale PRs), and rebasing is also how it is explained in the docs (http://pandas-docs.github.io/pandas-docs-travis/contributing.html#creating-a-branch). >>> And also many active contributors use rebasing while working on a PR to get in sync with changes in master. >>> >>> I would like to propose to change this policy from rebasing to merging (= merging master in the feature branch, creating a merge commit). >>> >>> Some reasons for this: >>> >>> - I personally think this is easier to do (certainly for less experienced git users; conflicts can be easier to solve, ..) >>> - It makes it easier to follow what has changed (certainly if we extend 'not rebasing' with 'not squash rebasing'), making it easier review >>> - It doesn't destroy links to github PR comments >>> - Since we squash on merge in the end, we don't care about the additional merge commits in the PR's history >>> >>> Thoughts on this? >>> >>> If we would agree on this, that would mean: update the docs + start doing it ourselves + start asking that of contributors consistently. >>> >>> Regards, >>> Joris >>> >>> >>> >>> >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Mon Nov 13 17:49:47 2017 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Mon, 13 Nov 2017 23:49:47 +0100 Subject: [Pandas-dev] Proposal to change PR update policy from rebase to merge In-Reply-To: References: Message-ID: 2017-11-13 23:05 GMT+01:00 Jeff Reback : > i don?t care what people actually do in there own branches; > The point is a bit that I actually *do* care about what you do in your branches. I would find it easier for the PRs I am reviewing that people would add commits (and merge master to sync with latest changes) than to rebase and amend or squash commits, or add commit + rebase. > though i find rebase much easier to read > > as long as github squashes then it?s fine > > the issue is that when i need to look at there branches locally they are > always a mess and very hard to follow > Can you give an example of what is hard? Eg if the branch is out of date, I typically just merge master in it. > i would still recommend rebasing > > On Nov 13, 2017, at 5:02 PM, Tom Augspurger > wrote: > > Yes, recommending merging instead of rebasing seems OK now that Github has > squash on merge. > > Tom > > On Mon, Nov 13, 2017 at 3:58 PM, Stephan Hoyer wrote: > >> +1 for merging instead of rebasing. Not losing comment history in PRs is >> a major bonus, and the end result (when we squash on merge) is basically >> looks the same. >> >> In fact, I would say with this workflow GitHub almost works as well as >> Gerrit ;). >> >> On Mon, Nov 13, 2017 at 1:54 PM Joris Van den Bossche < >> jorisvandenbossche at gmail.com> wrote: >> >>> Hi all, >>> >>> Currently when PRs get outdated, we often ask to "rebase and update" >>> (although to be honest, it mainly Jeff that is doing most of this work to >>> ping stale PRs), and rebasing is also how it is explained in the docs ( >>> http://pandas-docs.github.io/pandas-docs-travis/contributin >>> g.html#creating-a-branch). >>> And also many active contributors use rebasing while working on a PR to >>> get in sync with changes in master. >>> >>> I would like to propose to change this policy from rebasing to merging >>> (= merging master in the feature branch, creating a merge commit). >>> >>> Some reasons for this: >>> >>> - I personally think this is easier to do (certainly for less >>> experienced git users; conflicts can be easier to solve, ..) >>> - It makes it easier to follow what has changed (certainly if we extend >>> 'not rebasing' with 'not squash rebasing'), making it easier review >>> - It doesn't destroy links to github PR comments >>> - Since we squash on merge in the end, we don't care about the >>> additional merge commits in the PR's history >>> >>> Thoughts on this? >>> >>> If we would agree on this, that would mean: update the docs + start >>> doing it ourselves + start asking that of contributors consistently. >>> >>> Regards, >>> Joris >>> >>> >>> >>> >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> >> > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Mon Nov 13 23:34:53 2017 From: wesmckinn at gmail.com (Wes McKinney) Date: Mon, 13 Nov 2017 23:34:53 -0500 Subject: [Pandas-dev] Proposal to change PR update policy from rebase to merge In-Reply-To: References: Message-ID: I think as long as clean atomic commits end up in master (which the merge/squash tool takes care of) then whatever is convenient for the contributor is fine. I personally prefer clean rebases but some users struggle with rebasing. - Wes On Mon, Nov 13, 2017 at 5:49 PM, Joris Van den Bossche wrote: > 2017-11-13 23:05 GMT+01:00 Jeff Reback : >> >> i don?t care what people actually do in there own branches; > > > The point is a bit that I actually do care about what you do in your > branches. I would find it easier for the PRs I am reviewing that people > would add commits (and merge master to sync with latest changes) than to > rebase and amend or squash commits, or add commit + rebase. > >> >> though i find rebase much easier to read >> >> as long as github squashes then it?s fine >> >> the issue is that when i need to look at there branches locally they are >> always a mess and very hard to follow > > > Can you give an example of what is hard? Eg if the branch is out of date, I > typically just merge master in it. > >> >> i would still recommend rebasing >> >> On Nov 13, 2017, at 5:02 PM, Tom Augspurger >> wrote: >> >> Yes, recommending merging instead of rebasing seems OK now that Github has >> squash on merge. >> >> Tom >> >> On Mon, Nov 13, 2017 at 3:58 PM, Stephan Hoyer wrote: >>> >>> +1 for merging instead of rebasing. Not losing comment history in PRs is >>> a major bonus, and the end result (when we squash on merge) is basically >>> looks the same. >>> >>> In fact, I would say with this workflow GitHub almost works as well as >>> Gerrit ;). >>> >>> On Mon, Nov 13, 2017 at 1:54 PM Joris Van den Bossche >>> wrote: >>>> >>>> Hi all, >>>> >>>> Currently when PRs get outdated, we often ask to "rebase and update" >>>> (although to be honest, it mainly Jeff that is doing most of this work to >>>> ping stale PRs), and rebasing is also how it is explained in the docs >>>> (http://pandas-docs.github.io/pandas-docs-travis/contributing.html#creating-a-branch). >>>> And also many active contributors use rebasing while working on a PR to >>>> get in sync with changes in master. >>>> >>>> I would like to propose to change this policy from rebasing to merging >>>> (= merging master in the feature branch, creating a merge commit). >>>> >>>> Some reasons for this: >>>> >>>> - I personally think this is easier to do (certainly for less >>>> experienced git users; conflicts can be easier to solve, ..) >>>> - It makes it easier to follow what has changed (certainly if we extend >>>> 'not rebasing' with 'not squash rebasing'), making it easier review >>>> - It doesn't destroy links to github PR comments >>>> - Since we squash on merge in the end, we don't care about the >>>> additional merge commits in the PR's history >>>> >>>> Thoughts on this? >>>> >>>> If we would agree on this, that would mean: update the docs + start >>>> doing it ourselves + start asking that of contributors consistently. >>>> >>>> Regards, >>>> Joris >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> Pandas-dev mailing list >>>> Pandas-dev at python.org >>>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >>> >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > From jorisvandenbossche at gmail.com Tue Nov 14 09:49:58 2017 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Tue, 14 Nov 2017 15:49:58 +0100 Subject: [Pandas-dev] Proposal to change PR update policy from rebase to merge In-Reply-To: References: Message-ID: Another advantage (IMO) that I forgot: - when a contributor adds a commit to a branch instead of squash+rebasing/amending new additions, you get notified of that by github. In that way, I know the contributor has actually pushed updates related to my review, and I know I have to look again at the PR (otherwise I have to check the PR from time to time to see if the contributor pushed new changes). 2017-11-14 5:34 GMT+01:00 Wes McKinney : > I think as long as clean atomic commits end up in master (which the > merge/squash tool takes care of) then whatever is convenient for the > contributor is fine. I personally prefer clean rebases but some users > struggle with rebasing. My preference to not rebase is not only from a contributor point of view, but mainly from reviewer point of view. So my point is basically that, for me personally, this "whathever is convenient for the contributor" is not fine for me as a reviewer. But of course, if the different reviewers don't share this preference, we can't ask something specific from the contributors. That's the reason I opened this discussion to see if there would be agreement. > - Wes > > On Mon, Nov 13, 2017 at 5:49 PM, Joris Van den Bossche > wrote: > > 2017-11-13 23:05 GMT+01:00 Jeff Reback : > >> > >> i don?t care what people actually do in there own branches; > > > > > > The point is a bit that I actually do care about what you do in your > > branches. I would find it easier for the PRs I am reviewing that people > > would add commits (and merge master to sync with latest changes) than to > > rebase and amend or squash commits, or add commit + rebase. > > > >> > >> though i find rebase much easier to read > >> > >> as long as github squashes then it?s fine > >> > >> the issue is that when i need to look at there branches locally they are > >> always a mess and very hard to follow > > > > > > Can you give an example of what is hard? Eg if the branch is out of > date, I > > typically just merge master in it. > > > >> > >> i would still recommend rebasing > >> > >> On Nov 13, 2017, at 5:02 PM, Tom Augspurger > > >> wrote: > >> > >> Yes, recommending merging instead of rebasing seems OK now that Github > has > >> squash on merge. > >> > >> Tom > >> > >> On Mon, Nov 13, 2017 at 3:58 PM, Stephan Hoyer > wrote: > >>> > >>> +1 for merging instead of rebasing. Not losing comment history in PRs > is > >>> a major bonus, and the end result (when we squash on merge) is > basically > >>> looks the same. > >>> > >>> In fact, I would say with this workflow GitHub almost works as well as > >>> Gerrit ;). > >>> > >>> On Mon, Nov 13, 2017 at 1:54 PM Joris Van den Bossche > >>> wrote: > >>>> > >>>> Hi all, > >>>> > >>>> Currently when PRs get outdated, we often ask to "rebase and update" > >>>> (although to be honest, it mainly Jeff that is doing most of this > work to > >>>> ping stale PRs), and rebasing is also how it is explained in the docs > >>>> (http://pandas-docs.github.io/pandas-docs-travis/ > contributing.html#creating-a-branch). > >>>> And also many active contributors use rebasing while working on a PR > to > >>>> get in sync with changes in master. > >>>> > >>>> I would like to propose to change this policy from rebasing to merging > >>>> (= merging master in the feature branch, creating a merge commit). > >>>> > >>>> Some reasons for this: > >>>> > >>>> - I personally think this is easier to do (certainly for less > >>>> experienced git users; conflicts can be easier to solve, ..) > >>>> - It makes it easier to follow what has changed (certainly if we > extend > >>>> 'not rebasing' with 'not squash rebasing'), making it easier review > >>>> - It doesn't destroy links to github PR comments > >>>> - Since we squash on merge in the end, we don't care about the > >>>> additional merge commits in the PR's history > >>>> > >>>> Thoughts on this? > >>>> > >>>> If we would agree on this, that would mean: update the docs + start > >>>> doing it ourselves + start asking that of contributors consistently. > >>>> > >>>> Regards, > >>>> Joris > >>>> > >>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> Pandas-dev mailing list > >>>> Pandas-dev at python.org > >>>> https://mail.python.org/mailman/listinfo/pandas-dev > >>> > >>> > >>> _______________________________________________ > >>> Pandas-dev mailing list > >>> Pandas-dev at python.org > >>> https://mail.python.org/mailman/listinfo/pandas-dev > >>> > >> > >> _______________________________________________ > >> Pandas-dev mailing list > >> Pandas-dev at python.org > >> https://mail.python.org/mailman/listinfo/pandas-dev > >> > >> > >> _______________________________________________ > >> Pandas-dev mailing list > >> Pandas-dev at python.org > >> https://mail.python.org/mailman/listinfo/pandas-dev > >> > > > > > > _______________________________________________ > > Pandas-dev mailing list > > Pandas-dev at python.org > > https://mail.python.org/mailman/listinfo/pandas-dev > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wesmckinn at gmail.com Tue Nov 14 10:55:12 2017 From: wesmckinn at gmail.com (Wes McKinney) Date: Tue, 14 Nov 2017 10:55:12 -0500 Subject: [Pandas-dev] Proposal to change PR update policy from rebase to merge In-Reply-To: References: Message-ID: Having now experienced Gerrit and other professional code review tools, it's very hard for me to see GitHub's review in anything but a negative light. I haven't found any version of contributor behavior to make incremental reviews any easier -- this "has the contributor updated the PR" is a problem that is completely solved / a non-issue in tools like Gerrit (the other issues you cited also do not exist in Gerrit). Since I'm not actively maintaining pandas PRs, the solution that makes your lives as maintainers easiest is OK by me. It's too bad we're in a position of advocating "messy" PR branches in order to improve the UX of code reviews in GitHub. FWIW, I talked with GitHub employees in person about these exact sort of problems in October and I don't think it's likely to improve anytime soon. - Wes On Tue, Nov 14, 2017 at 9:49 AM, Joris Van den Bossche wrote: > Another advantage (IMO) that I forgot: > > - when a contributor adds a commit to a branch instead of > squash+rebasing/amending new additions, you get notified of that by github. > In that way, I know the contributor has actually pushed updates related to > my review, and I know I have to look again at the PR (otherwise I have to > check the PR from time to time to see if the contributor pushed new > changes). > > 2017-11-14 5:34 GMT+01:00 Wes McKinney : >> >> I think as long as clean atomic commits end up in master (which the >> merge/squash tool takes care of) then whatever is convenient for the >> contributor is fine. I personally prefer clean rebases but some users >> struggle with rebasing. > > > My preference to not rebase is not only from a contributor point of view, > but mainly from reviewer point of view. > So my point is basically that, for me personally, this "whathever is > convenient for the contributor" is not fine for me as a reviewer. > > But of course, if the different reviewers don't share this preference, we > can't ask something specific from the contributors. That's the reason I > opened this discussion to see if there would be agreement. > >> >> - Wes >> >> On Mon, Nov 13, 2017 at 5:49 PM, Joris Van den Bossche >> wrote: >> > 2017-11-13 23:05 GMT+01:00 Jeff Reback : >> >> >> >> i don?t care what people actually do in there own branches; >> > >> > >> > The point is a bit that I actually do care about what you do in your >> > branches. I would find it easier for the PRs I am reviewing that people >> > would add commits (and merge master to sync with latest changes) than to >> > rebase and amend or squash commits, or add commit + rebase. >> > >> >> >> >> though i find rebase much easier to read >> >> >> >> as long as github squashes then it?s fine >> >> >> >> the issue is that when i need to look at there branches locally they >> >> are >> >> always a mess and very hard to follow >> > >> > >> > Can you give an example of what is hard? Eg if the branch is out of >> > date, I >> > typically just merge master in it. >> > >> >> >> >> i would still recommend rebasing >> >> >> >> On Nov 13, 2017, at 5:02 PM, Tom Augspurger >> >> >> >> wrote: >> >> >> >> Yes, recommending merging instead of rebasing seems OK now that Github >> >> has >> >> squash on merge. >> >> >> >> Tom >> >> >> >> On Mon, Nov 13, 2017 at 3:58 PM, Stephan Hoyer >> >> wrote: >> >>> >> >>> +1 for merging instead of rebasing. Not losing comment history in PRs >> >>> is >> >>> a major bonus, and the end result (when we squash on merge) is >> >>> basically >> >>> looks the same. >> >>> >> >>> In fact, I would say with this workflow GitHub almost works as well as >> >>> Gerrit ;). >> >>> >> >>> On Mon, Nov 13, 2017 at 1:54 PM Joris Van den Bossche >> >>> wrote: >> >>>> >> >>>> Hi all, >> >>>> >> >>>> Currently when PRs get outdated, we often ask to "rebase and update" >> >>>> (although to be honest, it mainly Jeff that is doing most of this >> >>>> work to >> >>>> ping stale PRs), and rebasing is also how it is explained in the docs >> >>>> >> >>>> (http://pandas-docs.github.io/pandas-docs-travis/contributing.html#creating-a-branch). >> >>>> And also many active contributors use rebasing while working on a PR >> >>>> to >> >>>> get in sync with changes in master. >> >>>> >> >>>> I would like to propose to change this policy from rebasing to >> >>>> merging >> >>>> (= merging master in the feature branch, creating a merge commit). >> >>>> >> >>>> Some reasons for this: >> >>>> >> >>>> - I personally think this is easier to do (certainly for less >> >>>> experienced git users; conflicts can be easier to solve, ..) >> >>>> - It makes it easier to follow what has changed (certainly if we >> >>>> extend >> >>>> 'not rebasing' with 'not squash rebasing'), making it easier review >> >>>> - It doesn't destroy links to github PR comments >> >>>> - Since we squash on merge in the end, we don't care about the >> >>>> additional merge commits in the PR's history >> >>>> >> >>>> Thoughts on this? >> >>>> >> >>>> If we would agree on this, that would mean: update the docs + start >> >>>> doing it ourselves + start asking that of contributors consistently. >> >>>> >> >>>> Regards, >> >>>> Joris >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> _______________________________________________ >> >>>> Pandas-dev mailing list >> >>>> Pandas-dev at python.org >> >>>> https://mail.python.org/mailman/listinfo/pandas-dev >> >>> >> >>> >> >>> _______________________________________________ >> >>> Pandas-dev mailing list >> >>> Pandas-dev at python.org >> >>> https://mail.python.org/mailman/listinfo/pandas-dev >> >>> >> >> >> >> _______________________________________________ >> >> Pandas-dev mailing list >> >> Pandas-dev at python.org >> >> https://mail.python.org/mailman/listinfo/pandas-dev >> >> >> >> >> >> _______________________________________________ >> >> Pandas-dev mailing list >> >> Pandas-dev at python.org >> >> https://mail.python.org/mailman/listinfo/pandas-dev >> >> >> > >> > >> > _______________________________________________ >> > Pandas-dev mailing list >> > Pandas-dev at python.org >> > https://mail.python.org/mailman/listinfo/pandas-dev >> > > > From shoyer at gmail.com Tue Nov 14 11:46:23 2017 From: shoyer at gmail.com (Stephan Hoyer) Date: Tue, 14 Nov 2017 16:46:23 +0000 Subject: [Pandas-dev] Proposal to change PR update policy from rebase to merge In-Reply-To: References: Message-ID: The GitHub pull request model does have some virtues: there is a 1-1 relationship between git commits on a branch and what you see on the pull request. This makes it easier to review/edit a change's history without the review tool, and for multiple users to work on a single changes at once -- as long as contributors stick to adding new commits and merging in updates from master instead of rebasing. The PR interface itself is designed around this, e.g., you can click "Changes since my last review" (or any particular change), get email notifications when new commits are pushed, etc. It would be nice to have the option of Gerrit's single commit model, but as long as we are stuck on GitHub we should use the process that works well with the tool. On Tue, Nov 14, 2017 at 7:56 AM Wes McKinney wrote: > Having now experienced Gerrit and other professional code review > tools, it's very hard for me to see GitHub's review in anything but a > negative light. I haven't found any version of contributor behavior to > make incremental reviews any easier -- this "has the contributor > updated the PR" is a problem that is completely solved / a non-issue > in tools like Gerrit (the other issues you cited also do not exist in > Gerrit). Since I'm not actively maintaining pandas PRs, the solution > that makes your lives as maintainers easiest is OK by me. > > It's too bad we're in a position of advocating "messy" PR branches in > order to improve the UX of code reviews in GitHub. FWIW, I talked with > GitHub employees in person about these exact sort of problems in > October and I don't think it's likely to improve anytime soon. > > - Wes > > On Tue, Nov 14, 2017 at 9:49 AM, Joris Van den Bossche > wrote: > > Another advantage (IMO) that I forgot: > > > > - when a contributor adds a commit to a branch instead of > > squash+rebasing/amending new additions, you get notified of that by > github. > > In that way, I know the contributor has actually pushed updates related > to > > my review, and I know I have to look again at the PR (otherwise I have to > > check the PR from time to time to see if the contributor pushed new > > changes). > > > > 2017-11-14 5:34 GMT+01:00 Wes McKinney : > >> > >> I think as long as clean atomic commits end up in master (which the > >> merge/squash tool takes care of) then whatever is convenient for the > >> contributor is fine. I personally prefer clean rebases but some users > >> struggle with rebasing. > > > > > > My preference to not rebase is not only from a contributor point of view, > > but mainly from reviewer point of view. > > So my point is basically that, for me personally, this "whathever is > > convenient for the contributor" is not fine for me as a reviewer. > > > > But of course, if the different reviewers don't share this preference, we > > can't ask something specific from the contributors. That's the reason I > > opened this discussion to see if there would be agreement. > > > >> > >> - Wes > >> > >> On Mon, Nov 13, 2017 at 5:49 PM, Joris Van den Bossche > >> wrote: > >> > 2017-11-13 23:05 GMT+01:00 Jeff Reback : > >> >> > >> >> i don?t care what people actually do in there own branches; > >> > > >> > > >> > The point is a bit that I actually do care about what you do in your > >> > branches. I would find it easier for the PRs I am reviewing that > people > >> > would add commits (and merge master to sync with latest changes) than > to > >> > rebase and amend or squash commits, or add commit + rebase. > >> > > >> >> > >> >> though i find rebase much easier to read > >> >> > >> >> as long as github squashes then it?s fine > >> >> > >> >> the issue is that when i need to look at there branches locally they > >> >> are > >> >> always a mess and very hard to follow > >> > > >> > > >> > Can you give an example of what is hard? Eg if the branch is out of > >> > date, I > >> > typically just merge master in it. > >> > > >> >> > >> >> i would still recommend rebasing > >> >> > >> >> On Nov 13, 2017, at 5:02 PM, Tom Augspurger > >> >> > >> >> wrote: > >> >> > >> >> Yes, recommending merging instead of rebasing seems OK now that > Github > >> >> has > >> >> squash on merge. > >> >> > >> >> Tom > >> >> > >> >> On Mon, Nov 13, 2017 at 3:58 PM, Stephan Hoyer > >> >> wrote: > >> >>> > >> >>> +1 for merging instead of rebasing. Not losing comment history in > PRs > >> >>> is > >> >>> a major bonus, and the end result (when we squash on merge) is > >> >>> basically > >> >>> looks the same. > >> >>> > >> >>> In fact, I would say with this workflow GitHub almost works as well > as > >> >>> Gerrit ;). > >> >>> > >> >>> On Mon, Nov 13, 2017 at 1:54 PM Joris Van den Bossche > >> >>> wrote: > >> >>>> > >> >>>> Hi all, > >> >>>> > >> >>>> Currently when PRs get outdated, we often ask to "rebase and > update" > >> >>>> (although to be honest, it mainly Jeff that is doing most of this > >> >>>> work to > >> >>>> ping stale PRs), and rebasing is also how it is explained in the > docs > >> >>>> > >> >>>> ( > http://pandas-docs.github.io/pandas-docs-travis/contributing.html#creating-a-branch > ). > >> >>>> And also many active contributors use rebasing while working on a > PR > >> >>>> to > >> >>>> get in sync with changes in master. > >> >>>> > >> >>>> I would like to propose to change this policy from rebasing to > >> >>>> merging > >> >>>> (= merging master in the feature branch, creating a merge commit). > >> >>>> > >> >>>> Some reasons for this: > >> >>>> > >> >>>> - I personally think this is easier to do (certainly for less > >> >>>> experienced git users; conflicts can be easier to solve, ..) > >> >>>> - It makes it easier to follow what has changed (certainly if we > >> >>>> extend > >> >>>> 'not rebasing' with 'not squash rebasing'), making it easier review > >> >>>> - It doesn't destroy links to github PR comments > >> >>>> - Since we squash on merge in the end, we don't care about the > >> >>>> additional merge commits in the PR's history > >> >>>> > >> >>>> Thoughts on this? > >> >>>> > >> >>>> If we would agree on this, that would mean: update the docs + start > >> >>>> doing it ourselves + start asking that of contributors > consistently. > >> >>>> > >> >>>> Regards, > >> >>>> Joris > >> >>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> _______________________________________________ > >> >>>> Pandas-dev mailing list > >> >>>> Pandas-dev at python.org > >> >>>> https://mail.python.org/mailman/listinfo/pandas-dev > >> >>> > >> >>> > >> >>> _______________________________________________ > >> >>> Pandas-dev mailing list > >> >>> Pandas-dev at python.org > >> >>> https://mail.python.org/mailman/listinfo/pandas-dev > >> >>> > >> >> > >> >> _______________________________________________ > >> >> Pandas-dev mailing list > >> >> Pandas-dev at python.org > >> >> https://mail.python.org/mailman/listinfo/pandas-dev > >> >> > >> >> > >> >> _______________________________________________ > >> >> Pandas-dev mailing list > >> >> Pandas-dev at python.org > >> >> https://mail.python.org/mailman/listinfo/pandas-dev > >> >> > >> > > >> > > >> > _______________________________________________ > >> > Pandas-dev mailing list > >> > Pandas-dev at python.org > >> > https://mail.python.org/mailman/listinfo/pandas-dev > >> > > > > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.augspurger88 at gmail.com Thu Nov 16 09:22:32 2017 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Thu, 16 Nov 2017 08:22:32 -0600 Subject: [Pandas-dev] Label and Milestone cleanup Message-ID: We ran out of time to discuss this in the dev meeting. I want to clean up our Github labels and milestones. ## Milestones I would like to have 4 real milestones: 1. "Next Major Release" for API breaking changes that we'd like to do eventually 2. "Next Release" For bugfix / non-API breaking changes that we'd like to do eventually 3. "0.22.0" (i.e. the actual next major release) for PRs that should go into a specific release 4. "0.21.1" (i.e. the actual next minor release) for PRs that should go into a specific release The main change is the "Next Release", which I hope will clear up confusion about what file the release notes should go in. I think we can remove "Interesting Issues", remove "High Level Issue Tracking", and move "won't fix" into "No action". I'm not sure about "1.0" and "2.0" and "Someday", but maybe consolidate those. ## Labels I would like to - Start tagging "Easy" issues with "good first issue". Github gives some prominence to this tag in their UI. - Remove some of the less frequently used issues like "Closed PR, Multi Dimensional, etc." - Start using the "Needs Info" tag more often for incomplete bug reports, and regularly close issues that have been tagged as "Needs Info" and not updated in more that a couple weeks. Thoughts? Objections? Tom -------------- next part -------------- An HTML attachment was scrubbed... URL: From gfyoung17 at gmail.com Thu Nov 16 12:29:35 2017 From: gfyoung17 at gmail.com (G Young) Date: Thu, 16 Nov 2017 09:29:35 -0800 Subject: [Pandas-dev] Label and Milestone cleanup In-Reply-To: References: Message-ID: IMO "Won't fix" and "No action" were both good indicators of deliberate non-action on a PR / issue. We can consolidate those, but I wouldn't remove both. "Someday" doesn't look like it's used anymore nowadays, so I wouldn't mind removing that. As for tagging in general, I think our classification are a little incomplete (e.g. there is no tag for general "DataFrame" issues or their methods). On Thu, Nov 16, 2017 at 6:22 AM, Tom Augspurger wrote: > We ran out of time to discuss this in the dev meeting. I want to clean up > our Github labels and milestones. > > ## Milestones > > I would like to have 4 real milestones: > > 1. "Next Major Release" for API breaking changes that we'd like to do > eventually > 2. "Next Release" For bugfix / non-API breaking changes that we'd like to > do eventually > 3. "0.22.0" (i.e. the actual next major release) for PRs that should go > into a specific release > 4. "0.21.1" (i.e. the actual next minor release) for PRs that should go > into a specific release > > The main change is the "Next Release", which I hope will clear up > confusion about what file the release notes should go in. I think we can > remove "Interesting Issues", remove "High Level Issue Tracking", and move > "won't fix" into "No action". I'm not sure about "1.0" and "2.0" and > "Someday", but maybe consolidate those. > > ## Labels > > I would like to > > - Start tagging "Easy" issues with "good first issue". Github gives some > prominence to this tag in their UI. > - Remove some of the less frequently used issues like "Closed PR, Multi > Dimensional, etc." > - Start using the "Needs Info" tag more often for incomplete bug reports, > and regularly close issues that have been tagged as "Needs Info" and not > updated in more that a couple weeks. > > Thoughts? Objections? > > > Tom > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeffreback at gmail.com Thu Nov 16 18:43:56 2017 From: jeffreback at gmail.com (Jeff Reback) Date: Thu, 16 Nov 2017 18:43:56 -0500 Subject: [Pandas-dev] Label and Milestone cleanup In-Reply-To: References: Message-ID: <56D136E9-4CDA-4283-9A2D-C5B1270254F1@gmail.com> someday can go ok with your tags/milestones otherwise though leave the interesting ones - these are ones that i tagged - will move them after we realign things (then i will remove) > On Nov 16, 2017, at 12:29 PM, G Young wrote: > > IMO "Won't fix" and "No action" were both good indicators of deliberate non-action on a PR / issue. We can consolidate those, but I wouldn't remove both. > > "Someday" doesn't look like it's used anymore nowadays, so I wouldn't mind removing that. > > As for tagging in general, I think our classification are a little incomplete (e.g. there is no tag for general "DataFrame" issues or their methods). > >> On Thu, Nov 16, 2017 at 6:22 AM, Tom Augspurger wrote: >> We ran out of time to discuss this in the dev meeting. I want to clean up our Github labels and milestones. >> >> ## Milestones >> >> I would like to have 4 real milestones: >> >> 1. "Next Major Release" for API breaking changes that we'd like to do eventually >> 2. "Next Release" For bugfix / non-API breaking changes that we'd like to do eventually >> 3. "0.22.0" (i.e. the actual next major release) for PRs that should go into a specific release >> 4. "0.21.1" (i.e. the actual next minor release) for PRs that should go into a specific release >> >> The main change is the "Next Release", which I hope will clear up confusion about what file the release notes should go in. I think we can remove "Interesting Issues", remove "High Level Issue Tracking", and move "won't fix" into "No action". I'm not sure about "1.0" and "2.0" and "Someday", but maybe consolidate those. >> >> ## Labels >> >> I would like to >> >> - Start tagging "Easy" issues with "good first issue". Github gives some prominence to this tag in their UI. >> - Remove some of the less frequently used issues like "Closed PR, Multi Dimensional, etc." >> - Start using the "Needs Info" tag more often for incomplete bug reports, and regularly close issues that have been tagged as "Needs Info" and not updated in more that a couple weeks. >> >> Thoughts? Objections? >> >> >> Tom >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Fri Nov 17 09:49:13 2017 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Fri, 17 Nov 2017 15:49:13 +0100 Subject: [Pandas-dev] Label and Milestone cleanup In-Reply-To: <56D136E9-4CDA-4283-9A2D-C5B1270254F1@gmail.com> References: <56D136E9-4CDA-4283-9A2D-C5B1270254F1@gmail.com> Message-ID: Already answering for the Milestones: ## Milestones >> >> I would like to have 4 real milestones: >> >> 1. "Next Major Release" for API breaking changes that we'd like to do >> eventually >> 2. "Next Release" For bugfix / non-API breaking changes that we'd like to >> do eventually >> 3. "0.22.0" (i.e. the actual next major release) for PRs that should go >> into a specific release >> 4. "0.21.1" (i.e. the actual next minor release) for PRs that should go >> into a specific release >> >> The main change is the "Next Release", which I hope will clear up >> confusion about what file the release notes should go in. >> > Would the issues in "Next release" be those issues that, if somebody does a PR, can be included in eg 0.21.1, but are not important to tag them as such? > I think we can remove "Interesting Issues", >> > > Jeff: though leave the interesting ones - these are ones that i tagged - > will move them after we realign things (then i will remove) > You can make a "Jeff's interesting issues" project, and add them to that. I think that is a better use (you still can look at the list of issues you want to specifically tag for attention) and it keeps the Milestones clean for actual milestones. > remove "High Level Issue Tracking", >> > + 1 since we a "master issue" label as well. > and move "won't fix" into "No action". I'm not sure about "1.0" and "2.0" >> and "Someday", but maybe consolidate those. >> > I personally find the notion of "won't fix" informative, but we could also give it a label "won't fix" ? (and then use a single "No action" milestone) I would keep the 1.0 milestone for now. I think "Someday" was originally intended to give some difference in prioritization for the core-devs between "Next major release" and "Someday"? But I think we don't really do it like that, so for me ok to remove. I would suppose that 'no milestone' would then be that. Joris 2017-11-17 0:43 GMT+01:00 Jeff Reback : > someday can g > > ok with your tags/milestones otherwise > > though leave the interesting ones - these are ones that i tagged - will > move them after we realign things (then i will remove) > You can make a "Jeff's interesting issues" project, and add them to that. I think that is a better use (you still can look at the list of issues) and it keeps the Milestones clean for actual milestones. > > > On Nov 16, 2017, at 12:29 PM, G Young wrote: > > IMO "Won't fix" and "No action" were both good indicators of deliberate > non-action on a PR / issue. We can consolidate those, but I wouldn't > remove both. > > "Someday" doesn't look like it's used anymore nowadays, so I wouldn't mind > removing that. > > As for tagging in general, I think our classification are a little > incomplete (e.g. there is no tag for general "DataFrame" issues or their > methods). > > On Thu, Nov 16, 2017 at 6:22 AM, Tom Augspurger < > tom.augspurger88 at gmail.com> wrote: > >> We ran out of time to discuss this in the dev meeting. I want to clean up >> our Github labels and milestones. >> >> ## Milestones >> >> I would like to have 4 real milestones: >> >> 1. "Next Major Release" for API breaking changes that we'd like to do >> eventually >> 2. "Next Release" For bugfix / non-API breaking changes that we'd like to >> do eventually >> 3. "0.22.0" (i.e. the actual next major release) for PRs that should go >> into a specific release >> 4. "0.21.1" (i.e. the actual next minor release) for PRs that should go >> into a specific release >> >> The main change is the "Next Release", which I hope will clear up >> confusion about what file the release notes should go in. I think we can >> remove "Interesting Issues", remove "High Level Issue Tracking", and move >> "won't fix" into "No action". I'm not sure about "1.0" and "2.0" and >> "Someday", but maybe consolidate those. >> >> ## Labels >> >> I would like to >> >> - Start tagging "Easy" issues with "good first issue". Github gives some >> prominence to this tag in their UI. >> - Remove some of the less frequently used issues like "Closed PR, Multi >> Dimensional, etc." >> - Start using the "Needs Info" tag more often for incomplete bug reports, >> and regularly close issues that have been tagged as "Needs Info" and not >> updated in more that a couple weeks. >> >> Thoughts? Objections? >> >> >> Tom >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> >> > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeffreback at gmail.com Fri Nov 17 10:09:28 2017 From: jeffreback at gmail.com (Jeff Reback) Date: Fri, 17 Nov 2017 10:09:28 -0500 Subject: [Pandas-dev] Label and Milestone cleanup In-Reply-To: References: <56D136E9-4CDA-4283-9A2D-C5B1270254F1@gmail.com> Message-ID: <45D17807-FEE2-4167-82C8-1D8BE050E7DB@gmail.com> On Nov 17, 2017, at 9:49 AM, Joris Van den Bossche wrote: > > Already answering for the Milestones: > >>> ## Milestones >>> >>> I would like to have 4 real milestones: >>> >>> 1. "Next Major Release" for API breaking changes that we'd like to do eventually >>> 2. "Next Release" For bugfix / non-API breaking changes that we'd like to do eventually >>> 3. "0.22.0" (i.e. the actual next major release) for PRs that should go into a specific release >>> 4. "0.21.1" (i.e. the actual next minor release) for PRs that should go into a specific release >>> >>> The main change is the "Next Release", which I hope will clear up confusion about what file the release notes should go in. > > Would the issues in "Next release" be those issues that, if somebody does a PR, can be included in eg 0.21.1, but are not important to tag them as such? > Here?s the issue (pun intended). We cannot tag things for a specific milestone unless they are close to being merged; because things get stale, not worked on etc it is a heavy burden to then move all things from a specific milestone to the next one (and it?s just plain confusing) basically we have 3 pots issues for a specific major milestone issues for a specific minor milestone everything else > >>> I think we can remove "Interesting Issues", > >> Jeff: though leave the interesting ones - these are ones that i tagged - will move them after we realign things (then i will remove) > > You can make a "Jeff's interesting issues" project, and add them to that. I think that is a better use (you still can look at the list of issues you want to specifically tag for attention) and it keeps the Milestones clean for actual milestones. > >>> remove "High Level Issue Tracking", > > + 1 since we a "master issue" label as well. > >>> and move "won't fix" into "No action". I'm not sure about "1.0" and "2.0" and "Someday", but maybe consolidate those. > > I personally find the notion of "won't fix" informative, but we could also give it a label "won't fix" ? (and then use a single "No action" milestone) > > I would keep the 1.0 milestone for now. > > I think "Someday" was originally intended to give some difference in prioritization for the core-devs between "Next major release" and "Someday"? But I think we don't really do it like that, so for me ok to remove. I would suppose that 'no milestone' would then be that. > > Joris > > 2017-11-17 0:43 GMT+01:00 Jeff Reback : >> someday can g >> >> ok with your tags/milestones otherwise >> >> though leave the interesting ones - these are ones that i tagged - will move them after we realign things (then i will remove) > > You can make a "Jeff's interesting issues" project, and add them to that. I think that is a better use (you still can look at the list of issues) and it keeps the Milestones clean for actual milestones. > >> >> >>> On Nov 16, 2017, at 12:29 PM, G Young wrote: >>> >>> IMO "Won't fix" and "No action" were both good indicators of deliberate non-action on a PR / issue. We can consolidate those, but I wouldn't remove both. >>> >>> "Someday" doesn't look like it's used anymore nowadays, so I wouldn't mind removing that. >>> >>> As for tagging in general, I think our classification are a little incomplete (e.g. there is no tag for general "DataFrame" issues or their methods). >>> >>>> On Thu, Nov 16, 2017 at 6:22 AM, Tom Augspurger wrote: >>>> We ran out of time to discuss this in the dev meeting. I want to clean up our Github labels and milestones. >>>> >>>> ## Milestones >>>> >>>> I would like to have 4 real milestones: >>>> >>>> 1. "Next Major Release" for API breaking changes that we'd like to do eventually >>>> 2. "Next Release" For bugfix / non-API breaking changes that we'd like to do eventually >>>> 3. "0.22.0" (i.e. the actual next major release) for PRs that should go into a specific release >>>> 4. "0.21.1" (i.e. the actual next minor release) for PRs that should go into a specific release >>>> >>>> The main change is the "Next Release", which I hope will clear up confusion about what file the release notes should go in. I think we can remove "Interesting Issues", remove "High Level Issue Tracking", and move "won't fix" into "No action". I'm not sure about "1.0" and "2.0" and "Someday", but maybe consolidate those. >>>> >>>> ## Labels >>>> >>>> I would like to >>>> >>>> - Start tagging "Easy" issues with "good first issue". Github gives some prominence to this tag in their UI. >>>> - Remove some of the less frequently used issues like "Closed PR, Multi Dimensional, etc." >>>> - Start using the "Needs Info" tag more often for incomplete bug reports, and regularly close issues that have been tagged as "Needs Info" and not updated in more that a couple weeks. >>>> >>>> Thoughts? Objections? >>>> >>>> >>>> Tom >>>> >>>> _______________________________________________ >>>> Pandas-dev mailing list >>>> Pandas-dev at python.org >>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>> >>> >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aivar.annamaa at gmail.com Sat Nov 18 03:21:26 2017 From: aivar.annamaa at gmail.com (Aivar Annamaa) Date: Sat, 18 Nov 2017 10:21:26 +0200 Subject: [Pandas-dev] Pandas lite Message-ID: Hi! I'm going to teach an introduction to pandas to Python newbies, and I'm looking for ways to simplify the the view to the API and/or avoid some of the pitfalls. I'd like to identify a minimal set of methods/operations, which are enough for performing most common tasks with simply-indexed data (importing/exporting from csv/Excel, selecting rows and columns by index, boolean indexing of the rows, creating new columns, simple group-by and aggregations, simple plotting, maybe also simple joins) and which have minimal potential for surprises (unexpected copies, unexpected views, confusing warnings, differences with indexing with lists vs tuples etc). Maybe even allowing only "pure" transformations a la relational algebra? We could call it an opinionated and restricted usage-scheme of pandas. The students would use this subset of the API until they gain enough experience to meet the hairier face of pandas. Has anybody tried marking a subset of pandas API for some reasons? I was also thinking about how to enforce the boundaries of this subset: * Just suggest students to stick with it. * Provide a static analysis which disallows (or warns against) the operations/tricks outside the boundaries. * a wrapper library (eg. import pandaslite as pd) which wraps required pandas classes into similar classes which publish only a subset of the pandas capabilities and perform some extra checks (eg. disallow duplicates in the index). When the students grow tough enough or need more power, they would simply replace "import pandaslite as pd" with "import pandas as pd" in their code. At the moment I'm considering experimenting with the third approach. I'd be glad to hear your comments! best regards, Aivar Annamaa -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.augspurger88 at gmail.com Sun Nov 19 08:04:32 2017 From: tom.augspurger88 at gmail.com (Tom Augspurger) Date: Sun, 19 Nov 2017 07:04:32 -0600 Subject: [Pandas-dev] Pandas lite In-Reply-To: References: Message-ID: I'm not aware of any attempts to do that. Personally, I would recommend the first option as it's the least amount of work, and the least likely to force them to unlearn anything. I suppose another option would be to have them import a module that just monkey patches what shows up in dir(Series/DataFrame) so that tab-completion is less overwhelming. Good luck. - Tom On Sat, Nov 18, 2017 at 2:21 AM, Aivar Annamaa wrote: > Hi! > > I'm going to teach an introduction to pandas to Python newbies, and I'm > looking for ways to simplify the the view to the API and/or avoid some of > the pitfalls. > > I'd like to identify a minimal set of methods/operations, which are enough > for performing most common tasks with simply-indexed data > (importing/exporting from csv/Excel, selecting rows and columns by index, > boolean indexing of the rows, creating new columns, simple group-by and > aggregations, simple plotting, maybe also simple joins) and which have > minimal potential for surprises (unexpected copies, unexpected views, > confusing warnings, differences with indexing with lists vs tuples etc). > Maybe even allowing only "pure" transformations a la relational algebra? We > could call it an opinionated and restricted usage-scheme of pandas. > > The students would use this subset of the API until they gain enough > experience to meet the hairier face of pandas. > > Has anybody tried marking a subset of pandas API for some reasons? > > I was also thinking about how to enforce the boundaries of this subset: > > - Just suggest students to stick with it. > - Provide a static analysis which disallows (or warns against) the > operations/tricks outside the boundaries. > - a wrapper library (eg. import pandaslite as pd) which wraps required > pandas classes into similar classes which publish only a subset of the > pandas capabilities and perform some extra checks (eg. disallow duplicates > in the index). When the students grow tough enough or need more power, they > would simply replace "import pandaslite as pd" with "import pandas as pd" > in their code. > > At the moment I'm considering experimenting with the third approach. > > I'd be glad to hear your comments! > > best regards, > Aivar Annamaa > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Sun Nov 19 08:47:12 2017 From: njs at pobox.com (Nathaniel Smith) Date: Sun, 19 Nov 2017 05:47:12 -0800 Subject: [Pandas-dev] Pandas lite In-Reply-To: References: Message-ID: The 'datascience' package is an attempt to solve this problem -- it's a simplified wrapper around pandas written for use in the intro to data science sequence at UC Berkeley: http://data8.org/datascience/ I'm not involved in the course or that package myself, so I can't say how well it's worked. It's definitely worth checking out as prior art, though, and you might find it useful to contact the authors to compare notes. -n On Sat, Nov 18, 2017 at 12:21 AM, Aivar Annamaa wrote: > Hi! > > I'm going to teach an introduction to pandas to Python newbies, and I'm > looking for ways to simplify the the view to the API and/or avoid some of > the pitfalls. > > I'd like to identify a minimal set of methods/operations, which are enough > for performing most common tasks with simply-indexed data > (importing/exporting from csv/Excel, selecting rows and columns by index, > boolean indexing of the rows, creating new columns, simple group-by and > aggregations, simple plotting, maybe also simple joins) and which have > minimal potential for surprises (unexpected copies, unexpected views, > confusing warnings, differences with indexing with lists vs tuples etc). > Maybe even allowing only "pure" transformations a la relational algebra? We > could call it an opinionated and restricted usage-scheme of pandas. > > The students would use this subset of the API until they gain enough > experience to meet the hairier face of pandas. > > Has anybody tried marking a subset of pandas API for some reasons? > > I was also thinking about how to enforce the boundaries of this subset: > > Just suggest students to stick with it. > Provide a static analysis which disallows (or warns against) the > operations/tricks outside the boundaries. > a wrapper library (eg. import pandaslite as pd) which wraps required pandas > classes into similar classes which publish only a subset of the pandas > capabilities and perform some extra checks (eg. disallow duplicates in the > index). When the students grow tough enough or need more power, they would > simply replace "import pandaslite as pd" with "import pandas as pd" in their > code. > > At the moment I'm considering experimenting with the third approach. > > I'd be glad to hear your comments! > > best regards, > Aivar Annamaa > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -- Nathaniel J. Smith -- https://vorpus.org From pmhobson at gmail.com Mon Nov 27 20:21:39 2017 From: pmhobson at gmail.com (Paul Hobson) Date: Mon, 27 Nov 2017 17:21:39 -0800 Subject: [Pandas-dev] Help replacing workflows that used DataFrame.select Message-ID: Hey folks, I noticed that DataFrame.select is now deprecated in favor of DataFrame.loc[index.map(selector_fxn)] PR: https://github.com/pandas-dev/pandas/pull/17633 Issue: https://github.com/pandas-dev/pandas/issues/12401 I have a lot of work flows that look something like this: res = ( data.resample(freq) .agg(agg_dict) .pipe(fxn_that_adds_many_cols) .select(complex_fxn_that_selects_a_few_cols, axis='columns') ) It's not immediately clear to me how to access all of the e.g., columns in the middle or at the end of a chain of dataframe operations. Any tips? -Paul -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Tue Nov 28 05:31:21 2017 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Tue, 28 Nov 2017 11:31:21 +0100 Subject: [Pandas-dev] Help replacing workflows that used DataFrame.select In-Reply-To: References: Message-ID: Hi Paul, That's a good question. I think you can do it with a lambda function, like this: (data. ... (full pipeline) .loc[:, lambda df: complex_fxn_that_selects_a_few_cols(df.columns)] ) Does that work? But personally I am not sure if I find this really an usability improvement compared to the select method. Best, Joris 2017-11-28 2:21 GMT+01:00 Paul Hobson : > Hey folks, > > I noticed that DataFrame.select is now deprecated in favor of > DataFrame.loc[index.map(selector_fxn)] > > PR: https://github.com/pandas-dev/pandas/pull/17633 > Issue: https://github.com/pandas-dev/pandas/issues/12401 > > I have a lot of work flows that look something like this: > > res = ( > data.resample(freq) > .agg(agg_dict) > .pipe(fxn_that_adds_many_cols) > .select(complex_fxn_that_selects_a_few_cols, axis='columns') > ) > > It's not immediately clear to me how to access all of the e.g., columns in > the middle or at the end of a chain of dataframe operations. > > Any tips? > > -Paul > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pmhobson at gmail.com Tue Nov 28 11:52:17 2017 From: pmhobson at gmail.com (Paul Hobson) Date: Tue, 28 Nov 2017 08:52:17 -0800 Subject: [Pandas-dev] Help replacing workflows that used DataFrame.select In-Reply-To: References: Message-ID: Joris, Thanks for the nudge. I didn't understand that the callable could be passed the entire dataframe. That's what I needed. I'll miss the .select() method when it's gone, but it appears my use cases are covered. Cheers, -Paul On Tue, Nov 28, 2017 at 2:31 AM, Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > Hi Paul, > > That's a good question. I think you can do it with a lambda function, like > this: > > (data. > ... (full pipeline) > .loc[:, lambda df: complex_fxn_that_selects_a_few_cols(df.columns)] > ) > > Does that work? > > But personally I am not sure if I find this really an usability > improvement compared to the select method. > > Best, > Joris > > > > 2017-11-28 2:21 GMT+01:00 Paul Hobson : > >> Hey folks, >> >> I noticed that DataFrame.select is now deprecated in favor of >> DataFrame.loc[index.map(selector_fxn)] >> >> PR: https://github.com/pandas-dev/pandas/pull/17633 >> Issue: https://github.com/pandas-dev/pandas/issues/12401 >> >> I have a lot of work flows that look something like this: >> >> res = ( >> data.resample(freq) >> .agg(agg_dict) >> .pipe(fxn_that_adds_many_cols) >> .select(complex_fxn_that_selects_a_few_cols, axis='columns') >> ) >> >> It's not immediately clear to me how to access all of the e.g., columns >> in the middle or at the end of a chain of dataframe operations. >> >> Any tips? >> >> -Paul >> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pmhobson at gmail.com Tue Nov 28 11:56:35 2017 From: pmhobson at gmail.com (Paul Hobson) Date: Tue, 28 Nov 2017 08:56:35 -0800 Subject: [Pandas-dev] Help replacing workflows that used DataFrame.select In-Reply-To: References: Message-ID: Follow-up question for the whole group: My recollection is that .loc returns a slice, but .select returns a copy. Is this correct? Are there any implications of that distinction with long, chained workflows switching away from .select? -Paul On Tue, Nov 28, 2017 at 8:52 AM, Paul Hobson wrote: > Joris, > > Thanks for the nudge. I didn't understand that the callable could be > passed the entire dataframe. That's what I needed. > > I'll miss the .select() method when it's gone, but it appears my use cases > are covered. > > Cheers, > > -Paul > > On Tue, Nov 28, 2017 at 2:31 AM, Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > >> Hi Paul, >> >> That's a good question. I think you can do it with a lambda function, >> like this: >> >> (data. >> ... (full pipeline) >> .loc[:, lambda df: complex_fxn_that_selects_a_few_cols(df.columns)] >> ) >> >> Does that work? >> >> But personally I am not sure if I find this really an usability >> improvement compared to the select method. >> >> Best, >> Joris >> >> >> >> 2017-11-28 2:21 GMT+01:00 Paul Hobson : >> >>> Hey folks, >>> >>> I noticed that DataFrame.select is now deprecated in favor of >>> DataFrame.loc[index.map(selector_fxn)] >>> >>> PR: https://github.com/pandas-dev/pandas/pull/17633 >>> Issue: https://github.com/pandas-dev/pandas/issues/12401 >>> >>> I have a lot of work flows that look something like this: >>> >>> res = ( >>> data.resample(freq) >>> .agg(agg_dict) >>> .pipe(fxn_that_adds_many_cols) >>> .select(complex_fxn_that_selects_a_few_cols, axis='columns') >>> ) >>> >>> It's not immediately clear to me how to access all of the e.g., columns >>> in the middle or at the end of a chain of dataframe operations. >>> >>> Any tips? >>> >>> -Paul >>> >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Tue Nov 28 13:07:31 2017 From: shoyer at gmail.com (Stephan Hoyer) Date: Tue, 28 Nov 2017 18:07:31 +0000 Subject: [Pandas-dev] Help replacing workflows that used DataFrame.select In-Reply-To: References: Message-ID: The biggest reason for deprecating DataFrame.select() was that it was confusingly named. On GroupBy objects, it's equivalent to .filter(). Also, SELECT in SQL does something very different, more like DataFrame.filter(). If only we could simply switch the names without causing more confusion! So I think we would potentially be welcome to resurfacing the functionality if necessary, though probably under a different name. For discussion see https://github.com/pandas-dev/pandas/issues/12401 On Tue, Nov 28, 2017 at 4:52 PM Paul Hobson wrote: > Joris, > > Thanks for the nudge. I didn't understand that the callable could be > passed the entire dataframe. That's what I needed. > > I'll miss the .select() method when it's gone, but it appears my use cases > are covered. > > Cheers, > > -Paul > > On Tue, Nov 28, 2017 at 2:31 AM, Joris Van den Bossche < > jorisvandenbossche at gmail.com> wrote: > >> Hi Paul, >> >> That's a good question. I think you can do it with a lambda function, >> like this: >> >> (data. >> ... (full pipeline) >> .loc[:, lambda df: complex_fxn_that_selects_a_few_cols(df.columns)] >> ) >> >> Does that work? >> >> But personally I am not sure if I find this really an usability >> improvement compared to the select method. >> >> Best, >> Joris >> >> >> >> 2017-11-28 2:21 GMT+01:00 Paul Hobson : >> >>> Hey folks, >>> >>> I noticed that DataFrame.select is now deprecated in favor of >>> DataFrame.loc[index.map(selector_fxn)] >>> >>> PR: https://github.com/pandas-dev/pandas/pull/17633 >>> Issue: https://github.com/pandas-dev/pandas/issues/12401 >>> >>> I have a lot of work flows that look something like this: >>> >>> res = ( >>> data.resample(freq) >>> .agg(agg_dict) >>> .pipe(fxn_that_adds_many_cols) >>> .select(complex_fxn_that_selects_a_few_cols, axis='columns') >>> ) >>> >>> It's not immediately clear to me how to access all of the e.g., columns >>> in the middle or at the end of a chain of dataframe operations. >>> >>> Any tips? >>> >>> -Paul >>> >>> _______________________________________________ >>> Pandas-dev mailing list >>> Pandas-dev at python.org >>> https://mail.python.org/mailman/listinfo/pandas-dev >>> >>> >> > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pmhobson at gmail.com Tue Nov 28 13:34:14 2017 From: pmhobson at gmail.com (Paul Hobson) Date: Tue, 28 Nov 2017 10:34:14 -0800 Subject: [Pandas-dev] Help replacing workflows that used DataFrame.select In-Reply-To: References: Message-ID: Hey Stephen, Thanks for the info. While .select on the default axis (index) is indeed very different than SQL, operating on the columns is very similar (jn my twisted brain at least). I totally understand the deprecation, and remember rumblings about poor performance back in the early days. So I can't even say I'm surprised. If the devs find time to bring similar functionality back, might I suggest a name+sig as simple as DataFrame.keep_only(index=None, columns=None). Just to reiterate: I'm not complaining. I'm just trying to keep up :) -paul On Tue, Nov 28, 2017 at 10:07 AM, Stephan Hoyer wrote: > The biggest reason for deprecating DataFrame.select() was that it was > confusingly named. On GroupBy objects, it's equivalent to .filter(). Also, > SELECT in SQL does something very different, more like DataFrame.filter(). > If only we could simply switch the names without causing more confusion! > > So I think we would potentially be welcome to resurfacing the > functionality if necessary, though probably under a different name. For > discussion see https://github.com/pandas-dev/pandas/issues/12401 > > On Tue, Nov 28, 2017 at 4:52 PM Paul Hobson wrote: > >> Joris, >> >> Thanks for the nudge. I didn't understand that the callable could be >> passed the entire dataframe. That's what I needed. >> >> I'll miss the .select() method when it's gone, but it appears my use >> cases are covered. >> >> Cheers, >> >> -Paul >> >> On Tue, Nov 28, 2017 at 2:31 AM, Joris Van den Bossche < >> jorisvandenbossche at gmail.com> wrote: >> >>> Hi Paul, >>> >>> That's a good question. I think you can do it with a lambda function, >>> like this: >>> >>> (data. >>> ... (full pipeline) >>> .loc[:, lambda df: complex_fxn_that_selects_a_few_cols(df.columns)] >>> ) >>> >>> Does that work? >>> >>> But personally I am not sure if I find this really an usability >>> improvement compared to the select method. >>> >>> Best, >>> Joris >>> >>> >>> >>> 2017-11-28 2:21 GMT+01:00 Paul Hobson : >>> >>>> Hey folks, >>>> >>>> I noticed that DataFrame.select is now deprecated in favor of >>>> DataFrame.loc[index.map(selector_fxn)] >>>> >>>> PR: https://github.com/pandas-dev/pandas/pull/17633 >>>> Issue: https://github.com/pandas-dev/pandas/issues/12401 >>>> >>>> I have a lot of work flows that look something like this: >>>> >>>> res = ( >>>> data.resample(freq) >>>> .agg(agg_dict) >>>> .pipe(fxn_that_adds_many_cols) >>>> .select(complex_fxn_that_selects_a_few_cols, >>>> axis='columns') >>>> ) >>>> >>>> It's not immediately clear to me how to access all of the e.g., columns >>>> in the middle or at the end of a chain of dataframe operations. >>>> >>>> Any tips? >>>> >>>> -Paul >>>> >>>> _______________________________________________ >>>> Pandas-dev mailing list >>>> Pandas-dev at python.org >>>> https://mail.python.org/mailman/listinfo/pandas-dev >>>> >>>> >>> >> _______________________________________________ >> Pandas-dev mailing list >> Pandas-dev at python.org >> https://mail.python.org/mailman/listinfo/pandas-dev >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shoyer at gmail.com Tue Nov 28 13:58:05 2017 From: shoyer at gmail.com (Stephan Hoyer) Date: Tue, 28 Nov 2017 18:58:05 +0000 Subject: [Pandas-dev] Help replacing workflows that used DataFrame.select In-Reply-To: References: Message-ID: On Tue, Nov 28, 2017 at 6:34 PM Paul Hobson wrote: > Thanks for the info. While .select on the default axis (index) is indeed > very different than SQL, operating on the columns is very similar (jn my > twisted brain at least). > Agreed, but sadly .select() didn't default to axis=1. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Tue Nov 28 18:28:00 2017 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Wed, 29 Nov 2017 00:28:00 +0100 Subject: [Pandas-dev] Help replacing workflows that used DataFrame.select In-Reply-To: References: Message-ID: Would there be a way in keeping .select() but only deprecating the (default) `axis=0` ? Or would that only be more confusing? Because if we would find a name for such a method that defaults to the columns, we would come up with 'select' ... 2017-11-28 19:58 GMT+01:00 Stephan Hoyer : > On Tue, Nov 28, 2017 at 6:34 PM Paul Hobson wrote: > >> Thanks for the info. While .select on the default axis (index) is indeed >> very different than SQL, operating on the columns is very similar (jn my >> twisted brain at least). >> > > Agreed, but sadly .select() didn't default to axis=1. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Thu Nov 30 20:09:10 2017 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Fri, 1 Dec 2017 02:09:10 +0100 Subject: [Pandas-dev] Feedback request for return value of empty or all-NA sum (0 or NA?) Message-ID: *[Note for those reading it on the pydata mailing list, please answer to pandas-dev at python.org to keep discussion centralised there]* Hi list, In pandas 0.21.0 we changed the behaviour of the sum method for empty or all-NaN Series (to consistently return NaN), see the what's note . This change lead to some discussion on github whether this was the right choice we made. But the reach of github is of course limited, and therefore we wanted to solicit some more feedback on the mailing list. Below is given an overview of the background of the issue and the different options. Please keep in mind that we are not really interested in theoretical reasons why one of the other option is better or more correct. Each of the options has it advantages / disadvantages in practice. But it would be very interesting to hear the consequences in actual example analysis pipelines. Best, Joris Background Before pandas 0.21.0, the behaviour of the sum of an all-NA Series depended on whether the optional bottleneck dependency was installed. This inconsistency was in place since the bottleneck 1.0.0 release (February 2015), and you can read more background on it in the github issue #9422 . With bottleneck, the sum of all-NA was zero; without bottleneck, the sum was NaN. In [2]: pd.__version__ Out[2]: '0.20.3' In [3]: pd.options.compute.use_bottleneck = True In [4]: Series([np.nan]).sum() Out[4]: 0.0 In [5]: pd.options.compute.use_bottleneck = False In [6]: Series([np.nan]).sum() Out[6]: nan The sum of an empty series was always 0, with or without bottleneck. In [7]: Series([]).sum() Out[7]: 0 For pandas 0.21, we wanted to fix this inconsistency. The return value should not depend on whether an optional dependency is installed. After a lengthy discussion, we opted for the original pandas behaviour to return NaN. As a result, also the sum of an empty Series was changed to return NaN (see the what?s new notice here ): In [2]: pd.__version__ Out[2]: '0.21.0' In [3]: pd.Series([np.nan]).sum() Out[3]: nan In [4]: pd.Series([]).sum() Out[4]: nan However, after the 0.21.0 release more feedback was received about cases where this choice is not desirable, and due to this feedback, we are reconsidering the decision. Options We see three different options for the default behaviour of sum for those two cases of empty and all-NA series: 1. Empty / all-NA sum is always zero: SUM([]) = 0 and SUM([NA]) = 0 - Behaviour of pandas < 0.21 + bottleneck installed - Consistent with NumPy, R, MATLAB, etc. (given you use the variant that is NA aware: nansum for numpy, na.rm=TRUE for R, ...) 1. Empty / all-NA sum is always NA: SUM([]) = NA and SUM([NA]) = NA - The behaviour that is introduced in 0.21.0 - Consistent with SQL (although often (rightly or not) complained about) 1. Mixed behaviour: SUM([]) = 0 and SUM([NA]) = NA - Behaviour of pandas < 0.21 (without bottleneck installed) - A practicable compromise (having SUM([NA]) keep the information of NA, while SUM([]) = 0 does not introduce NAs when there were no in the data) - But somewhat inconsistent and unique to pandas ? We have to stress that each of those choices can be preferable depending on the use case and has its advantages and disadvantages. Some might be more mathematical sound, others might preserve more information about having missing data, each can be be more consistent with a certain ecosystem, ? It is clear that there is no ?best? option for all case. While we can only choose one of those options as the default behaviour, each choice can be accompanied by new features that can make it easier for the user to opt for a different behaviour: - When choosing option 1 or 2, we can introduce a new method (eg .total()) or a keyword to .sum() (eg min_count) to obtain the other behaviour. - When choosing for option 2, we could provide a pd.zeroifna(..) to be able to convert NaN values from aggregation results into zero?s if desired (similar to COALESCE(expr, 0) in SQL) -------------- next part -------------- An HTML attachment was scrubbed... URL: From clemens.brunner at gmail.com Tue Nov 28 05:57:39 2017 From: clemens.brunner at gmail.com (Clemens Brunner) Date: Tue, 28 Nov 2017 11:57:39 +0100 Subject: [Pandas-dev] Changing the default max_columns and max_rows Message-ID: <034945C6-2D66-4F51-ACA5-50DC01DDDA71@gmail.com> Hello! We're currently discussing a change in how data frames are displayed by default in https://github.com/pandas-dev/pandas/pull/17023. There are two proposed changes: (1) Set pd.options.display.max_columns=0 (previously this was set to 20). (2) Set pd.options.display.max_rows=20 (previously this was set to 60). Change (1) means that the number of printed columns is adapted to fit within the width of the terminal. If there are too many columns, ellipsis will be shown to indicate collapsed columns in the middle of the data frame. This doesn't work if Python is run as a Jupyter kernel (e.g. in a Jupyter notebook or in IPython QtConsole), in which case the maximum columns remain 20. Example: ======== import pandas as pd import numpy as np pd.DataFrame(np.random.rand(5, 10)) Output before (in a terminal with 100 chars width): --------------------------------------------------- 0 1 2 3 4 5 6 \ 0 0.643979 0.690414 0.018603 0.991478 0.707534 0.376765 0.670848 1 0.547836 0.810972 0.054448 0.415112 0.268120 0.904528 0.839258 2 0.582256 0.732149 0.284208 0.405197 0.213591 0.715367 0.150106 3 0.197348 0.317159 0.051669 0.738405 0.821046 0.179270 0.245793 4 0.483466 0.583330 0.999213 0.882883 0.315169 0.045712 0.897048 7 8 9 0 0.891467 0.494220 0.713369 1 0.601304 0.449880 0.266205 2 0.113262 0.360580 0.238833 3 0.798063 0.077769 0.471169 4 0.262779 0.530565 0.992084 Output after: ------------- 0 1 2 3 ... 6 7 8 9 0 0.673621 0.211505 0.943201 0.946548 ... 0.900453 0.612182 0.861933 0.710967 1 0.670855 0.834449 0.796273 0.785976 ... 0.609954 0.686663 0.684582 0.837505 2 0.544736 0.814827 0.352893 0.459556 ... 0.650993 0.735943 0.279110 0.840203 3 0.440125 0.554323 0.745462 0.940896 ... 0.544576 0.224175 0.852603 0.509837 4 0.225551 0.791834 0.476059 0.321857 ... 0.391165 0.423213 0.290683 0.954423 [5 rows x 10 columns] Change (2) implies fewer rows are displayed before auto-hiding takes place. I find that 60 rows almost always causes the terminal to scroll (most terminals have between 25-40 rows), so reducing the value to 20 increases the chance that a data frame can be observed on one terminal page. I'm not including a before/after output since it should be easy to imagine how this change affects the output. Both changes would make Pandas behave similar to R's Tidyverse (which I really like), but this should not be the main reason why these changes are a good idea. I mainly like them because these settings make (large) data frames much nicer to look at. Note that these changes affect the default values. Of course, users are free to change them back in their active Python session. Comments to both proposed changes are highly welcome (either here on the mailing list or at https://github.com/pandas-dev/pandas/pull/17023. Clemens From jon.mease at gmail.com Tue Nov 28 19:29:44 2017 From: jon.mease at gmail.com (Jon Mease) Date: Tue, 28 Nov 2017 19:29:44 -0500 Subject: [Pandas-dev] Help replacing workflows that used DataFrame.select In-Reply-To: References: Message-ID: Perhaps for versions 0.21.1 and 0.22 a warning could be issued when .select() is used without an explicit `axis` parameter. The warning would state that the current default is `axis=0` but that this will change to `axis=1` in the next major release. If the user wants the current default behavior then they could suppress the warning and future-proof their code by passing `axis=0` explicitly. -Jon On Tue, Nov 28, 2017 at 6:28 PM, Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > Would there be a way in keeping .select() but only deprecating the > (default) `axis=0` ? Or would that only be more confusing? > > Because if we would find a name for such a method that defaults to the > columns, we would come up with 'select' ... > > 2017-11-28 19:58 GMT+01:00 Stephan Hoyer : > >> On Tue, Nov 28, 2017 at 6:34 PM Paul Hobson wrote: >> >>> Thanks for the info. While .select on the default axis (index) is indeed >>> very different than SQL, operating on the columns is very similar (jn my >>> twisted brain at least). >>> >> >> Agreed, but sadly .select() didn't default to axis=1. >> > > > _______________________________________________ > Pandas-dev mailing list > Pandas-dev at python.org > https://mail.python.org/mailman/listinfo/pandas-dev > > -------------- next part -------------- An HTML attachment was scrubbed... URL: