Following system colour scheme Selected dark colour scheme Selected light colour scheme

Python Enhancement Proposals

PEP 512 – Migrating from hg.python.org to GitHub

Author:
Brett Cannon <brett at python.org>
Discussions-To:
Core-Workflow list
Status:
Final
Type:
Process
Created:
17-Jan-2015
Post-History:
17-Jan-2016, 19-Jan-2016, 23-Jan-2016

Table of Contents

Note

CPython’s development process moved to https://github.com/python/cpython on 2017-02-10.

Abstract

This PEP outlines the steps required to migrate Python’s development process from Mercurial [3] as hosted at hg.python.org [1] to Git [4] on GitHub [2]. Meeting the minimum goals of this PEP should allow for the development process of Python to be as productive as it currently is, and meeting its extended goals should improve the development process from its status quo.

Rationale

In 2014, it became obvious that Python’s custom development process was becoming a hindrance. As an example, for an external contributor to submit a fix for a bug that eventually was committed, the basic steps were:

  1. Open an issue for the bug at bugs.python.org [5].
  2. Checkout out the CPython source code from hg.python.org [1].
  3. Make the fix.
  4. Upload a patch.
  5. Have a core developer review the patch using our fork of the Rietveld code review tool [6].
  6. Download the patch to make sure it still applies cleanly.
  7. Run the test suite manually.
  8. Update the NEWS, ACKS, and “What’s New” document as necessary
  9. Pull changes to avoid a merge race.
  10. Commit the change manually.
  11. If the change was for a bugfix release, merge into the in-development branch.
  12. Run the test suite manually again.
  13. Commit the merge.
  14. Push the changes.

This is a very heavy, manual process for core developers. Even in the simple case, you could only possibly skip the code review step, as you would still need to build the documentation. This led to patches languishing on the issue tracker due to core developers not being able to work through the backlog fast enough to keep up with submissions. In turn, that led to a side-effect issue of discouraging outside contribution due to frustration from lack of attention, which is a dangerous problem for an open source project with no corporate backing as it runs counter to having a viable future for the project. While allowing patches to be uploaded to bugs.python.org [5] is potentially simple for an external contributor, it is as slow and burdensome as it gets for a core developer to work with.

Hence the decision was made in late 2014 that a move to a new development process was needed. A request for PEPs proposing new workflows was made, in the end leading to two: PEP 481 and PEP 507 proposing GitHub [2] and GitLab [7], respectively.

The year 2015 was spent off-and-on working on those proposals and trying to tease out details of what made them different from each other on the core-workflow mailing list [8]. PyCon US 2015 also showed that the community was a bit frustrated with our process due to both cognitive overhead for new contributors and how long it was taking for core developers to look at a patch (see the end of Guido van Rossum’s keynote at PyCon US 2015 [9] as an example of the frustration).

On January 1, 2016, the decision was made by Brett Cannon to move the development process to GitHub. The key reasons for choosing GitHub were [10]:

  • Maintaining custom infrastructure has been a burden on volunteers (e.g., an unmaintained, custom fork of Rietveld [6] is currently being used).
  • The custom workflow is very time-consuming for core developers (not enough automated tooling built to help support it).
  • The custom workflow is a hindrance to external contributors (acts as a barrier of entry due to time required to ramp up on development process unique to CPython itself).
  • There is no feature differentiating GitLab from GitHub beyond GitLab being open source.
  • Familiarity with GitHub is far higher among core developers and external contributors than with GitLab.
  • Our BDFL prefers GitHub (who would be the first person to tell you that his opinion shouldn’t matter, but the person making the decision felt it was important that the BDFL feel comfortable with the workflow of his own programming language to encourage his continued participation).

There’s even already an unofficial logo to represent the migration to GitHub [22].

The overarching goal of this migration is to improve the development process to the extent that a core developer can go from external contribution submission through all the steps leading to committing said contribution from within a browser on a tablet with WiFi using some development process (this does not inherently mean GitHub’s default workflow). The final solution will also allow an external contributor to contribute even if they chose not to use GitHub (although there is not guarantee in feature parity).

Repositories to Migrate

While hg.python.org [1] hosts many repositories, there are only five key repositories that need to move:

  1. devinabox [12] (done)
  2. benchmarks [11] (skipped)
  3. peps [13] (done)
  4. devguide [14] (done)
  5. cpython [15]

The devinabox repository is code-only. The peps and devguide repositories involve the generation of webpages. And the cpython repository has special requirements for integration with bugs.python.org [5].

Migration Plan

The migration plan is separated into sections based on what is required to migrate the repositories listed in the Repositories to Migrate section. Completion of requirements outlined in each section should unblock the migration of the related repositories. The sections are expected to be completed in order, but not necessarily the requirements within a section.

Requirements for Code-Only Repositories

Completion of the requirements in this section will allow the devinabox repository to move to GitHub.

Create a ‘Python core’ team

To manage permissions, a ‘Python core’ team will be created as part of the python organization [16]. Any repository that is moved will have the ‘Python core’ team added to it with write permissions [17]. Anyone who previously had rights to manage SSH keys on hg.python.org will become a team maintainer for the ‘Python core’ team.

Define commands to move a Mercurial repository to Git

Since moving to GitHub also entails moving to Git [4], we must decide what tools and commands we will run to translate a Mercurial repository to Git. The tools developed specifically for this migration are hosted at https://github.com/orsenthil/cpython-hg-to-git .

CLA enforcement

A key part of any open source project is making sure that its source code can be properly licensed. This requires making sure all people making contributions have signed a contributor license agreement (CLA) [18]. Up until now, enforcement of CLA signing of contributed code has been enforced by core developers checking whether someone had an * by their username on bugs.python.org [5]. With this migration, the plan is to start off with automated checking and enforcement of contributors signing the CLA.

Adding GitHub username support to bugs.python.org

To keep tracking of CLA signing under the direct control of the PSF, tracking who has signed the PSF CLA will be continued by marking that fact as part of someone’s bugs.python.org user profile. What this means is that an association will be needed between a person’s bugs.python.org [5] account and their GitHub account, which will be done through a new field in a user’s profile. This does implicitly require that contributors will need both a GitHub [2] and bugs.python.org account in order to sign the CLA and contribute through GitHub.

An API is provided to query bugs.python.org to see if a GitHub username corresponds to someone who has signed the CLA. Making a GET request to e.g. http://bugs.python.org/user?@template=clacheck&github_names=brettcannon,notanuser returns a JSON dictionary with the keys of the usernames requested and a true value if they have signed the CLA, false if they have not, and null if no corresponding GitHub username was found.

A bot to enforce CLA signing

With an association between someone’s GitHub account and their bugs.python.org [5] account, which has the data as to whether someone has signed the CLA, a bot can monitor pull requests on GitHub and denote whether the contributor has signed the CLA.

If the user has signed the CLA, the bot will add a positive label to the issue to denote the pull request has no CLA issues (e.g., a green label stating, “CLA signed”). If the contributor has not signed a CLA, a negative label will be added to the pull request will be blocked using GitHub’s status API (e.g., a red label stating, “CLA not signed”). If a contributor lacks a bugs.python.org account, that will lead to the negative label being used as well. Using a label for both positive and negative cases provides a fallback signal if the bot happens to fail, preventing potential false-positives or false-negatives. It also allows for an easy way to trigger the bot again by simply removing a CLA-related label (this is in contrast to using a GitHub status check [40] which is only triggered on code changes).

As no pre-existing bot exists to meet our needs, it will be hosted on Heroku [39] and written to target Python 3.5 to act as a showcase for asynchronous programming. The code for the bot is hosted in the Knights Who Say Ni project [41].

Make old repository read-only

Updating .hg/hgrc in the now-old Mercurial repository in the [hooks] section with:

pretxnchangegroup.reject = echo " * This repo has been migrated to github.com/python/peps and does not accept new commits in Mercurial!" 2>&1; exit 1

will make the repository read-only.

Requirements for the cpython Repository

Obviously the most active and important repository currently hosted at hg.python.org [1] is the cpython repository [15]. Because of its importance and high-frequency use, it requires more tooling before being moved to GitHub compared to the other repositories mentioned in this PEP.

Document steps to commit a pull request

During the process of choosing a new development workflow, it was decided that a linear history is desired. People preferred having a single commit representing a single change instead of having a set of unrelated commits lead to a merge commit that represented a single change. This means that the convenient “Merge” button in GitHub pull requests will be set to only do squash commits and not merge commits.

A second set of recommended commands will also be written for committing a contribution from a patch file uploaded to bugs.python.org [5]. This will obviously help keep the linear history, but it will need to be made to have attribution to the patch author.

The exact sequence of commands that will be given as guidelines to core developers is an open issue: Git CLI commands for committing a pull request to cpython.

Linking pull requests to issues

Historically, external contributions were attached to an issue on bugs.python.org [5] thanks to the fact that all external contributions were uploaded as a file. For changes committed by a core developer who committed a change directly, the specifying of an issue number in the commit message of the format Issue # at the start of the message led to a comment being posted to the issue linking to the commit.

Linking a pull request to an issue

An association between a pull request and an issue is needed to track when a fix has been proposed. The association needs to be many-to-one as there can take multiple pull requests to solve a single issue (technically it should be a many-to-many association for when a single fix solves multiple issues, but this is fairly rare and issues can be merged into one using the Superseder field on the issue tracker).

The association between a pull request and an issue will be done based on detecting an issue number. If the issue is specified in either the title or in the body of a message on a pull request then a connection will be made on bugs.python.org [5]. Some visible notification – e.g. label or message – will be made to the pull request to notify that the association was successfully made.

Notify the issue if a commit is made

Once a commit is made, the corresponding issue should be updated to reflect this fact. This should work regardless of whether the commit came from a pull request or a direct commit.

Update the linking service for mapping commit IDs to URLs

Currently you can use https://hg.python.org/lookup/ with a revision ID from either the Subversion or Mercurial copies of the cpython repo [15] to get redirected to the URL for that revision in the Mercurial repository. The URL rewriter will need to be updated to redirect to the Git repository and to support the new revision IDs created for the Git repository.

The most likely design is to statically know all the Mercurial changeset numbers once the migration has occurred. The lookup code will then be updated to accept hashes from 7 to 40 hexadecimal digits. Any hexadecimal of length 12 or 40 will be compared against the Mercurial changeset numbers. If the number doesn’t match or is of some other length between 7 and 40 then it will be assumed to be a Git hash.

The bugs.python.org commit number rewriter will also need to be updated to accept hashes as short as 7 digits as Git will match on hashes that short or longer.

Deprecate sys._mercurial

Once Python is no longer kept in Mercurial, the sys._mercurial attribute will need to be changed to return ('CPython', '', ''). An equivalent sys._git attribute will be added which fulfills the same use-cases.

Update the devguide

The devguide will need to be updated with details of the new workflow. Mostly likely work will take place in a separate branch until the migration actually occurs.

Update PEP 101

The release process will need to be updated as necessary.

Optional, Planned Features

Once the cpython repository [15] is migrated, all repositories will have been moved to GitHub [2] and the development process should be on equal footing as before the move. But a key reason for this migration is to improve the development process, making it better than it has ever been. This section outlines some plans on how to improve things.

It should be mentioned that overall feature planning for bugs.python.org [5] – which includes plans independent of this migration – are tracked on their own wiki page [23].

Handling Misc/NEWS

Traditionally the Misc/NEWS file [19] has been problematic for changes which spanned Python releases. Oftentimes there will be merge conflicts when committing a change between e.g., 3.5 and 3.6 only in the Misc/NEWS file. It’s so common, in fact, that the example instructions in the devguide explicitly mention how to resolve conflicts in the Misc/NEWS file [21]. As part of our tool modernization, working with the Misc/NEWS file will be simplified.

The planned approach is to use an individual file per news entry, containing the text for the entry. In this scenario, each feature release would have its own directory for news entries and a separate file would be created in that directory that was either named after the issue it closed or a timestamp value (which prevents collisions). Merges across branches would have no issue as the news entry file would still be uniquely named and in the directory of the latest version that contained the fix. A script would collect all news entry files no matter what directory they reside in and create an appropriate news file (the release directory can be ignored as the mere fact that the file exists is enough to represent that the entry belongs to the release). Classification can either be done by keyword in the new entry file itself or by using subdirectories representing each news entry classification in each release directory (or classification of news entries could be dropped since critical information is captured by the “What’s New” documents which are organized). The benefit of this approach is that it keeps the changes with the code that was actually changed. It also ties the message to being part of the commit which introduced the change. For a commit made through the CLI, a script could be provided to help generate the file. In a bot-driven scenario, the merge bot could have a way to specify a specific news entry and create the file as part of its flattened commit (while most likely also supporting using the first line of the commit message if no specific news entry was specified). If a web-based workflow is used then a status check could be used to verify that a new entry file is in the pull request to act as a reminder that the file is missing. Code for this approach has been written previously for the Mercurial workflow at http://bugs.python.org/issue18967. There is also tools from the community like https://pypi.python.org/pypi/towncrier, https://github.com/twisted/newsbuilder, and http://docs.openstack.org/developer/reno/.

Discussions at the Sep 2016 Python core-dev sprints led to this decision compared to the rejected approaches outlined in the Rejected Ideas section of this PEP. The separate files approach seems to have the right balance of flexibility and potential tooling out of the various options while solving the motivating problem.

Work for this is being tracked at https://github.com/python/core-workflow/issues/6.

Handling Misc/ACKS

Traditionally the Misc/ACKS file [20] has been managed by hand. But thanks to Git supporting an author value as well as a committer value per commit, authorship of a commit can be part of the history of the code itself.

As such, manual management of Misc/ACKS will become optional. A script will be written that will collect all author and committer names and merge them into Misc/ACKS with all of the names listed prior to the move to Git. Running this script will become part of the release process.

The script should also generate a list of all people who contributed since the last execution. This will allow having a list of those who contributed to a specific release so they can be explicitly thanked.

Work for this is being tracked at https://github.com/python/core-workflow/issues/7.

Create https://git.python.org

Just as hg.python.org [1] currently points to the Mercurial repository for Python, git.python.org should do the equivalent for the Git repository.

Backup of pull request data

Since GitHub [2] is going to be used for code hosting and code review, those two things need to be backed up. In the case of code hosting, the backup is implicit as all non-shallow Git [4] clones contain the full history of the repository, hence there will be many backups of the repository.

The code review history does not have the same implicit backup mechanism as the repository itself. That means a daily backup of code review history should be done so that it is not lost in case of any issues with GitHub. It also helps guarantee that a migration from GitHub to some other code review system is feasible were GitHub to disappear overnight.

Bot to generate cherry-pick pull requests

Since the decision has been made to work with cherry-picks instead of forward merging of branches, it would be convenient to have a bot that would generate pull requests based on cherry-picking for any pull requests that affect multiple branches. The most likely design is a bot that monitors merged pull requests with key labels applied that delineate what branches the pull request should be cherry-picked into. The bot would then generate cherry-pick pull requests for each label and remove the labels as the pull requests are created (this allows for easy detection when automatic cherry-picking failed).

Work for this is being tracked at https://github.com/python/core-workflow/issues/8.

Pull request commit queue

This would linearly apply accepted pull requests and verify that the commits did not interfere with each other by running the test suite and backing out commits if the test run failed. To help facilitate the speed of testing, all patches committed since the last test run can be applied at once under a single test run as the optimistic assumption is that the patches will work in tandem. Some mechanism to re-run the tests in case of test flakiness will be needed, whether it is from removing a “test failed” label, web interface for core developers to trigger another testing event, etc.

Inspiration or basis of the bot could be taken from pre-existing bots such as Homu [31] or Zuul [32].

The name given to this bot in order to give it commands is an open issue: Naming the bots.

A CI service

There are various CI services that provide free support for open source projects hosted on GitHub [2]. After experimenting with a couple CI services, the decision was made to go with Travis [33].

The current CI service for Python is Pypatcher [38]. A request can be made in IRC to try a patch from bugs.python.org [5]. The results can be viewed at https://ci.centos.org/job/cPython-build-patch/ .

Work for this is being tracked at https://github.com/python/core-workflow/issues/1.

Test coverage report

Getting an up-to-date test coverage report for Python’s standard library would be extremely beneficial as generating such a report can take quite a while to produce.

There are a couple pre-existing services that provide free test coverage for open source projects. In the end, Codecov [37] was chosen as the best option.

Work for this is being tracked at https://github.com/python/core-workflow/issues/2.

Notifying issues of pull request comments

The current development process does not include notifying an issue on bugs.python.org [5] when a review comment is left on Rietveld [6]. It would be nice to fix this so that people can subscribe only to comments at bugs.python.org and not GitHub [2] and yet still know when something occurs on GitHub in terms of review comments on relevant pull requests. Current thinking is to post a comment to bugs.python.org to the relevant issue when at least one review comment has been made over a certain period of time (e.g., 15 or 30 minutes, although with GitHub now supporting reviews the time aspect may be unnecessary). This keeps the email volume down for those that receive both GitHub and bugs.python.org email notifications while still making sure that those only following bugs.python.org know when there might be a review comment to address.

Allow bugs.python.org to use GitHub as a login provider

As of right now, bugs.python.org [5] allows people to log in using Google, Launchpad, or OpenID credentials. It would be good to expand this to GitHub credentials.

Web hooks for re-generating web content

The content at https://docs.python.org/, https://docs.python.org/devguide, and https://www.python.org/dev/peps/ are all derived from files kept in one of the repositories to be moved as part of this migration. As such, it would be nice to set up appropriate webhooks to trigger rebuilding the appropriate web content when the files they are based on change instead of having to wait for, e.g., a cronjob to trigger.

This can partially be solved if the documentation is a Sphinx project as then the site can have an unofficial mirror on Read the Docs, e.g. http://cpython-devguide.readthedocs.io/.

Work for this is being tracked at https://github.com/python/core-workflow/issues/9.

Splitting out parts of the documentation into their own repositories

While certain parts of the documentation at https://docs.python.org change with the code, other parts are fairly static and are not tightly bound to the CPython code itself. The following sections of the documentation fit this category of slow-changing, loosely-coupled:

These parts of the documentation could be broken out into their own repositories to simplify their maintenance and to expand who has commit rights to them to ease in their maintenance.

It has also been suggested to split out the What’s New documents. That would require deciding whether a workflow could be developed where it would be difficult to forget to update What’s New (potentially through a label added to PRs, like “What’s New needed”).

Backup of Git repositories

While not necessary, it would be good to have official backups of the various Git repositories for disaster protection. It will be up to the PSF infrastructure committee to decide if this is worthwhile or unnecessary.

Identify potential new core developers

The Python development team has long-standing guidelines for selecting new core developers. The key part of the guidelines is that a person needs to have contributed multiple patches which have been accepted and are high enough quality and size to demonstrate an understanding of Python’s development process. A bot could be written which tracks patch acceptance rates and generates a report to help identify contributors who warrant consideration for becoming core developers. This work doesn’t even necessarily require GitHub integration as long as the committer field in all git commits is filled in properly.

Work is being tracked at https://github.com/python/core-workflow/issues/10.

Status

Requirements for migrating the devinabox [12] repository:

Repositories whose build steps need updating:

cpython repo [15]

Required:

Optional features:

Open Issues

For this PEP, open issues are ones where a decision needs to be made to how to approach or solve a problem. Open issues do not entail coordination issues such as who is going to write a certain bit of code.

The fate of hg.python.org

With the code repositories moving over to Git [4], there is no technical need to keep hg.python.org [1] running. Having said that, some in the community would like to have it stay functioning as a Mercurial [3] mirror of the Git repositories. Others have said that they still want a mirror, but one using Git.

As maintaining hg.python.org is not necessary, it will be up to the PSF infrastructure committee to decide if they want to spend the time and resources to keep it running. They may also choose whether they want to host a Git mirror on PSF infrastructure.

Depending on the decision reached, other ancillary repositories will either be forced to migration or they can choose to simply stay on hg.python.org.

Git CLI commands for committing a pull request to cpython

Because Git [4] may be a new version control system for core developers, the commands people are expected to run will need to be written down. These commands also need to keep a linear history while giving proper attribution to the pull request author.

Another set of commands will also be necessary for when working with a patch file uploaded to bugs.python.org [5]. Here the linear history will be kept implicitly, but it will need to make sure to keep/add attribution.

Naming the bots

As naming things can lead to bikeshedding of epic proportions, Brett Cannon will choose the final name of the various bots (the name of the project for the bots themselves can be anything, this is purely for the name used in giving commands to the bot or the account name). The names must come from Monty Python, which is only fitting since Python is named after the comedy troupe.

Rejected Ideas

Separate Python 2 and Python 3 repositories

It was discussed whether separate repositories for Python 2 and Python 3 were desired. The thinking was that this would shrink the overall repository size which benefits people with slow Internet connections or small bandwidth caps.

In the end it was decided that it was easier logistically to simply keep all of CPython’s history in a single repository.

Commit multi-release changes in bugfix branch first

As the current development process has changes committed in the oldest branch first and then merged up to the default branch, the question came up as to whether this workflow should be perpetuated. In the end it was decided that committing in the newest branch and then cherry-picking changes into older branches would work best as most people will instinctively work off the newest branch and it is a more common workflow when using Git [4].

Cherry-picking is also more bot-friendly for an in-browser workflow. In the merge-up scenario, if you were to request a bot to do a merge and it failed, then you would have to make sure to immediately solve the merge conflicts if you still allowed the main commit, else you would need to postpone the entire commit until all merges could be handled. With a cherry-picking workflow, the main commit could proceed while postponing the merge-failing cherry-picks. This allows for possibly distributing the work of managing conflicting merges.

Lastly, cherry-picking should help avoid merge races. Currently, when one is doing work that spans branches, it takes time to commit in the older branch, possibly push to another clone representing the default branch, merge the change, and then push upstream. Cherry-picking should decouple this so that you don’t have to rush your multi-branch changes as the cherry-pick can be done separately.

Deriving Misc/NEWS from the commit logs

As part of the discussion surrounding Handling Misc/NEWS, the suggestion has come up of deriving the file from the commit logs itself. In this scenario, the first line of a commit message would be taken to represent the news entry for the change. Some heuristic to tie in whether a change warranted a news entry would be used, e.g., whether an issue number is listed.

This idea has been rejected due to some core developers preferring to write a news entry separate from the commit message. The argument is the first line of a commit message compared to that of a news entry have different requirements in terms of brevity, what should be said, etc.

Deriving Misc/NEWS from bugs.python.org

A rejected solution to the NEWS file problem was to specify the entry on bugs.python.org [5]. This would mean an issue that is marked as “resolved” could not be closed until a news entry is added in the “news” field in the issue tracker. The benefit of tying the news entry to the issue is it makes sure that all changes worthy of a news entry have an accompanying issue. It also makes classifying a news entry automatic thanks to the Component field of the issue. The Versions field of the issue also ties the news entry to which Python releases were affected. A script would be written to query bugs.python.org for relevant new entries for a release and to produce the output needed to be checked into the code repository. This approach is agnostic to whether a commit was done by CLI or bot. A drawback is that there’s a disconnect between the actual commit that made the change and the news entry by having them live in separate places (in this case, GitHub and bugs.python.org). This would mean making a commit would then require remembering to go back to bugs.python.org to add the news entry.

References


Source: https://github.com/python/peps/blob/main/peps/pep-0512.rst

Last modified: 2023-09-09 17:39:29 GMT