[Distutils] Reviving PEP 470 - Removing External Hosting support on PyPI

Donald Stufft donald at stufft.io
Thu Aug 27 03:24:05 CEST 2015


While developing Warehouse, one of the things I wanted to get done was a final ruling on PEP 470. With that in mind I’d like to bring it back up for discussion and hopefully ultimately a ruling.

Their are two major differences in this version of PEP 470, and I’d like to point them out explicitly.

Removal of the “External Repository Discover” feature. I’ve been thinking about this for awhile, and I finally removed it. I’ve always been uncomfortable with this feature and I finally realized why it was. Essentially, the major use case for not hosting things on PyPI that I think PyPI can reasonably be expected to accommodate is people who cannot publish their software to the US for various reasons. At the time I came up with the solution I did, It was an attempt to placate the folks who were against PEP 470 while assuming very few people would ever actually use it, essentially a junk feature to push the PEP through. I think that the feature itself is a bad feature and I think it presents a poor experience for people who want to use it, so I’ve removed it from the PEP and instead focused the PEP on explicitly recommending that all installers should implement the ability to specify multiple repositories and deprecating and removing the ability for finding anything but files hosted by the repository itself on /simple/.

I recognize this is a regression for anyone who *does* have concerns with uploading their projects to a server hosted in the US. If there is someone that has this concern, and is also willing to put in the effort and legwork required, I will happily collaborate with them to design a solution that both follows whatever legal requirements they might have, as well as provides a good experience for people using PyPI and pip. I have some rough ideas on what this could look like, but I think it’s really a separate discussion since I believe externally hosted files like we were is an overall bad experience for people and is largely a historic accident from how PyPI and Python packaging has evolved. I don’t want to derail this thread or PEP exploring these ideas (some of which I don’t even know if they would satisfy the requirements since it’s all dealing with legal jurisdictions other than my own), but i wanted to make explicit that someone who knows the legalities and is willing to put in the work can reach out to me.

The other major difference is that I’ve shortened the time schedule from 6 months to 3 months. Given that authors are either going to upload their projects to PyPI or not and there is no longer a need to setup an external index I think a shorter time schedule is fine, especially since they will be given a script they can run that will spider their projects for any installable files and upload them to PyPI for them in a quick one shot deal that would require very little effort for them.

Everything else in the PEP is basically the same except for rewordings.

I do need a BDFL Delegate for this PEP, Richard does not have the time to do it and the other logical candidate for a PyPI centric PEP is myself, but I don’t feel it’s appropriate to BDFL Delegate my own PEP.

You can see the PEP online at https://www.python.org/dev/peps/pep-0470/ (make sure it’s updated and you see the one that has Aug 26 2015 in it’s Post History).

The PEP has also been inlined below.

-----------------

Abstract
========

This PEP proposes the deprecation and removal of support for hosting files
externally to PyPI as well as the deprecation and removal of the functionality
added by PEP 438, particularly rel information to classify different types of
links and the meta-tag to indicate API version.


Rationale
=========

Historically PyPI did not have any method of hosting files nor any method of
automatically retrieving installables, it was instead focused on providing a
central registry of names, to prevent naming collisions, and as a means of
discovery for finding projects to use. In the course of time setuptools began
to scrape these human facing pages, as well as pages linked from those pages,
looking for things it could automatically download and install. Eventually this
became the "Simple" API which used a similar URL structure however it
eliminated any of the extraneous links and information to make the API more
efficient. Additionally PyPI grew the ability for a project to upload release
files directly to PyPI enabling PyPI to act as a repository in addition to an
index.

This gives PyPI two equally important roles that it plays in the Python
ecosystem, that of index to enable easy discovery of Python projects and
central repository to enable easy hosting, download, and installation of Python
projects. Due to the history behind PyPI and the very organic growth it has
experienced the lines between these two roles are blurry, and this blurring has
caused confusion for the end users of both of these roles and this has in turn
caused ire between people attempting to use PyPI in different capacities, most
often when end users want to use PyPI as a repository but the author wants to
use PyPI solely as an index.

This confusion comes down to end users of projects not realizing if a project
is hosted on PyPI or if it relies on an external service. This often manifests
itself when the external service is down but PyPI is not. People will see that
PyPI works, and other projects works, but this one specific one does not. They
often times do not realize who they need to contact in order to get this fixed
or what their remediation steps are.

PEP 438 attempted to solve this issue by allowing projects to explicitly
declare if they were using the repository features or not, and if they were
not, it had the installers classify the links it found as either "internal",
"verifiable external" or "unverifiable external". PEP 438 was accepted and
implemented in pip 1.4 (released on Jul 23, 2013) with the final transition
implemented in pip 1.5 (released on Jan 2, 2014).

PEP 438 was successful in bringing about more people to utilize PyPI's
repository features, an altogether good thing given the global CDN powering
PyPI providing speed ups for a lot of people, however it did so by introducing
a new point of confusion and pain for both the end users and the authors.

By moving to using explicit multiple repositories we can make the lines between
these two roles much more explicit and remove the "hidden" surprises caused by
the current implementation of handling people who do not want to use PyPI as a
repository.


Key User Experience Expectations
--------------------------------

#. Easily allow external hosting to "just work" when appropriately configured
   at the system, user or virtual environment level.
#. Eliminate any and all references to the confusing "verifiable external" and
   "unverifiable external" distinction from the user experience (both when
   installing and when releasing packages).
#. The repository aspects of PyPI should become *just* the default package
   hosting location (i.e. the only one that is treated as opt-out rather than
   opt-in by most client tools in their default configuration). Aside from that
   aspect, hosting on PyPI should not otherwise provide an enhanced user
   experience over hosting your own package repository.
#. Do all of the above while providing default behaviour that is secure against
   most attackers below the nation state adversary level.


Why Additional Repositories?
----------------------------

The two common installer tools, pip and easy_install/setuptools, both support
the concept of additional locations to search for files to satisfy the
installation requirements and have done so for many years. This means that
there is no need to "phase" in a new flag or concept and the solution to
installing a project from a repository other than PyPI will function regardless
of how old (within reason) the end user's installer is. Not only has this
concept existed in the Python tooling for some time, but it is a concept that
exists across languages and even extending to the OS level with OS package
tools almost universally using multiple repository support making it extremely
likely that someone is already familiar with the concept.

Additionally, the multiple repository approach is a concept that is useful
outside of the narrow scope of allowing projects that wish to be included on
the index portion of PyPI but do not wish to utilize the repository portion of
PyPI. This includes places where a company may wish to host a repository that
contains their internal packages or where a project may wish to have multiple
"channels" of releases, such as alpha, beta, release candidate, and final
release. This could also be used for projects wishing to host files which
cannot be uploaded to PyPI, such as multi-gigabyte data files or, currently at
least, Linux Wheels.


Why Not PEP 438 or Similar?
---------------------------

While the additional search location support has existed in pip and setuptools
for quite some time support for PEP 438 has only existed in pip since the 1.4
version, and still has yet to be implemented in setuptools. The design of
PEP 438 did mean that users still benefited for projects which did not require
external files even with older installers, however for projects which *did*
require external files, users are still silently being given either potentially
unreliable or, even worse, unsafe files to download. This system is also unique
to Python as it arises out of the history of PyPI, this means that it is almost
certain that this concept will be foreign to most, if not all users, until they
encounter it while attempting to use the Python toolchain.

Additionally, the classification system proposed by PEP 438 has, in practice,
turned out to be extremely confusing to end users, so much so that it is a
position of this PEP that the situation as it stands is completely untenable.
The common pattern for a user with this system is to attempt to install a
project possibly get an error message (or maybe not if the project ever
uploaded something to PyPI but later switched without removing old files), see
that the error message suggests ``--allow-external``, they reissue the command
adding that flag most likely getting another error message, see that this time
the error message suggests also adding ``--allow-unverified``, and again issue
the command a third time, this time finally getting the thing they wish to
install.

This UX failure exists for several reasons.

#. If pip can locate files at all for a project on the Simple API it will
   simply use that instead of attempting to locate more. This is generally the
   right thing to do as attempting to locate more would erase a large part of
   the benefit of PEP 438. This means that if a project *ever* uploaded a file
   that matches what the user has requested for install that will be used
   regardless of how old it is.
#. PEP 438 makes an implicit assumption that most projects would either upload
   themselves to PyPI or would update themselves to directly linking to release
   files. While a large number of projects did ultimately decide to upload to
   PyPI, some of them did so only because the UX around what PEP 438 was so bad
   that they felt forced to do so. More concerning however, is the fact that
   very few projects have opted to directly and safely link to files and
   instead they still simply link to pages which must be scraped in order to
   find the actual files, thus rendering the safe variant
   (``--allow-external``) largely useless.
#. Even if an author wishes to directly link to their files, doing so safely is
   non-obvious. It requires the inclusion of a MD5 hash (for historical
   reasons) in the hash of the URL. If they do not include this then their
   files will be considered "unverified".
#. PEP 438 takes a security centric view and disallows any form of a global opt
   in for unverified projects. While this is generally a good thing, it creates
   extremely verbose and repetitive command invocations such as::

      $ pip install --allow-external myproject --allow-unverified myproject myproject
      $ pip install --allow-all-external --allow-unverified myproject myproject


Multiple Repository/Index Support
=================================

Installers SHOULD implement or continue to offer, the ability to point the
installer at multiple URL locations. The exact mechanisms for a user to
indicate they wish to use an additional location is left up to each individual
implementation.

Additionally the mechanism discovering an installation candidate when multiple
repositories are being used is also up to each individual implementation,
however once configured an implementation should not discourage, warn, or
otherwise cast a negative light upon the use of a repository simply because it
is not the default repository.

Currently both pip and setuptools implement multiple repository support by
using the best installation candidate it can find from either repository,
essentially treating it as if it were one large repository.

Installers SHOULD also implement some mechanism for removing or otherwise
disabling use of the default repository. The exact specifics of how that is
achieved is up to each individual implementation.

Installers SHOULD also implement some mechanism for whitelisting and
blacklisting which projects a user wishes to install from a particular
repository. The exact specifics of how that is achieved is up to each
individual implementation.


Deprecation and Removal of Link Spidering
=========================================

A new hosting mode will be added to PyPI. This hosting mode will be called
``pypi-only`` and will be in addition to the three that PEP 438 has already
given us which are ``pypi-explicit``, ``pypi-scrape``, ``pypi-scrape-crawl``.
This new hosting mode will modify a project's simple api page so that it only
lists the files which are directly hosted on PyPI and will not link to anything
else.

Upon acceptance of this PEP and the addition of the ``pypi-only`` mode, all new
projects will be defaulted to the PyPI only mode and they will be locked to
this mode and unable to change this particular setting.

An email will then be sent out to all of the projects which are hosted only on
PyPI informing them that in one month their project will be automatically
converted to the ``pypi-only`` mode. A month after these emails have been sent
any of those projects which were emailed, which still are hosted only on PyPI
will have their mode set permanently to ``pypi-only``.

At the same time, an email will be sent to projects which rely on hosting
external to PyPI. This email will warn these projects that externally hosted
files have been deprecated on PyPI and that in 3 months from the time of that
email that all external links will be removed from the installer APIs. This
email **MUST** include instructions for converting their projects to be hosted
on PyPI and **MUST** include links to a script or package that will enable them
to enter their PyPI credentials and package name and have it automatically
download and re-host all of their files on PyPI. This email **MUST** also
include instructions for setting up their own index page. This email must also contain a link to the Terms of Service for PyPI as many users may have signed
up a long time ago and may not recall what those terms are. Finally this email
must also contain a list of the links registered with PyPI where we were able
to detect an installable file was located.

Two months after the initial email, another email must be sent to any projects
still relying on external hosting. This email will include all of the same
information that the first email contained, except that the removal date will
be one month away instead of three.

Finally a month later all projects will be switched to the ``pypi-only`` mode
and PyPI will be modified to remove the externally linked files functionality.


Summary of Changes
==================

Repository side
---------------

#. Deprecate and remove the hosting modes as defined by PEP 438.
#. Restrict simple API to only list the files that are contained within the
   repository.


Client side
-----------

#. Implement multiple repository support.
#. Implement some mechanism for removing/disabling the default repository.
#. Deprecate / Remove PEP 438


Impact
======

To determine impact, we've looked at all projects using a method of searching
PyPI which is similar to what pip and setuptools use and searched for all
files available on PyPI, safely linked from PyPI, unsafely linked from PyPI,
and finally unsafely available outside of PyPI. When the same file was found
in multiple locations it was deduplicated and only counted it in one location
based on the following preferences: PyPI > Safely Off PyPI > Unsafely Off PyPI.
This gives us the broadest possible definition of impact, it means that any
single file for this project may no longer be visible by default, however that
file could be years old, or it could be a binary file while there is a sdist
available on PyPI. This means that the *real* impact will likely be much
smaller, but in an attempt not to miscount we take the broadest possible
definition.

At the time of this writing there are 65,232 projects hosted on PyPI and of
those, 59 of them rely on external files that are safely hosted outside of PyPI
and 931 of them rely on external files which are unsafely hosted outside of
PyPI. This shows us that 1.5% of projects will be affected in some way by this
change while 98.5% will continue to function as they always have. In addition,
only 5% of the projects affected are using the features provided by PEP 438 to
safely host outside of PyPI while 95% of them are exposing their users to
Remote Code Execution via a Man In The Middle attack.


Data Sovereignty
================

In the discussions around previous versions of this PEP, one of the key use
cases for wanting to host files externally to PyPI was due to data sovereignty
requirements for people living in jurisdictions outside of the USA, where PyPI
is currently hosted. The author of this PEP is not blind to these concerns and
realizes that this PEP represents a regression for the people that have these
concerns, however the current situation is presenting an extremely poor user
experience and the feature is only being used by a small percentage of
projects. In addition, the data sovereignty problems requires familarity with
the laws outside of the home jurisdiction of the author of this PEP, who is
also the principal developer and operator of PyPI. For these reasons, a
solution for the problem of data sovereignty has been deferred and is
considered outside of the scope for this PEP.

If someone for whom the issue of data sovereignty matters to them wishes to
put forth the effort, then at that time a system can be designed, implemented,
and ultimately deployed and operated that would satisfy both the needs of non
US users that cannot upload their projects to a system on US soil and the
quality of user experience that is attempted to be created on PyPI.


Rejected Proposals
==================

Allow easier discovery of externally hosted indexes
---------------------------------------------------

A previous version of this PEP included a new feature added to both PyPI and
installers that would allow project authors to enter into PyPI a list of
URLs that would instruct installers to ignore any files uploaded to PyPI and
instead return an error telling the end user about these extra URLs that they
can add to their installer to make the installation work.

This idea is rejected because it provides a similar painful end user experience
where people will first attempt to install something, get an error, then have
to re-run the installation with the correct options.


Keep the current classification system but adjust the options
-------------------------------------------------------------

This PEP rejects several related proposals which attempt to fix some of the
usability problems with the current system but while still keeping the general
gist of PEP 438.

This includes:

* Default to allowing safely externally hosted files, but disallow unsafely
  hosted.

* Default to disallowing safely externally hosted files with only a global flag
  to enable them, but disallow unsafely hosted.

* Continue on the suggested path of PEP 438 and remove the option to unsafely
  host externally but continue to allow the option to safely host externally.

These proposals are rejected because:

* The classification system introduced in PEP 438 in an entirely unique concept
  to PyPI which is not generically applicable even in the context of Python
  packaging. Adding additional concepts comes at a cost.

* The classification system itself is non-obvious to explain and to
  pre-determine what classification of link a project will require entails
  inspecting the project's ``/simple/<project>/`` page, and possibly any URLs
  linked from that page.

* The ability to host externally while still being linked for automatic
  discovery is mostly a historic relic which causes a fair amount of pain and
  complexity for little reward.

* The installer's ability to optimize or clean up the user interface is limited
  due to the nature of the implicit link scraping which would need to be done.
  This extends to the ``--allow-*`` options as well as the inability to
  determine if a link is expected to fail or not.

* The mechanism paints a very broad brush when enabling an option, while
  PEP 438 attempts to limit this with per package options. However a project
  that has existed for an extended period of time may often times have several
  different URLs listed in their simple index. It is not unusual for at least
  one of these to no longer be under control of the project. While an
  unregistered domain will sit there relatively harmless most of the time, pip
  will continue to attempt to install from it on every discovery phase. This
  means that an attacker simply needs to look at projects which rely on unsafe
  external URLs and register expired domains to attack users.


Implement this PEP, but Do Not Remove the Existing Links
--------------------------------------------------------

This is essentially the backwards compatible version of this PEP. It attempts
to allow people using older clients, or clients which do not implement this
PEP to continue on as if nothing had changed. This proposal is rejected because
the vast bulk of those scenarios are unsafe uses of the deprecated features. It
is the opinion of this PEP that silently allowing unsafe actions to take place
on behalf of end users is simply not an acceptable solution.


Copyright
=========

This document has been placed in the public domain.


-----------------
Donald Stufft
PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA




More information about the Distutils-SIG mailing list