[Distutils] PEP 470 Round 2 - Using Multi Index Support for External to PyPI Package File Hosting

Donald Stufft donald at stufft.io
Fri Jun 6 04:08:44 CEST 2014


Here's round 2 of PEP 470.

You can see it online at https://python.org/dev/peps/pep-0470/ or below.

Notable changes:

- Ensure it's obvious this strictly deals with the installer API and does not
  affect a project's ability to register their project on PyPI for human
  consumptions.

- Mention that the functional mechanisms that make it possible for an end user
  to specify the additional locations have existed for a long time across many
  versions of the installers.

- Explicitly mention that the installer changes from PEP 438 should be
  deprecated and removed as part of this PEP.

- Explicitly mention pythonhosted.org as a location that authors can use to
  host an index if they do not wish to purchase a TLS certificate or host
  additional infrastructure.

- Include that a link to PyPI ToS should be included in the emails sent to
  authors to remind them of the PyPI ToS.

- Special case PIL as it is an outlier in terms of impact.

- Fill out the impact sections further to provide more detail


Abstract
========

This PEP proposes that the official means of having an installer locate and
find package files which are hosted externally to PyPI become the use of
multi index support instead of the practice of using external links on the
simple installer API.

It is important to remember that this is **not** about forcing anyone to host
their files on PyPI. If someone does not wish to do so they will never be under
any obligation too. They can still list their project in PyPI as an index, and
the tooling will still allow them to host it elsewhere.

This PEP strictly is concerned with the Simple Installer API and how automated
installers interact with PyPI, it has no bearing on the informational pages
which are primarily for human consumption.


Rationale
=========

There is a long history documented in PEP 438 that explains why externally
hosted files exist today in the state that they do on PyPI. For the sake of
brevity I will not duplicate that and instead urge readers to first take a look
at PEP 438 for background.

There are currently two primary ways for a project to make itself available
without directly hosting the package files on PyPI. They can either include
links to the package files in the simpler installer API or they can publish
a custom package index which contains their project.


Custom Additional Index
-----------------------

Each installer which speaks to PyPI offers a mechanism for the user invoking
that installer to provide additional custom locations to search for files
during the dependency resolution phase. For pip these locations can be
configured per invocation, per shell environment, per requirements file, per
virtual environment, and per user. The mechanism for specifying additional
locations have existed within pip and setuptools for many years, by comparison
the mechanisms in PEP 438 and any other new mechanism will have existed for
only a short period of time (if they exist at all currently).

The use of additional indexes instead of external links on the simple
installer API provides a simple clean interface which is consistent with the
way most Linux package systems work (apt-get, yum, etc). More importantly it
works the same even for projects which are commercial or otherwise have their
access restricted in some form (private networks, password, IP ACLs etc)
while the external links method only realistically works for projects which
do not have their access restricted.

Compared to the complex rules which a project must be aware of to prevent
themselves from being considered unsafely hosted setting up an index is fairly
trivial and in the simplest case does not require anything more than a
filesystem and a standard web server such as Nginx or Twisted Web. Even if
using simple static hosting without autoindexing support, it is still
straightforward to generate appropriate index pages as static HTML.

Example Index with Twisted Web
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

1. Create a root directory for your index, for the purposes of the example
   I'll assume you've chosen ``/var/www/index.example.com/``.
2. Inside of this root directory, create a directory for each project such
   as ``mkdir -p /var/www/index.example.com/{foo,bar,other}/``.
3. Place the package files for each project in their respective folder,
   creating paths like ``/var/www/index.example.com/foo/foo-1.0.tar.gz``.
4. Configure Twisted Web to serve the root directory, ideally with TLS.

::

    $ twistd -n web --path /var/www/index.example.com/


Examples of Additional indexes with pip
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Invocation:**

::

    $ pip install --extra-index-url https://pypi.example.com/ foobar

**Shell Environment:**

::

    $ export PIP_EXTRA_INDEX_URL=https://pypi.example.com/
    $ pip install foobar

**Requirements File:**

::

    $ echo "--extra-index-url https://pypi.example.com/\nfoobar" > requirements.txt
    $ pip install -r requirements.txt

**Virtual Environment:**

::

    $ python -m venv myvenv
    $ echo "[global]\nextra-index-url = https://pypi.example.com/" > myvenv/pip.conf
    $ myvenv/bin/pip install foobar

**User:**

::

    $ echo "[global]\nextra-index-url = https://pypi.example.com/" >~/.pip/pip.conf
    $ pip install foobar


External Links on the Simple Installer API
------------------------------------------

PEP 438 proposed a system of classifying file links as either internal,
external, or unsafe. It recommended that by default only internal links would
be installed by an installer however users could opt into external links on
either a global or a per package basis. Additionally they could also opt into
unsafe links on a per package basis.

This system has turned out to be *extremely* unfriendly towards the end users
and it is the position of this PEP that the situation has become untenable. The
situation as provided by PEP 438 requires an end user to be aware not only of
the difference between internal, external, and unsafe, but also to be aware of
what hosting mode the package they are trying to install is in, what links are
available on that project's /simple/ page, whether or not those links have
a properly formatted hash fragment, and what links are available from pages
linked to from that project's /simple/ page.

There are a number of common confusion/pain points with this system that I
have witnessed:

* Users unaware what the simple installer api is at all or how an installer
  locates installable files.
* Users unaware that even if the simple api links to a file, if it does
  not include a ``#md5=...`` fragment that it will be counted as unsafe.
* Users unaware that an installer can look at pages linked from the
  simple api to determine additional links, or that any links found in this
  fashion are considered unsafe.
* Users are unaware and often surprised that PyPI supports hosting your files
  someplace other than PyPI at all.

In addition to that, the information that an installer is able to provide
when an installation fails is pretty minimal. We are able to detect if there
are externally hosted files directly linked from the simple installer api,
however we cannot detect if there are files hosted on a linked page without
fetching that page and doing so would cause a massive performance hit just to
see if there might be a file there so that a better error message could be
provided.

Finally very few projects have properly linked to their external files so that
they can be safely downloaded and verified. At the time of this writing there
are a total of 65 projects which have files that are only available externally
and are safely hosted.

The end result of all of this, is that with PEP 438, when a user attempts to
install a file that is not hosted on PyPI typically the steps they follow are:

1. First, they attempt to install it normally, using ``pip install foobar``.
   This fails because the file is not hosted on PyPI and PEP 438 has us default
   to only hosted on PyPI. If pip detected any externally hosted files or other
   pages that we *could* have attempted to find other files at it will give an
   error message suggesting that they try ``--allow-external foobar``.
2. They then attempt to install their package using
   ``pip install --allow-external foobar foobar``. If they are lucky foobar is
   one of the packages which is hosted externally and safely and this will
   succeed. If they are unlucky they will get a different error message
   suggesting that they *also* try ``--allow-unverified foobar``.
3. They then attempt to install their package using
   ``pip install --allow-external foobar --allow-unverified foobar foobar``
   and this finally works.

This is the same basic steps that practically everyone goes through every time
they try to install something that is not hosted on PyPI. If they are lucky it'll
only take them two steps, but typically it requires three steps. Worse there is
no real indication to these people why one package might install after two
but most require three. Even worse than that most of them will never get an
externally hosted package that does not take three steps, so they will be
increasingly annoyed and frustrated at the intermediate step and will likely
eventually just start skipping it.


External Index Discovery
========================

One of the problems with using an additional index is one of discovery. Users
will not generally be aware that an additional index is required at all much
less where that index can be found. Projects can attempt to convey this
information using their description on the PyPI page however that excludes
people who discover their project organically through ``pip search``.

To support projects that wish to externally host their files and to enable
users to easily discover what additional indexes are required, PyPI will gain
the ability for projects to register external index URLs and additionally an
associated comment for each. These URLs will be made available on the simple
page however they will not be linked or provided in a form that older
installers will automatically search them.

When an installer fetches the simple page for a project, if it finds this
additional meta-data and it cannot find any files for that project in it's
configured URLs then it should use this data to tell the user how to add one
or more of the additional URLs to search in. This message should include any
comments that the project has included to enable them to communicate to the
user and provide hints as to which URL they might want if some are only
useful or compatible with certain platforms or situations. When the installer
has implemented the auto discovery mechanisms they should also deprecate any
of the mechanisms added for PEP 438 (such as ``--allow-external``) for removal
at the end of the deprecation period proposed by the PEP.

This feature *must* be added to PyPI prior to starting the deprecation and
removal process for link spidering.


Deprecation and Removal of Link Spidering
=========================================

A new hosting mode will be added to PyPI. This hosting mode will be called
``pypi-only`` and will be in addition to the three that PEP 438 has already
given us which are ``pypi-explicit``, ``pypi-scrape``, ``pypi-scrape-crawl``.
This new hosting mode will modify a project's simple api page so that it only
lists the files which are directly hosted on PyPI and will not link to anything
else.

Upon acceptance of this PEP and the addition of the ``pypi-only`` mode, all new
projects will by defaulted to the PyPI only mode and they will be locked to
this mode and unable to change this particular setting. ``pypi-only`` projects
will still be able to register external index URLs as described above - the
"pypi-only" refers only to the download links that are published directly on
PyPI.

An email will then be sent out to all of the projects which are hosted only on
PyPI informing them that in one month their project will be automatically
converted to the ``pypi-only`` mode. A month after these emails have been sent
any of those projects which were emailed, which still are hosted only on PyPI
will have their mode set to ``pypi-only``.

After that switch, an email will be sent to projects which rely on hosting
external to PyPI. This email will warn these projects that externally hosted
files have been deprecated on PyPI and that in 6 months from the time of that
email that all external links will be removed from the installer APIs. This
email *must* include instructions for converting their projects to be hosted
on PyPI and *must* include links to a script or package that will enable them
to enter their PyPI credentials and package name and have it automatically
download and re-host all of their files on PyPI. This email *must also*
include instructions for setting up their own index page and registering that
with PyPI, including the fact that they can use pythonhosted.org as a host
for an index page without requiring them to host any additional infrastructure
or purchase a TLS certificate. This email must also contain a link to the Terms
of Service for PyPI as many users may have signed up a long time ago and may
not recall what those terms are.

Five months after the initial email, another email must be sent to any projects
still relying on external hosting. This email will include all of the same
information that the first email contained, except that the removal date will
be one month away instead of six.

Finally a month later all projects will be switched to the ``pypi-only`` mode
and PyPI will be modified to remove the externally linked files functionality.
At this point in time any installers should finally remove any of the
deprecated PEP 438 functionality such as ``--allow-external`` and
``--allow-unverified`` in pip.


PIL
---

It's obvious from the numbers below that the vast bulk of the impact come from
the PIL project. On 2014-05-17 an email was sent to the contact for PIL
inquiring whether or not they would be willing to upload to PyPI. A response
has not been received as of yet (2014-06-05) nor has any change in the hosting
happened. Due to the popularity of PIL this PEP also proposes that during the
deprecation period that PyPI Administrators will set the PIL download URL as
the external index for that project. Allowing the users of PIL to take
advantage of the auto discovery mechanisms although the project has seemingly
become unmaintained.


Impact
======

The largest impact of this is going to be projects where the maintainers are
no longer maintaining the project, for one reason or another. For these
projects it's unlikely that a maintainer will arrive to set the external index
metadata which would allow the auto discovery mechanism to find it.

Looking at the numbers factoring out PIL (which has been special cased above)
the actual impact should be quite low, with it affecting just 6.9% of projects
which host only externally or 2.8% which have their latest version hosted
externally. This represents a mere 3883 unique IP addresses. The break down of
this is that of those 3883 addresses, 100% of them installed something that
could not be verified while only 3% installed something which could be.


Projects Which Rely on Externally Hosted files
----------------------------------------------

This is determined by crawling the simple index and looking for installable
files using a similar detection method as pip and setuptools use. The "latest"
version is determined using ``pkg_resources.parse_version`` sort order and it
is used to show whether or not the latest version is hosted externally or only
old versions are.

============ ======= ================ =================== =======
\             PyPI    External (old)   External (latest)   Total
============ ======= ================ =================== =======
 **Safe**     38716   31               35                  38782
 **Unsafe**   0       1659             1169                2828
 **Total**    38716   1690             1204                41610
============ ======= ================ =================== =======


Top Externally Hosted Projects by Requests
------------------------------------------

This is determined by looking at the number of requests the
``/simple/<project>/`` page had gotten in a single day. The total number of
requests during that day was 17,960,467.

============================== ========
Project                        Requests
============================== ========
PIL                            13470
mysql-connector-python         321
salesforce-python-toolkit      54
pyodbc                         50
elementtree                    44
atfork                         39
RBTools                        29
django-contrib-requestprovider 28
wadofstuff-django-serializers  23
Pygame                         21
============================== ========


Top Externally Hosted Projects by Unique IPs
--------------------------------------------

This is determined by looking at the IP addresses of requests the
``/simple/<project>/`` page had gotten in a single day. The total number of
unique IP addresses during that day was 105,587.

============================== ==========
Project                        Unique IPs
============================== ==========
PIL                            3515
mysql-connector-python         117
pyodbc                         34
elementtree                    21
RBTools                        19
egenix-mx-base                 16
Pygame                         14
salesforce-python-toolkit      13
django-contrib-requestprovider 12
wxPython                       11
python-apt                     10
============================== ==========


Rejected Proposals
==================

Keep the current classification system but adjust the options
-------------------------------------------------------------

This PEP rejects several related proposals which attempt to fix some of the
usability problems with the current system but while still keeping the
general gist of PEP 438.

This includes:

* Default to allowing safely externally hosted files, but disallow unsafely
  hosted.
* Default to disallowing safely externally hosted files with only a global
  flag to enable them, but disallow unsafely hosted.

These proposals are rejected because:

* The classification "system" is complex, hard to explain, and requires an
  intimate knowledge of how the simple API works in order to be able to reason
  about which classification is required. This is reflected in the fact that
  the code to implement it is complicated and hard to understand as well.

* People are generally surprised that PyPI allows externally linking to files
  and doesn't require people to host on PyPI. In contrast most of them are
  familiar with the concept of multiple software repositories such as is in
  use by many OSs.

* PyPI is fronted by a globally distributed CDN which has improved the
  reliability and speed for end users. It is unlikely that any particular
  external host has something comparable. This can lead to extremely bad
  performance for end users when the external host is located in different
  parts of the world or does not generally have good connectivity.

  As a data point, many users reported sub DSL speeds and latency when
  accessing PyPI from parts of Europe and Asia prior to the use of the CDN.

* PyPI has monitoring and an on-call rotation of sysadmins whom can respond to
  downtime quickly, thus enabling a quicker response to downtime. Again it is
  unlikely that any particular external host will have this. This can lead
  to single packages in a dependency chain being un-installable. This will
  often confuse users, who often times have no idea that this package relies
  on an external host, and they cannot figure out why PyPI appears to be up
  but the installer cannot find a package.

* PyPI supports mirroring, both for private organizations and public mirrors.
  The legal terms of uploading to PyPI ensure that mirror operators, both
  public and private, have the right to distribute the software found on PyPI.
  However software that is hosted externally does not have this, causing
  private organizations to need to investigate each package individually and
  manually to determine if the license allows them to mirror it.

  For public mirrors this essentially means that these externally hosted
  packages *cannot* be reasonably mirrored. This is particularly troublesome
  in countries such as China where the bandwidth to outside of China is
  highly congested making a mirror within China often times a massively better
  experience.

* Installers have no method to determine if they should expect any particular
  URL to be available or not. It is not unusual for the simple API to reference
  old packages and URLs which have long since stopped working. This causes
  installers to have to assume that it is OK for any particular URL to not be
  accessible. This causes problems where an URL is temporarily down or
  otherwise unavailable (a common cause of this is using a copy of Python
  linked against a really ancient copy of OpenSSL which is unable to verify
  the SSL certificate on PyPI) but it *should* be expected to be up. In this
  case installers will typically silently ignore this URL and later the user
  will get a confusing error stating that the installer couldn't find any
  versions instead of getting the real error message indicating that the URL
  was unavailable.

* In the long run, global opt in flags like ``--allow-all-external`` will
  become little annoyances that developers cargo cult around in order to make
  their installer work. When they run into a project that requires it they
  will most likely simply add it to their configuration file for that installer
  and continue on with whatever they were actually trying to do. This will
  continue until they try to install their requirements on another computer
  or attempt to deploy to a server where their install will fail again until
  they add the "make it work" flag in their configuration file.


-----------------
Donald Stufft
PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://mail.python.org/pipermail/distutils-sig/attachments/20140605/f70c6e87/attachment-0001.sig>


More information about the Distutils-SIG mailing list