[Python-checkins] peps: PEP 470: use external indexes over link spidering

nick.coghlan python-checkins at python.org
Wed May 14 14:10:36 CEST 2014


http://hg.python.org/peps/rev/12990e9fa4a9
changeset:   5476:12990e9fa4a9
user:        Nick Coghlan <ncoghlan at gmail.com>
date:        Wed May 14 22:10:25 2014 +1000
summary:
  PEP 470: use external indexes over link spidering

files:
  pep-0470.txt |  377 +++++++++++++++++++++++++++++++++++++++
  1 files changed, 377 insertions(+), 0 deletions(-)


diff --git a/pep-0470.txt b/pep-0470.txt
new file mode 100644
--- /dev/null
+++ b/pep-0470.txt
@@ -0,0 +1,377 @@
+PEP: 470
+Title: Using Multi Index Support for External to PyPI Package File Hosting
+Version: $Revision$
+Last-Modified: $Date$
+Author: Donald Stufft <donald at stufft.io>,
+BDFL-Delegate: Richard Jones <richard at python.org>
+Discussions-To: distutils-sig at python.org
+Status: Draft
+Type: Process
+Content-Type: text/x-rst
+Created: 12-May-2014
+Post-History: 14-May-2014
+
+
+Abstract
+========
+
+This PEP proposes that the official means of having an installer locate and
+find package files which are hosted externally to PyPI become the use of
+multi index support instead of the practice of using external links on the
+simple installer API.
+
+It is important to remember that this is **not** about forcing anyone to host
+their files on PyPI. If someone does not wish to do so they will never be under
+any obligation too. They can still list their project in PyPI as an index, and
+the tooling will still allow them to host it elsewhere.
+
+
+Rationale
+=========
+
+There is a long history documented in PEP 438 that explains why externally
+hosted files exist today in the state that they do on PyPI. For the sake of
+brevity I will not duplicate that and instead urge readers to first take a look
+at PEP 438 for background.
+
+There are currently two primary ways for a project to make itself available
+without directly hosting the package files on PyPI. They can either include
+links to the package files in the simpler installer API or they can publish
+a custom package index which contains their project.
+
+
+Custom Additional Index
+-----------------------
+
+Each installer which speaks to PyPI offers a mechanism for the user invoking
+that installer to provide additional custom locations to search for files
+during the dependency resolution phase. For pip these locations can be
+configured per invocation, per shell environment, per requirements file, per
+virtual environment, and per user.
+
+The use of additional indexes instead of external links on the simple
+installer API provides a simple clean interface which is consistent with the
+way most Linux package systems work (apt-get, yum, etc). More importantly it
+works the same even for projects which are commercial or otherwise have their
+access restricted in some form (private networks, password, IP ACLs etc)
+while the external links method only realistically works for projects which
+do not have their access restricted.
+
+Compared to the complex rules which a project must be aware of to prevent
+themselves from being considered unsafely hosted setting up an index is fairly
+trivial and in the simplest case does not require anything more than a
+filesystem and a standard web server such as Nginx. Even if using simple
+static hosting without autoindexing support, it is still straightforward
+to generate appropriate index pages as static HTML.
+
+Example Index with Nginx
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+1. Create a root directory for your index, for the purposes of the example
+   I'll assume you've chosen ``/var/www/index.example.com/``.
+2. Inside of this root directory, create a directory for each project such
+   as ``mkdir -p /var/www/index.example.com/{foo,bar,other}/``.
+3. Place the package files for each project in their respective folder,
+   creating paths like ``/var/www/index.example.com/foo/foo-1.0.tar.gz``.
+4. Configure nginx to serve the root directory, ideally with TLS, with the
+   autoindex directive enable (see below for example configuration).
+
+    ::
+
+        server {
+            listen 443 ssl;
+            server_name index.example.com;
+
+            ssl_certificate     /etc/pki/tls/certs/index.example.com.crt;
+            ssl_certificate_key /etc/pki/tls/certs/index.example.com.key;
+
+            root /var/www/index.example.com;
+
+            autoindex on;
+        }
+
+
+Examples of Additional indexes with pip
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+**Invocation:**
+
+    ::
+        $ pip install --extra-index-url https://pypi.example.com/ foobar
+
+**Shell Environment:**
+
+    ::
+        $ export PIP_EXTRA_INDEX_URL=https://pypi.example.com/
+        $ pip install foobar
+
+**Requirements File:**
+
+    ::
+        $ echo "--extra-index-url https://pypi.example.com/\nfoobar" > requirements.txt
+        $ pip install -r requirements.txt
+
+**Virtual Environment:**
+
+    ::
+        $ python -m venv myvenv
+        $ echo "[global]\nextra-index-url = https://pypi.exmaple.com/" > myvenv/pip.conf
+        $ myvenv/bin/pip install foobar
+
+**User:**
+
+    ::
+        $ echo "[global]\nextra-index-url = https://pypi.exmaple.com/" >~/.pip/pip.conf
+        $ pip install foobar
+
+
+External Links on the Simple Installer API
+------------------------------------------
+
+PEP438 proposed a system of classifying file links as either internal,
+external, or unsafe. It recommended that by default only internal links would
+be installed by an installer however users could opt into external links on
+either a global or a per package basis. Additionally they could also opt into
+unsafe links on a per package basis.
+
+This system has turned out to be *extremely* unfriendly towards the end users
+and it is the position of this PEP that the situation has become untenable. The
+situation as provided by PEP438 requires an end user to be aware not only of
+the difference between internal, external, and unsafe, but also to be aware of
+what hosting mode the package they are trying to install is in, what links are
+available on that project's /simple/ page, whether or not those links have
+a properly formatted hash fragment, and what links are available from pages
+linked to from that project's /simple/ page.
+
+There are a number of common confusion/pain points with this system that I
+have witnessed:
+
+* Users unaware what the simple installer api is at all or how an installer
+  locates installable files.
+* Users unaware that even if the simple api links to a file, if it does
+  not include a ``#md5=...`` fragment that it will be counted as unsafe.
+* Users unaware that an installer can look at pages linked from the
+  simple api to determine additional links, or that any links found in this
+  fashion are considered unsafe.
+* Users are unaware and often surprised that PyPI supports hosting your files
+  someplace other than PyPI at all.
+
+In addition to that, the information that an installer is able to provide
+when an installation fails is pretty minimal. We are able to detect if there
+are externally hosted files directly linked from the simple installer api,
+however we cannot detect if there are files hosted on a linked page without
+fetching that page and doing so would cause a massive performance hit just to
+see if there might be a file there so that a better error message could be
+provided.
+
+Finally very few projects have properly linked to their external files so that
+they can be safely downloaded and verified. At the time of this writing there
+are a total of 65 projects which have files that are only available externally
+and are safely hosted.
+
+The end result of all of this, is that with PEP 438, when a user attempts to
+install a file that is not hosted on PyPI typically the steps they follow are:
+
+1. First, they attempt to install it normally, using ``pip install foobar``.
+   This fails because the file is not hosted on PyPI and PEP 438 has us default
+   to only hosted on PyPI. If pip detected any externally hosted files or other
+   pages that we *could* have attempted to find other files at it will give an
+   error message suggesting that they try ``--allow-external foobar``.
+2. They then attempt to install their package using
+   ``pip install --allow-external foobar foobar``. If they are lucky foobar is
+   one of the packages which is hosted externally and safely and this will
+   succeed. If they are unlucky they will get a different error message
+   suggesting that they *also* try ``--allow-unverified foobar``.
+3. They then attempt to install their package using
+   ``pip install --allow-external foobar --allow-unverified foobar foobar``
+   and this finally works.
+
+This is the same basic steps that practically everyone goes through every time
+they try to install something that is not hosted on PyPI. If they are lucky it'll
+only take them two steps, but typically it requires three steps. Worse there is
+no real indication to these people why one package might install after two
+but most require three. Even worse than that most of them will never get an
+externally hosted package that does not take three steps, so they will be
+increasingly annoyed and frustrated at the intermediate step and will likely
+eventually just start skipping it.
+
+
+External Index Discovery
+========================
+
+One of the problems with using an additional index is one of discovery. Users
+will not generally be aware that an additional index is required at all much
+less where that index can be found. Projects can attempt to convey this
+information using their description on the PyPI page however that excludes
+people who discover their project organically through ``pip search``.
+
+To support projects that wish to externally host their files and to enable
+users to easily discover what additional indexes are required, PyPI will gain
+the ability for projects to register external index URLs and additionally an
+associated comment for each. These URLs will be made available on the simple
+page however they will not be linked or provided in a form that older
+installers will automatically search them.
+
+When an installer fetches the simple page for a project, if it finds this
+additional meta-data and it cannot find any files for that project in it's
+configured URLs then it should use this data to tell the user how to add one
+or more of the additional URLs to search in. This message should include any
+comments that the project has included to enable them to communicate to the
+user and provide hints as to which URL they might want if some are only
+useful or compatible with certain platforms or situations.
+
+This feature *must* be added to PyPI prior to starting the deprecation and
+removal process for link spidering.
+
+
+Deprecation and Removal of Link Spidering
+=========================================
+
+A new hosting mode will be added to PyPI. This hosting mode will be called
+``pypi-only`` and will be in addition to the three that PEP438 has already given
+us which are ``pypi-explicit``, ``pypi-scrape``, ``pypi-scrape-crawl``. This
+new hosting mode will modify a project's simple api page so that it only lists
+the files which are directly hosted on PyPI and will not link to anything else.
+
+Upon acceptance of this PEP and the addition of the ``pypi-only`` mode, all new
+projects will by defaulted to the PyPI only mode and they will be locked to
+this mode and unable to change this particular setting. ``pypi-only`` projects
+will still be able to register external index URLs as described above - the
+"pypi-only" refers only to the download links that are published directly on
+PyPI.
+
+An email will then be sent out to all of the projects which are hosted only on
+PyPI informing them that in one month their project will be automatically
+converted to the ``pypi-only`` mode. A month after these emails have been sent
+any of those projects which were emailed, which still are hosted only on PyPI
+will have their mode set to ``pypi-only``.
+
+After that switch, an email will be sent to projects which rely on hosting
+external to PyPI. This email will warn these projects that externally hosted
+files have been deprecated on PyPI and that in 6 months from the time of that
+email that all external links will be removed from the installer APIs. This
+email *must* include instructions for converting their projects to be hosted
+on PyPI and *must* include links to a script or package that will enable them
+to enter their PyPI credentials and package name and have it automatically
+download and re-host all of their files on PyPI. This email *must also*
+include instructions for setting up their own index page and registering that
+with PyPI.
+
+Five months after the initial email, another email must be sent to any projects
+still relying on external hosting. This email will include all of the same
+information that the first email contained, except that the removal date will
+be one month away instead of six.
+
+Finally a month later all projects will be switched to the ``pypa-only`` mode
+and PyPI will be modified to remove the externally linked files functionality.
+
+
+Impact
+======
+
+============ ======= ========== =======
+\             PyPI    External   Total
+============ ======= ========== =======
+ **Safe**     37779   65         37844
+ **Unsafe**   0       2974       2974
+ **Total**    37779   3039
+============ ======= ========== =======
+
+
+Rejected Proposals
+==================
+
+Keep the current classification system but adjust the options
+-------------------------------------------------------------
+
+This PEP rejects several related proposals which attempt to fix some of the
+usability problems with the current system but while still keeping the
+general gist of PEP 438.
+
+This includes:
+
+* Default to allowing safely externally hosted files, but disallow unsafely
+  hosted.
+* Default to disallowing safely externally hosted files with only a global
+  flag to enable them, but disallow unsafely hosted.
+
+These proposals are rejected because:
+
+* The classification "system" is complex, hard to explain, and requires an
+  intimate knowledge of how the simple API works in order to be able to reason
+  about which classification is required. This is reflected in the fact that
+  the code to implement it is complicated and hard to understand as well.
+
+* People are generally surprised that PyPI allows externally linking to files
+  and doesn't require people to host on PyPI. In contrast most of them are
+  familiar with the concept of multiple software repositories such as is in
+  use by many OSs.
+
+* PyPI is fronted by a globally distributed CDN which has improved the
+  reliability and speed for end users. It is unlikely that any particular
+  external host has something comparable. This can lead to extremely bad
+  performance for end users when the external host is located in different
+  parts of the world or does not generally have good connectivity.
+
+  As a data point, many users reported sub DSL speeds and latency when
+  accessing PyPI from parts of Europe and Asia prior to the use of the CDN.
+
+* PyPI has monitoring and an on-call rotation of sysadmins whom can respond to
+  downtime quickly, thus enabling a quicker response to downtime. Again it is
+  unlikely that any particular external host will have this. This can lead
+  to single packages in a dependency chain being un-installable. This will
+  often confuse users, who often times have no idea that this package relies
+  on an external host, and they cannot figure out why PyPI appears to be up
+  but the installer cannot find a package.
+
+* PyPI supports mirroring, both for private organizations and public mirrors.
+  The legal terms of uploading to PyPI ensure that mirror operators, both
+  public and private, have the right to distribute the software found on PyPI.
+  However software that is hosted externally does not have this, causing
+  private organizations to need to investigate each package individually and
+  manually to determine if the license allows them to mirror it.
+
+  For public mirrors this essentially means that these externally hosted
+  packages *cannot* be reasonably mirrored. This is particularly troublesome
+  in countries such as China where the bandwidth to outside of China is
+  highly congested making a mirror within China often times a massively better
+  experience.
+
+* Installers have no method to determine if they should expect any particular
+  URL to be available or not. It is not unusual for the simple API to reference
+  old packages and URLs which have long since stopped working. This causes
+  installers to have to assume that it is OK for any particular URL to not be
+  accessible. This causes problems where an URL is temporarily down or
+  otherwise unavailable (a common cause of this is using a copy of Python
+  linked against a really ancient copy of OpenSSL which is unable to verify
+  the SSL certificate on PyPI) but it *should* be expected to be up. In this
+  case installers will typically silently ignore this URL and later the user
+  will get a confusing error stating that the installer couldn't find any
+  versions instead of getting the real error message indicating that the URL
+  was unavailable.
+
+* In the long run, global opt in flags like ``--allow-all-external`` will
+  become little annoyances that developers cargo cult around in order to make
+  their installer work. When they run into a project that requires it they
+  will most likely simply add it to their configuration file for that installer
+  and continue on with whatever they were actually trying to do. This will
+  continue until they try to install their requirements on another computer
+  or attempt to deploy to a server where their install will fail again until
+  they add the "make it work" flag in their configuration file.
+
+
+Copyright
+=========
+
+This document has been placed in the public domain.
+
+
+

+..
+   Local Variables:
+   mode: indented-text
+   indent-tabs-mode: nil
+   sentence-end-double-space: t
+   fill-column: 70
+   coding: utf-8
+   End:

-- 
Repository URL: http://hg.python.org/peps


More information about the Python-checkins mailing list