[Catalog-sig] V3 PEP-draft for transitioning to pypi-hosting of release files

holger krekel holger at merlinux.eu
Wed Mar 13 12:21:59 CET 2013


Hi all,

after some more discussions and hours spend by Carl Meyer (who is now
co-authoring the PEP) and me, here is a new V3 pre-submit draft.  
It is now more ambitious than the previous draft as should be obvious
from the modified abstract (and Carl Meyers and Philip's earlier
interactions on this list).  There also are more details of how
the current link-scraping works among other improvements and incorporations
of feedback from discussions here.

We intend to submit this draft tonight to the PEP editors.  

Feedback now and later remains welcome.  I am sure there are issues to 
be sorted and clarified, among them the versioning-API suggestion by 
Marc-Andre.

Thanks for everybody's support and feedback so far,
holger


PEP: XXX
Title: Transitioning to release-file hosting on PyPI
Version: $Revision$
Last-Modified: $Date$
Author: Holger Krekel <holger at merlinux.eu>, Carl Meyer <carl at oddbird.net>
Discussions-To: catalog-sig at python.org
Status: Draft (PRE-submit V3)
Type: Process
Content-Type: text/x-rst
Created: 10-Mar-2013
Post-History:


Abstract
========

This PEP proposes a backward-compatible two-phase transition process to speed
up, simplify and robustify installing from the pypi.python.org (PyPI)
package index.  To ease the transition and minimize client-side
friction, **no changes to distutils or existing installation tools are
required in order to benefit from the transition phases, which is to
result in faster, more reliable installs for most existing packages**.

The first transition phase implements easy and explicit means for
a package maintainter to control which release file links are 
served to present-day installation tools.  The first phase also
includes the implementation of analysis tools for present-day packages,
to support communication with package maintainers and the automated
setting of default modes for controling release file links.   

The second transition phase will result in the current PYPI index 
to only serve PYPI-hosted files by default.  Externally hosted files
will still be automatically discoverable through a second index. 
Present-day installation tools will be able to continue working
by specifying this second index.  New versions of installation
tools shall default to only install packages from PYPI unless
the user explicitely wishes to include non-PYPI sites.



Rationale
=========

.. _history:

History and motivations for external hosting
--------------------------------------------

When PyPI went online, it offered release registration but had no
facility to host release files itself.  When hosting was added, no
automated downloading tool existed yet.  When Philip Eby implemented
automated downloading (through setuptools), he made the choice to
allow people to use download hosts of their choice.  The finding of
externally-hosted packages was implemented as follows:

#. The PyPI ``simple/`` index for a package contains all links found
   anywhere in that package's metadata for any release. Links in the
   "Download-URL" and "Home-page" metadata fields are given
   ``rel=download`` and ``rel=homepage`` attributes, respectively.

#. Any of these links whose target is a file whose name appears to be
   in the form of an installable source or binary distribution, with
   basename in the form "packagename-version.ARCHIVEEXT", is considered 
   a potential installation candidate.

#. Similarly, any links suffixed with an "#egg=packagename-version"
   fragment are considered an installation candidate.

#. Additionally, the ``rel=homepage`` and ``rel=download`` links are
   followed and, if HTML, are themselves scraped for release-file links
   in the above formats.

Today, most packages released on PyPI host their release files on
PyPI, but a small percentage (XXX need updated data) rely on external
hosting.

There are many reasons [2]_ why people have chosen external
hosting. To cite just a few:

- release processes and scripts have been developed already and upload
  to external sites

- it takes too long to upload large files from some places in the
  world

- export restrictions e.g. for crypto-related software

- company policies which require offering open source packages
  through own sites

- problems with integrating uploading to PYPI into one's release
  process (because of release policies)

- desiring download statistics different from those maintained by PyPI

- perceived bad reliability of PYPI

- not aware that PyPI offers file-hosting

Irrespective of the present-day validity of these reasons, there
clearly is a history why people choose to host files externally and it
even was for some time the only way you could do things.


Problem
-------

**Today, python package installers (pip, easy_install, buildout, and
others) often need to query many non-PyPI URLs even if there are no
externally hosted files**.  Apart from querying pypi.python.org's
simple index pages, also all homepages and download pages ever
specified with any release of a package are crawled by an installer.
The need for installers to crawl external sites slows down
installation and makes for a brittle and unreliable installation
process.  Those sites and packages also don't take part in the
:pep:`381` mirroring infrastructure, further decreasing reliability
and speed of automated installation processes around the world.

Most packages are hosted directly on pypi.python.org [1]_.  Even for
these packages, installers still crawl the homepage(s) of a package.
Many package uploaders are not aware that specifying the "homepage" in
their release process will slow down the installation process for all
users.

Relying on third party sites also opens up more attack vectors for
injecting malicious packages into sites using automated installs.  A
simple attack might just involve getting hold of an old now-unused
homepage domain and placing malicious packages there.  Moreover,
performing a Man-in-The-Middle (MITM) attack between an installation
site and any of the download sites can inject malicious packages on
the installation site.  As many homepages and download locations are
using HTTP and not HTTPS, such attacks are not hard to launch.  Such
MITM attacks can easily happen even for packages which never intended
to host files externally as their homepages are contacted by
installers anyway.

There is currently no way for package maintainers to avoid 3rd party
crawling, other than removing all homepage/download url metadata for
all historic releases.  While a script [3]_ has been written to
perform this action, it is not a good general solution because it
removes semantic information like the "homepage" specification from
PYPI packages.

Even if the "Homepage" and "Download-URL" links were not scraped for
further links, there is still no way under the current system for a
package owner to link to an installable file from their package
metadata without installation tools automatically considering that
file a candidate for installation.


Solution / two transition phases
================================

This first transition phase starts off by introducing a "hosting-mode"
field for each project on PYPI, allowing explicit control of which
machine-readable release file links are served to present-day
installation tools.  The first transition will, after successful
hosting-mode manipulations of individual early-adopters, then set a
default hosting mode for existing packages, based on automated anaylsis.
**Maintainers will be notified one month ahead of any such automated
change**.  At completion of the first transition phase, **all
present-day existing release and installation processes and tools are
expected to continue working**.  Any remaining errors or problems are
expected to only relate to installation of individual packages and can
be easily corrected by package maintainers or PYPI admins if maintainers
are not reachable.

**The second transition phase will then get PyPI, after a three month
warning period, to only serve links for PyPI-hosted packages under the 
present-day ``simple/`` index**.  At this point, present-day installation 
tools will not see externally hosted links anymore, unless they specify
a new ``simple/-with-externals`` index which PYPI MUST offer ahead of 
the start of the second transition phase.  This new index contains 
the external links as controled by a package maintainer.  Moreover, PYPI 
MUST also provide means to register and control download
links, independently from the current metadata and remote html-scraping 
methods.  At completion of the second transition phase, all present-day
installation tools will and all future installation releases SHALL
default to only install PYPI-hosted packages unless a user specifies
option(s) to include external links or the external index.   If an
installation tool chooses to use the new ``simple/-with-externals/`` as
a default, it MUST warn a user with a precise messsage of which external
links were followed.

Maintainers of packages which currently host release files on non-PyPI
sites shall receive instructions and tools to ease "re-hosting" of
their historic and future package release files.  The implementation
of such a re-hosting tool is expected but NOT REQUIRED to be available 
at the beginning of phase 2.


Implementation
==============

The foundation of both transition phases is the introduction of three
"modes" of PyPI hosting for a package, effecting which links are
generated for the ``simple/`` index in transition phase 1.  These modes 
are implemented without requiring changes to installation tools via changes 
to the algorithm for generating the machine-readable "/simple" index.

The modes are:

- ``pypi-ext-crawl``: no change from the current situation of generating
  machine-readable links for installation tools, as outlined in the
  history_.

- ``pypi-ext``: for a package in this mode, the "Home-page" and
  "Download-url" links added to the simple index are given
  ``rel=ext-homepage`` and ``rel=ext-download`` attributes instead of
  ``rel=homepage`` and ``rel=download``. The effect of this (with no
  change in installation tools neccessary) is that these links will 
  not be followed and scraped for further candidate links. Only installable 
  files linked directly from PyPI metadata (wherever they are hosted) will be
  considered for installation.

- ``pypi-only``: for a package in this mode, only links to URLs on
  PyPI itself will be added to the simple index.

At the end of the warning period of transition phase 2, the ``simple/``
index will be restricted to only show links to URLs on PyPI itself while the 
``simple/-with-externals`` index will during both transition phases show 
links to PYPI and any externals as controled by the package maintainer 
and the hosting-mode.

For a package in ``pypi-only`` mode, external links will no longer be
automatically scraped from metadata and added to the two indexes.
However, PyPI will expose an interface for package maintainers to
explicitly specify any number of URLs to externally hosted installable
files for a given release, and these URLs will be added to the
``simple/-with-ext`` index page for that project but NOT to the basic 
``simple/`` index page. Thus the ``-with-ext`` alternative index provides 
a means for package owners with good reason to host their packages elsewhere a
means to do so (even under the ``pypi-only`` package mode) and still
have that information reflected on PyPI in machine-readable form, allowing
installation tool users an explicit and easy choice of whether they wish
to read an index that includes externally-hosted packages or one that
does not.

The goal of this PEP is that eventually all projects on PyPI can be
migrated to the ``pypi-only`` mode, while preserving the ability to
install release files hosted from third parties in an automated manner.

Deprecation of hosting-modes to eventually only allow the "pypi-only"
mode is NOT REGULATED by this PEP but is expected to become feasible
some time after successfull implementation of the two transition phases
described in this PEP.


Implementation and interaction timeline
--------------------------------------------------

The proposed solution consists of multiple implementation and
communication steps:

#. Implement in PyPI the three modes and the ``-with-ext`` index as
   described above, and an interface for package owners to select the
   mode for each package and register explicit external file URLs for
   the ``-with-ext`` index (for projects in the ``pypi-only`` mode).
   Default all newly-registered packages to ``pypi-only`` mode (but
   package owners can still switch to the other modes as
   desired). Implement in ``pep381client`` the mirroring of the
   ``-with-ext`` index pages.

#. Determine which packages have installable versions available that
   are linked only from homepage/download pages (group B) and which
   packages have all installable files available on PyPI itself (group
   A).

#. Send mail to maintainers of projects in group A that their project
   is going to be automatically configured to ``pypi-ext`` mode in one
   month.  Inform them that this change is not expected to affect
   installability of their project at all, but will result in faster
   and safer installs for their users.  Encourage them to set this
   mode (or ``pypi-only``) themselves earlier to benefit their users.

#. Send mail to maintainers of packages in group B that their package
   hosting mode is ``pypi-ext-crawl``, list the sites which currently
   are crawled, and suggest that they re-host their packages directly
   on PyPI and then switch to ``pypi-only``.  Provide instructions and
   tools to help with this "re-uploading" process.

In addition, maintainers of installation tools are asked to release
two updates.  The first one shall provide clear warnings if
externally-hosted packages (that is, packages at a URL whose domain
name differs from the domain name of the index URL in use) are
selected for download, for which projects and URLS exactly this
happens, and that in future versions externally-hosted downloads 
will be disabled by default.

The second update for installation tools should change the default
mode to allow only installation of package files hosted at the index
domain, and allow installation of externally-hosted packages only when
the user supplies an option (ideally an option specifying exactly
which external domains are to be trusted as download sources). When
download of an externally-hosted package is disallowed, the user
should be notified, with instructions for how to make the install
succeed and warnings about the potential consequences.

It is expected that tools in this release may choose to change the
default index url to ``https://pypi.python.org/simple/-with-ext`` in
order to support explicitly-registered external URLs for projects in
``pypi-only`` mode. Tools may choose to do this only when the user
requests installation of externally-hosted packages, or may choose to
do this in all cases so as to be able to notify users when an
externally-hosted file is available.

Specific timelines for deprecation of ``pypi-ext-crawl`` and
``pypi-ext`` modes are not mandated in this PEP; this will depend on
observed behavior of package owners and availability of tooling. It is
expected that ``pypi-ext-crawl`` mode will be an early candidate for
deprecation; it may be necessary to leave ``pypi-ext`` mode in place 
for quite some time, at least for those packages already
depending on it (it may be removed as an option for new packages when
tool support for explicit external URLs and the ``-with-ext`` index is
sufficient).



Open questions
==============

- Should we introduce a third index which maintains the old behaviour
  of providing links irrespective of a maintainer's hosting-mode choice?

- should we introduce some form of PYPI API versioning in this PEP?
  (it might complicate matters and delay the implementation but is 
  often seen as good practise)


References
==========

.. [1] Donald Stufft, ratio of externally hosted versus pypi-hosted, http://mail.python.org/pipermail/catalog-sig/2013-March/005549.html (XXX need to update this data for all easy_install-supported formats)

.. [2] Marc-Andre Lemburg, reasons for external hosting, http://mail.python.org/pipermail/catalog-sig/2013-March/005626.html

.. [3] Holger Krekel, Script to remove homepage/download metadata for all releases http://mail.python.org/pipermail/catalog-sig/2013-February/005423.html

Acknowledgements
================

Philip Eby for precise information and the basic ideas to implement
the transition via server-side changes only.

Donald Stufft for pushing away from external hosting and 
and offering to implement both a Pull Request for the neccessary PYPI changes 
and the analysis tool to drive the transition phase 1.

Marc-Andre Lemburg, Nick Coghlan and catalog-sig in general for 
thinking through issues regarding getting rid of "external hosting".

Copyright
=========

This document has been placed in the public domain.



..
   Local Variables:
   mode: indented-text
   indent-tabs-mode: nil
   sentence-end-double-space: t
   fill-column: 70
   coding: utf-8
   End:


More information about the Catalog-SIG mailing list