[Catalog-sig] V2 pre-PEP: transitioning to release file hosting on PYPI

Tue Mar 12 18:05:08 CET 2013

Hi Marc-Andre, all,

On Tue, Mar 12, 2013 at 17:06 +0100, M.-A. Lemburg wrote:
> On 12.03.2013 12:38, holger krekel wrote:
> > Hi all,
> > 
> > below is the new PEP pre-submit version (V2) which incorporates the
> > latest suggestions and aims at a rapidly deployable solution.  Thanks in
> > particular to Philip, Donald and Marc-Andre.  I also added a few notes
> > on how installers should behave with respect to non-PYPI crawling.  
> > 
> > I think a PEP like doc is warranted and that we should not silently
> > change things without proper communication to maintainers and pre-planning
> > the implementation/change process.  Arguably, the changes are more
> > invasive than "oh, let's just do a http->https redirect" which didn't
> > work too well either.
> > 
> > Now, if there is some agreement, i can submit this PEP officially tomorrow,
> > and given agreement/refinments from the Pycon folks and the likes of
> > Richard, we may be able to get going very shortly after Pycon.
> > 
> > cheers,
> > holger
> > 
> > 
> > PEP-draft: transitioning to release-file hosting on PYPI
> > ====================================================================
> > 
> > Status
> > -----------
> > 
> > PRE-SUBMIT-v2
> > 
> > Abstract
> > ------------
> > 
> > This PEP proposes a backward-compatible transition process to speed up,
> > simplify and robustify installing from the pypi.python.org (PYPI)
> > package index.  The initial transition will put most packages on PYPI
> > automatically in a configuration mode which will prevent client-side
> > crawling from installers.  To ease automatic transition and minimize
> > client-side friction, **no changes to distutils or installation tools** are
> > required.  Instead, the transition is implemented by modifying PYPI to
> > serve links from ``simple/`` pages in a configurable way, preventing or
> > allowing crawling of non-PYPI sites for detecting release files.
> > Maintainers of all PYPI packages will be notified ahead of those
> > changes.
> > 
> > Maintainers of packages which currently are hosted on non-PYPI sites
> > shall receive instructions and tools to ease "re-hosting" of their
> > historic and future package release files.  The implementation of such
> > tools is NOT required for implementing the initial automatic transition.
> > 
> > Installation tools like pip and easy_install shall warn about crawling
> > non-PYPI sites and later default to disallow it and only allow it with
> > an explicit option.
> > 
> > 
> > History and motivations for external hosting
> > ------------------------------------------------
> > 
> > When PYPI went online, it offered release registration but had no
> > facility to host release files itself.  When hosting was added, no
> > automated downloading tool existed yet.  When Philip Eby implemented
> > automated downloading (through setuptools), he made the choice 
> > to allow people to use download hosts of their choice.  This was
> > implemented by the PYPI ``simple/`` index containing links of type
> > ``rel=homepage`` or ``rel=download`` which are crawled by installation
> > tools to discover package links.  As of March 2013, a substantial part 
> > of packages (estimated to about 10%) make use of this mechanism to host
> > files on github, bitbucket, sourceforge or own hosting sites like 
> > ``mercurial.selenic.com``, to just name a few.
> > 
> > There are many reasons [2]_ why people choose to use external hosting,
> > to cite just a few:
> > 
> > - release processes and scripts have been developed already and 
> >   upload to external sites 
> > 
> > - it takes too long to upload large files from some places in the world
> > 
> > - export restrictions e.g. for crypto-related software
> > 
> > - company policies which prescribe offering open source packages through
> >   own sites
> > 
> > - problems with integrating uploading to PYPI into one's release process
> >   (because of release policies)
> > 
> > - perceived bad reliability of PYPI
> > 
> > - missing knowlege you can upload files 
> > 
> > Irrespective of the present-day validity of these reasons, there clearly
> > is a history why people choose to host files externally and it even was 
> > for some time the only way you could do things.  
> > 
> > 
> > Problem
> > ---------------
> > 
> > **Today, python package installers (pip and easy_install) often need to
> > query non-PYPI sites even if there are no externally hosted files**.
> > Apart from querying pypi.python.org's simple index pages, also all
> > homepages and download pages ever specified with any release of a
> > package are crawled by an installer.  The need for installers to
> > crawl 3rd party sites slows down installation and makes for a brittle
> > unreliable installation process.   Those sites and packages also don't 
> > take part in the :pep:`381` mirroring infrastructure, further decreasing
> > reliability and speed of automated installation processes around the world. 
> > 
> > Roughly 90% of packages are hosted directly on pypi.python.org [1]_.
> > Even for them installers still need to crawl the homepage(s) of a
> > package.  Many package uploaders are particularly not aware that
> > specifying the "homepage" in their release process will slow down 
> > the installation process for all its users.
> > 
> > Relying on third party sites also opens up more attack vectors
> > for injecting malicious packages into sites using automated installs.  
> > A simple attack might just involve getting hold of an old now-unused
> > homepage domain and placing mailicious packages there.  Moreover,
> > performing a Man-in-The-Middle (MITM) attack between an installation
> > site and any of the download sites can inject mailicious packages on the
> > installation site.  As many homepages and download locations are using
> > HTTP and not proper HTTPS, such attacks are not very hard to launch.
> > Such MITM attacks can happen even for packages which never intended to
> > host files externally as their homepages are contacted by installers
> > anyway.
> > 
> > There is currently no way for package maintainers to avoid 3rd party
> > crawling, other than removing all homepage/download url metadata
> > for all historic releases.  While a script [3]_ has been written to 
> > perform this action, it is not a good general solution because it removes
> > semantic information like the "homepage" specification from PYPI packages.
> > 
> > 
> > Solution
> > -----------
> > 
> > The proposed solution consists of the following implementation and
> > communication steps:
> > 
> > - determine which packages have releases files only on PYPI (group A)
> >   and which have externally hosted release files (group B).
> > 
> > - Prepare PYPI implementation to allow a per-project "hosting mode",
> >   effectively enabling or disabling external crawling.  When enabled 
> >   nothing changes from the current situation of producing ``rel=download`` 
> >   and ``rel=homepage`` attributed links on ``simple/`` pages, 
> >   causing installers to crawl those sites.  
> >   When disabled, the attributions of links will change 
> >   to ``rel=newdownload`` and ``rel=newhomepage`` causing installers to
> >   avoid crawling 3rd party sites.  Retaining the meta-information allows
> >   tools to still make use of the semantic information.
> 
> Please start using versioned APIs for these things. The
> old style index should still be available under some
> URL, e.g. /simple-v1/ or /v1/simple/ or /1/simple/

Not sure it is neccessary in this case.  I would think it makes
the implementation harder and it would probably break PEP381 (mirroring
infrastructure) as well.

> > - send mail to maintainers of A that their project is going to be 
> >   automatically configured to "disable crawling" in one week
> >   and encourage them to set this mode earlier to help all of 
> >   their users.
> 
> One week ? That's a somewhat unrealistic timeframe.

Assuming we get our initial analysis correct, it's not a super-critical
change.  Also very easy to switch it back on a per-project basis.

I suggest we refine and repeat Donald's script from multiple places in
the world and merge the results to get a consolidated set of
"needs-no-crawling" packages.  If in doubt, we put a project into the
"needs-crawl" category.  Therefore, we can assume our set of
"needs-no-crawling" packages to be safe enough to perform the switching.
The one week is just there as an additional safety net, to give the
authors a chance for acting if they thing we did wrong.  I don't think
we end up with many problems and they will be localized to very very few
packages.  Extending the time frame will not help to significantly
reduce this number.  The main problem will be mails not reaching
a human, i suspect.

> I'm also missing some real-life tests to see what the effect
> are on actual users, e.g. setup the new index using a
> URL /simple-v2/ and let users play with it for a month
> before making /simple/ == /simple-v2/.

Preparation time is specified in the PEP by bringing the PYPI changes
online and asking _some_ people to set their hosting-mode.  As of know, 
the changes to PYPI are fairly trivial.

> > - send mail to maintainers of B that their package hosting mode 
> >   is "crawling enabled", and list the sites which currently are crawled,
> >   and suggest that they re-host their packages directly on PYPI and 
> >   then switch the hosting-mode "disable crawling".  Provide instructions 
> >   and at best tools to help with this "re-uploading" process.
> 
> That email should clearly state the PyPI terms to not
> cause surprises among the maintainers.

Can't the PYPI TOS be referenced from that mail?
And an address where they can get back in case of questions?

> I'd wait with this step until we've sorted out the PyPI terms
> issues on the python-legal list, to not cause a an uproar
> from people who get to read the terms for the first time ;-)

We could postpone the B packages maintainers mailing if there 
is a legal need.  We can still migrate "A" packages already.

> > In addition, maintainers of installation tools are asked to release
> > two updates.  The first one shall provide clear warnings if external
> > crawling needs to happen, for which projects and URLS exactly 
> > this happens, and that in the future crawling will be disabled by default.  
> > The next update shall change the default to disallow crawling and allow 
> > crawling only with an explicit option like ``--crawl-externals`` and 
> > another option allowing to limit which hosts are allowed to be crawled
> > at all.
> 
> AFAIK, both already exist in easy_install. Not sure about pip.
> They are not enable per default, though.

Right, i didn't investigage in detail the current cmdline options.  
To keep things simple i'd  like to just specify the meta-level of (a)
giving warnings and b) changing the default.

> > Hosting-Mode state transitions
> > ----------------------------------
> > 
> > 1. At the outset, we set hosting-mode to "notset" for all packages.
> >    This will not change any link served via the simple index and thus
> >    no bad effects are expected.  Early adopters and testers may now
> >    change the mode to either "crawl" or "nocrawl" to help with
> >    streamlining issues in the PYPI implementation.
> > 
> > 2. When maintainers of B packages are mailed their mode is directly
> >    set to "crawl".
> > 
> > 3. When maintainers of A are mailed we leave the mode at "notset" to allow
> >    people to change it to "nocrawl" themselves or to set it to "crawl" 
> >    if they think they are wrongly in the "A" group.  After a week 
> >    all "notset" modes are set to "nocrawl".
> > 
> > A week after the mailings all packages will be in "crawl" or "nocrawl"
> > hosting mode.  It is then a matter of good tools and reaching out to
> > maintainers of B packages to increase the A/B ratio.
> > 
> > Open questions
> > ----------------------
> > 
> > - Should the support tools for "rehosting" packages be implemented  on the
> >   server side or on the client side?  Implementing it on the client
> >   side probably is quicker to get right and less fatal in terms of failures.
> 
> Not sure what you mean here.

"Rehosting" tools help to transfer release files to PYPI which
are currently served on non-PYPI sites through the "crawling" algo.  
This could be done via a server-side interface or via client-side tools.  
I prefer the latter because i'd like to keep changes on the PYPI 
server minimal.  I am sure Richard agrees :)

> Your are also completely leaving out the idea to only cache
> distribution files on the PyPI CDN, without having to actually
> upload them.

Not sure what you mean.  FWIW, how PYPI hosts packages itself is completely
left out of this PEP on purpose.  PYPI might evolve to offer packages on a CDN
or improve the existing PEP381 infrastructure or introduce simple
"rsync-ability" (like CPAN).  IOW, this "no crawling" PEP is orthogonal 
to this question.

> > - double-check if ``rel=newhomepage`` and ``rel=newdownload`` cause the 
> >   desired behaviour of pip and easy_install (both the distribute and 
> >   setuptools based one) to not crawl those pages.
> 
> Indeed :-)

We might just avoid rel-attributions and point to the XMLRPC/JSON API - 
i am sure this works with easy_install and pip :)  

> Note that it will still be possible to add links to the
> distribution files in the long description of the package.

> Those links also show up on the /simple/ index page and
> will then get used, regardless of whether they have a rel
> attribute set or not.

Yes, this should be noted.

> > - are the "support tools" for re-hosting outside the scope of this PEP?
> 
> As with any PEP proposing an API change or a new API, it
> has to provide a reference implementation.

The re-hosting tools are NOT required for the "transition" part of
the PEP.  The PYPI implementation changes are required, of course.
Donald offered to help with a PYPI PR and the PEP tries to minimize
the neccessary changes.

> The current distutils upload command is geared towards
> uploading files at release time. While it is possible
> to trick it into uploading existing distribution files,
> it is not at all obvious how this is done.

Right, but i've written the code for that in another project.  Unless
someone (probably Donald) else beats me to it, i can try to help with
writing such a re-hosting tool.

> > - Think some more about pip/easy_install "allow-hosts" mode etc.
> 
> Note that tools such as zc.buildout provide easy ways
> of adding extra indexes and external URLs to scan for
> distribution files.
> 
> I'm not sure how the above would fit such use cases,
> i.e. if setuptools were to stop crawling external
> links per default, this could mean that user hosted
> PyPI-style indexes stop working with newer releases.
> 
> Here's an example list of indexes used in Plone 4.2:
> 
> # Add additional egg download sources here. dist.plone.org contains archives
> # of Plone packages.
> find-links =
>     http://dist.plone.org
>     http://download.zope.org/ppix/
>     http://download.zope.org/distribution/
>     http://effbot.org/downloads
>     http://dist.plone.org/release/4.2
> 
> None of these seem to use the rel attribute feature, so those
> will likely continue to work fine.

I am not surprised.  I don't know of alternative PYPI implementations
that actually implement "rel" attribution.  Most of them have the purpose
of controling which packages are installed in company environments and
thus have no need to implement this crawling mechanism but rather always
host files in their database.

cheers,
holger

> -- 
> Marc-Andre Lemburg
> eGenix.com
> 
> Professional Python Services directly from the Source  (#1, Mar 12 2013)
> >>> Python Projects, Consulting and Support ...   http://www.egenix.com/
> >>> mxODBC.Zope/Plone.Database.Adapter ...       http://zope.egenix.com/
> >>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
> ________________________________________________________________________
> 
> ::::: Try our mxODBC.Connect Python Database Interface for free ! ::::::
> 
>    eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
>     D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
>            Registered at Amtsgericht Duesseldorf: HRB 46611
>                http://www.egenix.com/company/contact/
>