[Catalog-sig] Deprecate External Links

Aaron Meurer asmeurer at gmail.com
Wed Feb 27 17:14:42 CET 2013


On Wed, Feb 27, 2013 at 8:26 AM, Donald Stufft <donald.stufft at gmail.com> wrote:
> PyPI is now being served with a valid SSL certificate, and the
> tooling has begun to incorporate SSL verification of PyPI into
> the process. This is _excellent_ and the parties involved should
> all be thanked. However there is still another massive area of
> insecurity within the packaging tool chain.
>
> For those who don't know, when you attempt to install a particular
> package a number of urls are visited. The steps look roughly
> something like this:
>
>     1. Visit http://pypi.python.org/simple/Package/ and attempt to
>         collect any links that look like it's installable (tarballs,
>         #egg=, etc).
>         Note: /simple/Package/ contains download_url, home_page,
>         and any link that is contained in the long_description).
>     2. Visit any link referenced as home_page and attempt to
>         collect any links that look like it's installable.
>     3. Visit any link referenced in a dependency_links and attempt
>         to collect any links that look like it's installable.
>     4. Take all of the collected links and determine which one
>         best matches the requirement spec given and download it.
>     5. Rinse and repeat for every dependency in the requirement
>         set.
>
> I propose we deprecate the external links that PyPI has published
> on the /simple/ indexes which exist because of the history of PyPI.
> Ideally in some number of months (1? 2?) we would turn off adding
> these links from new releases, leaving the existing ones intact and
> then a few months later the existing links be removed completely.
>
> Reasoning:
>   1. It is difficult to secure the process of spidering external links
>     for download.
>     1a. The only way I can think offhand is by requiring uploading
>           a hash of the expected files to PyPI along with the download
>           link and removing all urls except for the download_url. This
>           has the effect that only 1 file can be associated with a
> particular
>           release.
>   2. External links decrease the expected uptime for a particular set
>       of requirements. PyPI itself has become very stable, however
>       the same cannot be said for all of the hosts linked that the toolchain
>       processes. Each new host is an additional SPOF.
>
>       Ex: I depend on PyPI and 10 other external packages, each
>             service has a 99% uptime so my expected uptime to
>             be able to install all my requirements would be ~89% (0.99 **
> 11).
>   3. Breaks the ability for a CDN and/or mirroring infrastructure to provide
>       increased uptime and better latency/throughput across the globe.
>   4. Privacy implications, as a user it is not particularly obvious when
>       I run `pip install Foo` what hosts I will be able issuing requests
> against.
>       It is obvious that I will be contacting PyPI and I will have made the
>       decision to trust PyPI however it is not obvious what other hosts will
>       be able to gather information about me, including what packages I am
>       installing. This becomes even more difficult to determine the deeper
>       my dependency tree goes.

5. This is a serious PITA for package maintainers. If you accidentally
upload a file somewhere else that looks like a newer version pip will
install that.

6. It's a huge security hole.  For someone to upload a malicious
package, they just have to access some site that is crawled by pip,
which includes all old download sites.  If someone used to use some
download domain, but they no longer own it, this is very easy for
someone to upload an arbitrary malicious file with a slightly newer
version number, and pip will happily install that for everyone.

This was discussed at
http://mail.python.org/pipermail/catalog-sig/2012-June/004518.html.
My suggestion was to only download from the explicit external download
link for the latest listed version, and to do so only if an upload
didn't exist.

At the very least, let package maintainers manually enable this
behavior, so that we don't have to worry about tricking
pip/easy_install into installing the right thing by version number
naming (which is completely broken btw. It's impossible to name
separate Python 2 and Python 3 packages so that both pip and
easy_install will do the right thing in every case. See
https://code.google.com/p/sympy/issues/detail?id=3511).

Aaron Meurer


More information about the Catalog-SIG mailing list