[Distutils] What to do about the PyPI mirrors

Donald Stufft donald at stufft.io
Tue Aug 6 08:35:52 CEST 2013


On Aug 6, 2013, at 2:09 AM, Christian Theune <ct at gocept.com> wrote:

> Hi,
> 
> looks like I'm late to the party to figure out that I'm going to be hurt again.
> 
> I'd like to suggest explicitly considering what is going to break due to this and how much work you are forcefully inflicting on others. My whole experience around the packaging (distribute/setuptools) and mirroring/CDN in this year estimates cost for my company somewhere between 10k-20k EUR just for keeping up with the breakage those changes incure. It might be that we're wonderfully stupid (..enough to contribute) and all of this causes no headaches for anybody else …. Overall, guessing that the packaging infrastructure is used by probably multiple thousands of companies then I'd expect that at least 100 of them might be experiencing problems like us. Juggling arbritrary numbers I can see that we're inflicting around a million EURs of cost that nobody asked for. 
> 
> More specific statements below.
> 
> On 2013-08-04 22:25:01 +0000, Donald Stufft said:
> 
> Here's my PEP for Deprecating and  Removing the Official Public Mirrors
> 
> It's source is at: https://github.com/dstufft/peps/blob/master/mirror-removal.rst
> 
> Abstract
> =======
> This PEP provides a path to deprecate and ultimately remove the official
> public mirroring infrastructure for `PyPI`_. It does not propose the removal
> of mirroring support in general.
> 
> -1 - maybe I don't have the right to speak up on CDN usage, but personally I feel it's a bad idea to delegate overall PyPI availability exclusively to a commercial third party. It's OK for me that we're using them to improve PyPI availability, but completely putting our faith in their hands, doesn't sound right to me.

Hm. Maybe I wasn't clear here? The mirrors don't go away, the only thing that goes away is the *.pypi.python.org names and the DNS discovery protocol.

> 
> Rationale
> ========
> The PyPI mirroring infrastructure (defined in `PEP381`_) provides a means to
> mirror the content of PyPI used by the automatic installers. It also provides
> a method for autodiscovery of mirrors and a consistent naming scheme.
> 
> There are a number of problems with the official public mirrors:
> 
> * They give control over a \*.python.org domain name to a third party,
>   allowing that third party to set or read cookies on the pypi.python.org and
>   python.org domain name.
> 
> Agreed, that's a problem.
> 
> * The use of a sub domain of pypi.python.org means that the mirror operators
>   will never be able to get a certificate of their own, and giving them
>   one for a python.org domain name is unlikely to happen.
> 
> Agreed.
> 
> * They are often out of date, most often by several hours to a few days, but
>   regularly several days and even months.
> 
> That's something that the mirroring infrastructure should have been constructed for. I completely agree that the way the mirroring was established was way sub-optimal. I think we can do better.

Better mirroring protocol is on my TODO list as well but isn't particularly related to this PEP except that the poor protocol certainly influences how useful the global mirrors can be.

> 
> * With the introduction of the CDN on PyPI the public mirroring infrastructure
>   is not as important as it once was as the CDN is also a globally distributed
>   network of servers which will function even if PyPI is down.
> 
> Well, now we have one breakage point more which keeps annoying me. This argument is not completely true. They may be getting better over time but we have invested heavily to accomodate the breakage - that needs to be balanced with some benefit in the near future.

Can you expand further what you mean here? I don't believe I understand what you're saying.

> 
> * Although there is provisions in place for it, there is currently no known
>   installer which uses the authenticity checks discussed in `PEP381`_ which
>   means that any download from a mirror is subject to attack by a malicious
>   mirror operator, but further more due to the lack of TLS it also means that
>   any download from a mirror is also subject to a MITM attack.
> 
> Again, I think that was a mistake during the introduction of the mirroring infrastructure: too few people, too confusing PEP. 

See above about a new protocol being a TODO item for me, will likely be done in Warehouse.

> 
> * They have only ever been implemented by one installer (pip), and its
>   implementation, besides being insecure, has serious issues with performance
>   and is slated for removal with it's next release (1.5).
> 
> Only if you consider the mirror auto-discovery protocol. I'm not sure whether using DNS was such a smart move. A simple HTTP request to find mirrors would have been nice. I think we can still do that.
> 
> Also, not everyone wants or needs auto-detection the way that the protocol describes it. I personally just hand-pick a mirror (my own, hah) and keep using that. 

The auto detection is the main thing going away. You'll still be able to hand pick a mirror, it will just have a domain name owned by the mirror operator instead of one owned under *.pypi.python.org.

> 
> We are also thinking about providing system-level default configuration to hint tools like PIP and setuptools to a different default index that is closer from a network perspective. From a customer perspective this should be "PyPI".
> 
> I'd like to avoid breakage. Again, if you don't let me choose where to spend my time, I'd rather invest the time I need for cleaning up the breakage into something constructive.
> 
> The indices are in active use. f.pypi.python.org is seeing between 150-300GB of traffic per month, the patterns widely ranging over the last month. This is traffic that is not used internally from gocept.
> 
> Due to the number of issues, some of them very serious, and the CDN which more
> or less provides much of the same benefits this PEP proposes to first
> deprecate and then remove the public mirroring infrastructure. The ability to
> mirror and the method of mirroring will not be affected and the existing
> public mirrors are encouraged to acquire their own domains to host their
> mirrors on if they wish to continue hosting them.
> 
> The biggest benefit of the mirroring infrastructure is that it is intended to be de-centralized.
> As a community member I can step up and take over responsibility of availability, performance, and security of a mirror.
> 
> As a community member I have to completely submit to whatever the CDN does and contacting another community member who hopefully will be with us for a long time and stay in good contact with the CDN for us. That's centralization and I don't like that a bit.

Just to reiterate this doesn't remove the concept of mirroring at all, but it removes them from living under the PSF banner to living under the banner of the mirror operators.

> 
> Plan for Deprecation & Removal
> =============================
> Immediately upon acceptance of this PEP documentation on PyPI will be updated
> to reflect the deprecated nature of the official public mirrors and will
> direct users to external resources like http://www.pypi-mirrors.org/ to
> discover unofficial public mirrors if they wish to use one.
> 
> On October 1st, 2013, roughly 2 months from the date of this PEP, the DNS names
> of the public mirrors ([a-g].pypi.python.org) will be changed to point back to
> PyPI which will be modified to accept requests from those domains. At this
> point in time the public mirrors will be considered deprecated.
> 
> Then, roughly 2 months after the release of the first version of pip to have
> mirroring support removed (currently slated for pip 1.5) the DNS entries for
> [a-g].pypi.python.org and last.pypi.python.org will be removed and PyPI will
> no longer accept requests at those domains.
> 
> Oh great. That means in about 4 months I have to go through *any installation that my company maintains* and sift through whether we're still referencing f.pypi.python.org anywhere.
> 
> Can I write a check? 
> 
> Unofficial Public or Private Mirrors
> ===================================
> The mirroring protocol will continue to exist as defined in `PEP381`_ and
> people are encouraged to utilize to host unofficial public and private mirrors
> if they so desire. For operators of unofficial public or private mirrors the
> recommended mirroring client is `Bandersnatch`_.
> 
> Thanks for the recommendation.
> 
> Instead of this dance breaking many things yet again, I'd love if we could find a way forward keeping the infrastructure.
> 
> Some ideas:
> 
> - Take control of *.pypi.python.org back
> - Record other public names of the mirrors

It's planned that a externally maintained list (at least for the time being) will be used for recording the public names of the mirrors. Most likely http://pypi-mirrors.org/ which also includes some information to be able to select which mirror you'd like to use and is already recommended on the mirroring page.

> - Use 301 redirects to send old installations over to the new mirror names.

If this is done, and it'd need to be cleared with Infra as far as us serving redirects. However we could lengthen the timeframe to give more time to handle the migration and as a mirror operator you can handle redirecting from N.pypi.python.org to your new domain until control is taken back.

At some point though the hammer needs to come down on the N.pypi.python.org names because a long term goal is requiring TLS on all of python.org.

> - Make it easier for community members to help maintain the list of mirrors.

This is what part of the goals of offloading mirror listing to something like pypi-mirrors.org would be (as well as any other site that wants to maintain a list).

> - Make a better (faster) removal policy of mirrors if the owners are not responsive.

This becomes up to the decentralized sites to handle their own policy of what constitutes a reasonable removal policy.

> - Make it easier for other community members to set up and maintain mirrors. I'm happy to improve bandersnatch where needed.

This is partially handled by the future TODO of a better protocol, as well as pushing the maintenance of lists onto the community itself.

> 
> Lastly, again, and I might be getting on everyones nerves.
> 
> Why does it seem that other communities have figured this out much simpler, with less hassle, and with no significant changes for years and we need to keep changing stuff over and over and over and break things over and over and over.

Other communities had the benefit of learning from our mistakes and a lot of the breakages in this area have been closing security holes that still exist in those other communities.

> 
> It's really hard for me to write this mail without cussing - the situation is very frustrating: the community dynamics seem to "want to move forward" where they from my perspective "wander left and right and break stuff like a drunken elephant driving a tank throught the Louvre".
> 
> Christian
> _______________________________________________
> Distutils-SIG maillist  -  Distutils-SIG at python.org
> http://mail.python.org/mailman/listinfo/distutils-sig


-----------------
Donald Stufft
PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/distutils-sig/attachments/20130806/b30b001c/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://mail.python.org/pipermail/distutils-sig/attachments/20130806/b30b001c/attachment-0001.pgp>


More information about the Distutils-SIG mailing list