[Distutils] What to do about the PyPI mirrors

Noah Kantrowitz noah at coderanger.net
Tue Aug 6 09:13:01 CEST 2013


On Aug 6, 2013, at 12:01 AM, Nick Coghlan <ncoghlan at gmail.com> wrote:

> On 6 August 2013 16:09, Christian Theune <ct at gocept.com> wrote:
>> Hi,
>> 
>> 
>> looks like I'm late to the party to figure out that I'm going to be hurt
>> again.
> 
> That's why I asked for this to be put through the PEP process: to give
> it more visibility, and provide more opportunity for people
> potentially affected to have a chance to comment and offer
> alternatives. Giving third parties the opportunity to read python.org
> cookies indefinitely isn't an option.
> 
> Everything else is negotiable.
> 
>> I'd like to suggest explicitly considering what is going to break due to
>> this and how much work you are forcefully inflicting on others. My whole
>> experience around the packaging (distribute/setuptools) and mirroring/CDN in
>> this year estimates cost for my company somewhere between 10k-20k EUR just
>> for keeping up with the breakage those changes incure. It might be that
>> we're wonderfully stupid (..enough to contribute) and all of this causes no
>> headaches for anybody else …. Overall, guessing that the packaging
>> infrastructure is used by probably multiple thousands of companies then I'd
>> expect that at least 100 of them might be experiencing problems like us.
>> Juggling arbritrary numbers I can see that we're inflicting around a million
>> EURs of cost that nobody asked for.
>> 
>> 
>> More specific statements below.
>> 
>> 
>> On 2013-08-04 22:25:01 +0000, Donald Stufft said:
>> 
>> 
>> Here's my PEP for Deprecating and  Removing the Official Public Mirrors
>> 
>> 
>> It's source is at:
>> https://github.com/dstufft/peps/blob/master/mirror-removal.rst
>> 
>> 
>> Abstract
>> 
>> =======
>> 
>> This PEP provides a path to deprecate and ultimately remove the official
>> 
>> public mirroring infrastructure for `PyPI`_. It does not propose the removal
>> 
>> of mirroring support in general.
>> 
>> 
>> -1 - maybe I don't have the right to speak up on CDN usage, but personally I
>> feel it's a bad idea to delegate overall PyPI availability exclusively to a
>> commercial third party. It's OK for me that we're using them to improve PyPI
>> availability, but completely putting our faith in their hands, doesn't sound
>> right to me.
> 
> Would you be happier if it said "the current incarnation of the public
> mirroring infrastructure"? I have no objections to somebody proposing
> a *new* less broken mirroring process.
> 
>> That's something that the mirroring infrastructure should have been
>> constructed for. I completely agree that the way the mirroring was
>> established was way sub-optimal. I think we can do better.
> 
> As noted above, this PEP is about killing off the *current* public
> mirroring system as being irredeemably broken. If that inspires
> somebody to come up with a more sensible alternative, so much the
> better.
> 
>> * With the introduction of the CDN on PyPI the public mirroring
>> infrastructure
>> 
>>  is not as important as it once was as the CDN is also a globally
>> distributed
>> 
>>  network of servers which will function even if PyPI is down.
>> 
>> 
>> Well, now we have one breakage point more which keeps annoying me. This
>> argument is not completely true. They may be getting better over time but we
>> have invested heavily to accomodate the breakage - that needs to be balanced
>> with some benefit in the near future.
> 
> That's why explicit mirror usage is still supported and recommended.
> 
>> * Although there is provisions in place for it, there is currently no known
>> 
>>  installer which uses the authenticity checks discussed in `PEP381`_ which
>> 
>>  means that any download from a mirror is subject to attack by a malicious
>> 
>>  mirror operator, but further more due to the lack of TLS it also means
>> that
>> 
>>  any download from a mirror is also subject to a MITM attack.
>> 
>> 
>> Again, I think that was a mistake during the introduction of the mirroring
>> infrastructure: too few people, too confusing PEP.
> 
> Which is why *this* incarnation of it needs to go away.
> 
>> * They have only ever been implemented by one installer (pip), and its
>> 
>>  implementation, besides being insecure, has serious issues with
>> performance
>> 
>>  and is slated for removal with it's next release (1.5).
>> 
>> 
>> Only if you consider the mirror auto-discovery protocol. I'm not sure
>> whether using DNS was such a smart move. A simple HTTP request to find
>> mirrors would have been nice. I think we can still do that.
> 
> And can be done regardless of what happens to the current system.
> 
>> Also, not everyone wants or needs auto-detection the way that the protocol
>> describes it. I personally just hand-pick a mirror (my own, hah) and keep
>> using that.
> 
> Which will be unaffected for anyone not relying on a pypi.python.org subdomain.
> 
>> We are also thinking about providing system-level default configuration to
>> hint tools like PIP and setuptools to a different default index that is
>> closer from a network perspective. From a customer perspective this should
>> be "PyPI".
>> 
>> I'd like to avoid breakage. Again, if you don't let me choose where to spend
>> my time, I'd rather invest the time I need for cleaning up the breakage into
>> something constructive.
>> 
>> The indices are in active use. f.pypi.python.org is seeing between 150-300GB
>> of traffic per month, the patterns widely ranging over the last month. This
>> is traffic that is not used internally from gocept.
> 
> I think it would be suitable for the PEP to include an escape clause
> for maintainers of a domain to request that the PSF infrastructure
> team keep their subdomain active for longer than the general timeframe
> proposed, with a 301 redirect to a new host. This will need to be
> worked about between the infrastructure team and the maintainers of
> the specific instance.
> 
> 
>> The biggest benefit of the mirroring infrastructure is that it is intended
>> to be de-centralized.
>> 
>> As a community member I can step up and take over responsibility of
>> availability, performance, and security of a mirror.
> 
> And, indeed, that is still fully supported. What's going away is the
> delegation of pypi.python.org subdomains and the associated mirror
> auto-discovery system. There is no near term plan to create a
> replacement.
> 
>> As a community member I have to completely submit to whatever the CDN does
>> and contacting another community member who hopefully will be with us for a
>> long time and stay in good contact with the CDN for us. That's
>> centralization and I don't like that a bit.
> 
> Strictly speaking, you're submitting to the PSF infrastructure team,
> who manage the relationship with Fastly. Those interested in joining
> the infrastructure SIG can sign up here:
> http://mail.python.org/mailman/listinfo/infrastructure

All members of the Infra Staff team have access to the Fastly admin panel, and Fastly is aware of all of us as authorized to work on the PSF account. Bus factor of 4 is definitely not perfect, but as the one that is responsible for safeguarding such things, I am okay with it for now.

> 
> 
>> Then, roughly 2 months after the release of the first version of pip to have
>> 
>> mirroring support removed (currently slated for pip 1.5) the DNS entries for
>> 
>> [a-g].pypi.python.org and last.pypi.python.org will be removed and PyPI will
>> 
>> no longer accept requests at those domains.
>> 
>> 
>> Oh great. That means in about 4 months I have to go through *any
>> installation that my company maintains* and sift through whether we're still
>> referencing f.pypi.python.org anywhere.
>> 
>> 
>> Can I write a check?
> 
> I think it makes sense for maintainers of particular mirrors to
> request a stay of execution until their traffic logs show everything
> coming in under an updated FQDN.
> 
>> Some ideas:
>> 
>> - Take control of *.pypi.python.org back
>> 
>> - Record other public names of the mirrors
>> 
>> - Use 301 redirects to send old installations over to the new mirror names.
> 
> I think it makes sense for mirror maintainers to be able to request
> this process over the default handling (redirection to the PyPI CDN)
> 
>> - Make it easier for community members to help maintain the list of mirrors.
>> 
>> - Make a better (faster) removal policy of mirrors if the owners are not
>> responsive.
> 
> For these two points, I think having the PEP cover an addition and
> removal process for http://www.pypi-mirrors.org/ might make sense
> (assuming Ken is amenable to the idea).
> 
>> - Make it easier for other community members to set up and maintain mirrors.
>> I'm happy to improve bandersnatch where needed.
>> 
>> 
>> Lastly, again, and I might be getting on everyones nerves.
>> 
>> 
>> Why does it seem that other communities have figured this out much simpler,
>> with less hassle, and with no significant changes for years and we need to
>> keep changing stuff over and over and over and break things over and over
>> and over.
> 
> Because the current structure of PyPI is fundamentally flawed, and
> we're still suffering the consequences more than a decade later. A
> software distribution index server should be a static filesystem that
> contains all the necessary metadata (including signatures) and can be
> mirrored with rsync. PyPI is far from being that :P
> 
> Perl gets credit for CPAN, but something I only realised recently is
> that they probably deserve more credit for PAUSE, which is the
> *upload* side of CPAN. Much of the CPAN metadata is derived directly
> from the distributed software by PAUSE rather than relying on client
> side tools. That means CPAN can publish new metadata just by upgrading
> PAUSE - they don't need to worry about how people are doing the
> uploads.
> 
> Also, CPAN, like Linux distro trees, can be mirrored with rsync rather
> than needing a custom client. It's much easier to maintain backwards
> compatibility when the only required server API is the ability to
> serve static files.
> 

I will fight any attempt to do this with every fiber of my being. This kind of "dumb server" API means that any metadata indexing or searching either needs to be precomputed or implemented in a much more intelligent client. This is already somewhat the case with pip, and as someone that has to deal with multiple client implementations it makes me very sad that I can't just call a REST endpoint to know what will be installed when I do a thing. This is neither here nor there, but I wanted to stake out my grounds so I can growl when people get too close :)

> The only things that have changed recently are that:
> - the rubygems.org compromise has made it obvious that sticking our
> heads in the sand and trusting the fact that there are easier targets
> out there to protect us is no longer an adequate answer
> - we've made the decision to try to fix the underlying brokenness
> rather than living with it forever
> - we have people willing to do the work to make that happen

There is also finally much closer coordination between the whole stack of the packaging teams, which means that changes that once took years can now happen in a day or two. This definitely manifests as a possibly frustrating rate of changes compared to previously.

--Noah


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 235 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://mail.python.org/pipermail/distutils-sig/attachments/20130806/f2418080/attachment-0001.pgp>


More information about the Distutils-SIG mailing list