[Python-Dev] pip: cdecimal an externally hosted file and may be unreliable [sic]

Donald Stufft donald at stufft.io
Fri May 9 20:12:22 CEST 2014


On May 9, 2014, at 1:28 PM, R. David Murray <rdmurray at bitdance.com> wrote:

> On Fri, 09 May 2014 11:39:02 -0400, Donald Stufft <donald at stufft.io> wrote:
>> 
>> On May 9, 2014, at 9:58 AM, M.-A. Lemburg <mal at egenix.com> wrote:
>>> On 09.05.2014 13:44, Donald Stufft wrote:
>>>> On May 9, 2014, at 4:12 AM, M.-A. Lemburg <mal at egenix.com> wrote:
>>> I snipped the rest of the discussion and reliability, using
>>> unmaintained packages and projects using their own mirrors (which
>>> should really be the standard, not an exceptional case),
>>> because it's not really leading anywhere:
>> 
>> Using your own mirror shouldn’t be the standard if all you’re doing
>> is automatically updating that mirror. It’s a hack to get around
>> unreliability and it should be seen of as a sign of a failure to provide
>> a service that people can rely on and that’s how I see it. People
>> depend on this service and it’s irresponsible to not treat it as a
>> critical piece of infrastructure.
> 
> I don't understand this.  Why it is our responsibility to provide a
> free service for a large project to repeatedly download a set of files
> they need?  Why does it not make more sense for them to download them
> once, and only update their local copies when they change?  That's almost
> completely orthogonal to making the service we do provide reliable.

Well here’s the thing right. The large projects repeatedly downloading the
same set of files is a canary. If any particular project goes uninstallable on
PyPI (or if PyPI itself goes down) then nobody can install it, the people
installing things over and over every day or the people who just happened
to be installing it during that downtime. However intermittent failures and
general insatiability is going to be noticed by the projects who install things
over and over again quicker and easier and thus it becomes a lot easier
to use them as a general gauge for what the average “uptime” is.

IOW if PyPI goes unavailable for 10 minutes 5 times a day, you might get
a handful of “small” installers (e.g. not the big projects) in each downtime,
but a different set who are likely to shrug it off and just call treat it as the
norm even though it’s very disruptive to what they’re doing. However the
big project is highly likely to hit every single one of those downtimes and
be able to say “wow PyPI is flaky as hell”.

To expand further on that if we assume that we want ``pip install <foo>``
to be reliable and not work sometimes and work at other times then we’re
aiming for as high as uptime as possible. PyPI gets enough traffic that
any single large project isn’t a noticeable drop in our bucket and due to the
way our caching works it actually helps us to be faster and more reliable
to have people constantly hitting packages because they’ll be in cache
and able to be served without hitting the Origin servers.

Just for the record, PyPI gets roughly 350 req/s basically 24/7, in the
month of April we served 71.4TB of data with 877.4 million requests of
which 80.5% never made it to the actual servers that run PyPI and were
served directly out of the geo distributed CDN that sits in front of PyPI. We
are vastly better positioned to maintain a reliable infrastructure than ask
that every large project that uses Python to do the same.

The reason that it’s our responsibility for providing it is because we chose
to provide it. There isn’t a moral imperative to run PyPI, but running PyPI
badly seems like a crummy thing to do.

> 
> For perspective, Gentoo requests that people only do an emerge sync at
> most once a day, and if they have multiple machines to update, that they
> only do one pull, and they update the rest of their infrastructure from
> their local copy.

To be clear, there are other reasons to run a local mirror but I don’t think that
it’s reasonable to expect anyone who wants a reliable install using pip to
stand up their own infrastructure.

Further to this point here I’m currently working on adding caching by default
for pip so that we minimize how often different people hit PyPI and we do it
automatically and in a way that doesn’t generally require people to think about
it and that also doesn’t require them to stand up their own infra.

> 
> As another point of information for comparison, Gentoo downloads files
> from wherever they are hosted first, and only if that fails falls back to
> a Gentoo provided mirror (if I remember correctly...I think the Gentoo
> mirror copy doesn't always exist?).
> 
> --David
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: https://mail.python.org/mailman/options/python-dev/donald%40stufft.io


-----------------
Donald Stufft
PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://mail.python.org/pipermail/python-dev/attachments/20140509/a587e0fd/attachment.sig>


More information about the Python-Dev mailing list