[Catalog-sig] Rewrite PyPI for App Engine?

Fri Jun 25 18:49:16 CEST 2010

On Fri, Jun 25, 2010 at 3:39 AM, M.-A. Lemburg <mal at egenix.com> wrote:

> Ian Bicking wrote:
> > On Thu, Jun 24, 2010 at 5:16 PM, M.-A. Lemburg <mal at egenix.com> wrote:
> >
> >> Almir Karic wrote:
> >>> i would like to help out with the move.
> >>>
> >>> is anyone actually opposed to moving to GAE (either moving the current
> >>> code base or re-write, whichever seems more appropriate)?
> >>
> >> I don't think people are opposed to having a PyPI clone on GAE,
> >> but moving the existing installation to GAE is something we would
> >> have to discuss separately.
> >>
> >> I for one would not welcome such a change, since we then completely
> >> lose control over service availability.
> >>
> >
> > I don't really understand what this means.  Services become unavailable
> > sometimes.  A computer breaks, a company shuts down, an agreement ends.
>  We
> > don't necessarily have "control" over these situations, but we can
> respond
> > to them.  If App Engine goes down and the App Engine team is all like
> > "whatever, we'll get around to fixing stuff sometime" then sure it's a
> > problem.  But it's not a plausible problem.  The plausible problem is
> that
> > App Engine goes down, as it has from time to time, and we have to wait
> for
> > them to figure out what's wrong and fix it.  *We* don't have to fix it,
> we
> > only have to *wait for someone else to do it*.  I don't see any reason
> why
> > *we* are any better at fixing issues than the App Engine team would be.
> > Also presumably when there is a failure we want for the failure to be
> > understood and avoided in the future.  The App Engine team does that.
>  And
> > they do that *for us*.
>
> I hear you, but don't agree that putting the runtime into the
> hands of the GAE would get us an overall better service :-)
>
> The point is that with GAE you only have control over the code
> that you post there. Everything else is under control of the GAE
> team (and their automatic administration systems), i.e. whether
> your data is available and whether there are
> proper backups, whether the site is reachable or not, whether
> the performance is available and meets your requirements, whether
> the service is accessible, fast enough and has low latency, etc.
>
> So if something breaks, you can only fix it, if the problem
> is caused by a bug in the code. For all other situations, you
> have to wait for the GAE team to go in and do whatever is needed.
>
> I'm not saying that the GAE team would be doing a poor job,
> but just sitting there waiting for them to fix it in any
> of the typical problem situations (apart from a bug in the
> code), is asking a bit much, IMHO.
>

If GAE was just another hosting system, then sure -- but it's not.  For
instance, Noah mentioned if Apache went down (or the equivalent) there's
someone with a pager who will respond to it.  Except GAE isn't actually like
that; application instances are can be automatically killed, machines are
monitored automatically and brought out of the pool as necessary.  We're not
replacing our diligence with Google employees, it would be replaced with
machines.

Of course there might be network problems or Google's own problems growing
the service.  But a substantial class of problems (problems that I believe
have actually caused downtime) are simply eliminated from the system.  GAE
has less serviceable parts; that appears like losing control but it's really
the normal progression away from manual interactions.  I would really like
if there was an open source alternative that provided that kind of
infrastructure, but there isn't.

Another advantage to GAE is that if there are application errors, it would
be much easier for anyone to work on them -- anyone can sign up and receive
a free GAE account and deploy the code with almost no effort, and they will
be hosting that is completely equivalent to anyone else's hosting.  The only
difference would be the data set, and it is possible (maybe even likely)
that some class of problems will only be noticeable with a full dataset.
That's true now as well, like for some UI problems where pages have become
unwieldy, and I think it would be really helpful (regardless of GAE) if PyPI
had a cleaned-up-export built into it.

Other cloud service providers provide something very different from GAE, and
I don't think they would give a lot of benefit.  The one advantage I see is
that we (well, anyone) could spin up a new instance in a consistent state.
Everything else is basically the same, including all the same management
issues -- there's no one to kick Apache except us, for instance.  Honestly
if I have any skin in the game it's actually for a system like this, as I've
been working on this sort of infrastructure (http://cloudsilverlining.org)
-- I only propose GAE because I genuinely think it will work best for a
volunteer-run piece of infrastructure like PyPI.

We have to find a middle ground, where we can still apply the
> necessary hand holding ourselves, if we like to, while leaving
> most of the day-to-day tasks to automatic tools or other service
> providers to deal with.
>
> Since PyPI is becoming a central piece of Python community
> infrastructure, we need to make sure that we can provide a very
> good uptime of the service and fast access to the data,
> esp. for the automatic download tools.
>
> Fortunately, those tools only use static data, so focusing on
> making that highly available will get us a much better service
> uptime with little extra effort.
>
> > In some catastrophic case we could move the site to another server, use
> > TyphoonAE to move the code over (or simply require that there is a
> > sufficient abstraction layer to allow for a more normal environment) and
> > bring the site up.  We control the domain, we can ultimately control
> where
> > it is hosted.  This kind of failure seems like it would be far more
> likely
> > given our current situation than on App Engine, but moving to App Engine
> > would not somehow make this kind of move impossible.
>
> True, but do you really want to go through all that trouble
> just because GAE is down or too slow to be usable again ?
>

That's the catastrophic case, where Google decides they don't care about App
Engine or something like that.  Right now we'd have to do the same thing if
the server's hard disk dies, which is obviously far more likely.

If we were to go for a cloud service to deploy the PyPI runtime, I'd
> much rather like to see a standard virtualized server approach
> being used.
>
> With that approach, moving (virtual) servers would take
> at most 5 minutes, if needed at all - you can rather easily setup
> virtual servers as high availability cluster and then have
> them manage the failover all by themselves.
>

Setting up infrastructure for fail-overs is hard, and it would be easy for
us to set it up for the wrong pieces (the ones that aren't breaking).  In
some sense this is why I'm not excited about mirroring, because I don't
think it's fail-over for the pieces likely to break.

I do like the static file proposal, also.  I think just putting more content
into static files could potentially fix most of our problems, along with
maybe a bit of server tweaking (to make sure even if PyPI goes down, it
doesn't take Apache and the static files with it).  I think using a CDN
would be a nice step for speed, but is less important for reliability; I
think generating things with a cron job will reduce reliability because it's
exactly the kind of behind-the-scenes machinery that could break without
someone noticing, and we don't have a dedicated staff paying attention to
things like that.  If a new package registration breaks, I'd far rather it
be rejected immediately (e.g., from setup.py register) than for a broken
cron job to keep it from getting in the simple index.

-- 
Ian Bicking  |  http://blog.ianbicking.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/catalog-sig/attachments/20100625/ae089e9b/attachment.html>