[Catalog-sig] pre-PEP: transition to release-file hosting at pypi site

Mon Mar 11 11:02:25 CET 2013

Hi Philip,

thanks for your helpful review, almost all makes sense to me ...
some more inlined comments below.  Up front, i am open to you 
co-authoring the PEP if you like and share the goal to find a minimum
viable approach to speed up and simplify the interactions for installers.

On Sun, Mar 10, 2013 at 15:41 -0400, PJ Eby wrote:
> On Sun, Mar 10, 2013 at 11:07 AM, holger krekel <holger at merlinux.eu> wrote:
> > Philip, Marc-Andre, Richard (Jones), Nick and catalog-sig/distutils-sig:
> > scrutiny and feedback welcome.
> 
> Hi Holger.  I'm having some difficulty interpreting your proposal
> because it is leaving out some things, and in other places
> contradicting what I know of how the tools work.  It is also a bit at
> odds with itself in some places.

Certainly, it was a quick draft to get the process going and useful
feedback which worked so far :)

> For instance, at the beginning, the PEP states its proposed solution
> is to host all release files on PyPI, but then the problem section
> describes the problems that arise from crawling external pages:
> problems that can be solved without actually hosting the files on
> PyPI.
>
> To me, it needs a clearer explanation of why the actual hosting part
> also needs to be on PyPI, not just the links.  In the threads to date,
> people have argued about uptime, security, etc., and these points are
> not covered by the PEP or even really touched on for the most part.

Makes sense to clarify this more.

> (Actually, thinking about that makes me wonder....  Donald: did your
> analysis collect any stats on *where* those externally hosted files
> were hosted?  My intuition says that the bulk of the files (by *file
> count*) will come from a handful of highly-available domains, i.e.
> sourceforge, github, that sort of thing, with actual self-hosting
> being relatively rare *for the files themselves*, vs. a much wider
> range of domains for the homepage/download URLs (especially because
> those change from one release to the next.)  If that's true, then most
> complaints about availability are being caused by crawling multiple
> not-highly-available HTML pages, *not* by the downloading of the
> actual files.  If my intuition about the distribution is wrong, OTOH,
> it would provide a stronger argument for moving the files themselves
> to PyPI as well.)
> 
> Digression aside, this is one of things that needs to be clearer so
> that there's a better explanation for package authors as to why
> they're being asked to change.  And although the base argument is good
> ("specifying the "homepage" will slow down the installation process"),
> it could be amplified further with an example of some project that has
> had multiple homepages over its lifetime, listing all the URLs that
> currently must be crawled before an installer can be sure it has found
> all available versions, platforms, and formats of the that project.

Right, an example makes sense.

> Okay, on to the Solution section.  Again, your stated problem is to
> fix crawling, but the solution is all about file hosting.  Regardless
> of which of these three "hosting modes" is selected, it remains an
> option for the developer to host files elsewhere, and provide the
> links in their description...  unless of course you intended to rule
> that out and forgot to mention it.  (Or, I suppose, if you did *not*
> intend to rule it out and intentionally omitted mention of that so the
> rabid anti-externalists would think you were on their side and not
> create further controversy...  in which case I've now spoiled things.
> Darn.  ;-) )

To be honest, while drafting i forgot about the fact that the
long_description can contain package links as well.

> Some technical details are also either incorrect or confusing.  For
> example, you state that "The original homepage/download links are
> added as links without a ``rel`` attribute if they have the ``#egg``
> format".  But if they are added without a rel attribute, it doesn't
> *matter* whether they have an #egg marker or not.  It is quite
> possible for a PyPI package to have a download_url of say,
> "http://sourceforge.net/download/someproject-1.2.tgz".

Right.  I just wanted to clarify that the distutils metadata 
"download_url" can contain an #egg format link and that this link
should still be served (without a rel).

> Thus, I would suggest simply stating that changing hosting mode does
> not actually remove any links from the /simple index, it just removes
> the rel="" attributes from the "Home page" and "Download" links, thus
> preventing them from being crawled in search of additional file links.

That's certainly a better description of what effectively happens 
and avoids the special mentioning of #egg.

> With that out of the way, that brings me to the larger scope issue
> with the modes as presented.  Notice now that with this clarification,
> there is no real difference in *state* between the "pypi-cache" and
> "pypi-only" modes.  There is only a *functional* difference...  and
> that function is underspecified in the PEP.

Agreed.

> What I mean is, in both pypi-cache and pypi-only, the *state* of
> things is that rel="" attributes are gone, and there are links to
> files on PyPI.  The only difference is in *how* the files get there.

Yes.

> And for the pypi-cache mode, this function is *really*
> under-specified.  Arguably, this is the meat of the proposal, but it
> is entirely missing.  There is nothing here about the frequency of
> crawling, the methods used to select or validate files, whether there
> is any expiration...  it is all just magically assumed to happen
> somehow.

I'd like to avoid cache-invalidation issues by only performing cache
updates upon three user actions:

- when a release is registered for a package which is in 
  "pypi-cache" hosting mode.

- when a maintainer chooses to set "pypi-cache" 

- when a maintainer explicitely triggers a "cache" update 

All actions allow pypi.python.org to provide feedback / error codes
so there is nothing hidden going on in regular intervals or so.

> My suggestion would be to do two things:
> 
> First, make the state a boolean: crawl external links, with the
> current state yes and the future state no, with "no" simply meaning
> that the rel="" attribute is removed from the links that currently
> have it.
> 
> Second, propose to offer tools in the PyPI interface (and command
> line) to assist authors in making the transition, rather than
> proposing a completely unspecified caching mechanism.  Better to have
> some vaguely specified tools than a completely unspecified caching
> mechanism, and better still to spell out very precisely what those
> tools do.

This structure makes sense to me except that i see the need to start off with
"pypi-ext", i.e. a third state which encodes the current behaviour.
Thing is that the pypi.python.org doesn't have an extensive test 
suite and we will thus need to rely on a few early adopters 
using the tools/state-changes before starting phase 2 (mass mailings etc.).
Also in case of problems we can always switch back packages to the safe
"pypi-ext" mode.  IOW, the motiviation for this third state is considering
the actual implementation process.

> Okay, on to the "Phases of transtion".  This section gets a lot
> simpler if there are only two stages.  Specifically, we let everyone
> know the change is going to happen, and how long they have, give 'em
> links to migration tools.  Done.  ;-)
> 
> (Okay, so analysis still makes sense: the people who don't have any
> externally hosted files can get a different message, i.e., "Hey, we
> notice that installing your package is slow because you have these
> links that don't go anywhere.  Click here if you'd like PyPI to stop
> sending people on wild goose chases".  The people who have external
> hosted files will need a more involved message.)
> 
> Whew.  Okay, that ends my critique of the PEP as it sits.  Now for an
> outside-the-box suggestion.
> 
> If you'd like to be able to transition people away from spidered links
> in the fewest possible steps, with the least user action, no legal
> issues, and in a completely automated way, note that this can be done
> with a one-time spidering of the existing links to find the download
> links, then adding those links directly to the /simple index, and
> switching off the rel="" attributes.  This can be done without
> explicit user consent, though they can be given the chance to do it
> manually, sooner.

Right, my mail preceding the "pre-pep" one contained a "linkext" state
which spidered the links and offered them directly.  It's certainly possible
and indeed would likely not have legal issues.  It might have 
cache-invalidation issues and probably makes the pypi-side implementation 
more complex.  Also it goes a bit against the current intention of the
PEP to have pypi.python.org control all hosting of release files.

> To implement this you'd need two project-level (*not* release-level)
> fields: one to indicate whether the project is using rel="" or not,
> and one to contain the list of external download links, which would be
> user-editable.
> 
> This overall approach I'm proposing can be extended to also support
> mirroring, since it provides an explicit place to list what it is
> you're mirroring.  (At any rate, it's more explicitly specified than
> any such place in the current PEP.)
> 
> That field can also be fairly easily populated for any given project
> in just a few lines of code:
> 
>     from pkg_resources import Requirement
>     pr = Requirement.parse('Projectname')
>     from setuptools.package_index import PackageIndex
>     pi = PackageIndex(search_path=[], python=None, platform=None)
>     pi.find_packages(pr)
>     all_urls = dist.location for dist in pi[pr.key]
>     external_urls = [ url for url in all_urls if not '//pypi.python.org' in url]
> 
> (Although if you want more information, like what kind of link each
> one is, the dist objects actually know a bit more than just the URL.)
> 
> Anyway, I hope you found at least some of all this helpful.  ;-)

Certainly!  Will try to do an update incorporating your suggestions
in the next days.

best,
holger