[Catalog-sig] pre-PEP: transition to release-file hosting at pypi site

Sun Mar 10 20:41:50 CET 2013

On Sun, Mar 10, 2013 at 11:07 AM, holger krekel <holger at merlinux.eu> wrote:
> Philip, Marc-Andre, Richard (Jones), Nick and catalog-sig/distutils-sig:
> scrutiny and feedback welcome.

Hi Holger.  I'm having some difficulty interpreting your proposal
because it is leaving out some things, and in other places
contradicting what I know of how the tools work.  It is also a bit at
odds with itself in some places.

For instance, at the beginning, the PEP states its proposed solution
is to host all release files on PyPI, but then the problem section
describes the problems that arise from crawling external pages:
problems that can be solved without actually hosting the files on
PyPI.

To me, it needs a clearer explanation of why the actual hosting part
also needs to be on PyPI, not just the links.  In the threads to date,
people have argued about uptime, security, etc., and these points are
not covered by the PEP or even really touched on for the most part.

(Actually, thinking about that makes me wonder....  Donald: did your
analysis collect any stats on *where* those externally hosted files
were hosted?  My intuition says that the bulk of the files (by *file
count*) will come from a handful of highly-available domains, i.e.
sourceforge, github, that sort of thing, with actual self-hosting
being relatively rare *for the files themselves*, vs. a much wider
range of domains for the homepage/download URLs (especially because
those change from one release to the next.)  If that's true, then most
complaints about availability are being caused by crawling multiple
not-highly-available HTML pages, *not* by the downloading of the
actual files.  If my intuition about the distribution is wrong, OTOH,
it would provide a stronger argument for moving the files themselves
to PyPI as well.)

Digression aside, this is one of things that needs to be clearer so
that there's a better explanation for package authors as to why
they're being asked to change.  And although the base argument is good
("specifying the "homepage" will slow down the installation process"),
it could be amplified further with an example of some project that has
had multiple homepages over its lifetime, listing all the URLs that
currently must be crawled before an installer can be sure it has found
all available versions, platforms, and formats of the that project.

Okay, on to the Solution section.  Again, your stated problem is to
fix crawling, but the solution is all about file hosting.  Regardless
of which of these three "hosting modes" is selected, it remains an
option for the developer to host files elsewhere, and provide the
links in their description...  unless of course you intended to rule
that out and forgot to mention it.  (Or, I suppose, if you did *not*
intend to rule it out and intentionally omitted mention of that so the
rabid anti-externalists would think you were on their side and not
create further controversy...  in which case I've now spoiled things.
Darn.  ;-) )

Some technical details are also either incorrect or confusing.  For
example, you state that "The original homepage/download links are
added as links without a ``rel`` attribute if they have the ``#egg``
format".  But if they are added without a rel attribute, it doesn't
*matter* whether they have an #egg marker or not.  It is quite
possible for a PyPI package to have a download_url of say,
"http://sourceforge.net/download/someproject-1.2.tgz".

Thus, I would suggest simply stating that changing hosting mode does
not actually remove any links from the /simple index, it just removes
the rel="" attributes from the "Home page" and "Download" links, thus
preventing them from being crawled in search of additional file links.

With that out of the way, that brings me to the larger scope issue
with the modes as presented.  Notice now that with this clarification,
there is no real difference in *state* between the "pypi-cache" and
"pypi-only" modes.  There is only a *functional* difference...  and
that function is underspecified in the PEP.

What I mean is, in both pypi-cache and pypi-only, the *state* of
things is that rel="" attributes are gone, and there are links to
files on PyPI.  The only difference is in *how* the files get there.

And for the pypi-cache mode, this function is *really*
under-specified.  Arguably, this is the meat of the proposal, but it
is entirely missing.  There is nothing here about the frequency of
crawling, the methods used to select or validate files, whether there
is any expiration...  it is all just magically assumed to happen
somehow.

My suggestion would be to do two things:

First, make the state a boolean: crawl external links, with the
current state yes and the future state no, with "no" simply meaning
that the rel="" attribute is removed from the links that currently
have it.

Second, propose to offer tools in the PyPI interface (and command
line) to assist authors in making the transition, rather than
proposing a completely unspecified caching mechanism.  Better to have
some vaguely specified tools than a completely unspecified caching
mechanism, and better still to spell out very precisely what those
tools do.

Okay, on to the "Phases of transtion".  This section gets a lot
simpler if there are only two stages.  Specifically, we let everyone
know the change is going to happen, and how long they have, give 'em
links to migration tools.  Done.  ;-)

(Okay, so analysis still makes sense: the people who don't have any
externally hosted files can get a different message, i.e., "Hey, we
notice that installing your package is slow because you have these
links that don't go anywhere.  Click here if you'd like PyPI to stop
sending people on wild goose chases".  The people who have external
hosted files will need a more involved message.)

Whew.  Okay, that ends my critique of the PEP as it sits.  Now for an
outside-the-box suggestion.

If you'd like to be able to transition people away from spidered links
in the fewest possible steps, with the least user action, no legal
issues, and in a completely automated way, note that this can be done
with a one-time spidering of the existing links to find the download
links, then adding those links directly to the /simple index, and
switching off the rel="" attributes.  This can be done without
explicit user consent, though they can be given the chance to do it
manually, sooner.

To implement this you'd need two project-level (*not* release-level)
fields: one to indicate whether the project is using rel="" or not,
and one to contain the list of external download links, which would be
user-editable.

This overall approach I'm proposing can be extended to also support
mirroring, since it provides an explicit place to list what it is
you're mirroring.  (At any rate, it's more explicitly specified than
any such place in the current PEP.)

That field can also be fairly easily populated for any given project
in just a few lines of code:

    from pkg_resources import Requirement
    pr = Requirement.parse('Projectname')
    from setuptools.package_index import PackageIndex
    pi = PackageIndex(search_path=[], python=None, platform=None)
    pi.find_packages(pr)
    all_urls = dist.location for dist in pi[pr.key]
    external_urls = [ url for url in all_urls if not '//pypi.python.org' in url]

(Although if you want more information, like what kind of link each
one is, the dist objects actually know a bit more than just the URL.)

Anyway, I hope you found at least some of all this helpful.  ;-)