[Catalog-sig] homepage/download metadata cleaning

Fri Mar 1 17:29:17 CET 2013

On Fri, Mar 1, 2013 at 6:17 AM, holger krekel <holger at merlinux.eu> wrote:
> On Fri, Mar 01, 2013 at 06:09 -0500, Donald Stufft wrote:
>> On Friday, March 1, 2013 at 6:04 AM, M.-A. Lemburg wrote:
>> > On 01.03.2013 11:19, holger krekel wrote:
>> > > Hi Richard, all,
>> > >
>> > > somewhere deep in the threads i mentioned i wrote a little "cleanpypi.py"
>> > > script which takes a project name as an argument and then goes to
>> > > pypi.python.org (http://pypi.python.org) and removes all homepage/download metadata entries for
>> > > this project. This sanitizes/speeds up installation because
>> > > pip/easy_install don't need to crawl them anymore. I just did this for
>> > > three of my projects, (pytest, tox and py) and it seems to work fine.
>> > >
>> >
>> >
>> > Does it also cleanup the links that PyPI adds to the /simple/ by
>> > parsing the project description for links ?
>> >
>> > I think those are far nastier than the homepage and download links,
>> > which can be put to some good use to limit the external lookups
>> > (see http://wiki.python.org/moin/PyPI/DownloadMetaDataProposal)
>> >
>> > See e.g. https://pypi.python.org/simple/zc.buildout/
>> > for a good example of the mess this generates... even mailto links
>> > get listed and "file:///" links open up the installers for all
>> > kinds of nasty things (unless they explicitly protect against
>> > following these).
>> >
>> >
>>
>> pip at least, and I assume the other tools don't spider those links, but
>> they do consider them for download (e.g. if the link looks installable
>> it will be a candidate for installing, but  it won't fetch it, and look for
>> more links like it will donwnload_url/home_page).
>>
>> I believe that's the way it's structured atm.
>
> That's right. Even though the long-description extracted links
> look ugly on a simple/PKGNAME page, neither pip nor easy_install do anything
> with them except if the "href" ends in "#egg=PKGNAME-" in which case they are
> taken as pointing to a development tarball (e.g. at github or bitbucket).
> ASFAIK a link like "PKGNAME-VER.tar.gz" will not be treated as
> an installation candidate, just the "#egg=PKGNAME" one.

Both are considered "primary links".  A primary link is a link whose
filename portion matches one of the supported distutils or setuptools
file formats, or is marked with an #egg tag.  Primary links are
indexed as to project name and version, so that if that version/format
is chosen as the best candidate, it will be downloaded and installed.

Links marked with rel="homepage" or rel="download" are "secondary
links".  Secondary links are actively retrieved and scanned to look
for more primary links.  No further secondary links are scanned or
followed.  (Details of all of this can be found at:
http://peak.telecommunity.com/DevCenter/setuptools#making-your-package-available-for-easyinstall
)

This basically means that MAL's proposal for a download.html file is
actually a bit moot: you can just stick direct "primary" download URLs
in your PyPI description field, and the tools will pick them up.  They
can even include #md5 info.  (See
http://peak.telecommunity.com/DevCenter/EasyInstall#package-index-api
- item 4 mentions the description part.)

This means, by the way, that you could make an external link cleaner
which spiders the external pages and pulls the candidates onto the
description for that release, thereby keeping useful primary links and
getting rid of the secondary links used to fetch them.