[Catalog-sig] homepage/download metadata cleaning

M.-A. Lemburg mal at egenix.com
Fri Mar 1 21:27:38 CET 2013


Thank for the feedback, Holger and Phillip. I'll bake this into
a version 0.2 of the proposal over the weekend.

On 01.03.2013 17:29, PJ Eby wrote:
> On Fri, Mar 1, 2013 at 6:17 AM, holger krekel <holger at merlinux.eu> wrote:
>> On Fri, Mar 01, 2013 at 06:09 -0500, Donald Stufft wrote:
>>> On Friday, March 1, 2013 at 6:04 AM, M.-A. Lemburg wrote:
>>>> On 01.03.2013 11:19, holger krekel wrote:
>>>>> Hi Richard, all,
>>>>>
>>>>> somewhere deep in the threads i mentioned i wrote a little "cleanpypi.py"
>>>>> script which takes a project name as an argument and then goes to
>>>>> pypi.python.org (http://pypi.python.org) and removes all homepage/download metadata entries for
>>>>> this project. This sanitizes/speeds up installation because
>>>>> pip/easy_install don't need to crawl them anymore. I just did this for
>>>>> three of my projects, (pytest, tox and py) and it seems to work fine.
>>>>>
>>>>
>>>>
>>>> Does it also cleanup the links that PyPI adds to the /simple/ by
>>>> parsing the project description for links ?
>>>>
>>>> I think those are far nastier than the homepage and download links,
>>>> which can be put to some good use to limit the external lookups
>>>> (see http://wiki.python.org/moin/PyPI/DownloadMetaDataProposal)
>>>>
>>>> See e.g. https://pypi.python.org/simple/zc.buildout/
>>>> for a good example of the mess this generates... even mailto links
>>>> get listed and "file:///" links open up the installers for all
>>>> kinds of nasty things (unless they explicitly protect against
>>>> following these).
>>>>
>>>>
>>>
>>> pip at least, and I assume the other tools don't spider those links, but
>>> they do consider them for download (e.g. if the link looks installable
>>> it will be a candidate for installing, but  it won't fetch it, and look for
>>> more links like it will donwnload_url/home_page).
>>>
>>> I believe that's the way it's structured atm.
>>
>> That's right. Even though the long-description extracted links
>> look ugly on a simple/PKGNAME page, neither pip nor easy_install do anything
>> with them except if the "href" ends in "#egg=PKGNAME-" in which case they are
>> taken as pointing to a development tarball (e.g. at github or bitbucket).
>> ASFAIK a link like "PKGNAME-VER.tar.gz" will not be treated as
>> an installation candidate, just the "#egg=PKGNAME" one.
> 
> Both are considered "primary links".  A primary link is a link whose
> filename portion matches one of the supported distutils or setuptools
> file formats, or is marked with an #egg tag.  Primary links are
> indexed as to project name and version, so that if that version/format
> is chosen as the best candidate, it will be downloaded and installed.
> 
> Links marked with rel="homepage" or rel="download" are "secondary
> links".  Secondary links are actively retrieved and scanned to look
> for more primary links.  No further secondary links are scanned or
> followed.  (Details of all of this can be found at:
> http://peak.telecommunity.com/DevCenter/setuptools#making-your-package-available-for-easyinstall
> )
> 
> This basically means that MAL's proposal for a download.html file is
> actually a bit moot: you can just stick direct "primary" download URLs
> in your PyPI description field, and the tools will pick them up.  They
> can even include #md5 info.  (See
> http://peak.telecommunity.com/DevCenter/EasyInstall#package-index-api
> - item 4 mentions the description part.)
> 
> This means, by the way, that you could make an external link cleaner
> which spiders the external pages and pulls the candidates onto the
> description for that release, thereby keeping useful primary links and
> getting rid of the secondary links used to fetch them.
> _______________________________________________
> Catalog-SIG mailing list
> Catalog-SIG at python.org
> http://mail.python.org/mailman/listinfo/catalog-sig
> 

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Mar 01 2013)
>>> Python Projects, Consulting and Support ...   http://www.egenix.com/
>>> mxODBC.Zope/Plone.Database.Adapter ...       http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::::: Try our mxODBC.Connect Python Database Interface for free ! ::::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/


More information about the Catalog-SIG mailing list