[Catalog-sig] homepage/download metadata cleaning

Donald Stufft donald.stufft at gmail.com
Fri Mar 1 23:39:03 CET 2013


On Friday, March 1, 2013 at 2:31 PM, M.-A. Lemburg wrote:
> On 01.03.2013 12:17, holger krekel wrote:
> > On Fri, Mar 01, 2013 at 06:09 -0500, Donald Stufft wrote:
> > > On Friday, March 1, 2013 at 6:04 AM, M.-A. Lemburg wrote:
> > > > On 01.03.2013 11:19, holger krekel wrote:
> > > > > Hi Richard, all,
> > > > > 
> > > > > somewhere deep in the threads i mentioned i wrote a little "cleanpypi.py"
> > > > > script which takes a project name as an argument and then goes to 
> > > > > pypi.python.org (http://pypi.python.org) and removes all homepage/download metadata entries for 
> > > > > this project. This sanitizes/speeds up installation because
> > > > > pip/easy_install don't need to crawl them anymore. I just did this for
> > > > > three of my projects, (pytest, tox and py) and it seems to work fine.
> > > > > 
> > > > 
> > > > 
> > > > 
> > > > Does it also cleanup the links that PyPI adds to the /simple/ by
> > > > parsing the project description for links ?
> > > > 
> > > > I think those are far nastier than the homepage and download links,
> > > > which can be put to some good use to limit the external lookups
> > > > (see http://wiki.python.org/moin/PyPI/DownloadMetaDataProposal)
> > > > 
> > > > See e.g. https://pypi.python.org/simple/zc.buildout/
> > > > for a good example of the mess this generates... even mailto links
> > > > get listed and "file:///" links open up the installers for all
> > > > kinds of nasty things (unless they explicitly protect against
> > > > following these).
> > > > 
> > > 
> > > 
> > > pip at least, and I assume the other tools don't spider those links, but
> > > they do consider them for download (e.g. if the link looks installable
> > > it will be a candidate for installing, but it won't fetch it, and look for 
> > > more links like it will donwnload_url/home_page).
> > > 
> > > I believe that's the way it's structured atm.
> > 
> > That's right. Even though the long-description extracted links 
> > look ugly on a simple/PKGNAME page, neither pip nor easy_install do anything
> > with them except if the "href" ends in "#egg=PKGNAME-" in which case they are
> > taken as pointing to a development tarball (e.g. at github or bitbucket).
> > ASFAIK a link like "PKGNAME-VER.tar.gz" will not be treated as
> > an installation candidate, just the "#egg=PKGNAME" one.
> > 
> 
> 
> Hmm, then why not remove links that don't match the above from
> the /simple/ index pages ?
> 
> Note that it's easily possible to make e.g. file:/// links
> have a fragment that matches what you described, so I guess the
> filters would have to be more careful about what to allow
> (e.g. only http/ftp schemes, perhaps even only https schemes)
> and what not.
> 
> BTW: Are those links also shown as-is on the description page ?
> People could do nasty stuff by adding "javascript:" links which look
> like normal links to the descriptions.
> 
> 

The descriptions don't allow javascript: urls anymore (I reported that
ages ago and Richard fixed it). home_page and probably download_url
do though.
> 
> -- 
> Marc-Andre Lemburg
> eGenix.com (http://eGenix.com)
> 
> Professional Python Services directly from the Source (#1, Mar 01 2013)
> > > > Python Projects, Consulting and Support ... http://www.egenix.com/
> > > > mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/
> > > > mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
> > > > 
> > > 
> > 
> 
> ________________________________________________________________________
> 
> ::::: Try our mxODBC.Connect Python Database Interface for free ! ::::::
> 
> eGenix.com (http://eGenix.com) Software, Skills and Services GmbH Pastor-Loeh-Str.48
> D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
> Registered at Amtsgericht Duesseldorf: HRB 46611
> http://www.egenix.com/company/contact/
> 
> 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/catalog-sig/attachments/20130301/47c12896/attachment-0001.html>


More information about the Catalog-SIG mailing list