[Catalog-sig] Migrating away from scanning home pages

M.-A. Lemburg mal at egenix.com
Thu Feb 28 17:49:58 CET 2013


I've added the proposal to the wiki to keep collecting comments
and updates:

http://wiki.python.org/moin/PyPI/DownloadMetaDataProposal

On 28.02.2013 12:55, M.-A. Lemburg wrote:
> On 28.02.2013 12:45, Donald Stufft wrote:
>> On Thursday, February 28, 2013 at 5:55 AM, M.-A. Lemburg wrote:
>>> I think we all agree that scanning arbitrary HTML pages
>>> for download links is not a good idea and we need to
>>> transition away from this towards a more reliable system.
>>>
>>> Here's an approach that would work to start the transition
>>> while not breaking old tools (sketching here to describe the
>>> basic idea):
>>>
>>> Limiting scans to download_url
>>> ------------------------------
>>>
>>> Installers and similar tools preferably no longer scan the all
>>> links on the /simple/ index, but instead only look at
>>> the download links (which can be defined in the package
>>> meta data) for packages that don't host files on PyPI.
>>>
>>> Going only one level deep
>>> -------------------------
>>>
>>> If the download links point to a meta-file named
>>> "<packagename>-<version>-downloads.html#<sha256-hashvalue>",
>>> the installers download that file, check whether the
>>> hash value matches and if it does, scan the file in
>>> the same way they would parse the /simple/ index page of
>>> the package - think of the downloads.html file as a symlink
>>> to extend the search to an external location, but in a
>>> predefined and safe way.
>>>
>>> Comments
>>> --------
>>>
>>> * The creation of the downloads.html file is left to the
>>> package owner (we could have a tool to easily create it).
>>>
>>> * Since the file would use the same format as the PyPI
>>> /simple/ index directory listing, installers would be
>>> able to verify the embedded hash values (and later
>>> GPG signatures) just as they do for files hosted directly
>>> on PyPI.
>>>
>>> * The URL of the downloads.html file, together with the
>>> hash fragment, would be placed into the setup.py
>>> download_url variable. This is supported by all recent
>>> and not so recent Python versions.
>>>
>>> * No changes to older Python versions of distutils are
>>> necessary to make this work, since the download_url
>>> field is a free form field.
>>>
>>> * No changes to existing distutils meta data formats are
>>> necessary, since the download_url field has always
>>> been meant for download URLs.
>>>
>>> * Installers would not need to learn about a new meta
>>> data format, because they already know how to parse
>>> PyPI style index listings.
>>>
>>> * Installers would prefer the above approach for downloads,
>>> and warn users if they have to revert back to the old
>>> method of scanning all links.
>>>
>>> * Installers could impose extra security requirements,
>>> such as only following HTTPS links and verifying
>>> all certificates.
>>>
>>> * In a later phase of the transition we could have
>>> PyPI cache the referenced distribution files locally
>>> to improve reliability. This would turn the push
>>> strategy for uploading files to PyPI into a pull
>>> strategy for those packages and make things a lot
>>> easier to handle for package maintainers.
>>>
>> I don't have time to respond to the rest right now, but this isn't doable
>> I don't think. The purpose of that legalese you pointed out is to make
>> it possible for PyPI to serve those files legally. We don't know if those
>> files are something PyPI is allowed to distribute so PyPI can't cache them.
> 
> Thanks for the note.
> 
> The legalese could be adapted to make this work (if needed)
> or we could add a flag to the download.html file which makes
> the choice explicit on a per package basis - the latter might
> be the better option to address packages that are subject to
> export control or other restrictions.
> 

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 28 2013)
>>> Python Projects, Consulting and Support ...   http://www.egenix.com/
>>> mxODBC.Zope/Plone.Database.Adapter ...       http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::::: Try our mxODBC.Connect Python Database Interface for free ! ::::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/


More information about the Catalog-SIG mailing list