[Catalog-sig] PyPI mirrors are all up to date

Tarek Ziadé tarek at ziade.org
Tue Apr 17 23:59:55 CEST 2012


On 4/17/12 12:54 PM, Donald Stufft wrote:
>
>>
>> If there's interest I can write a multiprocess-based script that keeps a
>> md5 database up-to-date
> I'd be interested ;) Although i'd prefer sha256 personally.
>>
well, this is not really for security and I don't think a collision can 
happen that often with md5 :D

here's a raw script: http://tarek.pastebin.mozilla.org/1575563

The grand digest is done like a derived secret: I loop on the hash and do

    grand hash = hash(n & n+1) for n in hashes

I've run it against the 111,196 files I currently have in my mirror

- First *full* run from scratch - 15m32s  (not sure why I don't have 
better here, maybe Python's md5 is slower than md5deep)

- Second *full* run, md5 database filled - 2m33s - it scans the mirror, 
and adds missing md5s + build the grand digest.

- Just the digest, against a synced MD5 DB - 1m1s  (I just commented the 
first part that builds/updates the md5 db)

In a real mirror, once the first full run is done, the md5 db would be 
updated continuously everytime a new file is added
in the mirror, so the only extra load is recalculating the digest again.

So it would take around a minute each time, not a few seconds as I said 
previously. But that seems ok if a mirror is updated
for example every 5 minutes, 4 minutes can be spent to sync the files, 
and 1 minute to do the checksum I guess




>> Cheers
>> Tarek
>>
>>>
>>> Regards,
>>> Martin
>>
>> _______________________________________________
>> Catalog-SIG mailing list
>> Catalog-SIG at python.org <mailto:Catalog-SIG at python.org>
>> http://mail.python.org/mailman/listinfo/catalog-sig
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/catalog-sig/attachments/20120417/ff253592/attachment.html>


More information about the Catalog-SIG mailing list