[Distutils] [tuf] Re: Automation for creating, updating and destroying a TUF-secured PyPI mirror

Justin Cappos jcappos at poly.edu
Tue Apr 9 07:17:31 CEST 2013


His 29MB and 58MB numbers assume that every developer has their own key
right now.   We don't think this is likely to happen and propose initially
signing everything that the developers don't sign with a single PyPI key.

It also assumes there are no abandoned packages / devel account.   I also
think many devels won't go back and sign all old versions of their
software.   So my number is definitely a back of the envelope calculation
using Trishank's data.   Trishank's calculations are much more expressive,
but are the "worst case" size.

Thanks,
Justin




On Tue, Apr 9, 2013 at 12:18 AM, Nick Coghlan <ncoghlan at gmail.com> wrote:

> On Tue, Apr 9, 2013 at 9:58 AM, Justin Cappos <jcappos at poly.edu> wrote:
> > FYI: For anyone who wants the executive summary, we think the TUF
> metadata
> > will be under 1MB and even with very broad / rapid adoption of TUF in the
> > next year or two will stay <3MB or so.
>
> Is that after compression? Or did Trishank miscount the number of
> digits for the initial email?
>
> Cheers,
> Nick.
>
>
> >
> > Note that this cost is only paid upon the initial run of the client tool.
> > Everything after that just downloads diffs (or at least will once we fix
> an
> > open ticket).
> >
> > Thanks,
> > Justin
> >
> >
> >
> > On Mon, Apr 8, 2013 at 2:41 PM, Trishank Karthik Kuppusamy
> > <tk47 at students.poly.edu> wrote:
> >>
> >> Hello everyone,
> >>
> >> I have been testing and refining the pypi.updateframework.comautomation
> >> over the past week, and looking at how much TUF metadata is generated
> for
> >> PyPI.
> >>
> >> In this email, I am going to focus only on the PyPI data under /simple;
> >> let us call that "simple data".
> >>
> >> Now, if we assume that every developer will have her own key to sign the
> >> simple data for her package, then this is what the TUF metadata could
> look
> >> like:
> >>
> >> metadata/targets.txt
> >> ====================
> >> Delegation from the targets to the targets/simple role, with the former
> >> role being responsible for no target data because it has none of its
> own.
> >>
> >> metadata/targets/simple.txt
> >> ===========================
> >> Delegation from targets/simple to the targets/simple/packageI role, with
> >> the former role being responsible for one target datum:
> simple/index.html.
> >>
> >> metadata/targets/simple/packageI.txt
> >> ====================================
> >> The targets/simple/packageI role is responsible only for the simple data
> >> at simple/packageI/index.html.
> >>
> >> In this upper bound case, where every developer is responsible for
> signing
> >> her own package, one can estimate the metadata size to be like so:
> >>
> >> - metadata/targets/targets.txt is, at most, about a few KB, and can be
> >> safely ignored.
> >> - metadata/targets/simple/packageI.txt is about 1KB.
> >> - metadata/targets/simple.txt is about the sum of all
> >> metadata/targets/simple/packageI.txt files. (This is a very rough
> estimate!)
> >>
> >> Therefore, if we have 30,000 developer packages on PyPI (roughly the
> >> current number of packages), then we would have about 29 MB of
> >> metadata/targets/simple/packageI.txt, and another 29 MB of
> >> metadata/targets/simple.txt, for a rough total of 58MB. If PyPI has
> 45GB of
> >> total data (roughly what I saw from my last mirror), then the simple
> >> metadata is about 0.13% of total data size.
> >>
> >> This may seem like a lot of metadata, but let us remember a few
> important
> >> things:
> >>
> >> - So far, the metadata is simply uncompressed JSON. We are considering
> >> metadata compression or difference schemes.
> >> - This assumes the upper bound case, where every package developer is
> >> responsible for her own package, so that means that we have talk about
> a lot
> >> of keys (random data).
> >> - This is a one-time initial download cost. An update to PyPI is
> unlikely
> >> to change all the simple data; therefore, updates to the simple metadata
> >> will be cheap, because a TUF client would only download updated
> metadata. We
> >> could amortize the initial simple metadata download cost by
> distributing it
> >> with PyPI installers (e.g. pip).
> >>
> >> Could we do better? Yes!
> >>
> >> As Nick Coghlan has suggested, PyPI could begin adopting TUF by signing
> >> for all of the developer packages itself. This means that we could
> reuse a
> >> key for multiple developer packages instead of dedicating a key per
> package.
> >> The tradeoff here is that if one such "shared key" is compromised, then
> >> multiple packages (but not all of them) could be compromised.
> >>
> >> In this case, where we use a shared key to sign up to, say, 1,000
> >> developer packages, then we would have the following simple metadata
> size.
> >> First, let us define some terms:
> >>
> >> NP = # of developer packages
> >> NPK = # of developer packages signed by a key
> >> NR = # of roles (each responsible for NPK packages) = math.ceil(NP/NPK)
> >> K = average key metadata size
> >> D = average delegated role metadata size given one target path
> >> P = average target path length
> >> T = average simple target (index.html) metadata size
> >>
> >> metadata/targets/simple.txt
> >> ===========================
> >> Most of the metadata here deals with all of the keys, and the roles,
> used
> >> to sign simple data. Therefore, the size of the keys and roles metadata
> will
> >> dominate this file.
> >>
> >> key metadata size = NR*K
> >> role metadata size = NR*(D+NPK*P)
> >>
> >> Takeaway: the lower the NPK (the number of developer packages signed by
> a
> >> key), then the higher the NR, and the larger the metadata. We would save
> >> metadata by setting NPK to, say, 1,000, because then one key could
> describe
> >> 1,000 packages.
> >>
> >> metadata/targets/simple/roleI.txt
> >> ====================================
> >> When NPK=1, then this file would be equivalent to
> >> metadata/targets/simple/packageI.txt.
> >>
> >> It is a small metadata file if we assume that it only talks about the
> >> simple data (index.html) for one package. Most of the metadata talks
> about
> >> key signatures, and target metadata. If we increase NPK, then clearly
> the
> >> target metadata would increase in size:
> >>
> >> target metadata size = NPK*T < NPK*1KB
> >>
> >> Takeaway: the target metadata would increase in size, but it certainly
> >> will not increase as much as it would have if we had signed each
> developer
> >> package with a separate key.
> >>
> >> Finally, the question is how the savings in metadata/targets/simple.txt
> >> would compare to the "growth" of the metadata/targets/simple/roleI.txt
> >> files. Ultimately, the higher the NPK (and thus the lower the NR), then
> the
> >> less would we be talking about keys (random data). Everything else would
> >> remain the same, because there would still be the same number of
> targets,
> >> and thus the same amount of target metadata. So, we would have net
> savings.
> >>
> >> I hope this clears some questions about metadata size. If there was
> >> something confusing because I did not explain it well enough or I got
> >> something wrong, please be sure to let me know. My machine is nearly
> done
> >> generating all the simple metadata, so we can make better estimates
> then.
> >>
> >> -Trishank
> >>
> >
> >
> > _______________________________________________
> > Distutils-SIG maillist  -  Distutils-SIG at python.org
> > http://mail.python.org/mailman/listinfo/distutils-sig
> >
>
>
>
> --
> Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/distutils-sig/attachments/20130409/3b3af55e/attachment-0001.html>


More information about the Distutils-SIG mailing list