[Python-Dev] PEP 552: deterministic pycs

Nick Coghlan ncoghlan at gmail.com
Thu Sep 7 21:47:20 EDT 2017


On 7 September 2017 at 16:58, Gregory P. Smith <greg at krypto.org> wrote:
> +1 on this PEP.
>
> The TL;DR summary of this PEP:
>   The pyc date+length metadata check was a convenient hack.  It still works
> well for many people and use cases, it isn't going away.
>   PEP 552 proposes a new alternate hack that relies on file contents instead
> of os and filesystem date metadata.
>     Assumption: The hash function is significantly faster than re-parsing
> the source.  (guaranteed to be true)
>
> Questions:
>
> Input from OS package distributors would be interesting.  Would they use
> this?  Which way would it impact their startup time (loading the .py file vs
> just statting it.  does that even matter?  source files are often eventually
> loaded for linecache use in tracebacks anyways)?

Christian and I asked some of our security folks for their personal
wishlists recently, and one of the items that came up was "The
recompile is based on a timestamp. How do you know the pyc file on
disk really is related to the py file that is human readable? Can it
be based on a hash or something like that?"

This is a restating of the reproducible build use case: for a given
version of Python, a given source file should always give the same
source hash and marshaled code object, and once it does, it's easier
to do an independent compilation from the source file and check you
get the same answer.

While you can implement that for timestamp based formats by adjusting
input file metadata (and that's exactly what distros do with
_SOURCE_DATE_EPOCH), it's still pretty annoying, and not particularly
build cache friendly, since the same file in different source
artifacts may produce different build outputs.

> Would they benefit from a pyc that can contain _both_ timestamp+length, and
> the source_hash?  if both were present, I assume that only one would be
> checked at startup.  i'm not sure what would make the decision of what to
> check.  one fails, check the other?  i personally do not have a use for this
> case so i'd omit the complexity without a demonstrated need.

I don't see any way we'd benefit from having both items present.

However, I do wonder whether we could encode *all* the mode settings
into the magic number, such that we did something like reserving the
top 3 bits for format flags:

* number & 0x1FFF -> the traditional magic number
* number & 0x8000 -> timestamp or hash?
* number & 0x4000 -> checked or not?
* number & 0x2000 -> reserved for future format changes

By default we'd still produce the checked-timestamp format, but
managed build systems (including Linux distros) could opt-in to the
unchecked-hash format.

> Something to also state in the PEP:
>
> This is intentionally not a "secure" hash.  Security is explicitly a
> non-goal.

I don't think it's so much that security is a non-goal, as that the
(admittedly minor) security improvement comes from making it easier to
reproduce the expected machine-readable output from a given
human-readable input, rather than from the nature of the hashing
function used.

> Rationale behind my support:

+1 from me as well, for the reasons Greg gives (while Fedora doesn't
currently do any per-file build artifact caching, I hope we will in
the future, and output formats based on input artifact hashes will
make that much easier than formats based on input timestamps).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia


More information about the Python-Dev mailing list