[Python-Dev] PEP 3147 ready for pronouncement and merging

Barry Warsaw barry at python.org
Sat Apr 17 01:00:30 CEST 2010


On Apr 15, 2010, at 08:01 PM, Guido van Rossum wrote:

>> Byte code files contain two 32-bit numbers followed by the marshaled
>
>big-endian

Done.

>> [2]_ code object.  The 32-bit numbers represent a magic number and a
>> timestamp.  The magic number changes whenever Python changes the byte
>> code format, e.g. by adding new byte codes to its virtual machine.
>> This ensures that pyc files built for previous versions of the VM
>> won't cause problems.  The timestamp is used to make sure that the pyc
>> file is not older than the py file that was used to create it.  When
>
>is not older than -> matches
>
>(Obscure fact: the timestamp in the pyc file must match the source's
>mtime exactly.)

Done.

>> Rationale
>> =========
>>
>> Linux distributions such as Ubuntu [4]_ and Debian [5]_ provide more
>> than one Python version at the same time to their users.  For example,
>> Ubuntu 9.10 Karmic Koala users can install Python 2.5, 2.6, and 3.1,
>> with Python 2.6 being the default.
>>
>> This causes a conflict for Python source files installed by the
>> system (including third party packages), because you cannot compile a
>
>I'd say only 3rd part packages right? (And code written by the distro,
>which from Python's POV is also 3rd party.) At least ought to clarify
>that the stdlib is unaffected by this conflict, because multiple
>versions of the stdlib *are* installed.

Yes, good point.  Clarified.

>> single Python source file for more than one Python version at a time.
>> Thus if your system wanted to install a `/usr/share/python/foo.py`, it
>> could not create a `/usr/share/python/foo.pyc` file usable across all
>> installed Python versions.
>
>Note that (due to the magic#) Python doesn't crash, it just falls back
>on the slower approach of compiling from source.
>
>Perhaps more important is that different Python versions (if the user
>has write permission) will fight over the pyc file and rewrite it each
>time the source is compiled. Worse, even though the magic# is
>initially written as zero and then rewritten with the correct value,
>concurrent processes running different Python versions can actually
>end up reading corrupt bytecode. (Alex Martelli diagnosed this at
>Google years ago.)

Good point; I've made this more clear.

>> Furthermore, in order to ease the burden on operating system packagers
>> for these distributions, the distribution packages do not contain
>> Python version numbers [6]_; they are shared across all Python
>> versions installed on the system.  Putting Python version numbers in
>> the packages would be a maintenance nightmare, since all the packages
>> - *and their dependencies* - would have to be updated every time a new
>> Python release was added or removed from the distribution.  Because of
>> the sheer number of packages available, this amount of work is
>> infeasible.
>>
>> C extensions can be source compatible across multiple versions of
>> Python.  Compiled extension modules are usually not compatible though,
>
>Actually we typically make every effort to support backwards
>compatibility for compiled modules, and the module initialization API
>contains a version# check. This is a different version# than the
>import magic# and historically has changed much less frequently.

I've rewritten this paragraph a bit.  It's not particularly relevant to this
PEP. (I'll be look at PEP 384 soon.)

>> and PEP 384 [7]_ has been proposed to address this by defining a
>> stable ABI for extension modules.
>>

>> Proposal
>> ========
>>
>> Python's import machinery is extended to write and search for byte
>> code cache files in a single directory inside every Python package
>> directory.  This directory will be called `__pycache__`.
>> Further, pyc files will contain a magic string that differentiates the
>
>Clarify that the magic string is in the filename, not in the file contents.

Yep.

>> Python version they were compiled for.  This allows multiple byte
>> compiled cache files to co-exist for a single Python source file.
>>
>> This scheme has the added benefit of reducing the clutter in a Python
>> package directory.
>>
>> When a Python source file is imported for the first time, a
>> `__pycache__` directory will be created in the package directory, if
>
>Is this still true? ISTR there was a lot of discussion about the
>auto-creation and possible security concerns.

It is still true.  I think we determined it will usually not be an issue
because the umask will not be altered, and because normal installation
procedures typically involve byte compilation (and thus __pycache__ creation)
during installation time via tools like compileall.  This really is describing
what happens when you run Python over pure Python source code for the first
time, and it's no different from what happens now with the automatic creation
of pyc files.

>> one does not already exist.  The pyc file for the imported source will
>> be written to the `__pycache__` directory, using the magic-tag
>
>By now the magic-tag format should have been defined (or a "see below"
>inserted).

Based on this and your following comment, I've moved the description of the
magic tag format to here, and rewritten it to fit in context.  The section
discussing the hexadecimal representation is moved to the (rejected)
"Alternatives" section.

>> Case 1: The first import
>> ------------------------
>>
>> When Python is asked to import module `foo`, it searches for a
>> `foo.py` file (or `foo` package, but that's not important for this
>> discussion) along its `sys.path`.  When Python locates the `foo.py`
>> file it will look for a `__pycache__` directory in the directory where
>> it found the `foo.py`.  If the `__pycache__` directory is missing,
>> Python will create it.  Then it will parse and byte compile the
>> `foo.py` file and save the byte code in `__pycache__/foo.<magic>.pyc`,
>> where <magic> is defined by the Python implementation, but will be a
>> human readable string such as `cpython-32`.
>
>(Aside: at first I read this as a description of the full algorithm.
>But there is a step missing -- the __pycache__/foo.<magic>.pyc file is
>searched and not found.)

I added a Case 0 for the "steady state" which should clarify this.

>> Magic identifiers
>> =================
>>
>> pyc files inside of the `__pycache__` directories contain a magic
>> identifier in their file names.  These are mnemonic tags for the
>> actual magic numbers used by the importer.  For example, in Python
>> 3.2, we could use the hexlified [10]_ magic number as a unique
>
>(Aside: when you search Wikipedia for "hexlify" it says "did you mean:
>heavily?" :-)

:) Emacs is where I first encountered this term, e.g. M-x hexlify-buffer.  It
got carried over to the binascii module.  But in this case "hexadecimal
representation of the binary magic number" is probably a better term to use.

>
>> identifier::
>>
>>    >>> from binascii import hexlify
>>    >>> from imp import get_magic
>>    >>> 'foo.{}.pyc'.format(hexlify(get_magic()).decode('ascii'))
>>    'foo.580c0d0a.pyc'
>>
>> This isn't particularly human friendly though.  Instead, this PEP
>
>This section reads a bit weird -- first it describes the solution we
>*didn't* pick. I'd move that to a "Alternatives Considered and
>Rejected" section or some such.

Agreed; see above.

>> proposes a *magic tag* that uniquely defines `.pyc` files for the
>> current version of Python.  Whenever the magic number is bumped, a new
>> magic tag is defined which is unique among all versions and
>> implementations of Python.  The actual contents of the magic tag is
>> left up to the implementation, although it is recommended that the tag
>> include the implementation name and a version shorthand.  In general,
>> magic numbers never change between Python micro releases, but the
>> convention can be extended to handle magic number changes between
>> pre-release development versions.
>>
>> For example, CPython 3.2 would have a magic tag of `cpython-32` and
>> write pyc files like this: `foo.cpython-32.pyc`.  When the `-O` flag
>> is used, it would write `foo.cpython-32.pyo`.  For backports of this
>> feature to Python 2, when the `-U` flag is used, a file such as
>> `foo.cpython-27u.pyc` can be written.
>
>Does all of this match the implementation?

Yes.  Well, except for the -U part, since I haven't backported this to Python
2... yet :).

>> Implementation strategy
>> =======================
>>
>> This feature is targeted for Python 3.2, solving the problem for those
>> and all future versions.  It may be back-ported to Python 2.7.
>
>Is there time given that 2.7b1 was released?

See my previous response.

>> This PEP proposes the addition of an `__cached__` attribute to
>> modules, which will always point to the actual `pyc` file that was
>> read or written.  When the environment variable
>> `$PYTHONDONTWRITEBYTECODE` is set, or the `-B` option is given, or if
>> the source lives on a read-only filesystem, then the `__cached__`
>> attribute will point to the location that the `pyc` file *would* have
>> been written to if it didn't exist.  This location of course includes
>> the `__pycache__` subdirectory in its path.
>
>Hm. I wish there was a way to find out whether the bytecode (or
>whatever) actually *was* read from this file. __file__ in Python 2
>supports this (though not in Python 3).

Do you have a use case for that?  It might be interesting to know, but I can't
think of a good way to infer that from __file__ and __cached__, or of a good
way to expose that on module objects.   Of course, it would be totally Python
implementation dependent too.

>> Backports
>> ---------
>>
>> For versions of Python earlier than 3.2 (and possibly 2.7), it is
>> possible to backport this PEP.  However, in Python 3.2 (and possibly
>> 2.7), this behavior will be turned on by default, and in fact, it will
>> replace the old behavior.  Backports will need to support the old
>> layout by default.  We suggest supporting PEP 3147 through the use of
>> an environment variable called `$PYTHONENABLECACHEDIR` or the command
>> line switch `-Xenablecachedir` to enable the feature.
>
>I would be okay if a distro decided to turn it on by default, as long
>as there was a way to opt out.

For Python 2.6, even for a distro-specific backport, I think I'd want to
enable it only with a switch.  It might be weird for example if Python 2.6 in
Ubuntu 10.04 produced traditional pyc files, but Python 2.6 in Ubuntu 10.10
produced PEP 3147 file names.  For a backport to Python 2.7 though (which
e.g. would be new in Ubuntu 10.10), it might make sense to enable it by
default.

Either way, we're really talking about the effects on user code only.  We'll
definitely enable it as part of the package installation tools.

Thanks again Guido.  I think this hits all your feedback.  Now to land the
code!

-Barry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-dev/attachments/20100416/5d5e5799/attachment-0001.pgp>


More information about the Python-Dev mailing list