[Python-ideas] add a hash to .pyc to don't mess between .py and .pyc

Sun Aug 14 22:35:32 EDT 2016

On 15/08/2016 02:45, Wes Turner wrote:
>
> You can add a `make clean` build step:
>
>   pyclean:
>       find . -name '*.pyc' -delete
>
> You can delete all .pyc files
>
> - $ find . -name '*.pyc' -delete
> - http://manpages.ubuntu.com/manpages/precise/man1/pyclean.1.html
> #.pyc, .pyo
>
> You can rebuild all .pyc files (for a given directory):
>
> - $ python -m compileall -h
> - https://docs.python.org/2/library/compileall.html
> - https://docs.python.org/3/library/compileall.html
>
>
>
> You can, instead of building .pyc, build .pyo
>
> - https://docs.python.org/2/using/cmdline.html#envvar-PYTHONOPTIMIZE
> - https://docs.python.org/2/using/cmdline.html#cmdoption-O
>
> You can not write .pyc or .pyo w/ PYTHONDONTWRITEBYTECODE / -B
>
> - https://docs.python.org/2/using/cmdline.html#envvar-PYTHONDONTWRITEBYTECODE
> - https://docs.python.org/2/using/cmdline.html#cmdoption-B
> - If the files exist though,
>   - https://docs.python.org/3/reference/import.html
>
> You can build a PEX (which rebuilds .pyc files) and test/deploy that:
>
> - https://github.com/pantsbuild/pex#integrating-pex-into-your-workflow
> - https://pantsbuild.github.io/python-readme.html#more-about-python-tests
>
> How .pyc files currently work:
>
> - http://nedbatchelder.com/blog/200804/the_structure_of_pyc_files.html
> - https://www.python.org/dev/peps/pep-3147/#flow-chart (*.pyc ->
> ./__pycache__)
> - http://raulcd.com/how-python-caches-compiled-bytecode.html
>
> You could add a hash of the .py source file in the header of the
> .pyc/.pyo object (as proposed)
>
> - The overhead of this hashing would be a significant performance
> regression
> - Instead, today, the build step can just pyclean or build a
> .zip/.WHL/.PEX which is expected to be a fresh build
>
The problem is not the option of you have to prevent the problem, the
simplest way being
to delete the .pyc file, It is easy to do once you spot it. The problem
is that it randomly happen in
normal workflow.

To have an idea of the overhead of the whole hashing procedure I run the
following script

import sys

from time import time
from zlib import adler32 as h
t2 =time()
import decimal
print(decimal.__file__)
c1 = time()-t2
t1=time()
r=h(open(decimal.__file__,'rb').read())
c2= time()-t1
print(c2,c1,c2/c1)

decimal was chosen because it was the biggest file of the standard library.
on 20 runs, the overhead was always between 1% and 1.5%
So yes the overhead on the import process is measurable but very small.
By consequence, I would not call it significant.
Moreover the import process is only a part (and not the biggest one) of
a whole.

At the difference of my first mail I now consider only a non
cryptographic hash/checksum
as the only aim is to prevent accidental unmatch between .pyc and .py file.