[Python-checkins] (no subject)
Lumír 'Frenzy' Balhar
webhook-mailer at python.org
Thu May 14 10:17:31 EDT 2020
To: python-checkins at python.org
Subject:
bpo-40495: compileall option to hardlink duplicate pyc files (GH-19901)
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
https://github.com/python/cpython/commit/e77d428856fbd339faee44ff47214eda5fb5=
1d57
commit: e77d428856fbd339faee44ff47214eda5fb51d57
branch: master
author: Lum=C3=ADr 'Frenzy' Balhar <lbalhar at redhat.com>
committer: GitHub <noreply at github.com>
date: 2020-05-14T16:17:22+02:00
summary:
bpo-40495: compileall option to hardlink duplicate pyc files (GH-19901)
compileall is now able to use hardlinks to prevent duplicates in a
case when .pyc files for different optimization levels have the same content.
Co-authored-by: Miro Hron=C4=8Dok <miro at hroncok.cz>
Co-authored-by: Victor Stinner <vstinner at python.org>
files:
A Misc/NEWS.d/next/Library/2020-05-04-11-20-49.bpo-40495.TyTc2O.rst
M Doc/library/compileall.rst
M Doc/whatsnew/3.9.rst
M Lib/compileall.py
M Lib/test/test_compileall.py
M Misc/ACKS
diff --git a/Doc/library/compileall.rst b/Doc/library/compileall.rst
index b1ae9d60e8ae1..a511c7eda265b 100644
--- a/Doc/library/compileall.rst
+++ b/Doc/library/compileall.rst
@@ -113,6 +113,11 @@ compile Python sources.
=20
Ignore symlinks pointing outside the given directory.
=20
+.. cmdoption:: --hardlink-dupes
+
+ If two ``.pyc`` files with different optimization level have
+ the same content, use hard links to consolidate duplicate files.
+
.. versionchanged:: 3.2
Added the ``-i``, ``-b`` and ``-h`` options.
=20
@@ -125,7 +130,7 @@ compile Python sources.
Added the ``--invalidation-mode`` option.
=20
.. versionchanged:: 3.9
- Added the ``-s``, ``-p``, ``-e`` options.
+ Added the ``-s``, ``-p``, ``-e`` and ``--hardlink-dupes`` options.
Raised the default recursion limit from 10 to
:py:func:`sys.getrecursionlimit()`.
Added the possibility to specify the ``-o`` option multiple times.
@@ -143,7 +148,7 @@ runtime.
Public functions
----------------
=20
-.. function:: compile_dir(dir, maxlevels=3Dsys.getrecursionlimit(), ddir=3DN=
one, force=3DFalse, rx=3DNone, quiet=3D0, legacy=3DFalse, optimize=3D-1, work=
ers=3D1, invalidation_mode=3DNone, \*, stripdir=3DNone, prependdir=3DNone, li=
mit_sl_dest=3DNone)
+.. function:: compile_dir(dir, maxlevels=3Dsys.getrecursionlimit(), ddir=3DN=
one, force=3DFalse, rx=3DNone, quiet=3D0, legacy=3DFalse, optimize=3D-1, work=
ers=3D1, invalidation_mode=3DNone, \*, stripdir=3DNone, prependdir=3DNone, li=
mit_sl_dest=3DNone, hardlink_dupes=3DFalse)
=20
Recursively descend the directory tree named by *dir*, compiling all :fil=
e:`.py`
files along the way. Return a true value if all the files compiled succes=
sfully,
@@ -193,6 +198,9 @@ Public functions
the ``-s``, ``-p`` and ``-e`` options described above.
They may be specified as ``str``, ``bytes`` or :py:class:`os.PathLike`.
=20
+ If *hardlink_dupes* is true and two ``.pyc`` files with different optimiz=
ation
+ level have the same content, use hard links to consolidate duplicate file=
s.
+
.. versionchanged:: 3.2
Added the *legacy* and *optimize* parameter.
=20
@@ -219,9 +227,9 @@ Public functions
Setting *workers* to 0 now chooses the optimal number of cores.
=20
.. versionchanged:: 3.9
- Added *stripdir*, *prependdir* and *limit_sl_dest* arguments.
+ Added *stripdir*, *prependdir*, *limit_sl_dest* and *hardlink_dupes* a=
rguments.
=20
-.. function:: compile_file(fullname, ddir=3DNone, force=3DFalse, rx=3DNone, =
quiet=3D0, legacy=3DFalse, optimize=3D-1, invalidation_mode=3DNone, \*, strip=
dir=3DNone, prependdir=3DNone, limit_sl_dest=3DNone)
+.. function:: compile_file(fullname, ddir=3DNone, force=3DFalse, rx=3DNone, =
quiet=3D0, legacy=3DFalse, optimize=3D-1, invalidation_mode=3DNone, \*, strip=
dir=3DNone, prependdir=3DNone, limit_sl_dest=3DNone, hardlink_dupes=3DFalse)
=20
Compile the file with path *fullname*. Return a true value if the file
compiled successfully, and a false value otherwise.
@@ -257,6 +265,9 @@ Public functions
the ``-s``, ``-p`` and ``-e`` options described above.
They may be specified as ``str``, ``bytes`` or :py:class:`os.PathLike`.
=20
+ If *hardlink_dupes* is true and two ``.pyc`` files with different optimiz=
ation
+ level have the same content, use hard links to consolidate duplicate file=
s.
+
.. versionadded:: 3.2
=20
.. versionchanged:: 3.5
@@ -273,7 +284,7 @@ Public functions
The *invalidation_mode* parameter's default value is updated to None.
=20
.. versionchanged:: 3.9
- Added *stripdir*, *prependdir* and *limit_sl_dest* arguments.
+ Added *stripdir*, *prependdir*, *limit_sl_dest* and *hardlink_dupes* a=
rguments.
=20
.. function:: compile_path(skip_curdir=3DTrue, maxlevels=3D0, force=3DFalse,=
quiet=3D0, legacy=3DFalse, optimize=3D-1, invalidation_mode=3DNone)
=20
diff --git a/Doc/whatsnew/3.9.rst b/Doc/whatsnew/3.9.rst
index 2fec790fe3a63..fbad0fba20f4b 100644
--- a/Doc/whatsnew/3.9.rst
+++ b/Doc/whatsnew/3.9.rst
@@ -245,6 +245,16 @@ that schedules a shutdown for the default executor that =
waits on the
Added :class:`asyncio.PidfdChildWatcher`, a Linux-specific child watcher
implementation that polls process file descriptors. (:issue:`38692`)
=20
+compileall
+----------
+
+Added new possibility to use hardlinks for duplicated ``.pyc`` files: *hardl=
ink_dupes* parameter and --hardlink-dupes command line option.
+(Contributed by Lum=C3=ADr 'Frenzy' Balhar in :issue:`40495`.)
+
+Added new options for path manipulation in resulting ``.pyc`` files: *stripd=
ir*, *prependdir*, *limit_sl_dest* parameters and -s, -p, -e command line opt=
ions.
+Added the possibility to specify the option for an optimization level multip=
le times.
+(Contributed by Lum=C3=ADr 'Frenzy' Balhar in :issue:`38112`.)
+
concurrent.futures
------------------
=20
diff --git a/Lib/compileall.py b/Lib/compileall.py
index abe6cffce59c5..fe7f450c55e1c 100644
--- a/Lib/compileall.py
+++ b/Lib/compileall.py
@@ -15,6 +15,7 @@
import importlib.util
import py_compile
import struct
+import filecmp
=20
from functools import partial
from pathlib import Path
@@ -47,7 +48,7 @@ def _walk_dir(dir, maxlevels, quiet=3D0):
def compile_dir(dir, maxlevels=3DNone, ddir=3DNone, force=3DFalse,
rx=3DNone, quiet=3D0, legacy=3DFalse, optimize=3D-1, workers=
=3D1,
invalidation_mode=3DNone, *, stripdir=3DNone,
- prependdir=3DNone, limit_sl_dest=3DNone):
+ prependdir=3DNone, limit_sl_dest=3DNone, hardlink_dupes=3DFa=
lse):
"""Byte-compile all modules in the given directory tree.
=20
Arguments (only dir is required):
@@ -70,6 +71,7 @@ def compile_dir(dir, maxlevels=3DNone, ddir=3DNone, force=
=3DFalse,
after stripdir
limit_sl_dest: ignore symlinks if they are pointing outside of
the defined path
+ hardlink_dupes: hardlink duplicated pyc files
"""
ProcessPoolExecutor =3D None
if ddir is not None and (stripdir is not None or prependdir is not None):
@@ -104,7 +106,8 @@ def compile_dir(dir, maxlevels=3DNone, ddir=3DNone, force=
=3DFalse,
invalidation_mode=3Dinvalidation_=
mode,
stripdir=3Dstripdir,
prependdir=3Dprependdir,
- limit_sl_dest=3Dlimit_sl_dest),
+ limit_sl_dest=3Dlimit_sl_dest,
+ hardlink_dupes=3Dhardlink_dupes),
files)
success =3D min(results, default=3DTrue)
else:
@@ -112,14 +115,15 @@ def compile_dir(dir, maxlevels=3DNone, ddir=3DNone, for=
ce=3DFalse,
if not compile_file(file, ddir, force, rx, quiet,
legacy, optimize, invalidation_mode,
stripdir=3Dstripdir, prependdir=3Dprependdir,
- limit_sl_dest=3Dlimit_sl_dest):
+ limit_sl_dest=3Dlimit_sl_dest,
+ hardlink_dupes=3Dhardlink_dupes):
success =3D False
return success
=20
def compile_file(fullname, ddir=3DNone, force=3DFalse, rx=3DNone, quiet=3D0,
legacy=3DFalse, optimize=3D-1,
invalidation_mode=3DNone, *, stripdir=3DNone, prependdir=3D=
None,
- limit_sl_dest=3DNone):
+ limit_sl_dest=3DNone, hardlink_dupes=3DFalse):
"""Byte-compile one file.
=20
Arguments (only fullname is required):
@@ -140,6 +144,7 @@ def compile_file(fullname, ddir=3DNone, force=3DFalse, rx=
=3DNone, quiet=3D0,
after stripdir
limit_sl_dest: ignore symlinks if they are pointing outside of
the defined path.
+ hardlink_dupes: hardlink duplicated pyc files
"""
=20
if ddir is not None and (stripdir is not None or prependdir is not None):
@@ -176,6 +181,14 @@ def compile_file(fullname, ddir=3DNone, force=3DFalse, r=
x=3DNone, quiet=3D0,
if isinstance(optimize, int):
optimize =3D [optimize]
=20
+ # Use set() to remove duplicates.
+ # Use sorted() to create pyc files in a deterministic order.
+ optimize =3D sorted(set(optimize))
+
+ if hardlink_dupes and len(optimize) < 2:
+ raise ValueError("Hardlinking of duplicated bytecode makes sense "
+ "only for more than one optimization level")
+
if rx is not None:
mo =3D rx.search(fullname)
if mo:
@@ -220,10 +233,16 @@ def compile_file(fullname, ddir=3DNone, force=3DFalse, =
rx=3DNone, quiet=3D0,
if not quiet:
print('Compiling {!r}...'.format(fullname))
try:
- for opt_level, cfile in opt_cfiles.items():
+ for index, opt_level in enumerate(optimize):
+ cfile =3D opt_cfiles[opt_level]
ok =3D py_compile.compile(fullname, cfile, dfile, True,
optimize=3Dopt_level,
invalidation_mode=3Dinvalidation=
_mode)
+ if index > 0 and hardlink_dupes:
+ previous_cfile =3D opt_cfiles[optimize[index - 1]]
+ if filecmp.cmp(cfile, previous_cfile, shallow=3DFals=
e):
+ os.unlink(cfile)
+ os.link(previous_cfile, cfile)
except py_compile.PyCompileError as err:
success =3D False
if quiet >=3D 2:
@@ -352,6 +371,9 @@ def main():
'Python interpreter itself (specified by -O).'=
))
parser.add_argument('-e', metavar=3D'DIR', dest=3D'limit_sl_dest',
help=3D'Ignore symlinks pointing outsite of the DIR')
+ parser.add_argument('--hardlink-dupes', action=3D'store_true',
+ dest=3D'hardlink_dupes',
+ help=3D'Hardlink duplicated pyc files')
=20
args =3D parser.parse_args()
compile_dests =3D args.compile_dest
@@ -371,6 +393,10 @@ def main():
if args.opt_levels is None:
args.opt_levels =3D [-1]
=20
+ if len(args.opt_levels) =3D=3D 1 and args.hardlink_dupes:
+ parser.error(("Hardlinking of duplicated bytecode makes sense "
+ "only for more than one optimization level."))
+
if args.ddir is not None and (
args.stripdir is not None or args.prependdir is not None
):
@@ -404,7 +430,8 @@ def main():
stripdir=3Dargs.stripdir,
prependdir=3Dargs.prependdir,
optimize=3Dargs.opt_levels,
- limit_sl_dest=3Dargs.limit_sl_dest):
+ limit_sl_dest=3Dargs.limit_sl_dest,
+ hardlink_dupes=3Dargs.hardlink_dupes=
):
success =3D False
else:
if not compile_dir(dest, maxlevels, args.ddir,
@@ -414,7 +441,8 @@ def main():
stripdir=3Dargs.stripdir,
prependdir=3Dargs.prependdir,
optimize=3Dargs.opt_levels,
- limit_sl_dest=3Dargs.limit_sl_dest):
+ limit_sl_dest=3Dargs.limit_sl_dest,
+ hardlink_dupes=3Dargs.hardlink_dupes):
success =3D False
return success
else:
diff --git a/Lib/test/test_compileall.py b/Lib/test/test_compileall.py
index 72678945089f2..b4061b79357b8 100644
--- a/Lib/test/test_compileall.py
+++ b/Lib/test/test_compileall.py
@@ -1,16 +1,19 @@
-import sys
import compileall
+import contextlib
+import filecmp
import importlib.util
-import test.test_importlib.util
+import io
+import itertools
import os
import pathlib
import py_compile
import shutil
import struct
+import sys
import tempfile
+import test.test_importlib.util
import time
import unittest
-import io
=20
from unittest import mock, skipUnless
try:
@@ -26,6 +29,24 @@
from .test_py_compile import SourceDateEpochTestMeta
=20
=20
+def get_pyc(script, opt):
+ if not opt:
+ # Replace None and 0 with ''
+ opt =3D ''
+ return importlib.util.cache_from_source(script, optimization=3Dopt)
+
+
+def get_pycs(script):
+ return [get_pyc(script, opt) for opt in (0, 1, 2)]
+
+
+def is_hardlink(filename1, filename2):
+ """Returns True if two files have the same inode (hardlink)"""
+ inode1 =3D os.stat(filename1).st_ino
+ inode2 =3D os.stat(filename2).st_ino
+ return inode1 =3D=3D inode2
+
+
class CompileallTestsBase:
=20
def setUp(self):
@@ -825,6 +846,32 @@ def test_ignore_symlink_destination(self):
self.assertTrue(os.path.isfile(allowed_bc))
self.assertFalse(os.path.isfile(prohibited_bc))
=20
+ def test_hardlink_bad_args(self):
+ # Bad arguments combination, hardlink deduplication make sense
+ # only for more than one optimization level
+ self.assertRunNotOK(self.directory, "-o 1", "--hardlink-dupes")
+
+ def test_hardlink(self):
+ # 'a =3D 0' code produces the same bytecode for the 3 optimization
+ # levels. All three .pyc files must have the same inode (hardlinks).
+ #
+ # If deduplication is disabled, all pyc files must have different
+ # inodes.
+ for dedup in (True, False):
+ with tempfile.TemporaryDirectory() as path:
+ with self.subTest(dedup=3Ddedup):
+ script =3D script_helper.make_script(path, "script", "a =
=3D 0")
+ pycs =3D get_pycs(script)
+
+ args =3D ["-q", "-o 0", "-o 1", "-o 2"]
+ if dedup:
+ args.append("--hardlink-dupes")
+ self.assertRunOK(path, *args)
+
+ self.assertEqual(is_hardlink(pycs[0], pycs[1]), dedup)
+ self.assertEqual(is_hardlink(pycs[1], pycs[2]), dedup)
+ self.assertEqual(is_hardlink(pycs[0], pycs[2]), dedup)
+
=20
class CommandLineTestsWithSourceEpoch(CommandLineTestsBase,
unittest.TestCase,
@@ -841,5 +888,176 @@ class CommandLineTestsNoSourceEpoch(CommandLineTestsBas=
e,
=20
=20
=20
+class HardlinkDedupTestsBase:
+ # Test hardlink_dupes parameter of compileall.compile_dir()
+
+ def setUp(self):
+ self.path =3D None
+
+ @contextlib.contextmanager
+ def temporary_directory(self):
+ with tempfile.TemporaryDirectory() as path:
+ self.path =3D path
+ yield path
+ self.path =3D None
+
+ def make_script(self, code, name=3D"script"):
+ return script_helper.make_script(self.path, name, code)
+
+ def compile_dir(self, *, dedup=3DTrue, optimize=3D(0, 1, 2), force=3DFal=
se):
+ compileall.compile_dir(self.path, quiet=3DTrue, optimize=3Doptimize,
+ hardlink_dupes=3Ddedup, force=3Dforce)
+
+ def test_bad_args(self):
+ # Bad arguments combination, hardlink deduplication make sense
+ # only for more than one optimization level
+ with self.temporary_directory():
+ self.make_script("pass")
+ with self.assertRaises(ValueError):
+ compileall.compile_dir(self.path, quiet=3DTrue, optimize=3D0,
+ hardlink_dupes=3DTrue)
+ with self.assertRaises(ValueError):
+ # same optimization level specified twice:
+ # compile_dir() removes duplicates
+ compileall.compile_dir(self.path, quiet=3DTrue, optimize=3D[=
0, 0],
+ hardlink_dupes=3DTrue)
+
+ def create_code(self, docstring=3DFalse, assertion=3DFalse):
+ lines =3D []
+ if docstring:
+ lines.append("'module docstring'")
+ lines.append('x =3D 1')
+ if assertion:
+ lines.append("assert x =3D=3D 1")
+ return '\n'.join(lines)
+
+ def iter_codes(self):
+ for docstring in (False, True):
+ for assertion in (False, True):
+ code =3D self.create_code(docstring=3Ddocstring, assertion=
=3Dassertion)
+ yield (code, docstring, assertion)
+
+ def test_disabled(self):
+ # Deduplication disabled, no hardlinks
+ for code, docstring, assertion in self.iter_codes():
+ with self.subTest(docstring=3Ddocstring, assertion=3Dassertion):
+ with self.temporary_directory():
+ script =3D self.make_script(code)
+ pycs =3D get_pycs(script)
+ self.compile_dir(dedup=3DFalse)
+ self.assertFalse(is_hardlink(pycs[0], pycs[1]))
+ self.assertFalse(is_hardlink(pycs[0], pycs[2]))
+ self.assertFalse(is_hardlink(pycs[1], pycs[2]))
+
+ def check_hardlinks(self, script, docstring=3DFalse, assertion=3DFalse):
+ pycs =3D get_pycs(script)
+ self.assertEqual(is_hardlink(pycs[0], pycs[1]),
+ not assertion)
+ self.assertEqual(is_hardlink(pycs[0], pycs[2]),
+ not assertion and not docstring)
+ self.assertEqual(is_hardlink(pycs[1], pycs[2]),
+ not docstring)
+
+ def test_hardlink(self):
+ # Test deduplication on all combinations
+ for code, docstring, assertion in self.iter_codes():
+ with self.subTest(docstring=3Ddocstring, assertion=3Dassertion):
+ with self.temporary_directory():
+ script =3D self.make_script(code)
+ self.compile_dir()
+ self.check_hardlinks(script, docstring, assertion)
+
+ def test_only_two_levels(self):
+ # Don't build the 3 optimization levels, but only 2
+ for opts in ((0, 1), (1, 2), (0, 2)):
+ with self.subTest(opts=3Dopts):
+ with self.temporary_directory():
+ # code with no dostring and no assertion:
+ # same bytecode for all optimization levels
+ script =3D self.make_script(self.create_code())
+ self.compile_dir(optimize=3Dopts)
+ pyc1 =3D get_pyc(script, opts[0])
+ pyc2 =3D get_pyc(script, opts[1])
+ self.assertTrue(is_hardlink(pyc1, pyc2))
+
+ def test_duplicated_levels(self):
+ # compile_dir() must not fail if optimize contains duplicated
+ # optimization levels and/or if optimization levels are not sorted.
+ with self.temporary_directory():
+ # code with no dostring and no assertion:
+ # same bytecode for all optimization levels
+ script =3D self.make_script(self.create_code())
+ self.compile_dir(optimize=3D[1, 0, 1, 0])
+ pyc1 =3D get_pyc(script, 0)
+ pyc2 =3D get_pyc(script, 1)
+ self.assertTrue(is_hardlink(pyc1, pyc2))
+
+ def test_recompilation(self):
+ # Test compile_dir() when pyc files already exists and the script
+ # content changed
+ with self.temporary_directory():
+ script =3D self.make_script("a =3D 0")
+ self.compile_dir()
+ # All three levels have the same inode
+ self.check_hardlinks(script)
+
+ pycs =3D get_pycs(script)
+ inode =3D os.stat(pycs[0]).st_ino
+
+ # Change of the module content
+ script =3D self.make_script("print(0)")
+
+ # Recompilation without -o 1
+ self.compile_dir(optimize=3D[0, 2], force=3DTrue)
+
+ # opt-1.pyc should have the same inode as before and others shou=
ld not
+ self.assertEqual(inode, os.stat(pycs[1]).st_ino)
+ self.assertTrue(is_hardlink(pycs[0], pycs[2]))
+ self.assertNotEqual(inode, os.stat(pycs[2]).st_ino)
+ # opt-1.pyc and opt-2.pyc have different content
+ self.assertFalse(filecmp.cmp(pycs[1], pycs[2], shallow=3DTrue))
+
+ def test_import(self):
+ # Test that import updates a single pyc file when pyc files already
+ # exists and the script content changed
+ with self.temporary_directory():
+ script =3D self.make_script(self.create_code(), name=3D"module")
+ self.compile_dir()
+ # All three levels have the same inode
+ self.check_hardlinks(script)
+
+ pycs =3D get_pycs(script)
+ inode =3D os.stat(pycs[0]).st_ino
+
+ # Change of the module content
+ script =3D self.make_script("print(0)", name=3D"module")
+
+ # Import the module in Python with -O (optimization level 1)
+ script_helper.assert_python_ok(
+ "-O", "-c", "import module", __isolated=3DFalse, PYTHONPATH=
=3Dself.path
+ )
+
+ # Only opt-1.pyc is changed
+ self.assertEqual(inode, os.stat(pycs[0]).st_ino)
+ self.assertEqual(inode, os.stat(pycs[2]).st_ino)
+ self.assertFalse(is_hardlink(pycs[1], pycs[2]))
+ # opt-1.pyc and opt-2.pyc have different content
+ self.assertFalse(filecmp.cmp(pycs[1], pycs[2], shallow=3DTrue))
+
+
+class HardlinkDedupTestsWithSourceEpoch(HardlinkDedupTestsBase,
+ unittest.TestCase,
+ metaclass=3DSourceDateEpochTestMeta,
+ source_date_epoch=3DTrue):
+ pass
+
+
+class HardlinkDedupTestsNoSourceEpoch(HardlinkDedupTestsBase,
+ unittest.TestCase,
+ metaclass=3DSourceDateEpochTestMeta,
+ source_date_epoch=3DFalse):
+ pass
+
+
if __name__ =3D=3D "__main__":
unittest.main()
diff --git a/Misc/ACKS b/Misc/ACKS
index f744de6b1f66d..b479aa5d807f5 100644
--- a/Misc/ACKS
+++ b/Misc/ACKS
@@ -86,6 +86,7 @@ Marcin Bachry
Alfonso Baciero
Dwayne Bailey
Stig Bakken
+Lum=C3=ADr Balhar
Aleksandr Balezin
Greg Ball
Lewis Ball
diff --git a/Misc/NEWS.d/next/Library/2020-05-04-11-20-49.bpo-40495.TyTc2O.rs=
t b/Misc/NEWS.d/next/Library/2020-05-04-11-20-49.bpo-40495.TyTc2O.rst
new file mode 100644
index 0000000000000..d3049b05a78b6
--- /dev/null
+++ b/Misc/NEWS.d/next/Library/2020-05-04-11-20-49.bpo-40495.TyTc2O.rst
@@ -0,0 +1,2 @@
+:mod:`compileall` is now able to use hardlinks to prevent duplicates in a
+case when ``.pyc`` files for different optimization levels have the same con=
tent.
More information about the Python-checkins
mailing list