[Python-checkins] bpo-34043: Optimize tarfile uncompress performance (GH-8089)

INADA Naoki webhook-mailer at python.org
Fri Jul 6 01:06:04 EDT 2018


https://github.com/python/cpython/commit/8d130913cb9359c01de412178f9942419e921170
commit: 8d130913cb9359c01de412178f9942419e921170
branch: master
author: INADA Naoki <methane at users.noreply.github.com>
committer: GitHub <noreply at github.com>
date: 2018-07-06T14:06:00+09:00
summary:

bpo-34043: Optimize tarfile uncompress performance (GH-8089)

tarfile._Stream has two buffer for compressed and uncompressed data.
Those buffers are not aligned so unnecessary bytes slicing happens
for every reading chunks.

This commit bypass compressed buffering.

In this benchmark [1], user time become 250ms from 300ms.

[1]: https://bugs.python.org/msg320763

files:
A Misc/NEWS.d/next/Library/2018-07-04-21-14-35.bpo-34043.0YJNq9.rst
M Lib/tarfile.py

diff --git a/Lib/tarfile.py b/Lib/tarfile.py
index 59f044cc5a00..ba3e95f281df 100755
--- a/Lib/tarfile.py
+++ b/Lib/tarfile.py
@@ -513,21 +513,10 @@ def seek(self, pos=0):
             raise StreamError("seeking backwards is not allowed")
         return self.pos
 
-    def read(self, size=None):
-        """Return the next size number of bytes from the stream.
-           If size is not defined, return all bytes of the stream
-           up to EOF.
-        """
-        if size is None:
-            t = []
-            while True:
-                buf = self._read(self.bufsize)
-                if not buf:
-                    break
-                t.append(buf)
-            buf = b"".join(t)
-        else:
-            buf = self._read(size)
+    def read(self, size):
+        """Return the next size number of bytes from the stream."""
+        assert size is not None
+        buf = self._read(size)
         self.pos += len(buf)
         return buf
 
@@ -540,9 +529,14 @@ def _read(self, size):
         c = len(self.dbuf)
         t = [self.dbuf]
         while c < size:
-            buf = self.__read(self.bufsize)
-            if not buf:
-                break
+            # Skip underlying buffer to avoid unaligned double buffering.
+            if self.buf:
+                buf = self.buf
+                self.buf = b""
+            else:
+                buf = self.fileobj.read(self.bufsize)
+                if not buf:
+                    break
             try:
                 buf = self.cmp.decompress(buf)
             except self.exception:
diff --git a/Misc/NEWS.d/next/Library/2018-07-04-21-14-35.bpo-34043.0YJNq9.rst b/Misc/NEWS.d/next/Library/2018-07-04-21-14-35.bpo-34043.0YJNq9.rst
new file mode 100644
index 000000000000..c035ba7275f8
--- /dev/null
+++ b/Misc/NEWS.d/next/Library/2018-07-04-21-14-35.bpo-34043.0YJNq9.rst
@@ -0,0 +1 @@
+Optimize tarfile uncompress performance about 15% when gzip is used.



More information about the Python-checkins mailing list