[New-bugs-announce] [issue34010] tarfile stream read performance

Sat Jun 30 05:27:00 EDT 2018

New submission from hajoscher <hajoscher at gmail.com>:

Buffer read of large files in a compressed tarfile stream performs poorly.

The buffered read in tarfile _Stream is extending a bytes object. 
It is much more efficient to use a list followed by a join. 
Using a list can mean seconds instead of minutes. 

This performance regression was introduced in b506dc32c1a. 

How to test:
# create random tarfile 50Mb
dd if=/dev/urandom of=test.bin count=50 bs=1M
tar czvf test.tgz test.bin

# read with tarfile as stream (note pipe symbol in 'r|gz')
import tarfile
tfile = tarfile.open("test.tgz", 'r|gz')
for t in tfile:
    file = tfile.extractfile(t)
    if file:
        print(len(file.read()))

----------
components: Library (Lib)
messages: 320763
nosy: hajoscher
priority: normal
severity: normal
status: open
title: tarfile stream read performance
type: performance
versions: Python 3.4, Python 3.5, Python 3.6, Python 3.7, Python 3.8

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue34010>
_______________________________________