[issue7471] GZipFile.readline too slow

Sun Dec 13 10:38:57 CET 2009

Nir <nir at winpdb.org> added the comment:

First patch, please forgive long comment :)

I submit a small patch which speeds up readline() on my data set - a 
74MB (5MB .gz) log file with 600K lines.

The speedup is 350%.

Source of slowness is that (~20KB) extrabuf is allocated/deallocated in 
read() and _unread() with each call to readline().

In the patch read() returns a slice from extrabuf and defers 
manipulation of extrabuf to _read().

In the following, the first timeit() corresponds to reading extrabuf 
slices while the second timeit() corresponds to read() and _unread() as 
they are done today:

>>> timeit.Timer("x[10000: 10100]", "x = 'x' * 20000").timeit()
0.25299811363220215

>>> timeit.Timer("x[: 100]; x[100:]; x[100:] + x[: 100]", "x = 'x' * 
10000").timeit()
5.843876838684082

Another speedup is achieved by doing a small shortcut in readline() for 
the typical case in which the entire line is already in extrabuf.

The patch only addresses the typical case of calling readline() with no 
arguments. It does not address other problems in readline() logic. In 
particular the current 512 chunk size is not a sweet spot. Regardless of 
the size argument passed to readline(), read() will continue to 
decompress just 1024 bytes with each call as the size of extrabuf swings 
around the target size argument as result of the interaction between 
_unread() and read().

----------
keywords: +patch
nosy: +nirai
Added file: http://bugs.python.org/file15536/gzip_7471_patch.diff

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue7471>
_______________________________________