[Patches] [ python-Patches-1281707 ] Speed up gzip.readline (~40%)
SourceForge.net
noreply at sourceforge.net
Mon May 22 17:54:41 CEST 2006
Patches item #1281707, was opened at 2005-09-04 13:53
Message generated for change (Comment added) made by etrepum
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=1281707&group_id=5470
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Library (Lib)
Group: None
>Status: Pending
Resolution: None
Priority: 5
Submitted By: April King (marumari)
Assigned to: Bob Ippolito (etrepum)
Summary: Speed up gzip.readline (~40%)
Initial Comment:
See bug 849046 for history. This patch passes both the
regression test and the standard test. Hopefully the
extra information below won't be too difficult to read.
I can attach this info to the bug, if need be.
Fixed:
- Add self.min_readsize to __init__.
Follows the principal that lines are likely to be
the same length in size,
and doesn't start over at a minimum length string
every call to readline()
- Rewriting of assignment for readsize and size at
the beginning of function.
Eliminates almost all calls to min()
- Change bufs to a string, and not an array. No
point in using an array when
all you do with it is "".join(bufs). Uses string
addition instead.
- Remove extra assignments to bufs (in return())
- Changes readline() to be much more readable (loop
reordering, more comments)
Recommendations:
- Delete _unread() function. It is used _only_ by
readline(), and moving its
functionality into readline() itself saves the
function call overhead.
_unread() is only 3 lines long. Testing shows that
removing it speeds
readline() up by about 3%. Backwards compatibility
concerns?
Testing results:
test_append (__main__.TestGzip) ... ok
test_many_append (__main__.TestGzip) ... ok
test_mode (__main__.TestGzip) ... ok
test_read (__main__.TestGzip) ... ok
test_readline (__main__.TestGzip) ... ok
test_readlines (__main__.TestGzip) ... ok
test_seek_read (__main__.TestGzip) ... ok
test_seek_write (__main__.TestGzip) ... ok
test_write (__main__.TestGzip) ... ok
----------------------------------------------------------------------
Ran 9 tests in 0.331s
Regression tests:
python regrtest.py -g test_gzip.py
test_gzip
1 test OK.
---
Profiling Results (performed on a common compressed log
file - 200748 lines).
With patch...
1213961 function calls in 12.188 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall
filename:lineno(function)
1 0.000 0.000 0.000 0.000 :0(close)
1159 0.020 0.000 0.020 0.000 :0(crc32)
1158 0.100 0.000 0.100 0.000
:0(decompress)
1 0.000 0.000 0.000 0.000
:0(decompressobj)
200774 0.812 0.000 0.812 0.000 :0(find)
403865 0.902 0.000 0.902 0.000 :0(len)
1183 0.000 0.000 0.000 0.000 :0(min)
2 0.000 0.000 0.000 0.000 :0(ord)
1173 0.000 0.000 0.000 0.000 :0(read)
12 0.000 0.000 0.000 0.000 :0(seek)
1 0.000 0.000 0.000 0.000
:0(setprofile)
18 0.000 0.000 0.000 0.000 :0(tell)
2 0.000 0.000 0.000 0.000 :0(unpack)
1 0.000 0.000 12.188 12.188 <string>:1(?)
1 0.000 0.000 0.000 0.000
gzip_new.py:156(_init_read)
1 0.000 0.000 0.000 0.000
gzip_new.py:160(_read_gzip_header)
3 0.000 0.000 0.000 0.000
gzip_new.py:18(U32)
200774 2.453 0.000 2.593 0.000
gzip_new.py:207(read)
200749 2.894 0.000 3.796 0.000
gzip_new.py:239(_unread)
1166 0.010 0.000 0.140 0.000
gzip_new.py:244(_read)
1 0.000 0.000 0.000 0.000
gzip_new.py:27(LOWU32)
1158 0.010 0.000 0.030 0.000
gzip_new.py:294(_add_read_data)
1 0.000 0.000 0.000 0.000
gzip_new.py:300(_read_eof)
1 0.000 0.000 0.000 0.000
gzip_new.py:314(close)
1 0.000 0.000 0.000 0.000
gzip_new.py:327(__del__)
200749 3.916 0.000 11.117 0.000
gzip_new.py:384(readline)
2 0.000 0.000 0.000 0.000
gzip_new.py:39(read32)
1 0.000 0.000 0.000 0.000
gzip_new.py:42(open)
1 0.000 0.000 0.000 0.000
gzip_new.py:60(__init__)
1 0.000 0.000 12.188 12.188
profile:0(gunzip_gzip_new_open())
0 0.000 0.000
profile:0(profiler)
1 1.071 1.071 12.188 12.188
test_gzip_speed.py:14(gunzip_gzip_new_open)
Without patch...
2073328 function calls in 18.597 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall
filename:lineno(function)
243820 0.735 0.000 0.735 0.000 :0(append)
1 0.000 0.000 0.000 0.000 :0(close)
1159 0.040 0.000 0.040 0.000 :0(crc32)
1158 0.100 0.000 0.100 0.000
:0(decompress)
1 0.000 0.000 0.000 0.000
:0(decompressobj)
243820 0.960 0.000 0.960 0.000 :0(find)
200749 0.801 0.000 0.801 0.000 :0(join)
489958 1.330 0.000 1.330 0.000 :0(len)
243820 0.791 0.000 0.791 0.000 :0(min)
2 0.000 0.000 0.000 0.000 :0(ord)
1173 0.030 0.000 0.030 0.000 :0(read)
6 0.000 0.000 0.000 0.000 :0(seek)
1 0.000 0.000 0.000 0.000
:0(setprofile)
6 0.000 0.000 0.000 0.000 :0(tell)
2 0.000 0.000 0.000 0.000 :0(unpack)
1 0.000 0.000 18.597 18.597 <string>:1(?)
1 0.000 0.000 0.000 0.000
gzip.py:154(_init_read)
1 0.000 0.000 0.000 0.000
gzip.py:158(_read_gzip_header)
3 0.000 0.000 0.000 0.000
gzip.py:18(U32)
243820 2.711 0.000 2.921 0.000
gzip.py:205(read)
200749 3.083 0.000 4.143 0.000
gzip.py:237(_unread)
1160 0.010 0.000 0.210 0.000
gzip.py:242(_read)
1 0.000 0.000 0.000 0.000
gzip.py:27(LOWU32)
1158 0.030 0.000 0.070 0.000
gzip.py:292(_add_read_data)
1 0.000 0.000 0.000 0.000
gzip.py:298(_read_eof)
1 0.000 0.000 0.000 0.000
gzip.py:312(close)
1 0.000 0.000 0.000 0.000
gzip.py:325(__del__)
200749 6.934 0.000 17.555 0.000
gzip.py:379(readline)
2 0.000 0.000 0.000 0.000
gzip.py:39(read32)
1 0.000 0.000 0.000 0.000
gzip.py:42(open)
1 0.000 0.000 0.000 0.000
gzip.py:59(__init__)
1 0.000 0.000 18.597 18.597
profile:0(gunzip_gzip_open())
0 0.000 0.000
profile:0(profiler)
1 1.042 1.042 18.597 18.597
test_gzip_speed.py:7(gunzip_gzip_open)
Using popen + gunzip -c...
200754 function calls in 4.338 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall
filename:lineno(function)
1 0.000 0.000 0.000 0.000 :0(popen)
200749 3.578 0.000 3.578 0.000 :0(readline)
1 0.000 0.000 0.000 0.000
:0(setprofile)
1 0.240 0.240 4.338 4.338 <string>:1(?)
1 0.000 0.000 4.338 4.338
profile:0(gunzip_popen())
0 0.000 0.000
profile:0(profiler)
1 0.520 0.520 4.098 4.098
test_gzip_speed.py:21(gunzip_popen)
----------------------------------------------------------------------
>Comment By: Bob Ippolito (etrepum)
Date: 2006-05-22 11:54
Message:
Logged In: YES
user_id=139309
The attached patch uses a strategy that provides the same 30%-ish performance
boost for the benchmark, and also provides a small performance boost (about
9% or so) for the very strange log file.
The key is to use lists for buffering, and to never allow the default buffer size to
grow too large (512-ish starting point seems to be a sweet spot). This defends
against working with large strings more often than necessary.
----------------------------------------------------------------------
Comment By: April King (marumari)
Date: 2006-05-22 11:39
Message:
Logged In: YES
user_id=747439
Actually, the slow speed in that specific circumstance has
nothing to do with the fact that it uses a string style
buffer, which should always be faster than what it used
before (an array of strings that was constantly appended to.)
The problem with that particular file is how the
gzip.readline function auto-optimizes it's read size.
if readsize > self.min_readsize:
self.min_readsize = readsize
So, it optimizes it's read size to the length of the largest
line that it has seen so far. The assumption is that
gzipped files are generally going to be a bunch of lines of
similar length, and not wildly differing length.
----------------------------------------------------------------------
Comment By: Bob Ippolito (etrepum)
Date: 2006-05-22 11:29
Message:
Logged In: YES
user_id=139309
It turns out the performance difference was due to some.. interesting
characteristics for that particular log file.
>>> import gzip
>>> lengths = [len(line) for line in gzip.GzipFile('TEST.LOG')]
>>> sum(lengths) / float(len(lengths))
45.60349675165147
>>> max(lengths)
117989
>>> min(lengths)
1
The str style buffer in this particular example is going to fail miserably
reading that one long line.
----------------------------------------------------------------------
Comment By: Bob Ippolito (etrepum)
Date: 2006-05-22 11:12
Message:
Logged In: YES
user_id=139309
I'm reopening this patch -- it seems that these changes have made parsing
Apache style log files MUCH slower (4x on some samples).
----------------------------------------------------------------------
Comment By: Bob Ippolito (etrepum)
Date: 2006-05-22 10:32
Message:
Logged In: YES
user_id=139309
Applied in revision 46070
----------------------------------------------------------------------
Comment By: Bob Ippolito (etrepum)
Date: 2006-05-22 10:11
Message:
Logged In: YES
user_id=139309
This patch is about a 30% win on Mac OS X i386 using this benchmark:
http://svn.python.org/view/sandbox/trunk/gzipbench/gzipbench.py
I'm going to look to see if there's any other low hanging fruit in there before I
commit.
----------------------------------------------------------------------
Comment By: April King (marumari)
Date: 2005-09-04 13:57
Message:
Logged In: YES
user_id=747439
See attached text file for the detailed description (that's
much more readable).
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=1281707&group_id=5470
More information about the Patches
mailing list