[Patches] [ python-Patches-1281707 ] Speed up gzip.readline (~40%)

Mon May 22 17:54:41 CEST 2006

Patches item #1281707, was opened at 2005-09-04 13:53
Message generated for change (Comment added) made by etrepum
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=1281707&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Library (Lib)
Group: None
>Status: Pending
Resolution: None
Priority: 5
Submitted By: April King (marumari)
Assigned to: Bob Ippolito (etrepum)
Summary: Speed up gzip.readline (~40%)

Initial Comment:
See bug 849046 for history.  This patch passes both the
regression test and the standard test.  Hopefully the
extra information below won't be too difficult to read.
 I can attach this info to the bug, if need be.

Fixed:
  - Add self.min_readsize to __init__.
    Follows the principal that lines are likely to be
the same length in size,
    and doesn't start over at a minimum length string
every call to readline()
  - Rewriting of assignment for readsize and size at
the beginning of function.
    Eliminates almost all calls to min()
  - Change bufs to a string, and not an array.  No
point in using an array when
    all you do with it is "".join(bufs).  Uses string
addition instead.
  - Remove extra assignments to bufs (in return())
  - Changes readline() to be much more readable (loop
reordering, more comments)

Recommendations:
  - Delete _unread() function.  It is used _only_ by
readline(), and moving its
    functionality into readline() itself saves the
function call overhead.
    _unread() is only 3 lines long.  Testing shows that
removing it speeds
    readline() up by about 3%.  Backwards compatibility
concerns?

Testing results:
test_append (__main__.TestGzip) ... ok
test_many_append (__main__.TestGzip) ... ok
test_mode (__main__.TestGzip) ... ok
test_read (__main__.TestGzip) ... ok
test_readline (__main__.TestGzip) ... ok
test_readlines (__main__.TestGzip) ... ok
test_seek_read (__main__.TestGzip) ... ok
test_seek_write (__main__.TestGzip) ... ok
test_write (__main__.TestGzip) ... ok

----------------------------------------------------------------------
Ran 9 tests in 0.331s

Regression tests:
python regrtest.py -g test_gzip.py
test_gzip
1 test OK.

---

Profiling Results (performed on a common compressed log
file - 200748 lines).

With patch...

         1213961 function calls in 12.188 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall
filename:lineno(function)
        1    0.000    0.000    0.000    0.000 :0(close)
     1159    0.020    0.000    0.020    0.000 :0(crc32)
     1158    0.100    0.000    0.100    0.000
:0(decompress)
        1    0.000    0.000    0.000    0.000
:0(decompressobj)
   200774    0.812    0.000    0.812    0.000 :0(find)
   403865    0.902    0.000    0.902    0.000 :0(len)
     1183    0.000    0.000    0.000    0.000 :0(min)
        2    0.000    0.000    0.000    0.000 :0(ord)
     1173    0.000    0.000    0.000    0.000 :0(read)
       12    0.000    0.000    0.000    0.000 :0(seek)
        1    0.000    0.000    0.000    0.000
:0(setprofile)
       18    0.000    0.000    0.000    0.000 :0(tell)
        2    0.000    0.000    0.000    0.000 :0(unpack)
        1    0.000    0.000   12.188   12.188 <string>:1(?)
        1    0.000    0.000    0.000    0.000
gzip_new.py:156(_init_read)
        1    0.000    0.000    0.000    0.000
gzip_new.py:160(_read_gzip_header)
        3    0.000    0.000    0.000    0.000
gzip_new.py:18(U32)
   200774    2.453    0.000    2.593    0.000
gzip_new.py:207(read)
   200749    2.894    0.000    3.796    0.000
gzip_new.py:239(_unread)
     1166    0.010    0.000    0.140    0.000
gzip_new.py:244(_read)
        1    0.000    0.000    0.000    0.000
gzip_new.py:27(LOWU32)
     1158    0.010    0.000    0.030    0.000
gzip_new.py:294(_add_read_data)
        1    0.000    0.000    0.000    0.000
gzip_new.py:300(_read_eof)
        1    0.000    0.000    0.000    0.000
gzip_new.py:314(close)
        1    0.000    0.000    0.000    0.000
gzip_new.py:327(__del__)
   200749    3.916    0.000   11.117    0.000
gzip_new.py:384(readline)
        2    0.000    0.000    0.000    0.000
gzip_new.py:39(read32)
        1    0.000    0.000    0.000    0.000
gzip_new.py:42(open)
        1    0.000    0.000    0.000    0.000
gzip_new.py:60(__init__)
        1    0.000    0.000   12.188   12.188
profile:0(gunzip_gzip_new_open())
        0    0.000             0.000         
profile:0(profiler)
        1    1.071    1.071   12.188   12.188
test_gzip_speed.py:14(gunzip_gzip_new_open)

Without patch...

         2073328 function calls in 18.597 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall
filename:lineno(function)
   243820    0.735    0.000    0.735    0.000 :0(append)
        1    0.000    0.000    0.000    0.000 :0(close)
     1159    0.040    0.000    0.040    0.000 :0(crc32)
     1158    0.100    0.000    0.100    0.000
:0(decompress)
        1    0.000    0.000    0.000    0.000
:0(decompressobj)
   243820    0.960    0.000    0.960    0.000 :0(find)
   200749    0.801    0.000    0.801    0.000 :0(join)
   489958    1.330    0.000    1.330    0.000 :0(len)
   243820    0.791    0.000    0.791    0.000 :0(min)
        2    0.000    0.000    0.000    0.000 :0(ord)
     1173    0.030    0.000    0.030    0.000 :0(read)
        6    0.000    0.000    0.000    0.000 :0(seek)
        1    0.000    0.000    0.000    0.000
:0(setprofile)
        6    0.000    0.000    0.000    0.000 :0(tell)
        2    0.000    0.000    0.000    0.000 :0(unpack)
        1    0.000    0.000   18.597   18.597 <string>:1(?)
        1    0.000    0.000    0.000    0.000
gzip.py:154(_init_read)
        1    0.000    0.000    0.000    0.000
gzip.py:158(_read_gzip_header)
        3    0.000    0.000    0.000    0.000
gzip.py:18(U32)
   243820    2.711    0.000    2.921    0.000
gzip.py:205(read)
   200749    3.083    0.000    4.143    0.000
gzip.py:237(_unread)
     1160    0.010    0.000    0.210    0.000
gzip.py:242(_read)
        1    0.000    0.000    0.000    0.000
gzip.py:27(LOWU32)
     1158    0.030    0.000    0.070    0.000
gzip.py:292(_add_read_data)
        1    0.000    0.000    0.000    0.000
gzip.py:298(_read_eof)
        1    0.000    0.000    0.000    0.000
gzip.py:312(close)
        1    0.000    0.000    0.000    0.000
gzip.py:325(__del__)
   200749    6.934    0.000   17.555    0.000
gzip.py:379(readline)
        2    0.000    0.000    0.000    0.000
gzip.py:39(read32)
        1    0.000    0.000    0.000    0.000
gzip.py:42(open)
        1    0.000    0.000    0.000    0.000
gzip.py:59(__init__)
        1    0.000    0.000   18.597   18.597
profile:0(gunzip_gzip_open())
        0    0.000             0.000         
profile:0(profiler)
        1    1.042    1.042   18.597   18.597
test_gzip_speed.py:7(gunzip_gzip_open)

Using popen + gunzip -c...

         200754 function calls in 4.338 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall
filename:lineno(function)
        1    0.000    0.000    0.000    0.000 :0(popen)
   200749    3.578    0.000    3.578    0.000 :0(readline)
        1    0.000    0.000    0.000    0.000
:0(setprofile)
        1    0.240    0.240    4.338    4.338 <string>:1(?)
        1    0.000    0.000    4.338    4.338
profile:0(gunzip_popen())
        0    0.000             0.000         
profile:0(profiler)
        1    0.520    0.520    4.098    4.098
test_gzip_speed.py:21(gunzip_popen)

----------------------------------------------------------------------

>Comment By: Bob Ippolito (etrepum)
Date: 2006-05-22 11:54

Message:
Logged In: YES 
user_id=139309

The attached patch uses a strategy that provides the same 30%-ish performance 
boost for the benchmark, and also provides a small performance boost (about 
9% or so) for the very strange log file.

The key is to use lists for buffering, and to never allow the default buffer size to 
grow too large (512-ish starting point seems to be a sweet spot). This defends 
against working with large strings more often than necessary.

----------------------------------------------------------------------

Comment By: April King (marumari)
Date: 2006-05-22 11:39

Message:
Logged In: YES 
user_id=747439

Actually, the slow speed in that specific circumstance has
nothing to do with the fact that it uses a string style
buffer, which should always be faster than what it used
before (an array of strings that was constantly appended to.)

The problem with that particular file is how the
gzip.readline function auto-optimizes it's read size.

if readsize > self.min_readsize:
  self.min_readsize = readsize

So, it optimizes it's read size to the length of the largest
line that it has seen so far.  The assumption is that
gzipped files are generally going to be a bunch of lines of
similar length, and not wildly differing length.

----------------------------------------------------------------------

Comment By: Bob Ippolito (etrepum)
Date: 2006-05-22 11:29

Message:
Logged In: YES 
user_id=139309

It turns out the performance difference was due to some.. interesting 
characteristics for that particular log file.

>>> import gzip
>>> lengths = [len(line) for line in gzip.GzipFile('TEST.LOG')]
>>> sum(lengths) / float(len(lengths))
45.60349675165147
>>> max(lengths)
117989
>>> min(lengths)
1

The str style buffer in this particular example is going to fail miserably 
reading that one long line.

----------------------------------------------------------------------

Comment By: Bob Ippolito (etrepum)
Date: 2006-05-22 11:12

Message:
Logged In: YES 
user_id=139309

I'm reopening this patch -- it seems that these changes have made parsing 
Apache style log files MUCH slower (4x on some samples).

----------------------------------------------------------------------

Comment By: Bob Ippolito (etrepum)
Date: 2006-05-22 10:32

Message:
Logged In: YES 
user_id=139309

Applied in revision 46070

----------------------------------------------------------------------

Comment By: Bob Ippolito (etrepum)
Date: 2006-05-22 10:11

Message:
Logged In: YES 
user_id=139309

This patch is about a 30% win on Mac OS X i386 using this benchmark:
http://svn.python.org/view/sandbox/trunk/gzipbench/gzipbench.py

I'm going to look to see if there's any other low hanging fruit in there before I 
commit.

----------------------------------------------------------------------

Comment By: April King (marumari)
Date: 2005-09-04 13:57

Message:
Logged In: YES 
user_id=747439

See attached text file for the detailed description (that's
much more readable).

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=1281707&group_id=5470