[ python-Bugs-849046 ] gzip.GzipFile is slow
SourceForge.net
noreply at sourceforge.net
Tue Dec 23 12:10:14 EST 2003
Bugs item #849046, was opened at 2003-11-25 10:45
Message generated for change (Comment added) made by akuchling
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=849046&group_id=5470
Category: Python Library
Group: Python 2.4
Status: Open
Resolution: None
Priority: 3
Submitted By: Ronald Oussoren (ronaldoussoren)
Assigned to: Nobody/Anonymous (nobody)
Summary: gzip.GzipFile is slow
Initial Comment:
gzip.GzipFile is significantly (an order of a magnitude)
slower than using the gzip binary. I've been bitten by this
several times, and have replaced "fd = gzip.open('somefile',
'r')" by "fd = os.popen('gzcat somefile', 'r')" on several
occassions.
Would a patch that implemented GzipFile in C have any
change of being accepted?
----------------------------------------------------------------------
>Comment By: A.M. Kuchling (akuchling)
Date: 2003-12-23 12:10
Message:
Logged In: YES
user_id=11375
It should be simple to check if the string operations are responsible
-- comment out the 'self.extrabuf = self.extrabuf + data'
in _add_read_data. If that makes a big difference, then _read
should probably be building a list instead of modifying a string.
----------------------------------------------------------------------
Comment By: Brett Cannon (bcannon)
Date: 2003-12-04 14:51
Message:
Logged In: YES
user_id=357491
Looking at GzipFile.read and ._read , I think a large chunk of time
is burned in the decompression of small chunks of data. It initially
reads and decompresses 1024 bits, and then if that read did not
hit the EOF, it multiplies it by 2 and continues until the EOF is
reached and then finishes up.
The problem is that for each read a call to _read is made that sets
up a bunch of objects. I would not be surprised if the object
creation and teardown is hurting the performance. I would also
not be surprised if the reading of small chunks of data is an initial
problem as well. This is all guesswork, though, since I did not run
the profiler on this.
*But*, there might be a good reason for reading small chunks. If
you are decompressing a large file, you might run out of memory
very quickly by reading the file into memory *and* decompressing
at the same time. Reading it in successively larger chunks means
you don't hold the file's entire contents in memory at any one
time.
So the question becomes whether causing your memory to get
overloaded and major thrashing on your swap space is worth the
performance increase. There is also the option of inlining _read
into 'read', but since it makes two calls that seems like poor
abstraction and thus would most likely not be accepted as a
solution. Might be better to just have some temporary storage in
an attribute of objects that are used in every call to _read and
then delete the attribute once the reading is done. Or maybe
allow for an optional argument to read that allowed one to specify
the initial read size (and that might be a good way to see if any of
these ideas are reasonable; just modify the code to read the
whole thing and go at it from that).
But I am in no position to make any of these calls, though, since I
never use gzip. If someone cares to write up a patch to try to fix
any of this it will be considered.
----------------------------------------------------------------------
Comment By: Jim Jewett (jimjjewett)
Date: 2003-11-25 17:05
Message:
Logged In: YES
user_id=764593
In the library, I see a fair amount of work that doesn't really
do anything except make sure you're getting exactly a line at
a time.
Would it be an option to just read the file in all at once, split it
on newlines, and then loop over the list? (Or read it into a
cStringIO, I suppose.)
----------------------------------------------------------------------
Comment By: Ronald Oussoren (ronaldoussoren)
Date: 2003-11-25 16:12
Message:
Logged In: YES
user_id=580910
To be more precise:
$ ls -l gzippedfile
-rw-r--r-- 1 ronald admin 354581 18 Nov 10:21 gzippedfile
$ gzip -l gzippedfile
compressed uncompr. ratio uncompressed_name
354581 1403838 74.7% gzippedfile
The file contains about 45K lines of text (about 40 characters/line)
$ time gzip -dc gzippedfile > /dev/null
real 0m0.100s
user 0m0.060s
sys 0m0.000s
$ python read.py gzippedfile > /dev/null
real 0m3.222s
user 0m3.020s
sys 0m0.070s
$ cat read.py
#!/usr/bin/env python
import sys
import gzip
fd = gzip.open(sys.argv[1], 'r')
ln = fd.readline()
while ln:
sys.stdout.write(ln)
ln = fd.readline()
The difference is also significant for larger files (e.g. the
difference is not caused by the different startup-times)
----------------------------------------------------------------------
Comment By: Ronald Oussoren (ronaldoussoren)
Date: 2003-11-25 16:03
Message:
Logged In: YES
user_id=580910
The files are created using GzipFile. That speed is acceptable
because it happens in a batch-job, reading back is the problem
because that happens on demand and a user is waiting for the
results.
gzcat is a *uncompress* utility (specifically it is "gzip -dc"), the
compression level is irrelevant for this discussion.
The python code seems to do quite some string manipulation,
maybe that is causing the slowdown (I'm using fd.readline() in a
fairly tight loop). I'll do some profiling to check what is taking so
much time.
BTW. I'm doing this on Unix systems (Sun Solaris and Mac OS X).
----------------------------------------------------------------------
Comment By: Jim Jewett (jimjjewett)
Date: 2003-11-25 12:35
Message:
Logged In: YES
user_id=764593
Which compression level are you using?
It looks like most of the work is already done by zlib (which is in C), but GzipFile defaults to compression level 9. Many other zips (including your gzcat?) default to a lower (but much faster) compression level.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=849046&group_id=5470
More information about the Python-bugs-list
mailing list