really slow gzip decompress, why?

Jeff McNeil jeff at jmcneil.net
Mon Jan 26 11:02:55 EST 2009


On Jan 26, 10:51 am, Jeff McNeil <j... at jmcneil.net> wrote:
> On Jan 26, 10:22 am, redbaron <ivanov.ma... at gmail.com> wrote:
>
> > I've one big (6.9 Gb) .gz file with text inside it.
> > zcat bigfile.gz > /dev/null does the job in 4 minutes 50 seconds
>
> > python code have been doing the same job for 25 minutes and still
> > doesn't finish =( the code is simpliest I could ever imagine:
>
> > def main():
> >   fh = gzip.open(sys.argv[1])
> >   all(fh)
>
> > As far as I understand most of the time it executes C code, so pythons
> > no overhead should be noticible. Why is it so slow?
>
> Look what's happening in both operations. The zcat operation is simply
> uncompressing your data and dumping directly to /dev/null. Nothing is
> done with the data as it's uncompressed.
>
> On the other hand, when you call 'all(fh)', you're iterating through
> every element in in bigfile.gz.  In other words, you're reading the
> file and scanning it for newlines versus simply running the
> decompression operation.

The File:
----------------------------------------------------
[jeff at marvin ~]$ ls -alh junk.gz
-rw-rw-r-- 1 jeff jeff 113M 2009-01-26 10:42 junk.gz
[jeff at marvin ~]$

The 'zcat' time:
----------------------------------------------------
[jeff at marvin ~]$ time zcat junk.gz > /dev/null

real    0m2.390s
user    0m2.296s
sys     0m0.093s
[jeff at marvin ~]$


Test Script #1:
----------------------------------------------------
import sys
import gzip

fs = gzip.open('junk.gz')
data = fs.read(8192)
while data:
    sys.stdout.write(data)
    data = fs.read(8192)


Test Script #1 Time:
----------------------------------------------------
[jeff at marvin ~]$ time python test9.py >/dev/null

real    0m3.681s
user    0m3.201s
sys     0m0.478s
[jeff at marvin ~]$


Test Script #2:
----------------------------------------------------
import sys
import gzip

fs = gzip.open('junk.gz')
all(fs)


Test Script #2 Time:
----------------------------------------------------
[jeff at marvin ~]$ time python test10.py

real    1m51.764s
user    1m51.475s
sys     0m0.245s
[jeff at marvin ~]$




More information about the Python-list mailing list