[SciPy-user] read/write compressed files

Thu Jun 21 06:57:02 EDT 2007

Hi,

I meant bz2 over zlib due to higher compression, if slower performance.
This common belief was usually parallel to my experience. However, a
simple test below made with fresh morning data clearly undermines this
thinking:

> du -hsc test9*.dat

428M    total

> time gzip test9*.dat

real    0m31.663s
user    0m28.946s
sys     0m1.612s

> du -hsc test9*.dat.gz

215M    total

> time gunzip test9*.dat.gz

real    0m7.447s
user    0m6.036s
sys     0m1.264s

> time bzip2 test9*.dat

real    2m1.696s
user    1m54.527s
sys     0m4.008s

> du -hsc test9*.dat.bz2

219M    total

> time bunzip2 test9*.dat.bz2

real    0m43.252s
user    0m39.926s
sys     0m2.792s

I am surprised, as I well remember cases where I could gain 20%. But
indeed, given the much slower performance, you have me convinced to use
zlib over bz2.

thanks for forcing me to do this test,
- Dominik

Francesc Altet wrote:
> El dc 20 de 06 del 2007 a les 21:01 +0200, en/na Dominik Szczerba va
> escriure:
>> PyTables is great (and big) while I just need to read in a sequence of
>> values.
> 
> Ok, that's fine. In any case, I'm interested in knowing the reasons on
> why you are using bzip2 instead zlib.  Have you detected some data
> pattern where you get significantly more compression than by using zlib
> for example?.
> 
> I'm asking this because, in my experience with numerical data, I was
> unable to detect important compression level differences between bzip2
> and zlib. See:
> 
> http://www.pytables.org/docs/manual/ch05.html#compressionIssues
> 
> for some experiments in that regard.
> 
> I'd appreciate any input on this subject (bzip2 vs zlib).
> 

-- 
Dominik Szczerba, Ph.D.
Computer Vision Lab CH-8092 Zurich
http://www.vision.ee.ethz.ch/~domi