[SciPy-user] Fast saving/loading of huge matrices

Thu Apr 19 15:30:32 EDT 2007

El dj 19 de 04 del 2007 a les 09:23 -0500, en/na Robert Kern va
escriure:
> Gael Varoquaux wrote:
> > I have a huge matrix (I don't know how big it is, it hasn't finished
> > loading yet, but the ascii file weights 381M). I was wondering what
> > format had best speed efficiency for saving/loading huge file. I don't
> > mind using a hdf5 even if it is not included in scipy itself.
> 
> I think we've found that a simple pickle using protocol 2 works the fastest. At
> the time (a year or so ago) this was faster than PyTables for loading the entire
> array of about 1GB size. PyTables might be better now, possibly because of the
> new numpy support.

I was curious as well if PyTables 2.0 is getting somewhat faster than
1.4 series (although I already knew that for this sort of things, the
space for improvement should be rather small).

For that, I've made a small benchmark (see attachments) and compared the
performance for PyTables 1.4 and 2.0 against pickle (protocol 2). In the
benchmark, a NumPy array of around 1 GB is created and the time for
writing and reading it from disk is written to stdout. You can see the
outputs for the runs in the attachments as well.

>From there, some conclusions can be draw:

1. The difference of performance between PyTables 1.4 and 2.0 for this
especific task is almost negligible. This was somthing expected because,
although 1.4 was using numarray at the core, the use of the array
protocol made unnecessary the copies of the arrays (and hence, the
overhead over 2.0, with NumPy at the core, is negligible).

2. For writing, the EArray (Extensible Array) object of PyTables has
roughly the same speed than NumPy (a 15% faster in fact, but this is not
that much).  However, for reading, the speed-up of PyTables over pickle
is more than 2x (up to 2.35x for 2.0), which is something to consider.

3. For compressed EArrays, writing times are relatively bad: between
0.06x (zlib and PyTables 1.4) and 0.15x (lzo and PyTables 2.0). However,
for reading the ratios are quite good: between 0.57x (zlib and PyTables
1.4) and 1.45x (lzo and PyTables 2.0). In general, one should expect
better performance from compressed data, but I've chosen completely
random data here, so the compressors weren't able to achieve even decent
compression ratios and that hurts I/O performance quite a few.

4. The best performance is achieved by the simple (it doesn't allow to
be enlarged nor compressed), but rather effective in terms of I/O, Array
object. For writing, it can be up to 1.74x faster (using PyTables 2.0)
than pickle and up to 3.56x (using PyTables 1.4) for reading, which is
quite a lot (more than 500 MB/s) in terms of I/O speed.

I will warn the reader that these times are taken *without* having in
account the flush time to disk for writing. When this time is taken, the
gap between PyTables and pickle will reduce significantly (but not when
using compression, were PyTables will continue to be rather slower in
comparison). So, you should take the the above figures as *peak*
throughputs (that can be achieved when the dataset fits comfortably in
the main memory because of the filesystem cache).

For reading, and when the files doesn't fit in the filesystem cache or
are read from the first time one should expect an important degrading
over all the figures that I presented here. However, when using
compression over real data (where a 2x or more compression ratios are
realistic), the compressed EArray should be up to 2x faster (I've
noticed this many times in other contexts) for reading than other
solutions (this is so because one have to read less data from disk and
moreover, CPUs today are exceedingly fast at decompressing).

The above benchmarks have been run on a Linux machine running SuSe Linux
with an AMD Opteron @ 2 GHz, 8 GB of main memory and a 7200 rpm IDE
disk.

Cheers,

-- 
Francesc Altet    |  Be careful about using the following code --
Carabos Coop. V.  |  I've only proven that it works, 
www.carabos.com   |  I haven't tested it. -- Donald Knuth
-------------- next part --------------
A non-text attachment was scrubbed...
Name: iobench-2.0.py
Type: text/x-python
Size: 3679 bytes
Desc: 
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20070419/96f7af0a/attachment.py>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: iobench-1.4.py
Type: text/x-python
Size: 3670 bytes
Desc: 
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20070419/96f7af0a/attachment-0001.py>
-------------- next part --------------
Python version:    2.4.4 (#1, Nov  6 2006, 12:24:47) 
[GCC 4.0.2 20050901 (prerelease) (SUSE Linux)]
NumPy version:     1.0.1
PyTables version:  1.4
Checking with a 1000x125000 matrix of float64 elements (953.674 MB)
***** cPickle (protocol 2) *****
Time for writing: 3.992s
File size: 955M
Time for reading: 6.222s
***** PyTables EArray (dump row to row) *****
Time for writing: 3.745s.   Speed-up over cPickle: 1.07x
File size: 955M
Time for reading: 2.73s.   Speed-up over cPickle: 2.28x
File size: 955M
***** PyTables EArray (dump row to row, compressed with zlib) ******
Time for writing: 68.575s.   Speed-up over cPickle: 0.06x
File size: 810M
Time for reading: 10.956s.   Speed-up over cPickle: 0.57x
File size: 810M
***** PyTables EArray (dump row to row, compressed with lzo) *****
Time for writing: 33.865s.   Speed-up over cPickle: 0.12x
File size: 840M
Time for reading: 7.694s.   Speed-up over cPickle: 0.81x
File size: 840M
***** PyTables EArray (complete dump) *****
Time for writing: 3.389s.   Speed-up over cPickle: 1.18x
File size: 955M
Time for reading: 2.758s.   Speed-up over cPickle: 2.26x
File size: 955M
***** PyTables Array *****
Time for writing: 2.659s.   Speed-up over cPickle: 1.5x
File size: 955M
Time for reading: 1.746s.   Speed-up over cPickle: 3.56x
File size: 955M
-------------- next part --------------
Python version:    2.5 (r25:51908, Nov  3 2006, 12:01:01) 
[GCC 4.0.2 20050901 (prerelease) (SUSE Linux)]
NumPy version:     1.0.2.dev3640
PyTables version:  2.0b2pro
Checking with a 1000x125000 matrix of float64 elements (953.674 MB)
***** cPickle (protocol 2) *****
Time for writing: 4.674s
File size: 955M
Time for reading: 6.254s
***** PyTables EArray (dump row to row) *****
Time for writing: 3.844s.   Speed-up over cPickle: 1.22x
File size: 972M
Time for reading: 2.663s.   Speed-up over cPickle: 2.35x
File size: 972M
***** PyTables EArray (dump row to row, compressed with zlib) ******
Time for writing: 48.956s.   Speed-up over cPickle: 0.1x
File size: 831M
Time for reading: 8.597s.   Speed-up over cPickle: 0.73x
File size: 831M
***** PyTables EArray (dump row to row, compressed with lzo) *****
Time for writing: 30.643s.   Speed-up over cPickle: 0.15x
File size: 842M
Time for reading: 4.302s.   Speed-up over cPickle: 1.45x
File size: 842M
***** PyTables EArray (complete dump) *****
Time for writing: 4.071s.   Speed-up over cPickle: 1.15x
File size: 972M
Time for reading: 2.701s.   Speed-up over cPickle: 2.32x
File size: 972M
***** PyTables Array *****
Time for writing: 2.693s.   Speed-up over cPickle: 1.74x
File size: 955M
Time for reading: 1.81s.   Speed-up over cPickle: 3.46x
File size: 955M