[SciPy-user] Fast saving/loading of huge matrices

Thu Apr 19 18:26:44 EDT 2007

Pytables looks very interesting and clearly has a ton of features. However,
if I am trying to just read-in a csv file can it figure out the correct data
types on its own (e.g., dates, floats, strings)? Read "I am too lazy to
types in variables names and types myself if the names are already in the
file" :)

Similarly can you just dump a dictionary or rec-array into a pytable with
one 'save' command and have pytables figure out the variable names and
types? This seems relevant since you wouldn't have to do that with cPickle
which saves user-time if not computer time.

Sorry if this is too off-topic.

Vincent

On 4/19/07 2:30 PM, "Francesc Altet" <faltet at carabos.com> wrote:

> El dj 19 de 04 del 2007 a les 09:23 -0500, en/na Robert Kern va
> escriure:
>> Gael Varoquaux wrote:
>>> I have a huge matrix (I don't know how big it is, it hasn't finished
>>> loading yet, but the ascii file weights 381M). I was wondering what
>>> format had best speed efficiency for saving/loading huge file. I don't
>>> mind using a hdf5 even if it is not included in scipy itself.
>> 
>> I think we've found that a simple pickle using protocol 2 works the fastest.
>> At
>> the time (a year or so ago) this was faster than PyTables for loading the
>> entire
>> array of about 1GB size. PyTables might be better now, possibly because of
>> the
>> new numpy support.
> 
> I was curious as well if PyTables 2.0 is getting somewhat faster than
> 1.4 series (although I already knew that for this sort of things, the
> space for improvement should be rather small).
> 
> For that, I've made a small benchmark (see attachments) and compared the
> performance for PyTables 1.4 and 2.0 against pickle (protocol 2). In the
> benchmark, a NumPy array of around 1 GB is created and the time for
> writing and reading it from disk is written to stdout. You can see the
> outputs for the runs in the attachments as well.
> 
>> From there, some conclusions can be draw:
> 
> 1. The difference of performance between PyTables 1.4 and 2.0 for this
> especific task is almost negligible. This was somthing expected because,
> although 1.4 was using numarray at the core, the use of the array
> protocol made unnecessary the copies of the arrays (and hence, the
> overhead over 2.0, with NumPy at the core, is negligible).
> 
> 2. For writing, the EArray (Extensible Array) object of PyTables has
> roughly the same speed than NumPy (a 15% faster in fact, but this is not
> that much).  However, for reading, the speed-up of PyTables over pickle
> is more than 2x (up to 2.35x for 2.0), which is something to consider.
> 
> 3. For compressed EArrays, writing times are relatively bad: between
> 0.06x (zlib and PyTables 1.4) and 0.15x (lzo and PyTables 2.0). However,
> for reading the ratios are quite good: between 0.57x (zlib and PyTables
> 1.4) and 1.45x (lzo and PyTables 2.0). In general, one should expect
> better performance from compressed data, but I've chosen completely
> random data here, so the compressors weren't able to achieve even decent
> compression ratios and that hurts I/O performance quite a few.
> 
> 4. The best performance is achieved by the simple (it doesn't allow to
> be enlarged nor compressed), but rather effective in terms of I/O, Array
> object. For writing, it can be up to 1.74x faster (using PyTables 2.0)
> than pickle and up to 3.56x (using PyTables 1.4) for reading, which is
> quite a lot (more than 500 MB/s) in terms of I/O speed.
> 
> I will warn the reader that these times are taken *without* having in
> account the flush time to disk for writing. When this time is taken, the
> gap between PyTables and pickle will reduce significantly (but not when
> using compression, were PyTables will continue to be rather slower in
> comparison). So, you should take the the above figures as *peak*
> throughputs (that can be achieved when the dataset fits comfortably in
> the main memory because of the filesystem cache).
> 
> For reading, and when the files doesn't fit in the filesystem cache or
> are read from the first time one should expect an important degrading
> over all the figures that I presented here. However, when using
> compression over real data (where a 2x or more compression ratios are
> realistic), the compressed EArray should be up to 2x faster (I've
> noticed this many times in other contexts) for reading than other
> solutions (this is so because one have to read less data from disk and
> moreover, CPUs today are exceedingly fast at decompressing).
> 
> The above benchmarks have been run on a Linux machine running SuSe Linux
> with an AMD Opteron @ 2 GHz, 8 GB of main memory and a 7200 rpm IDE
> disk.
> 
> Cheers,

--