[Numpy-discussion] Pickle, pytables, and sqlite - loading and saving recarray's

Fri Jul 20 07:17:59 EDT 2007

A Divendres 20 Juliol 2007 04:42, Vincent Nijs escrigué:
> I am interesting in using sqlite (or pytables) to store data for scientific
> research. I wrote the attached test program to save and load a simulated
> 11x500,000 recarray. Average save and load times are given below (timeit
> with 20 repetitions). The save time for sqlite is not really fair because I
> have to delete the data table each time before I create the new one. It is
> still pretty slow in comparison. Loading the recarray from sqlite is
> significantly slower than pytables or cPickle. I am hoping there may be
> more efficient ways to save and load recarray¹s from/to sqlite than what I
> am now doing. Note that I infer the variable names and types from the data
> rather than specifying them manually.
>
> I¹d luv to hear from people using sqlite, pytables, and cPickle about their
> experiences.
>
> saving recarray with cPickle:       1.448568 sec/pass
> saving recarray with pytable:      3.437228 sec/pass
> saving recarray with sqlite:         193.286204 sec/pass
>
> loading recarray using cPickle:    0.471365 sec/pass
> loading recarray with pytable:     0.692838 sec/pass
> loading recarray with sqlite:        15.977018 sec/pass

For a more fair comparison, and for large amounts of data, you should inform 
PyTables about the expected number of rows (see [1]) that you will end 
feeding into the tables so that it can choose the best chunksize for I/O 
purposes.

I've redone the benchmarks (the new script is attached) with 
this 'optimization' on and here are my numbers:

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
PyTables version:  2.0
HDF5 version:      1.6.5
NumPy version:     1.0.3
Zlib version:      1.2.3
LZO version:       2.01 (Jun 27 2005)
Python version:    2.5 (r25:51908, Nov  3 2006, 12:01:01)
[GCC 4.0.2 20050901 (prerelease) (SUSE Linux)]
Platform:          linux2-x86_64
Byte-ordering:     little
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Test saving recarray using cPickle: 0.197113 sec/pass
Test saving recarray with pytables: 0.234442 sec/pass
Test saving recarray with pytables (with zlib): 1.973649 sec/pass
Test saving recarray with pytables (with lzo): 0.925558 sec/pass

Test loading recarray using cPickle: 0.151379 sec/pass
Test loading recarray with pytables: 0.165399 sec/pass
Test loading recarray with pytables (with zlib): 0.553251 sec/pass
Test loading recarray with pytables (with lzo): 0.264417 sec/pass

As you can see, the differences between raw cPickle and PyTables are much less 
than not informing about the total number of rows.  In fact, an automatic 
optimization can easily be done in PyTables so that when the user is passing 
a recarray, the total length of the recarray would be compared with the 
default number of expected rows (currently 10000), and if the former is 
larger, then the length of the recarray should be chosen instead.

I also have added the times when using compression just in case you are 
interested using it.  Here are the final file sizes:

$ ls -sh data
total 132M
24M data-lzo.h5  43M data-None.h5  43M data.pickle  25M data-zlib.h5

Of course, this is using completely random data, but with real data the 
compression levels are expected to be higher than this.

[1] http://www.pytables.org/docs/manual/ch05.html#expectedRowsOptim

Cheers,

-- 
>0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 "-"
-------------- next part --------------
A non-text attachment was scrubbed...
Name: load_tables_test.py
Type: application/x-python
Size: 4964 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20070720/1dec785c/attachment.bin>