[Numpy-discussion] Pickle, pytables, and sqlite - loading and saving recarray's
Francesc Altet
faltet at carabos.com
Fri Jul 20 11:53:09 EDT 2007
Vincent,
A Divendres 20 Juliol 2007 15:35, Vincent Nijs escrigué:
> Still curious however ... does no one on this list use (and like) sqlite?
First of all, while I'm not a heavy user of relational databases, I've used
them as references for benchmarking purposes. Hence, based on my own
benchmarking experience, I'd say that, for writing, relational databases do
take a lot of safety measures to ensure that all the data that is written to
the disk is safe and that the data relationships don't get broken, and that
takes times (a lot of time, in fact). I'm not sure about whether some of
these safety measures can be relaxed, but even though some relational
databases would allow this, my feel (beware, I can be wrong) is that you
won't be able to reach cPickle/PyTables speed (cPickle/PyTables are not
observing security measures in that regard because they are not thought for
these tasks).
In this sense, the best writing speed that I was able to achieve with
Postgres (I don't know whether sqlite support this) is by simulating that
your data comes from a file stream and using the "cursor.copy_from()" method.
Using this approach I was able to accelerate a 10x (if I remember well) the
injecting speed, but even with this, PyTables can be another 10x faster. You
can see an exemple of usage in the Postgres backend [1] used for doing the
benchmarks for comparing PyTables and Postgres speeds.
Regarding reading speed, my diggins [2] seems to indicate that the bottleneck
here is not related with safety, but with the need of the relational
databases pythonic APIs of wrapping *every* element retrieved out of the
database with a Python container (int, float, string...). On the contrary,
PyTables does take advantage of creating an empty recarray as the container
to keep all the retrieved data, and that's very fast compared with the former
approach. To somewhat quantify this effect in function of the size of the
dataset retrieved, you can see the figure 14 of [3] (as you can see, the
larger the dataset retrieved, the larger the difference in terms of speed).
Incidentally, and as it is said there, I'm hoping that NumPy containers
should eventually be discovered by relational database wrappers makers, so
these wrapping times would be removed completely, but I'm currently not aware
of any package taking this approach.
[1] http://www.pytables.org/trac/browser/trunk/bench/postgres_backend.py
[2] http://thread.gmane.org/gmane.comp.python.numeric.general/9704
[3] http://www.carabos.com/docs/OPSI-indexes.pdf
Cheers,
--
>0,0< Francesc Altet http://www.carabos.com/
V V Cárabos Coop. V. Enjoy Data
"-"
More information about the NumPy-Discussion
mailing list