Why does shelve make such large files?

Jean-Claude Wippler Jean-Claude.Wippler at p98.f112.n480.z2.fidonet.org
Fri Jul 2 14:41:25 EDT 1999


From: Jean-Claude Wippler <jcw at equi4.com>

Johan Wouters wrote:
> 
> Gerrit Holl wrote:
> 
> > Ah, I understand.
> > So pickle is useful for very small datases, but when they're really
> > huge, one should use shelve. Isn't it?
> 
> Pickle will serialise your data into a certain format. This way you
> can "store" objects like lists as a whole. The serialising has the
> benefit that you can eg. put your object (be it a list or something
> else) through a pipe or socket and depickle it at the other side. This
> way you can have python processes that exchange data in a sort of
> native python format.
> 
> Shelve gives you alot more than just storing the data as a long
> sequence. Imagine the extra space being the administration that is
> needed for all the extra functionality (like tagging what data items
> are valid etc)
> 
> The overhead induced by a database management system can be enormous
> compared to the bare data you want to access, but this is mostly a
> space/time tradeoff: the more space you use the faster you can do
> things like searching and sorting etc etc.

Ahem, with all due respect to everyone... this is humbug.

Pickle serializes your data in one sweep to/from disk.  It's compact.

Shelve stores serialized pieces in a keyed-access form, and uses file
space allocation to support add/modify/delete.  It starts out with more
empty space, but as the amount of stored data grows, that space does not
expand linearly.  Consider shelves to have half their file space being
used, on *average* (plus or minus a factor 2 perhaps).

But the last claim is horrendously misleading, I'm afraid: data storage
is not a space/time tradeoff.  It's about throughput (I/O bottlenecks)
and overhead in managing the supported data and indexing schemes.  There
are order-of-magnitude performance differences in how several solutions
work, because of this.

Shudder.  The notion that a large database package, or a large datafile,
is faster, is so far from reality that it has to be corrected, even in
this Python-oriented newsgroup.  My apologies for the S/N ratio drop.

-- Jean-Claude

P.S.  Geloof niet alles wat je leest, Gerrit.




More information about the Python-list mailing list