key/value store optimized for disk storage

Thu May 3 02:33:56 EDT 2012

Steve Howell <showell30 at yahoo.com> writes:
> Doesn't cdb do at least one disk seek as well?  In the diagram on this
> page, it seems you would need to do a seek based on the value of the
> initial pointer (from the 256 possible values):

Yes, of course it has to seek if there is too much data to fit in
memory.  All I'm saying is that if you're spending milliseconds on the
seek, that may dominate the lookup time even if you scan the 90k.

Actually, looking at the spec more closely, there are 256 hash tables in
the file, but each table can be of any size.  So there can be far more
than 2**16 hash slots.  Uncharacteristically for Bernstein, the document
is pretty unclear, so maybe you have to look at the code to be sure of
the layout.  Note also that empty hash buckets have just 2 words of
overhead.  So with 3M keys and 75% load factor, you get 4M buckets and
relatively few collisions.  The extra 1M buckets in a 64 bit
implementation is just 16MB in the file, which isn't much at all even
considering that you want it to stay resident in memory to avoid some
seeks, assuming you're on a PC and not some smaller device like a phone.
(A phone will have a solid state disk eliminating most seek time, so
you're ok in that situation too).

> Yup, I don't think I want to incur the extra overhead.  Do you have
> any first hand experience pushing dbm to the scale of 6Gb or so?  My
> take on dbm is that its niche is more in the 10,000-record range.

There are a bunch of different variants.  I'm trying to remember what
I've personally done with it and I'm sure I've used much more than 10k
records, but maybe not millions.  Unix dbm was originally designed to
handle millions of records back when that was a lot.  I'd expect gdbm,
bsddb and so forth can handle it easily.  The naive Python dbm module
might not be able to.  

The amount of data you're talking about (a few million keys, a few gb of
data) is fairly modest by today's standards, so I would think fancy
methods aren't really needed.