Dictionary or Database—Please advise

mk mrkafk at gmail.com
Fri Feb 26 13:20:19 EST 2010


Jeremy wrote:

> Shelve looks like an interesting option, but what might pose an issue
> is that I'm reading the data from a disk instead of memory.  I didn't
> mention this in my original post, but I was hoping that by using a
> database it would be more memory efficient in storing data in RAM so I
> wouldn't have to read from (or swap to/from) disk.  Would using the
> shelve package make reading/writing data from disk faster since it is
> in a binary format?

Read the docs:

"class shelve.BsdDbShelf(dict[, protocol=None[, writeback=False]])¶
     A subclass of Shelf which exposes first(), next(), previous(), 
last() and set_location() which are available in the bsddb module but 
not in other database modules. The dict object passed to the constructor 
must support those methods. This is generally accomplished by calling 
one of bsddb.hashopen(), bsddb.btopen() or bsddb.rnopen(). The optional 
protocol and writeback parameters have the same interpretation as for 
the Shelf class."

Apparently using shelve internally gives you option of using bsddb, 
which is good news: bsddb is B-tree DB, which is highly efficient for 
finding keys. I would recommend bsddb.btopen(), as it creates B-tree DB 
(perhaps other implementations, like anydb or hash db are good as well, 
but I personally didn't test them out).

I can't say for Berkeley DB implementation, but in general B-tree 
algorithm has O(log2 n) complexity for finding keys, which roughly means 
that if you need to find particular key in a db of 1 million keys, 
you'll probably need ~20 disk accesses (or even less if some keys looked 
at in the process of search happen to be in the same disk sectors). So 
yes, it's highly efficient.

Having said that, remember that disk is many orders of magnitude slower 
than RAM, so it's no free lunch.. Nothing will beat memory-based data 
structure when it comes to speed (well new flash or hybrid disks perhaps 
could significantly improve in comparison to current mainstream 
mechanical-platter disks? there are some hyper-fast storage hardware 
companies out there, although they tend to charge arm and leg for their 
stuff for now).

Caveat: Berkeley DB is dual-licensed -- if you're using it for 
commercial work, it might be that you'd need to buy a license for it. 
Although I have had no experience with this really, if someone here did 
perhaps they will shed some light on it?

Regards,
mk






More information about the Python-list mailing list