Is shelve/dbm supposed to be this inefficient?

Thomas Jollans thomas at jollans.com
Wed Aug 1 20:08:50 EDT 2007


On Thursday 02 August 2007, Joshua J. Kugler wrote:
> I am using shelve to store some data since it is probably the best solution
> to my "data formats, number of columns, etc can change at any time"
> problem.  However, I seem to be dealing with bloat.
>
> My original data is 33MB.  When each row is converted to python lists, and
> inserted into a shelve DB, it balloons to 69MB.  Now, there is some
> additional data in there namely a list of all the keys containing data (vs.
> the keys that contain version/file/config information), BUT if I copy all
> the data over to a dict and dump the dict to a file using cPickle, that
> file is only 49MB.  I'm using pickle protocol 2 in both cases.
>
> Is this expected? Is there really that much overhead to using shelve and
> dbm files?  Are there any similar solutions that are more space efficient? 
> I'd use straight pickle.dump, but loading requires pulling the entire thing
> into memory, and I don't want to have to do that every time.
>
> [Note, for those that might suggest a standard DB.  Yes, I'd like to use a
> regular DB, but I have a domain where the number of data points in a sample
> may change at any time, so a timestamp-keyed dict is arguably the best
> solution, thus my use of shelve.]

Have you considered a directory full of pickle files ? (In effect, replacing 
the dbm with the file system) i.e. something like (untested)

class DirShelf(dict):
    def __init__(self, dirname):
        self.dir = dirname
        self.__repl_dict = {}

    def __contains__(self, key):
        assert isinstance(key, str)
        assert key.isalnum() # or similar portable check for is-name-ok
        return os.path.exists(os.path.join(self.dir, key))

    def has_key(self, key):
        return key in self

    def __getitem__(self, key):
        try:
            if key not in self.__repl_dict:
                self.__repl_dict[key] = \
                    cPickle.load(file(os.path.join(self.dir, key), 'rb'),
                                 protocol=2)
            return self.__repl_dict[key]
        except IOError, e:
          raise KeyError(e)

    def __setitem__(self, key, val):
        assert isinstance(key, str)
        assert key.isalnum() # or similar portable check for is-name-ok
        self.__repl_dict[key] = val
        self.flush()

    def flush(self):
        for k, v in self.__repl_dict.iteritems():
            cPickle.dump(v, file(os.path.join(self.dir, k), 'wb'),
                         protocol=2)

    def __del__(self):
        self.flush()


-- 
      Regards,                       Thomas Jollans
GPG key: 0xF421434B may be found on various keyservers, eg pgp.mit.edu
Hacker key <http://hackerkey.com/>:
v4sw6+8Yhw4/5ln3pr5Ock2ma2u7Lw2Nl7Di2e2t3/4TMb6HOPTen5/6g5OPa1XsMr9p-7/-6
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 196 bytes
Desc: This is a digitally signed message part.
URL: <http://mail.python.org/pipermail/python-list/attachments/20070802/d0036f08/attachment.sig>


More information about the Python-list mailing list