Is shelve/dbm supposed to be this inefficient?
Thomas Jollans
thomas at jollans.com
Wed Aug 1 20:08:50 EDT 2007
On Thursday 02 August 2007, Joshua J. Kugler wrote:
> I am using shelve to store some data since it is probably the best solution
> to my "data formats, number of columns, etc can change at any time"
> problem. However, I seem to be dealing with bloat.
>
> My original data is 33MB. When each row is converted to python lists, and
> inserted into a shelve DB, it balloons to 69MB. Now, there is some
> additional data in there namely a list of all the keys containing data (vs.
> the keys that contain version/file/config information), BUT if I copy all
> the data over to a dict and dump the dict to a file using cPickle, that
> file is only 49MB. I'm using pickle protocol 2 in both cases.
>
> Is this expected? Is there really that much overhead to using shelve and
> dbm files? Are there any similar solutions that are more space efficient?
> I'd use straight pickle.dump, but loading requires pulling the entire thing
> into memory, and I don't want to have to do that every time.
>
> [Note, for those that might suggest a standard DB. Yes, I'd like to use a
> regular DB, but I have a domain where the number of data points in a sample
> may change at any time, so a timestamp-keyed dict is arguably the best
> solution, thus my use of shelve.]
Have you considered a directory full of pickle files ? (In effect, replacing
the dbm with the file system) i.e. something like (untested)
class DirShelf(dict):
def __init__(self, dirname):
self.dir = dirname
self.__repl_dict = {}
def __contains__(self, key):
assert isinstance(key, str)
assert key.isalnum() # or similar portable check for is-name-ok
return os.path.exists(os.path.join(self.dir, key))
def has_key(self, key):
return key in self
def __getitem__(self, key):
try:
if key not in self.__repl_dict:
self.__repl_dict[key] = \
cPickle.load(file(os.path.join(self.dir, key), 'rb'),
protocol=2)
return self.__repl_dict[key]
except IOError, e:
raise KeyError(e)
def __setitem__(self, key, val):
assert isinstance(key, str)
assert key.isalnum() # or similar portable check for is-name-ok
self.__repl_dict[key] = val
self.flush()
def flush(self):
for k, v in self.__repl_dict.iteritems():
cPickle.dump(v, file(os.path.join(self.dir, k), 'wb'),
protocol=2)
def __del__(self):
self.flush()
--
Regards, Thomas Jollans
GPG key: 0xF421434B may be found on various keyservers, eg pgp.mit.edu
Hacker key <http://hackerkey.com/>:
v4sw6+8Yhw4/5ln3pr5Ock2ma2u7Lw2Nl7Di2e2t3/4TMb6HOPTen5/6g5OPa1XsMr9p-7/-6
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 196 bytes
Desc: This is a digitally signed message part.
URL: <http://mail.python.org/pipermail/python-list/attachments/20070802/d0036f08/attachment.sig>
More information about the Python-list
mailing list