ZODB memory problems (was: processing a Very Large file)

Sun May 22 02:57:19 EDT 2005

 class ExtendedTupleTable(Persistent):
    def __init__(self):
        self.interning = ObjectInterning()

        # This Set stores all generated ExtendedTuple objects.
        self.ets = Set() # et(s): ExtendedTuple object(s)
        # This dictionary stores a mapping of elements to Sets of
        # ExtendedTuples.
        # eg: self.el2ets[3] = Set([(1,2,3), (3,4,5), (1,3,9)])
        #     self.el2ets[4] = Set([(3,4,5), (2,4,9)])
        self.el2ets = {}  # el: element of an ExtendedTuple object

#######

Note: I might be wrong. I say this here instead of qualifying every
assertion below. Thank you.

If you want more fine-grained swapping-out to disk, you might want to
look at the classes provided by the BTrees modules that come with ZODB.
Built-in container classes like set and dictionary are effectively
opaque to ZODB - they have to be loaded into memory or out to disk as
one whole unit, container and contents. This is true for the Persistent
versions of the containers as well - these are special mostly because
they automatically detect when they are modified.

In order to have some contents of a container pickled out to disk and
others available in memory, you should use BTrees:

>>> root = get_zodb_root_container()
>>> from BTrees import IOBTree
>>> root['el2ets'] = el2ets = IOBTree.IOBTree()
>>> transaction.commit()
>>> el2ets[3] = Set([(1,2,3), (3,4,5), (1,3,9)])
>>> transaction.commit()

IOBTree means that its designed to have integer keys and arbitrary
object values. OOBTree means you can use arbitrary objects (e.g.
tuples) as keys. I read that you should avoid using instances of
subclasses of Persistent as keys in BTrees unless you are very careful
implementing __cmp__(); instead confine your keys to objects
constructed from immutable python types, e.g., strings, tuples, tuples
of strings, ...

If you break down the persistent chunks into small enough pieces and
use the transaction commit and abort appropriately (that takes some
experimenting - e.g., on a read-only loop through every element of a
large BTree, I was running out of memory until I called
transaction.abort() every loop), you should max out your memory usage
at some reasonable amount (determined by cache size) no matter how big
your BTree grows.