Request for comments on a design

Sat Oct 23 04:50:53 EDT 2010

TomF wrote:

> I have a program that manipulates lots of very large indices, which I
> implement as bit vectors (via the bitarray module).   These are too
> large to keep all of them in memory so I have to come up with a way to
> cache and load them from disk as necessary.  I've been reading about
> weak references and it looks like they may be what I want.
> 
> My idea is to use a WeakValueDictionary to hold references to these
> bitarrays, so Python can decide when to garbage collect them.  I then
> keep a key-value database of them (via bsddb) on disk and load them
> when necessary.  The basic idea for accessing one of these indexes is:
> 
> _idx_to_bitvector_dict = weakref.WeakValueDictionary()

In a well written script that cache will be almost empty. You should compare 
the weakref approach against a least-recently-used caching strategy. In 
newer Pythons you can use collections.OrderedDict to implement an LRU cache 
or use the functools.lru_cache decorator.

> def retrieve_index(idx):
>     if idx in _idx_to_bitvector_dict and _idx_to_bitvector_dict[idx] is
> not None:
>         return _idx_to_bitvector_dict[idx]

In a multi-threaded environment the above may still return None or raise a 
KeyError.

>     else:  # it's been gc'd
>         bv_str = bitvector_from_db[idx]        # Load from bsddb
>         bv = cPickle.loads(bv_str)                # Deserialize the string
>         _idx_to_bitvector_dict[idx] = bv       # Re-initialize the weak
> dict element
>         return bv
> 
> Hopefully that's not too confusing.  Comments on this approach?  I'm
> wondering whether the weakref stuff isn't duplicating some of the
> caching that bsddb might be doing.

Even if the raw value is cached somewhere you save the overhead of 
deserialisation.

Peter