Optimizing size of very large dictionaries

Thu Jul 31 03:21:24 EDT 2008

> > Are there any techniques I can use to strip a dictionary data
> > structure down to the smallest memory overhead possible?

Sure.  You can build your own version of a dict using
UserDict.DictMixin.  The underlying structure can be as space
efficient as you want.

FWIW, dictionaries automatically become more space efficient at
largers sizes (50,000+ records).  The size quadrupling strategy falls
back to just doubling whenever a dict gets two thirds full.

> > Background: I'm trying to identify duplicate records in very
> > large text based transaction logs. I'm detecting duplicate
> > records by creating a SHA1 checksum of each record and using this
> > checksum as a dictionary key. This works great except for several
> > files whose size is such that their associated checksum
> > dictionaries are too big for my workstation's 2G of RAM.

Tons of memory can be saved by not storing the contents of the
record.  Just make an initial pass to identify the digest values of
possible duplicates.  The use a set to identify actual dups but only
store the records for those whose digest is a possible duplicate:

  bag = collections.defaultdict(int)
  for record in logs:
      bag[sha1(record).digest()] += 1
  possible_dups = set()
  while bag:
      hashval, cnt = bag.popitem()
      if cnt > 1:
          possible_dups.add(hashvalue)
  seen = set()
  for record in logs:
      if record in seen:
          print 'Duplicate:', record
      elif sha1(record).digest() in possible_dups:
          seen.add(record)

Raymond

P.S.  If the log entries are one liners, maybe it would be better to
use the operating system's sort/uniq filters.