Optimizing size of very large dictionaries

Fri Aug 1 00:57:33 EDT 2008

Raymond Hettinger wrote:

>>> Background: I'm trying to identify duplicate records in very
>>> large text based transaction logs. I'm detecting duplicate
>>> records by creating a SHA1 checksum of each record and using this
>>> checksum as a dictionary key. This works great except for several
>>> files whose size is such that their associated checksum
>>> dictionaries are too big for my workstation's 2G of RAM.
> 
> Tons of memory can be saved by not storing the contents of the
> record.  Just make an initial pass to identify the digest values of
> possible duplicates.  The use a set to identify actual dups but only
> store the records for those whose digest is a possible duplicate:
> 
>   bag = collections.defaultdict(int)
>   for record in logs:
>       bag[sha1(record).digest()] += 1
>   possible_dups = set()
>   while bag:
>       hashval, cnt = bag.popitem()
>       if cnt > 1:
>           possible_dups.add(hashvalue)

Since actual counts above 1 are not needed, I believe a bit more memory 
could be saved by computing possible_dups incrementally.  The logic is a 
bit trickier, and seeing the above helps.

once, dups = set(), set()
for record in logs:
     d = sha1(record).digest()
     if d in once:
         once.remove(d)
         dups.add(d)
     elif d not in dups:
         once.add(d)

>   seen = set()
>   for record in logs:
>       if record in seen:
>           print 'Duplicate:', record
>       elif sha1(record).digest() in possible_dups:
>           seen.add(record)
> 
> Raymond
> 
> P.S.  If the log entries are one liners, maybe it would be better to
> use the operating system's sort/uniq filters.
> --
> http://mail.python.org/mailman/listinfo/python-list
>