Optimizing size of very large dictionaries

Wed Jul 30 22:40:21 EDT 2008

En Wed, 30 Jul 2008 21:29:39 -0300, <python at bdurham.com> escribi�:

> Are there any techniques I can use to strip a dictionary data
> structure down to the smallest memory overhead possible?
>
> I'm working on a project where my available RAM is limited to 2G
> and I would like to use very large dictionaries vs. a traditional
> database.
>
> Background: I'm trying to identify duplicate records in very
> large text based transaction logs. I'm detecting duplicate
> records by creating a SHA1 checksum of each record and using this
> checksum as a dictionary key. This works great except for several
> files whose size is such that their associated checksum
> dictionaries are too big for my workstation's 2G of RAM.

You could use a different hash algorithm yielding a smaller value (crc32,  
by example, fits on an integer). At the expense of having more collisions,  
and more processing time to check those possible duplicates.

-- 
Gabriel Genellina