Optimizing size of very large dictionaries

M.-A. Lemburg mal at egenix.com
Thu Jul 31 06:24:57 EDT 2008


On 2008-07-31 02:29, python at bdurham.com wrote:
> Are there any techniques I can use to strip a dictionary data
> structure down to the smallest memory overhead possible?
> 
> I'm working on a project where my available RAM is limited to 2G
> and I would like to use very large dictionaries vs. a traditional
> database.
> 
> Background: I'm trying to identify duplicate records in very
> large text based transaction logs. I'm detecting duplicate
> records by creating a SHA1 checksum of each record and using this
> checksum as a dictionary key. This works great except for several
> files whose size is such that their associated checksum
> dictionaries are too big for my workstation's 2G of RAM.

If you don't have a problem with taking a small performance hit,
then I'd suggest to have a look at mxBeeBase, which is an on-disk
dictionary implementation:

     http://www.egenix.com/products/python/mxBase/mxBeeBase/

Of course, you could also use a database table for this. Together
with a proper index that should work as well (but it's likely slower
than mxBeeBase).

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jul 31 2008)
 >>> Python/Zope Consulting and Support ...        http://www.egenix.com/
 >>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
 >>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::


    eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
     D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
            Registered at Amtsgericht Duesseldorf: HRB 46611



More information about the Python-list mailing list