Strategy for determing difference between 2 very large dictionaries

Wed Dec 24 03:04:57 EST 2008

Hi Roger,

By very large dictionary, I mean about 25M items per dictionary. Each
item is a simple integer whose value will never exceed 2^15.

I populate these dictionaries by parsing very large ASCII text files
containing detailed manufacturing events. From each line in my log file
I construct one or more keys and increment the numeric values associated
with these keys using timing info also extracted from each line.

Some of our log files are generated by separate monitoring equipment
measuring the same process. In theory, these log files should be
identical, but of course they are not. I'm looking for a way to
determine the differences between the 2 dictionaries I will create from
so-called matching sets of log files.

At this point in time, I don't have concerns about memory as I'm running
my scripts on a dedicated 64-bit server with 32Gb of RAM (but with
budget approval to raise our total RAM to 64Gb if necessary).

My main concern is am I applying a reasonably pythonic approach to my
problem, eg. am I using appropriate python techniques and data
structures? I am also interested in using reasonable techniques that
will provide me with the fastest execution time.

Thank you for sharing your thoughts with me.

Regards,
Malcolm

----- Original message -----
From: "Roger Binns" <rogerb at rogerbinns.com>
To: python-list at python.org
Date: Tue, 23 Dec 2008 23:26:49 -0800
Subject: Re: Strategy for determing difference between 2 very large    
dictionaries

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

python at bdurham.com wrote:
> Feedback on my proposed strategies (or better strategies) would be
> greatly appreciated.

Both strategies will work but I'd recommend the second approach since it
uses already tested code written by other people - the chances of it
being wrong are far lower than new code.

You also neglected to mention what your concerns are or even what "very
large" is.  Example concerns are memory consumption, cpu consumption,
testability, utility of output (eg as a generator getting each result on
demand or a single list with complete results).  Some people will think
a few hundred entries is large.  My idea of large is a working set
larger than my workstation's 6GB of memory :-)

In general the Pythonic approach is:

1 - Get the correct result
2 - Simple code (developer time is precious)
3 - Optimise for your data and environment

Step 3 is usually not needed.

Roger
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAklR5DUACgkQmOOfHg372QSuWACgp0xrdpW+NSB6qqCM3oBY2e/I
LIEAn080VgNvmEYj47Mm7BtV69J1GwXN
=MKLl
-----END PGP SIGNATURE-----

--
http://mail.python.org/mailman/listinfo/python-list