Strategy for determing difference between 2 very large dictionaries

Wed Dec 24 20:48:43 EST 2008

On Dec 24, 7:04 pm, "Malcolm Greene" <mgre... at bdurham.com> wrote:
> Hi Roger,
>
> By very large dictionary, I mean about 25M items per dictionary. Each
> item is a simple integer whose value will never exceed 2^15.

In Python-speak about dictionaries, an "item" is a tuple (key, value).
I presume from the gross difference between 25M and 2**15
-- more pedantry: 2^15 is 13 :-) -- that you mean that the values
satisfy 0 <= value <= 2**15.

>
> I populate these dictionaries by parsing very large ASCII text files
> containing detailed manufacturing events. From each line in my log file
> I construct one or more keys and increment the numeric values associated
> with these keys using timing info also extracted from each line.
>
> Some of our log files are generated by separate monitoring equipment
> measuring the same process. In theory, these log files should be
> identical, but of course they are not. I'm looking for a way to
> determine the differences between the 2 dictionaries I will create from
> so-called matching sets of log files.

You seem to refer to the dictionaries as part of your requirements,
not as part of the solution. Do you *need* the two dictionaries after
you have obtained the differences?

Let's start from the beginning: You have two log files, each of which
can be abstracted as containing lots of (k, v) data, where each k
describes an event and each v is a compressed 15-bit timestamp. You
want to know for each datum whether it is (a) in both files (b) only
in file1 (c) only in file2. Is that a fair summary?

If so, there's an alternative that doesn't need to use dictionaries.
Each file can be represented by a *list* of 32768 sets. Each set
contains the ks that happened at the corresponding time... sets[fileno]
[v].add(k). Once you've built that, you trundle through the pairs of
sets doing set differences etc.

Bear in mind that Python dicts and sets grow as required by doubling
(IIRC) in size and rehashing from old to new. Consequently you get
periodical disturbances which are not usually noticed but may be
noticeable with large amounts of data.

HTH,
John