Orders of magnitude

Mon Mar 29 21:01:48 EST 2004

Robert Brewer wrote:

> Christian Tismer wrote:
> 
>>Robert Brewer wrote:
>>
>>>I'm dedup'ing a 10-million-record dataset, trying different 
>>
>>approaches
>>
>>>for building indexes. The in-memory dicts are clearly 
>>
>>faster, but I get
>>
>>>Memory Errors (Win2k, 512 MB RAM, 4 G virtual). Any 
>>
>>recommendations on
>>
>>>other ways to build a large index without slowing down by a 
>>
>>factor of
>>
>>>25?
>>
>>So, here we go.
>>The attached script processes a gigabyte of data, one
>>million records, in about a minute on my machine, and
>>finds the single duplicate.
> 
> 
> Thanks, Christian! Looks like someone's been having fun. ;)
> 
> But I don't see how your solution avoids the problem, however (some code
> snipped):
> 
>     def process_duplicates(self):
>         for i in range(len(self.files)):
>             seen = {}
>             has = seen.has_key
>             for k in range(self.flushes):
>                 bin = pickle.load(self.files[i])
>                 for key, value in bin:
>                     if has(key):
>                         self.duplicates += 1
>                     else:
>                         seen[key] = value
> 
> ...that is, the 'seen' dictionary is what is growing too large on my
> dataset; it ends up being about 7 million items. Actually, I have three
> such dictionaries (3 different fields which I'm checking for dupes for
> each record). The hashing might cut these sizes down a bit, but I think
> I'm still going to end up having to cache 'seen' (which I call 'indices'
> in my Directory.produce() method) somehow.

Once again going over it, to make things crystal clear:
You don't keep three dicts for whatever reason.
What you pass my algortihm is a complete representation
of your complete record, as simple-minded as possible,
as long as it covers the full contents in a unique way.

Then I will swallow that into a 20 byte hash key and a record
number. This is all what you ever will need, regardless how
complicated your records may be. Be able to do a complete string
representation, and you are done.

The algorithm will then partition this set into 256 bins, which
are unrelated by construction, and examining each of them
individually will solve your problem, as said in prior email.

In a sense of patterns, this is simple "divide and conquer".
Simple and effective, with very few competitors.
Yes I do take contracts.

ciao - chris

-- 
Christian Tismer             :^)   <mailto:tismer at stackless.com>
Mission Impossible 5oftware  :     Have a break! Take a ride on Python's
Johannes-Niemeyer-Weg 9a     :    *Starship* http://starship.python.net/
14109 Berlin                 :     PGP key -> http://wwwkeys.pgp.net/
work +49 30 89 09 53 34  home +49 30 802 86 56  mobile +49 173 24 18 776
PGP 0x57F3BF04       9064 F4E1 D754 C2FF 1619  305B C09C 5A3B 57F3 BF04
      whom do you want to sponsor today?   http://www.stackless.com/