Orders of magnitude

Christian Tismer tismer at stackless.com
Mon Mar 29 14:36:38 EST 2004


Robert Brewer wrote:

> I wrote:
> 
> I'm dedup'ing a 10-million-record dataset, trying different approaches
> for building indexes. The in-memory dicts are clearly faster, but I get
> Memory Errors (Win2k, 512 MB RAM, 4 G virtual). Any recommendations on
> other ways to build a large index without slowing down by a factor of
> 25?
> 
> ...and got replies:

Here is another one.
If you believe in statistics, then you might consider
to build a dictionary of SHA digests, only. The chance
that you get a hash clash is very, very small, if you use
the sha 160 bit algorith as in the sha module.

I just did some tests on my 2 GHz machine:
Taking a million of sha digests on a 1000 char record
takes about 25 seconds.

A rough guess of memory consumption:
a dict takes 3 words per slot and wastes aproximately
two slots per entry, this is 6 words (compact),
every sha string is 20 bytes plus 5 words overhead for
the string object, which probably aligns up to 12 words.
So one entry is 18 words if you store None or a hit count as the
value. So we end up with 72 bytes per entry.

Too bad, this is about 7 million records, not 10.

While it would easily fit with a special string type, which in this
case could be extremely efficient:
- No length, since we store sha hashes, only
- no hash, because these keys are more random than anything
- we could allocate chunks of them

The dict would store no value at all, so we would have
4 words per slot with indexes into a flat pre-allocated
area of string space for the sha keys.
So we are at 9 words per entry, and you are set. :-)

It would cost me a day to implement that special dict for you.
Maybe it is at the same cost to double your main menory :-)

Of course it would be easiest to make dict and string table
the right size from the beginning, to avoid the duplication
process.

I believe the whole process is more reliable than your
computer hardware, but to be sure, you might write out
duplicates and their positions during the process, for further
testing.

cheers - chris

-- 
Christian Tismer             :^)   <mailto:tismer at stackless.com>
Mission Impossible 5oftware  :     Have a break! Take a ride on Python's
Johannes-Niemeyer-Weg 9a     :    *Starship* http://starship.python.net/
14109 Berlin                 :     PGP key -> http://wwwkeys.pgp.net/
work +49 30 89 09 53 34  home +49 30 802 86 56  mobile +49 173 24 18 776
PGP 0x57F3BF04       9064 F4E1 D754 C2FF 1619  305B C09C 5A3B 57F3 BF04
      whom do you want to sponsor today?   http://www.stackless.com/





More information about the Python-list mailing list