Orders of magnitude

Mon Mar 29 17:35:24 EST 2004

Robert Brewer wrote:

> I wrote:
> 
> I'm dedup'ing a 10-million-record dataset, trying different approaches
> for building indexes. The in-memory dicts are clearly faster, but I get
> Memory Errors (Win2k, 512 MB RAM, 4 G virtual). Any recommendations on
> other ways to build a large index without slowing down by a factor of
> 25?
> 
> ...and got replies:

And here my last one, hopefully:

I changed the program to use just one file, and 10 million of
about 300 bytes, each.
There was one duplicate. The program identified it.
The whole process took about six minutes, using a temp file
with pickled bins of 300 MB, only.
It will work for you, regardless of record size.

:-))

have fun -- chris

-- 
Christian Tismer             :^)   <mailto:tismer at stackless.com>
Mission Impossible 5oftware  :     Have a break! Take a ride on Python's
Johannes-Niemeyer-Weg 9a     :    *Starship* http://starship.python.net/
14109 Berlin                 :     PGP key -> http://wwwkeys.pgp.net/
work +49 30 89 09 53 34  home +49 30 802 86 56  mobile +49 173 24 18 776
PGP 0x57F3BF04       9064 F4E1 D754 C2FF 1619  305B C09C 5A3B 57F3 BF04
      whom do you want to sponsor today?   http://www.stackless.com/

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: unduplicator.py
URL: <http://mail.python.org/pipermail/python-list/attachments/20040330/156587d3/attachment.ksh>