Orders of magnitude
Christian Tismer
tismer at stackless.com
Mon Mar 29 16:42:41 EST 2004
Robert Brewer wrote:
> I wrote:
>
> I'm dedup'ing a 10-million-record dataset, trying different approaches
> for building indexes. The in-memory dicts are clearly faster, but I get
> Memory Errors (Win2k, 512 MB RAM, 4 G virtual). Any recommendations on
> other ways to build a large index without slowing down by a factor of
> 25?
So, here we go.
The attached script processes a gigabyte of data, one
million records, in about a minute on my machine, and
finds the single duplicate.
ciao - chris
--
Christian Tismer :^) <mailto:tismer at stackless.com>
Mission Impossible 5oftware : Have a break! Take a ride on Python's
Johannes-Niemeyer-Weg 9a : *Starship* http://starship.python.net/
14109 Berlin : PGP key -> http://wwwkeys.pgp.net/
work +49 30 89 09 53 34 home +49 30 802 86 56 mobile +49 173 24 18 776
PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04
whom do you want to sponsor today? http://www.stackless.com/
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: unduplicator.py
URL: <http://mail.python.org/pipermail/python-list/attachments/20040329/f8610a3d/attachment.ksh>
More information about the Python-list
mailing list