Orders of magnitude
Christian Tismer
tismer at stackless.com
Mon Mar 29 20:52:19 EST 2004
Buck Nuggets wrote:
> "Robert Brewer" <fumanchu at amor.org> wrote in message news:<mailman.38.1080542935.20120.python-list at python.org>...
>
>>I'm dedup'ing a 10-million-record dataset, trying different approaches
>>for building indexes. The in-memory dicts are clearly faster, but I get
>>Memory Errors (Win2k, 512 MB RAM, 4 G virtual). Any recommendations on
>>other ways to build a large index without slowing down by a factor of
>>25?
>
>
> In case you are interested in alternatives approaches...here's how I
> typically do this:
>
> step 1: sort the file using a separate sort utility (unix sort, cygwin
> sort, etc)
>
> step 2: have a python program read in rows,
> compare each row to the prior,
> write out only one row for each set
Good solution, but wayyyy too much effort.
You probably know it:
If you are seeking for duplicates, and doing it by
complete ordering, then you are thwowing lots of information
away, since you are not seeking for neighborship, right?
That clearly means: it must be inefficient.
No offense, just trying to get you on the right track!
ciao - chris
--
Christian Tismer :^) <mailto:tismer at stackless.com>
Mission Impossible 5oftware : Have a break! Take a ride on Python's
Johannes-Niemeyer-Weg 9a : *Starship* http://starship.python.net/
14109 Berlin : PGP key -> http://wwwkeys.pgp.net/
work +49 30 89 09 53 34 home +49 30 802 86 56 mobile +49 173 24 18 776
PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04
whom do you want to sponsor today? http://www.stackless.com/
More information about the Python-list
mailing list