Orders of magnitude

Mon Mar 29 18:39:50 EST 2004

"Robert Brewer" <fumanchu at amor.org> wrote in message news:<mailman.38.1080542935.20120.python-list at python.org>...
> I'm dedup'ing a 10-million-record dataset, trying different approaches
> for building indexes. The in-memory dicts are clearly faster, but I get
> Memory Errors (Win2k, 512 MB RAM, 4 G virtual). Any recommendations on
> other ways to build a large index without slowing down by a factor of
> 25?

In case you are interested in alternatives approaches...here's how I
typically do this:

step 1: sort the file using a separate sort utility (unix sort, cygwin
sort, etc)

step 2: have a python program read in rows, 
        compare each row to the prior,
        write out only one row for each set

ks