Sorting in huge files

Paul Rubin http
Thu Dec 9 06:01:59 EST 2004


"Paul" <paulolivier at gmail.com> writes:
> If you really want to know, my entries are elliptic curves and my
> hashing function is an attempt at mapping them to their Serre resdual
> representation modulo a given prime p.
> 
> Now, for you to tell me something relevant about the data that I don't
> already know from some big-shot conjecture/theorem would probably
> require that you read a few grad school books on number theory. Happy?

If you know that much math, you shouldn't have any trouble reading
Knuth "The Art Of Computer Programming" vol 3 "Sorting and Searching",
for info on how to do external sorts.  External sorting means sorting
files that are too large to fit in memory.  Basically you read in
chunks of the file that will fit in memory, sort the chunks in memory
and write the sorted chunks to disk.  Then you merge the sorted
chunks.  Since in the old days a huge amount of computer resources
went into doing these operations, whole books (like Knuth's) have been
written about how to handle all the details efficiently.

If you only want to do this sort once, your simplest approach (if
you're using Unix/Linux) is to use the Unix/Linux "sort" command,
which does an external sort.  You'll have to first crunch your data
into a format that the "sort" command can handle, then run "sort",
then uncrunch if appropriate.



More information about the Python-list mailing list