Sorting in huge files

Tue Dec 7 17:19:34 EST 2004

Paul wrote:
> I have a large database of 15GB, consisting of 10^8 entries of
> approximately 100 bytes each. 
or 10 gigabytes of data.
> A few thoughts on this:
> - Space is not going to be an issue. I have a Tb available.
I presume this is disk space, not memory.  If you do have a Tb of RAM
and you are using CrayPython, just do it.  From the above, I'd figure
(very hand-wavishly), that you want about 20G RAM (maybe as little as
15).  That's a lot to me.

> Any comments?
I'd draw a random sample of data to work with.  The python cookbook has
some good recipes fro drawing an unbiased sample.  If you work with
about 10,000 elements, you can probably still see patterns, but can
try all kinds of weird stuff experimentally.  Once you have a feel for
key distributions, cut the distribution into key ranges that "guarantee"
(not really, but in a statistical sense) the data will fit comfortably
in memory (say a third of your available RAM), and process each range
separately.  The sort should then be duck soup:

     ordered_dest = open('ordered_by_key1', 'w')
     for interval in ranges_in_order():
         data = extract_data(interval)
         data.sort(key=key1)
         for entry in data:
             ordered_dest.write(data)
         data = [] # avoid having older interval data during extract

> How long should I hope this sort will take? It will sound weird, but I
> actually have 12 different key maps and I want to sort this with
> respect to each map, so I will have to sort 12 times.

Well, good sorting is O(N * log(N)), so you should be able to calculate
from a sample timing (once you get out of cache effects).  If you use a
variant of the above, simply time a few of the intervals you've chosen
and do a nice linear extrapolation.

--Scott David Daniels
Scott.Daniels at Acm.Org