random writing access to a file in Python

Sat Aug 26 20:34:29 EDT 2006

Claudio Grondi <claudio.grondi at freenet.de> writes:
> Does it mean, that in case of very large files:
>    the size of available memory for the sorting operation (making it
> possible to work on larger chunks of data in memory) has less impact
> on the actual sorting speed than
>    the speed of the data transfer from/to storage device(s)

Transfer speed and parallelism matters most, if the cpu can keep up.
Parallelism means you want multiple drives that you can do i/o to
simultaneously without having to seek.  Random access helps simulate
this, but only somewhat.  Large database installations always want
lots of parallel disks for similar reasons to this.

The basic method of sorting large files has traditionally been:

1) Read the file in pieces p(1),p(2),...,p(n) where each piece is as
big as will fit in memory.  Sort each piece with your favorite
in-memory sort, and write out sorted pieces that we'll designate
r(1,1),...r(1,n).  The r(a,b)'s are called "runs".  Use multiple
output devices (tape or disk drives) to write out the runs, i.e. each
output tape will contain its own bunch of runs.

2) Read back a bunch of the runs in parallel, i.e. say you have
written r(1,1),r(1,2),...,r(1,5) to five separate tape drives.  Read
them back simultaneously and merge them (requires very little external
memory) into a new "second level" run called r(2,1).  Similarly merge
r(1,6),...,r(1,10) to the second level run r(2,2), and so forth.

3) Merge the second level runs into third level runs, and so forth
recursively until there's only one giant run.  That is the sorted
output.

The amount of memory determines the size of the initial "first level"
runs, which determines how many recursive merges are needed, so more
memory helps.

The number of levels of recursion also depends on the "merge order",
i.e. the number of runs merged at each merge step.  You really want
each merge to read from M physical input devices that are separate
from the output device(s).  

There are obviously a lot of optimizations that the above description
is missing, and there's a lot of strategy about how to organize the
merges, e.g. if you have 8 drives, do you use 4 for input and 4 for
output, or 5 in and 3 out, or what?  

Is this some kind of production deployment problem you're working on,
or do you just have a big pile of data you need to sort once?  If you
need to deploy across multiple machines (i.e. you can't just buy a
giant machine if you need to do this sorting frequently), then I'd
suggest reading up on the subject, starting with Knuth vol 3.