random writing access to a file in Python

Sun Aug 27 03:44:30 EDT 2006

Paul Rubin wrote:
> Claudio Grondi <claudio.grondi at freenet.de> writes:
> 
>>Does it mean, that in case of very large files:
>>   the size of available memory for the sorting operation (making it
>>possible to work on larger chunks of data in memory) has less impact
>>on the actual sorting speed than
>>   the speed of the data transfer from/to storage device(s)
> 
> 
> Transfer speed and parallelism matters most, if the cpu can keep up.
> Parallelism means you want multiple drives that you can do i/o to
> simultaneously without having to seek.  Random access helps simulate
> this, but only somewhat.  Large database installations always want
> lots of parallel disks for similar reasons to this.
> 
> The basic method of sorting large files has traditionally been:
> 
> 1) Read the file in pieces p(1),p(2),...,p(n) where each piece is as
> big as will fit in memory.  Sort each piece with your favorite
> in-memory sort, and write out sorted pieces that we'll designate
> r(1,1),...r(1,n).  The r(a,b)'s are called "runs".  Use multiple
> output devices (tape or disk drives) to write out the runs, i.e. each
> output tape will contain its own bunch of runs.
> 
> 2) Read back a bunch of the runs in parallel, i.e. say you have
> written r(1,1),r(1,2),...,r(1,5) to five separate tape drives.  Read
> them back simultaneously and merge them (requires very little external
> memory) into a new "second level" run called r(2,1).  Similarly merge
> r(1,6),...,r(1,10) to the second level run r(2,2), and so forth.
> 
> 3) Merge the second level runs into third level runs, and so forth
> recursively until there's only one giant run.  That is the sorted
> output.
> 
> The amount of memory determines the size of the initial "first level"
> runs, which determines how many recursive merges are needed, so more
> memory helps.
> 
> The number of levels of recursion also depends on the "merge order",
> i.e. the number of runs merged at each merge step.  You really want
> each merge to read from M physical input devices that are separate
> from the output device(s).  
> 
> There are obviously a lot of optimizations that the above description
> is missing, and there's a lot of strategy about how to organize the
> merges, e.g. if you have 8 drives, do you use 4 for input and 4 for
> output, or 5 in and 3 out, or what?  
The Windows XP SP 2 '/> sort' (sorting of four Gigs of 20 byte records 
took 12 CPU and 18 usual hours) has, from what I could observe on the 
task manager, done the job in only two runs of 'copying' :
   1. first to a temporary file,
   2. then to the final file.
In the second run the size of processed chunks between reading/writing 
was in the order of up to tenths of Megabytes, where in the first run in 
order of up to hundreds Megabytes. I suppose that the procedure behind 
it was as follows:
1. decision about the size of chunks to split the file into and choosing 
the size of required memory
2. processing the chunks with in-memory sorting them and writing to the 
temporary file
3. decision about the size of buffers for merge sorting the chunks into 
the final file, so that they all fit into the 300 MByte of used memory
4. opening as many 'pipes' as there were chunks filling all of the pipe 
buffers up when one of them runs out of data
5. continuously switching between reading and writing to the hard disk, 
writing the results of the merge sorting to the final file always when 
one of the buffers run out of data and then filling up all of the 
buffers for the next cycle (concluded from observed scattered reading, 
smooth writing)

> 
> Is this some kind of production deployment problem you're working on,
> or do you just have a big pile of data you need to sort once?  
The latter.
One of the intermediate reasons behind doing it, is an attempt to get 
more and better intuitive understanding what are the hardware and 
software limits of brute force based approaches to solving problems in 
the area of AI, language processing and data compression.

> If you
> need to deploy across multiple machines (i.e. you can't just buy a
> giant machine if you need to do this sorting frequently), then I'd
> suggest reading up on the subject, starting with Knuth vol 3.  

Thank you much for your detailed response.

Claudio Grondi