Sorting Large File (Code/Performance)

Fri Jan 25 13:45:02 EST 2008

On Jan 24, 9:26 pm, Ira.Ko... at gmail.com wrote:
> > If you really have a 2GB file and only 2GB of RAM, I suggest that you don't hold your breath.
>
> I am limited with resources. Unfortunately.

As long as you have at least as much disc space spare as you need to
hold a copy of the file then this is not too hard.  Split the file
into chunks that are small enough to fit in memory, sort each chunk
and write it to a file and then interleave the chunks.  Below is a
cheap and cheesy outline of code to do this, from which you can start.

For files which are hugely larger than your available memory you can
do this recursively but for files which are 10 to 100 times too big
the single-pass code below will probably work just fine.  The
complexity is technically O(n.(log(c)+(n/c))) where n is the size of
input and c is the chunk size; once n/c (the number of chunks) exceeds
log(c) the cost of merging the chunks will start to dominate, though a
recursive version would be slowed by needing a lot more disc access.

#!/usr/bin/env python
from itertools import islice
from tempfile import TemporaryFile
import sys

# Tweak this number to fill your memory
lines_per_chunk = 100000

chunkfiles = []
mergechunks = []

while True:
    chunk = list(islice(sys.stdin, lines_per_chunk))
    if not chunk:
       break
    chunk.sort()
    f = TemporaryFile()
    f.writelines(chunk)
    f.seek(0)
    mergechunks.append((chunk[0], len(chunkfiles)))
    chunkfiles.append(f)

while mergechunks:
    # The next line is order O(n) in the number of chunks
    (line, fileindex) = min(mergechunks)
    mergechunks.remove((line, fileindex))
    sys.stdout.write(line)
    nextline = chunkfiles[fileindex].readline()
    if nextline == "":
        chunkfiles[fileindex].close()
    else:
        mergechunks.append((nextline, fileindex))