Sorting Large File (Code/Performance)

John Nagle nagle at animats.com
Thu Jan 24 19:13:30 EST 2008


Ira.Kovac at gmail.com wrote:
> Thanks to all who replied. It's very appreciated.
> 
> Yes, I had to double check line counts and the number of lines is ~16
> million (instead of stated 1.6B).

    OK, that's not bad at all.

    You have a few options:

    - Get enough memory to do the sort with an in-memory sort, like UNIX "sort"
	or Python's "sort" function.
    - Thrash; in-memory sorts do very badly with virtual memory, but eventually
	they finish.  Might take many hours.
    - Get a serious disk-to-disk sort program. (See "http://www.ordinal.com/".
	There's a free 30-day trial.  It can probably sort your data
	in about a minute.)
    - Load the data into a database like MySQL and let it do the work.
	This is slow if done wrong, but OK if done right.
    - Write a distribution sort yourself.  Fan out the incoming file into
	one file for each first letter, sort each subfile, merge the
	results.

With DRAM at $64 for 4GB, I'd suggest just getting more memory and using
a standard in-memory sort.

				John Nagle



More information about the Python-list mailing list