Sorting Large File (Code/Performance)
John Nagle
nagle at animats.com
Thu Jan 24 19:13:30 EST 2008
Ira.Kovac at gmail.com wrote:
> Thanks to all who replied. It's very appreciated.
>
> Yes, I had to double check line counts and the number of lines is ~16
> million (instead of stated 1.6B).
OK, that's not bad at all.
You have a few options:
- Get enough memory to do the sort with an in-memory sort, like UNIX "sort"
or Python's "sort" function.
- Thrash; in-memory sorts do very badly with virtual memory, but eventually
they finish. Might take many hours.
- Get a serious disk-to-disk sort program. (See "http://www.ordinal.com/".
There's a free 30-day trial. It can probably sort your data
in about a minute.)
- Load the data into a database like MySQL and let it do the work.
This is slow if done wrong, but OK if done right.
- Write a distribution sort yourself. Fan out the incoming file into
one file for each first letter, sort each subfile, merge the
results.
With DRAM at $64 for 4GB, I'd suggest just getting more memory and using
a standard in-memory sort.
John Nagle
More information about the Python-list
mailing list