Sorting Large File (Code/Performance)

Albert van der Horst albert at spenarnc.xs4all.nl
Sat Feb 2 09:40:59 EST 2008


In article <4799285d$0$36346$742ec2ed at news.sonic.net>,
John Nagle  <nagle at animats.com> wrote:
>Ira.Kovac at gmail.com wrote:
>> Thanks to all who replied. It's very appreciated.
>>
>> Yes, I had to double check line counts and the number of lines is ~16
>> million (instead of stated 1.6B).
>
>    OK, that's not bad at all.
>
>    You have a few options:
>
>    - Get enough memory to do the sort with an in-memory sort, like UNIX "sort"
>       or Python's "sort" function.

There is no Unix sort besides the executable binary /usr/bin/sort
There is qsort() in the standard c-library
but you would have to write a c-program around it.

>    - Thrash; in-memory sorts do very badly with virtual memory, but eventually
>       they finish.  Might take many hours.

A long time I thought that too, but of course it doesn't apply to mergesorts.
Then I heard reputable sources say that quicksort has good "locality".
Thinking about it, I find it more than plausible.
Now I wonder whether even trash to disk (page fault) could tank a qsort
using application. Why would it be a worse than a batch sort?

>    - Get a serious disk-to-disk sort program. (See "http://www.ordinal.com/".
>       There's a free 30-day trial.  It can probably sort your data
>       in about a minute.)

This is the way, but you don't need/want to go with "trial" software.

>    - Load the data into a database like MySQL and let it do the work.
>       This is slow if done wrong, but OK if done right.
>    - Write a distribution sort yourself.  Fan out the incoming file into
>       one file for each first letter, sort each subfile, merge the
>       results.


>
>With DRAM at $64 for 4GB, I'd suggest just getting more memory and using
>a standard in-memory sort.

You give the mistaken impression that normal
sort programs can only sort what can be hold in memory. This is
only true for MSDOS' sort.exe that was an abomination even in 1983.

All sort programs on Unixes, way back in the 1980 could handle
files that were restricted by disk size.

With respect to Microsoft OS'es:
Despite its age and using only 1 Mbyte of internal memory,
my ssort.exe can handle the OP's problem with ease.

It sorted a 12 Mbyte file in minutes even in 1993.

http://home.hccnet.nl/a.w.m.van.der.horst/ssort.html

This is a plain executable, so no hassles and no trial.
It is GPL, so you can even hack the source, if you fancy C++.

There are a lot of Unix compatible toolkit present for
Windows nowadays. Install one, you can spare a dozen Mbytes,
can't you?

>
>                               John Nagle

Groetjes Albert

--
-- 
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- like all pyramid schemes -- ultimately falters.
albert at spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst




More information about the Python-list mailing list