Sorting in huge files

François Pinard pinard at iro.umontreal.ca
Thu Dec 9 18:24:08 EST 2004


[Paul]

> Thanks! I definitely didn't want to go into any elaborate programming
> for this, and the Unix sort is perfect for this.  It sorted a tenth of
> my data in about 8 min, which is entirely satisfactory to me (assuming
> it will take ~ 20 times more to do the whole thing).  Your answer
> greatly helped!  Paul

I was to reply a bit more elaborately, but if you are happy with `sort'
that's quite nice, you do have a solution. :-)

One of my old cases was a bit more difficult in that the comparison
algorithm was not _simple_ enough to easily translate into computing a
key once, that could be then saved into the record.  The comparison had
to stay live with the sort and the whole somehow sort co-routined with
the application.  I wrote a dual-tournament sort aiming a polyphased
merge, and transported the algorithm through machines and languages.
Not so long ago, I finally saved the whole algorithm in Python for the
record, but found it to be too slow to be practical for huge tasks.

Nowadays, given the same kind of problem, the size and speed of machines
is such that I would dare trying Timsort (the standard Python sort) over
a mmap-ed file, right from within Python.  My guess is that it could
very reasonably work, despite requiring almost no development time.

-- 
François Pinard   http://pinard.progiciels-bpi.ca



More information about the Python-list mailing list