shuffle the lines of a large file

Nick Craig-Wood nick at
Tue Mar 8 06:30:02 EST 2005

Raymond Hettinger <vze4rx4y at> wrote:
> >>> from random import random
> >>> out = open('corpus.decorated', 'w')
> >>> for line in open('corpus.uniq'):
>          print >> out, '%.14f %s' % (random(), line),
> >>> out.close()
>  sort corpus.decorated | cut -c 18- > corpus.randomized

Very good solution!

Sort is truly excellent at very large datasets.  If you give it a file
bigger than memory then it divides it up into temporary files of
memory size, sorts each one, then merges all the temporary files back

You tune the memory sort uses for in memory sorts with --buffer-size.
Its pretty good at auto tuning though.

You may want to set --temporary-directory also to save filling up your

In a previous job I did a lot of stuff with usenet news and was
forever blowing up the server with scripts which used too much memory.
sort was always the solution!

Nick Craig-Wood <nick at> --

More information about the Python-list mailing list