shuffle the lines of a large file

Tue Mar 8 06:30:02 EST 2005

Raymond Hettinger <vze4rx4y at verizon.net> wrote:
> >>> from random import random
> >>> out = open('corpus.decorated', 'w')
> >>> for line in open('corpus.uniq'):
>          print >> out, '%.14f %s' % (random(), line),
> 
> >>> out.close()
> 
>  sort corpus.decorated | cut -c 18- > corpus.randomized

Very good solution!

Sort is truly excellent at very large datasets.  If you give it a file
bigger than memory then it divides it up into temporary files of
memory size, sorts each one, then merges all the temporary files back
together.

You tune the memory sort uses for in memory sorts with --buffer-size.
Its pretty good at auto tuning though.

You may want to set --temporary-directory also to save filling up your
/tmp.

In a previous job I did a lot of stuff with usenet news and was
forever blowing up the server with scripts which used too much memory.
sort was always the solution!

-- 
Nick Craig-Wood <nick at craig-wood.com> -- http://www.craig-wood.com/nick