shuffle the lines of a large file

Kent Johnson kent37 at tds.net
Mon Mar 7 09:11:01 EST 2005


Joerg Schuster wrote:
> Hello,
> 
> I am looking for a method to "shuffle" the lines of a large file.
> 
> I have a corpus of sorted and "uniqed" English sentences that has been
> produced with (1):
> 
> (1) sort corpus | uniq > corpus.uniq
> 
> corpus.uniq is 80G large. The fact that every sentence appears only
> once in corpus.uniq plays an important role for the processes
> I use to involve my corpus in.  Yet, the alphabetical order is an
> unwanted side effect of (1): Very often, I do not want (or rather, I
> do not have the computational capacities) to apply a program to all of
> corpus.uniq. Yet, any series of lines of corpus.uniq is obviously a
> very lopsided set of English sentences.
> 
> So, it would be very useful to do one of the following things:
> 
> - produce corpus.uniq in a such a way that it is not sorted in any way
> - shuffle corpus.uniq > corpus.uniq.shuffled
> 
> Unfortunately, none of the machines that I may use has 80G RAM.
> So, using a dictionary will not help.

There was a thread a while ago about choosing random lines from a file without reading the whole 
file into memory. Would that help? Instead of shuffling the file, shuffle the users. I can't find 
the thread though...

Kent



More information about the Python-list mailing list