shuffle the lines of a large file
Eddie Corns
eddie at holyrood.ed.ac.uk
Mon Mar 7 09:18:29 EST 2005
"Joerg Schuster" <joerg.schuster.REMOVETHIS at gmail.com> writes:
>Hello,
>I am looking for a method to "shuffle" the lines of a large file.
>I have a corpus of sorted and "uniqed" English sentences that has been
>produced with (1):
>(1) sort corpus | uniq > corpus.uniq
>corpus.uniq is 80G large. The fact that every sentence appears only
>once in corpus.uniq plays an important role for the processes
>I use to involve my corpus in. Yet, the alphabetical order is an
>unwanted side effect of (1): Very often, I do not want (or rather, I
>do not have the computational capacities) to apply a program to all of
>corpus.uniq. Yet, any series of lines of corpus.uniq is obviously a
>very lopsided set of English sentences.
>So, it would be very useful to do one of the following things:
>- produce corpus.uniq in a such a way that it is not sorted in any way
>- shuffle corpus.uniq > corpus.uniq.shuffled
>Unfortunately, none of the machines that I may use has 80G RAM.
>So, using a dictionary will not help.
>Any ideas?
Instead of shuffling the file itself maybe you could index it (with dbm for
instance) and select random lines by using random indexes whenever you need a
sample.
Eddie
More information about the Python-list
mailing list