shuffle the lines of a large file

gry at ll.mit.edu gry at ll.mit.edu
Mon Mar 7 09:38:49 EST 2005


As far as I can tell, what you ultimately want is to be able to extract
a random ("representative?") subset of sentences.  Given the huge size
of data, I would suggest not randomizing the file, but randomizing
accesses to the file.  E.g. (sorry for off-the-cuff pseudo python):
[adjust 8196 == 2**13 to your disk block size]
. while True:
.     byteno = random.randint(0,length_of_file)
.     #align to disk block to avoid unnecessary IO
.     byteno = (byteno >> 13) << 13  #zero out the bottom 13 bits
.     f.seek(byteno) #set the file pointer to a random position
.     bytes = r.read(8196) #read one block
.     sentences = bytes.splitlines()[2:-1] #omit ends with partial
lines
.     do_something(sentences)

If you only need 1000 sentences, use only one sentence from each block,
if you need 1M, then use them all.
[I hope I understood you problem]

-- george




More information about the Python-list mailing list