shuffle the lines of a large file
gry at ll.mit.edu
gry at ll.mit.edu
Mon Mar 7 09:38:49 EST 2005
As far as I can tell, what you ultimately want is to be able to extract
a random ("representative?") subset of sentences. Given the huge size
of data, I would suggest not randomizing the file, but randomizing
accesses to the file. E.g. (sorry for off-the-cuff pseudo python):
[adjust 8196 == 2**13 to your disk block size]
. while True:
. byteno = random.randint(0,length_of_file)
. #align to disk block to avoid unnecessary IO
. byteno = (byteno >> 13) << 13 #zero out the bottom 13 bits
. f.seek(byteno) #set the file pointer to a random position
. bytes = r.read(8196) #read one block
. sentences = bytes.splitlines()[2:-1] #omit ends with partial
lines
. do_something(sentences)
If you only need 1000 sentences, use only one sentence from each block,
if you need 1M, then use them all.
[I hope I understood you problem]
-- george
More information about the Python-list
mailing list