shuffle the lines of a large file

Warren Postma wp at tekran__NOSP7M.com
Mon Mar 7 09:29:14 EST 2005


Joerg Schuster wrote:
> Unfortunately, none of the machines that I may use has 80G RAM.
> So, using a dictionary will not help.
> 
> Any ideas?
> 

Why don't you index the file?  I would store the byte-offsets of the 
beginning of each line into an index file. Then you can generate a 
random number from 1 to Whatever, go get that index from the index file,
then open your text file, seek to that position in the file, read one 
line, and close the file. Using this process you can then extract a 
somewhat random set of lines from your 'corpus' text file.

You probably should consider making a database of the file, keep the raw 
text file for sure, but create a converted copy in bsddb or pytables format.

Warren



More information about the Python-list mailing list