shuffle the lines of a large file
Warren Postma
wp at tekran__NOSP7M.com
Mon Mar 7 09:29:14 EST 2005
Joerg Schuster wrote:
> Unfortunately, none of the machines that I may use has 80G RAM.
> So, using a dictionary will not help.
>
> Any ideas?
>
Why don't you index the file? I would store the byte-offsets of the
beginning of each line into an index file. Then you can generate a
random number from 1 to Whatever, go get that index from the index file,
then open your text file, seek to that position in the file, read one
line, and close the file. Using this process you can then extract a
somewhat random set of lines from your 'corpus' text file.
You probably should consider making a database of the file, keep the raw
text file for sure, but create a converted copy in bsddb or pytables format.
Warren
More information about the Python-list
mailing list