shuffle the lines of a large file - filelist.py (0/1)
Christos TZOTZIOY Georgiou
tzot at sil-tec.gr
Mon Mar 7 10:01:29 EST 2005
On 7 Mar 2005 05:36:32 -0800, rumours say that "Joerg Schuster"
<joerg.schuster.REMOVETHIS at gmail.com> might have written:
>Hello,
>
>I am looking for a method to "shuffle" the lines of a large file.
[snip]
>So, it would be very useful to do one of the following things:
>
>- produce corpus.uniq in a such a way that it is not sorted in any way
>- shuffle corpus.uniq > corpus.uniq.shuffled
>
>Unfortunately, none of the machines that I may use has 80G RAM.
>So, using a dictionary will not help.
To implement your 'shuffle' command in Python, you can do the following
algorithm, with a couple of assumptions:
ASSUMPTION
----------
The total line count in your big file is less than sys.maxint.
The algorithm as given works for systems where eol is a single '\n'.
ALGORITHM
---------
Create a temporary filelist.FileList fl (see attached file) of
struct.calcsize("q") bytes each (struct.pack and the 'q' format string is your
friend), to hold the offset of each line start in big_file. fl[0] would be 0,
fl[1] would be the length of the first line including its '\n' and so on.
Read once the big_file appending to fl the offset each time (if you need help
with this, let me know).
random.shuffle(fl) # this is tested with the filelist.FileList as given
for offset_as_str in fl:
offset= struct.unpack("q", offset_as_str)[0]
big_file.seek(offset)
sys.stdout.write(big_file.readline())
That's it. Redirect output to your preferred file. No promises for speed
though :)
--
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...
More information about the Python-list
mailing list