shuffle the lines of a large file - filelist.py (0/1)

Mon Mar 7 10:01:29 EST 2005

On 7 Mar 2005 05:36:32 -0800, rumours say that "Joerg Schuster"
<joerg.schuster.REMOVETHIS at gmail.com> might have written:

>Hello,
>
>I am looking for a method to "shuffle" the lines of a large file.

[snip]

>So, it would be very useful to do one of the following things:
>
>- produce corpus.uniq in a such a way that it is not sorted in any way
>- shuffle corpus.uniq > corpus.uniq.shuffled
>
>Unfortunately, none of the machines that I may use has 80G RAM.
>So, using a dictionary will not help.

To implement your 'shuffle' command in Python, you can do the following
algorithm, with a couple of assumptions:

ASSUMPTION
----------

The total line count in your big file is less than sys.maxint.

The algorithm as given works for systems where eol is a single '\n'.

ALGORITHM
---------

Create a temporary filelist.FileList fl (see attached file) of
struct.calcsize("q") bytes each (struct.pack and the 'q' format string is your
friend), to hold the offset of each line start in big_file.  fl[0] would be 0,
fl[1] would be the length of the first line including its '\n' and so on.

Read once the big_file appending to fl the offset each time (if you need help
with this, let me know).

random.shuffle(fl) # this is tested with the filelist.FileList as given

for offset_as_str in fl:
    offset= struct.unpack("q", offset_as_str)[0]
    big_file.seek(offset)
    sys.stdout.write(big_file.readline())

That's it.  Redirect output to your preferred file.  No promises for speed
though :)
-- 
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...