shuffle the lines of a large file

Simon Brunning simon.brunning at gmail.com
Fri Mar 11 04:20:09 EST 2005


On Fri, 11 Mar 2005 06:59:33 +0100, Heiko Wundram <modelnine at ceosg.de> wrote:
> On Tuesday 08 March 2005 15:55, Simon Brunning wrote:
> > Ah, but that's the clever bit; it *doesn't* store the whole list -
> > only the selected lines.
> 
> But that means that it'll only read several lines from the file, never do a
> shuffle of the whole file content...

Err, thing is, it *does* pick a random selection from the whole file,
without holding the whole file in memory. (It does hold all the
selected items in memory - I don't see any way to avoid that.) Why not
try it and see?

> When you'd want to shuffle the file
> content, you'd have to set lines=1 and throw away repeating lines in
> subsequent runs, or you'd have to set lines higher, and deal with the
> resulting lines too in some way (throw away repeating ones... :-).

Eliminating duplicates is left as an exercise for the reader. ;-)

> Doesn't
> matter how, you'd have to store which lines you've already read
> (selected_lines). And in any case you'd need a line cache of 10^9 entries for
> this amount of data...

Nope, you don't.

-- 
Cheers,
Simon B,
simon at brunningonline.net,
http://www.brunningonline.net/simon/blog/



More information about the Python-list mailing list