shuffle the lines of a large file

Heiko Wundram modelnine at ceosg.de
Tue Mar 8 09:49:35 EST 2005


On Tuesday 08 March 2005 15:28, Simon Brunning wrote:
> This has the advantage that every line had the same chance of being
> picked regardless of its length. There is the chance that it'll pick
> the same line more than once, though.

Problem being: if the file the OP is talking about really is 80GB in size, and 
you consider a sentence to have 80 bytes on average (it's likely to have less 
than that), that makes 10^9 sentences in the file. Now, multiply that with 
the memory overhead of storing a list of 10^9 None(s), and reconsider, 
whether that algorithm really works for the posted conditions. I don't think 
that any machine I have access to even has near enough memory just to store 
this list... ;)

-- 
--- Heiko.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-list/attachments/20050308/6e8a2a8d/attachment.sig>


More information about the Python-list mailing list