random writing access to a file in Python

Paul Rubin http
Sun Aug 27 18:06:07 EDT 2006


Claudio Grondi <claudio.grondi at freenet.de> writes:
> >>The Windows XP SP 2 '/> sort' (sorting of four Gigs of 20 byte records
> >>took 12 CPU and 18 usual hours)....
> Ok, I see - the misunderstanding is, that there were 4.294.967.296
> records each 20 bytes long, what makes the actual file 85.899.345.920
> bytes large (I just used 'Gigs' for telling the number of records, not
> the size of the file).
> Still not acceptable sorting time?

I think that's not so bad, though probably still not optimal.  85 GB
divided by 18 hours is 1.3 MB/sec, which means if the program is
reading the file 8 times, it's getting 10 MB/sec through the Windows
file system, which is fairly reasonable throughput.

If you know something about the distribution of the data (e.g. the
records are random 20-byte hashes) you might be able to sort it in
essentially linear time (radix sorting).  But even with a general
purpose algorithm, if you have a few hundred MB of ram and some
scratch disk space to work with, you should be able to sort that much
data in much less than 18 hours.

But if you only had to do it once and it's finished now, why do you
still care how long it took?



More information about the Python-list mailing list