comparing multiple copies of terrabytes of data?

Josiah Carlson jcarlson at uci.edu
Tue Oct 26 16:40:30 EDT 2004


Istvan Albert <ialbert at mailblocks.com> wrote:
> 
> Josiah Carlson wrote:
> 
> > measure, if shy by around 3 orders of magnitude in terms of time.
> 
> > That new one runs in 5 minutes 15 seconds total, because it exploits the
> 
> my point was never to say that it is not possible to write
> a better way, nor to imply that you could not do it, I simply
> said that there is no easy way around this problem.

Funny, I thought the implementation I offered was an easy way.  No
easier than to use someone else's code, which took all of 5 minutes to
write.


> Your solution while short and nice is not simple, and
> requires quite a bit of knowledge to understand
> why and how it works.

Anyone who has difficulty understanding the algorithm I offered that
used md5 should seriously consider switching hobbies/jobs. The algorithm
is trivial to understand.


> other topic ... sometimes I have a hard time visualizing how much
> a terrabyte of data actually is, this is a good example
> for that, even this optimized algorithm would take over three
> days to perform a a simple identity check ...

It would only take that long if you could only get 15 megs/second from
your drives (it was the main boot drive and the files were fragmented).
At multi-tb data sizes, RAID arrays are common, so read speed is
generally much higher, which would then reduce the running time
significantly.

In the case of multiple drives in a non-raid array, splitting up the
process could linearly scale to the limit of the drive and processor(s)
speed.


 - Josiah




More information about the Python-list mailing list