comparing multiple copies of terrabytes of data?
Josiah Carlson
jcarlson at uci.edu
Tue Oct 26 16:40:30 EDT 2004
Istvan Albert <ialbert at mailblocks.com> wrote:
>
> Josiah Carlson wrote:
>
> > measure, if shy by around 3 orders of magnitude in terms of time.
>
> > That new one runs in 5 minutes 15 seconds total, because it exploits the
>
> my point was never to say that it is not possible to write
> a better way, nor to imply that you could not do it, I simply
> said that there is no easy way around this problem.
Funny, I thought the implementation I offered was an easy way. No
easier than to use someone else's code, which took all of 5 minutes to
write.
> Your solution while short and nice is not simple, and
> requires quite a bit of knowledge to understand
> why and how it works.
Anyone who has difficulty understanding the algorithm I offered that
used md5 should seriously consider switching hobbies/jobs. The algorithm
is trivial to understand.
> other topic ... sometimes I have a hard time visualizing how much
> a terrabyte of data actually is, this is a good example
> for that, even this optimized algorithm would take over three
> days to perform a a simple identity check ...
It would only take that long if you could only get 15 megs/second from
your drives (it was the main boot drive and the files were fragmented).
At multi-tb data sizes, RAID arrays are common, so read speed is
generally much higher, which would then reduce the running time
significantly.
In the case of multiple drives in a non-raid array, splitting up the
process could linearly scale to the limit of the drive and processor(s)
speed.
- Josiah
More information about the Python-list
mailing list