hard disk activity

Terry Hancock hancock at anansispaceworks.com
Tue Feb 14 03:27:46 EST 2006


On 13 Feb 2006 13:13:51 -0800
Paul Rubin <"http://phr.cx"@NOSPAM.invalid> wrote:
> "VSmirk" <vania.smirk at gmail.com> writes:
> > Aweseme!!!  I got as far as segmenting the large file on
> > my own, and I ran out of ideas.  I kind of thought about
> > checksum, but I never put the two together.
> > 
> > Thanks.  You've helped a lot....
> 
> The checksum method I described works ok if bytes change
> in the middle of the file but don't get inserted (piecs of
> the file don't move around).  If you insert on byte in the
> middle of a 1GB file (so it becomes 1GB+1 byte) then all
> the checksums after the middle block change, which is no
> good for your purpose.

But of course, the OS will (I hope) give you the exact
length of the file, so you *could* assume that the beginning
and end are the same, then work towards the middle.
Somewhere in between, when you hit the insertion point, both
will disagree, and you've found it.  Same for deletion.

Of course, if *many* changes have been made to the file,
then this will break down. But then, if that's the case,
you're going to have to do an expensive transfer anyway, so
expensive analysis is justified.

In fact, you could proceed by analyzing the top and bottom
checksum lists at the point of failure -- download that
frame, do a byte by byte compare and see if you can derive
the frameshift. Then compensate, and go back to checksums
until they fail again.  Actually, that will work just coming
from the beginning, too.

If instead, the region continues to be unrecognizeable to
the end of the frame, then you need the next frame anyway.

Seems like it could get pretty close to optimal (but we
probably are re-inventing rsync).

Cheers,
Terry

-- 
Terry Hancock (hancock at AnansiSpaceworks.com)
Anansi Spaceworks http://www.AnansiSpaceworks.com




More information about the Python-list mailing list