[Borgbackup] Poor dedup with tar overlay

Mon Feb 13 14:25:34 EST 2017

Thanks -- I think I'm mostly following that.

I believe that borg uses a sliding window like rsync, so ought to be
able to identify the start of a chunk properly, right?  But what you're
saying is that we'd have an issue with the last chunk of a file, since
in the tar case it could contain NULL padding or metadata for the next
file (or even data for the next file), right?

I also didn't realize that it didn't attempt to dedup files less than
512KB.  (Or is that doesn't attempt to /chunk/ files less than 512KB? 
I'm a little confused about the implication.)

The dataset in question contained about 100,000 files, of which there
are probably a great many very small ones.

So this is a very helpful conversation.  What I'm really after,
incidentally, is something like "borg compare" that would take a borg
archive and a live filesystem and compare byte-for-byte every file,
permission bit, etc. and make sure it's good.  I figured that by storing
a tar file in the repo, I could approximate this by calculating the
sha256sum of it as it goes in, and later extract/compare it at will.

Thanks,

John

On 02/13/2017 12:56 PM, Marian Beermann wrote:
> Hi John,
>
> when working on separate files the first block start is implicitly set
> by the file start. When working on something like a tar archive this is
> not the case, instead, the tar archive looks something like:
>
> header metadata for file #1 contents of file #1 metadata for file #2
> contents of file #2 ...
>
> So every metadata block that is interlaced between the contents of the
> adjacent files most likely influences the chunker, and will most likely
> be included in the last chunk (assuming big-ish files here now) of the
> preceding, or the first chunk of the following, or split across them.
>
> This would mean that there is no efficient deduplication against files
> that are only 1-2 chunks long.
>
> Smaller files (that would not be considered for chunking, <512 kB by
> default) would not deduplicate at all, since they would be chunked
> together with their interlaced metadata like a big file.
>
> Cheers, Marian
>
> On 13.02.2017 19:20, John Goerzen wrote:
>> Hi folks,
>>
>> Long story, but I've been running borg over a 60GB filesystem for awhile
>> now.  This has been working fine.
>>
>> I had a long thought regarding verifiability, and thought that I could
>> pipe an uncompressed tar of the same data into borg.  This should,
>> theoretically, use very little space, since tar has some metadata
>> (highly compressible), and NULL-padded blocks of data.  These data
>> blocks would be exact matches for what's already in the borg repo.
>>
>> To my surprise, however, this experiment consumed 12GB after compression
>> and dedup.  Any ideas why that might be?
>>
>> My chunker params are at the default.
>>
>> Thanks,
>>
>> John
>> _______________________________________________
>> Borgbackup mailing list
>> Borgbackup at python.org
>> https://mail.python.org/mailman/listinfo/borgbackup
>>
> _______________________________________________
> Borgbackup mailing list
> Borgbackup at python.org
> https://mail.python.org/mailman/listinfo/borgbackup

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/borgbackup/attachments/20170213/96439822/attachment.html>