[Borgbackup] Poor dedup with tar overlay

Mon Feb 13 13:56:46 EST 2017

Hi John,

when working on separate files the first block start is implicitly set
by the file start. When working on something like a tar archive this is
not the case, instead, the tar archive looks something like:

header metadata for file #1 contents of file #1 metadata for file #2
contents of file #2 ...

So every metadata block that is interlaced between the contents of the
adjacent files most likely influences the chunker, and will most likely
be included in the last chunk (assuming big-ish files here now) of the
preceding, or the first chunk of the following, or split across them.

This would mean that there is no efficient deduplication against files
that are only 1-2 chunks long.

Smaller files (that would not be considered for chunking, <512 kB by
default) would not deduplicate at all, since they would be chunked
together with their interlaced metadata like a big file.

Cheers, Marian

On 13.02.2017 19:20, John Goerzen wrote:
> Hi folks,
> 
> Long story, but I've been running borg over a 60GB filesystem for awhile
> now.  This has been working fine.
> 
> I had a long thought regarding verifiability, and thought that I could
> pipe an uncompressed tar of the same data into borg.  This should,
> theoretically, use very little space, since tar has some metadata
> (highly compressible), and NULL-padded blocks of data.  These data
> blocks would be exact matches for what's already in the borg repo.
> 
> To my surprise, however, this experiment consumed 12GB after compression
> and dedup.  Any ideas why that might be?
> 
> My chunker params are at the default.
> 
> Thanks,
> 
> John
> _______________________________________________
> Borgbackup mailing list
> Borgbackup at python.org
> https://mail.python.org/mailman/listinfo/borgbackup
>