[Borgbackup] Poor dedup with tar overlay

Thomas Waldmann tw at waldmann-edv.de
Mon Feb 13 15:20:06 EST 2017


> when working on separate files the first block start is implicitly set
> by the file start. When working on something like a tar archive this is
> not the case, instead, the tar archive looks something like:
> 
> header metadata for file #1 contents of file #1 metadata for file #2
> contents of file #2 ...
> 
> So every metadata block that is interlaced between the contents of the
> adjacent files most likely influences the chunker, and will most likely
> be included in the last chunk (assuming big-ish files here now) of the
> preceding, or the first chunk of the following, or split across them.

BTW, we have a ticket about special chunkers for formats like tar, kind
of to simulate separate files by knowing the tar format and chunking at
file starts / ends.

That is not implemented yet though and I think (if we ever implement
that), it should wait until after borg 1.2 (because we will refactor
internal architecture then into some separate workers (for worker
threads). Likely it will be easier to swap code for some components of
borg after that refactoring.

I am not sure whether it would be worth it for tar files, though.

A even simpler fixed-block chunker could support database files with
fixed record size, there is also a ticket about that.

We also have a ticket about steering the chunker by file extension,
which would be needed to trigger these chunkers while using the normal
rolling hash chunker for the rest.

But that all is 1.3+ (if ever), so let's rather concentrate on getting
1.1 released and then not packing too much into 1.2, so it can be
released in a timely manner.

-- 

GPG ID: 9F88FB52FAF7B393
GPG FP: 6D5B EF9A DD20 7580 5747 B70F 9F88 FB52 FAF7 B393



More information about the Borgbackup mailing list