[Borgbackup] Deduplication of tar files - doesn't seem to be giving good performance

Thu Apr 21 07:41:08 EDT 2016

On 21.04.2016 11:03, Sitaram Chamarty wrote:
> On 04/21/2016 01:58 PM, public at enkore.de wrote:
>> Since Borg doesn't know the structure of a tar file my guess is that
>> changed metadata that's stored in-line with file data will make
>> deduplication of the file data impossible for files that are smaller
>> than 1-2 avg chunk sizes (>2 MB).
> 
> Oh very nice; I had not thought of this but it makes perfect sense!
> 
>> For this specific use case I'd recommend using the old chunker params
>> which should allow better deduplication; still: unchanged, small files
>> with updated metadata won't deduplicate.
>>
>> When deduplicating actual file systems this doesn't seem to be as
>> troublesome ; my guess here is that most file systems tend to put inodes
>> (with the often-changing metadata) in one place and file data in
>> another, hence metadata updates don't affect data deduplication as much.
> 
> My guess would be that borg itself "knows" what is metadata and what is
> file data, and has different storage/dedup mechanisms for them.

My bad, I meant to write "deduplicating actual file system *images*".

When Borg makes archives from a file system (not FS image) then the
physical layout of the FS doesn't matter, it reads files/dirs with
normal APIs like most programs would do.

File contents directly go into chunks, metadata goes into the item
(=files, dirs) stream, which is chunked with a different, very
fine-grained chunker.

Cheers, Marian

> 
> regards
> sitaram
> 
>>
>> Still, for optimal granularity you'll want Borg to be able to tell files
>> apart.
>>
>> Cheers, Marian
>>
>> On 21.04.2016 09:11, heiko.helmle at horiba.com wrote:
>>>> Borg isn't capable of handling duplicate pieces inside a file.
>>>>
>>>> oop; my apologies.  I reacted too fast and did not realise that borg was
>>>> getting an uncompressed file.
>>>>
>>>> I assume this means borg gets the file via STDIN?  If so, maybe it has
>>>> something to do with STDIN being less amenable to dedup?
>>>>
>>>> sorry again for my previous (useless) mail!
>>>
>>> I'm seeing something similar here. I used attic (and many early borg
>>> revisions) to backup a few work VMs here. A slightly bigger one (about
>>> 100Gigs) was backupped daily. This backup took about half an hour (with
>>> -C lzma) and resulted in about 1-2 Gigs of new data (deduped and
>>> compressed) each time.
>>>
>>> Now with recent borg, the amount of new data jumped to about 17-20Gigs
>>> per day and it took much longer (i had to scale back to use zlib as
>>> compression to have the backup finnish before the LVM snapshot filled
>>> up). This indicates that the deduplication engine took a hit along the
>>> way and feeds much more data to lzma, which makes the overall runtime
>>> slower.
>>>
>>> This *might* coincide with the change in the default chunker params, but
>>> I'm not sure. Unfortunately I didn't pay attention as to which release
>>> actually started the drop in dedup performance. If I find the time, I
>>> might start a trial run with the "classic" parameters (10,23,16,4095),
>>> but not this week :)
>>>
>>> Best Regards
>>>  Heiko
>>>
>>>
>>> _______________________________________________
>>> Borgbackup mailing list
>>> Borgbackup at python.org
>>> https://mail.python.org/mailman/listinfo/borgbackup
>>>
>>
>>
>> _______________________________________________
>> Borgbackup mailing list
>> Borgbackup at python.org
>> https://mail.python.org/mailman/listinfo/borgbackup
>>
>