[Borgbackup] Deduplication not efficient on single file VM images

Mon Dec 4 15:32:51 EST 2017

> The backup image files are created by another tool (so these are
> proper backups, not live disk images) and I am piping them into borg
> stdin in my wrapper script.

Likely because of the data format used.

> The biggest problem right now is that Borg seems to fail to
> deduplicate most of the data:
> 
> # du -sh {zbackup,borg}/vm-100
> 1,9G    zbackup/vm-100
> 8,0G    borg/vm-100

I don't know zbackup details, but maybe you need finer granularity for
borg's chunker?

> Borg stats output for first, second and last borg create for vm-100:
>  ------------------------------------------------------------------------------
>  Archive name: vzdump-qemu-100-2017_11_20-15_52_32.vma

Ah, you use proxmox? So guess one needs to research that .vma format...

https://git.proxmox.com/?p=pve-qemu.git;a=blob;f=vma_spec.txt

It puts a UUID into the VMA extent headers.
Looks like this is always a different UUID in each .vma file.
So that spoils dedup for the chunks containing that UUID.

extent = 59 clusters a 64kiB = ~ 3.8MB

borg's default target block size is 2MiB - so borg's chunks will often
contain that UUID (and thus not dedup with other .vma) and every 2nd
chunk without an UUID inside might as well not match chunks from other
.vma due different cutting places.

So, you need to lower target chunk size significantly. You could check
what zbackup uses or just try some target chunk sizes >= 64kiB.

> The machine itself is a simple shorewall based router and the image
> doesn't change much. The only content that is changing are the logs,
> so I am truly amazed why the deduplication performs so weakly.

You could try doing a snapshot manually and reading the raw image data
(from the blockdevice or whatever) into borg.

> I guess I could run zerofill on the VM images, but on the other hand
> zbackup somehow managed to deduplicate most of the stuff, so I
> wouldn't think that this is the issue.

Yeah, looks like.

> Is there something I am missing from the documentation regarding
> tuning for my use-case?

--chunker-params maybe. See also docs/misc/...

But be aware the small chunks means also more chunks and more management
overhead.

> processes using the same cache dir? Should the cache dir be seaparate
> for different repos?

No, it creates a separate dir per repo under the borg cache dir anyway.

> Another problem is that the backup takes way longer (zbackup takes
> around 8 minutes to process the non-initial 14GB images, borg takes
> more than 2 hours every time).

That's likely the consequence of dedup not kicking in as much as expected.

> My assumption is that this difference
> is due to zbackup using multiple threads fot lzma compression.

And that. But if you're in a hurry, just don't use lzma, but lz4.

-- 

GPG ID: 9F88FB52FAF7B393
GPG FP: 6D5B EF9A DD20 7580 5747 B70F 9F88 FB52 FAF7 B393