[Borgbackup] Storage, CPU, RAM comparisons

Mon May 4 15:10:44 EDT 2020

On Mon, May 4, 2020 at 6:55 PM MRob <mrobti at insiberia.net> wrote:

> Dmitry thank you for your perspective,
>
> >> Maybe borg improved from attic, but its not the point. Borg, attic,
> >> duplicacy based on deduplication using massive higher storage space in
> >> relation to duplicity (and rdiff-backup?). I dont' understand why
> >> file-based delta is more storage efficient then deduplication which
> >> can
> >> consolidate chunks from all files in the repo. I expect the opposite
> >> storage use ratio comparison.
> >
> > In real-life scenarios when taking full backups could be prohibitively
> > expensive (especially scenarios like massive amounts of data over
> > high-latency low-speed links), rdiff-backup/duplicity approach becomes
> > simply unviable, disk space saving or not, because you either have to
> > keep
> > all backups infinitely long (and eventually run out of storage space)
> > or
> > you need to pay the price of full backup once in a while (which will
> > likely
> > overshadow all the time/disk space savings you have made previously).
>
> I understand. For my case latency is not a concern. Also most files are
> text (for server backup not personal media). I use rsnapshot so I can
> take backup strategys for 1y, 6m, daily, etc. It is hard-link base so it
> is not so bad for disk space but I am exploring better choices, so far
> Borg looks best.
>
> Yet I still want to understand if it is true or not true that
> deduplication would reduce disk space requirement. Isn't it the purpose
> of deduplication? Even if compression choice was not fairly evaluated in
> that comparison, why doesn't deduplication plus (worse) compression come
> closer to duplicity?
>

Oh, but it is close. If you use zstd or zlib for compression then (using
files from the benchmark we are discussing) first borg backup will be 179
Mb, which is roughly what duplicity has.

>
> My first test showing with text data the default compression works nice
> (near 40% reduction) but common chunks is not very good (I think "total
> chunks" minus "unique chunks" right?),

if you use "borg info" on the backup, I think it will give you the
information you want (number of unique chunks in this backup vs the number
of shared chunks).

> a very small reduction (1%) from
> deduplication. First time transfer was also slow (even on fast local
> link) but if its a one time operation thats ok.
>
> > So some fraction of extra space taken by bog will be manifests that tie
> > blocks in the storage to files that use them.
>
> I don't mind overhead but wnat to have clear understand of the costs and
> benefit. Your opinion is for big picture having variety of backup
> snapshots. I agree to take the big perspective but also can you help me
> understand the details picture, if deduplication does better with
> storage than others.
>
> Compare to hardlink setup like rsnapshot where changed files cause
> entire new file to be kept, I expect because common blocks are kept only
> once the storage reduction would be massive(? is it correct???)

Yes, this is correct - each unique block is stored at most once.

> But similar to be true for rdiff tools that only store deltas. Evaluating
> features of borg vs. rdiff-backup/duplicity borg is clear winner but the
> large difference in storage cost I saw in that comparison is too big to
> ignore.
>

So one difference which I already covered is that "storing deltas" requires
you to store "base" to which you can apply this delta, and this has
implications for backup removal.

The second difference is that file names/paths are input for computing
deltas, so if you move/rename files, this would generate large deltas,
whereas borg will just see the same blocks all over again and "do nothing".

A third difference is that you can easily find out which portion of backup
is "unique" and which is "shared" -- with deltas, you only know the diff to
the base.

Speaking of diffs, borg could easily diff any two backups, which with
deltas is not straightforward, unless you happen to have just a single base
backup.

If anything happens to your base backup and you detect it, you have no
other choice that to take new base backup (and discard all the deltas you
have to the corrupted base). With borg, if you detect corrupted blocks,
your next backup will just store new copies of the block you lost at a
fraction of the cost.

With hardlinked rsync backups, the information about data blocks shared by
different files is stored in the filesystem datastructures (inodes in case
of ext4, etc). With borg, this information is kept in the data structure
which is written into archive storage for every archive. For small files,
this will require at least ~100B/file + some extra overhead for directories
and such. For Linux kernel tree with ~60K files this would be on the order
of 5-10 Mb (i think), which can easily add up in the context of the
benchmark we are discussing....

>
> Another question: after for example a year of backup does it require
> alot more CPU/RAM to compute chunk deltas?

cpu/time will generally be proportional to the number of chunks you have in
total (it may vary from operation to operation)

Reading
https://borgbackup.readthedocs.io/en/stable/internals/data-structures.html
might
help.

> What are those costs?
> _______________________________________________
> Borgbackup mailing list
> Borgbackup at python.org
> https://mail.python.org/mailman/listinfo/borgbackup
>

-- 
D. Astapov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/borgbackup/attachments/20200504/d4c3fc69/attachment-0001.html>