[Borgbackup] faster / better deletion, for a bounty?

Thu Dec 22 04:14:09 EST 2016

Hi John,

On 22.12.2016 01:37, John Goerzen wrote:
> On 12/21/2016 04:25 PM, Mario Emmenlauer wrote:
>> Hi Thomas,
>>
>> On 21.12.2016 14:31, Thomas Waldmann wrote:
>>>> (1) My archive is now 3.4 TB (reported with 'du'), but borg list says
>>>>     the deduplicated archive size is 1.82 TB. Why are the two numbers
>>>>     off by 50%? Below the full output of my borg list.
>>> Did you activate append-only mode for the repo?
>>>
>>> While append-only is set, borg prune/delete will not be able to really
>>> remove data.
>> This is actually before I performed any deletions. The disk usage is
>> reported as 3.4 TB by du and df, whereas borg reports the total dedup
>> size as "only" 1.8TB (so approx. 50% of the actual usage). Is this a
>> typical overhead, or is something fishy in my setup?
> 
> Hi Mario,
> 
> On my system (which is zfs-backed for the moment), zfs list and df actually show
> *less* space used than borg does.  I'm still trying to figure that one out ;-)

Haha that's interesting! Let me know what you find out :-)

> If I understand the dedup size correctly -- and that's an *if* since I have not
> been using borg for more than a few days -- its meaning is /how much space will
> be freed if you delete just this one archive/.  This makes a lot of sense to me,
> because it is exactly the same way zfs gives me the size of snapshots.
> 
> If you have very little change in your datasets but a high number of archives,
> it would be possible for you to have terabytes of data under management and a
> sum of the dedup size of almost zero.  This would not be an error, given the
> meaning listed.
> 
> It is also, therefore, expected that if you remove an archive, the dedup size
> listed in other archives may increase, since if there was a chunk in common
> between the deleted archive and the other one, it wouldn't have shown up in the
> dedup size of either (since deleting /just that one archive/ would not free its
> space), but once one of the two archives is gone, it would be counted to the other.
> 
> Does that make sense?

What you say makes perfect sense for a single archive. But borg reports also
numbers for "all archives", which I understood to be the numbers for the full
repository. Am I on the wrong track there? Because "all archives" is not the
sum of the individual archives, so I assumed its the repo. For the repo,
however, I think the dedup size should be equal to the disk size (except for
overheads like meta data, index, etc). Therefore I was surprised to see that
for me, its approx. 50% of disk usage.

See here the output of borg list on one of my archives:
Number of files: 1796064
                       Original size      Compressed size    Deduplicated size
This archive:               95.27 GB             70.53 GB            178.00 MB
All archives:               78.26 TB             65.13 TB              1.82 TB
                       Unique chunks         Total chunks
Chunk index:                 9733154            414693364

Cheers,

    Mario

> How you count up space is a funny business when you have deduplication going
> on.  Same when you have hard links in your filesystem.  (du can say you've got
> 50GB in a directory, but you might find that rm -r on it only frees up 50K if
> there's a lot of hardlinks to other areas.)
> 
> I think zfs might have a little clearer terminology on this: "referenced" is how
> much data is pointed to by a given snapshot, and "used" is how much space would
> be freed if only that one snapshot were deleted right now.  That's like borg's
> archive size and dedup size.
> 
> John
> 
> 
>>
>>
>>>> (2) In the last months, my backup size went up quite a lot, even though
>>>>     I did not change anything in borg. So I'd like to reverse engineer
>>>>     which archives (or which files) contribute to the sudden increase in
>>>>     size. I tried "borg list" on all archives, but only 7 have ~3 GB of
>>>>     deduplicated space, and all others have less than 1 GB of dedup space!
>>>>     I assumed 533 archives of ~1 GB dedup size = 533 GB total,
>>> No, that is only the sum of the space ONLY used by a single archive.
>>>
>>> As soon as the same chunks are used by more than 1 archive, it does not
>>> show up as "unique chunks" any more.
>>>
>>>>     How would I find the archives that free most space when deleted?
>>> For a single archive deletion, that is the unique chunks space
>>> ("deduplicated size") of that archive.
>>>
>>> For multiple archive deletion there is no easy way to see beforehands.
>> Would it be possible to somehow change this reporting in borg? I
>> think I (possibly accidentally) backed up a few huge files for a few
>> days, that now use up 50% of my archive space. Since the chunks are
>> shared, I have no way of knowing which archives are the "bad guys".
>> My only option seems to prune with a shotgun-approach until eventually
>> I get lucky and free significant disk space. If I'm unlucky I can
>> prune a lot before freeing any significant space...
>>
>> I think for example 'du' when used on hard links reports the shared
>> disk usage on the first directory it encounters, and does not duplicate
>> the size of hard links on subsequent directories. Would this be a sane
>> behaviour for borg too? Or add a new field for "shared chunks size"?
>>
>>
>> Thanks a lot for the help, and all the best,
>>
>>     Mario Emmenlauer
>>
>>
>> _______________________________________________
>> Borgbackup mailing list
>> Borgbackup at python.org
>> https://mail.python.org/mailman/listinfo/borgbackup
> 

Viele Gruesse,

    Mario Emmenlauer

--
BioDataAnalysis GmbH, Mario Emmenlauer      Tel. Buero: +49-89-74677203
Balanstr. 43                   mailto: memmenlauer * biodataanalysis.de
D-81669 München                          http://www.biodataanalysis.de/