[Borgbackup] Inconsistenses in repository

Wed May 1 18:37:53 EDT 2024

> Killed stale lock ha-idg-3.scidom.de at 140947408721621.41751-0.
> Removed stale exclusive roster lock for host ha-idg-3.scidom.de at 140947408721621 pid 41751 thread 0.   <=====
> Removed stale exclusive roster lock for host ha-idg-3.scidom.de at 140947408721621 pid 41751 thread 0.   <=====

borg kills stale locks only if it can be really sure they are stale (== 
invalid and not needed any more).

So, killing the lock is only fixing a leftover lock from a long dead 
borg process.

> After the prune I got the following errors:
> segment 10365 not found, but listed in compaction data ...

That's strange. Either the compaction data is a bit off (that would be a 
minor issue) or segment files are gone that really should be still there 
(major issue).

> and later on it seems the CIFS share is gone:
> BlockingIOError: [Errno 11] Resource temporarily unavailable: '/mnt/nas/Daten/AG_BioInformatik/Technik/borg_backup/lock.roster'

That might point to a severe issue with your NAS (hw or sw or network 
connection), which could be the root cause for all the troubles.

Be aware that trying to "borg check --repair" the repo without fixing 
the root cause of the issues first might make the situation even worse.

> Data integrity error: Segment entry checksum mismatch [segment 308, offset 22670132]

That is bad. It means that a segment entry (could be a file content 
chunk, could be an archive metadata stream chunk) failed the crc32 
check. Somehow you have data corruption there.

Check the RAM and CPU of the NAS.
Also check the SMART state of your disk(s).
Check the power supply.

> ID: 029179b026cc8f09fa5e23bc7c3a3a6fb414f8a227ca700a5883caa07ef80aef rebuilt index: <not found>      committed index: (308, 262593301)

That means that the on-disk committed repo index has an entry for a 
chunk with that ID and the in-memory rebuilt index hasn't.

That means that the rebuild process did not process that chunk when 
rebuilding the index from the segment files. Could be because the 
segment file containing it is gone or because the crc32 check failed and 
the segment entry for that ID was considered invalid.

> 8035 orphaned objects found!

That could be a relatively harmless issue - or not.
It just means that these objects are not used any more.
But one reason because they are not used any more could be that metadata 
that pointed to them (used them) is now missing.

> But what is a segment and what is a chunk ?

Files are read and cut into smaller pieces (chunks) by borg's chunker 
using misc. algorithms (default: buzhash). E.g. if you have a VM disk of 
100GB, borg might cut that file into ~50000 chunks of ~2MB each.

A segment file (short: segment) contains some segment entries.

A segment entry can be one of these types:
- PUT + chunkid + chunkdata
- DEL + chunkid
- COMMIT
Additionally, each segment entry has a crc32 checksum and an overall size.

More precise infos about this can be found in the docs, "internals" section.

> And do you have any advice what I could do ?

Check for NAS hw/sw/network issues and fix them first.
Make sure the network and NAS works correctly and stable.

Then you can try "borg check --repair --progress -v REPO".
It will try to rescue whatever is still there and bring the repo into a 
consistent state. It can't do wonders though and won't bring lost or 
corrupt data back.

> I tried to check the archive metadata separately for each archiv to find out which one is broken, but
> an archive metadata check needs to check all archives.

Some checks can only be done for the whole repo / all archives. 
Especially the refcounting / orphans check, of course.

-- 
GPG Fingerprint: 6D5B EF9A DD20 7580 5747  B70F 9F88 FB52 FAF7 B393