[Borgbackup] Baffling behavior on low cpu system with borg backup

Tue Jun 21 04:52:10 EDT 2016

On Mon, Jun 20, 2016 at 12:57:26AM +0200, Thomas Waldmann wrote:
> >To be clear, every single time I have blown away the cache and retried the
> >backup operation, borg has synchronized and completed the backup
> >successfully.
> 
> Can you check how big the "chunks" cache is after that?
I'm not sure how I would do that. Do you mean du -h ~/.cache ? I'll do a
test run on the weekend and let you know.  On the server right now, I get:
718M    /root/.cache/borg

> >Believe it or not, I thought the rpi would be impossible to run borg - but
> >although on my main server borg runs with a bogglesome ~2.1GB of ram
> >allocated, which would *never* fit on the Rpi which has ~490MB of ram
> >available, working on the same remote repository I've seen it use as little
> >as ~130MB and at most a little more than 300MB while doing its thing, and
> >I'm not using any space saving switches either.
> 
> That sounds strange. I'ld expect the memory usage to be similar. Maybe on a
> 64bit system a little bit more than on a 32bit system, but the chunks cache
> would be exactly the same amount of memory for both.
> 
> Are you maybe seeing the big slowdown in the moment it begins with paging
> memory to disk / SD card? With little memory that can happen suddenly when
> the hash table 75% full and gets enlarged. In that moment, the old hash
> table and the new larger one both need to be in memory while it transfers
> the entries from old to new.
> 
> Maybe watch "top" (or htop?) while it is resyncing to see that. Have a look
> at memory and swap usage. Also watch the clock "top" displays to see whether
> the display is updating at all.

I have run top before as well as monitored the swap/memory use. Quite
simply, there's no memory pressure for it to go into swap - it never uses
more than a few megabytes of swap, and the amount generally doesn't change
from before, during, and after it runs.

When it gets stuck, no memory is being measurably allocated to or freed from
the program in top output.  No networking data is, as far as I can tell
being sent or recieved by the program while monitoring the network traffic
from the server the repository is on.  While it is attempting to resync and
is stuck I can demonstrably access the rpi using ssh, run programs, read my
email etc.  Disk I/O I don't have a very good measurement of, but from the
blinkenlights on the external disk, it doesn't seem do be doing much of
anything at all, and general use of the rpi is unburdened - trying to do
anything that requires disk to be touched while it is loaded up with say,
updatedb is a task in patience normally..  During a clean resync it's VERY
active, either pulling as much data as it can via the network (~2.4MB/s,
which is about the hardware limit for the Rpi) during the archive sync, or
when it's merging into the master chunk index the disk is very busy for an
extended time.  When it gets stuck, it just sits there - the program is
constantly in running state with cpu use at 100%, and gets nowhere.

Something is obviously going wrong here.

The way my Rpi is configured is a little different from stock - it boots the
kernel off the sdcard in it with minimum graphics memory, then mounts root
off an external usb hard disk which is much faster than the sd card, as well
as not making me worry about using swap on it using up the lifetime of the
card.  Not that I generally have to worry about it using swap too much. 
Generally I only run into swap heavily when I've screwed up fiercely, and it
is *painfully* noticable how overloaded it becomes - things like echoing
back what you typed in ssh suddenly can take several seconds to respond.

When I do that test this weekend I'll first try simplifying the required
steps and see if I get the same sort of hang; I'll try backing up a single
file, then muck around with the server side so it has to resync and tell it
to try backing up that file again.  If that works on making it hang, it'll
reduce the amount of time required to do each no cache test by about an
hour.

I'll also run a full backup from a clean cache using time so you can see how
long it takes when it works, and record the memory use of borg (maybe using
a script to write ps output in 1 min intervals?) on both the Rpi and the
server when I have them doing tasks.  A full backup of the Rpi using the
server via sshfs should give you an interesting data point to contrast the
Rpi trying to do the same on its own hardware.

If I was mistaken about what you are asking for with the chunks cache,
please explain and I'll try to get what you want.

Tim McGrath
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 465 bytes
Desc: Digital signature
URL: <http://mail.python.org/pipermail/borgbackup/attachments/20160621/3eaddd05/attachment.sig>