Saving a file "in the background" -- How?

Fri Oct 31 08:07:42 EDT 2014

Virgil Stokes <vs at it.uu.se> writes:

> While running a python program I need to save some of the data that is
> being created. I would like to save the data to a file on a disk
> according to a periodical schedule  (e.g. every 10
> minutes). Initially, the amount of data is small (< 1 MB) but after
> sometime the amount of data can be >10MB. If a problem occurs during
> data creation, then the user should be able to start over from the
> last successfully saved data.
>
> For my particular application, no other file is being saved and the
> data should always replace (not be appended to) the previous data
> saved. It is important that  the data be saved without any obvious
> distraction to the user who is busy creating more data. That is, I
> would like to save the data "in the background".
>
> What is a good method to perform this task using Python 2.7.8 on a
> Win32 platform?

There are several requirements:

- save data asynchroniously -- "without any obvious distraction to the
  user"
- save data durably -- avoid corrupting previously saved data or
  writing only partial new data e.g., in case of a power failure
- do it periodically -- handle drift/overlap gracefully in a documented
  way 

A simple way to do asynchronios I/O on Python 2.7.8 on a Win32 platform
is to use threads:

  t = threading.Thread(target=backup_periodically, kwargs=dict(period=600))
  t.daemon = True # stop if the program exits
  t.start()

where backup_periodically() backups data every period seconds: 

  import time

  def backup_periodically(period, timer=time.time, sleep=time.sleep):
      start = timer()
      while True:
          try:
              backup()
          except Exception: # log exceptions and continue
              logging.exception() 
          # lock with the timer 
          sleep(period - (timer() - start) % period) 

To avoid drift over time of backup times, the sleep is locked with the
timer using the modulo operation. If backup() takes longer than *period*
seconds (unlikely for 10MB per 10 minutes) then the step may be
skipped. 

backup() makes sure that the data is saved and can be restore at any
time. 

  def backup():
      with atomic_open('backup', 'w') as file: 
          file.write(get_data())

where atomic_open() [1] tries to overcome multiple issues with saving
data reliably:

- write to a temporary file so that the old data is always available
- rename the file when all new data is written, handle cases such as:
  * "antivirus opens old file thus preventing me from replacing it"

either the operation succeeds and 'backup' contains new data or it fails
and 'backup' contains untouched ready-to-restore old data -- nothing in
between. 

[1]: https://github.com/mitsuhiko/python-atomicfile/blob/master/atomicfile.py

I don't know how ready atomicfile.py but you should be aware of the
issues it is trying to solve if you want a reliable backup solution.

--
Akira