[Tutor] Monitoring directories

Sat Feb 15 04:08:41 EST 2020

Hello,

Filesystems are wonderful and powerful.  (Hardware disclaimer [0].)

Filesystems can also be tricky.

There are conventions for non-racy interactions with filesystems.

Fundamental items:

  * writing to a filesystem can have all sorts of problems
  * there are a few filesystem calls that are atomic

So, if you are doing lots of work with a filesystem, learn how to 
take advantage of the atomic operations to minimize your exposure to 
races and significantly reduce your risk of data loss (due to 
software).

Let's say I run a family grocery for hungry giraffes.

>I have a little script below that copies files from a directory and
>all of its sub directories over to a target dir. The target dir doesnt
>have any sub dir's and is exactly how i want it.  The copy checks if
>the file in the source dir exists in the target dir and then copies if
>it doesnt.

This seems deceptively simple.  So, maybe this is ./conveyer.py 
(because it conveys files from the source tree of subdirectories to 
a single target directory....)

Problem 1:  What happens if a file under source/a/monkey.txt exists 
and there's also file source/b/monkey.txt ?  It's possible that you 
know your data and/or applications and this would never 
happen,...but it is certainly possible from a filesystem 
perspective.

Problem 2:  How do you know the input file in sourcedir is complete? 
Has it been written successfully?  (Is shelf-stocker.py writing 
files atomically?, see below) (Cameron Simpson mentioned this, 
problem, too.)

>Problem is is that i have another utility monitoring that target dir
>moving the files out to process them further so as it processes them
>the target dir will always be empty thus coping every file from source
>folder over and over depending on interval i run the script

Yes!  And, that makes the job of the program feeding data into the 
target dir even a bit harder, as you have already figured out.  
What happens if hungry-giraffe.py consumes the file out of target 
directory.  How is conveyer.py going to know not to copy from 
sourcedir into target dir?

Problem 3:  How do you know if you have copied the file already, 
especially if there's a hungry-giraffe.py eating the files out of 
the target directory?

>What i would like to do is change my below script from checking if 
>sourcedir filename exists in target dir, is to change it so it 
>monitors the source dir and copies the newest files over to the 
>target dir.

A filesystem makes some guarantees that can be quite useful.  There 
are some operations which are atomic on filesystems.  The Python 
call os.link() and os.unlink() are my favorites.  These are the 
equivalent of renaming (link) and deleting (unlink) files.  For 
example, if an os.link() call returns success, your file is now 
available with the new name.  And, if an os.unlink() file returns 
successfully, you have removed the file.

Safely handling files in a non-racy way requires care, but pretty 
reliable software can be written by trusting the filesystem 
atomicity guarantees afforded by these functions.

In copying data from one file (or filesystem) to another, writing 
directly memory from memory to disk, or generating output 
programmatically can often yield an error.  Maybe the disk gets 
full, and you can't write any more.  Maybe, you write too many bytes 
to a single file, and it can't get larger.  Maybe, you cannot open 
the file to write because of permissions issue or inode exhaustion.  
(My inodes just flew in from Singapore, and they are very tired.)

In these (admittedly rare cases), you don't know where in the 
attempted filesystem operation, f.write() or f.sync() whether the 
data you just put into the file are correct.  So, what's a concerned 
software developer to do?  Here's an old technique (30+ years?) for 
taking advantage of filesystem atomicity to make sure you safely 
create / copy files.

  - open up a temporary file, with a temporary name in your output 
    directory (either in the same directory, or if you are paranoid 
    / writing large numbers of files simultaneously, in a directory 
    on the same filesystem; see Maildir description for details [1])
    (If error: delete tempfile, bail, crash, try again or start over)

  - write every bit of data from memory (and check for errors on 
    your f.write() or print() calls) or, copy the data from the 
    input file to the tempfile
    (If error: delete tempfile, bail, crash, try again or start over)

  - close the tempfile
    (If error: delete tempfile, bail, crash, try again or start over)

  - os.link() the tempfile to the full name
    (If error: delete tempfile, bail, crash, try again or start over)

  - os.unlink() the tempfile
    (Programmer judgement: log the error and ... ignore or crash, 
    depending on what your preference is; if you have a separate 
    directory for your tempfiles, you can often just log and ignore 
    this and have a separate sweeping agent clean up the cruft)

So, putting that together (conceptually, only, at this point), and 
assuming that the shelf-stocker.py (which puts things into the input 
directory) and conveyer.py (which puts things into the target 
directory for the giraffes) both use the above filesystem handling 
techniques, then you have solved Problem 2.

If you write the file with a tempfile name, you'll sidestep problem 
1.  If you want to keep the original name, but use a tempfile 
technique, you can do:

  # -- write data, avoid errors, if errors, bail and delete
  #    log and crash or maybe just loop and try again later ...
  #    whatever your application wants to do
  #
  target_tempfile.close()

  # -- get a guaranteed, non-colliding, new filename
  #
  tf = tempfile.NamedTemporaryFile(dir=".", prefix=fname, delete=False)
  tf.close()

  # -- now link right over top of the just created tf
  #
  os.link(target_tempfile, tf.name)

  # -- remove the target_tempfile
  #
  os.unlink(target_tempfile)

The downside to solving problem 2 as I suggest above is that you 
can't tell by the presence of a file in target_dir, whether a file 
has been copied.  In order to detect, you would absolutely have to 
use some state file.

So, as you may imagine, now, it's safer to either A) duplicate the 
source directory hierarchy exactly into the target directory  (so 
you don't have to worry about name collisions) or B) settle on a 
single directory for each source and target directory.

Problem 3 can be solved (as Alan Gauld pointed out) by storing the 
filenames you have seen in a file.  Or, you can try to take 
advantage of the filesystem's atomicity guarantees for this too, so 
that you know which files your conveyer.py has copied from the 
source directory to the target directory for the hungry-giraffe.py.

How?

    source:  dir with input files (sub-tree complicates)
    target:  dir for the giraffes to browse on
    state:  dir to track which input files have been copied

Since the os.symlink() command is an atomic filesystem 
operation, too, you can use that to create a file in the state 
directory to that points to any file your conveyer.py program has 
already handled.  (Note, this does leave a very small race in the 
program.)

  - copy file from source to temfile in target, then os.link() to 
    final name in target
  - use os.symlink() to create symlink in state that points to file 
    in source dir

When you introduce the state directory, though, you have introduced 
another race, because it's easy to imagine the conveyer.py program 
crashing after it has completed feeding the giraffes, but before it 
has updated its state directory.  So, upon next execution, 
conveyer.py will move the same file to the target_dir again.

>As this script is invoked currently via a cron job im not familiar
>with what i would need to do to make this python script monitor in
>daemon mode or how it would know  the newest files and the fact it
>haunt already copied them over.

Or loop forever?

  while True:
      run_conveyer_logic( ... )
      if not loop:
          break
      time.sleep(30)  # -- sleep 30 seconds, and run loop again

>The script i have already is below. Just looking for feedback on how i
>should improve this.  Would using something like Pyinotify work best
>in this situation?
>
>import shutil
>import os
>
>targetdir = "/home/target_dir"
>sourcedir = "/home/source/"
>
>dirlist = os.listdir(sourcedir)
>for x in dirlist :
>    directory = sourcedir + x + '/'
>
>    filelist = os.listdir(directory)
>        for file in filelist :
>        if not os.path.exists(os.path.join(targetdir,file)):
>            shutil.copy(directory+file,targetdir)

Note that shutil is already taking care of a great deal of the heavy 
lifting of safely interacting with the filesystem for you, so you 
you could get away with simply using it as you have.  Anytime you 
are writing a file (including using shutil.copy), there's a risk 
that there will be a problem completely writing the file to disk.  
If so, the safest thing to do is to delete the file and start over. 

This is why the pattern of write to a temporary file exists -- and 
then once you are sure the data are written safely in that file, you 
can atomically link the file into place (with os.link()).

Anytime you are working with a filesystem, it's also good to review 
the entire sequence of open(), f.write(), f.close(), os.link() calls 
for racy interactions.  (What happens if two instances of the same 
program are running at the same time?  What happens if somebody 
accidentally blows away a state directory?  What happens if the 
target directory is not there?  Are temporary files cleaned up 
after encountering an error?  Many of us have filled up disks with 
programs that ran amok generating temporary files in loops.)

That is all.  I wish you good luck with your grocery business and 
hope that it conveys you to your goals, and keeps your giraffes as 
happy as mine,

-Martin

 [0] There was a recent discussion here about the eventually 
     inevitable failure or corruption of data, files, RAM on any
     computer.  However, for the purposes of my answer, I'll
     assume that the physical constraints of our modest universe
     do not apply, which is to say, that I'll answer your question
     as though this hardware failure or data corruption is not
     possible.  By making this assumption, we can remain squarely
     in the realm of high quality hardware.

     Accurate:  https://mail.python.org/pipermail/tutor/2020-January/115993.html
     Pragmatic: https://mail.python.org/pipermail/tutor/2020-January/115992.html

 [1] https://en.wikipedia.org/wiki/Maildir

-- 
Martin A. Brown
http://linux-ip.net/