[Tutor] Monitoring directories
Martin A. Brown
martin at linux-ip.net
Sat Feb 15 04:08:41 EST 2020
Hello,
Filesystems are wonderful and powerful. (Hardware disclaimer [0].)
Filesystems can also be tricky.
There are conventions for non-racy interactions with filesystems.
Fundamental items:
* writing to a filesystem can have all sorts of problems
* there are a few filesystem calls that are atomic
So, if you are doing lots of work with a filesystem, learn how to
take advantage of the atomic operations to minimize your exposure to
races and significantly reduce your risk of data loss (due to
software).
Let's say I run a family grocery for hungry giraffes.
>I have a little script below that copies files from a directory and
>all of its sub directories over to a target dir. The target dir doesnt
>have any sub dir's and is exactly how i want it. The copy checks if
>the file in the source dir exists in the target dir and then copies if
>it doesnt.
This seems deceptively simple. So, maybe this is ./conveyer.py
(because it conveys files from the source tree of subdirectories to
a single target directory....)
Problem 1: What happens if a file under source/a/monkey.txt exists
and there's also file source/b/monkey.txt ? It's possible that you
know your data and/or applications and this would never
happen,...but it is certainly possible from a filesystem
perspective.
Problem 2: How do you know the input file in sourcedir is complete?
Has it been written successfully? (Is shelf-stocker.py writing
files atomically?, see below) (Cameron Simpson mentioned this,
problem, too.)
>Problem is is that i have another utility monitoring that target dir
>moving the files out to process them further so as it processes them
>the target dir will always be empty thus coping every file from source
>folder over and over depending on interval i run the script
Yes! And, that makes the job of the program feeding data into the
target dir even a bit harder, as you have already figured out.
What happens if hungry-giraffe.py consumes the file out of target
directory. How is conveyer.py going to know not to copy from
sourcedir into target dir?
Problem 3: How do you know if you have copied the file already,
especially if there's a hungry-giraffe.py eating the files out of
the target directory?
>What i would like to do is change my below script from checking if
>sourcedir filename exists in target dir, is to change it so it
>monitors the source dir and copies the newest files over to the
>target dir.
A filesystem makes some guarantees that can be quite useful. There
are some operations which are atomic on filesystems. The Python
call os.link() and os.unlink() are my favorites. These are the
equivalent of renaming (link) and deleting (unlink) files. For
example, if an os.link() call returns success, your file is now
available with the new name. And, if an os.unlink() file returns
successfully, you have removed the file.
Safely handling files in a non-racy way requires care, but pretty
reliable software can be written by trusting the filesystem
atomicity guarantees afforded by these functions.
In copying data from one file (or filesystem) to another, writing
directly memory from memory to disk, or generating output
programmatically can often yield an error. Maybe the disk gets
full, and you can't write any more. Maybe, you write too many bytes
to a single file, and it can't get larger. Maybe, you cannot open
the file to write because of permissions issue or inode exhaustion.
(My inodes just flew in from Singapore, and they are very tired.)
In these (admittedly rare cases), you don't know where in the
attempted filesystem operation, f.write() or f.sync() whether the
data you just put into the file are correct. So, what's a concerned
software developer to do? Here's an old technique (30+ years?) for
taking advantage of filesystem atomicity to make sure you safely
create / copy files.
- open up a temporary file, with a temporary name in your output
directory (either in the same directory, or if you are paranoid
/ writing large numbers of files simultaneously, in a directory
on the same filesystem; see Maildir description for details [1])
(If error: delete tempfile, bail, crash, try again or start over)
- write every bit of data from memory (and check for errors on
your f.write() or print() calls) or, copy the data from the
input file to the tempfile
(If error: delete tempfile, bail, crash, try again or start over)
- close the tempfile
(If error: delete tempfile, bail, crash, try again or start over)
- os.link() the tempfile to the full name
(If error: delete tempfile, bail, crash, try again or start over)
- os.unlink() the tempfile
(Programmer judgement: log the error and ... ignore or crash,
depending on what your preference is; if you have a separate
directory for your tempfiles, you can often just log and ignore
this and have a separate sweeping agent clean up the cruft)
So, putting that together (conceptually, only, at this point), and
assuming that the shelf-stocker.py (which puts things into the input
directory) and conveyer.py (which puts things into the target
directory for the giraffes) both use the above filesystem handling
techniques, then you have solved Problem 2.
If you write the file with a tempfile name, you'll sidestep problem
1. If you want to keep the original name, but use a tempfile
technique, you can do:
# -- write data, avoid errors, if errors, bail and delete
# log and crash or maybe just loop and try again later ...
# whatever your application wants to do
#
target_tempfile.close()
# -- get a guaranteed, non-colliding, new filename
#
tf = tempfile.NamedTemporaryFile(dir=".", prefix=fname, delete=False)
tf.close()
# -- now link right over top of the just created tf
#
os.link(target_tempfile, tf.name)
# -- remove the target_tempfile
#
os.unlink(target_tempfile)
The downside to solving problem 2 as I suggest above is that you
can't tell by the presence of a file in target_dir, whether a file
has been copied. In order to detect, you would absolutely have to
use some state file.
So, as you may imagine, now, it's safer to either A) duplicate the
source directory hierarchy exactly into the target directory (so
you don't have to worry about name collisions) or B) settle on a
single directory for each source and target directory.
Problem 3 can be solved (as Alan Gauld pointed out) by storing the
filenames you have seen in a file. Or, you can try to take
advantage of the filesystem's atomicity guarantees for this too, so
that you know which files your conveyer.py has copied from the
source directory to the target directory for the hungry-giraffe.py.
How?
source: dir with input files (sub-tree complicates)
target: dir for the giraffes to browse on
state: dir to track which input files have been copied
Since the os.symlink() command is an atomic filesystem
operation, too, you can use that to create a file in the state
directory to that points to any file your conveyer.py program has
already handled. (Note, this does leave a very small race in the
program.)
- copy file from source to temfile in target, then os.link() to
final name in target
- use os.symlink() to create symlink in state that points to file
in source dir
When you introduce the state directory, though, you have introduced
another race, because it's easy to imagine the conveyer.py program
crashing after it has completed feeding the giraffes, but before it
has updated its state directory. So, upon next execution,
conveyer.py will move the same file to the target_dir again.
>As this script is invoked currently via a cron job im not familiar
>with what i would need to do to make this python script monitor in
>daemon mode or how it would know the newest files and the fact it
>haunt already copied them over.
Or loop forever?
while True:
run_conveyer_logic( ... )
if not loop:
break
time.sleep(30) # -- sleep 30 seconds, and run loop again
>The script i have already is below. Just looking for feedback on how i
>should improve this. Would using something like Pyinotify work best
>in this situation?
>
>import shutil
>import os
>
>targetdir = "/home/target_dir"
>sourcedir = "/home/source/"
>
>dirlist = os.listdir(sourcedir)
>for x in dirlist :
> directory = sourcedir + x + '/'
>
> filelist = os.listdir(directory)
> for file in filelist :
> if not os.path.exists(os.path.join(targetdir,file)):
> shutil.copy(directory+file,targetdir)
Note that shutil is already taking care of a great deal of the heavy
lifting of safely interacting with the filesystem for you, so you
you could get away with simply using it as you have. Anytime you
are writing a file (including using shutil.copy), there's a risk
that there will be a problem completely writing the file to disk.
If so, the safest thing to do is to delete the file and start over.
This is why the pattern of write to a temporary file exists -- and
then once you are sure the data are written safely in that file, you
can atomically link the file into place (with os.link()).
Anytime you are working with a filesystem, it's also good to review
the entire sequence of open(), f.write(), f.close(), os.link() calls
for racy interactions. (What happens if two instances of the same
program are running at the same time? What happens if somebody
accidentally blows away a state directory? What happens if the
target directory is not there? Are temporary files cleaned up
after encountering an error? Many of us have filled up disks with
programs that ran amok generating temporary files in loops.)
That is all. I wish you good luck with your grocery business and
hope that it conveys you to your goals, and keeps your giraffes as
happy as mine,
-Martin
[0] There was a recent discussion here about the eventually
inevitable failure or corruption of data, files, RAM on any
computer. However, for the purposes of my answer, I'll
assume that the physical constraints of our modest universe
do not apply, which is to say, that I'll answer your question
as though this hardware failure or data corruption is not
possible. By making this assumption, we can remain squarely
in the realm of high quality hardware.
Accurate: https://mail.python.org/pipermail/tutor/2020-January/115993.html
Pragmatic: https://mail.python.org/pipermail/tutor/2020-January/115992.html
[1] https://en.wikipedia.org/wiki/Maildir
--
Martin A. Brown
http://linux-ip.net/
More information about the Tutor
mailing list