[Tutor] Monitoring directories

Cameron Simpson cs at cskk.id.au
Fri Feb 14 16:36:25 EST 2020


On 14Feb2020 12:02, Nathan D'Elboux <nathan.delboux at gmail.com> wrote:
>I have a little script below that copies files from a directory and
>all of its sub directories over to a target dir. The target dir doesnt
>have any sub dir's and is exactly how i want it.  The copy checks if
>the file in the source dir exists in the target dir and then copies if
>it doesnt.
>
>Problem is is that i have another utility monitoring that target dir
>moving the files out to process them further so as it processes them
>the target dir will always be empty thus coping every file from source
>folder over and over depending on interval i run the script
>
>What i would like to do is change my below script from checking if
>sourcedir filename exists in target dir, is to change it so it
>monitors the source dir and copies the newest files over to the target
>dir.

Might I suggest that you rename the files in the source directly when 
they have been copied? This light be as simple as moving them 
(shutil.move) from the source directory to a parallel "processed" 
directory.

If the source directory is your serious read only main repository of 
source files and you do not want to modify it, perhapsyou should 
maintain a parallel "to process" source tree beside it; have the thing 
which puts things into the main source directory _also_ hard link them 
into the "to process" (or "spool") tree, which you are free to remove 
things from after the copy.

Either of these removes your need to keep extra state or play guesswork 
with file timestamps.

>As this script is invoked currently via a cron job im not familiar
>with what i would need to do to make this python script monitor in
>daemon mode or how it would know  the newest files and the fact it
>haunt already copied them over.

Because cron runs repeatedly you tend not to run things in "daemon" mode 
from them - daemons start once and continue indefinitely to provide 
whatever service they perform.

>The script i have already is below. Just looking for feedback on how i
>should improve this.  Would using something like Pyinotify work best
>in this situation?

Pyinotify would suit a daemon mode script (started not from cron but 
maybe the "at" command) as it relies on reporting changes. It is 
perfectly valid and reasonable though, you just wouldn't start it from 
cron (unless the crontab were to start it if it was no longer running, 
which would require an "is it running?" check).

>import shutil
>import os
>
>targetdir = "/home/target_dir"
>sourcedir = "/home/source/"
>
>dirlist = os.listdir(sourcedir)
>for x in dirlist :
>
>directory = sourcedir + x + '/'

Consider using os.path.join for the "+ x" bit.

>filelist = os.listdir(directory)
>    for file in filelist :
>    if not os.path.exists(os.path.join(targetdir,file)):
>shutil.copy(directory+file,targetdir)

The other component missing here is that scripts like this are prone to 
copying a source file before it is complete i.e. as soon as its name 
shows in the source tree, not _after_ all the data have been copied into 
it. You will have a similar problem in the target directory with 
whatever is processing things there.

Tools like rsync perform this process by copying new files to a 
temporary name in the target tree beginning with a '.', eg 
.tempfile-blah, and only renaming the file to the real name when the 
copy of data is complete. You should consider adapting your script to 
use a similar mode, both for the copy to the target ("copy to a 
tempfile, then rename") and also for the scan of the source (ignore 
filenames which start with a '.', thus supporting such a scheme for the 
delivery to the source directory).

Cheers,
Cameron Simpson <cs at cskk.id.au>


More information about the Tutor mailing list