How to track files processed

Mon Apr 18 13:56:12 EDT 2016

Greetings,

>If you are parsing files in a directory what is the best way to 
>record which files were actioned?
>
>So that if i re-parse the directory i only parse the new files in 
>the directory?

How will you know that the files are new?

If a file has exactly the same content as another file, but a 
different name, is it new?

Often this depends on the characteristics of the system in which 
your (planned) software is operating.

Peter Otten has also asked for some more context, which would help 
us give you some tips that are more targetted to the problem you are 
trying to solve.

But, I'll just forge ahead and make some assumptions:

  * You are watching a directory for new/changed files.
  * New files are appearing regularly.
  * Contents of old files get updated and you want to know.

Have you ever seen an MD5SUMS file?  Do you know what a content hash 
is?  You could find a place to store the content hash (a.k.a. 
digest) of each file that you process.

Below is a program that should work in Python2 and Python3.  You 
could use this sort of approach as part of your solution.  In order 
to make sure you have handled a file before, you should store and 
compare two things.

  1.  The filename.
  2.  The content hash.

Note:  If you are sure the content is not going to change, then just 
use the filename to track whether you have handled something or not.

How would you use this tracking info ?

  * Create a dictionary (or a set), e.g.:

    handled = dict()
    handled[('410c35da37b9a25d9b5d701753b011e5','setup.py')] = time.time()

    Lasts only as long as the program runs.  But, you will know
    that you have handled any file by the tuple of its content hash 
    and filename.

  * Store the filename (and/or digest) in a database.  So many 
    options: sqlite, pickle, anydbm, text file of your own 
    crafting, SQLAlchemy ...

  * Create a file, hardlink or symlink in the filesystem (in the 
    same directory or another directory), e.g.:

    trackingfile = os.path.join('another-directory', 'setup.py')
    with open(trackingfile, 'w') as f:
        f.write('410c35da37b9a25d9b5d701753b011e5')

      OR

    os.symlink('setup.py', '410c35da37b9a25d9b5d701753b011e5-setup.py')

    Now, you can also examine your little cache of handled files to 
    compare for when the content hash changes.  If the system is an 
    automated system, then this can be perfectly fine.  If humans 
    create the files, I would suggest not doing this.  Humans tend 
    to be easily confused by such things (and then want to delete 
    the files or just be intimidated by them; scary hashes!).

There are lots of options, but without some more context, we can 
only make generic suggestions.  So, I'll stop with my generic 
suggestions now.

Have fun and good luck!

-Martin

#! /usr/bin/python

from __future__ import print_function

import os
import sys
import logging
import hashlib

logformat = '%(levelname)-9s %(name)s %(filename)s#%(lineno)s ' \
     + '%(funcName)s %(message)s'
logging.basicConfig(stream=sys.stderr, format=logformat, level=logging.ERROR)
logger = logging.getLogger(__name__)

def hashthatfile(fname):
    contenthash = hashlib.md5()
    try:
        with open(fname, 'rb') as f:
            contenthash.update(f.read())
        return contenthash.hexdigest()
    except IOError as e:
        logger.warning("See exception below; skipping file %s", fname)
        logger.exception(e)
        return None

def main(dirname):
    for fname in os.listdir(dirname):
        if not os.path.isfile(fname):
            logger.debug("Skipping non-file %s", fname)
            continue
        logger.info("Found file %s", fname)
        digest = hashthatfile(fname)
        logger.info("Computed MD5 hash digest %s", digest)
        print('%s %s' % (digest, fname,))
    return os.EX_OK

if __name__ == '__main__':
    if len(sys.argv) == 1:
        sys.exit(main(os.getcwd()))
    else:
        sys.exit(main(sys.argv[1]))

# -- end of file

-- 
Martin A. Brown
http://linux-ip.net/