Directory Caching, suggestions and comments?

Thu May 15 20:25:20 EDT 2014

On 15May2014 15:34, Benjamin Schollnick <benjamin at schollnick.net> wrote:
>I am going to be using this code as part of a web system, and I would love
>any feedback, comments and criticism. [...]
>I am using scandir from benhoyt to speed up the directory listings, and
>data collection. [...]
>I had considered using OrderedDicts, but I really didn't see how that would
>help the system.
>
>I'm not completely happy with the return_sort_* functions, since they
>return two different tuples, one goal was to try to keep everything in the
>dictionary, but I couldn't think of a better method.
>
>So any suggestions are welcome.

Firstly, I'm with ChrisA on the whole: let the filesystem/OS do the stat 
caching. It _is_ actually slower to rely on this than an in-memory cache such 
as yours, but at least it is reliable because you will then have exactly the 
same file info state as any other user of the OS. Stale caches are a usability 
PITA for the end user; witness the continued presence of "Shift-Reload" in most 
browsers because browser (and proxy) caches get stale.

His remark about relying on the mtime of a directory is correct also, and not 
just because of clock changes. On UNIX, the mtime of a directory changes _only_ 
when a filename is added/removed from the directory. To the point, it will not 
change if a file in the directory is just modified, and therefore your cache 
will become stale (and stay stale) in that circumstance.

Of course, you wouldn't even be bothering with scandir if you were not 
concerned about relying of the filesystem/OS. So, it is your call about 
remembering this stuff. I would include a real-time staleness in addition to 
the mtime check, eg if the mtime is updated _or_ the system time says it is 
more than N seconds since the last directory scan, with N being smallish.

Most of my other remarks have to do with style and implementation.

>Preqs -
>    Scandir - https://github.com/benhoyt/scandir
>    scandir is a module which provides a generator version of
>    os.listdir() that also exposes the extra file information the
>    operating system returns when you iterate a directory.

Personally, my habit is to use os.listdir() and accept the memory cost for two 
reasons: it is simple and the list is stable. If you iterate over a directory 
(eg via the C library readdir() facility) to my mind there is scope for the 
directory changing underneath you. A bit like iterating over a dictionary which 
is in active use by other code. Using os.listdir() minimises that window.

So I'd be going:

   names = list(scandir(directory_path))

to get a snapshot, and then working with that list.

>from stat import ST_MODE, ST_INO, ST_DEV, ST_NLINK, ST_UID, ST_GID, \
>                    ST_SIZE, ST_ATIME, ST_MTIME, ST_CTIME

You don't need these. os.lstat() and os.stat() return stat objects, and they 
have convenient .st_mtime etc attributes. So all your:

   data["st_nlink"] = st[ST_NLINK]

stuff can become:

   data["st_nlink"] = st.st_nlink

I'd also be making your "data" object just an object, and assigning directory 
to an attribute on it, thus:

   data.st_nlink = st.st_nlink

Much more readable.

You could also avoid the entire issue of manually copying each st.st_blah 
attribute by just going:

   data.st = st

Then just reach for data.st.st_mtime (or whatever) as needed.  Short, quick, 
avoids needing to worry about optional stat fields - just keep the original 
stat result.

>import time
>import scandir
>
>plugin_name = "dir_cache"
>
>#####################################################
>class   CachedDirectory(object):
>    """
>    For example:
>
>        To be added shortly.

It is _well_ work sketching this part out. Since this is just a cache, an 
example should be quite short.

I find it enormously convenient to sketch some of the use cases here - it helps 
keep the required method names nicely defined.  i.e.  sketch a little code that 
you need to write, then write methods to support the code. You get a much 
better match that way. I often revisit classes when I use them, because I write 
some code using the class and think "that was very painful, how would I _like_ 
to have been able to use this"?

>    """
>    def __init__(self):
>        self.files_to_ignore = ['.ds_store', '.htaccess']
>        self.root_path = None

You don't seem to use .root_path?

>            # This is the path in the OS that is being examined
>            #    (e.g. /Volumes/Users/username/)
>        self.directory_cache = {}

This is a CachedDirectory class. I would have just called this .cache - it will 
make all the code using it less wordy, and hopefully more readable. Unless 
there may be caches of other stuff too. This is almost entirely a matter of 
personal style though.

>    def _scan_directory_list(self, scan_directory):
>        """
>            Scan the directory "scan_directory", and save it to the
>            self.directory_cache dictionary.
>
>            Low Level function, intended to be used by the populate
>function.
>        """
>        scan_directory = os.path.abspath(scan_directory)
>        directories = {}
>        files = {}
>        self.directory_cache[scan_directory.strip().lower()] = {}
>        self.directory_cache[scan_directory.strip().lower()]["number_dirs"] = 0

You write "scan_directory.strip().lower()". It if probably worth going:

   norm_directory = scan_directory.strip().lower()

and saying "norm_directory" throughout. Also, you say:

   self.directory_cache[scan_directory.strip().lower()]

You can also bind:

   cache = self.directory_cache[norm_directory]

right after binding "norm_directory", and say "cache" throughout instead of the 
longer term. It is naming the same object.

>self.directory_cache[scan_directory.strip().lower()]["number_files"] = 0
>        for x in scandir.scandir(scan_directory):
>            st = x.lstat()
>            data = {}

As mentioned earlier, you could usefully define a tiny class:

   class Dirent(object):
     pass

and go:

   data = Dirent()

which will allow you to replace:

>            data["fq_filename"] = os.path.realpath(scan_directory).lower() + \
>                    os.sep+x.name.strip().lower()

with:

   data.fq_filename = os.path.realpath( os.path.join(norm_directory,
                                                     x.name.strip().lower()) )

Note also the use of os.path.join: it is portable across OSes (eg UNIX versus 
Windows).

Also, well worth defining:

   norm_name = x.name.strip().lower()

and just saying "norm_name" throughout. More readable, less prone to typing 
accidents.

Um. A serious point here: why do you go .strip().lower() to all your filenames 
and pathnames? On a real UNIX system that will usually mean a different object.  
Many other platforms may have case insensitive names, but will still complaint 
about names with and without whitespace as different objects. I would even 
expect os.path.realname to be unreliable if handed such mangled names.

If you are sanitising some input I would recommend doing that elsewhere, and 
not binding some much special knowledge into this class.

Maybe you have good reasons for that. At the least those reasons need to be 
expressed in comments (best in the opening docstring), because otherwise I 
would recommend strongly to make the cache class as naive about names as 
possible (i.e. do not mangle them).

>            data["parentdirectory"] = os.sep.join(\
>                    os.path.split(scan_directory)[0:-1])

Use os.path.dirname.

>            data["st_mode"] = st[ST_MODE]
>            data["st_inode"] = st[ST_INO]
>            data["st_dev"] = st[ST_DEV]
>            data["st_nlink"] = st[ST_NLINK]
>            data["st_uid"] = st[ST_UID]
>            data["st_gid"] = st[ST_GID]
>            data["compressed"] = st[ST_SIZE]

"compressed" is worth a comment.

>            data["st_size"] = st[ST_SIZE]       #10
>            data["st_atime"] = st[ST_ATIME]     #11
>            data["raw_st_mtime"] = st[ST_MTIME] #12
>            data["st_mtime"] = time.asctime(time.localtime(st[ST_MTIME]))

I strongly recommend against this.

Keep "st_mtime" (or data.st.st_mtime per my earlier suggestion) as the raw 
mtime. Any human readable transcription of the time should get a name like 
"human_st_mtime".

Better still, if you make a Dirent class for "data", you can give it a property 
like:

   @property
   def human_st_mtime(self):
     return time.asctime(time.localtime(self.st.st_mtime))

and just use ".human.st.mtime" like any other attribute later when needed.  
Note: no brackets.

>            data["st_ctime"] = st[ST_CTIME]
>            if not x.name.strip().lower() in self.files_to_ignore:

Personally I phrase this as "a not in b" instead of "not a in b".

Also, if you really are ignoring it, you could go:

   if norm_name in self.files_to_ignore:
     # do nothing else with this entry
     continue

and save yourself some indentation below. Not everyone likes this.

>    def directory_in_cache(self, scan_directory):
>        """
>            Pass the target directory
>
>            Will return True if the directory is already cached
>            Will return False if the directory is not already cached
>        """
>        scan_directory = os.path.realpath(scan_directory).lower().strip()

Probably worth making normalisation function:

   def norm_name(name):
     return name.lower().strip()

becuase you do this a lot. Using a function ensures you do it the same way 
everywhere and makes it obvious that it is the same thing with the same purpose 
everywhere.

>        return scan_directory in self.directory_cache.keys()

You can just say:

   return scan_directory in self.directory_cache

and for added points you can rename "directory_in_cache" to "__contains__".  
That way from outside (or inside) you use "in" directly. Example:

   if "foo" in self:

Cheers,
Cameron Simpson <cs at zip.com.au>