Directory Caching, suggestions and comments?

Chris Angelico rosuav at gmail.com
Thu May 15 15:49:04 EDT 2014


On Fri, May 16, 2014 at 5:34 AM, Benjamin Schollnick
<benjamin at schollnick.net> wrote:
> Just as a side note, I'm not completely PEP 8.  I know that, I use a
> slightly laxer setting in pylint, but I'm working my way up to it...
>
> I am using scandir from benhoyt to speed up the directory listings, and data
> collection.

First comment: You're running headlong into the two hardest problems
in computing - cache invalidation, and naming things. (And off-by-one
errors.)

More specifically, and leaving aside the naming issue as you're aware
of it, you have to cope with all sorts of messes of stale cache data.
For instance, you stat a directory and depend on its mtime - you can't
depend on that always being up-to-date, AND you can't rely on the
clock not shifting. (What happens, for instance, if the server's
onboard clock gains time at a notable rate, and a firewall
misconfiguration is blocking NTP - and then you fix the firewall and
the clock suddenly jumps backward by a few hours? Yep. Happened to me.
Well, I think the clock jumped maybe half an hour, but it could easily
have been a lot more.)

What platform are you running this on? On all my Linux systems, the
file system caches stat() info for me. That has never been a problem,
because the FS knows when it needs to update/flush that cache. All I
need to know is that having spare RAM means performance improves :) I
can do a "sudo find / -name ..." and it chugs and chugs, and then I do
it again and it's fast. Windows, not so nice, but I'd still look at OS
or FS caching where possible. (Other platforms I don't personally use,
so I don't know whether or not they have good caching.)

ChrisA



More information about the Python-list mailing list