[Distutils] Entry points: specifying and caching

Fri Oct 27 04:10:54 EDT 2017

On Thu, Oct 26, 2017 at 9:02 PM, Nick Coghlan <ncoghlan at gmail.com> wrote:
> Option 2: temporary (or persistent) per-user-session cache
>
> * Pro: only the first query per path entry per user session incurs a linear
> DB read
> * Pro: given persistent cache dirs (e.g. XDG_CACHE_HOME, ~/.cache) even that
> overhead can be avoided
> * Pro: sys.path directory mtimes are sufficient for cache invalidation
> (subject to filesystem timestamp granularity)

Timestamp granularity is a solvable problem. You just have to be
careful not to write out the cache unless the directory mtime is
sufficiently far in the past, like 10 seconds old, say. (This is an
old trick that VCSes use to make commands like 'git status'
fast-and-reliable.)

This does mean you can get in a weird state where if the directory
mtime somehow gets set to the future, then start time starts sucking
because caching goes away.

Note also that you'll want to explicitly write the observed directory
mtime to the cache file, rather than comparing it to the cache file's
mtime, to avoid the race condition where the directory gets modified
just after we scan it but before we write out the cache.

> * Pro: zero elevated privileges needed (cache would be stored in a per-user
> directory tree)
> * Con: interprocess locking likely needed to avoid the "thundering herd"
> cache update problem [1]

Interprocess filesystem locking is going to be far more painful than
any problem it might solve. Seriously. At least on Unix, the right
approach is to go ahead and regenerate the cache, and then atomically
write it to the given place, and if someone else overwrites it a few
milliseconds later then oh well.

I guess on Windows locking might be OK, given that it has no atomic
writes and less gratuitously broken filesystem locking. But you'd
still want to make sure you never block when acquiring the lock; if
the lock is already taken because someone else is in the middle of
updating the cache, then you need to fall back on doing a linear scan.
This is explicitly *not* avoiding the thundering herd problem, because
it's more important to avoid the "one process got stuck and now
everyone else freezes on startup waiting for it" problem.

> * Con: if a non-persistent storage location is used, zero benefit over an
> in-memory cache for throwaway environments (e.g. container startup)

You also have to be careful about whether you have a writeable storage
location at all, and if so whether you have the right permissions. (It
might be bad if 'sudo somescript.py' leaves me with root-owned cache
files in /home/njs/.cache/.)

Filesystems are just a barrel of fun.

> * Con: cost of the cache freshness check will still scale linearly with the
> number of sys.path entries
>
> Option 3: persistent per-path-entry cache
>
> * Pro: assuming cache freshness means zero runtime queries incur a linear DB
> read (cache creation becomes an install time cost)
> * Con: if you don't assume cache freshness, you need option 1 or 2 anyway,
> and the install time cache just speeds up that first linear read
> * Con: filesystem access control requires either explicit cache refresh or
> implicit metadata caching support in installers
> * Con: sys.path directory mtimes are no longer sufficient for cache
> invalidation (due to potential for directory relocation)

Not sure what problem you're thinking of here? In this model we
wouldn't be using mtimes for cache invalidation anyway, because it'd
be the responsibility of those modifying the directory to update the
cache. And if you rename a whole directory, that doesn't affect its
mtime anyway?

> * Con: interprocess locking arguably still needed to avoid the "thundering
> herd" cache update problem (just between installers rather than runtime
> processes)

If two installers are trying to rearrange the same directory at the
same time then they can conflict in lots of ways. For the most part
people get away with it because doing multiple 'pip install' runs in
parallel is generally considered a Bad Idea and unlikely to happen by
accident; and if it is a problem then we should add locking anyway
(like dpkg and rpm already do), regardless of the cache update part.

-n

-- 
Nathaniel J. Smith -- https://vorpus.org