[Distutils] Entry points: specifying and caching

Wed Oct 18 10:52:00 EDT 2017

We're increasingly using entry points in Jupyter to help integrate
third-party components. This brings up a couple of things that I'd like
to do:

1. Specification

As far as I know, there's no document describing the details of entry
points; it's a de-facto standard established by setuptools. It seems to
work quite well, but it's worth writing down what is unofficially
standardised. I would like to see a document on
https://packaging.python.org/specifications/ saying:

- Where build tools should put entry points in wheels
- Where entry points live in installed distributions
- The file format (including allowed characters, case sensitivity...)

I guess I'm volunteering to write this, although if someone else wants
to, don't let me stop you. ;-)

I'd also be happy to hear that I'm wrong, that this specification
already exists somewhere. If it does, can we add a link from
https://packaging.python.org/specifications/ ?

2. Caching

"There are only two hard problems in computer science: cache
invalidation, naming things, and off-by-one errors"

I know that caching is going to make things more complex, but at present
a scan of available entry points requires a stat() for every installed
package, plus open()+read()+parse for every installed package that
provides entry points. This doesn't scale well, especially on spinning
hard drives. By eliminating a call to pygments which caused an entry
points scan, we cut the cold-start time of IPython almost in half on one
HDD system (11s -> 6s; PR 10859).

As packaging improves, the trend is to break functionality into more,
smaller packages, which is only going to make this worse (though I hope
we never end up with a left-pad package ;-). Caching could allow entry
points to be used in places where the current performance penalty is too
much.

I envisage a cache working something like this:
- Each directory on sys.path can have a cache file, e.g.
'entry-points.json'
- I suggest JSON because Python can parse it efficiently, and it's not
intended to be directly edited by humans. Other options? SQLite? Does
someone want to do performance comparisons?
- There is a command to scan all packages in a directory and build the
cache file
- After an install tool (e.g. pip) has added/removed packages from a
directory, it should call that command to rebuild the cache.
- A second command goes through all directories on sys.path and rebuilds
their cache files - this lets the user rebuild caches if something has
gone wrong.
- Applications looking for entry points can choose from a range of
behaviours depending on how important accuracy and performance are. E.g.
ignore all caches, only use caches, use caches for directories where
they exist, or try caches first and then scan packages if a key is
missing.

In the best case, when the caches exist and you trust them, loading them
would cost one set of filesystem operations per sys.path entry, rather
than per package.

Thanks,
Thomas