[Python-Dev] Re: Change to sys.path[0] for Python 2.3

Tue, 27 Nov 2001 10:15:29 -0500

Jack Jansen wrote:
> 
> > I am not happy with changing current directories because I am trying
> > to speed up imports, and it is harder to cache directory contents
> > on multiple operating systems.  But I can do it if we need to.
> 
> Ah, at least now I understand what you're trying to do. BUT: have you done any
> measurements that show that this caching is actually beneficial? For many
> years we've used a special caching importer in a very big Python project here,
> and when we finally did some real measurements it turned out that it had
> slowed down imports all that time in stead of speeding them up.

Yes, I have many benchmarks on both Linux and Windows.  They are
hard to do because the OS will cache directory contents itself.
So the comparison is not between cached/uncached, but rather
between Python-dictionary-cached versus file-system-cached.  In
particular, successive runs of a benchmark can speed up by
2X because the OS now has a fresh cache of directory contents.

I am focusing on benchmarks made with a "cold" OS directory
cache.  I reboot to empty the cache.  Roughly, my directory
cache cuts import time by ~40% for a Windows local drive
or an NFS drive.  My impression is that OS directory caching
is pretty good for Windows 2000 and Linux-NFS.
Improvements are possible but not dramatic.

For Windows 2000 and a Linux/Samba network server, things are
different.  Improvements of 5X or more are easily achieved.
Apparently, this combination has good OS caching of file data
blocks, and poor caching of directory blocks.

Apparently the blizzard of fopen() calls Python currently makes
can be a problem for some network file systems.  It would also
be a problem for heavily loaded servers.

I also see that import times show a lot of scatter for OS caching,
but much less scatter for my caching.  Perhaps file server load
or just evil cache spirits are to blame.

Nevertheless, I do believe that Python imports need to be speeded up.

> Hmm, why not cache on a sys.path entry-by-entry basis? Then, if the entry is a
> zipfile we always cache, if the entry is a relative pathname we never cache,
> if the entry is an absolute pathname we cache on the basis of a preference.
> Use the sys.path entry as a key in a dictionary, the result is either None
> (don't cache) or the cache for this sys.path entry. If the key isn't found
> this is the first time we come across this sys.path entry so we decide whether
> to cache or not.

This is almost exactly what my code does.  But I don't test for relative
paths because that is a porting headache (but a test for "" or "."
could be an exception).  So the rule is:

zip files are always cached, otherwise if os.listdir() exists and has
been imported we always cache, else use current logic.  Each entry
in sys.path is checked once.

My problem is that this breaks the current feature of importing
from a relative directory path using the current getcwd().  I
can fix this, but (1) is it worth it except perhaps for ".",
(2) I don't want to support it for zip files because I must
cache these, (3) the fix is a portability and speed problem
because I must recognize a relative path (like os.path.abspath)
and either call os.getcwd or fall back to fopen() searching.

JimA