[Python-Dev] Import redesign [LONG]

Fri, 19 Nov 1999 05:29:50 -0800 (PST)

On Thu, 18 Nov 1999, Guido van Rossum wrote:
> Gordon McMillan wrote:
>...
> > I think imputil's emulation of the builtin importer is more of a 
> > demonstration than a serious implementation. As for speed, it 
> > depends on the test. 
> 
> Agreed.  I like some of imputil's features, but I think the API
> need to be redesigned.

It what ways? It sounds like you've applied some thought. Do you have any
concrete ideas yet, or "just a feeling" :-)  I'm working through some
changes from JimA right now, and would welcome other suggestions. I think
there may be some outstanding stuff from MAL, but I'm not sure (Marc?)

>...
> So here's a challenge: redesign the import API from scratch.

I would suggest starting with imputil and altering as necessary. I'll use
that viewpoint below.

> Let me start with some requirements.
> 
> Compatibility issues:
> ---------------------
> 
> - the core API may be incompatible, as long as compatibility layers
> can be provided in pure Python

Which APIs are you referring to? The "imp" module? The C functions? The
__import__ and reload builtins?

I'm guessing some of imp, the two builtins, and only one or two C
functions.

> - support for rexec functionality

No problem. I can think of a number of ways to do this.

> - support for freeze functionality

No problem. A function in "imp" must be exposed to Python to support this
within the imputil framework.

> - load .py/.pyc/.pyo files and shared libraries from files

No problem. Again, a function is needed for platform-specific loading of
shared libraries.

> - support for packages

No problem. Demo's in current imputil.

> - sys.path and sys.modules should still exist; sys.path might
> have a slightly different meaning

I would suggest that both retain their *exact* meaning. We introduce
sys.importers -- a list of importers to check, in sequence. The first
importer on that list uses sys.path to look for and load modules. The
second importer loads builtins and frozen code (i.e. modules not on
sys.path).

Users can insert/append new importers or alter sys.path as before.

sys.modules continues to record name:module mappings.

> - $PYTHONPATH and $PYTHONHOME should still be supported

No problem.

> (I wouldn't mind a splitting up of importdl.c into several
> platform-specific files, one of which is chosen by the configure
> script; but that's a bit of a separate issue.)

Easy enough. The standard importer can select the appropriate
platform-specific module/function to perform the load. i.e. these can move
to Modules/ and be split into a module-per-platform.

> New features:
> -------------
> 
> - Integrated support for Greg Ward's distribution utilities (i.e. a
>   module prepared by the distutil tools should install painlessly)

I don't know the specific requirements/functionality that would be
required here (does Greg? :-), but I can't imagine any problem with this.

> - Good support for prospective authors of "all-in-one" packaging tool
>   authors like Gordon McMillan's win32 installer or /F's squish.  (But
>   I *don't* require backwards compatibility for existing tools.)

Um. *No* problem. :-)

> - Standard import from zip or jar files, in two ways:
> 
>   (1) an entry on sys.path can be a zip/jar file instead of a directory;
>       its contents will be searched for modules or packages

While this could easily be done, I might argue against it. Old
apps/modules that process sys.path might get confused.

If compatibility is not an issue, then "No problem."

An alternative would be an Importer instance added to sys.importers that
is configured for a specific archive (in other words, don't add the zip
file to sys.path, add ZipImporter(file) to sys.importers).

Another alternative is an Importer that looks at a "sys.py_archives" list.
Or an Importer that has a py_archives instance attribute.

>   (2) a file in a directory that's on sys.path can be a zip/jar file;
>       its contents will be considered as a package (note that this is
>       different from (1)!)

No problem. This will slow things down, as a stat() for *.zip and/or *.jar
must be done, in addition to *.py, *.pyc, and *.pyo.

>   I don't particularly care about supporting all zip compression
>   schemes; if Java gets away with only supporting gzip compression
>   in jar files, so can we.

I presume we would support whatever zlib gives us, and no more.

> - Easy ways to subclass or augment the import mechanism along
>   different dimensions.  For example, while none of the following
>   features should be part of the core implementation, it should be
>   easy to add any or all:
> 
>   - support for a new compression scheme to the zip importer

Presuming ZipImporter is a class (derived from Importer), then this
ability is wholly dependent upon the author of ZipImporter providing the
hook.

The Importer class is already designed for subclassing (and its interface 
is very narrow, which means delegation is also *very* easy; see
imputil.FuncImporter).

>   - support for a new archive format, e.g. tar

A cakewalk. Gordon, JimA, and myself each have archive formats. :-)

>   - a hook to import from URLs or other data sources (e.g. a
>     "module server" imported in CORBA) (this needn't be supported
>     through $PYTHONPATH though)

No problem at all.

>   - a hook that imports from compressed .py or .pyc/.pyo files

No problem at all.

>   - a hook to auto-generate .py files from other filename
>     extensions (as currently implemented by ILU)

No problem at all.

>   - a cache for file locations in directories/archives, to improve
>     startup time

No problem at all.

>   - a completely different source of imported modules, e.g. for an
>     embedded system or PalmOS (which has no traditional filesystem)

No problem at all.

In each of the above cases, the Importer.get_code() method just needs to
grab the byte codes from the XYZ data source. That data source can be
cmopressed, across a network, on-the-fly generated, or whatever. Each
importer can certainly create a cache based on its concept of "location".
In some cases, that would be a mapping from module name to filesystem
path, or to a URL, or to a compiled-in, frozen module.

> - Note that different kinds of hooks should (ideally, and within
>   reason) properly combine, as follows: if I write a hook to recognize
>   .spam files and automatically translate them into .py files, and you
>   write a hook to support a new archive format, then if both hooks are
>   installed together, it should be possible to find a .spam file in an
>   archive and do the right thing, without any extra action.  Right?

Ack. Very, very difficult.

The imputil scheme combines the concept of locating/loading into one step.
There is only one "hook" in the imputil system. Its semantic is "map this
name to a code/module object and return it; if you don't have it, then
return None."

Your compositing example is based on the capabilities of the
find-then-load paradigm of the existing "ihooks.py". One module finds
something (foo.spam) and the other module loads it (by generating a .py).

All is not lost, however. I can easily envision the get_code() hook as
allowing any kind of return type. If it isn't a code or module object,
then another hook is called to transform it.
[ actually, I'd design it similarly: a *series* of hooks would be called
  until somebody transforms the foo.spam into a code/module object. ]

The compositing would be limited ony by the (Python-based) Importer
classes. For example, my ZipImporter might expect to zip up .pyc files
*only*. Obviously, you would want to alter this to support zipping any
file, then use the suffic to determine what to do at unzip time.

> - It should be possible to write hooks in C/C++ as well as Python

Use FuncImporter to delegate to an extension module.

This is one of the benefits of imputil's single/narrow interface.

> - Applications embedding Python may supply their own implementations,
>   default search path, etc., but don't have to if they want to piggyback
>   on an existing Python installation (even though the latter is
>   fraught with risk, it's cheaper and easier to understand).

An application would have full control over the contents of sys.importers.

For a restricted execution app, it might install an Importer that loads
files from *one* directory only which is configured from a specific
Win32 Registry entry. That importer could also refuse to load shared
modules. The BuiltinImporter would still be present (although the app
would certainly omit all but the necessary builtins from the build).
Frozen modules could be excluded.

> Implementation:
> ---------------
> 
> - There must clearly be some code in C that can import certain
>   essential modules (to solve the chicken-or-egg problem), but I don't
>   mind if the majority of the implementation is written in Python.
>   Using Python makes it easy to subclass.

I posited once before that the cost of import is mostly I/O rather than
CPU, so using Python should not be an issue. MAL demonstrated that a good
design for the Importer classes is also required. Based on this, I'm a
*strong* advocate of moving as much as possible into Python (to get
Python's ease-of-coding with little relative cost).

The (core) C code should be able to search a path for a module and import
it. It does not require dynamic loading or packages. This will be used to
import exceptions.py, then imputil.py, then site.py.

The platform-specific module that perform dynamic-loading must be a
statically linked module (in Modules/ ... it doesn't have to be in the
Python/ directory).

site.py can complete the bootstrap by setting up sys.importers with the
appropriate Importer instances (this is where an application can define
its own policy). sys.path was initially set by the import.c bootstrap code
(from the compiled-in path and environment variables).

Note that imputil.py would not install any hooks when it is loaded. That
is up to site.py. This implies the core C code will import a total of
three modules using its builtin system. After that, the imputil mechanism
would be importing everything (site.py would .install() an Importer which
then takes over the __import__ hook).

Further note that the "import" Python statement could be simplified to use
only the hook. However, this would require the core importer to inject
some module names into the imputil module's namespace (since it couldn't
use an import statement until a hook was installed). While this
simplification is "neat", it complicates the run-time system (the import
statement is broken until a hook is installed).

Therefore, the core C code must also support importing builtins. "sys" and
"imp" are needed by imputil to bootstrap.

The core importer should not need to deal with dynamic-load modules.

To support frozen apps, the core importer would need to support loading
the three modules as frozen modules.

The builtin/frozen importing would be exposed thru "imp" for use by
imputil for future imports. imputil would load and use the (builtin)
platform-specific module to do dynamic-load imports.

> - In order to support importing from zip/jar files using compression,
>   we'd at least need the zlib extension module and hence libz itself,
>   which may not be available everywhere.

Yes. I don't see this as a requirement, though. We wouldn't start to use
these by default, would we? Or insist on zlib being present? I see this as
more along the lines of "we have provided a standardized Importer to do
this, *provided* you have zlib support."

> - I suppose that the bootstrap is solved using a mechanism very
>   similar to what freeze currently used (other solutions seem to be
>   platform dependent).

The bootstrap that I outlined above could be done in C code. The import
code would be stripped down dramatically because you'll drop package
support and dynamic loading.

Alternatively, you could probably do the path-scanning in Python and
freeze that into the interpreter. Personally, I don't like this idea as it
would not buy you much at all (it would still need to return to C for
accessing a number of scanning functions and module importing funcs).

> - I also want to still support importing *everything* from the
>   filesystem, if only for development.  (It's hard enough to deal with
>   the fact that exceptions.py is needed during Py_Initialize();
>   I want to be able to hack on the import code written in Python
>   without having to rebuild the executable all the time.

My outline above does not freeze anything. Everything resides in the
filesystem. The C code merely needs a path-scanning loop and functions to
import .py*, builtin, and frozen types of modules.

If somebody nukes their imputil.py or site.py, then they return to Python
1.4 behavior where the core interpreter uses a path for importing (i.e. no
packages). They lose dynamically-loaded module support.

> Let's first complete the requirements gathering.  Are these
> requirements reasonable?  Will they make an implementation too
> complex?  Am I missing anything?

I'm not a fan of the compositing due to it requiring a change to semantics
that I believe are very useful and very clean. However, I outlined a
possible, clean solution to do that (a secondary set of hooks for
transforming get_code() return values).

The requirements are otherwise reasonable to me, as I see that they can
all be readily solved (i.e. they aren't burdensome).

While this email may be long, I do not believe the resulting system would
be complex. From the user-visible side of things, nothing would be
changed. sys.path is still present and operates as before. They *do* have
new functionality they can grow into, though (sys.importers). The
underlying C code is simplified, and the platform-specific dynamic-load
stuff can be distributed to distinct modules, as needed
(e.g. BeOS/dynloadmodule.c and PC/dynloadmodule.c).

> Finally, to what extent does this impact the desire for dealing
> differently with the Python bytecode compiler (e.g. supporting
> optimizers written in Python)?  And does it affect the desire to
> implement the read-eval-print loop (the >>> prompt) in Python?

If the three startup files require byte-compilation, then you could have
some issues (i.e. the byte-compiler must be present).

Once you hit site.py, you have a "full" environment and can easily detect
and import a read-eval-print loop module (i.e. why return to Python? just 
start things up right there).

site.py can also install new optimizers as desired, a new Python-based
parser or compiler, or whatever...  If Python is built without a parser or
compiler (I hope that's an option!), then the three startup modules would
simply be frozen into the executable.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/