[Python-Dev] Import redesign [LONG]

Thu, 02 Dec 1999 15:43:46 -0500

Here's the promised response to Greg's response to my wishlist.

> On Thu, 18 Nov 1999, Guido van Rossum wrote:
> > Gordon McMillan wrote:
> >...
> > > I think imputil's emulation of the builtin importer is more of a 
> > > demonstration than a serious implementation. As for speed, it 
> > > depends on the test. 
> > 
> > Agreed.  I like some of imputil's features, but I think the API
> > need to be redesigned.
> 
> It what ways? It sounds like you've applied some thought. Do you have any
> concrete ideas yet, or "just a feeling" :-)  I'm working through some
> changes from JimA right now, and would welcome other suggestions. I think
> there may be some outstanding stuff from MAL, but I'm not sure (Marc?)

I actually think that the way the PVM (Python VM) calls the importer
ought to be changed.  Assigning to __builtin__.__import__ is a crock.
The API for __import__ is a crock.

> >...
> > So here's a challenge: redesign the import API from scratch.
> 
> I would suggest starting with imputil and altering as necessary. I'll use
> that viewpoint below.
> 
> > Let me start with some requirements.
> > 
> > Compatibility issues:
> > ---------------------
> > 
> > - the core API may be incompatible, as long as compatibility layers
> > can be provided in pure Python
> 
> Which APIs are you referring to? The "imp" module? The C functions? The
> __import__ and reload builtins?

> I'm guessing some of imp, the two builtins, and only one or two C
> functions.

All of those.

> > - support for rexec functionality
> 
> No problem. I can think of a number of ways to do this.

Agreed, I think that imputil can do this.

> > - support for freeze functionality
> 
> No problem. A function in "imp" must be exposed to Python to support this
> within the imputil framework.

Agreed.  It currently exports init_frozen() which is about the right
functionality.

> > - load .py/.pyc/.pyo files and shared libraries from files
> 
> No problem. Again, a function is needed for platform-specific loading of
> shared libraries.

Is it useful to expose the platform differences?  The current
imp.load_dynamic() should suffice.

> > - support for packages
> 
> No problem. Demo's in current imputil.
> 
> > - sys.path and sys.modules should still exist; sys.path might
> > have a slightly different meaning
> 
> I would suggest that both retain their *exact* meaning. We introduce
> sys.importers -- a list of importers to check, in sequence. The first
> importer on that list uses sys.path to look for and load modules. The
> second importer loads builtins and frozen code (i.e. modules not on
> sys.path).

This is looking like the redesign I was looking for.  (Note that
imputil's current chaining is not good since it's impossible to remove
or reorder importers, which I think is a required feature; an explicit
list would solve this.)

Actually, the order is the other way around, but by now you should
know that.  It makes sense to have separate ones for builtin and
frozen modules -- these have nothing in common.

There's another issue, which isn't directly addressed by imputil,
although with clever use of inheritance it might be doable.  I'd like
more support for this however.  Quite orthogonally to the issue of
having separate importers, I might want to recognize new extensions.
Take the example of the ILU folks.  They want to be able to drop a
file "foo.isl" in any directory on sys.path and have the ILU stubber
automatically run if you try to import foo (the client stubs) or
foo__skel (the server skeleton).

This doesn't fit in the sys.importers strategy, because they want to
be able to drop their .isl files in any directory along sys.path.
(Or, more likely, they want to have control over where in sys.modules
the directory/directories with .isl files are placed.)  This requires
an ugly modification to the _fs_import() function.  (Which should have
been a method, by the way, to make overriding it in a subclass of
PathImporter easier!)

I've been thinking here along the lines of a strategy where the
standard importer (the one that walks sys.path) has a set of hooks
that define various things it could look for, e.g. .py files, .pyc
files, .so or .dll files.  This list of hooks could be changed to
support looking for .isl files.

There's an old, subtle issue that could be solved through this as
well: whether or not a .pyc file without a .py file should be accepted
or not.  Long ago (in Python 0.9.8) a .pyc file alone would never be
loaded.  This was changed at the request of a small but vocal minority
of Python developers who wanted to distribute .pyc files without .py
files.  It has occasionally caused frustration because sometimes
developers move .py files around but forget to remove the .pyc files,
and then the .pyc file is silently picked up if it occurs on sys.path
earlier than where the .py was moved to.

Having a set of hooks for various extensions would make it possible to
have a default where lone .pyc files are ignored, but where one can
insert a .pyc importer in the list of hooks that does the right thing
here.  (Of course, it may be possible that this whole feature of lone
.pyc files should be replaced since the same need is easily taken care
of by zip importers.

I also want to support (Jim A notwithstanding :-) a feature whereby
different things besides directories can live on sys.path, as long as
they are strings -- these could be added from the PYTHONPATH env
variable.  Every piece of code that I've ever seen that uses sys.path
doesn't care if a directory named in sys.path doesn't exist -- it may
try to stat various files in it, which also don't exist, and as far as
it is concerned that is just an indication that the requested module
doesn't live there.

Again, we would have to dissect imputil to support various hooks that
deal with different kind of entities in sys.path.  The default hook
list would consist of a single item that interprets the name as a
directory name; other hooks could support zip files or URLs.  Jack's
"magic cookies" could also be supported nicely through such a
mechanism.

> Users can insert/append new importers or alter sys.path as before.
> 
> sys.modules continues to record name:module mappings.

Yes.

Note that the interpretation of __file__ could be problematic.  To
what value do you set __file__ for a module loaded from a zip archive?

> > - $PYTHONPATH and $PYTHONHOME should still be supported
> 
> No problem.
> 
> > (I wouldn't mind a splitting up of importdl.c into several
> > platform-specific files, one of which is chosen by the configure
> > script; but that's a bit of a separate issue.)
> 
> Easy enough. The standard importer can select the appropriate
> platform-specific module/function to perform the load. i.e. these can move
> to Modules/ and be split into a module-per-platform.

Again: what's the advantage of exposing the platform specificity?

> > New features:
> > -------------
> > 
> > - Integrated support for Greg Ward's distribution utilities (i.e. a
> >   module prepared by the distutil tools should install painlessly)
> 
> I don't know the specific requirements/functionality that would be
> required here (does Greg? :-), but I can't imagine any problem with this.

Probably more support is required from the other end: once it's common
for modules to be imported from zip files, the distutil code needs to
support the creation and installation of such zip files.  Also, there
is a need for the install phase of distutil to communicate the
location of the zip file to the Python installation.

> > - Good support for prospective authors of "all-in-one" packaging tool
> >   authors like Gordon McMillan's win32 installer or /F's squish.  (But
> >   I *don't* require backwards compatibility for existing tools.)
> 
> Um. *No* problem. :-)

:-)

> > - Standard import from zip or jar files, in two ways:
> > 
> >   (1) an entry on sys.path can be a zip/jar file instead of a directory;
> >       its contents will be searched for modules or packages

Note that this is what I mention above for distutil support.

> While this could easily be done, I might argue against it. Old
> apps/modules that process sys.path might get confused.

Above I argued that this shouldn't be a problem.

> If compatibility is not an issue, then "No problem."
> 
> An alternative would be an Importer instance added to sys.importers that
> is configured for a specific archive (in other words, don't add the zip
> file to sys.path, add ZipImporter(file) to sys.importers).

This would be harder for distutil: where does Python get the initial
list of importers?

> Another alternative is an Importer that looks at a "sys.py_archives" list.
> Or an Importer that has a py_archives instance attribute.

OK, but again distutil needs to be able to add to this list when it
installs a package.  (Note that package deinstallation should also be
supported!)

(Of course I don't require this to affect Python processes that are
already running; but it should be possible to easily change the
default search path for all newly started instances of a given Python
installation.)

> >   (2) a file in a directory that's on sys.path can be a zip/jar file;
> >       its contents will be considered as a package (note that this is
> >       different from (1)!)
> 
> No problem. This will slow things down, as a stat() for *.zip and/or *.jar
> must be done, in addition to *.py, *.pyc, and *.pyo.

Fine, this is where the caching comes in handy.

> >   I don't particularly care about supporting all zip compression
> >   schemes; if Java gets away with only supporting gzip compression
> >   in jar files, so can we.
> 
> I presume we would support whatever zlib gives us, and no more.

That's it. :-)

> > - Easy ways to subclass or augment the import mechanism along
> >   different dimensions.  For example, while none of the following
> >   features should be part of the core implementation, it should be
> >   easy to add any or all:
> > 
> >   - support for a new compression scheme to the zip importer
> 
> Presuming ZipImporter is a class (derived from Importer), then this
> ability is wholly dependent upon the author of ZipImporter providing the
> hook.

Agreed.  But since we're likely going to provide this as a standandard
feature, we must ensure that it provides this hook.

> The Importer class is already designed for subclassing (and its interface 
> is very narrow, which means delegation is also *very* easy; see
> imputil.FuncImporter).

But maybe it's *too* narrow; some of the hooks I suggest above seem to
require extra interfaces -- at least in some of the subclasses of the
Importer base class.

Note: I looked at the doc string for get_code() and I don't understand
what the difference is between the modname and fqname arguments.  If I
write "import foo.bar", what are modname and fqname?  Why are both
present?  Also, while you claim that the API is narrow, the multiple
return values (also the different types for the second item) make it
complicated.

> >   - support for a new archive format, e.g. tar
> 
> A cakewalk. Gordon, JimA, and myself each have archive formats. :-)
> 
> >   - a hook to import from URLs or other data sources (e.g. a
> >     "module server" imported in CORBA) (this needn't be supported
> >     through $PYTHONPATH though)
> 
> No problem at all.
> 
> >   - a hook that imports from compressed .py or .pyc/.pyo files
> 
> No problem at all.
> 
> >   - a hook to auto-generate .py files from other filename
> >     extensions (as currently implemented by ILU)
> 
> No problem at all.

See above -- I think this should be more integrated with sys.path than
you are thinking of.  The more I think about it, the more I see that
the problem is that for you, the importer that uses sys.path is a
final subclass of Importer (i.e. it is itself not further subclassed).
Several of the hooks I want seem to require additional hooks in the
PathImporter rather than new importers.

> >   - a cache for file locations in directories/archives, to improve
> >     startup time
> 
> No problem at all.
> 
> >   - a completely different source of imported modules, e.g. for an
> >     embedded system or PalmOS (which has no traditional filesystem)
> 
> No problem at all.
> 
> In each of the above cases, the Importer.get_code() method just needs to
> grab the byte codes from the XYZ data source. That data source can be
> cmopressed, across a network, on-the-fly generated, or whatever. Each
> importer can certainly create a cache based on its concept of "location".
> In some cases, that would be a mapping from module name to filesystem
> path, or to a URL, or to a compiled-in, frozen module.

See above for sys.path integration remark.

> > - Note that different kinds of hooks should (ideally, and within
> >   reason) properly combine, as follows: if I write a hook to recognize
> >   .spam files and automatically translate them into .py files, and you
> >   write a hook to support a new archive format, then if both hooks are
> >   installed together, it should be possible to find a .spam file in an
> >   archive and do the right thing, without any extra action.  Right?
> 
> Ack. Very, very difficult.

Actually, I take most of this back.  Importers that deal with new
extension types often have to go through a file system to transform
their data to .py files, and this is just too complicated.  However it
would be still nice if there was code sharing between the code that
looks for .py and .pyc files in a zip archive and the code that does
the same in a filesystem.  Hm, maybe even that shouldn't be necessary,
the zip file probably should contain only .pyc files...

(Unrelated remark: I should really try to release the set of modules
we've written here at CNRI to deal with zip files.  Unfortunately zip
files are hairy and so is our code.)

> The imputil scheme combines the concept of locating/loading into one step.
> There is only one "hook" in the imputil system. Its semantic is "map this
> name to a code/module object and return it; if you don't have it, then
> return None."

That's fine.  I actually don't recall where the find-then-load API
came from, I think it may be an artefact of the original
implementation strategy.  It is currently used as follows: we try to
see if there's a .pyc and then we try to see if there's a .py; if both
exist we compare the timestamps etc. to choose which one.  But that's
still a red herring.

> Your compositing example is based on the capabilities of the
> find-then-load paradigm of the existing "ihooks.py". One module finds
> something (foo.spam) and the other module loads it (by generating a .py).

I still don't understand why ihooks.py had to be so complicated.  I
guess I just had much less of an understanding of the issues.  (It was
also partly a compromise with an alternative design by Ken Manheimer,
who basically forced me to support packages, originally through ni.py.)

> All is not lost, however. I can easily envision the get_code() hook as
> allowing any kind of return type. If it isn't a code or module object,
> then another hook is called to transform it.
> [ actually, I'd design it similarly: a *series* of hooks would be called
>   until somebody transforms the foo.spam into a code/module object. ]

OK.  This could be a feature of a subclass of Importer.

> The compositing would be limited ony by the (Python-based) Importer
> classes. For example, my ZipImporter might expect to zip up .pyc files
> *only*. Obviously, you would want to alter this to support zipping any
> file, then use the suffic to determine what to do at unzip time.
> 
> > - It should be possible to write hooks in C/C++ as well as Python
> 
> Use FuncImporter to delegate to an extension module.

Maybe not so great, since it sounds like the C code can't benefit from
any of the infrastructure that imputil offers.  I'm not sure about
this one though.

> This is one of the benefits of imputil's single/narrow interface.

Plus its vague specs? :-)

> > - Applications embedding Python may supply their own implementations,
> >   default search path, etc., but don't have to if they want to piggyback
> >   on an existing Python installation (even though the latter is
> >   fraught with risk, it's cheaper and easier to understand).
> 
> An application would have full control over the contents of sys.importers.
> 
> For a restricted execution app, it might install an Importer that loads
> files from *one* directory only which is configured from a specific
> Win32 Registry entry. That importer could also refuse to load shared
> modules. The BuiltinImporter would still be present (although the app
> would certainly omit all but the necessary builtins from the build).
> Frozen modules could be excluded.

Actually there's little reason to exclude frozen modules or any
.py/.pyc modules -- by definition, bytecode can't be dangerous.  It's
the builtins and extensions that need to be censored.

We currently do this by subclassing ihooks, where we mask the test for
builtins with a comparison to a predefined list of names.

> > Implementation:
> > ---------------
> > 
> > - There must clearly be some code in C that can import certain
> >   essential modules (to solve the chicken-or-egg problem), but I don't
> >   mind if the majority of the implementation is written in Python.
> >   Using Python makes it easy to subclass.
> 
> I posited once before that the cost of import is mostly I/O rather than
> CPU, so using Python should not be an issue. MAL demonstrated that a good
> design for the Importer classes is also required. Based on this, I'm a
> *strong* advocate of moving as much as possible into Python (to get
> Python's ease-of-coding with little relative cost).

Agreed.  However, how do you explain the slowdown (from 9 to 13
seconds I recall) though?  Are you a lousy coder? :-)

> The (core) C code should be able to search a path for a module and import
> it. It does not require dynamic loading or packages. This will be used to
> import exceptions.py, then imputil.py, then site.py.

It does, however, need to import builtin modules.  imputil currently
imports imp, sys, strop and __builtin__, struct and marshal; note that
struct can easily be a dynamic loadable module, and so could strop in
theory.  (Note that strop will be unnecessary in 1.6 if you use string
methods.)

I don't think that this chicken-or-egg problem is particularly
problematic though.

> The platform-specific module that perform dynamic-loading must be a
> statically linked module (in Modules/ ... it doesn't have to be in the
> Python/ directory).

See earlier comments.

> site.py can complete the bootstrap by setting up sys.importers with the
> appropriate Importer instances (this is where an application can define
> its own policy). sys.path was initially set by the import.c bootstrap code
> (from the compiled-in path and environment variables).

I thing that algorithm (currently in getpath.c / getpathp.c) might
also be moved to Python code -- imported frozen.  Sadly, rebuilding
with a new version of a frozen module might be more complicated than
rebuilding with a new version of a C module, but writing and
maintaining this code in Python would be *sooooooo* much easier that I
think it's worth it.

> Note that imputil.py would not install any hooks when it is loaded. That
> is up to site.py. This implies the core C code will import a total of
> three modules using its builtin system. After that, the imputil mechanism
> would be importing everything (site.py would .install() an Importer which
> then takes over the __import__ hook).

(Three not counting the builtin modules.)

> Further note that the "import" Python statement could be simplified to use
> only the hook. However, this would require the core importer to inject
> some module names into the imputil module's namespace (since it couldn't
> use an import statement until a hook was installed). While this
> simplification is "neat", it complicates the run-time system (the import
> statement is broken until a hook is installed).

Same chicken-or-egg.  We can be pragmatic.

For a developer, I'd like a bit of robustness (all this makes it
rather hard to debug a broken imputil, and that's a fair amount of
code!).

> Therefore, the core C code must also support importing builtins. "sys" and
> "imp" are needed by imputil to bootstrap.
> 
> The core importer should not need to deal with dynamic-load modules.

Same question.  Since that all has to be coded in C anyway, why not?

> To support frozen apps, the core importer would need to support loading
> the three modules as frozen modules.

I'd like to see a description of how someone like Jim A would build a
single-file application using the new mechanism.  This could
completely replace freeze.  (Freeze currently requires a C compiler;
that's bad.)

> The builtin/frozen importing would be exposed thru "imp" for use by
> imputil for future imports. imputil would load and use the (builtin)
> platform-specific module to do dynamic-load imports.

Sure.

> > - In order to support importing from zip/jar files using compression,
> >   we'd at least need the zlib extension module and hence libz itself,
> >   which may not be available everywhere.
> 
> Yes. I don't see this as a requirement, though. We wouldn't start to use
> these by default, would we? Or insist on zlib being present? I see this as
> more along the lines of "we have provided a standardized Importer to do
> this, *provided* you have zlib support."

Agreed.  Zlib support is easy to get, but there are probably platforms
where it's not.  (E.g. maybe the Mac?  I suppose that on the Mac,
there would be some importer classes to import from a resource fork.)

> > - I suppose that the bootstrap is solved using a mechanism very
> >   similar to what freeze currently used (other solutions seem to be
> >   platform dependent).
> 
> The bootstrap that I outlined above could be done in C code. The import
> code would be stripped down dramatically because you'll drop package
> support and dynamic loading.

Not the dynamic loading.  But yes the package support.

> Alternatively, you could probably do the path-scanning in Python and
> freeze that into the interpreter. Personally, I don't like this idea as it
> would not buy you much at all (it would still need to return to C for
> accessing a number of scanning functions and module importing funcs).
> 
> > - I also want to still support importing *everything* from the
> >   filesystem, if only for development.  (It's hard enough to deal with
> >   the fact that exceptions.py is needed during Py_Initialize();
> >   I want to be able to hack on the import code written in Python
> >   without having to rebuild the executable all the time.
> 
> My outline above does not freeze anything. Everything resides in the
> filesystem. The C code merely needs a path-scanning loop and functions to
> import .py*, builtin, and frozen types of modules.

Good.  Though I think there's also a need for freezing everything.
And when we go the route of the zip archive, the zip archive handling
code needs to be somewhere -- frozen seems to be a reasonable choice.

> If somebody nukes their imputil.py or site.py, then they return to Python
> 1.4 behavior where the core interpreter uses a path for importing (i.e. no
> packages). They lose dynamically-loaded module support.

But if the path guessing is also done by site.py (as I propose) the
path will probably be wrong.  A warning should be printed.

> > Let's first complete the requirements gathering.  Are these
> > requirements reasonable?  Will they make an implementation too
> > complex?  Am I missing anything?
> 
> I'm not a fan of the compositing due to it requiring a change to semantics
> that I believe are very useful and very clean. However, I outlined a
> possible, clean solution to do that (a secondary set of hooks for
> transforming get_code() return values).

As you may see from my responses, I'm a big fan of having several
different sets of hooks.  I do withdraw the composition requirement
though.

> The requirements are otherwise reasonable to me, as I see that they can
> all be readily solved (i.e. they aren't burdensome).
> 
> While this email may be long, I do not believe the resulting system would
> be complex. From the user-visible side of things, nothing would be
> changed. sys.path is still present and operates as before. They *do* have
> new functionality they can grow into, though (sys.importers). The
> underlying C code is simplified, and the platform-specific dynamic-load
> stuff can be distributed to distinct modules, as needed
> (e.g. BeOS/dynloadmodule.c and PC/dynloadmodule.c).
> 
> > Finally, to what extent does this impact the desire for dealing
> > differently with the Python bytecode compiler (e.g. supporting
> > optimizers written in Python)?  And does it affect the desire to
> > implement the read-eval-print loop (the >>> prompt) in Python?
> 
> If the three startup files require byte-compilation, then you could have
> some issues (i.e. the byte-compiler must be present).

Another chicken-or-egg.  No biggie.

> Once you hit site.py, you have a "full" environment and can easily detect
> and import a read-eval-print loop module (i.e. why return to Python? just 
> start things up right there).

You mean "why return to C?"  I agree.  It would be cool if somehow
IDLE and Pythonwin would also be bootstrapped using the same
mechanisms.  (This would also solve the question "which interactive
environment am I using?" that some modules and apps want to see
answered because they need to do things differently when run under
IDLE,for example.)

> site.py can also install new optimizers as desired, a new Python-based
> parser or compiler, or whatever...  If Python is built without a parser or
> compiler (I hope that's an option!), then the three startup modules would
> simply be frozen into the executable.

More power to hooks!

--Guido van Rossum (home page: http://www.python.org/~guido/)