[Python-Dev] Draft PEP: "Simplified Package Layout and Partitioning"

P.J. Eby pje at telecommunity.com
Wed Jul 20 05:58:55 CEST 2011


So, over on the Import-SIG, we were talking about the implementation 
and terminology for PEP 382, and it became increasingly obvious that 
things were, well, not entirely okay in the "implementation is easy 
to explain" department.

Anyway, to make a long story short, we came up with an alternative 
implementation plan that actually solves some other problems besides 
the one that PEP 382 sets out to solve, and whose implementation a 
bit is easier to explain.  (In fact, for users coming from various 
other languages, it hardly needs any explanation at all.)

However, for long-time users of Python, the approach may require a 
bit more justification, which is why roughly 2/3rds of the PEP 
consists of a detailed rationale, specification overview, rejected 
alternatives, and backwards-compatibility discussion...  which is 
still a lot less verbiage than reading through the lengthy Import-SIG 
threads that led up to the proposal.  ;-)  (The remaining 1/3rd of 
the PEP is the short, sweet, and easy-to-explain implementation detail.)

Anyway, the PEP has already been discussed on the Import-SIG, and is 
proposed as an alternative to PEP 382 ("Namespace packages").  We 
expect, however, that many people will be interested in it for 
reasons having little to do with the namespace packaging use case.

So, we would like to submit this for discussion, hole-finding, and 
eventual Pronouncement.  As Barry put it, "I think it's certainly 
worthy of posting to python-dev to see if anybody else can shoot 
holes in it, or come up with useful solutions to open 
questions.  I'll be very interested to see Guido's reaction to it. :)"

So, without further ado, here it is:

PEP: XXX
Title: Simplified Package Layout and Partitioning
Version: $Revision$
Last-Modified: $Date$
Author: P.J. Eby
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 12-Jul-2011
Python-Version: 3.3
Post-History:
Replaces: 382

Abstract
========

This PEP proposes an enhancement to Python's package importing
to:

* Surprise users of other languages less,
* Make it easier to convert a module into a package, and
* Support dividing packages into separately installed components
   (ala "namespace packages", as described in PEP 382)

The proposed enhancements do not change the semantics of any
currently-importable directory layouts, but make it possible for
packages to use a simplified directory layout (that is not importable
currently).

However, the proposed changes do NOT add any performance overhead to
the importing of existing modules or packages, and performance for the
new directory layout should be about the same as that of previous
"namespace package" solutions (such as ``pkgutil.extend_path()``).


The Problem
===========

.. epigraph::

     "Most packages are like modules.  Their contents are highly
     interdependent and can't be pulled apart.  [However,] some
     packages exist to provide a separate namespace. ...  It should
     be possible to distribute sub-packages or submodules of these
     [namespace packages] independently."

     -- Jim Fulton, shortly before the release of Python 2.3 [1]_


When new users come to Python from other languages, they are often
confused by Python's packaging semantics.  At Google, for example,
Guido received complaints from "a large crowd with pitchforks" [2]_
that the requirement for packages to contain an ``__init__`` module
was a "misfeature", and should be dropped.

In addition, users coming from languages like Java or Perl are
sometimes confused by a difference in Python's import path searching.

In most other languages that have a similar path mechanism to Python's
``sys.path``, a package is merely a namespace that contains modules
or classes, and can thus be spread across multiple directories in
the language's path.  In Perl, for instance, a ``Foo::Bar`` module
will be searched for in ``Foo/`` subdirectories all along the module
include path, not just in the first such subdirectory found.

Worse, this is not just a problem for new users: it prevents *anyone*
from easily splitting a package into separately-installable
components.  In Perl terms, it would be as if every possible ``Net::``
module on CPAN had to be bundled up and shipped in a single tarball!

For that reason, various workarounds for this latter limitation exist,
circulated under the term "namespace packages".  The Python standard
library has provided one such workaround since Python 2.3 (via the
``pkgutil.extend_path()`` function), and the "setuptools" package
provides another (via ``pkg_resources.declare_namespace()``).

The workarounds themselves, however, fall prey to a *third* issue with
Python's way of laying out packages in the filesystem.

Because a package *must* contain an ``__init__`` module, any attempt
to distribute modules for that package must necessarily include that
``__init__`` module, if those modules are to be importable.

However, the very fact that each distribution of modules for a package
must contain this (duplicated) ``__init__`` module, means that OS
vendors who package up these module distributions must somehow handle
the conflict caused by several module distributions installing that
``__init__`` module to the same location in the filesystem.

This led to the proposing of PEP 382 ("Namespace Packages") - a way
to signal to Python's import machinery that a directory was
importable, using unique filenames per module distribution.

However, there was more than one downside to this approach.
Performance for all import operations would be affected, and the
process of designating a package became even more complex.  New
terminology had to be invented to explain the solution, and so on.

As terminology discussions continued on the Import-SIG, it soon became
apparent that the main reason it was so difficult to explain the
concepts related to "namespace packages" was because Python's
current way of handling packages is somewhat underpowered, when
compared to other languages.

That is, in other popular languages with package systems, no special
term is needed to describe "namespace packages", because *all*
packages generally behave in the desired fashion.

Rather than being an isolated single directory with a special marker
module (as in Python), packages in other languages are typically just
the *union* of appropriately-named directories across the *entire*
import or inclusion path.

In Perl, for example, the module ``Foo`` is always found in a
``Foo.pm`` file, and a module ``Foo::Bar`` is always found in a
``Foo/Bar.pm`` file.  (In other words, there is One Obvious Way to
find the location of a particular module.)

This is because Perl considers a module to be *different* from a
package: the package is purely a *namespace* in which other modules
may reside, and is only *coincidentally* the name of a module as well.

In current versions of Python, however, the module and the package are
more tightly bound together.  ``Foo`` is always a module -- whether it
is found in ``Foo.py`` or ``Foo/__init__.py`` -- and it is tightly
linked to its submodules (if any), which *must* reside in the exact
same directory where the ``__init__.py`` was found.

On the positive side, this design choice means that a package is quite
self-contained, and can be installed, copied, etc. as a unit just by
performing an operation on the package's root directory.

On the negative side, however, it is non-intuitive for beginners, and
requires a more complex step to turn a module into a package.  If
``Foo`` begins its life as ``Foo.py``, then it must be moved and
renamed to ``Foo/__init__.py``.

Conversely, if you intend to create a ``Foo.Bar`` module from the
start, but have no particular module contents to put in ``Foo``
itself, then you have to create an empty and seemingly-irrelevant
``Foo/__init__.py`` file, just so that ``Foo.Bar`` can be imported.

(And these issues don't just confuse newcomers to the language,
either: they annoy many experienced developers as well.)

So, after some discussion on the Import-SIG, this PEP was created
as an alternative to PEP \382, in an attempt to solve *all* of the
above problems, not just the "namespace package" use cases.

And, as a delightful side effect, the solution proposed in this PEP
does not affect the import performance of ordinary modules or
self-contained (i.e. ``__init__``-based) packages.


The Solution
============

In the past, various proposals have been made to allow more intuitive
approaches to package directory layout.  However, most of them failed
because of an apparent backward-compatibility problem.

That is, if the requirement for an ``__init__`` module were simply
dropped, it would open up the possibility for a directory named, say,
``string`` on ``sys.path``, to block importing of the standard library
``string`` module.

Paradoxically, however, the failure of this approach does *not* arise
from the elimination of the ``__init__`` requirement!

Rather, the failure arises because the underlying approach takes for
granted that a package is just ONE thing, instead of two.

In truth, a package comprises two separate, but related entities: a
module (with its own, optional contents), and a *namespace* where
*other* modules or packages can be found.

In current versions of Python, however, the module part (found in
``__init__``) and the namespace for submodule imports (represented
by the ``__path__`` attribute) are both initialized at the same time,
when the package is first imported.

And, if you assume this is the *only* way to initialize these two
things, then there is no way to drop the need for an ``__init__``
module, while still being backwards-compatible with existing directory
layouts.

After all, as soon as you encounter a directory on ``sys.path``
matching the desired name, that means you've "found" the package, and
must stop searching, right?

Well, not quite.


A Thought Experiment
--------------------

Let's hop into the time machine for a moment, and pretend we're back
in the early 1990s, shortly before Python packages and ``__init__.py``
have been invented.  But, imagine that we *are* familiar with
Perl-like package imports, and we want to implement a similar system
in Python.

We'd still have Python's *module* imports to build on, so we could
certainly conceive of having ``Foo.py`` as a parent ``Foo`` module
for a ``Foo`` package.  But how would we implement submodule and
subpackage imports?

Well, if we didn't have the idea of ``__path__`` attributes yet,
we'd probably just search ``sys.path`` looking for ``Foo/Bar.py``.

But we'd *only* do it when someone actually tried to *import*
``Foo.Bar``.

NOT when they imported ``Foo``.

And *that* lets us get rid of the backwards-compatibility problem
of dropping the ``__init__`` requirement, back here in 2011.

How?

Well, when we ``import Foo``, we're not even *looking* for ``Foo/``
directories on ``sys.path``, because we don't *care* yet.  The only
point at which we care, is the point when somebody tries to actually
import a submodule or subpackage of ``Foo``.

That means that if ``Foo`` is a standard library module (for example),
and I happen to have a ``Foo`` directory on ``sys.path`` (without
an ``__init__.py``, of course), then *nothing breaks*.  The ``Foo``
module is still just a module, and it's still imported normally.


Self-Contained vs. "Virtual" Packages
-------------------------------------

Of course, in today's Python, trying to ``import Foo.Bar`` will
fail if ``Foo`` is just a ``Foo.py`` module (and thus lacks a
``__path__`` attribute).

So, this PEP proposes to *dynamically* create a ``__path__``, in the
case where one is missing.

That is, if I try to ``import Foo.Bar`` the proposed change to the
import machinery will notice that the ``Foo`` module lacks a
``__path__``, and will therefore try to *build* one before proceeding.

And it will do this by making a list of all the existing ``Foo/``
subdirectories of the directories listed in ``sys.path``.

If the list is empty, the import will fail with ``ImportError``, just
like today.  But if the list is *not* empty, then it is saved in
a new ``Foo.__path__`` attribute, making the module a "virtual
package".

That is, because it now has a valid ``__path__``, we can proceed
to import submodules or subpackages in the normal way.

Now, notice that this change does not affect "classic", self-contained
packages that have an ``__init__`` module in them.  Such packages
already *have* a ``__path__`` attribute (initialized at import time)
so the import machinery won't try to create another one later.

This means that (for example) the standard library ``email`` package
will not be affected in any way by you having a bunch of unrelated
directories named ``email`` on ``sys.path``.  (Even if they contain
``*.py`` files.)

But it *does* mean that if you want to turn your ``Foo`` module into
a ``Foo`` package, all you have to do is add a ``Foo/`` directory
somewhere on ``sys.path``, and start adding modules to it.

But what if you only want a "namespace package"?  That is, a package
that is *only* a namespace for various separately-distributed
submodules and subpackages?

For example, if you're Zope Corporation, distributing dozens of
separate tools like ``zc.buildout``, each in packages under the ``zc``
namespace, you don't want to have to make and include an empty
``zc.py`` in every tool you ship.  (And, if you're a Linux or other
OS vendor, you don't want to deal with the package installation
conflicts created by trying to install ten copies of ``zc.py`` to the
same location!)

No problem.  All we have to do is make one more minor tweak to the
import process: if the "classic" import process fails to find a
self-contained module or package (e.g., if ``import zc`` fails to find
a ``zc.py`` or ``zc/__init__.py``), then we once more try to build a
``__path__`` by searching for all the ``zc/`` directories on
``sys.path``, and putting them in a list.

If this list is empty, we raise ``ImportError``.  But if it's
non-empty, we create an empty ``zc`` module, and put the list in
``zc.__path__``.  Congratulations: ``zc`` is now a namespace-only,
"pure virtual" package!  It has no module contents, but you can still
import submodules and subpackages from it, regardless of where they're
located on ``sys.path``.

(By the way, both of these additions to the import protocol (i.e. the
dynamically-added ``__path__``, and dynamically-created modules)
apply recursively to child packages, using the parent package's
``__path__`` in place of ``sys.path`` as a basis for generating a
child ``__path__``.  This means that self-contained and virtual
packages can contain each other without limitation, with the caveat
that if you put a virtual package inside a self-contained one, it's
gonna have a really short ``__path__``!)


Backwards Compatibility and Performance
---------------------------------------

Notice that these two changes *only* affect import operations that
today would result in ``ImportError``.  As a result, the performance
of imports that do not involve virtual packages is unaffected, and
potential backward compatibility issues are very restricted.

Today, if you try to import submodules or subpackages from a module
with no ``__path__``, it's an immediate error.  And of course, if you
don't have a ``zc.py`` or ``zc/__init__.py`` somewhere on ``sys.path``
today, ``import zc`` would likewise fail.

Thus, the only potential backwards-compatibility issues are:

1. Tools that expect package directories to have an ``__init__``
    module, that expect directories without an ``__init__`` module
    to be unimportable, or that expect ``__path__`` attributes to be
    static, will not recognize virtual packages as packages.

    (In practice, this just means that tools will need updating to
    support virtual packages, e.g. by using ``pkgutil.walk_modules()``
    instead of using hardcoded filesystem searches.)

2. Code that *expects* certain imports to fail may now do something
    unexpected.  This should be fairly rare in practice, as most sane,
    non-test code does not import things that are expected not to
    exist!

The biggest likely exception to the above would be when a piece of
code tries to check whether some package is installed by importing
it.  If this is done *only* by importing a top-level module (i.e., not
checking for a ``__version__`` or some other attribute), *and* there
is a directory of the same name as the sought-for package on
``sys.path`` somewhere, *and* the package is not actually installed,
then such code could *perhaps* be fooled into thinking a package is
installed that really isn't.

However, even in the rare case where all these conditions line up to
happen at once, the failure is more likely to be annoying than
damaging.  In most cases, after all, the code will simply fail a
little later on, when it actually tries to DO something with the
imported (but empty) module.  (And code that checks ``__version__``
attributes or for the presence of some desired function, class, or
module in the package will not see a false positive result in the
first place.)

Meanwhile, tools that expect to locate packages and modules by
walking a directory tree can be updated to use the existing
``pkgutil.walk_modules()`` API, and tools that need to inspect
packages in memory should use the other APIs described in the
`Standard Library Changes/Additions`_ section below.


Specification
=============

Two changes are made to the existing import process.

First, the built-in ``__import__`` function must not raise an
``ImportError`` when importing a submodule of a module with no
``__path__``.  Instead, it must attempt to *create* a ``__path__``
attribute for the parent module first, as described in `__path__
creation`_, below.

Second, if searching ``sys.meta_path`` and ``sys.path`` (or a parent
package ``__path__``) fails to find a module being imported, the
import process must attempt to create a ``__path__`` attribute for
the missing module.  If the attempt succeeds, an empty module is
created and its ``__path__`` is set.  Otherwise, importing fails.

In both of the above cases, if a non-empty ``__path__`` is created,
the name of the module whose ``__path__`` was created is added to
``sys.virtual_packages`` -- an initially-empty ``set()`` of package
names.

(This way, code that extends ``sys.path`` at runtime can find out
what virtual packages are currently imported, and thereby add any
new subdirectories to those packages' ``__path__`` attributes.  See
`Standard Library Changes/Additions`_ below for more details.)

Conversely, if an empty ``__path__`` results, an ``ImportError``
is immediately raised, and the module is not created or changed, nor
is its name added to ``sys.virtual_packages``.


``__path__`` Creation
---------------------

A virtual ``__path__`` is created by obtaining a PEP 302 "importer"
object for each of the path entries found in ``sys.path`` (for a
top-level module) or the parent ``__path__`` (for a submodule).

(Note: because ``sys.meta_path`` importers are not associated with
``sys.path`` or ``__path__`` entry strings, such importers do *not*
participate in this process.)

Each importer is checked for a ``get_subpath()`` method, and if
present, the method is called with the full name of the module/package
the ``__path__`` is being constructed for.  The return value is either
a string representing a subdirectory for the requested package, or
``None`` if no such subdirectory exists.

The strings returned by the importers are added to the ``__path__``
being built, in the same order as they are found.  (``None`` values
and missing ``get_subpath()`` methods are simply skipped.)

In Python code, the algorithm would look something like this::

     def get_virtual_path(modulename, parent_path=None):

         if parent_path is None:
             parent_path = sys.path

         path = []

         for entry in parent_path:
             # Obtain a PEP 302 importer object - see pkgutil module
             importer = pkgutil.get_importer(entry)

             if hasattr(importer, 'get_subpath'):
                 subpath = importer.get_subpath(modulename)
                 if subpath is not None:
                     path.append(subpath)

         return path

And a function like this one should be exposed in the standard
library as e.g. ``imp.get_virtual_path()``, so that people creating
``__import__`` replacements or ``sys.meta_path`` hooks can reuse it.


Standard Library Changes/Additions
----------------------------------

The ``pkgutil`` module should be updated to handle this
specification appropriately, including any necessary changes to
``extend_path()``, ``iter_modules()``, etc.

Specifically the proposed changes and additions to ``pkgutil`` are:

* A new ``extend_virtual_paths(path_entry)`` function, to extend
   existing, already-imported virtual packages' ``__path__`` attributes
   to include any portions found in a new ``sys.path`` entry.  This
   function should be called by applications extending ``sys.path``
   at runtime, e.g. when adding a plugin directory or an egg to the
   path.

   The implementation of this function does a simple top-down traversal
   of ``sys.virtual_packages``, and performs any necessary
   ``get_subpath()`` calls to identify what path entries need to
   be added to each package's ``__path__``, given that `path_entry`
   has been added to ``sys.path``.  (Or, in the case of sub-packages,
   adding a derived subpath entry, based on their parent namespace's
   ``__path__``.)

* A new ``iter_virtual_packages(parent='')`` function to allow
   top-down traversal of virtual packages in ``sys.virtual_packages``,
   by yielding the child virtual packages of `parent`.  For example,
   calling ``iter_virtual_packages("zope")`` might yield ``zope.app``
   and ``zope.products`` (if they are imported virtual packages listed
   in ``sys.virtual_packages``), but **not** ``zope.foo.bar``.
   (This function is needed to implement ``extend_virtual_paths()``,
   but is also potentially useful for other code that needs to inspect
   imported virtual packages.)

* ``ImpImporter.iter_modules()`` should be changed to also detect and
   yield the names of modules found in virtual packages.

In addition to the above changes, the ``zipimport`` importer should
have its ``iter_modules()`` implementation similarly changed.  (Note:
current versions of Python implement this via a shim in ``pkgutil``,
so technically this is also a change to ``pkgutil``.)

Last, but not least, the ``imp`` module (or ``importlib``, if
appropriate) should expose the algorithm described in the `__path__
creation`_ section above, as a
``get_virtual_path(modulename, parent_path=None)`` function, so that
creators of ``__import__`` replacements can use it.


Implementation Notes
--------------------

For users, developers, and distributors of virtual packages:

* While virtual packages are easy to set up and use, there is still
   a time and place for using self-contained packages.  While it's not
   strictly necessary, adding an ``__init__`` module to your
   self-contained packages lets users of the package (and Python
   itself) know that *all* of the package's code will be found in
   that single subdirectory.  In addition, it lets you define
   ``__all__``, expose a public API, provide a package-level docstring,
   and do other things that make more sense for a self-contained
   project than for a mere "namespace" package.

* ``sys.virtual_packages`` is allowed to contain non-existent or
   not-yet-imported package names; code that uses its contents should
   not assume that every name in this set is also present in
   ``sys.modules`` or that importing the name will necessarily succeed.

* If you are changing a currently self-contained package into a
   virtual one, it's important to note that you can no longer use its
   ``__file__`` attribute to locate data files stored in a package
   directory.  Instead, you must search ``__path__`` or use the
   ``__file__`` of a submodule adjacent to the desired files, or
   of a self-contained subpackage that contains the desired files.

   (Note: this caveat is already true for existing users of "namespace
   packages" today.  That is, it is an inherent result of being able
   to partition a package, that you must know *which* partition the
   desired data file lives in.  We mention it here simply so that
   *new* users converting from self-contained to virtual packages will
   also be aware of it.)

* XXX what is the __file__ of a "pure virtual" package?  ``None``?
   Some arbitrary string?  The path of the first directory with a
   trailing separator?  No matter what we put, *some* code is
   going to break, but the last choice might allow some code to
   accidentally work.  Is that good or bad?


For those implementing PEP \302 importer objects:

* Importers that support the ``iter_modules()`` method (used by
   ``pkgutil`` to locate importable modules and packages) and want to
   add virtual package support should modify their ``iter_modules()``
   method so that it discovers and lists virtual packages as well as
   standard modules and packages.  To do this, the importer should
   simply list all immediate subdirectory names in its jurisdiction
   that are valid Python identifiers.

   XXX This might list a lot of not-really-packages.  Should we
   require importable contents to exist?  If so, how deep do we
   search, and how do we prevent e.g. link loops, or traversing onto
   different filesystems, etc.?  Ick.

* "Meta" importers (i.e., importers placed on ``sys.meta_path``) do
   not need to implement ``get_subpath()``, because the method
   is only called on importers corresponding to ``sys.path`` entries
   and ``__path__`` entries.  If a meta importer wishes to support
   virtual packages, it must do so entirely within its own
   ``find_module()`` implementation.

   Unfortunately, it is unlikely that any such implementation will be
   able to merge its package subpaths with those of other meta
   importers or ``sys.path`` importers, so the meaning of "supporting
   virtual packages" for a meta importer is currently undefined!

   (However, since the intended use case for meta importers is to
   replace Python's normal import process entirely for some subset of
   modules, and the number of such importers currently implemented is
   quite small, this seems unlikely to be a big issue in practice.)


References
==========

.. [1] "namespace" vs "module" packages (mailing list thread)
    (http://mail.zope.org/pipermail/zope3-dev/2002-December/004251.html)

.. [2] "Dropping __init__.py requirement for subpackages"
    (http://mail.python.org/pipermail/python-dev/2006-April/064400.html)


Copyright
=========

This document has been placed in the public domain.


..
    Local Variables:
    mode: indented-text
    indent-tabs-mode: nil
    sentence-end-double-space: t
    fill-column: 70
    coding: utf-8
    End:



More information about the Python-Dev mailing list