[Import-SIG] Proto-PEP: Redesigning extension module loading

Fri Feb 20 15:56:50 CET 2015

Hello list,

I have taken Nick's challenge of extension module loading.
I've read some of the relevant discussions, and bounced my ideas off Nick
to see if I missed anything important.

The main idea I realized, which was not obvious from the discussion,
was that in addition to playing well with PEP 451 (ModuleSpec) and supporting
subinterpreters and multiple Py_Initialize/Py_Finalize cycles,
Nick's Create/Exec proposal allows executing the module in a "foreign",
externally created module object. The main use case for that would be runpy and
__main__, but lazy-loading mechanisms were mentioned that would benefit as well.

As I was writing this down, I realized that once pre-created modules are
allowed, it makes no sense to insist that they actually are module
instances -- PyModule_Type provides little functionality above a plain object
subclass. I'm not sure there are any use cases for this, but I don't see a
reason to limit things artificially. Any bugs caused by allowing
non-ModuleType modules are unlikely to be subtle, unless the custom object
passes the "asked for it" line.

Comments appreciated.

---

PEP: XXX
Title: Redesigning extension module loading
Version: $Revision$
Last-Modified: $Date$
Author: Petr Viktorin <encukou at gmail.com>, Stefan Behnel <stefan_ml
at behnel.de>, Nick Coghlan <ncoghlan at gmail.com>
BDFL-Delegate: "???"
Discussions-To: "???"
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 11-Aug-2013
Python-Version: 3.5
Post-History: 23-Aug-2013, 20-Feb-2015
Resolution:

Abstract
========

This PEP proposes a redesign of the way in which extension modules interact
with the import machinery. This was last revised for Python 3.0 in PEP
3121, but did not solve all problems at the time. The goal is to solve them
by bringing extension modules closer to the way Python modules behave;
specifically to hook into the ModuleSpec-based loading mechanism
introduced in PEP 451.

Two ways to initialize a module, depending on the desired functionality,
are proposed.

The preferred form allows extension modules to be executed in pre-defined
namespaces, paving the way for extension modules being runnable with Python's
``-m`` switch.

Other modules can use arbitrary custom types for their module implementation,
and are no longer restricted to types.ModuleType.

Both ways make it easy to support properties at the module
level and to safely store arbitrary global state in the module that is
covered by normal garbage collection and supports reloading and
sub-interpreters.
Extension authors are encouraged to take these issues into account
when using the new API.

Motivation
==========

Python modules and extension modules are not being set up in the same way.
For Python modules, the module is created and set up first, then the module
code is being executed (PEP 302).
A ModuleSpec object (PEP 451) is used to hole information about the module,
and pased to the relevant hooks.
For extensions, i.e. shared libraries, the module
init function is executed straight away and does both the creation and
initialisation. The initialisation function is not passed ModuleSpec
information about the loaded module, such as the __file__ or fully-qualified
name
This hinders relative imports and resource loading.

This is specifically a problem for Cython generated modules, for which it's
not uncommon that the module init code has the same level of complexity as
that of any 'regular' Python module. Also, the lack of __file__ and __name__
information hinders the compilation of __init__.py modules, i.e. packages,
especially when relative imports are being used at module init time.

The other disadvantage of the discrepancy is that existing Python programmers
learning C cannot effectively map concepts between the two domains.
As long as extension modules are fundamentally different from pure Python ones
in the way they're initialised, they are harder for people to pick up without
relying on something like cffi, SWIG or Cython to handle the actual extension
module creation.

Currently, extension modules are also not added to sys.modules until they are
fully initialized, which means that a (potentially transitive)
re-import of the module will really try to reimport it and thus run into an
infinite loop when it executes the module init function again.
Without the fully qualified module name, it is not trivial to correctly add
the module to sys.modules either.

Furthermore, the majority of currently existing extension modules has
problems with sub-interpreter support and/or reloading, and, while it is
it possible with the current infrastructure to support these
features, is neither easy nor efficient.
Addressing these issues was the goal of PEP 3121, but many extensions
took the least-effort approach to porting to Python 3, leaving many of these
issues unresolved.
Thius PEP keeps the backwards-compatible behavior, which should reduce pressure
and give extension authors adequate time to consider these issues when porting.

The current process
===================

Currently, extension modules export an initialisation function named
"PyInit_modulename", named after the file name of the shared library. This
function is executed by the import machinery and must return either NULL in
the case of an exception, or a fully initialised module object. The
function receives no arguments, so it has no way of knowing about its
import context.

During its execution, the module init function creates a module object
based on a PyModuleDef struct. It then continues to initialise it by adding
attributes to the module dict, creating types, etc.

In the back, the shared library loader keeps a note of the fully qualified
module name of the last module that it loaded, and when a module gets
created that has a matching name, this global variable is used to determine
the fully qualified name of the module object. This is not entirely safe as it
relies on the module init function creating its own module object first,
but this assumption usually holds in practice.

The proposal
============

The current extension module initialisation will be deprecated in favour of
a new initialisation scheme. Since the current scheme will continue to be
available, existing code will continue to work unchanged, including binary
compatibility.

Extension modules that support the new initialisation scheme must export one
or both of the public symbols "PyModuleCreate_modulename" and
"PyModuleExec_modulename", where "modulename" is the
name of the shared library. This mimics the previous naming convention for
the "PyInit_modulename" function.

This symbols, if defined, must resolve to C functions with the following
signatures, respectively::

    PyObject* (*PyModuleCreateFunction)(PyObject* module_spec)
    int (*PyModuleExecFunction)(PyObject* module)

The PyModuleCreate function
---------------------------

This PyModuleCreate function is used to implement "loader.create_module"
defined in PEP 451.

By exporting the "PyModuleCreate_modulename" symbol, an extension module
indicates that it uses a custom module object.

This prevents loading the extension in a pre-created module,
but gives greater flexibility in allowing a custom C-level layout
of the module object.

The "module_spec" argument receives a "ModuleSpec" instance, as defined in
PEP 451.

When called, this function must create and return a module object.

If "PyModuleExec_module" is undefined, this function must also initialize
the module; see PyModuleExec_module for details on initialization.

There is no requirement for the returned object to be an instance of
types.ModuleType. Any type can be used. This follows the current
support for allowing arbitrary objects in sys.modules and makes it easier
for extension modules to define a type that exactly matches their needs for
holding module state.

The PyModuleExec function
-------------------------

This PyModuleExec function is used to implement "loader.exec_module"
defined in PEP 451.
It is called after ModuleSpec-related attributes such as ``__loader__``,
``__spec__`` and ``__name__`` are set on the module.
(The full list is in PEP 451 [#pep-0451-attributes]_)

The "PyModuleExec_modulename" function will be called to initialize a module.
This happens in two situations: when the module is first initialized for
a given (sub-)interpreter, and when the module is reloaded.

The "module" argument receives the module object.
If PyModuleCreate is defined, this will be the the object returned by it.
If PyModuleCreate is not defined, PyModuleExec is epected to operate
on any Python object for which attributes can be added by PyObject_GetAttr*
and retreived by PyObject_SetAttr*.
Specifically, as the module may not be a PyModule_Type subclass,
PyModule_* functions should not be used on it, unless they explicitly support
operating on all objects.

Helper functions
----------------

For two initialization tasks previously done by PyModule_Create,
two functions are introduced::

    int PyModule_SetDocString(PyObject *m, const char *doc)
    int PyModule_AddFunctions(PyObject *m, PyMethodDef *functions)

These set the module docstring, and add the module functions, respectively.
Both will work on any Python object that supports setting attributes.
They return zero on success, and on failure, they set the exception
and return -1.

Other changes
-------------

The following functions and macros will be modified to work on any object
that supports attribute access:

    * PyModule_GetNameObject
    * PyModule_GetName
    * PyModule_GetFilenameObject
    * PyModule_GetFilename
    * PyModule_AddIntConstant
    * PyModule_AddStringConstant
    * PyModule_AddIntMacro
    * PyModule_AddStringMacro
    * PyModule_AddObject

Usage
=====

This PEP allows three new ways of creating modules, each with its
advantages and disadvantages.

Exec-only
---------

The preferred way to create C extensions is to define "PyModuleExec_modulename"
only. This brings the following advantages:

* The extension can be loaded into a pre-created module, making it possible
  to run them as ``__main__``, participate in certain lazy-loading schemes
  [#lazy_import_concerns]_, or enable other creative uses.
* The module can be reloaded in the same way as Python modules.

As Exec-only extension modules do not have C-level storage,
all module-local data must be stored in the module object's attributes,
possibly using the PyCapsule mechanism.

XXX: Provide an example?

Create-only
-----------

Extensions defining only the "PyModuleCreate_modulename" hook behave similarly
to current extensions.

This is the easiest way to create modules that require custom module objects,
or substantial per-module state at the C level (using positive
``PyModuleDef.m_size``).

When the PyModuleCreate function is called, the module has not yet been added
to sys.modules.
Attempts to load the module again (possibly transitively) will result in an
infinite loop.
If user code needs to me called in module initialization,
module authors are advised to do so from the PyModuleExec function.

Reloading a Create-only module does nothing, except re-setting
ModuleSpec-related attributes described in PEP 0451 [#pep-0451-attributes].

XXX: Provide an example? (It would be similar to the one in PEP 3121)

Exec and Create
---------------

Extensions that need to create a custom module object,
and either need to run user code during initialization or support reloading,
should define both "PyModuleCreate_modulename" and "PyModuleExec_modulename".

XXX: Provide an example?

Legacy Init
-----------

If neither PyModuleExec nor PyModuleCreate is defined, the module is
initialized using the PyModuleInit hook, as described in PEP 3121.

If PyModuleExec or PyModuleCreate is defined, PyModuleInit will be ignored.
Modules requiring compatibility with previous versions of CPython may implement
PyModuleInit in addition to the new hooks.

Subinterpreters and Interpreter Reloading
-----------------------------------------

Extensions using the new initialization scheme are expected to support
subinterpreters and multiple Py_Initialize/Py_Finalize cycles correctly.
The mechanism is designed to make this easy, but care is still required
on the part of the extension author.
No user-defined functions, methods, or instances may leak to different
interpreters.
To achieve this, all module-level state should be kept in either the module
dict, or in the module object.
A simple rule of thumb is: Do not define any static data, except built-in types
with no mutable or user-settable class attributes.

Module Reloading
----------------

Extensions that support reloading must define PyModuleExec, which is called
in reload() to re-initialize the module in place.
The same caveats apply to reloading an extension module as to reloading
a Python module.

Note that due to limitations in shared library loading (both dlopen on POSIX
and LoadModuleEx on Windows), it is not generally possible to load a modified
library after it has changed on disk.
Therefore, reloading extension modules is of limited use.

Multiple modules in one library
-------------------------------

To support multiple Python modules in one shared library, the library
must export all appropriate PyModuleExec_<name> or PyModuleCreate_<name> hooks
for each exported module.
The modules are loaded using a ModuleSpec with origin set to the name of the
library file, and name set to the module name.
Note that this mechanism can only be used to *load* such modules,
not to *find* them.

XXX: Provide an example of how to load such modules

Implementation
==============

XXX - not started

Open issues
===========

Now that PEP 442 is implemented, it would be nice if module finalization
did not set all attributes to None,

In this scheme, it is not possible to create a module with C-level state,
which would be able to exec itself in any externally provided module object,
short of putting PyCapsules in the module dict.

The proposal repurposes PyModule_SetDocString, PyModule_AddObject,
PyModule_AddIntMacro et.al. to work on any object.
Would it be better to have these in the PyObject namespace?

We should expose some kind of API in importlib.util (or a better place?) that
can be used to check that a module works with reloading and subinterpreters.

The runpy module will need to be modified to take advantage of PEP 451
and this PEP. This might out of scope for this PEP.

Previous Approaches
===================

Stefan Behnel's initial proto-PEP [#stefans_protopep]_
had a "PyInit_modulename" hook that would create a module class,
whose ``__init__`` would be then called to create the module.
This proposal did not correspond to the (then nonexistent) PEP 451,
where module creation and initialization is broken into distinct steps.
It also did not support loading an extension into pre-existing module objects.

Nick Coghlan proposed the Create annd Exec hooks, and wrote a prototype
implementation [#nicks-prototype]_.
At this time PEP 451 was still not implemented, so the prototype
does not use ModuleSpec.

References
==========

.. [#lazy_import_concerns]
   https://mail.python.org/pipermail/python-dev/2013-August/128129.html

.. [#pep-0451-attributes]
   https://www.python.org/dev/peps/pep-0451/#attributes

.. [#stefans_protopep]
   https://mail.python.org/pipermail/python-dev/2013-August/128087.html

.. [#nicks-prototype]
   https://mail.python.org/pipermail/python-dev/2013-August/128101.html

Copyright
=========

This document has been placed in the public domain.