[Import-SIG] Proto-PEP: Redesigning extension module loading

Mon Feb 23 14:18:25 CET 2015

On Sat, Feb 21, 2015 at 1:19 PM, Nick Coghlan <ncoghlan at gmail.com> wrote:
> On 21 February 2015 at 00:56, Petr Viktorin <encukou at gmail.com> wrote:
>> Hello list,
>>
>> I have taken Nick's challenge of extension module loading.
<snip>
>>
>> Comments appreciated.
>
> This generally looks good to me. Some more specific feedback inline below.

Thanks! I'll reply to the points where we don't 100% agree.

>> PEP: XXX
>> Title: Redesigning extension module loading
>
> For the BDFL-Delegate question: Brett would you be happy tackling this one?
>
<snip>
>
>> The proposal
>> ============
>>
>> The current extension module initialisation will be deprecated in favour of
>> a new initialisation scheme. Since the current scheme will continue to be
>> available, existing code will continue to work unchanged, including binary
>> compatibility.
>>
>> Extension modules that support the new initialisation scheme must export one
>> or both of the public symbols "PyModuleCreate_modulename" and
>> "PyModuleExec_modulename", where "modulename" is the
>> name of the shared library. This mimics the previous naming convention for
>> the "PyInit_modulename" function.
>>
>> This symbols, if defined, must resolve to C functions with the following
>> signatures, respectively::
>>
>>     PyObject* (*PyModuleCreateFunction)(PyObject* module_spec)
>>     int (*PyModuleExecFunction)(PyObject* module)
>
> For the Python level, the model we ended up with for 3.5 is:
>
> 1. create_module must exist, but may return None
> 2. exec_module must exist, but may have no effect on the module state

It would make sense that PyModuleCreate may return None, just to
better mirror PEP451.
I'll also point out that exec_module can put another object in
sys.modules, to replace the module being loaded.

<snip>
>
>> The PyModuleExec function
>> -------------------------
>>
>> This PyModuleExec function is used to implement "loader.exec_module"
>> defined in PEP 451.
>> It is called after ModuleSpec-related attributes such as ``__loader__``,
>> ``__spec__`` and ``__name__`` are set on the module.
>> (The full list is in PEP 451 [#pep-0451-attributes]_)
>>
>> The "PyModuleExec_modulename" function will be called to initialize a module.
>> This happens in two situations: when the module is first initialized for
>> a given (sub-)interpreter, and when the module is reloaded.
>>
>> The "module" argument receives the module object.
>> If PyModuleCreate is defined, this will be the the object returned by it.
>> If PyModuleCreate is not defined, PyModuleExec is epected to operate
>> on any Python object for which attributes can be added by PyObject_GetAttr*
>> and retreived by PyObject_SetAttr*.
>> Specifically, as the module may not be a PyModule_Type subclass,
>> PyModule_* functions should not be used on it, unless they explicitly support
>> operating on all objects.
>
> I think this is too permissive on the interpreter side of things, thus
> making things more complicated than we'd like them to be for extension
> module authors.

What complications are you thinking about? I was worried about this
too, but I don't see the complications. I don't think there is enough
difference between PyModule_Type and any object with getattr/setattr,
either on the C or Python level. After initialization, the differences
are:
- Modules have a __dict__. But, as the docs say, "It is recommended
extensions use other PyModule_*() and PyObject_*() functions rather
than directly manipulate a module’s __dict__." This would become a
requirement.
- The finalization is special. There have been efforts to remove this
difference. Any problems here are for the custom-module-object
provider (e.g. the lazy-load library) to sort out, the extension
author shouldn't have to do anything extra.
- There's a PyModuleDef usable for registration.
- There's a custom __repr__.
Currently there is a bunch of convenience functions/macros that only
work on modules do little more than get/setattr. They can easily be
made to work on any object.

> If PyModuleCreate_* is defined, PyModuleExec_* will receive the object
> returned there, while if it isn't defined, the interpreter *will*
> provide a PyModule_Type instance, as per PEP 451.
>
> However, permitting module authors to make the PyModule_Type (or a
> subclass) assumption in their implementation does introduce a subtle
> requirement on the implementation of both the load_module method, and
> on custom PyModuleExec_* functions that are paired with a
> PyModuleCreate_* function.
>
> Firstly, we need to enforce the following constraint in load_module:
> if the underlying C module does *not* define a custom PyModuleCreate_*
> function, and we're passed a module execution environment which is
> *not* an instance of PyModule_Type, then we should throw TypeError.
>
> By contrast, in the presence of a custom PyModuleCreate_* function,
> the requirement for checking the type of the execution environment
> (and throwing TypeError if the module can't handle it) should be
> delegated to the PyModuleExec_* function, and that will need to be
> documented appropriately.
>
> That keeps things simple in the default case (extension module authors
> just using PyModuleExec_* can continue to assume the use of
> PyModule_Type or a subclass), while allowing more flexibility in the
> "power user" case of creating your own module object.

I see a different kind of simplicity in my proposal: Modules are just
objects with a custom __repr__.

<snip>
>
>> Subinterpreters and Interpreter Reloading
>> -----------------------------------------
>>
>> Extensions using the new initialization scheme are expected to support
>> subinterpreters and multiple Py_Initialize/Py_Finalize cycles correctly.
>> The mechanism is designed to make this easy, but care is still required
>> on the part of the extension author.
>> No user-defined functions, methods, or instances may leak to different
>> interpreters.
>> To achieve this, all module-level state should be kept in either the module
>> dict, or in the module object.
>> A simple rule of thumb is: Do not define any static data, except built-in types
>> with no mutable or user-settable class attributes.
>
> Worth noting here that this is why we consider it desirable to provide
> a utility somewhere in the standard library to make it easy to do
> these kinds of checks.
>
> At the very least we need it in the test.support module to do our own
> tests, but it would be preferable to have it as a supported API
> somewhere in the standard library.
>
> This isn't the only area where this kind of question of making it
> easier for people to test whether or not they're implementing or
> emulating a protocol correctly has come up - it's applicable to
> testing things like total ordering support in custom objects, operand
> precedence handling, ABC compliance, code generation, exception
> traceback manipulation, etc.
>
> Perhaps we should propose a new unittest submodule for compatibility
> and compliance tests that are too esoteric for the module top level,
> but we also don't want to ask people to write for themselves?

The unittest submodule is out of scope here, but something I'd like to
get involved in later.
For now I'm going to put tests in test.support.

<snip>
>
>> Open issues
>> ===========
>>
>> Now that PEP 442 is implemented, it would be nice if module finalization
>> did not set all attributes to None,
>
> Antoine added that in 3.4: http://bugs.python.org/issue18214
>
> However, it wasn't entirely effective, as several extension modules
> still need to be hit with a sledgehammer to get them to drop
> references properly. Asking "Why is that so?" is actually one of the
> things that got me started digging into this area a couple of years
> back.

Ah. I had a note about this from reading the discussion, but now I see
this is out of scope (aside from checking things don't break on this
front). Issue closed :)

>> In this scheme, it is not possible to create a module with C-level state,
>> which would be able to exec itself in any externally provided module object,
>> short of putting PyCapsules in the module dict.
>
> I suspect "PyCapsule in the module dict" may be the right answer here,
> in which case some suitable documentation and perhaps some convenience
> APIs could be a good way to go.

Right, I'll put them in the next version.

> Relying on PyCapsule also has the advantage of potentially supporting
> better collaboration between extension modules, without needing to
> link them with each other directly.

Well, I'd argue that in most cases where you want two extensions to
collaborate, a public Python API would be useful enough to justify its
costs. Maintaining an ABI for capsule contents, on the other hand,
might not be worth it. I'd rather promote cffi/Cython wrappers than
this.
But of course, it is possible if people want to go for it.

<snip>
>> The runpy module will need to be modified to take advantage of PEP 451
>> and this PEP. This might out of scope for this PEP.
>
> I think it's out of scope, but runpy *does* need an internal redesign
> to take full advantage of PEP 451. Currently it works by attempting to
> extract the code object directly in most situations, whereas PEP 451
> should let it rely almost entirely on exec_code instead (with direct
> execution used only when it's actually given a path directly to a
> Python source or bytecode file.

Yes. If it was simple I'd include it, but this effort wants its own
issue, if not a PEP.