[Python-Dev] Pre-PEP: Redesigning extension modules

Sun Aug 25 13:54:30 CEST 2013

Nick Coghlan, 24.08.2013 23:43:
> On 25 Aug 2013 01:44, "Stefan Behnel" wrote:
>> Nick Coghlan, 24.08.2013 16:22:
>>> The new _PyImport_CreateAndExecExtensionModule function does the heavy
>>> lifting:
>>>
>>> https://bitbucket.org/ncoghlan/cpython_sandbox/src/081f8f7e3ee27dc309463b48e6c67cf4880fca12/Python/importdl.c?at=new_extension_imports#cl-65
>>>
>>> One key point to note is that it *doesn't* call
>>> _PyImport_FixupExtensionObject, which is the API that handles all the
>>> PEP 3121 per-module state stuff. Instead, the idea will be for modules
>>> that don't need additional C level state to just implement
>>> PyImportExec_NAME, while those that *do* need C level state implement
>>> PyImportCreate_NAME and return a custom object (which may or may not
>>> be a module subtype).
>>
>> Is it really a common case for an extension module not to need any C level
>> state at all? I mean, this might work for very simple accelerator modules
>> with only a few stand-alone functions. But anything non-trivial will
>> almost
>> certainly have some kind of global state, cache, external library, etc.,
>> and that state is best stored at the C level for safety reasons.
> 
> I'd prefer to encourage people to put that state on an exported *type*
> rather than directly in the module global state. So while I agree we need
> to *support* C level module globals, I'd prefer to provide a simpler
> alternative that avoids them.

But that has an impact on the API then. Why do you want the users of an
extension module to go through a separate object (even if it's just a
singleton, for example) instead of going through functions at the module
level? We don't currently encourage or propose this design for Python
modules either. Quite the contrary, it's extremely common for Python
modules to provide most of their functionality at the function level. And
IMHO that's a good thing.

Note that even global functions usually hold state, be it in the form of
globally imported modules, global caches, constants, ...

> We also need the create/exec split to properly support reloading. Reload
> *must* reinitialize the object already in sys.modules instead of inserting
> a different object or it completely misses the point of reloading modules
> over deleting and reimporting them (i.e. implicitly affecting the
> references from other modules that imported the original object).

Interesting. I never thought of it that way.

I'm not sure this can be done in general. What if the module has threads
running that access the global state? In that case, reinitialising the
module object itself would almost certainly lead to a crash.

And what if you do "from extmodule import some_function" in a Python
module? Then reloading couldn't replace that reference, just as for normal
Python modules. Meaning that you'd still have to keep both modules properly
alive in order to prevent crashes due to lost global state of the imported
function.

The difference to Python modules here is that in Python code, you'll get
some kind of exception if state is lost during a reload. In C code, you'll
most likely get a crash.

How would you even make sure global state is properly cleaned up? Would you
call tp_clear() on the module object before re-running the init code? Or
how else would you enable the init code to do the right thing during both
the first run (where global state is uninitialised) and subsequent runs
(where global state may hold valid state and owned Python references)?

Even tp_clear() may not be enough, because it's only meant to clean up
Python references, not C-level state. Basically, for reloading to be
correct without changing the object reference, it would have to go all the
way through tp_dealloc(), catch the object at the very end, right before it
gets freed, and then re-initialise it.

This sounds like we need some kind of indirection (as you mentioned above),
but without the API impact that a separate type implies. Simply making
modules an arbitrary extension type, as I proposed, cannot solve this.

(Actually, my intuition tells me that if it can't really be made to work
100% for Python modules, e.g. due to the from-import case, why bother with
it for extension types?)

>>> Such modules can still support reloading (e.g.
>>> to pick up reloaded or removed module dependencies) by providing
>>> PyImportExec_NAME as well.
>>>
>>> (in a PEP 451 world, this would likely be split up as two separate
>>> functions, one for create, one for exec)
>>
>> Can't we just always require extension modules to implement their own
>> type?
>> Sure, it's a lot of boiler plate code, but that could be handled by a
>> simple C code generator or maybe even a copy&paste example in the docs. I
>> would like to avoid making it too easy for users in the future to get
>> anything wrong with reloading or sub-interpreters. Most people won't test
>> these things for their own code and the harder it is to make them not
>> work,
>> the more likely it is that a given set of dependencies will properly work
>> in a sub-interpreter.
>>
>> If users are required to implement their own type, I think it would be
>> more
>> obvious where to put global module state, how to define functions (i.e.
>> module methods), how to handle garbage collection at the global module
>> level, etc.
> 
> Take a look at the current example - everything gets stored in the module
> dict for the simple case with no C level global state.

Well, you're storing types there. And those types are your module API. I
understand that it's just an example, but I don't think it matches a common
case. As far as I can see, the types are not even interacting with each
other, let alone doing any C-level access of each other. We should try to
focus on the normal case that needs C-level state and C-level field access
of extension types. Once that's solved, we can still think about how to
make the really simple cases simpler, if it turns out that they are not
simple enough.

Keeping everything in the module dict is a design that (IMHO) is too error
prone. C state should be kept safely at the C level, outside of the reach
of Python code. I don't want users of my extension module to be able to
provoke a crash by saying "extmodule._xyz = None".

I didn't know about PyType_FromSpec(), BTW. It looks like a nice addition
for manually written code (although useless for Cython).

Stefan