[Python-Dev] Strange artifacts with PEP 3121 and monkey-patching sys.modules (in csv, ElementTree and others)

Sun Aug 11 14:01:43 CEST 2013

Antoine Pitrou, 11.08.2013 12:33:
> On Sat, 10 Aug 2013 17:12:53 -0700 Eli Bendersky wrote:
>> Note how doing some sys.modules acrobatics and re-importing suddenly
>> changes the internal state of a previously imported module. This happens
>> because:
>>
>> 1. The first import of 'csv' (which then imports `_csv) creates
>> module-specific state on the heap and associates it with the current
>> sub-interpreter. The list of dialects, amongst other things, is in that
>> state.
>> 2. The 'del's wipe 'csv' and '_csv' from the cache.
>> 3. The second import of 'csv' also creates/initializes a new '_csv' module
>> because it's not in sys.modules. This *replaces* the per-sub-interpreter
>> cached version of the module's state with the clean state of a new module
> 
> I would say this is pretty much expected. The converse would be a bug
> IMO (but perhaps Martin disagrees). PEP 3121's stated goal is not only
> subinterpreter support:
> 
>   "Extension module initialization currently has a few deficiencies.
>   There is no cleanup for modules, the entry point name might give
>   naming conflicts, the entry functions don't follow the usual calling
>   convention, and multiple interpreters are not supported well."
> 
> Re-initializing state when importing a module anew makes extension
> modules more like pure Python modules, which is a good thing.

It's the same as defining a type or function in a loop, or inside of a
closure. The whole point of reimporting is that you get a new module.

However, it should not change the content of the old module, just create a
new one.

> So, the PEP 3121 "module state" pointer (the optional opaque void*
> thing) should only be used to hold non-PyObjects.  PyObjects should go
> to the module dict, like they do in normal Python modules.  Now, the
> reason our PEP 3121 extension modules abuse the module state pointer to
> keep PyObjects is two-fold:
> 
> 1. it's surprisingly easier (it's actually a one-liner if you don't
> handle errors - a rather bad thing, but all PEP 3121 extension modules
> currently don't handle a NULL return from PyState_FindModule...)
> 
> 2. it protects the module from any module dict monkeypatching. It's not
> important if you are using a generic API on the PyObject, but it is if
> the PyObject is really a custom C type with well-defined fields.

Yes, it's a major safety problem if you can crash the interpreter by
assigning None to a module attribute.

> Those two issues can be addressed if we offer an API for it. How about:
> 
>   PyObject *PyState_GetModuleAttr(struct PyModuleDef *def,
>                                   const char *name,
>                                   PyObject *restrict_type)
> 
>   *def* is a pointer to the module definition.
>   *name* is the attribute to look up on the module dict.
>   *restrict_type*, if non-NULL, is a type object the looked up attribute
>   must be an instance of.
> 
>   Lookup an attribute in the current interpreter's extension module
>   instance for the module definition *def*.
>   Returns a *new* reference (!), or NULL if an error occurred.
>   An error can be:
>   - no such module exists for the current interpreter (ImportError?
>       RuntimeError? SystemError?)
>   - no such attribute exists in the module dict (AttributeError)
>   - the attribute doesn't conform to *restrict_type* (TypeError)
> 
> So code can be written like:
> 
>   PyObject *dialects = PyState_GetModuleAttr(
>       &_csvmodule, "dialects", &PyDict_Type);
>   if (dialects == NULL)
>       return NULL;

At least for Cython it's unlikely that it'll ever use this. It's just way
too much overhead for looking up a global name. Plus, not all global names
are visible in the module dict, e.g. it's common to have types that are
only used internally to keep some kind of state. Those would still have to
live in the internal per-module state.

ISTM that this is not a proper solution for the problem, because it only
covers the simple use cases. Rather, I'd prefer making the handling of
names in the per-module instance state safer.

Essentially, with PEP 3121, modules are just one form of an extension type.
So what's wrong with giving them normal extension type fields? Functions
are essentially methods of the module, global types are just inner classes.
Both should keep the module alive (on the one side) and be tied to it (on
the other side). If you reimport a module, you'd get a new set of
everything, and the old module would just linger in the background until
the last reference to it dies.

In other words, I don't see why modules should be any special.

Stefan