[issue42969] pthread_exit & PyThread_exit_thread from PyEval_RestoreThread etc. are harmful

Thu Sep 23 14:29:24 EDT 2021

Jeremy Maitin-Shepard <jeremy at jeremyms.com> added the comment:

In general, I view hanging threads as the least bad thing to do when faced with either acquiring the GIL or not returning at all.  There is a lot of existing usage of Python that currently poses a risk of random crashes and memory corruption while Python is exiting, and I would like to fix that.

However, I would certainly recommend that code using the Python C API attempt to avoid threads getting to that state in the first place.  I added a "finalize block" mechanism to that PR which is intended to provide a way to attempt to acquire the GIL in a way that ensures the GIL won't get hung.  I would welcome feedback on that.  A common use case for that API might be a non-Python created thread that wants to invoke some sort of asynchronous callback handler via Python APIs.

For Python daemon threads that you control, you can avoid them hanging by registering an atexit function that signals them to exit and then waits until they do.

vsteinner: Regarding processing the Windows messages, I updated the PR to include a link to the MSDN documentation that led me to think it was a good idea.

vstinner: As for random code outside of Python itself that is using `PyThread_exit_thread`: although I suppose there are legitimate use cases for `pthread_exit` and `_endthreadex`, these functions are only safe with the cooperation of the entire call stack of the thread.  Additionally, while `pthread_exit` and `_endthreadex` have similar behavior for C code, they don't have the same behavior for C++ code, and that difference seems like a likely source of problems.  Finally, I would say Python itself does not guarantee that its call stacks safely cooperate with `pthread_exit` (at the very least, there are sure to be memory leaks).  Therefore, I think Python should not encourage its use by leaving it as a non-deprecated public API.  If a user wishes to terminate a thread, they can invoke `pthread_exit` or `_endthreadex` directly, ideally without any Python functions in the call stack, and can refer to the documentation of those functions to understand the implications.

gps: The reasons I believe hanging the thread is better than `pthread_exit`:
- `pthread_exit` essentially throws an exception.  In theory that means you could do proper cleanup via C++ destructors and/or re-throwing catch blocks.  But in practice existing extension code is not designed to do that, and it would be quite a complex task to modify it to do proper cleanup, and on Windows the cleanup wouldn't run anyway.
- Additionally, throwing an exception means if there is a `noexcept` function in the call stack, the program terminates.  We would need to document that you aren't allowed to call Python APIs if there is a `noexcept` function in the call stack.  If you have a `catch(...)` in the call stack, then you may inadvertently catch the exception and return control back to Python at a point that assumes it owns the GIL, which will cause all sorts of havoc.  We would likewise need to document that you can't have a non-rethrowing `catch(...)` block in the call stack (I believe pybind11 has some of those).
- Throwing an exception also means C++ destructors run.  pybind11 has a smart pointer type that holds a `PyObject` and whose destructor calls `Py_DECREF`.  That causes a crash when `pthread_exit` unwinds the stack, since the thread doesn't own the GIL.

Those are the additional problems specific to `pthread_exit`.  As gps noted, there is the additional problem of memory corruption common to both `pthread_exit` and `_endthreadex`:
- Data structures that are accessible from other threads may contain pointers to data on the thread's stack.  For example, with certain types of locks/signalling mechanisms it is common to store a linked list node on the stack that as then added to a list of waiting threads.  If we destroy the thread stack without proper cleanup (and that proper cleanup definitely won't happen with `_endthreadex`, and probably in most cases still won't happen with `pthread_exit`), the data structure has now become corrupted.

I don't think hanging the thread really increases the risk of deadlock over the status quo.  In theory someone could have a C++ destructor that does some cleanup that safely pevents deadlock, but that is not portable to Windows, and I expect that properly implemented `pthread_exit`-safe code is extremely rare.

I think we would want to ensure that Python itself is implemented in such a way as to not deadlock if Python-created threads with only Python functions in the call stack hang.  Mostly that would amount to not holding mutexes while calling functions that may transitively attempt to acquire the GIL (or release and then re-acquire the GIL).  That is probably a good practice for avoiding deadlock even when not finalizing.

We would also want to document that external code using the Python API should be safe from deadlock if a thread hangs, or should use the new "finalize block" mechanism to ensure that it doesn't hang, but I would say that is much easier to achieve than `pthread_exit`/`_endthreadex` safety, and I expect is already satisfied by most code.

Regarding vstinner's point that we would leak hung threads in an application that embeds Python and keeps running after `Py_FinalizeEx`:
What are the existing use cases for initializing/finalizing the interpreter multiple times?   Under the status quo we leak a bunch of memory when finalizing, e.g. because any Python object references held by the threads that are killed will be leaked, among many other things.  That would seem to rule out its suitability for use in production code.  At best it might be suitable for test code.  Leaking a hung thread basically just amounts to leaking the thread stack.  Additionally, if the thread is not created by Python, and the application is not exiting, then only the application should decide whether to kill one of its threads.

Adding a `Py_SetThreadExitCallback(func)` function seems reasonable, but I would propose that the only acceptable behaviors are: abort, quick exit, and hang.  I don't think we should promise that Python's own code base is safe for use with `pthread_exit` / `_endthreadex`, nor should we require that extension code is safe for that.  Furthermore, `abort` would seem to only be useful if you have already taken measures (such as signaling daemon threads to exit and using the "finalize block" mechanism) to ensure it is never reached, and it just serves as an additional sanity check.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue42969>
_______________________________________