How to solve a deadlock in an orthodox way

Wed Mar 4 16:29:20 EST 2020

Hi folks,

During the development of a Python library which uses under the hood a
C++ library - using Cyhton - we found a deadlock issue between a lock
maintained by the C++ library and the GIL lock.

The problem started to appear at the moment that we tried to offload
all IO operations into an isolated thread. Basically, a thread running
an Asyncio loop. Once this thread was added, we had to start using
call_soon_threadsafe for scheduling new functions into the IO loop,
hereby is when deadlocks start appearing.

Why?

Before adding the new thread for offloading the IO operations we were
not releasing the GIL at each call to the C++ library, and in the very
beginning tried to be stick to that plan which resulted to be nocive
with the appearance of deadlocks.

Two threads might have the following behaviour, which would result in
a deadlock:

1 - (thread 1) (Cython) call a C++ function
2 - (thread 1) (C++ code) create a mutex X
3 - (thread 1) (C++ code) calls cython callback
4 - (thread 1) (Cython) calls call_son_threadsafe
5 - (thread 1) (Python) releases the GIL
6 - (thread 1) (Python) sock.sends(...)
7 - (thread 2) (Python) acquires GIL
8 - (thread 2) (Cython) call a C++ function
9 - (thread 2) (C++ code) tries to acquire the mutex X (gets locked)
10 - (thread 1) (Python) acuqires GIL (gets locked)

The IO thread synchronization, which was done by writing to a specific
FD, was releasing the GIL which would give the chance to other threads
to be executed and have the chance on getting locked into the already
locked mutex, ending up in a fatal deadlock.

For addressing the situation we explicitly released the GIL at each
C++ call, which solved the issue with some none negligible performance
sacrificing.

Do you have any idea how this deadlock could be prevented without
having to release the GIL at each C++ call? Considering that we do not
have any freedom on modifying the C++ code.

We have a solution that might work but it's not easy to implement, or
at least not for all of the environments. The idea is basically not
running the piece of code that is implicitly releasing the GIL and
deferring its execution to after the C++ code function. It seems to
work quite well since the C++ code is basically an asynchronous
framework that only needs to schedule a callback. This is doable in
environments where we already have an Asyncio thread, its a matter of
make a call to `call_soon(offload_io_work_into_another_thread, cb)`,
but for environments where we do not have an automatic mechanism for
deferring the execution of code we would need to implement them which
could be a none straightforward task.

Thoughts?

Thanks!

-- 
--pau