global interpreter lock not working as it should

Fri Aug 2 21:43:13 EDT 2002

On Fri, 2 Aug 2002 12:43:35 -0400, anton wilson <anton.wilson at camotion.com> wrote:

>On Friday 02 August 2002 11:48 am, Mark Day wrote:
>> In article <ddc19db7.0208020705.4ef1e3e9 at posting.google.com>, Armin
>>
>> Steinhoff <a-steinhoff at web.de> wrote:
>> > I see your point. The release_lock / acquire_lock make only sense if
>> > the threads have different priorities. But all Python threads have by
>> > default the same priority ... so it makes absolutely sense to include
>> > the sched_yield for a fair scheduling.
>> >
>> > Without the sched_yield ... a Python thread can catch the CPU for
>> > ever. (if not preempted by a task with a higher prio)
>>
>> In many (most?) schedulers, if you have other eligble threads at the
>> same priority as the running thread, those other eligible threads will
>> all get run eventually.  This is commonly done by using some periodic
>> timer to give the scheduler an opportunity to reschedule and pick
>> another thread at the same priority, or perhaps notice higher priority
>> threads that have become eligible (vs. systems that immediately
>> reschedule when a higher priority thread becomes eligible).  That time
>> period is often known as a "time slice".
>>
>
>This is true. Therefore, the only time another thread WILL grab the GIL under 
>Linux is if 
>
>1) the GIL is released by the currently running thread
>2) the thread that just released the GIL depletes its timeslice before it
>   can grab the lock again
>3) the OS notices the process has depleted it's timeslice and the yanks       
>   it from the CPU (this happens every 100 times per second by default on an  
>                    i386)
>4) the waiting thread that recieved the GIL release signal is chosen to run
>
>Therefore, we now have a large set of coincidences for CPU-bound python 
>threads. The only reason it works at all is because it happens 100 times per 
>second and the GIL is released frequently by default. So there is a 
>sufficient probability that these 4 cases will happen simultaneously.
>
Hm. If that's an accurate description (and I am skeptical ;-), I don't think I like it.
Seems like if another thread is waiting for the GIL, then after a full time slice the
current GIL-holder ought to be told to give it up ASAP and be given a new time slice
only to do the minimum it needs to do that.

But an OS-level scheduler won't know about that, so something else would have
to get in between and set an indication that the interpreter could notice and
respond to. In effect, a GIL time_expired flag. A crude but probably
effective way to do it would be to have a high priority periodic thread
(C code, not involving interpretation) within the interpreter which would just
increment a GIL timer variable which would get reset every time the GIL was
acquired. Then the interpreter could check whether timer for >=2 very cheaply,
maybe every few byte codes (or via the VM with a byte code inserted by the compiler?),
and release the GIL and yield to the scheduler ASAP if so. If the C equivalent of time.clock
were fast enough, clock() could be read and stored with the GIL at acquisition time,
and the interpreter could read clock() and check elapsed time instead of using a
tick variable >=2, but I don't think it could be as fast (and see below for zero
overhead in the byte code loop).

You could do something more sophisticated, but any technique would presumably
involve setting an indication that the interpreter could poll efficiently, and
it should be set some fixed time after the GIL was aquired, without a lot of
overhead of its own. A tick that gets incremented periodically and asynchronously
avoids trying to schedule a timeout every time the GIL is aquired, but a single tick
could happen right away, so you have to look for >=2 (or >1) to guarantee at least
one full tick time, which means sometimes you'll get almost 2. If your periodic was 10ms,
that would translate to 10 to 20 ms. If there were an equivalent scheduling tick
available somewhere in memory, that could be used like clock(). You could use smaller ticks,
but the scheduler may not resolve things any finer anyway, and it would be overhead.

If you wanted zero time overhead in the byte code loop, you could conceivably
atomically grab, save, and change the byte code pointer from outside (from a
timeout-driven thread), something like causing a VM interrupt. The byte-coded
"VM interrupt service routine" would have a VM instruction byte to release the GIL
and block, and then reacquire the GIL and "return from interrupt" when it got control
again." Depending on how things are implemented ... ;-)

Threads which unblock from i/o blocking and then try to get the GIL to run some code,
only to block immediately, would presumably get control faster, and if they only had
a line or two of code before blocking on i/o again (releasing the GIL), you'd expect
better i/o throughput overall, IWT.

But I'd be surprised if Timbot & Co haven't thought of all this and more, so I am skeptical
that scheduling of Python threads is as dependent on "coincidences" as you describe
(I haven't looked at the relevant code).

>> And some schedulers even allow lower priority threads to run
>> occasionally to prevent starvation.
>>
Yes, but no scheduler will run a blocked thread, and Anton's description
suggests that threads only become unblocked by "coincidence" if another thread
is cpu bound and holding the GIL, even as time slices expire repeatedly ;-/

Regards,
Bengt Richter