Python threading (was: Re: global interpreter lock not working as it should)

Thu Aug 8 20:52:13 EDT 2002

Hi, Martin v. Lowis !

 On Thu, Aug 08, 2002 at 10:33:20AM +0200, Martin v. Lowis wrote:
Доброго времени суток, Martin v. Lowis !

  Thu, Aug 08, 2002 at 10:33:20AM +0200 Martin v. Lowis написал(а):

> DIG <dig.list at telkel.net> writes:
> 
> > > Whatever changes you make, they can't increase the performance. If you
> > > need performance, you better avoid threads. On a single processor,
> > > threads can only slow down the entire computation.
> > 
> > And ?..
> 
> Armin said that he makes these changes "To have an acceptable
> ->thread performance<-". I just say that this won't be possible.

I am sorry, if I was a little bit unclear here.

I asked you "And ?.." because: 
(1) you already said this in [1]; 
(2) as far as I understood Armin, he did not say that his changes would increase thread performance [2]. 

You are right in [1] -- it is simply not possible (for given CPU bound task). In my opinion (and please correct me if I am wrong), Armin proposed tthis changes not to increase the thread performance, but to switch between the threads more often. 

(By the way, your assumption about the increase of thread performance in [1] is right for an ideal world. In the real world, you should precise WHAT is an overall performance for you in YOUR case, because (in my opinion) it is rare enough, when performance can be defined as the number of instruction per time unit. And even so, I think, you were talking about effectiveness of you program and the whole system -- how effectively you use the performance of your processor, your system to solve your problem.)

As Armin demonstrated in [3] and especially in [4], proposed changes will make switch the treads more often. In the same time, they (changes) do not decrease the overall performance dramatically. (Any way, the decrease of total counts from 3002779 for unmodified version to 2917223 in Armin's version is much less spectacular than increase of thread switch frequency, from 9221 to 18788, as demonstrated in [4]). This is my interpretation of Armin's words "To have an acceptable ->thread performance<-" in [2] (And it is not in any way "an increase of thread performance").

I can imagine tasks that need lower latency time, even in spite of small decrease of overall performance (I agree that the notion of "small" is very subjective). Sometimes it may be important for an application to switch more often between the threads. As Jonathan said in [5], there is no ideal solution. Of course. Even OS-layer under your application decrease the performance of your application. Some more on this in Bengt's post [6].

As Dave said in [7], the big question is: how this patch would affect existing applications (if applied) ? The same question, I suppose, is asked before ANY change in python interpreter. 

Why do not give to the user an opportunity to change thread-related behavior by means of sys module ? Something like sys.thread_turbo_switch( 1 or 0 ) ? Of course, with the counterpart: sys.thread_overall_performance_decrease( 0 or 1 ) :-))

References:
~~~~~~~~~~~

(irrelevant part are skipped)

[1] From: martin at v.loewis.de (Martin v. Loewis)
    Message-ID: <m34re7g4kz.fsf at mira.informatik.hu-berlin.de>
    Date: 06 Aug 2002 21:38:52 +0200

I'm all in favour of efficiency. However, adding more thread switches
is likely to hurt efficiency, instead of increasing it. Notice that
the total amount of work to do is fixed, and it consumes a certain
amount of time when done strictly sequentially. Adding thread switches
extends completion time, and thus decreases performance.

[2] From: a-steinhoff at web.de (Armin Steinhoff)
    Message-ID: <ddc19db7.0208070034.13c671d3 at posting.google.com>
    Date: 7 Aug 2002 01:34:23 -0700

To have an accepptable  ->thread performance<- for POSIX systems I
would propose to do a separation of the lock handling at application
(thread) level and the handling of the GIL at system level ... just 
to cleanup the internal design!

In the moment it makes no sense to use the 'Python threads' for POSIX
systems 'if and only if' you need performance (or real-time performance).

[3] From: a-steinhoff at web.de (Armin Steinhoff)
    Message-ID: <ddc19db7.0208050743.590e56bc at posting.google.com>
    Date: 5 Aug 2002 08:43:38 -0700

I have build three versions of python by inserting a sched_yield and a
delay of 1ms in the code of ceval.c below ... and did run Jonathans testcode.

[4] From: a-steinhoff at web.de (Armin Steinhoff)
    Message-ID: <ddc19db7.0208061220.6cd88693 at posting.google.com>
    Date: 6 Aug 2002 13:20:06 -0700

Counts:
[202038, 312444, 322712, 206103, 216143, 323075, 361574, 279071, 335451, 358612]
Total = 2917223

Switches:
[1855, 1890, 1892, 1858, 1852, 1889, 1899, 1877, 1888, 1888]
Total = 18788

ceval.c unmodified:

Counts:
[286616, 328785, 305904, 284464, 313173, 281241, 308752, 295980, 317234, 280630]
Total = 3002779

Switches:
[927, 937, 923, 915, 932, 914, 925, 930, 914, 904]
Total = 9221

[5] From: Jonathan Hogg <jonathan at onegoodidea.com>
    Message-ID: <B9746363.F238%jonathan at onegoodidea.com>
    Date: Mon, 05 Aug 2002 17:16:19 +0100

It's not really an ideal solution all things considered. If you have any
higher-priority I/O going on in other threads then increasing the check
interval will introduce long latencies.

[6] From: bokr at oz.net (Bengt Richter)
    Message-ID: <aiu5u1$vfv$0 at 216.39.172.122>
    Date: 8 Aug 2002 16:23:29 GMT

Don't forget that a disk controller is effectively blocking and waiting
for attention if you don't give it work to do when there is disk work to do
(although that can be mitigated with OS/file system readahead for sequential
access etc.) So part of managing "not to block in system calls" may be getting
the disk controller to start filling a new buffer in parallel with your single
thread as soon as it's ready to, so by the time you need the data, you won't block.

In a single thread, the code to do that will likely be ugly and/or inefficient.
Polling is effectively a time-distributed busy wait, so if you need to do that
in order to keep i/o going, you are not really avoiding busy waiting, you are
just diluting it with added latency. And worse, if you do it by writing Python
code to poll, you will be hugely more inefficient than letting ceval.c do it
in the byte code loop, even if the latter is not as optimum as it could be.

Yes, but again, to avoid blocking you need pretty much vanilla sequential i/o
that the OS can anticipate your needs with, and to be compute-bound otherwise.

Yes, but for many situations convenience is crucial in getting programmers
to deal with problems of managing parallel system activity so as to have
at least one unblocked thread available most of the time to keep the CPU busy.

Of course you are right that there is nothing to be gained from chopping up
what would otherwise be an unbroken stream of computation ;-)

[7] From: brueckd at tbye.com
    Message-ID: <Pine.LNX.4.44.0208061509280.16247-100000 at ns0.tbye.com>
    Date: Tue, 6 Aug 2002 15:10:33 -0700 (PDT)

So now the big question is: how does this affect the performance of Python 
programs that already work great (i.e. multithreaded Python programs that 
actually do real work)?

Thank you for your time.
Regards,

-- 
DIG (Dmitri I GOULIAEV)