[CentralOH] Fwd: Re: Stackless Python

Mon Nov 2 16:38:21 CET 2009

Hi Guys,
I actually did some benchmarks between the threading and processing libraries.  I didn't do anything rigorous and I don't want to "publish" the results because I didn't understand the results very well.  However, I have some general results below for anyone who is interested.  

"Threading" (mt) all runs from within a single process and rarely uses more than a single core no matter what.  I did notice some usage in the 120% range, indicating that it was using more than a single processor, but that's pretty pitiful compared to a theoretical max of 400%.  I can't imagine why it takes 280% overhead to move number=3.14*i+number to a different core and run it there.  Maybe, I'm missing something simple.

"Processing" (mp) opens a new process for each thread, so if you are running top, you will see a new python instance for each process.  It was kind of satisfying to see my 100 "thread" test case fill top with "pythons".  It was nice to see any evidence at all that something was different.  However, as above, the processor usage rarely got very much over 100%.  In some cases I got something like 150% or 160% cpu usage.

Here is what I got in general:
I simulated a compute intensive operation by incrementing a counter and them multiplying that by a floating point number.  I then printed the number, which I knw takes a really long time.  I piped this output to a file to speed things up a bit, but I fear this test may be I/o bound which could be what is slowing down my benchmarks.   I could get the same number of total operations by increasing the number of threads and decreasing the number of iterations.  10 threads of 10,000 ~= 20 threads of 5,000 operations.  It's possible larger numbres could lead to overflows/underflows, but I don't think that should have slowed things down too much.  

Python mp was faster when the number of processes was <= the number of cores.  I have a 4 core machine and the mp would work ~30-50% faster than mt for threads=2,3, and 4.  1 was about the same as one might expect.  However, as soon as I hit 5 threads, the mp slowed down dramatically.  The performance of mt would gradually decrease as I increased the number of threads, but had no such noticable drop in performance at threads=5. At threads=5, mp and mt had about the same performance.  The mp drpped off a cliff after that, quickly taking 10x or 100x longer to complete than the mt program.

Bryan Harris
Research Engineer
Structures and Materials Evaluation Group
bryan.harris at udri.udayton.edu
(937) 229-5561

________________________________

From: centraloh-bounces+harrisbw=notes.udayton.edu at python.org on behalf of Brian Costlow
Sent: Mon 11/2/2009 9:37 AM
To: centraloh
Subject: Re: [CentralOH] Fwd: Re: Stackless Python

Just for my own education, I did a little further digging this weekend to get a better understanding of why the GIL prevents Python from running across multi-core/multi-processing systems.

I had a pretty accurate, but high-level understanding about how the GIL works. I also thought that aside from the GIL, Python's thread implementation was pretty much: let the kernel/os deal with scheduling, priorities, context switches etc. So although I read a brief comment on a board from a well-known pythonista that it 'won't ' run across multi-cores, that didn't make sense to me. If it just uses the os' underlying thread model, why won't it run  across multi-cores when you have I/O bound threads, or when your thread is C extension code that releases the lock while it does stuff outside the interpreter?

So before I stuck my foot in my mouth again, I did some more searching and reading.

As it turns out, Python does try to use more than one core/processor, but the GIL implementation can cause a terrible race condition.

A quick simplified summary.

The thread holding the GIL releases it every 100 ticks (a tick can be thought of loosely as 1-6 python bytecode instructions, depending on the bytecodes).

The os gets the opportunity to context switch to another thread. However, the scheduler usually doesn't switch at every opportunity, so the running thread will try to reacquire the GIL. In a single processor, when the os does context switch, the running thread is stopped, and another one is woken up and given control.

In a multiple core/processor system, say we have thread 1 and 2  running across different processors. Thread 1 holds the GIL and is processor-bound, while 2 is doing some I/O. Now 2 needs access to the python interpreter again, so it waits until it acquires the GIL. Thread one releases the GIL. The os starts to wake up thread 2, but thread 1 is still running and also tries to reacquire the GIL. Thread 1 usually wins, because of the overhead of waking up thread 2. (In the links below, in one case, thread 2 attempted to get the lock 1400 times before it was successful). There's all the extra work that results in the slowdown. 

See this great talk by David Beazley:

http://blip.tv/file/2232410/

Accompanying slides (PDF):

http://www.dabeaz.com/python/GIL.pdf

Long but interesting thread on the python concurrency sig. Gets really interesting and overlaps/references Dave's talk when the subject line changes to "Inside the Python GIL."

http://mail.python.org/pipermail/concurrency-sig/2009-June/000001.html

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/mailman/private/centraloh/attachments/20091102/99588cd1/attachment-0001.htm>