[Numpy-discussion] Python ctypes and OpenMP mystery

Sat Feb 12 15:19:39 EST 2011

Hello All,
I have been toying with OpenMP through f2py and ctypes. On the whole, 
the results of my efforts have been very encouraging. That said, some 
results are a bit perplexing.

I have written identical routines that I run directly as a C-derived 
executable, and through ctypes as a shared library. I am running the 
tests on a dual-Xeon Ubuntu system with 12 cores and 24 threads. The C 
executable is SLIGHTLY faster than the ctypes at lower thread counts, 
but the C eventually has a speedup ratio of 12+, while the python caps 
off at 7.7, as shown below:

threads C-speedup Python-speedup
1	1	1
2	2.07	1.98
3	3.1	2.96
4	4.11	3.93
5	4.97	4.75
6	5.94	5.54
7	6.83	6.53
8	7.78	7.3
9	8.68	7.68
10	9.62	7.42
11	10.38	7.51
12	10.44	7.26
13	7.19	6.04
14	7.7	5.73
15	8.27	6.03
16	8.81	6.29
17	9.37	6.55
18	9.9	6.67
19	10.36	6.9
20	10.98	7.01
21	11.45	6.97
22	11.92	7.1
23	12.2	7.08

These ratios are quite consistent from 100KB double arrays to 100MB 
double arrays, so I do not think it reflects a Python overhead issue. 
There is no question the routine is memory bandwidth constrained, and I 
feel lucky to squeeze the eventual 12+ ratio, but I am very perplexed as 
to why the performance of the Python-invoked routine seems to cap off.

Does anyone have an explanation for the caps? Am I seeing some effect 
from ctypes, or the Python engine, or what?

Cheers,
Eric