[Python-Dev] proposal+patch: sys.gettickeraccumulation()

Mon Dec 6 00:37:12 CET 2004

Brett C. wrote:
> Ralf W. Grosse-Kunstleve wrote:
> >  Brett C. wrote:
> >  
> > >I have been mulling over this proposal and I think I am finally
> > >settling on +0 as long as it can be shown that it does not hurt
> > >performance at all (yes, it shouldn't but this is the eval loop we are
> > >talking about and that thing is a touchy beast).
> >  
> >  
> >  Is there a standard benchmark I could use to demonstrate the
> >  impact of the two new additions on the runtime?
> >  Thanks!
> 
> =)  Parrotbench and PyBench would be enough for me to sign off on any 
> performance hit.  There is also pystone.

I have done my homework now, based on parrotbench. Timings are below.
The most important conclusion from my viewpoint is that my original
goal can be achieved with a small patch and with a runtime penalty that
is in the noise. However, mysterious runtime behavior of
Objects/longobject.c lead me to change my original patch, and to the
unanswered question "could small rearrangements in longobject.c
potentially boost performance by 45% on some platforms?"

Please don't be put off by the size of this message. My patch
is essentially just one line in ceval.c (_Py_TickerAccumulation++)!

To restate my original goal:

  I am looking for a simple way to answer the question: How much of a
  speedup can I expect if I reimplement a piece of Python code in C or
  C++?

Note that I am not asking how much time is spent in Python including
all the extensions that come with it, but how much time is spent in the
Python bytecode interpreter loop. To clarify, look at the timings for
two tests in parrotbench:

   b1      5.8287 time        30049474 ticks        0.1940 us/tick
   b2      1.5944 time          482584 ticks        3.3039 us/tick

Each of the tests was run with range(10) in b.py to increase accuracy.
The first column shows time.time() differences, the second the number
of bytecodes processed in ceval.c, and the last column shows
micro-seconds per tick. Why is the us/tick value for b2 17 times larger
than that for b1?

The answer is that b2 makes heavy use of long integers, and that a lot
of time is spent in Objects/longobject.c doing long-integer arithmetic.
Compared to b1, relatively little time is spent interpreting bytecodes.

Back to the original question: how much sense does it make to
reimplement b1 or b2 in C? Simply looking at the us/tick values is
telling me that I can expect much more of a performance boost by
reimplementing b1 rather than b2. This is because b2 spends a lot of
time in C already. A similar situation arises if extensions like
Numeric are used. Computationally intensive operations like matrix
multiplication and matrix inversion are already done at C speed.

The us/tick let us quickly estimate how disappointed we would be after
moving Python code to C or C++. We get this estimate without having to
do detailed source code analysis, and before we have wasted our time
doing the recoding (we have been through this a couple of times ...).

The timings below show that the patched Python which counts the number
of interpreted bytecodes is in the worst case 0.6% slower than the
original Python, but some tests even run faster at face value:

$gcc296_compiled/original/python -O b_timings.py | grep reporter
   b0     16.8831 time
   b1      5.9553 time
   b2      1.5914 time
   b3     10.8149 time
   b4      5.2528 time
   b5     11.6437 time
   b6     11.0248 time
   b7     27.6960 time
  all     90.9276 time
$gcc296_compiled/work/python -O b_timings.py | grep reporter
   b0     16.9193 time        50979585 ticks        0.3319 us/tick
   b1      5.8287 time        30049474 ticks        0.1940 us/tick
   b2      1.5944 time          482584 ticks        3.3039 us/tick
   b3     10.7931 time        43577404 ticks        0.2477 us/tick
   b4      5.2865 time        18463044 ticks        0.2863 us/tick
   b5     11.6086 time        17065750 ticks        0.6802 us/tick
   b6     10.0992 time        60000464 ticks        0.1683 us/tick
   b7     27.6830 time         1928644 ticks       14.3536 us/tick
  all     89.8760 time       222548399 ticks        0.4038 us/tick

The b7 test is the same as b2 but with xrange(10000) instead of
xrange(1000), and with all print statements removed.

Ratios (rounded to 3 decimals):
    16.9193/16.8831=1.002
     5.8287/5.9553 =0.979
     1.5944/1.5914 =1.002
    10.7931/10.8149=0.998
     5.2865/5.2528 =1.006
    11.6086/11.6437=0.997
    10.0992/11.0248=0.916
    27.6830/27.6960=1.000

Therefore I'd argue that the runtime penalty for the one additional
long long increment in ceval.c (_Py_TickerAccumulation++) is in the
noise.

The timings were collected on a 2.8GHz Dual Xeon, Redhat WS 3.
Python was compiled under Red Hat 7.3 with gcc 2.96. See below why.

My latest patch can be found at the end of this posting.
It can be applied to the Python-2.4 distribution like this:

    tar zxf Python-2.4.tgz
    patch --strip=1 -d Python-2.4 < patch_file

------------------------------------------------------------------------------
------------------------------------------------------------------------------
------------------------------------------------------------------------------

Now to the mysterious runtime behavior of Objects/longobject.c

In the original Python 2.4 sources _Py_Ticker is decremented in
two places:

In the interpreter loop in Python/ceval.c:

                if (--_Py_Ticker < 0) {

In Objects/longobject.c:

#define SIGCHECK(PyTryBlock) \
        if (--_Py_Ticker < 0) { \
                _Py_Ticker = _Py_CheckInterval; \
                if (PyErr_CheckSignals()) { PyTryBlock; } \
        }

This macro is used in four places.
My initial patch was to insert updates of a new, global
_Py_TickerAccumulation variable in these two places:

Python/ceval.c:

                                    a try: finally: block uninterruptable. */
                                 goto fast_next_opcode;
                         }
+			_Py_TickerAccumulation += _Py_CheckInterval - _Py_Ticker;
 			_Py_Ticker = _Py_CheckInterval;
 			tstate->tick_counter++;

Objects/longobject.c:

 #define SIGCHECK(PyTryBlock) \
 	if (--_Py_Ticker < 0) { \
+		_Py_TickerAccumulation += _Py_CheckInterval - _Py_Ticker; \
 		_Py_Ticker = _Py_CheckInterval; \
 		if (PyErr_CheckSignals()) { PyTryBlock; } \
 	}

This lead to the timings below. All timings were collected on the
same machine, but the original and the patched Pythons were each
compiled with two different compilers, gcc 2.96 and gcc 3.2.3. The
important timings to look at are those for b2 and b7 again.

$gcc296_compiled/original/python -O b_timings.py | grep reporter
   b0     16.8831 time
   b1      5.9553 time
   b2      1.5914 time
   b3     10.8149 time
   b4      5.2528 time
   b5     11.6437 time
   b6     11.0248 time
   b7     27.6960 time
  all     90.9276 time
$gcc323_compiled/original/python -O b_timings.py | grep reporter
   b0     18.6390 time
   b1      6.4681 time
   b2      2.2588 time
   b3     11.0215 time
   b4      5.6490 time
   b5     12.3022 time
   b6     10.3815 time
   b7     40.2735 time
  all    107.0636 time

This shows that the gcc 2.96 optimizer is generally a little bit
better than the gcc 3.2.3 optimizer, but not by very much. Except
for the b2 and b7 tests. E.g. 40.2735/27.6960 = 1.454!

Now with the patches in both ceval.c and longobject.c:

$gcc296_compiled/ticker/python -O b_timings.py | grep reporter
   b0     17.1190 time        60792625 ticks        0.2816 us/tick
   b1      6.1171 time        30049474 ticks        0.2036 us/tick
   b2      1.8764 time          705754 ticks        2.6587 us/tick
   b3     10.7070 time        43577404 ticks        0.2457 us/tick
   b4      5.2677 time        22944694 ticks        0.2296 us/tick
   b5     11.7448 time        17433190 ticks        0.6737 us/tick
   b6     10.9603 time        60000504 ticks        0.1827 us/tick
   b7     33.2320 time         2837124 ticks       11.7133 us/tick
  all     97.0893 time       238342235 ticks        0.4074 us/tick
$gcc323_compiled/ticker/python -O b_timings.py | grep reporter
   b0     18.3115 time        60792625 ticks        0.3012 us/tick
   b1      6.1193 time        30049474 ticks        0.2036 us/tick
   b2      2.2522 time          705754 ticks        3.1912 us/tick
   b3     11.2202 time        43577404 ticks        0.2575 us/tick
   b4      5.4596 time        22944694 ticks        0.2379 us/tick
   b5     11.8197 time        17433190 ticks        0.6780 us/tick
   b6     10.7407 time        60000504 ticks        0.1790 us/tick
   b7     40.2397 time         2837124 ticks       14.1833 us/tick
  all    106.2283 time       238342235 ticks        0.4457 us/tick

The gcc 3.2.3 timings for b7 are hardly affected by my patch in
longobject.c, but the gcc 2.96 timing shoots up from 27.6960 to
33.2320. Still not as bad as the gcc 3.2.3 timing, but why?

Interestingly, if I declare _Py_TickerAccumulation as long instead
of PY_LONG_LONG, the gcc 2.96 timing is also hardly affected by
my patch in longobject.c.

Even more interestingly, if I change longobject.c like this...

#define SIGCHECK(PyTryBlock) \
        { \
                if (PyErr_CheckSignals()) { PyTryBlock; } \
        }

the runtime is also hardly affected (times not shown in this posting)
even though PyErr_CheckSignals() is executed 100 times more often.

Curious to learn how other compilers deal with longobject.c
I collected timings with this compiler:
Compaq C V6.3-028 on Compaq Tru64 UNIX V5.1 (Rev. 732)

$compaq_compiled/original/python -O b_timings.py | grep reporter 
   b1     19.4492 time
   b2      4.9395 time
   b3     23.6787 time
   b5     28.2568 time
   b6     21.6230 time
   b7     86.5283 time
  all    192.0498 time
$compaq_compiled/ticker/python -O b_timings.py | grep reporter 
   b1     19.4775 time        30049474 ticks        0.6482 us/tick
   b2      4.9971 time          704784 ticks        7.0902 us/tick
   b3     24.2559 time        43577404 ticks        0.5566 us/tick
   b5     26.3320 time        17420746 ticks        1.5115 us/tick
   b6     21.7227 time        60000464 ticks        0.3620 us/tick
   b7     86.9395 time         2836154 ticks       30.6540 us/tick
  all    190.8936 time       165555937 ticks        1.1530 us/tick

The b0 and b4 timings are missing because of exceptions that I didn't
investigate. As with the gcc 3.2.3 compilation, my longobject.c
patch hardly makes a difference.

What does all this mean? Does the gcc 2.96 surprise tell us that small
rearrangements in longobject.c could potentially boost performance by
45% on some platforms?

Since I am not interested in optimizing longobject.c this is where I
stopped. Eventually I decided that the problematic patch in
longobject.c is not helping me with my original goal as stated near the
beginning of this posting. I am only interested in counting the
iterations of the interpreter loop. However, the _Py_Ticker decrements
in longobject.c are not inside the interpreter loop, but in C loops!
This means _Py_Ticker is useless for my purposes. Therefore I decoupled
_Py_Ticker and _Py_TickerAccumulation.

BTW: I think strictly speaking this documentation is incorrect because
of the _Py_Ticker manipulations in longobject.c:

    sys.setcheckinterval.__doc__:
    Tell the Python interpreter to check for asynchronous events every
    n instructions.

I feel very dizzy now after dealing with all the timings and the nasty
details. A lot of fallout caused by the simple idea to add one innocent
line to ceval.c. I hope it makes sense to somebody!

Cheers,
        Ralf

------------------------------------------------------------------------------
------------------------------------------------------------------------------
------------------------------------------------------------------------------

diff -r -u original/Include/ceval.h work/Include/ceval.h

--- original/Include/ceval.h	2004-10-10 19:40:35.000000000 -0700
+++ work/Include/ceval.h	2004-11-30 23:27:11.000000000 -0800
@@ -68,6 +68,11 @@
 
 /* this used to be handled on a per-thread basis - now just two globals */
 PyAPI_DATA(volatile int) _Py_Ticker;
+#ifndef HAVE_LONG_LONG
+PyAPI_DATA(volatile long) _Py_TickerAccumulation;
+#else
+PyAPI_DATA(volatile PY_LONG_LONG) _Py_TickerAccumulation;
+#endif
 PyAPI_DATA(int) _Py_CheckInterval;
 
 /* Interface for threads.
diff -r -u original/PC/os2emx/python24.def work/PC/os2emx/python24.def
--- original/PC/os2emx/python24.def	2004-10-10 19:40:50.000000000 -0700
+++ work/PC/os2emx/python24.def	2004-11-30 23:27:11.000000000 -0800
@@ -743,6 +743,7 @@
   "_Py_CheckRecursionLimit"
   "_Py_CheckInterval"
   "_Py_Ticker"
+  "_Py_TickerAccumulation"
 
 ; From python24_s.lib(compile)
   "PyCode_New"
diff -r -u original/Python/ceval.c work/Python/ceval.c
--- original/Python/ceval.c	2004-11-23 10:06:08.000000000 -0800
+++ work/Python/ceval.c	2004-12-03 19:39:36.000000000 -0800
@@ -373,6 +373,7 @@
 	pendinglast = j;
 
 	_Py_Ticker = 0;
+	_Py_TickerAccumulation = 0;
 	things_to_do = 1; /* Signal main loop */
 	busy = 0;
 	/* XXX End critical section */
@@ -476,6 +477,11 @@
    per thread, now just a pair o' globals */
 int _Py_CheckInterval = 100;
 volatile int _Py_Ticker = 100;
+#ifndef HAVE_LONG_LONG
+volatile long _Py_TickerAccumulation = 0;
+#else
+volatile PY_LONG_LONG _Py_TickerAccumulation = 0;
+#endif
 
 PyObject *
 PyEval_EvalCode(PyCodeObject *co, PyObject *globals, PyObject *locals)
@@ -770,6 +776,7 @@
 		   async I/O handler); see Py_AddPendingCall() and
 		   Py_MakePendingCalls() above. */
 
+                _Py_TickerAccumulation++;
 		if (--_Py_Ticker < 0) {
                         if (*next_instr == SETUP_FINALLY) {
                                 /* Make the last opcode before
diff -r -u original/Python/sysmodule.c work/Python/sysmodule.c
--- original/Python/sysmodule.c	2004-08-12 11:19:17.000000000 -0700
+++ work/Python/sysmodule.c	2004-12-03 19:36:52.000000000 -0800
@@ -442,6 +442,20 @@
 "getcheckinterval() -> current check interval; see setcheckinterval()."
 );
 
+static PyObject *
+sys_gettickeraccumulation(PyObject *self, PyObject *args)
+{
+#ifndef HAVE_LONG_LONG
+	return PyInt_FromLong(_Py_TickerAccumulation);
+#else
+	return PyLong_FromLongLong(_Py_TickerAccumulation);
+#endif
+}
+
+PyDoc_STRVAR(gettickeraccumulation_doc,
+"gettickeraccumulation() -> current count of bytecodes processed by the interpreter."
+);
+
 #ifdef WITH_TSC
 static PyObject *
 sys_settscdump(PyObject *self, PyObject *args)
@@ -763,6 +777,8 @@
 	 setcheckinterval_doc}, 
 	{"getcheckinterval",	sys_getcheckinterval, METH_NOARGS,
 	 getcheckinterval_doc}, 
+	{"gettickeraccumulation", sys_gettickeraccumulation, METH_NOARGS,
+	 gettickeraccumulation_doc}, 
 #ifdef HAVE_DLOPEN
 	{"setdlopenflags", sys_setdlopenflags, METH_VARARGS, 
 	 setdlopenflags_doc},