Speed up Python by up to 5% ?

Thu Jul 25 11:14:42 EDT 2002

It may be possible to speed up the Python interpreter 3% to 5% with a
trivial modification.

When stripped of conditionals, the main loop of the Python 2.2.1 interpreter
is as follows. (I've added comments to show C code equivalent to the
expansions of the macros.  Actual expansion may be slightly different.)

[begin code]

why = WHY_NOT; err = 0;
w = NULL; x = Py_None; /* Not a reference, just anything non-NULL */
for (;;) {

  if (things_to_do || --tstate->ticker < 0)
    { /* handle periodic things */ }

  opcode = NEXTOP();    // opcode = *next_instr++

  if (HAS_ARG(opcode))  // if (opcode >= HAVE_ARGUMENT) {
    oparg = NEXTARG();  //   next_instr += 2;
                        //   oparg = ((next_instr[-1] << 8) +
                        //      next_instr[-2]) }
  switch (opcode)
  { /* opcode cases */ }

  if (why == WHY_NOT)
    if (err == 0 && x != NULL)
      continue; /* Normal, fast path */

} // end for

[end code]

BTW, many common cases end with "continue", so even the trailing code
following the opcode switch doesn't necessarily contribute that much to
program running time.

In short, this is a very tight loop.  But can't it be improved? The byte
code is littered with SET_LINENO opcodes.  Has anyone considered moving the
test:

if (things_to_do || --tstate->ticker < 0)
    { /* handle periodic things */ } // not usually executed.

out of the main loop an into the case for SET_LINENO? The code might be
executed slightly less regularly, but would that matter?  This would seem to
be a real improvement precisely because the main loop is already so tight.

The only problem I can think of is that there would be some loop that does
not involve any SET_LINENO opcode at all.  Can anyone come up with such an
example?  Is there any other reason for moving the "periodic things" test
out of the main loop? Even if there were places where a loop contains no
SET_LINENO opcode, it might be worthwhile to have the compiler insert an
DO_PERIODIC_THINGS opcode in the bytecode for the loop.

How much time might we expect to save? Let us look at the most favorable
case first.  As it happens, the SET_LINENO case of the main loop takes about
the _least_ amount of time to execute (on average) so the speedup relative
to this case will be the _greatest_.  Here is the code that typically gets
executed:

case SET_LINENO:
  f->f_lineno = oparg;
  if (tstate->c_tracefunc == NULL || tstate->tracing)
    continue;  // typically _is_ executed.
  // typically _not_ executed.

As a rough estimate, let us just count C instructions in the loop through
this case:

  if (things_to_do || --tstate->ticker < 0)
    { /* typically not executed. */ }
  opcode = *next_instr++
  if (opcode >= HAVE_ARGUMENT) {
    next_instr += 2;
    oparg = ((next_instr[-1] << 8) + next_instr[-2])
  }
  switch (opcode)
  {
    case SET_LINENO:
      f->f_lineno = oparg;
      if (tstate->c_tracefunc == NULL || tstate->tracing)
        continue;  // typically _is_ executed.
      // typically not executed.
  }

Depending on how you count, there are 8-10 instruction here, so saving the
line:

    if (things_to_do || --tstate->ticker < 0)

might save up to 10% to 15% of the time through this loop.  This is a very
rough estimate, and the actual value may be less.  Notice that code marked
"typically not executed" has little effect on "amortized" time.

Are typical opcodes much more expensive than the simplest SET_LINENO case?
I wrote a simple program based on dis to count the static occurrences of
bytecodes.  The results of running this program on most of the modules of
Leo are as follows:

 18686 LOAD_FAST
 14276 SET_LINENO
  9635 LOAD_CONST
  9555 POP_TOP
  6721 LOAD_GLOBAL **
  5890 CALL_FUNCTION **
  5657 STORE_FAST
  3957 JUMP_IF_FALSE
  3473 RETURN_VALUE
  2887 JUMP_FORWARD
  2777 COMPARE_OP
  2229 LOAD_ATTR **
  1109 INPLACE_ADD
   887 BUILD_TUPLE
   876 BINARY_SUBSCR
   839 BINARY_ADD
   818 POP_BLOCK
   756 JUMP_IF_TRUE
   730 JUMP_ABSOLUTE
   729 SETUP_LOOP
   283 UNARY_NOT
   ...
total 96454

** denotes "heavy hitter"

Dynamic execution frequencies will not match these static frequencies
exactly, but I see no reason to suppose they will be wildly different.  If
anyone has real dynamic data, I would like to see it :-)

Typical "fast" opcodes are LOAD_FAST and LOAD_CONST.  The code for them is
as follows:

case LOAD_FAST:
 x = GETLOCAL(oparg);            // x = fastlocals[oparg]
 if (x == NULL) {                // if (x == NULL) ...
  /* Typically not executed. */
 }
 Py_INCREF(x);                   // x->ob_refcnt++
 PUSH(x);                        // *stack_pointer++ = x
 if (x != NULL) continue;
 break;  // Typically not executed.

case LOAD_CONST:
  x = GETCONST(oparg); // x = f->f_code->co_consts->ob_item[oparg]
  Py_INCREF(x);        // x->ob_refcnt++
  PUSH(x);             // *stack_pointer++ = x
  continue;

Again we can see that eliminating:

  if (things_to_do || --tstate->ticker < 0)

from the main loop might save up to 10% of the time taken to execute these
instructions.

Some opcodes _will_ take much longer than others to execute.  I have marked
the "heavy hitters" above with **.  Suppose we say that the heavy hitters
will take at least 10 times longer than the other opcodes to execute, so we
shall add 10 times their total count to the total count of 95454 shown
above.  The new total then becomes roughly 300,000 rather than 100,000, so
perhaps by this very rough estimate we might save 3-5% rather than 10-15%.

Still, this might be a real savings. What do you think?

Edward

P.S.  The code needed to execute the "periodic" tasks in the SET_LINENO case
might be simpler than the code in the main loop, because we would execute
the periodic tasks _every time_ the SET_LINENO is seen.

The "do periodic things" code is:

@ Doing this every time through the loop would add too much overhead, so we
do it only every Nth instruction.  We also do it if ``things_to_do'' is set,
i.e. when an asynchronous event needs attention (e.g. a signal handler or
async I/O handler); see Py_AddPendingCall() and Py_MakePendingCalls() above.
@c

tstate->ticker = tstate->interp->checkinterval;
if (things_to_do) {
 if (Py_MakePendingCalls() < 0) {
  why = WHY_EXCEPTION;
  goto on_error;
 }
}

#if !defined(HAVE_SIGNAL_H) || defined(macintosh)
 /* If we have true signals, the signal handler
    will call Py_AddPendingCall() so we don't
    have to call sigcheck().  On the Mac and
    DOS, alas, we have to call it. */
 if (PyErr_CheckSignals()) {
  why = WHY_EXCEPTION;
  goto on_error;
 }
#endif

#ifdef WITH_THREAD
 if (interpreter_lock) {
  /* Give another thread a chance */

  if (PyThreadState_Swap(NULL) != tstate)
   Py_FatalError("ceval: tstate mix-up");
  PyThread_release_lock(interpreter_lock);

  /* Other threads may run now */

  PyThread_acquire_lock(interpreter_lock, 1);
  if (PyThreadState_Swap(tstate) != NULL)
   Py_FatalError("ceval: orphan tstate");
 }
#endif

It's not clear how expensive this code will be when executed.  It depends on
preprocessor options and run-time variables.  Again, dynamic data would be
helpful.

E.K.R.
--------------------------------------------------------------------
Edward K. Ream   email:  edream at tds.net
Leo: Literate Editor with Outlines
Leo: http://personalpages.tds.net/~edream/front.html
--------------------------------------------------------------------