Speed up Python by up to 5% ?

Thu Jul 25 13:12:19 EDT 2002

"Edward K. Ream" <edream at tds.net> writes:

> It may be possible to speed up the Python interpreter 3% to 5% with a
> trivial modification.

Maybe.

[snip]
> In short, this is a very tight loop.  But can't it be improved? The byte
> code is littered with SET_LINENO opcodes.  Has anyone considered moving the
> test:
> 
> if (things_to_do || --tstate->ticker < 0)
>     { /* handle periodic things */ } // not usually executed.
> 
> out of the main loop an into the case for SET_LINENO?

I don't think so, because a more seriously consdiered modification has
been removing SET_LINENO entirely (it's *only* used for tracing, these
days).  There's a patch on sf about this.

> The only problem I can think of is that there would be some loop that does
> not involve any SET_LINENO opcode at all.  Can anyone come up with such an
> example?

Anything run under "python -O"...

> Is there any other reason for moving the "periodic things" test
> out of the main loop? Even if there were places where a loop contains no
> SET_LINENO opcode, it might be worthwhile to have the compiler insert an
> DO_PERIODIC_THINGS opcode in the bytecode for the loop.

This might be a reasonable strategy anyway.  Would require
complicating the compiler to ensure one got into every loop, etc, but
that should be easier when Jeremy's ast-branch lands.

> How much time might we expect to save?

Oh no, you're not going to sucker anyone with that one <wink>.  Except
maybe yourself <double-wink>.

> Let us look at the most favorable case first.  As it happens, the
> SET_LINENO case of the main loop takes about the _least_ amount of
> time to execute (on average) so the speedup relative to this case
> will be the _greatest_.  Here is the code that typically gets
> executed:
> 
> case SET_LINENO:
>   f->f_lineno = oparg;
>   if (tstate->c_tracefunc == NULL || tstate->tracing)
>     continue;  // typically _is_ executed.
>   // typically _not_ executed.
> 
> As a rough estimate, let us just count C instructions in the loop through
> this case:
> 
>   if (things_to_do || --tstate->ticker < 0)
>     { /* typically not executed. */ }
>   opcode = *next_instr++
>   if (opcode >= HAVE_ARGUMENT) {
>     next_instr += 2;
>     oparg = ((next_instr[-1] << 8) + next_instr[-2])
>   }
>   switch (opcode)
>   {
>     case SET_LINENO:
>       f->f_lineno = oparg;
>       if (tstate->c_tracefunc == NULL || tstate->tracing)
>         continue;  // typically _is_ executed.
>       // typically not executed.
>   }
> 
> Depending on how you count, there are 8-10 instruction here, 

I make it rather a lot more than that, esp on RISC architectures...

> so saving the
> line:
> 
>     if (things_to_do || --tstate->ticker < 0)
> 
> might save up to 10% to 15% of the time through this loop.  This is a very
> rough estimate, and the actual value may be less.  Notice that code marked
> "typically not executed" has little effect on "amortized" time.
> 
> Are typical opcodes much more expensive than the simplest SET_LINENO case?
> I wrote a simple program based on dis to count the static occurrences of
> bytecodes.  

You should look at Jeremy's work counting instructions for individual
bytecodes:

http://www.zope.org/Members/jeremy/CurrentAndFutureProjects/PerformanceMeasurements

Also, there's the DX_PAIRS & so on macros you can build Python with.
For one thing, static frequency of bytecodes != (in general) dynamic
frequency.

[...]

> Dynamic execution frequencies will not match these static frequencies
> exactly, but I see no reason to suppose they will be wildly different.  If
> anyone has real dynamic data, I would like to see it :-)

Ah.  As Tim once said, this editor doesn't go backwards :)

You can make real data yourself.  grep for DYNAMIC_EXECUTION_PROFILE.

> Typical "fast" opcodes are LOAD_FAST and LOAD_CONST.  The code for them is
> as follows:
> 
> case LOAD_FAST:
>  x = GETLOCAL(oparg);            // x = fastlocals[oparg]
>  if (x == NULL) {                // if (x == NULL) ...
>   /* Typically not executed. */
>  }
>  Py_INCREF(x);                   // x->ob_refcnt++
>  PUSH(x);                        // *stack_pointer++ = x
>  if (x != NULL) continue;
>  break;  // Typically not executed.
> 
> case LOAD_CONST:
>   x = GETCONST(oparg); // x = f->f_code->co_consts->ob_item[oparg]
>   Py_INCREF(x);        // x->ob_refcnt++
>   PUSH(x);             // *stack_pointer++ = x
>   continue;
> 
> Again we can see that eliminating:
> 
>   if (things_to_do || --tstate->ticker < 0)
> 
> from the main loop might save up to 10% of the time taken to execute these
> instructions.
> 
> Some opcodes _will_ take much longer than others to execute.  I have marked
> the "heavy hitters" above with **.  Suppose we say that the heavy hitters
> will take at least 10 times longer than the other opcodes to execute, so we
> shall add 10 times their total count to the total count of 95454 shown
> above.  The new total then becomes roughly 300,000 rather than 100,000, so
> perhaps by this very rough estimate we might save 3-5% rather than 10-15%.
> 
> Still, this might be a real savings. What do you think?

I think you should get some real data.

> P.S.  The code needed to execute the "periodic" tasks in the SET_LINENO case
> might be simpler than the code in the main loop, because we would execute
> the periodic tasks _every time_ the SET_LINENO is seen.

Uhh, what about sys.setcheckinterval?

[snip]
> Again, dynamic data would be helpful.

No kidding!  Don't want to discourage you, but finding performance
enhancements for Python is not all that easy any more -- most low
hanging fruit aren't that tasty :-/

Cheers,
M.

-- 
  Academic politics is the most vicious and bitter form of politics,
  because the stakes are so low.                      -- Wallace Sayre