Speed up Python by up to 5% ?
Michael Hudson
mwh at python.net
Thu Jul 25 13:12:19 EDT 2002
"Edward K. Ream" <edream at tds.net> writes:
> It may be possible to speed up the Python interpreter 3% to 5% with a
> trivial modification.
Maybe.
[snip]
> In short, this is a very tight loop. But can't it be improved? The byte
> code is littered with SET_LINENO opcodes. Has anyone considered moving the
> test:
>
> if (things_to_do || --tstate->ticker < 0)
> { /* handle periodic things */ } // not usually executed.
>
> out of the main loop an into the case for SET_LINENO?
I don't think so, because a more seriously consdiered modification has
been removing SET_LINENO entirely (it's *only* used for tracing, these
days). There's a patch on sf about this.
> The only problem I can think of is that there would be some loop that does
> not involve any SET_LINENO opcode at all. Can anyone come up with such an
> example?
Anything run under "python -O"...
> Is there any other reason for moving the "periodic things" test
> out of the main loop? Even if there were places where a loop contains no
> SET_LINENO opcode, it might be worthwhile to have the compiler insert an
> DO_PERIODIC_THINGS opcode in the bytecode for the loop.
This might be a reasonable strategy anyway. Would require
complicating the compiler to ensure one got into every loop, etc, but
that should be easier when Jeremy's ast-branch lands.
> How much time might we expect to save?
Oh no, you're not going to sucker anyone with that one <wink>. Except
maybe yourself <double-wink>.
> Let us look at the most favorable case first. As it happens, the
> SET_LINENO case of the main loop takes about the _least_ amount of
> time to execute (on average) so the speedup relative to this case
> will be the _greatest_. Here is the code that typically gets
> executed:
>
> case SET_LINENO:
> f->f_lineno = oparg;
> if (tstate->c_tracefunc == NULL || tstate->tracing)
> continue; // typically _is_ executed.
> // typically _not_ executed.
>
> As a rough estimate, let us just count C instructions in the loop through
> this case:
>
> if (things_to_do || --tstate->ticker < 0)
> { /* typically not executed. */ }
> opcode = *next_instr++
> if (opcode >= HAVE_ARGUMENT) {
> next_instr += 2;
> oparg = ((next_instr[-1] << 8) + next_instr[-2])
> }
> switch (opcode)
> {
> case SET_LINENO:
> f->f_lineno = oparg;
> if (tstate->c_tracefunc == NULL || tstate->tracing)
> continue; // typically _is_ executed.
> // typically not executed.
> }
>
> Depending on how you count, there are 8-10 instruction here,
I make it rather a lot more than that, esp on RISC architectures...
> so saving the
> line:
>
> if (things_to_do || --tstate->ticker < 0)
>
> might save up to 10% to 15% of the time through this loop. This is a very
> rough estimate, and the actual value may be less. Notice that code marked
> "typically not executed" has little effect on "amortized" time.
>
> Are typical opcodes much more expensive than the simplest SET_LINENO case?
> I wrote a simple program based on dis to count the static occurrences of
> bytecodes.
You should look at Jeremy's work counting instructions for individual
bytecodes:
http://www.zope.org/Members/jeremy/CurrentAndFutureProjects/PerformanceMeasurements
Also, there's the DX_PAIRS & so on macros you can build Python with.
For one thing, static frequency of bytecodes != (in general) dynamic
frequency.
[...]
> Dynamic execution frequencies will not match these static frequencies
> exactly, but I see no reason to suppose they will be wildly different. If
> anyone has real dynamic data, I would like to see it :-)
Ah. As Tim once said, this editor doesn't go backwards :)
You can make real data yourself. grep for DYNAMIC_EXECUTION_PROFILE.
> Typical "fast" opcodes are LOAD_FAST and LOAD_CONST. The code for them is
> as follows:
>
> case LOAD_FAST:
> x = GETLOCAL(oparg); // x = fastlocals[oparg]
> if (x == NULL) { // if (x == NULL) ...
> /* Typically not executed. */
> }
> Py_INCREF(x); // x->ob_refcnt++
> PUSH(x); // *stack_pointer++ = x
> if (x != NULL) continue;
> break; // Typically not executed.
>
> case LOAD_CONST:
> x = GETCONST(oparg); // x = f->f_code->co_consts->ob_item[oparg]
> Py_INCREF(x); // x->ob_refcnt++
> PUSH(x); // *stack_pointer++ = x
> continue;
>
> Again we can see that eliminating:
>
> if (things_to_do || --tstate->ticker < 0)
>
> from the main loop might save up to 10% of the time taken to execute these
> instructions.
>
> Some opcodes _will_ take much longer than others to execute. I have marked
> the "heavy hitters" above with **. Suppose we say that the heavy hitters
> will take at least 10 times longer than the other opcodes to execute, so we
> shall add 10 times their total count to the total count of 95454 shown
> above. The new total then becomes roughly 300,000 rather than 100,000, so
> perhaps by this very rough estimate we might save 3-5% rather than 10-15%.
>
> Still, this might be a real savings. What do you think?
I think you should get some real data.
> P.S. The code needed to execute the "periodic" tasks in the SET_LINENO case
> might be simpler than the code in the main loop, because we would execute
> the periodic tasks _every time_ the SET_LINENO is seen.
Uhh, what about sys.setcheckinterval?
[snip]
> Again, dynamic data would be helpful.
No kidding! Don't want to discourage you, but finding performance
enhancements for Python is not all that easy any more -- most low
hanging fruit aren't that tasty :-/
Cheers,
M.
--
Academic politics is the most vicious and bitter form of politics,
because the stakes are so low. -- Wallace Sayre
More information about the Python-list
mailing list