[Python-Dev] Byte code arguments from two to one byte: did anyone try this?

Mon Jan 31 10:17:51 CET 2011

I tried to find any research on this subject, but I couldn't find any,
so I'll be daring and vulnerable and just try it out to see what your  
thoughts
are.
I single stepped a simple loop in Python to see where the efficiency  
bottlenecks are.
I was impressed by the optimizations already in there, but I still  
dare to suggest an optimization that from my estimates might shave  
off a few cycles, speeding up Python about 5%.
The idea is simple: change the byte code argument values from two  
bytes to one.
Implications are:
- code changes are relatively simple, see below
- fewer memory reads, which are becoming more and more expensive
- saves three instructions for every opcode with args (i.e. most of  
them)

Code changes are, as far as I could find:
compile.c:
assemble_emit must produce extended opcodes
     for all cases of more than 8 bits instead of 16

ceval.c:
NEXTARG and PEEKARG need adjustment
EXTENDED_ARG needs adjustment
     (this will be a four byte instruction, which is ugly, I agree)

peephole.c:
GETARG, SETARG, need adjustment
also GETJUMPTGT, CODESIZE
routine tuple_of_constants, fold_binops_on_constants, PyCode_Optimize
     are dependent on instruction length, which will be 2 instead of 3
(search for the digit 3 will find all cases, as far as I checked)
you probably will have to write a macro for codestr[i+3]
there is a check for code length >32700, but I think this one might  
stay,
maybe if a few extra checks are added.

dis:
minor adjustments

Estimation of speed impact:
about 80% of the instructions seem to have an argument, and I never  
saw an  opcode >255 while looking at bytecode, so they are probably  
not frequent.

The NEXTARG macro expands on my Macbook to:

mov    -408(%ebp),%edx        (next_instr)
movzbl 2(%edx),%eax           (*second byte)
shl    $0x8,%eax              (*shift)
movzbl 1(%edx),%edx           (first byte)
add    %edx,%eax              (*combine)

and the starred instructions will vanish.
The main loop is approximately 40 instructions, so a saving of three  
instructions is significant. I don't dare to claim 3/40 = 7.5% savings,
but I think 5% may be realistic.

Did anyone try this already? If not, I might take up the gauntlet
and try it myself, but I never did this before...

- Jurjen

PS I also saw that some scratch variables, mainly v and x, are  
carefull stored back in memory by the compiler and the end of the big  
interpreter loop, while their value isn't used anymore, of course.
A few carefully placed braces might tell the compiler how useless  
this is and
save another few percent.