[Python-Dev] Byte code arguments from two to one byte: did anyone try this?
Jurjen N.E. Bos
j.bos-interpay at xs4all.nl
Mon Jan 31 10:17:51 CET 2011
I tried to find any research on this subject, but I couldn't find any,
so I'll be daring and vulnerable and just try it out to see what your
thoughts
are.
I single stepped a simple loop in Python to see where the efficiency
bottlenecks are.
I was impressed by the optimizations already in there, but I still
dare to suggest an optimization that from my estimates might shave
off a few cycles, speeding up Python about 5%.
The idea is simple: change the byte code argument values from two
bytes to one.
Implications are:
- code changes are relatively simple, see below
- fewer memory reads, which are becoming more and more expensive
- saves three instructions for every opcode with args (i.e. most of
them)
Code changes are, as far as I could find:
compile.c:
assemble_emit must produce extended opcodes
for all cases of more than 8 bits instead of 16
ceval.c:
NEXTARG and PEEKARG need adjustment
EXTENDED_ARG needs adjustment
(this will be a four byte instruction, which is ugly, I agree)
peephole.c:
GETARG, SETARG, need adjustment
also GETJUMPTGT, CODESIZE
routine tuple_of_constants, fold_binops_on_constants, PyCode_Optimize
are dependent on instruction length, which will be 2 instead of 3
(search for the digit 3 will find all cases, as far as I checked)
you probably will have to write a macro for codestr[i+3]
there is a check for code length >32700, but I think this one might
stay,
maybe if a few extra checks are added.
dis:
minor adjustments
Estimation of speed impact:
about 80% of the instructions seem to have an argument, and I never
saw an opcode >255 while looking at bytecode, so they are probably
not frequent.
The NEXTARG macro expands on my Macbook to:
mov -408(%ebp),%edx (next_instr)
movzbl 2(%edx),%eax (*second byte)
shl $0x8,%eax (*shift)
movzbl 1(%edx),%edx (first byte)
add %edx,%eax (*combine)
and the starred instructions will vanish.
The main loop is approximately 40 instructions, so a saving of three
instructions is significant. I don't dare to claim 3/40 = 7.5% savings,
but I think 5% may be realistic.
Did anyone try this already? If not, I might take up the gauntlet
and try it myself, but I never did this before...
- Jurjen
PS I also saw that some scratch variables, mainly v and x, are
carefull stored back in memory by the compiler and the end of the big
interpreter loop, while their value isn't used anymore, of course.
A few carefully placed braces might tell the compiler how useless
this is and
save another few percent.
More information about the Python-Dev
mailing list