[Python-Dev] Opcode cache in ceval loop

Mon Feb 1 16:21:37 EST 2016

Hi Damien,

On 2016-02-01 3:59 PM, Damien George wrote:
> Hi Yury,
>
> That's great news about the speed improvements with the dict offset cache!
>
>> The cache struct is defined in code.h [2], and is 32 bytes long. When a
>> code object becomes hot, it gets an cache offset table allocated for it
>> (+1 byte for each opcode) + an array of cache structs.
> Ok, so each opcode has a 1-byte cache that sits separately to the
> actual bytecode.  But a lot of opcodes don't use it so that leads to
> some wasted memory, correct?

Each code object has a list of opcodes and their arguments
(bytes object == unsigned char array).

"Hot" code objects have an offset table (unsigned chars), and
a cache entries array (hope your email client will display
the following correctly):

    opcodes          offset       cache entries
                     table

     OPCODE            0            cache for 1st LOAD_ATTR
     ARG1              0            cache for 1st LOAD_GLOBAL
     ARG2              0            cache for 2nd LOAD_ATTR
     OPCODE            0            cache for 1st LOAD_METHOD
     LOAD_ATTR         1            ...
     ARG1              0
     ARG2              0
     OPCODE            0
     LOAD_GLOBAL       2
     ARG1              0
     ARG2              0
     LOAD_ATTR         3
     ARG1              0
     ARG2              0
     ...              ...
     LOAD_METHOD       4
     ...              ...

When, say, a LOAD_ATTR opcode executes, it first checks if the
code object has a non-NULL cache-entries table.

If it has, that LOAD_ATTR then uses the offset table (indexing
with its `INSTR_OFFSET()`) to find its position in
cache-entries.

>
> But then how do you index the cache, do you keep a count of the
> current opcode number?  If I remember correctly, CPython has some
> opcodes taking 1 byte, and some taking 3 bytes, so the offset into the
> bytecode cannot be easily mapped to a bytecode number.

First, when a code object is created, it doesn't have
an offset table and cache entries (those are set to NULL).

Each code object has a new field to count how many times
it was called.  Each time a code object is called with
PyEval_EvalFrameEx, that field is inced.

Once a code object is called more than 1024 times we:

1. allocate memory for its offset table

2. iterate through its opcodes and count how many
LOAD_ATTR, LOAD_METHOD and LOAD_GLOBAL opcodes it has;

3. As part of (2) we initialize the offset-table with
correct mapping.  Some opcodes will have a non-zero
entry in the offset-table, some won't.  Opcode args
will always have zeros in the offset tables.

4. Then we allocate cache-entries table.

Yury