[Python-Dev] Speeding up CPython 5-10%

Wed Jan 27 17:58:11 EST 2016

On 2016-01-27 3:46 PM, Glenn Linderman wrote:
> On 1/27/2016 12:37 PM, Yury Selivanov wrote:
>>
>>>
>>> MicroPython also has dictionary lookup caching, but it's a bit
>>> different to your proposal.  We do something much simpler: each opcode
>>> that has a cache ability (eg LOAD_GLOBAL, STORE_GLOBAL, LOAD_ATTR,
>>> etc) includes a single byte in the opcode which is an offset-guess
>>> into the dictionary to find the desired element.  Eg for LOAD_GLOBAL
>>> we have (pseudo code):
>>>
>>> CASE(LOAD_GLOBAL):
>>> key = DECODE_KEY;
>>> offset_guess = DECODE_BYTE;
>>> if (global_dict[offset_guess].key == key) {
>>>      // found the element straight away
>>> } else {
>>>      // not found, do a full lookup and save the offset
>>>      offset_guess = dict_lookup(global_dict, key);
>>>      UPDATE_BYTECODE(offset_guess);
>>> }
>>> PUSH(global_dict[offset_guess].elem);
>>>
>>> We have found that such caching gives a massive performance increase,
>>> on the order of 20%.  The issue (for us) is that it increases bytecode
>>> size by a considerable amount, requires writeable bytecode, and can be
>>> non-deterministic in terms of lookup time.  Those things are important
>>> in the embedded world, but not so much on the desktop.
>>
>> That's a neat idea!  You're right, it does require bytecode to become 
>> writeable.
>
> Would it?
>
> Remember "fixup lists"?  Maybe they still exist for loading function 
> addresses from one DLL into the code of another at load time?
>
> So the equivalent for bytecode requires a static table of 
> offset_guess, and the offsets into that table are allocated by the 
> byte-code loader at byte-code load time, and the byte-code is "fixed 
> up" at load time to use the correct offsets into the offset_guess 
> table.  It takes one more indirection to find the guess, but if the 
> result is a 20% improvement, maybe you'd still get 19%...

Right, in my current patch I have an offset table per code object. 
Essentially, this offset table adds 8bits per opcode.  It also means 
that only first 255 LOAD_GLOBAL/LOAD_METHOD opcodes *per-code-object* 
are optimized (because the offset table only can store 8bit offsets), 
which is usually enough (I think you need to have more than a 500 lines 
of code function to reach that limit).

Yury