[Python-Dev] Speeding up CPython 5-10%
Yury Selivanov
yselivanov.ml at gmail.com
Wed Jan 27 17:58:11 EST 2016
On 2016-01-27 3:46 PM, Glenn Linderman wrote:
> On 1/27/2016 12:37 PM, Yury Selivanov wrote:
>>
>>>
>>> MicroPython also has dictionary lookup caching, but it's a bit
>>> different to your proposal. We do something much simpler: each opcode
>>> that has a cache ability (eg LOAD_GLOBAL, STORE_GLOBAL, LOAD_ATTR,
>>> etc) includes a single byte in the opcode which is an offset-guess
>>> into the dictionary to find the desired element. Eg for LOAD_GLOBAL
>>> we have (pseudo code):
>>>
>>> CASE(LOAD_GLOBAL):
>>> key = DECODE_KEY;
>>> offset_guess = DECODE_BYTE;
>>> if (global_dict[offset_guess].key == key) {
>>> // found the element straight away
>>> } else {
>>> // not found, do a full lookup and save the offset
>>> offset_guess = dict_lookup(global_dict, key);
>>> UPDATE_BYTECODE(offset_guess);
>>> }
>>> PUSH(global_dict[offset_guess].elem);
>>>
>>> We have found that such caching gives a massive performance increase,
>>> on the order of 20%. The issue (for us) is that it increases bytecode
>>> size by a considerable amount, requires writeable bytecode, and can be
>>> non-deterministic in terms of lookup time. Those things are important
>>> in the embedded world, but not so much on the desktop.
>>
>> That's a neat idea! You're right, it does require bytecode to become
>> writeable.
>
> Would it?
>
> Remember "fixup lists"? Maybe they still exist for loading function
> addresses from one DLL into the code of another at load time?
>
> So the equivalent for bytecode requires a static table of
> offset_guess, and the offsets into that table are allocated by the
> byte-code loader at byte-code load time, and the byte-code is "fixed
> up" at load time to use the correct offsets into the offset_guess
> table. It takes one more indirection to find the guess, but if the
> result is a 20% improvement, maybe you'd still get 19%...
Right, in my current patch I have an offset table per code object.
Essentially, this offset table adds 8bits per opcode. It also means
that only first 255 LOAD_GLOBAL/LOAD_METHOD opcodes *per-code-object*
are optimized (because the offset table only can store 8bit offsets),
which is usually enough (I think you need to have more than a 500 lines
of code function to reach that limit).
Yury
More information about the Python-Dev
mailing list