[Python-Dev] Speeding up CPython 5-10%

Tue Feb 2 04:28:43 EST 2016

Hi,

I'm back for the FOSDEM event at Bruxelles, it was really cool. I gave
talk about FAT Python and I got good feedback. But friends told me
that people now have expectations on FAT Python. It looks like people
care of Python performance :-)

FYI the slides of my talk:
https://github.com/haypo/conf/raw/master/2016-FOSDEM/fat_python.pdf
(a video was recorded, I don't know when it will be online)

I take a first look at your patch and sorry, I'm skeptical about the
design. I have to play with it a little bit more to check if there is
no better design.

To be clear, FAT Python with your work looks more and more like a
cheap JIT compiler :-) Guards, specializations, optimizing at runtime
after a threshold... all these things come from JIT compilers. I like
the idea of a kind-of JIT compiler without having to pay the high cost
of a large dependency like LLVM. I like baby steps in CPython, it's
faster, it's possible to implement it in a single release cycle (one
minor Python release, Python 3.6). Integrating a JIT compiler into
CPython already failed with Unladen Swallow :-/

PyPy has a complete different design (and has serious issues with the
Python C API), Pyston is restricted to Python 2.7, Pyjion looks
specific to Windows (CoreCLR), Numba is specific to numeric
computations (numpy). IMHO none of these projects can be easily be
merged into CPython "quickly" (again, in a single Python release
cycle). By the way, Pyjion still looks very young (I heard that they
are still working on the compatibility with CPython, not on
performance yet).

2016-01-27 19:25 GMT+01:00 Yury Selivanov <yselivanov.ml at gmail.com>:
> tl;dr The summary is that I have a patch that improves CPython performance
> up to 5-10% on macro benchmarks.  Benchmarks results on Macbook Pro/Mac OS
> X, desktop CPU/Linux, server CPU/Linux are available at [1].  There are no
> slowdowns that I could reproduce consistently.

That's really impressive, great job Yury :-) Getting non-negligible
speedup on large macrobenchmarks became really hard in CPython.
CPython is already well optimized in all corners. It looks like the
overall Python performance still depends heavily on the performance of
dictionary and attribute lookups. Even if it was well known, I didn't
expect up to 10% speedup on *macro* benchmarks.

> LOAD_METHOD & CALL_METHOD
> -------------------------
>
> We had a lot of conversations with Victor about his PEP 509, and he sent me
> a link to his amazing compilation of notes about CPython performance [2].
> One optimization that he pointed out to me was LOAD/CALL_METHOD opcodes, an
> idea first originated in PyPy.
>
> There is a patch that implements this optimization, it's tracked here: [3].
> There are some low level details that I explained in the issue, but I'll go
> over the high level design in this email as well.

Your cache is stored directly in code objects. Currently, code objects
are immutable.

Antoine Pitrou's patch adding a LOAD_GLOBAL cache adds a cache to
functions with an "alias" in each frame object:
http://bugs.python.org/issue10401

Andrea Griffini's patch also adding a cache for LOAD_GLOBAL adds a
cache for code objects too.
https://bugs.python.org/issue1616125

I don't know what is the best place to store the cache.

I vaguely recall a patch which uses a single unique global cache, but
maybe I'm wrong :-p

> The opcodes we want to optimize are LAOD_GLOBAL, 0 and 3.  Let's look at the
> first one, that loads the 'print' function from builtins.  The opcode knows
> the following bits of information:

I tested your latest patch. It looks like LOAD_GLOBAL never
invalidates the cache on cache miss ("deoptimize" the instruction).

I suggest to always invalidate the cache at each cache miss. Not only,
it's common to modify global variables, but there is also the issue of
different namespace used with the same code object. Examples:

* late global initialization. See for example _a85chars cache of
base64.a85encode.
* code object created in a temporary namespace and then always run in
a different global namespace. See for example
collections.namedtuple(). I'm not sure that it's the best example
because it looks like the Python code only loads builtins, not
globals. But it looks like your code keeps a copy of the version of
the global namespace dict.

I tested with a threshold of 1: always optimize all code objects.
Maybe with your default threshold of 1024 runs, the issue with
different namespaces doesn't occur in practice.

> A straightforward way to implement such a cache is simple, but consumes a
> lot of memory, that would be just wasted, since we only need such a cache
> for LOAD_GLOBAL and LOAD_METHOD opcodes. So we have to be creative about the
> cache design.

I'm not sure that it's worth to develop a complex dynamic logic to
only enable optimizations after a threshold (design very close to a
JIT compiler). What is the overhead (% of RSS memory) on a concrete
application when all code objects are optimized at startup?

Maybe we need a global boolean flag to disable the optimization? Or
even a compilation option?

I mean that all these new counters have a cost, and the code may be
even faster without these counters if everything is always optimized,
no?

I'm not sure that the storage for the cache is really efficient. It's
a compact data structure, but it looks "expensive" to access it (there
is one level of indirection). I understand that it's compact to reduce
the memory footpring overhead.

I'm not sure that the threshold of 1000x run is ok for short scripts.
It would be nice to optimize also scripts which only call a function
900x times :-) Classical memory vs cpu compromise issue :-)

I'm just thinking aloud :-)

Victor