[Python-Dev] New calling convention to avoid temporarily tuples when calling functions

Mon Aug 8 18:25:36 EDT 2016

Hi,

tl;dr I found a way to make CPython 3.6 faster and I validated that
there is no performance regression. I'm requesting approval of core
developers to start pushing changes.

In 2014 during a lunch at Pycon, Larry Hasting told me that he would
like to get rid of temporary tuples to call functions in Python. In
Python, positional arguments are passed as a tuple to C functions:
"PyObject *args". Larry wrote Argument Clinic which gives more control
on how C functions are called. But I guess that Larry didn't have time
to finish his implementation, since he didn't publish a patch.

While trying to optimize CPython 3.6, I wrote a proof-of-concept patch
and results were promising:
https://bugs.python.org/issue26814#msg264003
https://bugs.python.org/issue26814#msg266359

C functions get a C array "PyObject **args, int nargs". Getting the
nth argument become "arg = args[n];" at the C level. This format is
not new, it's already used internally in Python/ceval.c. A Python
function call made from a Python function already avoids a temporary
tuple in most cases: we pass the stack of the first function as the
list of arguments to the second function. My patch generalizes the
idea to C functions. It works in all directions (C=>Python, Python=>C,
C=>C, etc.).

Many function calls become faster than Python 3.5 with my full patch,
but even faster than Python 2.7! For multiple reasons (not interesting
here), tested functions are slower in Python 3.4 than Python 2.7.
Python 3.5 is better than Python 3.4, but still slower than Python 2.7
in a few cases. Using my "FASTCALL" patch, all tested function calls
become faster or as fast as Python 2.7!

But when I ran the CPython benchmark suite, I found some major
performance regressions. In fact, it took me 3 months to understand
that I didn't run benchmarks correctly and that most benchmarks of the
CPython benchmark suite are very unstable. I wrote articles explaining
how benchmarks should be run (to be stable) and I patched all
benchmarks to use my new perf module which runs benchmarks in multiple
processes and computes the average (to make benchmarks more stable).

At the end, my minimum FASTCALL patch (issue #27128) doesn't show any
major performance regression if you run "correctly" benchmarks :-)
https://bugs.python.org/issue27128#msg272197

Most benchmarks are not significant, 14 are faster, and only 4 are slower.

According to benchmarks on the "full" FASTCALL patch, the slowdown are
temporary and should become quickly speedup (with further changes).

My question is now: can I push fastcall-2.patch of the issue #27128?
This patch only adds the infrastructure to start working on more
useful optimizations, more patches will come, I expect more exciting
benchmark results.

Overview of the initial FASTCALL patch, see my first message on the issue:
https://bugs.python.org/issue27128#msg266422

--

Note: My full FASTCALL patch changes the C API: this is out of the
scope of my first simple FASTCALL patch. I will open a new discussion
to decide if it's worth it and if yes, how it should be done.

Victor