[Python-Dev] Add tp_fastcall to PyTypeObject

Victor Stinner victor.stinner at gmail.com
Fri Jan 13 08:17:48 EST 2017


Hi,

tl;dr Python 3.7 is going to be faster without breaking backward
compatibility, say hello to the new "tp_fastcall" slot!
=> http://bugs.python.org/issue29259


Python 3.6 got a new "FASTCALL" calling convention which allows to
avoid the creation a temporary tuple to pass positional arguments and
a temporary dictionary to pass keyword arguments. But callable objects
having a __call__() method implemented in Python don't benefit of
FASTCAL yet.

I tried to reuse the tp_call slot with a new flag in tp_flags, but I
had two major blocker issues:

* Deeply break the backward compatibility of the C API: calling
directly tp_call (with tuple/dict) would crash immediately if the
object uses FASTCALL

* Need to duplicate each "tp_call" function to get a new "tp_fastcall"
flavor. It wasn't easy to share the function body.


Good news, I found a new design which don't have any of these issues!
=> http://bugs.python.org/issue29259

I chose to add a new tp_fastcall field to PyTypeObject and use a tiny
wrapper calling tp_fastcall for tp_call, to keep the backward
compatibility.


The goal is to get optimizations "for free" when calling functions.
The best expected speedup on a microbenchmark is around 1.56x faster
(-36%) when calling an object supporting FASTCALL. Example with
property_descr_get() without its "cached args" hack, result without
fastcall ("py34") compared to fastcall ("fastcall_wrapper"):

Median +- std dev: [py34] 75.0 ns +- 1.7 ns -> [fastcall_wrapper] 48.2
ns +- 1.5 ns: 1.56x faster (-36%)

But please don't expect such large speedup on macro-benchmark.


tp_fastcall allows to remove the "cached args" optimization used in
various parts of Python core, old optimizations used in performance
critical code. This hack causes various kinds of complex bugs in
corner cases which can lead to crash in the worst case.


The patch to support tp_fastcall is tiny, but you should expect a long
list of tiny changes to replace tp_call with tp_fastcall in various
types.


Final bonus point: existing code (calling functions) doesn't need to
be modified (nor recompiled) to get speedup. Even if tp_call is
directly directly, fastcall will provide speedup, but only if it is
called only with positional arguments.

About the tp_call wrapper: keyword arguments require to convert a
Python dictionary to a C array which might be more expensive. I didn't
try to measure the performance, since this case is very rare. Almost
no C code calls functions with keyword arguments, just because it's
much more complex to pass keyword arguments, it requires too much C
code (and it's not simpler with fastcall, sorry).

Victor


More information about the Python-Dev mailing list