|Title:||Vectorcall: a fast calling protocol for CPython|
|Author:||Mark Shannon <mark at hotpy.org>, Jeroen Demeyer <J.Demeyer at UGent.be>|
- New C API and changes to CPython
- Internal CPython changes
- Third-party extension classes using vectorcall
- Performance implications of these changes
- Stable ABI
- Alternative Suggestions
- Reference implementation
This PEP introduces a new C API to optimize calls of objects. It introduces a new "vectorcall" protocol and calling convention. This is based on the "fastcall" convention, which is already used internally by CPython. The new features can be used by any user-defined extension class.
NOTE: This PEP deals only with the Python/C API, it does not affect the Python language or standard library.
The choice of a calling convention impacts the performance and flexibility of code on either side of the call. Often there is tension between performance and flexibility.
The current tp_call  calling convention is sufficiently flexible to cover all cases, but its performance is poor. The poor performance is largely a result of having to create intermediate tuples, and possibly intermediate dicts, during the call. This is mitigated in CPython by including special-case code to speed up calls to Python and builtin functions. Unfortunately, this means that other callables such as classes and third party extension objects are called using the slower, more general tp_call calling convention.
This PEP proposes that the calling convention used internally for Python and builtin functions is generalized and published so that all calls can benefit from better performance. The new proposed calling convention is not fully general, but covers the large majority of calls. It is designed to remove the overhead of temporary object creation and multiple indirections.
Another source of inefficiency in the tp_call convention is that it has one function pointer per class, rather than per object. This is inefficient for calls to classes as several intermediate objects need to be created. For a class cls, at least one intermediate object is created for each call in the sequence type.__call__, cls.__new__, cls.__init__.
Calls are made through a function pointer taking the following parameters:
- PyObject *callable: The called object
- PyObject *const *args: A vector of arguments
- Py_ssize_t nargs: The number of arguments plus the optional flag PY_VECTORCALL_ARGUMENTS_OFFSET (see below)
- PyObject *kwnames: Either NULL or a tuple with the names of the keyword arguments
This is implemented by the function pointer type: typedef PyObject *(*vectorcallfunc)(PyObject *callable, PyObject *const *args, Py_ssize_t nargs, PyObject *kwnames);
The unused slot printfunc tp_print is replaced with tp_vectorcall_offset. It has the type Py_ssize_t. A new tp_flags flag is added, Py_TPFLAGS_HAVE_VECTORCALL, which must be set for any class that uses the vectorcall protocol.
If Py_TPFLAGS_HAVE_VECTORCALL is set, then tp_vectorcall_offset must be a positive integer. It is the offset into the object of the vectorcall function pointer of type vectorcallfunc. This pointer may be NULL, in which case the behavior is the same as if Py_TPFLAGS_HAVE_VECTORCALL was not set.
The tp_print slot is reused as the tp_vectorcall_offset slot to make it easier for for external projects to backport the vectorcall protocol to earlier Python versions. In particular, the Cython project has shown interest in doing that (see https://mail.python.org/pipermail/python-dev/2018-June/153927.html).
One additional type flag is specified: Py_TPFLAGS_METHOD_DESCRIPTOR.
Py_TPFLAGS_METHOD_DESCRIPTOR should be set if the the callable uses the descriptor protocol to create a bound method-like object. This is used by the interpreter to avoid creating temporary objects when calling methods (see _PyObject_GetMethod and the LOAD_METHOD/CALL_METHOD opcodes).
Concretely, if Py_TPFLAGS_METHOD_DESCRIPTOR is set for type(func), then:
- func.__get__(obj, cls)(*args, **kwds) (with obj not None) must be equivalent to func(obj, *args, **kwds).
- func.__get__(None, cls)(*args, **kwds) must be equivalent to func(*args, **kwds).
There are no restrictions on the object func.__get__(obj, cls). The latter is not required to implement the vectorcall protocol.
The call takes the form ((vectorcallfunc)(((char *)o)+offset))(o, args, n, kwnames) where offset is Py_TYPE(o)->tp_vectorcall_offset. The caller is responsible for creating the kwnames tuple and ensuring that there are no duplicates in it.
n is the number of postional arguments plus possibly the PY_VECTORCALL_ARGUMENTS_OFFSET flag.
The flag PY_VECTORCALL_ARGUMENTS_OFFSET should be added to n if the callee is allowed to temporarily change args[-1]. In other words, this can be used if args points to argument 1 in the allocated vector. The callee must restore the value of args[-1] before returning.
Whenever they can do so cheaply (without allocation), callers are encouraged to use PY_VECTORCALL_ARGUMENTS_OFFSET. Doing so will allow callables such as bound methods to make their onward calls cheaply. The bytecode interpreter already allocates space on the stack for the callable, so it can use this trick at no additional cost.
See  for an example of how PY_VECTORCALL_ARGUMENTS_OFFSET is used by a callee to avoid allocation.
For getting the actual number of arguments from the parameter n, the macro PyVectorcall_NARGS(n) must be used. This allows for future changes or extensions.
The following functions or macros are added to the C API:
- PyObject *PyObject_Vectorcall(PyObject *obj, PyObject *const *args, Py_ssize_t nargs, PyObject *keywords): Calls obj with the given arguments. Note that nargs may include the flag PY_VECTORCALL_ARGUMENTS_OFFSET. The actual number of positional arguments is given by PyVectorcall_NARGS(nargs). The argument keywords is a tuple of keyword names or NULL. An empty tuple has the same effect as passing NULL. This uses either the vectorcall protocol or tp_call internally; if neither is supported, an exception is raised.
- PyObject *PyVectorcall_Call(PyObject *obj, PyObject *tuple, PyObject *dict): Call the object (which must support vectorcall) with the old *args and **kwargs calling convention. This is mostly meant to put in the tp_call slot.
- Py_ssize_t PyVectorcall_NARGS(Py_ssize nargs): Given a vectorcall nargs argument, return the actual number of arguments. Currently equivalent to nargs & ~PY_VECTORCALL_ARGUMENTS_OFFSET.
Extension types inherit the type flag Py_TPFLAGS_HAVE_VECTORCALL and the value tp_vectorcall_offset from the base class, provided that they implement tp_call the same way as the base class. Additionally, the flag Py_TPFLAGS_METHOD_DESCRIPTOR is inherited if tp_descr_get is implemented the same way as the base class.
Heap types never inherit the vectorcall protocol because that would not be safe (heap types can be changed dynamically). This restriction may be lifted in the future, but that would require special-casing __call__ in type.__setattribute__.
The function, builtin_function_or_method, method_descriptor, method, wrapper_descriptor, method-wrapper classes will use the vectorcall protocol (not all of these will be changed in the initial implementation).
For builtin_function_or_method and method_descriptor (which use the PyMethodDef data structure), one could implement a specific vectorcall wrapper for every existing calling convention. Whether or not it is worth doing that remains to be seen.
For a class cls, creating a new instance using cls(xxx) requires multiple calls. At least one intermediate object is created for each call in the sequence type.__call__, cls.__new__, cls.__init__. So it makes a lot of sense to use vectorcall for calling classes. This really means implementing the vectorcall protocol for type. Some of the most commonly used classes will use this protocol, probably range, list, str, and type.
Argument Clinic  automatically generates wrapper functions around lower-level callables, providing safe unboxing of primitive types and other safety checks. Argument Clinic could be extended to generate wrapper objects conforming to the new vectorcall protocol. This will allow execution to flow from the caller to the Argument Clinic generated wrapper and thence to the hand-written code with only a single indirection.
To enable call performance on a par with Python functions and built-in functions, third-party callables should include a vectorcallfunc function pointer, set tp_vectorcall_offset to the correct value and add the Py_TPFLAGS_HAVE_VECTORCALL flag. Any class that does this must implement the tp_call function and make sure its behaviour is consistent with the vectorcallfunc function. Setting tp_call to PyVectorcall_Call is sufficient.
This PEP should not have much impact on the performance of existing code (neither in the positive nor the negative sense). It is mainly meant to allow efficient new code to be written, not to make existing code faster.
Nevertheless, this PEP optimizes for METH_FASTCALL functions. Performance of functions using METH_VARARGS will become slightly worse.
PEP 590 is close to what was proposed in bpo-29259 . The main difference is that this PEP stores the function pointer in the instance rather than in the class. This makes more sense for implementing functions in C, where every instance corresponds to a different C function. It also allows optimizing type.__call__, which is not possible with bpo-29259.
Both PEP 576 and PEP 580 are designed to enable 3rd party objects to be both expressive and performant (on a par with CPython objects). The purpose of this PEP is provide a uniform way to call objects in the CPython ecosystem that is both expressive and as performant as possible.
This PEP is broader in scope than PEP 576 and uses variable rather than fixed offset function-pointers. The underlying calling convention is similar. Because PEP 576 only allows a fixed offset for the function pointer, it would not allow the improvements to any objects with constraints on their layout.
PEP 580 proposes a major change to the PyMethodDef protocol used to define builtin functions. This PEP provides a more general and simpler mechanism in the form of a new calling convention. This PEP also extends the PyMethodDef protocol, but merely to formalise existing conventions.
A longer, 6 argument, form combining both the vector and optional tuple and dictionary arguments was considered. However, it was found that the code to convert between it and the old tp_call form was overly cumbersome and inefficient. Also, since only 4 arguments are passed in registers on x64 Windows, the two extra arguments would have non-neglible costs.
Removing any special cases and making all calls use the tp_call form was also considered. However, unless a much more efficient way was found to create and destroy tuples, and to a lesser extent dictionaries, then it would be too slow.
Victor Stinner for developing the original "fastcall" calling convention internally to CPython. This PEP codifies and extends his work.
|||Add tp_fastcall to PyTypeObject: support FASTCALL calling convention for all callable objects, https://bugs.python.org/issue29259|
|||tp_call/PyObject_Call calling convention https://docs.python.org/3/c-api/typeobj.html#c.PyTypeObject.tp_call|
|||Using PY_VECTORCALL_ARGUMENTS_OFFSET in callee https://github.com/markshannon/cpython/blob/vectorcall-minimal/Objects/classobject.c#L53|
|||Argument Clinic https://docs.python.org/3/howto/clinic.html|
A minimal implementation can be found at https://github.com/markshannon/cpython/tree/vectorcall-minimal
This document has been placed in the public domain.