[Python-Dev] PEP 580/590 discussion

Sat Apr 27 07:32:54 EDT 2019

Hi Petr,

On 24/04/2019 11:24 pm, Petr Viktorin wrote:
> So, I spent another day pondering the PEPs.
> 
> I love PEP 590's simplicity and PEP 580's extensibility. As I hinted 
> before, I hope they can they be combined, and I believe we can achieve 
> that by having PEP 590's (o+offset) point not just to function pointer, 
> but to a {function pointer; flags} struct with flags defined for two 
> optimizations:
> - "Method-like", i.e. compatible with LOAD_METHOD/CALL_METHOD.
> - "Argument offsetting request", allowing PEP 590's 
> PY_VECTORCALL_ARGUMENTS_OFFSET optimization.

A big problem with adding another field to the structure is that it 
prevents classes from implementing vectorcall.
A 30% reduction in the time to create ranges, small lists and sets and 
to call type(x) is easily worth the a single tp_flag, IMO.

As an aside, there are currently over 10 spare flags. As long we don't 
consume more that one a year, we have over a decade to make tp_flags a 
uint64_t. It already consumes 64 bits on any 64 bit machine, due to the 
struct layout.

As I've said before, PEP 590 is universal and capable of supporting an 
implementation of PEP 580 on top of it. Therefore, adding any flags or 
fields from PEP 580 to PEP 590 will not increase its capability.
Since any extra fields will require at least as many memory accesses as 
before, it will not improve performance and by restricting layout may 
decrease it.

> 
> This would mean one basic call signature (today's METH_FASTCALL | 
> METH_KEYWORD), with individual optimizations available if both the 
> caller and callee support them.
> 

That would prevent the code having access to the callable object. That 
access is a fundamental part of both PEP 580 and PEP 590 and the key 
motivating factor for both.

> 
> 
> In case you want to know my thoughts or details, let me indulge in some 
> detailed comparisons and commentary that led to this.
> I also give a more detailed proposal below.
> Keep in mind I wrote this before I distilled it to the paragraph above, 
> and though the distillation is written as a diff to PEP 590, I still 
> think of this as merging both PEPs.
> 
> 
> PEP 580 tries hard to work with existing call conventions (like METH_O, 
> METH_VARARGS), making them fast.
> PEP 590 just defines a new convention. Basically, any callable that 
> wants performance improvements must switch to METH_VECTORCALL (fastcall).
> I believe PEP 590's approach is OK. To stay as performant as possible, C 
> extension authors will need to adapt their code regularly. If they 
> don't, no harm -- the code will still work as before, and will still be 
> about as fast as it was before.

As I see it, authors of C extensions have five options with PEP 590.
Option 4, do nothing, is the recommended option :)

1. Use the PyMethodDef protocol, it will work exactly the same as 
before. It's already fairly quick in most cases.
2. Use Cython and let Cython take care of handling the vectorcall interface.
3. Use Argument Clinic, and let Argument Clinic take care of handling 
the vectorcall interface.
4. Do nothing. This the same as 1-3 above depending on what you were 
already doing.
5. Implement the vectorcall call directly. This might be a bit quicker 
than the above, but probably not enough to be worth it, unless you are 
implementing numpy or something like that.

> In exchange for this, Python (and Cython, etc.) can focus on optimizing 
> one calling convention, rather than a variety, each with its own 
> advantages and drawbacks.
> 
> Extending PEP 580 to support a new calling convention will involve 
> defining a new CCALL_* constant, and adding to existing dispatch code.
> Extending PEP 590 to support a new calling convention will most likely 
> require a new type flag, and either changing the vectorcall semantics or 
> adding a new pointer.
> To be a bit more concrete, I think of possible extensions to PEP 590 as 
> things like:
> - Accepting a kwarg dict directly, without copying the items to 
> tuple/array (as in PEP 580's CCALL_VARARGS|CCALL_KEYWORDS)
> - Prepending more than one positional argument, or appending positional 
> arguments
> - When an optimization like LOAD_METHOD/CALL_METHOD turns out to no 
> longer be relevant, removing it to simplify/speed up code.
> I expect we'll later find out that something along these lines might 
> improve performance. PEP 590 would make it hard to experiment.
> 
> I mentally split PEP 590 into two pieces: formalizing fastcall, plus one 
> major "extension" -- making bound methods fast.

Not just bound methods, any callable that adds an extra argument before 
dispatching to another callable. This includes builtin-methods, classes 
and a few others.
Setting the Py_TPFLAGS_METHOD_DESCRIPTOR flag states the behaviour of 
the object when used as a descriptor. It is up to the implementation to 
use that information how it likes.
If LOAD_METHOD/CALL_METHOD gets replaced, then the new implementation 
can still use this information.

> When seen this way, this "extension" is quite heavy: it adds an 
> additional type flag, Py_TPFLAGS_METHOD_DESCRIPTOR, and uses a bit in 
> the "Py_ssize_t nargs" argument as additional flag. Both type flags and 
> nargs bits are very limited resources. If I was sure vectorcall is the 
> final best implementation we'll have, I'd go and approve it – but I 
> think we still need room for experimentation, in the form of more such 
> extensions.
> PEP 580, with its collection of per-instance data and flags, is 
> definitely more extensible. What I don't like about it is that it has 
> the extensions built-in; mandatory for all callers/callees.
> 
> PEP 580 adds a common data struct to callable instances. Currently these 
> are all data bound methods want to use (cc_flags, cc_func, cc_parent, 
> cr_self). Various flags are consulted in order to deliver the needed 
> info to the underlying function.
> PEP 590 lets the callable object store data it needs independently. It 
> provides a clever mechanism for pre-allocating space for bound methods' 
> prepended "self" argument, so data can be provided cheaply, though it's 
> still done by the callable itself.
> Callables that would need to e.g. prepend more than one argument won't 
> be able to use this mechanism, but y'all convinced me that is not worth 
> optimizing for.
> 
> PEP 580's goal seems to be that making a callable behave like a Python 
> function/method is just a matter of the right set of flags. Jeroen 
> called this "complexity in the protocol".
> PEP 590, on the other hand, leaves much to individual callable types. 
> This is "complexity in the users of the protocol".
> I now don't see a problem with PEP 590's approach. Not all users will 
> need the complexity. We need to give CPython and Cython the tools to 
> make implementing "def"-like functions possible (and fast), but if other 
> extensions need to match the behavior of Python functions, they should 
> just use Cython. Emulating Python functions is a special-enough use case 
> that it doesn't justify complicating the protocol, and the same goes for 
> implementing Python's built-in functions (with all their historical 
> baggage).
> 
> 
> 
> My more full proposal for a compromise between PEP 580 and 590 would go 
> something like below.
> 
> The type flag (Py_TPFLAGS_HAVE_VECTORCALL/Py_TPFLAGS_HAVE_CCALL) and 
> offset (tp_vectorcall_offset/tp_ccalloffset; in tp_print's place) stay.
> 
> The offset identifies a per-instance structure with two fields:
> - Function pointer (with the vectorcall signature)
> - Flags
> Storing any other per-instance data (like PEP 580's cr_self/cc_parent) 
> is the responsibility of each callable type.
> 
> Two flags are defined initially:
> 1. "Method-like" (like Py_TPFLAGS_METHOD_DESCRIPTOR in PEP 580, or 
> non-NULL cr_self in PEP 580). Having the flag here instead of a type 
> flag will prevent tp_call-only callables from taking advantage of 
> LOAD_METHOD/CALL_METHOD optimisation, but I think that's OK.
> 
> 2. Request to reserve space for one argument before the args array, as 
> in PEP 590's argument offsetting. If the flag is missing, nargs may not 
> include PY_VECTORCALL_ARGUMENTS_OFFSET. A mechanism incompatible with 
> offsetting may use the bit for another purpose.
> 
> Both flags may be simply ignored by the caller (or not be set by the 
> callee in the first place), reverting to a more straightforward (but 
> less performant) code path. This should also be the case for any flags 
> added in the future.
> Note how without these flags, the protocol (and its documentation) will 
> be extremely simple.
> This mechanism would work with my examples of possible future extensions:
> - "kwarg dict": A flag would enable the `kwnames` argument to be a dict 
> instead of a tuple.
> - prepending/appending several positional arguments: The callable's 
> request for how much space to allocate stored right after the {func; 
> flags} struct. As in argument offsetting, a bit in nargs would indicate 
> that the request was honored. (If this was made incompatible with 
> one-arg offsetting, it could reuse the bit.)
> - removing an optimization: CPython would simply stop using an 
> optimizations (but not remove the flag). Extensions could continue to 
> use the optimization between themselves.

This seems a lot more complex than the caller setting a bit to tell the 
callee whether it has allocated extra space.

> 
> As in PEP 590, any class that uses this mechanism shall not be usable as 
> a base class. This will simplify implementation and tests, but hopefully 
> the limitation will be removed in the future. (Maybe even in the initial 
> implementation.)
> 
> The METH_VECTORCALL (aka CCALL_FASTCALL|CCALL_KEYWORDS) calling 
> convention is added to the public API. The other calling conventions 
> (PEP 580's CCALL_O, CCALL_NOARGS, CCALL_VARARGS, CCALL_KEYWORDS, 
> CCALL_FASTCALL, CCALL_DEFARG) as well as argument type checking 
> (CCALL_OBJCLASS) and self slicing (CCALL_SELFARG) are left up to the 
> callable.
> 
> No equivalent of PEP 580's restrictions on the __name__ attribute. In my 
> opinion, the PyEval_GetFuncName function should just be deprecated in 
> favor of getting the __name__ attribute and checking if it's a string. 
> It would be possible to add a public helper that returns a proper 
> reference, but that doesn't seem worth it. Either way, I consider this 
> out of scope of this PEP.
> 
> No equivalent of PEP 580's PyCCall_GenericGetParent and 
> PyCCall_GenericGetQualname either -- again, if needed, they should be 
> retrieved as normal attributes. As I see it, the operation doesn't need 
> to be particularly fast.
> 
> No equivalent of PEP 580's PyCCall_Call, and no support for dict in 
> PyCCall_FastCall's kwds argument. To be fast, extensions should avoid 
> passing kwargs in a dict. Let's see how far that takes us. (FWIW, this 
> also avoids subtle issues with dict mutability.)
> 
> Profiling stays as in PEP 580: only exact function types generate the 
> events.
> 
> As in PEP 580, PyCFunction_GetFlags and PyCFunction_GET_FLAGS are 
> deprecated
> 
> As in PEP 580, nothing is added to the stable ABI
> 
> 
> Does that sound reasonable?