[Python-ideas] Faster PyArg_ParseTupleAndKeywords kwargs

Fri May 23 21:22:20 CEST 2014

Early while working on py-lmdb I noticed that a huge proportion of
runtime was being lost to PyArg_ParseTupleAndKeywords, and so I
subsequently wrote a specialization for this extension module.

In the current code[0], parse_args() is much faster than
ParseTupleAndKeywords, responsible for a doubling of performance in
several of the library's faster code paths (e.g.
Cursor.put(append=True)). Ever since adding the rewrite I've wanted to
go back and either remove it or at least reduce the amount of custom
code, but it seems there really isn't a better approach to fast argument
parsing using the bare Python C API at the moment.

    [0] https://github.com/dw/py-lmdb/blob/master/lmdb/cpython.c#L833

In the append=True path, parse_args() yields a method that can complete
1.1m insertions/sec on my crappy Core 2 laptop, compared to 592k/sec
using the same method rewritten with PyArg_ParseTupleAndKeywords.

Looking to other 'fast' projects for precedent, and studying Cython's
output in particular, it seems that Cython completely ignores the
standard APIs and expends a huge amount of .text on using almost every
imagineable C performance trick to speed up parsing (actually Cython's
output is a sheer marvel of trickery, it's worth study). So it's clear
the standard APIs are somewhat non-ideal, and those concerned with
performance are taking other approaches.

ParseTupleAndKeywords is competitive for positional arguments (1.2m/sec
vs 1.5m/sec for "Cursor.put(k, v)"), but things go south when a kwarg
dict is provided.

The primary goal of parse_args() was to avoid the continous temporary
allocations and hashing done by PyArg_ParseTupleAndKeywords, by way of
PyDict_GetItemString(), which invokes PyString_FromString() internally,
which subsequently causes alloc / strlen() and memcpy(), one for each
possible kwarg, on every function call.

The rewrite has been hacked over time, and honestly I'm not sure which
bits are responsible for the speed improvement, and which are totally
redundant. The tricks are:

 * Intern keyword arg strings once at startup, avoiding the temporary
   PyString creation and also causing their hash() to be cached across
   calls. This uses an incredibly ugly pair of enum/const char *[]
   static globals.[3]

    [3] https://github.com/dw/py-lmdb/blob/master/lmdb/cpython.c#L79

 * Use a per-function 'static const' array of structures to describe the
   expected set of arguments. Since these arrays are built at compile
   time, they cannot directly reference the runtime-generated interned
   PyStrings, thus the use of an enum.

   A nice side effect of the array's contents being purely small integer
   is that each array element is small and thus quite cache-efficient.
   In the current code array elements are 4 bytes each.

 * Avoid use of variable-length argument lists. I'm not sure if this
   helps at all, but certainly it simplifies the parsing code and makes
   the call sites much more compact.

   Instead of a va_arg list of destination pointers, parsed output is
   represented as a per-function structure[1][2] definition, whose
   offsets are encoded into the above argspec array, and at build time.

        [1] https://github.com/dw/py-lmdb/blob/master/lmdb/cpython.c#L1265
        [2] https://github.com/dw/py-lmdb/blob/master/lmdb/cpython.c#L704

   This might hurt the compiler's ability to optimize the placement of
   what were previouly small stack variables (e.g. I'm not sure if it
   prevents the compiler making more use of registers). In any case the
   overall result is much faster than before.

And most recently, giving a further 20% boost to append=True:

 * Cache a dict that maps interned kwarg -> argspec array offset,
   allowing the per-call kwarg dict to be iterated, and causing only one
   hash lookup per supplied kwarg. Prior to the cache, presence of
   kwargs would cause one hash lookup per argspec entry (e.g.
   potentially 15 lookups instead of 1 or 2).

It's obvious this approach isn't generally useful, and looking at the
CPython source we can see the interning trick is already known, and
presumably not exposed in the CPython API because the method is quite
ugly. Still it seems there is room to improve the public API to include
something like this interning trick, and that's what this mail is about.

My initial thought is for a horribly macro-heavy API like:

    PyObject *my_func(PyObject *self, PyObject *args, PyObject *kwargs)
    {
        Py_ssize_t foo;
        const char *some_buf;
        PyObject *list;

        Py_BEGIN_ARGS
            PY_ARG("foo", PY_ARG_SSIZE_T, NULL, PY_ARG_REQUIRED),
            PY_ARG("some_buf", PY_ARG_BUFFER, NULL, PY_ARG_REQUIRED),
            PY_ARG("list", PY_ARG_OBJECT, &PyList_Type, NULL, 0)
        Py_END_ARGS

        if(Py_PARSE_ARGS(args, kwds, &foo, &some_buf, &list)) {
            return NULL;
        }

        /* do stuff */
    }

Where:

    struct py_arg_info;  /* Opaque */

    struct py_arg_spec {
        const char *name;
        enum { ... } type;
        PyTypeObject *type;
        int options;
    };

    #define PY_BEGIN_ARGS                                           \
        static struct py_arg_info *_py_arg_info;                    \
        if(! _py_arg_info) {                                        \
            static const struct py_arg_spec _py_args[] = {

    #define PY_END_ARGS                                             \
            };                                                      \
            _Py_InitArgInfo(&_py_arg_info, _py_args,                \
                            sizeof _py_args / sizeof _py_args[0]);  \
        }

    #define PY_ARG(name, type, type2, opts) {name, type, type2, opts}

    #define Py_PARSE_ARGS(a, k, ...) \
        _Py_ParseArgsFromInfo(&_py_arg_info, a, k, _VA_ARG_);

Here some implementation-internal py_arg_info structure is built up on
first function invocation, producing the cached mapping of argument
keywords to array index, and storing a reference to the py_arg_spec
array, or some version of it that has been internally transformed to a
more useful format.

You may notice this depends on va_arg macros, which breaks at least
Visual Studio, so at the very least that part is broken.

The above also doesn't deal with all the cases supported by the existing
PyArg_ routines, such as setting the function name and custom error
message, or unpacking tuples (is this still even supported in Python 3?)

Another approach might be to use a PyArg_ParseTupleAndKeywords-alike
API, so that something like this was possible:

    static PyObject *
    my_method(PyObject *self, PyObject *args, *PyObject *kwds)
    {
        Py_ssize_t foo;
        const char *some_buf;
        Py_ssize_t some_buf_size;
        PyObject *list;

        static PyArgInfo arg_info;
        static char *keywords[] = {
            "foo", "some_buf", "list", NULL
        };

        if(! PyArg_FastParse(&arg_info, args, kwds, "ns#|O!", keywords,
                             &foo, &some_buf, &some_buf_size,
                             &PyList_Type, &list)) {
            return NULL;
        }

        /* do stuff */
    }

In that case that API is very familiar, and PyArg_FastParse() builds the
cache on first invocation itself, but the supplied va_list is full of
noise that needs to be carefully skipped somehow. The work involved in
doing the skipping might introduce complexity that slows things down all
over again.

Any thoughts on a better API? Is there a need here? I'm obviously not
the first to notice PyArg_ParseTupleAndKeywords is slow, and so I wonder
how many people have sighed and brushed off the fact their module is
slower than it could be.

David