Undefined behaviour in C [was Re: The Cost of Dynamism]

Sat Mar 26 09:22:51 EDT 2016

On Sat, Mar 26, 2016 at 11:21 PM, Steven D'Aprano <steve at pearwood.info> wrote:
> In plain English, if the programmer had an intention for the code, and it
> was valid C syntax, it's not hard to conclude that the code has some
> meaning. Even if that meaning isn't quite what the programmer expected.
> Compilers are well known for only doing what you tell them to do, not what
> you want them to do. But in the case of C and C++ they don't even do what
> you tell them to do.
>

Does this Python code have meaning?

x = 5
while x < 10:
    print(x)
    ++x

It's a fairly direct translation of perfectly valid C code, and it's
syntactically valid. When the C spec talks about accidentally doing
what you intended, that would be to have the last line here increment
x. But that's never a requirement; compilers/interpreters are not
mindreaders.

The main reason the C int has undefined behaviour is that it's
somewhere between "fixed size two's complement signed integer" and
"integer with plenty of room". A C compiler is generally free to use a
larger integer than you're expecting, which will cause numeric
overflow to not happen. That's (part of[1]) why overflow of signed
integers is undefined - it'd be too costly to emulate a smaller
integer. So tell me... what happens in CPython if you incref an object
more times than the native integer will permit? Are you bothered by
this possibility, or do you simply assume that nobody will ever do
that? Does C's definition of undefined behaviour mean that this code
can be optimized away, thus achieving nothing?

typedef struct _object {
    _PyObject_HEAD_EXTRA
    Py_ssize_t ob_refcnt;
    struct _typeobject *ob_type;
} PyObject;

#define Py_INCREF(op) (                         \
    _Py_INC_REFTOTAL  _Py_REF_DEBUG_COMMA       \
    ((PyObject *)(op))->ob_refcnt++)

The reftotal and debug comma are both empty in non-debug builds. The
meat of this is simply a double-plus increment of a Py_ssize_t
integer, which is defined in pyport.h as a signed integer.

Of course this won't be optimized away, though. The only part that's
undefined is "what exactly happens if you overflow an integer?". And
you shouldn't be doing that at all; if you do, it's a bug, one way or
another. The compiler cannot be expected to magically know what you
intended to happen, so it's allowed to assume that this isn't what
your code meant to do. If you care about capping refcounts, you need
to do that yourself, somehow - don't depend on the compiler, because
you can't even know exactly how wide that refcount is.

ChrisA

[1] The other part being that C didn't want to mandate two's
complement, although I'm pretty sure that could be changed today
without breaking any modern architectures.