Undefined behaviour in C [was Re: The Cost of Dynamism]

Sat Mar 26 08:21:53 EDT 2016

On Sat, 26 Mar 2016 01:59 pm, Paul Rubin wrote:

> Steven D'Aprano <steve at pearwood.info> writes:
>> Culturally, C compiler writers have a preference for using undefined
>> behaviour to allow optimizations, even if it means changing the semantics
>> of your code.
> 
> If your code has UB then by definition it has no semantics to change.
> Code with UB has no meaning.

Ah, a language lawyer, huh? :-P

By the rules of the C standard, you're right, but those rules make use of a
rather specialised definition of "no meaning" or "meaningless". I'm using
the ordinary English sense. For example, would you consider that this
isolated C code is "meaningless"?

int i = n + 1;

I'm not talking about type errors where n is not an int. We can assume that
n is also an int. I bet that you know exactly what that line of code means.
But according to the standard, it's "meaningless", since it might overflow,
and signed int overflow is Undefined Behaviour.

Even the C FAQ (as quoted by John Regehr) implies that code which is defined
as "meaningless" may have meaning in the ordinary English sense:

    [quote]
    Anything at all can happen; the Standard imposes no requirements.
    The program may fail to compile, or it may execute incorrectly
    (either crashing or silently generating incorrect results), or it
    may fortuitously do EXACTLY WHAT THE PROGRAMMER INTENDED.

[Emphasis added.]

http://blog.regehr.org/archives/213

If the code is "meaningless", how can we say that it does what the
programmer intended?

In plain English, if the programmer had an intention for the code, and it
was valid C syntax, it's not hard to conclude that the code has some
meaning. Even if that meaning isn't quite what the programmer expected.
Compilers are well known for only doing what you tell them to do, not what
you want them to do. But in the case of C and C++ they don't even do what
you tell them to do.

When I talk about changing the semantics of the code you write, I'm using a
plain English sense of "meaning". Start with a simple-minded, non-
optimizing C compiler -- what Raymond Chen refers to as a "classical
compiler". For example:

int table[4];
bool exists_in_table(int v)
{
    for (int i = 0; i <= 4; i++) {
        if (table[i] == v) return true;
    }
    return false;
}

There's an out-of-bounds error there, but as Chen puts it, a classical
compiler would mindlessly generate code that reads past the end of the
array. A bug, but a predictable one: you can reason about it, and the
effect will be dependent on whatever arbitrary value happens to be in that
memory location. A better compiler would generate an error and refuse to
compile code for it. Either way, in plain English, the meaning is obvious:

* Create an array of four ints, naming it "table".

* Declare a function named "exists_in_table", which takes an int "v" as
argument and returns a bool.

* This function iterates over i = 0 to 4 inclusive, returning true if the
i-th item of table equals the given v, and false if none of those items
equals the given v.

I don't believe for a second that you can't read that code well enough to
infer the intended meaning of it. Even I can read C well enough to do that.
Yet according to the C standard, that perfectly understandable code snippet
is deemed to be gibberish, and instead of returning true or false, the
compiler is permitted to erase your hard disk, or turn off your life-
support, if it so chooses. And as Raymond Chen describes, a post-classical
compiler will probably optimize that function to one which always returns
true.

As far as I know, there is no other language apart from C and C++ that takes
such a cavalier approach.

I cannot emphasis enough that the treatment of "undefined behaviour" is
intentional by the C standards committee. Given the absolutely catastrophic
effect it has had on the reliability, safety and security of code written
in C, in my opinion the C standard borders on professional negligence.
Programming in C becomes a battle to defeat the compiler and force it to do
what you tell it to do, all because the C standard was written by a bunch
of people whose number one priority was being able to make their benchmarks
look good.

Imagine a bridge builder who discovers a tiny, technical ambiguity or error
in the blueprints for a bridge. On one page, the documentation states that
there should be four rivets per metre in the supporting beams, but on
another page, it is described as five rivets per metre. What should the
builder do?

- ask for clarification and get the blueprints and documentation corrected?

- play it safe and use five rivets?

- declare that therefore the entire blueprints are meaningless, and so he is
free to optimize the bridge and reduce costs by using steel of a cheaper
grade, half the thickness, and one rivet per metre?

When the bridge collapses under the load of normal traffic, killing hundreds
of people, what comfort should we take from the fact that the builder was
able to optimize it so that it was half the weight, a quarter of the cost,
and finished ahead of schedule, compared to a bridge that would have
actually done the job it was designed for?

-- 
Steven