Here's a solution to undefined behavior: let's define all undefined behavior to be the platform's behavior! Sounds attractive huh? Well, one thing's for sure: it brings many mixed feelings very quickly. In this short video excerpt, you can see this idea being obvious to one senior developer, while getting “laughed at” by another, all in less than a minute.
In this article, I would to like give a serious consideration to this idea and see where it takes us. For starters, I have 2 main goals: (1) To illuminate what this idea would entail, and (2) to explain why some principles from programming languages, that are considered almost sacred, may not apply to languages like C.
Then, we'll make a leap towards a vision for compiler transformations I've been having for the longest time: decoupled compiler transformation APIs. This idea has been in a state of hibernation for some years now, but it now seems to get some popularity. Better yet, it may be a practical solution to undefined behavior!
So, grab a seat, and get ready for all the kind hate comments from all around the Internet.
Before getting to the meat of the argument, it would be good if we all had a good idea of what undefined behavior is. There's a lot of content on this topic online, but unfortunately it's hard to find some that sticks to the facts and doesn't turn into an ideological statement in favor of or against undefined behavior.
That said, here are some resources that are almost clean. First, here's a video by John Regehr in which he does a decent job of stating what's true and what isn't. I would also recommend another article I've written on undefined behavior in which I discuss both cases where undefined behavior helps optimizations and cases where it doesn't.
Assuming a good understanding of the matter, let me briefly discuss two core facts of undefined behavior.
For better or worse, according to the C standard, if a program gets into a state
that is exposed to undefined behavior, then it's a wrong program; and it's the
programmer's fault. This is similar to when you try to compile a program that
has a syntax error, like if you write int a = @
, or a type error, like: int a = "test";
.
Given that, you may be wondering why the compiler doesn't tell us we have an error. This is more complicated than it seems, but in short, the compiler can rarely be certain (statically) that your program has undefined behavior. In practice, programs get to undefined behavior for some inputs, but not others. In other words, your code, which is all the compiler has, is usually not enough to determine with certainty that the program will get to undefined behavior.
That said, optimizing compilers are often too aggressive with undefined behavior, but that's a topic of a different article.
For example, the C standard says that if you load from a NULL
pointer, the
behavior of the program is undefined. This means that all possible behaviors
are allowed. A compiler is allowed to choose one of these behaviors and
implement that. In other words, this is basically a subset relation: as long as
the behaviors of the program (that the compiler generates) form a subset of the
behaviors the C standard allows, we're good. In the case of undefined behavior,
the set of behaviors allowed is...err...the universal set, meaning, all
behaviors. So, a set with a single behavior is definitely a subset of that.
To get more concrete, suppose the compiler chooses to define the behavior of
dereferencing a int*
pointing to NULL
as having the value 0
. The problem
is that to implement this behavior, the compiler needs to add an if
in every
int*
dereference in the program. If it doesn't, and e.g., you dereference a
NULL
pointer, your program will crash (in a conventional platform). Because
this doesn't follow the compiler's newly defined behavior (of C programs), it's
an incorrect program. Of course in this case, it's the compiler's fault, not the
programmer's. At the same time, adding if
s everywhere introduces some
overheads, which C compilers don't want to introduce.
In short, if you define certain behaviors, implementing these behaviors brings extra work. Of course, some compilers expose flags to turn bounds checking on and off. Why don't they expose flags for other cases, like checking pointer dereferences? I don't know.
Now it's time to get to the argument.
One problem with this argument is that a program does not have meaning in and of itself, and instead its meaning depends on who executes the program. To put it simply, think of an algorithm. An algorithm describes what happens independently of who performs it. So, the Euclidean Algorithm describes how to find the Greatest Common Denominator (GCD) of two numbers, independently of where these two numbers are saved and who (and how) performs the operations like division. The “processor” could be you and the “memory” could be a piece of paper, or the processor and memory could be the ones in your computer. In other words, the memory is an abstraction, which can be implemented in multiple concrete ways.
This is a fundamental idea in computer science. Computations have meaning in and of themselves; they are acts independent of the actors who perform them. Programming languages, similarly, have abstract semantics, meaning, the semantics describe what the behavior of a program should be, but not how this behavior is achieved.
By now you might already be yelling at me to take my PL hat off and bring this back to reality. I will, I will I promise, but let me first say that there’s a practical point to all this “abstract semantics” stuff: platform independence. A program does the same thing no matter the platform (i.e., who executes it). So, even though we probably don’t care about whether the program works the same when you execute it with pen and paper (sorry!), we do care about it working the same if an x86 and an ARM processor execute it.
Let me now (finally) take my PL hat off and say that while all this sounds fun, it’s kind of irrelevant when we are talking about languages like C and C++. Why? Because they have undefined behavior! UB already makes a program not have meaning in and of itself. If your program exposes UB, then it might do a different thing not only between different platforms but even between different executions. So, the original argument definitely does not make matters worse. On the contrary, it brings value! Because hey, my program may be theoretically as undefined as it was before, but in practice, if I know in what platform I will run it, then I know exactly the semantics of the program. This is not what happens now.
Of course I now need to take a minute and address the C/C++ standard naysayers who heard me saying that UB makes your program not have meaning in and of itself—and they want to complain. Because if your program exhibits UB, it’s your fault, not the standard’s! As we discussed, UB is a programming error. They told us not to do something, we did it, so we are to blame. Ok, except this is not a productive argument at all. If my programming language has hundreds of hidden edge cases, even though theoretically “I told my users not to do it” and “after all, nobody forced them to use my language anyway”, still that doesn’t make me a great language designer. To close off this tangent, even though this is a technically valid counter-argument, it is one that does not care for the user.
So, let’s now imagine that we try to realize this argument. Before we get to a concrete example, I want to be sure that we know the setup, which is the following. The C standard does not change. This is not a proposal aimed towards the standard people. Who is it aimed at? The compiler people. We tell them “define everything according to the target platform”.
But how exactly will they proceed? Usually, when you ask people how they expect
to apply the argument, they say something like: “on NULL
pointer dereference,
there is a load from the memory location 0x0
. The platform (e.g., x86 running
on Linux) already defines (very precisely) what happens when you try to load
from an unmapped location (in this case, 0x0)”. That is true, but there's a
catch.
The catch is that we assumed we have a load to begin with. In other words, when
we have a C program with x = *p
, we assumed that somewhere in the assembly
(that the compiler outputs) there is a load that corresponds exactly to this
pointer dereference. But this is not necessarily the case! Sometimes for a good
reason. For example, the compiler may be able to infer (correctly) that say,
*p
is 7
at this program point. So, it rewrites the code to x = 7
. Nobody
guarantees that your code’s trip through the compiler won’t eliminate this load.
But there's a more fundamental problem: there is no such thing as a “load” in a
C program. The concept of a “load” does not exist in the C standard. Only the
(abstract) concept of a pointer dereference exists. The load is merely a way
that the compiler implements the (abstract) behavior of a pointer dereference.
And it is exactly this abstract behavior that gives freedom to the compiler to
optimize the code however it wants as long as it implements that abstract
behavior. For example, if we do *p = 7
, and then x = *p
, we expect x
to
get the value 7
. But as I said, that doesn't mean a load needs to happen. As
long as x
gets the value 7
somehow, we're good.
So, here's what needs to happen to implement this argument: we need a different
C standard for each platform; at least one for every processor architecture. In
these standards, all the abstract behaviors need to be translated to
concrete behaviors. For example, the standard for the x86 target needs to say
that a pointer dereference (in an r-value) generates a mov
with a M64
operand..., etc.
In practice, each of these standards is a specification of a high-level assembler. You get to decide how viable such an approach is. My 2 cents are that some time ago I expended considerable effort into trying to define such a high-level assembler. And I failed. Maybe I need to write a different article about that, but here are some notes.
Do you want your compiler to be able to turn a loop into a constant computation
(e.g., the optimization I describe
here)?
Well, it won't be able to in a high-level assembler. Because the x86-specific C
standard will say something like “A loop needs to generate at least one basic
block, with a back-edge... blah blah”. In other words, you need a specification
of how a while
loop gets translated, which will basically force a loop to
exist in the first place.
Do you want the compiler to vectorize your code? It probably won't be able to
because for example a *p
will need to generate a scalar node, not a vectorized
one. The only alternative is to start introducing unspecified behavior, meaning,
C constructs will have a range of possibilities on what they correspond to.
Do you want the compiler to do all the usual boring transformations? Like
common-subexpression elimination, turning a multiplication to a shift, an add to
an LEA, etc. It similarly won't be able to because e.g., +
will need to turn
into a add
instruction.
I will propose an extension to the argument that I think makes it more practical. At least, it should provide an environment more useful to the heavy C users (e.g., those who care about high-performance software) than the current situation.
The main issue with the original argument is that defining how C is translated directly to x86 does not allow the compiler to perform any transformations. One way to mitigate that is to make the optimizations the programmer's responsibility. In this case, the modified C standard would define how to translate C to unoptimized e.g., LLVM IR. Note that because of the reasons I explained in another article, LLVM IR is not target-independent, which means we would still need a different standard for each target.
But, this would allow us to get predictable compiler output (no undefined behavior enters the picture because there's no abstract behavior). At the same time, this IR can be further optimized. The programmer can choose to apply only well-defined transformations over the code, which would still allow them to get predictable code at the end.
In particular, high-performance programmers don't care much about the heavy compiler optimizations. For example, for most of them, auto-vectorization is unreliable anyway, so they vectorize the code themselves. What they do care about, though, is all the menial small, repetitive transformations, e.g., common-subexpression elimination, dead-code elimination, and instruction combining. Heavy compiler optimizations can be quite unpredictable, which defeats the purpose, but these simple transformations can be well-defined, easily understandable and predictable.
However, the interfaces modern compilers provide are terrible if you want to realize this vision. More specifically, transformations are coupled with cost models, and they are buried inside passes with no well-specified and flexible APIs. As a user, or even a compiler expert, it's very hard to apply specific optimizations to your code.
In my BSc thesis, I articulated a vision towards a different design of optimizing compilers. In this design, transformations are decoupled from cost models and are exposed through granular and well-defined APIs. This, in turn, allows programmers to essentially build their own custom mini-compilers. These mini-compilers leverage the reusable APIs, and apply the transformations the programmers care about. As a concrete example, the thesis presents an API for the loop distribution transformation. That API was used back then in a loop-optimization framework we were building at NEC.
But, at that time at least, this idea didn't seem to catch on. To my surprise, however, I recently came across the MLIR Transform Dialect. I'm not sure this is the end goal, but in my opinion, it's definitely a step towards the right direction.