Compiler Optimization in a Language you Can Understand

Nov 16, 2024

Motivation & Reading Guide

Recently, a friend wrote to me:

You know what would be interesting? If there was a post showing what optimizations actually do to your code.

I immediately thought "Sure, I know a thousand articles and videos on that". But, I quickly realized that basically almost all sources assume knowledge of compiler lingo, internals, IRs, etc. The problem is: people who use compilers (like my friend) usually don't care about any of those. They don't care about LLVM IR, or what a φ node is, or which pass and why is named loop rotation. Rather, their priorities are, in order of decreasing importance, the following: (1) the what, (2) the why, (3) the how.

What. Above all, users want to know what the compiler can and cannot do, i.e., what they need to optimize themselves and what the compiler can do on its own.

Why. After that, they may be interested in knowing why the compiler optimized their code the way it did (e.g., why it chose one instruction over another).

How. This concerns how the compiler achieves things, i.e., the compiler technology. But this is the least of people's priorities, and usually people care about that to satisfy their curiosity, not to do their job.

In this article, I'll explain compiler optimizations through a series of examples. For the most part, I will be focusing on the what, only talking about the why and how when I think is relevant or quite interesting. To avoid making this article super long, I created an Appendix with a bunch of long details, including some of the whys and the hows.

Quite importantly, in the examples I will use C and the occasional x86-64 assembly (when C is too high level to express a concept). I will not use LLVM IR, and I will avoid talking about compiler internals and lingo.

As a final note, ideally I would use realistic examples for everything, but the trade-off is that these are sometimes harder to understand. So, it's usually a judgement call on a case-by-case basis.

Quick acknowledgements: Thanks to André Sintzoff for finding a bunch of errata.

Arithmetic

We'll start simple, with the following function:

int times2(int n) {
  return n*2;
}

Generally speaking, compilers can optimize arithmetic, logical, and other kinds of scalar operations pretty effectively.1 Even in this simple example, where it might seem the compiler can't do much, it generated following with -O1 (godbolt):

times2(int):
        lea     eax, [rdi+rdi]
        ret

The compiler effectively turned our function to:

int times2(int n) {
  return n+n;
}

In the Appendix I have a discussion on why the compiler decided to generate this out of all things.

Let's now see a related realistic example that will give us a sense of the compiler's limits on manipulating arithmetic operations, inspired from this post. Here's the original code (godbolt), slightly simplified by me:

typedef struct Float2 {
  float x, y;
} Float2;

Float2 ProjectSphere(float x, float z, float r, 
                     float t, float ResultScale) {
  float tz = t*z;
  float rx = r*x;
  float tx = t*x;
  float rz = r*z;

  float A = (tz + rx);
  float B = (tz - rx);

  float Min = (tx - rz) / A;
  float Max = (tx + rz) / B;

  Float2 Res = {Min*ResultScale, Max*ResultScale};

  return Res;
}

Let me detail the changes I made from the original code. First, I deleted the sqrt computation to reduce the example; in other words, we don't need it to make the point. Second, I moved common subexpressions (i.e., subexpressions that appear multiple times) to their own variables, because I think this makes the code much more readable. We will return to that when we talk about common subexpression elimination.

Now, in the post, they present an optimization on this code to avoid the two divisions because divisions are generally slow in any kind of hardware. The optimization uses the math fact that:

\[ \frac{x}{y} = \frac{x}{y \times z} \times z \]

Obviously assuming z != 0. We have two values, one which we want to divide by A and the other by B. The optimization is that we'll divide both by A * B, and then multiply one with A (to leave B in the denominator) and the other with B (to leave A in the denominator). So:

Float2 ProjectSphere(float x, float z, float r,
                     float t, float ResultScale) {
  float tz = t*z;
  float rx = r*x;
  float tx = t*x;
  float rz = r*z;

  float A = (tz + rx);
  float B = (tz - rx);

  // Was: Nothing
  ResultScale /= (A * B);

  // Was: float Min = (tx - rz) / A;
  float Min = (tx - rz) * B;
  // Was: float Max = (tx + rz) / B;
  float Max = (tx + rz) * A;

  Float2 Res = {Min*ResultScale, Max*ResultScale};

  return Res;
}

The question we ask is: Can the compiler do this optimization on its own? It seems plausible that it can because the optimization is a bit wild but not too wild. So, let's fire our -O3 across compilers. GCC gives us two divisions, so that's not great. Clang is a tad smarter by vectorizing the code a bit. Basically, it created one vector: X = {(tx -rz), (tx + rz)}. Then, it created another vector: Y = {A, B}. And it did X / Y. So, with one division instruction, we do two divisions.

ICX is a peculiar case. It managed to generate code with no divisions at all... sort of. Essentially, it generated this:

Float2 ProjectSphere(float x, float z, float r, float t, float RS) {
  float tz = t*z;
  float rx = r*x;
  float tx = t*x;
  float rz = r*z;

  float A = t*z + r*x;
  float B = t*z - r*x;
  float v = {A, B, A, B};
  float rec_v = {1/A, 1/B, 1/A, 1/B};

  float Min_s = tx - rz;
  float Max_s = tx + rz;
  float MinMax = {Min_s, Max_s, Min_s, Max_s};
  float RS_v = {RS, RS, RS, RS};

  // Focus on only the first two elements from here on.
  // Anyway these are what we return.

  float MinMaxRS = MinMax*RS_v;  // = {Min_s*RS, Max_s*RS}
  float Div = MinMaxRS*rec_v; // = {Min_s*RS/A, Max_s*RS/B}
  float Clear = v*Div; // = {Min_s*RS, Max_s*RS}
  float res = MinMaxRS - Clear; // = {0, 0}
  res = res*rec_v; // = {0, 0}
  res = res + Div; // = {Min_s*RS/A, Max_s*RS/B}
  return res;
}

In the Appendix I have the same C version annotated with the original assembly.

So, ICX basically computed the reciprocal, which is kind of like a division, and then multiplied by that. This is a good trade-off because rcpps is generally much faster than divss and divps.

But the compiler seems to be doing something stupid. In line 21 we seem to already have the result! Why don't we return there and we keep doing the rest of the nonsense to get to the same result on line 25? Well, remember that we're using floating-point numbers. So, the 0s on line 24 are not really zeroes. In short, the reason ICX does the rest is to reduce the error of the computation. In the Appendix I show you some C code that calls both and checks the error.

Even more interestingly, the only reason ICX was able to do that is because, contrary to GCC and Clang, its default floating-point precision is not precise. In other words, the generated doesn't necessarily produce exactly the result that the original version would. This is true, by the way, for the hand-written optimization too. It does not respect the floating-point semantics.

For GCC and Clang, on the other hand, we have to explicitly tell them that they can use "fast math", i.e., do optimizations that may introduce some error. We do that with the flag -ffast-math. GCC wasn't able to do much with that, but Clang did! It generated basically the same version as ICX.

The moral of the story is that compilers can do pretty crazy stuff with your code, but you have to tell them them what you do and don't care about.

Common Subexpression Elimination

Let's return to the ProjectSphere example. The original code was:

Float2 ProjectSphere(float x, float z, float r, float t, float ResultScale) {
  float Min = (t * x - r * z) / (t * z + r * x);
  float Max = (t * x + r * z) / (t * z - r * x);

  Float2 Res = {Min * ResultScale, Max * ResultScale};

  return Res;
}

And the code I wrote was:

Float2 ProjectSphere(float x, float z, float r, float t, float scale) {
  float tz = t*z;
  float rx = r*x;
  float tx = t*x;
  float rz = r*z;

  float A = (tz + rx);
  float B = (tz - rx);

  float Min = (tx - rz) / A;
  float Max = (tx + rz) / B;

  Float2 Res = {Min*scale, Max*scale};

  return Res;
}

In my code I performed a (compiler) optimization called "common subexpression elimination" (CSE), and the name is pretty suggestive of what it does. We basically take expressions that appear multiple times, like t*z in the original code, and put them into variables. This is supposedly beneficial because we avoid re-computing the same stuff. In practice, there are caveats. CSE generally increases the register pressure, which I explain in great detail in the Appendix.

The conventional wisdom argues that we don't have to do CSE ourselves, because the compiler will do it. In fact, it goes a step further. It shouldn't even matter if we do it or not! The compiler should be able to see through the variables. In other words, to the compiler, variables like tz should be the same as a macro, i.e., tz and t*z are just basically the same thing, so it doesn't matter how we write the code. But how true are these?

Using Clang -O1 the two versions give us basically the same code (same instructions but not in the same order), and the same is true for GCC under -O1. In -O2 and -O3 things change a bit for GCC. In both you get the same code, except my version gets one more movdqa before the ret, so it's slightly slower. For Clang, the differences between the two are wilder with -O2. In this case, my version is slightly better because it uses two subss while the original uses two shufps (shuffles).2

The moral of the story is that saying "it doesn't matter how you write your arithmetic expressions, the compiler will take care of it" is just plainly not accurate. Even worse, there's no golden rule for preferring one version over another. At the same time, though, you really should care about it only if you care about the last bit of performance; yes, the compiler will do the same thing for 90% of your code or so no matter how you write it.

Unoptimized Builds

One way to understand what the compiler does when you turn optimizations on is to see what it does not do when you keep them off. Let's consider again our simple times2, but slightly modified so that it uses one local variable:

int times2(int n) {
  int x = n*2;
  return x;
}

Let's see what Clang generates with no optimizations (godbolt).

times2:
        push    rbp
        mov     rbp, rsp
        mov     dword ptr [rbp - 4], edi
        mov     eax, dword ptr [rbp - 4]
        shl     eax
        mov     dword ptr [rbp - 8], eax
        mov     eax, dword ptr [rbp - 8]
        pop     rbp
        ret

Without understanding much about the code, it should be relatively clear that rbp is useless. We save its value by pushing it into the stack. Then, we copy rsp's value to rbp, and, after we do everything using rbp, we just pop rbp's value from the stack. So, rbp uses rsp's value all along. Thus, to make our life easier, I'll just omit using rbp altogether by using -fomit-frame-pointer (godbolt). If you don't know what a frame pointer is, then rest assured it doesn't matter for this article. But, you can always read up more online. Now, the code we get using just rsp is:

times2:
        ; Save the argument `n`, whose value is in `edi`, to the stack.
        ; The next available position is `rsp - 4`.
        mov     dword ptr [rsp - 4], edi
        ; Load `n` into a register to do operations with it
        mov     eax, dword ptr [rsp - 4]
        ; n << 1
        shl     eax
        ; `x` is also assigned a stack position. The next available is
        ; `rsp - 8`. Save the result of the computation above there.
        mov     dword ptr [rsp - 8], eax
        ; Load `x` into `eax` because that's what we return.
        mov     eax, dword ptr [rsp - 8]
        ret

I explain the details in the assembly, but there's one general idea you should keep in mind: the compiler follows a very repetitive, naive, and simple methodology in unoptimized builds. It goes roughly as follows. First, every variable is associated one-to-one with a position in the stack (so, every variable gets one position in the stack and that position is associated with only that one variable). This includes all local temporary variables, as well as function arguments. So, at the beginning of the function, we just copy all the function arguments into the stack. Second, whenever we want to use a variable's value, we load it from the stack and into a register, partly because we can't perform computations directly using stack locations; most instructions work with registers. And of course, every assignment to a variable is a store to the stack location of the variable.

Now, this may sound incredibly stupid, because we definitely have enough unused registers not to have to use the stack at all in this case. And frankly, it is. But, the way to think about unoptimized builds is that the compiler should not have to think at all. The method described above produces correct code without any thinking no matter how many variables we have, whereas the fancier method of only using registers would have to at least do some thinking like: "I will use registers as long as this is possible, and if I run out, I'll start using the stack". But applying this thinking is itself a compiler optimization: register allocation (i.e., how you allocate registers to hold values). And if you want that, you should turn optimizations on. There are different levels to how hard a register allocation works to get values into registers, and in some compilers, lower optimization levels use simpler register allocation strategies.

Also, the method described above makes a debugger's life easier. Because each variable is associated with a stack location and that stack location is not used for anything else, if at any point during the execution of a function you want to see the value of a variable, you just look up the value in that location. In optimized builds, the value of a variable is not associated with any one location or any one register. Rather, it usually moves around. So, the compiler has to communicate all these intermediate changes and the debugger has to keep track of them. Of course, again, for simple functions like the one we saw, the compiler could have one register for each variable such that each variable is associated with only one register.3 But, the tooling we use is not optimized for such "middle-ground" cases. Rather, the mantra is that you either have completely stupid debug/unoptimized builds or "whatever happens" optimized builds.

Finally, note that Clang used a left shift for the multiplication, even in an unoptimized build. Similarly, GCC uses a version similar to our version earlier where the multiplication is replaced by an addition (godbolt).

Inlining

For this I'm going to shamelessly plug a ~6-minute excerpt from a talk I gave at the compiler meetup a couple of years ago. It's one of these few talks that I still think are decent, so here you go. In the first 6 minutes I explain the benefits and drawbacks of inlining, as well as some caveats (e.g., when we can inline). Then, I go more into how it happens in the compiler.

The one thing I'd like to add here that is not in the video is an implementation of inlining in LLVM by Chris Lattner, from 20 years ago.

Strength Reductions

Let's say we have this function:

void init_scaled(int *array, int n, int scale) {
  for (int i = 0; i < n; i++) {
    array[i] = i * scale;
  }
}

One way to avoid the multiplication inside the loop is to write the code like this:

void init_scaled(int *array, int n, int scale) {
  int tmp = 0;
  for (int i = 0; i < n; i++) {
    array[i] = tmp;
    tmp += scale;
  }
}

So, we replaced a multiplication with an addition, which is a win given that in general integer additions are faster than integer multiplications. This optimization falls under the general category of strength-reduction optimizations, in which we replace a piece of code with an equivalent one that is not as "strong", meaning it uses cheaper instructions. Clang (and most compilers) will do this optimization even in -O1 (godbolt).

A fancier form of strength reduction4 is loop-invariant code motion. Here's a similar example to the one above:

void scale_array(float *array, int n, float multiplier) {
  const float pi = 3.14159;
  for (int i = 0; i < n; i++) {
    float scale = pi * multiplier;
    array[i] = array[i] * scale;
  }
}

You may notice that the value of scale is the same across all loop iterations. So, we can hoist it out of the loop so that the multiplication happens only once:

void scale_array(float *array, int n, float multiplier) {
  const float pi = 3.14159;
  float scale = pi * multiplier;
  for (int i = 0; i < n; i++) {
    array[i] = array[i] * scale;
  }
}

Clang (and most compilers) will do that (godbolt) even in -O1. Here's the shortened and annotated assembly:

.LCPI0_0:
        ; This is pi
        .long   0x40490fd0
scale_array:
        ...
        ; scale = multiplier * pi
        mulss   xmm0, dword ptr [rip + .LCPI0_0]
        ...
.LBB0_2:
        ; float tmp = array[i];
        movss   xmm1, dword ptr [rdi + 4*rcx]
        ; float tmp2 = tmp * scale
        mulss   xmm1, xmm0
        ; array[i] = tmp2;
        movss   dword ptr [rdi + 4*rcx], xmm1
        ...

Compile-Time Optimizations

Potentially surprising, most compilers usually have multiple subsystems that compute code at compile time. The most obvious case is all the machinery that is used to compute constexpr.5

But, there are some non-obvious compile-time subsystems. Take, for example, the usual collatz sequence (godbolt):

static int collatz(int n) {
  int its = 0;
  while (n != 1) {
    its++;
    if (n % 2 == 0) {
      n /= 2;
    } else {
      n = 3 * n + 1;
    }
  }
  return its;
}

int main(void) {
  return collatz(6);
}

Let's see the assembly:

main:
        mov     eax, 8
        ret

First, I intentionally marked the collatz function static so that the compiler leaves only the main function. More on that on the dead-code elimination section.

Now, the compiler basically generated this:

int main(void) {
  return 8;
}

In other words, it executed the function at compile time, and replaced the call with the return value of collatz. Note that, as a I wrote in a previous article article, the compiler sometimes can turn loops to constant computations if the loop can be turned to one. But, the Collatz sequence is famously not such a function. So, the compiler necessarily ran the loop at compile time.

The compiler, in general, has no way of knowing whether the loop will ever finish (especially for the Collatz sequence). So, you may be wondering: can the compiler get stuck into an infinite loop? Generally speaking, no, because the compiler has limits on the number of iterations. For example, when we pass in number 27, which takes 111 iterations to complete, the compiler generates this (godbolt):

main:
        mov     ecx, 27
        xor     eax, eax
.LBB0_1:
        inc     eax
        mov     edx, ecx
        sar     edx
        test    cl, 1
        lea     ecx, [rcx + 2*rcx + 1]
        cmove   ecx, edx
        cmp     ecx, 1
        jne     .LBB0_1
        ret

So, the loop is definitely there (notice the jne back to .LBB0_1). At the time of writing, the max iterations of the subsystem used here is 100.6 We can increase it to 111 by passing the cmd arg -mllvm --scalar-evolution-max-iterations=111 (godbolt):

main:
        mov     eax, 111
        ret

Why the argument has this name is a different story. You may also want to check my article on scalar evolution.

Notice that the compilation is now pretty slow (noticeably slow, in fact, which is massive for only 111 iterations), which indicates that the compile-time systems inside compilers pretty slow. But, they're not designed for performance anyway. That said, last time I checked on the Jai programming language, it was generating bytecode for compile-time computations, so theoretically it should be faster.

Finally, not all compilers will compute loops at compile time. For example, GCC won't do it for the collatz example (godbolt), nor will ICX (godbolt).

Dead-Code Elimination

"Dead code" is basically useless/unused/redundant and generally unnecessary code. Probably what dead-code elimination (DCE) does the most is delete unused code included by #include directives. For example, if I add an #include to the times2 example (godbolt), you'll notice the code doesn't change and that's because the compiler threw away everything. Well, that's really for two reasons; the main one is that most code in .h files is not executable. For example, a struct is not executable. It's just there to inform the compiler how to generate code that uses structs. But in C++ .h files you can also have function definitions e.g., inside classes. These will be thrown away too if not used.

But here's a more subtle case. Above, we saw the collatz function (godbolt), which we marked static. That means the function is local to this module and cannot (and thus is not) used by any other module. In this module, now, it is called in only one place, but this call got inlined. So, the function is now unnecessary and the compiler removed it.

When most people hear about dead-code elimination for the first time, they ask: "well, all the code I write does something, so how does the compiler find useless code?" Above, we saw one example. And in general, most dead code is generated because of compiler optimizations.

Dead-code elimination can be surprising. For example, let's return to the collatz example above. This time, however, instead of returning the number of iterations, we will return n (godbolt):

static int collatz(int n) {
  int its = 0;
  while (n != 1) {
    its++;
    if (n % 2 == 0) {
      n /= 2;
    } else {
      n = 3 * n + 1;
    }
  }
  ////////////////////////////////////////
  // THIS IS WHERE THE CHANGE HAPPENED ///
  ////////////////////////////////////////
  return n;
}

int main(void) {
  return collatz(27);
}

Let's take a look at the assembly:

main:
        mov     eax, 1
        ret

The compiler basically turned the code to:

int main(void) {
  return 1;
}

But how? Note that in this case, we haven't passed the argument that increases the limit of the compile-time execution engine. So, how did the compiler eliminate the loop? Notice the loop condition: while (n != 1). Thus, the compiler knows that when the loop finishes, n must be 1, otherwise it wouldn't finish. Thus, the compiler necessarily returns 1.

But how does the compiler know the loop will finish?7 From the C standard... In summary, infinite loops without side-effects (e.g., printing, or writing to volatile pointers) are undefined behavior. The compiler always assumes undefined behavior doesn't happen, and thus it assumes that the loop will finish.

This is one of the many great examples of why you should not think of C as a high-level assembler (i.e., as just a way to write assembly easily). Undefined behavior is at the top of the list of reasons. There are cases in which the compiler will do unexpected things to your code.

Want to join the discussion? Go to the Reddit post.

Don't want to miss any updates? You can follow this RSS feed or sign up for my newsletter:

Footnotes

The case for vectorized operations is a bit more complex as we'll see later.↩
This is a rough estimate, but subss (see) has much better throughput than shufps (see). ↩
Of course, necessarily there would be cases where more than one register has a variable's value at any point in time. For example, say the compiler decides that x's value will be in edx. Still, if you want to return x, you need to copy edx's value into eax, so at that point both eax and edx have x's value. But, this doesn't matter for our purposes because still if you ask the debugger for x's value, it can just look up edx without caring where else x's value might be.↩
Well, I think it's debatable among compiler theorists whether it is or isn't a strength-reduction optimization.↩
For example, Clang implements a constexpr interpreter.↩
Check here if you want to learn more.↩
The Collatz conjecture is, after all, just a conjecture...↩