Recently, a friend wrote to me:
You know what would be interesting? If there was a post showing what optimizations actually do to your code.
I immediately thought "Sure, I know a thousand articles and videos on that". But, I quickly realized that basically almost all sources assume knowledge of compiler lingo, internals, IRs, etc. The problem is: people who use compilers (like my friend) usually don't care about any of those. They don't care about LLVM IR, or what a φ node is, or which pass and why is named loop rotation. Rather, their priorities are, in order of decreasing importance, the following: (1) the what, (2) the why, (3) the how.
What. Above all, users want to know what the compiler can and cannot do, i.e., what they need to optimize themselves and what the compiler can do on its own.
Why. After that, they may be interested in knowing why the compiler optimized their code the way it did (e.g., why it chose one instruction over another).
How. This concerns how the compiler achieves things, i.e., the compiler technology. But this is the least of people's priorities, and usually people care about that to satisfy their curiosity, not to do their job.
In this article, I'll explain compiler optimizations through a series of examples. For the most part, I will be focusing on the what, only talking about the why and how when I think is relevant or quite interesting. To avoid making this article super long, I created an Appendix with a bunch of long details, including some of the whys and the hows.
Quite importantly, in the examples I will use C and the occasional x86-64 assembly (when C is too high level to express a concept). I will not use LLVM IR, and any talk about compiler internals and lingo will be found only in the how sections.
As a final note, ideally I would use realistic examples for everything, but the trade-off is that these are sometimes harder to understand. So, it's usually a judgement call on a case-by-case basis.
We'll start simple, with the following function:
int times2(int n) {
return n*2;
}
Generally speaking, compilers can optimize arithmetic, logical, and other kinds
of scalar operations pretty effectively.1 Even in this simple example, where
it might seem the compiler can't do much, it generated following with -O1
(godbolt):
times2(int):
lea eax, [rdi+rdi]
ret
The compiler effectively turned our function to:
int times2(int n) {
return n+n;
}
In the Appendix I have a discussion on why the compiler decided to generate this out of all things.
Let's now see a related realistic example that will give us a sense of the compiler's limits on manipulating arithmetic operations, inspired from this post. Here's the original code (godbolt), slightly simplified by me:
typedef struct Float2 {
float x, y;
} Float2;
Float2 ProjectSphere(float x, float z, float r,
float t, float ResultScale) {
float tz = t*z;
float rx = r*x;
float tx = t*x;
float rz = r*z;
float A = (tz + rx);
float B = (tz - rx);
float Min = (tx - rz) / A;
float Max = (tx + rz) / B;
Float2 Res = {Min*ResultScale, Max*ResultScale};
return Res;
}
Let me detail the changes I made from the original code. First, I deleted the
sqrt
computation to reduce the example; in other words, we don't need it to
make the point. Second, I moved common subexpressions (i.e., subexpressions that
appear multiple times) to their own variables, because I think this makes the
code much more readable. We will return to that when we talk about common
subexpression elimination.
Now, in the post, they present an optimization on this code to avoid the two divisions because divisions are generally slow in any kind of hardware. The optimization uses the math fact that:
Obviously assuming z != 0
. We have two
values, one which we want to divide by A
and the other by B
. The optimization is that we'll divide both by A * B
,
and then multiply one with A
(to leave B
in the denominator) and the other
with B
(to leave A
in the denominator). So:
Float2 ProjectSphere(float x, float z, float r,
float t, float ResultScale) {
float tz = t*z;
float rx = r*x;
float tx = t*x;
float rz = r*z;
float A = (tz + rx);
float B = (tz - rx);
// Was: Nothing
ResultScale /= (A * B);
// Was: float Min = (tx - rz) / A;
float Min = (tx - rz) * B;
// Was: float Max = (tx + rz) / B;
float Max = (tx + rz) * A;
Float2 Res = {Min*ResultScale, Max*ResultScale};
return Res;
}
The question we ask is: Can the compiler do this optimization on its own? It
seems plausible that it can because the optimization is a bit wild but not too
wild. So, let's fire our -O3
across compilers.
GCC gives us two divisions, so that's not
great. Clang is a tad smarter by vectorizing
the code a bit. Basically, it created one vector: X = {(tx -rz), (tx + rz)}
.
Then, it created another vector: Y = {A, B}
. And it did X / Y
. So, with one
division instruction, we do two divisions.
ICX is a peculiar case. It managed to generate code with no divisions at all... sort of. Essentially, it generated this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | Float2 ProjectSphere(float x, float z, float r, float t, float RS) { float tz = t*z; float rx = r*x; float tx = t*x; float rz = r*z; float A = t*z + r*x; float B = t*z - r*x; float v = {A, B, A, B}; float rec_v = {1/A, 1/B, 1/A, 1/B}; float Min_s = tx - rz; float Max_s = tx + rz; float MinMax = {Min_s, Max_s, Min_s, Max_s}; float RS_v = {RS, RS, RS, RS}; // Focus on only the first two elements from here on. // Anyway these are what we return. float MinMaxRS = MinMax*RS_v; // = {Min_s*RS, Max_s*RS} float Div = MinMaxRS*rec_v; // = {Min_s*RS/A, Max_s*RS/B} float Clear = v*Div; // = {Min_s*RS, Max_s*RS} float res = MinMaxRS - Clear; // = {0, 0} res = res*rec_v; // = {0, 0} res = res + Div; // = {Min_s*RS/A, Max_s*RS/B} return res; } |
In the Appendix I have the same C version annotated with the original assembly.
So, ICX basically computed the reciprocal, which is kind of like a division, and
then multiplied by that. This is a good trade-off because rcpps
is generally
much faster than divss
and divps
.
But the compiler seems to be doing something stupid. In line 21
we seem to
already have the result! Why don't we return there and we keep doing the rest of
the nonsense to get to the same result on line 25
? Well, remember that we're
using floating-point numbers. So, the 0
s on line 24
are not really zeroes.
In short, the reason ICX does the rest is to reduce the error of the
computation. In the
Appendix
I show you some C code that calls both and checks the error.
Even more interestingly, the only reason ICX was able to do that is because, contrary to GCC and Clang, its default floating-point precision is not precise. In other words, the generated doesn't necessarily produce exactly the result that the original version would. This is true, by the way, for the hand-written optimization too. It does not respect the floating-point semantics.
For GCC and Clang, on the other hand, we have to explicitly tell them that they
can use "fast math", i.e., do optimizations that may introduce some error. We do
that with the flag -ffast-math
. GCC wasn't
able to do much with that, but Clang did! It
generated basically the same version as ICX.
The moral of the story is that compilers can do pretty crazy stuff with your code, but you have to tell them them what you do and don't care about.
Let's return to the ProjectSphere
example. The original code was:
Float2 ProjectSphere(float x, float z, float r, float t, float ResultScale) {
float Min = (t * x - r * z) / (t * z + r * x);
float Max = (t * x + r * z) / (t * z - r * x);
Float2 Res = {Min * ResultScale, Max * ResultScale};
return Res;
}
And the code I wrote was:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | Float2 ProjectSphere(float x, float z, float r, float t, float scale) { float tz = t*z; float rx = r*x; float tx = t*x; float rz = r*z; float A = (tz + rx); float B = (tz - rx); float Min = (tx - rz) / A; float Max = (tx + rz) / B; Float2 Res = {Min*scale, Max*scale}; return Res; } |
In my code I performed a (compiler) optimization called "common subexpression
elimination" (CSE), and the name is pretty suggestive of what it does. We
basically take expressions that appear multiple times, like t*z
in the
original code, and put them into variables. This is supposedly beneficial
because we avoid re-computing the same stuff. In practice, there are caveats.
CSE generally increases the register pressure, which I explain in great detail
in the
Appendix.
The conventional wisdom argues that we don't have to do CSE ourselves, because
the compiler will do it. In fact, it goes a step further. It shouldn't even
matter if we do it or not! The compiler should be able to see through the
variables. In other words, to the compiler, variables like tz
should be the
same as a macro, i.e., tz
and t*z
are just basically the same thing, so it
doesn't matter how we write the code. But how true are these?
Using Clang -O1 the two versions give us
basically the same code (same instructions but not in the same order), and the
same is true for GCC under -O1
. In -O2
and -O3
things change a bit for GCC. In both you get the same code,
except my version gets one
more movdqa
before the ret
, so it's slightly slower. For Clang, the
differences between the two are wilder with -O2.
In this case, my version is slightly better because it uses two subss
while the original uses two shufps
(shuffles).2
The moral of the story is that saying "it doesn't matter how you write your arithmetic expressions, the compiler will take care of it" is just plainly not accurate. Even worse, there's no golden rule for preferring one version over another. At the same time, though, you really should care about it only if you care about the last bit of performance; yes, the compiler will do the same thing for 90% of your code or so no matter how you write it.
One way to understand what the compiler does when you turn optimizations on is
to see what it does not do when you keep them off. Let's consider again our
simple times2
, but slightly modified so that it uses one local variable:
int times2(int n) {
int x = n*2;
return x;
}
Let's see what Clang generates with no optimizations (godbolt).
times2:
push rbp
mov rbp, rsp
mov dword ptr [rbp - 4], edi
mov eax, dword ptr [rbp - 4]
shl eax
mov dword ptr [rbp - 8], eax
mov eax, dword ptr [rbp - 8]
pop rbp
ret
Without understanding much about the code, it should be relatively clear that
rbp
is useless. We save its value by pushing it into the stack. Then, we copy
rsp
's value to rbp
, and, after we do everything using rbp
, we just pop
rbp
's value from the stack. So, rbp
uses rsp
's value all along. Thus, to
make our life easier, I'll just omit using rbp
altogether by using
-fomit-frame-pointer
(godbolt).
If you don't know what a frame pointer is, then rest assured it doesn't matter
for this article. But, you can always read up more online. Now, the code we get
using just rsp
is:
times2:
; Save the argument `n`, whose value is in `edi`, to the stack.
; The next available position is `rsp - 4`.
mov dword ptr [rsp - 4], edi
; Load `n` into a register to do operations with it
mov eax, dword ptr [rsp - 4]
; n << 1
shl eax
; `x` is also assigned a stack position. The next available is
; `rsp - 8`. Save the result of the computation above there.
mov dword ptr [rsp - 8], eax
; Load `x` into `eax` because that's what we return.
mov eax, dword ptr [rsp - 8]
ret
I explain the details in the assembly, but there's one general idea you should keep in mind: the compiler follows a very repetitive, naive, and simple methodology in unoptimized builds. It goes roughly as follows. First, every variable is associated one-to-one with a position in the stack (so, every variable gets one position in the stack and that position is associated with only that one variable). This includes all local temporary variables, as well as function arguments. So, at the beginning of the function, we just copy all the function arguments into the stack. Second, whenever we want to use a variable's value, we load it from the stack and into a register, partly because we can't perform computations directly using stack locations; most instructions work with registers. And of course, every assignment to a variable is a store to the stack location of the variable.
Now, this may sound incredibly stupid, because we definitely have enough unused registers not to have to use the stack at all in this case. And frankly, it is. But, the way to think about unoptimized builds is that the compiler should not have to think at all. The method described above produces correct code without any thinking no matter how many variables we have, whereas the fancier method of only using registers would have to at least do some thinking like: "I will use registers as long as this is possible, and if I run out, I'll start using the stack". But applying this thinking is itself a compiler optimization: register allocation (i.e., how you allocate registers to hold values). And if you want that, you should turn optimizations on. There are different levels to how hard a register allocation works to get values into registers, and in some compilers, lower optimization levels use simpler register allocation strategies.
Also, the method described above makes a debugger's life easier. Because each variable is associated with a stack location and that stack location is not used for anything else, if at any point during the execution of a function you want to see the value of a variable, you just look up the value in that location. In optimized builds, the value of a variable is not associated with any one location or any one register. Rather, it usually moves around. So, the compiler has to communicate all these intermediate changes and the debugger has to keep track of them. Of course, again, for simple functions like the one we saw, the compiler could have one register for each variable such that each variable is associated with only one register.3 But, the tooling we use is not optimized for such "middle-ground" cases. Rather, the mantra is that you either have completely stupid debug/unoptimized builds or "whatever happens" optimized builds.
Finally, note that Clang used a left shift for the multiplication, even in an unoptimized build. Similarly, GCC uses a version similar to our version earlier where the multiplication is replaced by an addition (godbolt).
For this I'm going to shamelessly plug a ~6-minute excerpt from a talk I gave at the compiler meetup a couple of years ago. It's one of these few talks that I still think are decent, so here you go. In the first 6 minutes I explain the benefits and drawbacks of inlining, as well as some caveats (e.g., when we can inline). Then, I go more into how it happens in the compiler.
The one thing I'd like to add here that is not in the video is an implementation of inlining in LLVM by Chris Lattner, from 20 years ago.
Let's say we have this function:
void init_scaled(int *array, int n, int scale) {
for (int i = 0; i < n; i++) {
array[i] = i * scale;
}
}
One way to avoid the multiplication inside the loop is to write the code like this:
void init_scaled(int *array, int n, int scale) {
int tmp = scale;
for (int i = 0; i < n; i++) {
array[i] = tmp;
tmp += scale;
}
}
So, we replaced a multiplication with an addition, which is a win given that in
general integer additions are faster than integer multiplications. This
optimization falls under the general category of strength-reduction
optimizations, in which we replace a piece of code with an equivalent one that
is not as "strong", meaning it uses cheaper instructions. Clang (and most
compilers) will do this optimization even in -O1
(godbolt).
A fancier form of strength reduction4 is loop-invariant code motion. Here's a similar example to the one above:
void scale_array(float *array, int n, float multiplier) {
const float pi = 3.14159;
for (int i = 0; i < n; i++) {
float scale = pi * multiplier;
array[i] = array[i] * scale;
}
}
You may notice that the value of scale
is the same across all loop iterations.
So, we can hoist it out of the loop so that the multiplication happens only once:
void scale_array(float *array, int n, float multiplier) {
const float pi = 3.14159;
float scale = pi * multiplier;
for (int i = 0; i < n; i++) {
array[i] = array[i] * scale;
}
}
Clang (and most compilers) will do that
(godbolt) even in -O1
. Here's the shortened
and annotated assembly:
.LCPI0_0:
; This is pi
.long 0x40490fd0
scale_array:
...
; scale = multiplier * pi
mulss xmm0, dword ptr [rip + .LCPI0_0]
...
.LBB0_2:
; float tmp = array[i];
movss xmm1, dword ptr [rdi + 4*rcx]
; float tmp2 = tmp * scale
mulss xmm1, xmm0
; array[i] = tmp2;
movss dword ptr [rdi + 4*rcx], xmm1
...
Potentially surprising, most compilers usually have multiple subsystems that
compute code at compile time. The most obvious case is all the machinery that is
used to compute constexpr
.5
But, there are some non-obvious compile-time subsystems. Take, for example, the usual collatz sequence (godbolt):
static int collatz(int n) {
int its = 0;
while (n != 1) {
its++;
if (n % 2 == 0) {
n /= 2;
} else {
n = 3 * n + 1;
}
}
return its;
}
int main(void) {
return collatz(6);
}
Let's see the assembly:
main:
mov eax, 8
ret
First, I intentionally marked the collatz
function static
so that the
compiler leaves only the main
function. More on that on the dead-code
elimination section.
Now, the compiler basically generated this:
int main(void) {
return 8;
}
In other words, it executed the function at compile time, and replaced the
call with the return value of collatz
. Note that, as a I wrote in a previous article
article, the compiler sometimes can turn loops to constant computations if
the loop can be turned to one. But, the Collatz sequence is famously not such
a function. So, the compiler necessarily ran the loop at compile time.
The compiler, in general, has no way of knowing whether the loop will ever
finish (especially for the Collatz sequence). So, you may be wondering: can the
compiler get stuck into an infinite loop? Generally speaking, no, because the
compiler has limits on the number of iterations. For example, when we pass in
number 27
, which takes 111
iterations to complete, the compiler generates
this (godbolt):
main:
mov ecx, 27
xor eax, eax
.LBB0_1:
inc eax
mov edx, ecx
sar edx
test cl, 1
lea ecx, [rcx + 2*rcx + 1]
cmove ecx, edx
cmp ecx, 1
jne .LBB0_1
ret
So, the loop is definitely there (notice the jne
back to .LBB0_1
). At the
time of writing, the max iterations of the subsystem used here is
100
.6 We can increase it to 111
by passing the cmd arg
-mllvm --scalar-evolution-max-iterations=111
(godbolt):
main:
mov eax, 111
ret
Why the argument has this name is a different story. You may also want to check my article on scalar evolution.
Notice that the compilation is now pretty slow (noticeably slow, in fact, which is massive for only 111 iterations), which indicates that the compile-time systems inside compilers pretty slow. But, they're not designed for performance anyway. That said, last time I checked on the Jai programming language, it was generating bytecode for compile-time computations, so theoretically it should be faster.
Finally, not all compilers will compute loops at compile time. For example, GCC won't do it for the collatz example (godbolt), nor will ICX (godbolt).
"Dead code" is basically useless/unused/redundant and generally unnecessary
code. Probably what dead-code elimination (DCE) does the most is delete unused
code included by #include
directives. For example, if I add an #include
to the times2
example (godbolt),
you'll notice the code doesn't change and that's because the compiler threw
away everything. Well, that's really for two reasons; the main one is that most
code in .h
files is not executable. For example, a struct
is not executable.
It's just there to inform the compiler how to generate code that uses structs.
But in C++ .h
files you can also have function definitions e.g., inside classes.
These will be thrown away too if not used.
But here's a more subtle case. Above, we saw the collatz function
(godbolt), which we marked static
. That
means the function is local to this module and cannot (and thus is not) used by
any other module. In this module, now, it is called in only one place, but this
call got inlined. So, the function is now unnecessary and the compiler removed it.
When most people hear about dead-code elimination for the first time, they ask: "well, all the code I write does something, so how does the compiler find useless code?" Above, we saw one example. And in general, most dead code is generated because of compiler optimizations.
Dead-code elimination can be surprising. For example, let's return to the
collatz
example above. This time, however, instead of returning the number of
iterations, we will return n
(godbolt):
static int collatz(int n) {
int its = 0;
while (n != 1) {
its++;
if (n % 2 == 0) {
n /= 2;
} else {
n = 3 * n + 1;
}
}
////////////////////////////////////////
// THIS IS WHERE THE CHANGE HAPPENED ///
////////////////////////////////////////
return n;
}
int main(void) {
return collatz(27);
}
Let's take a look at the assembly:
main:
mov eax, 1
ret
The compiler basically turned the code to:
int main(void) {
return 1;
}
But how? Note that in this case, we haven't passed the argument that increases
the limit of the compile-time execution engine. So, how did the compiler
eliminate the loop? Notice the loop condition: while (n != 1)
. Thus, the
compiler knows that when the loop finishes, n
must be 1
, otherwise it
wouldn't finish. Thus, the compiler necessarily returns 1
.
But how does the compiler know the loop will finish?7 From the C standard... In summary,
infinite loops without side-effects (e.g., printing, or writing to volatile
pointers) are undefined behavior. The compiler always assumes undefined behavior
doesn't happen, and thus it assumes that the loop will finish.
This is one of the many great examples of why you should not think of C as a high-level assembler (i.e., as just a way to write assembly easily). Undefined behavior is at the top of the list of reasons. There are cases in which the compiler will do unexpected things to your code.
subss
(see)
has much better throughput than shufps
(see).
↩x
's value will be in edx
. Still,
if you want to return x
, you need to copy edx
's value into eax
, so at that
point both eax
and edx
have x
's value. But, this doesn't matter for our
purposes because still if you ask the debugger for x
's value, it can just look
up edx
without caring where else x
's value might be.↩