Compilers have always been surrounded by an air of mystery and magic. This has led many of us to believe that they do things (well) that they don't, or that they don't do things (well) that they do.1
In this article is one some way a follow-up of a previous one on compiler optimization. Here, I've collected some misconceptions I've come across over the years (many of them mine), doing my best to clear up any confusions. As a heads up, this article is only about large-scale, mainstream, general-purpose compilers like LLVM, GCC, and ICX. Some of the statements don't apply e.g., to domain-specific compilers2 or small-to-medium-scale compilers.3 Please do let me know if you'd like to see an article on other classes of compilers.
Now, because this article got a little long, as per a friend's suggestion, here's a table of contents:
First, optimal according to which metric? For example, one metric is code size. But usually the metric compilers target is running time (i.e., latency of a single run of the program). In any case, though, “optimization” is a misnomer. Compilers don't even try to find the optimal program. Instead, they first lower the program in a very simple manner, and then they just try to improve that. “Improving” is much weaker than “optimizing”, but it is accurate.
But why don't we get the optimal program? Well, it's very hard and slow. For
code size there may be some hope of producing optimal programs in a reasonable
time some day because code size as a metric has a bunch of nice properties. For
example, we can measure code size extremely accurately and reliably; it's
basically just the file size of the generated program. Also, different
measurements give you the same exact value (i.e., if you measure the size now
and after a minute, you'll get the same value). Finally, code size has
optimal
substructure. UPDATE: Actually, I seem to have proved myself wrong here.
See this comment.
Hopefully, I will find some time to revisit this.
To see what this means in practice, let's say you have a function
A
whose
size you want to minimize (i.e., find the minimum size). Now, let's say
you split the function in two parts and you find their minimums independently.
Optimal substructure tells us that if you just merge these two, you'll
get the minimum version of the whole function. In other words, combining
individual minimums gives you a global minimum.
Running time, on the other hand, has none of that. First, you can't measure it accurately. Different runs of the program will have slightly different running times. Worse yet, the factors that impact the running time are many and subtle (e.g., a popular paper [1, 2] showed that something seemingly innocent like program layout has been shown to be able to bias benchmark results by over 40%). Moreover, improving the running time doesn't have optimal substructure. For example, you have the optimal version two loops individually, but the global minimum may require fusing these two loops. Finally, “we have absolutely no proper models of the machines we're compiling for”. This is something recent academic work discovered the hard way. goSLP generates globally optimized SLP-vectorized code. But optimized according to what? According to hardware models. Because these models are bad4, the authors discovered that the generated programs were not only not optimal, but they could be slower than LLVM!
In most modern compilers, including GCC, LLVM, and ICX, conditional branches can
have weights. These are basically probabilities of whether the branch will be
taken or not. Such weights are added either by profile-guided optimization
(PGO), or by the programmer using intrinsics such as __builtin_expect
:
if (__builtin_expect(x == 0, 1)) { // The branch is likely to be taken
}
The misconception here is that for x86 architectures, the compiler generates branch hint (instructions) for the CPU's branch predictor. Until pretty recently, that was simply not the case. Even though Intel introduced branch hints in Pentium 4, they found that in practice there was no performance benefit and they just increased code size.5 You can still include branch hints in your assembly, but the CPU simply ignores them.
So, what are these hints/weights used for? They are hints to the compiler's “branch predictor” which it uses primarily to place code blocks effectively. For example, if a branch is likely, the compiler places its target block directly below the current block so that execution falls through to it, to increase instruction cache locality.
But, Intel recently introduced an architecture, Redwood Cove, in which branch hints are relevant again.6 So, the compiler will theoretically generate these hints for this architecture.
However, it's hard to see it in practice. I couldn't produce any example where you give a C example to the compiler and with some combination of flags, it gives you x86 assembly that has branch hints; for any of: GCC, LLVM, or ICX. I don't even know how to specifically target Redwood Cove. But, we can make it generate the hints. We take this LLVM IR code:7
; Source C code:
;
; void foo(int x, int y, int *p) {
; if (__builtin_expect(x == 0, 1)) {
; *p = y;
; }
; }
define void @foo(i32 %x, ptr %p) {
entry:
%cmp = icmp eq i32 %x, 0
br i1 %cmp, label %if.then, label %if.end, !prof !0
if.then: ; preds = %entry
store i32 %x, ptr %p
br label %if.end
if.end: ; preds = %if.then, %entry
ret void
}
!0 = !{!"branch_weights", i32 49, i32 51}
and pass it to llc
8 with:
llc < test.ll -mtriple=x86_64 -mattr=+branch-hint -enable-branch-hint
which gives us (godbolt):
1 2 3 4 5 6 7 | foo: # @foo test edi, edi ds # <--- This is the branch hint jne .LBB0_2 mov dword ptr [rsi], edi .LBB0_2: # %if.end ret |
Line 3 has the branch hint.
The story may be different for other architectures, but I am mostly familiar with x86.
In 2019, which is not that long ago in compiler years, Emery Berger gave a talk
at the Strange Loop Conference. In that talk, he showed evidence for the
following stark statement (based on the Stabilizer paper): there's no statistical difference between -O2
and
-O3
in Clang!9
That's also what I've found in my experience, especially with Clang. GCC's -O2
is less aggressive
than Clang's, so there is some difference there, but not a big one. Also,
-O3
ignores the code size almost completely, and that can get you into trouble
with the instruction cache.
The moral of the story is: this is not a rule. It's one of these cases that it pays to benchmark.
Knowing which paths are hot in any Javascript bytecode is not enough to produce optimized code; you also need to know the types. And you only know the types at runtime. This is the main reason why JIT compilers compile code at runtime.
This depends a lot on the case. Yeah, probably for C/C++ pipelines, an interpreter won't be super useful anywhere, and good lucking making one anyway. But there are cases where an interpreter can be useful. One interesting case is that of WebAssembly.
Recently, Ben Titzer published a paper titled: A Fast In-Place Interpreter for WebAssembly. In this paper, Titzer tells us α story about WebAssembly implementations.
You see, for non-native languages, you usually develop an interpreter first, and then a (JIT) compiler. This is because an interpreter is usually easier to develop and use, while the compiler is something you need for performance. Interestingly, however, in WebAssembly ecosystem, JIT compilers were developed first. As far as I can understand, that was a factor of three things, discussed next.
Designed for Compilation. WebAssembly was designed for compilation and that's because performance was a core goal, not an after-thought. Performance, in this case, means both performance of the produced code, but also efficiency of the compilation. For example, WebAssembly can be type-checked in a single pass.
Convince. Even though WebAssembly had “political” power from the beginning10, the designers still wanted to convince folks and give evidence that WebAssembly is indeed a viable option for users who care about performance. And you can't give evidence for that without a compiler.
Not Ideal for Interpretation. The design for compilation brought about some features that made it hard to implement an interpreter. Most notably, structured control flow. In the paper, Titzer describes the pain points very well.
Nevertheless, folks eventually realized that interpreters had a bunch of advantages. Titzer gives a nice account of six advantages in Section 1.1 of the paper. The gist of the matter is that while you can't usually beat a compiled binary's performance, there are other practical reasons to have an interpreter.
LLVM has the most popular middle-end, and that is not the case for LLVM. I wrote a whole article on the topic.
Almost certainly not!11 Yes, compilers will optimize for instruction cache locality; we saw that earlier. There is also a long history of compiler research for cache optimizations12; none of it has made it into mainstream compilers. Polly is the most serious case (I know) of a data-locality optimization framework13 becoming part of a big mainstream compiler. Still, it is not turned on by default in LLVM.
Mike Acton has gathered himself some 690 thousand views in his CppCon talk Data-Oriented Design and C++ in no small part because he drove the point home that a C++ compiler won't optimize cache locality for you. This talk basically started14 the movement of data-oriented design, whose main goal is to write code that makes good use of the cache.
But, you may be asking why the compiler won't do it for you. Unfortunately, it
cannot. First, optimizing for data locality requires mass-scale intrusive
changes to the program. For example, let's say you have an open-addressing hash
table. More specifically, here's how LLVM DenseMap
's data layout looks like:
class DenseMap {
struct BucketT { KeyT Key; ValueT Value; };
unsigned NumBuckets;
BucketT *Buckets;
...
So, we have a single array in which every entry has the key and the value paired together. But, during lookups, we only care about the keys. The way the data is structured, we keep loading the values into the cache, wasting cache space and cycles. One way to improve that is to split this into two arrays: one for the keys and one for the values.
To keep the code behavior the same, we would have to change every place in which this original array is used: the allocations, where we get a key, where we set a value, etc. In C/C++, this is very hard to do reliably. But most importantly, the compiler is plainly not allowed to do it. Generally, a C/C++ compiler is simply not allowed to re-order the struct fields.15
Last but not least, even if the compiler was allowed to do it, the abstractions in C/C++ programs are too impoverished in information for the compiler to identify such opportunities.16 In simple terms, the compiler has a very imprecise view of a program's memory accesses to have any good idea of what layout optimizations would be beneficial.
Mainstream compilers like Clang and GCC are very slow, so it's more accurate to
think that -O0
gives you debuggable and predictable code, rather than that it
will give you the code fast.
Generally speaking, yes -O0
is faster than say -O2
, although I don't have a
large-scale experiment to back this up. But for example, compiling a Release
build of zstd takes ~34s, while a Debug
build takes ~11s. Something similar is true for one of my projects
which uses a unity build and has about 5K lines of code. However, even saying
that -O0
is always faster than -O2
is not true. As an example, in my
desktop, it takes ~25min to compile a Debug
version of LLVM, but ~23min to
compile a Release
version.17
Point being, if really want fast compilation, you need to somehow bypass the standard compilation pipeline, especially if you're starting from C++. One option is to just use a smaller compiler, as these are usually faster (e.g., TinyCC). Another option is to at least skip some of the compilation stages. One way to do that is to directly generate LLVM IR, something databases researchers have observed too. In a popular 201118 query compilation paper,19, the authors say:
[Compiling LLVM IR] usually requires only a few milliseconds for query compilation, while C or C++ compilers would need seconds
And query compilation is one use case where fast compilation matters since it's online.
It is easy to see why the quote is true if the compiler we use is Clang because Clang anyway needs to go through LLVM IR. Also, C/C++ doesn't have a binary format, which would speed up processing, only a textual format. However, LLVM IR does.
Not really... C++ templates are slow to compile because C++'s compilation model is terrible.20 Templates, of course, increase the compilation complexity, but not in a way that makes them impractical.
A case in point is Dlang's standard library, Phobos. In my not-particularly-fast laptop, compiling Phobos takes about 12 seconds even though it has more than 250 thousand of lines of code full of templates in an optimized build.21 Note that DMD, D's main compiler, doesn't do anything particularly fancy. Of course, Walter Bright—the mastermind behind D and DMD—is one of the best compiler engineers ever and he definitely knows more than a thing or two about how to make a compiler run fast. But DMD doesn't do anything crazy; it's not even multi-threaded.
Separate compilation is the idea that you compile a different object file for each source file.22 The reasoning of why this is beneficial is that if you change one source file, you only have to re-compile that one file and "just" link it with the other object files; you don't have to re-compile the whole project.
This sounds attractive and has become some kind of sacred creed in computer science education,23 and from there has crept into production environments. The problem is that in practice, linking is incredibly slow. Unfortunately, I have not studied linking deeply enough to comment on whether the problem is inherent or not.24 But little does that matter; reality is what it is.
So, for many projects, separate compilation is not worth it. Instead, a unity build—where you just include everything into a single file—is usually a much better option. Let me give you some examples. First, I have confirmed that it's not worth it for every C/C++ project I have written. So, say up to 20K lines of code in a single project. For example, minijava-cpp is a MiniJava compiler written in C++ which has some pretty heavy code in it. It compiles in ~0.7s in my laptop, while linking all these separate files takes much longer. In short, I have never come across a case in my projects where separate compilation is better than a unity build.
Beyond personal experience, Ubisoft has also
reported using unity builds to improve
compilation times of its games.25 SQLite, too, compiles
everything into a single .c
program.
Beyond compilation times, unity builds have other advantages. For example, you get whole-program optimization without needing to apply link-time optimization (see below). Also, everything is compiled only once, something that contributes a lot both to faster compilation speeds and nicer error logs. With C++'s compilation model (see above), if you include the same file in different translation units (or files for simplicity), it will get compiled once for every unit. If you now add on top that templates generate many versions of the same thing, you have the same code being re-compiled over and over.
In conclusion, I cannot make the statement that separate compilation is never useful, because I don't have the evidence to support that. But, I have never found it to be a win in my projects, and that seems to be the case for some big projects too. So, give unity builds a try.
Here's a story. Two years ago, I was auditing Vikram Adve's26 course “Advanced Compiler Construction”. At some point during the class, Vikram posed the million dollar question: why does link-time optimization happen at link-time? This may sound like an oxymoron, but in reality the question is: why does whole-program optimization happen at link-time?
This is a very reasonable question because no one actually cares about link-time optimization; people care about whole-program optimization. And linkers are generally not optimizers, it is not their job. In fact, theoretically it makes much more sense for whole-program optimization to happen in the middle-end. First, the middle-end's job is optimization. Second, the middle-end operates in an IR made for optimization, contrary to the assembly that object files normally have. Furthermore, nothing prevents the compiler from having a global view of the program. It can just translate all the input C/C++ files to LLVM IR, and then has a view over the whole program. Then, Dear God, why does it happen at link-time? I should say, no one knew the answer. Thankfully, Dear God illuminated us.
Vikram explained (well, I'm paraphrasing) that the answer lies in the sad reality of C/C++ build systems. In huge projects, good luck finding the source C/C++ files and figuring out what calls what. The C++ tooling is just not made for that. On the other hand, it's necessary that a linker be able to find all the object files. So, the tooling is such that a linker can gather all the code for a project. Of course, normally object files contain assembly and no one wants to work on that. So, the compiler just sticks e.g., the LLVM IR into the object file for the compiler to access it.
Avoiding a call instruction (sequence) is beneficial, but even that is partly beneficial due to instruction cache locality, not because the instruction is that slow. Most importantly, though, inlining's biggest benefit (by far) is that it's an enabling transformation, meaning it enables other optimizations. I have a more detailed explanation here, but let me highlight one frequently overlooked effect of inlining.
Inlining enables interprocedrural optimization without actually the need for any interprocedural compiler techniques (apart from inlining itself). Interprocedural optimization techniques are those that involve more than one procedure (i.e., function), i.e., they optimize across functions. Interprocedural optimizations are difficult to implement exactly because they involve multiple functions. Optimizations are anyway usually difficult to extend beyond a small set of basic blocks or a single loop, let alone beyond function boundaries. Inlining, however, in effect allows interprocedural optimization to happen. For example, suppose you have the following code:
int sat_div(int num, int denom) {
if (denom == 0) {
return (num > 0) ? INT_MAX : INT_MIN;
}
return num / denom;
}
int foo(int a) {
int b = sat_div(a, 3);
return b;
}
After optimization, the code for foo
looks like (godbolt):
// ... sat_div is the same
int foo(int a) {
// The generated code for this looks confusing because
// the compiler has turned a division into a multiplication.
return a / 3;
}
So, the compiler performed constant propagation, by propagating the constant 3
to denom
, it eliminated the dead if
as the condition is false in this case,
and it turned the division into a multiplication. But how did it do it? Well,
not through interprocedural versions of constant propagation27, dead-code elimination, or instruction
combining. No, the effect happened because the compiler first inlined the call
to sat_div
(godbolt), which then enabled
the intraprocedural28 versions of these
transformations. In other words, inlining brought everything into a single
function, which allowed the conventional transformations to work. This is why inlining is considered by many
(e.g., Chandler Carruth) the single most
important optimization in optimizing compilers.29
Does that mean that inlining is the only approach to interprocedural optimization? Well, on the one hand, there is a long literature of interprocedural optimizations, and some of it has made it into production compilers.30 On the other hand, for most mainstream compilers, almost all the interprocedural work is done by inlining. This is also why virtually all recent research on compiler cost models that that related to interprocedural optimization target inlining.31
I'm talking about this thing (cppreference):
#ifndef EXAMPLE_H
#define EXAMPLE_H
// function included in multiple source files must be inline
inline int sum(int a, int b) {
return a + b;
}
#endif
Most of the documentation on this keyword is devoted to its role in linking. And indeed, this is where it matters semantically. But the question I'm here to answer is: Does it have to do anything with the inlining optimization?
This was supposed to be an easy one. After all, Chandler Carruth told
us that “the inline
keyword in C++ has
possibly nothing to do with optimization”.32 But the story is not so simple. On the one hand, cppreference tells us
that:
The original intent of the inline keyword was to serve as an indicator to the optimizer that inline substitution of a function is preferred over function call.
and if we look at the latest C++ standard draft33, it says34:
The inline specifier indicates to the implementation that inline substitution of the function body at the point of call is to be preferred to the usual function call mechanism. An implementation is not required to perform this inline substitution at the point of call.
On the other hand, cppreference also tells us:
[...] the meaning of the keyword inline for functions came to mean "multiple definitions are permitted" rather than "inlining is preferred" since C++98, that meaning was extended to variables.
With all these conflicting statements by credible sources35, it's reasonable to expect a confusion. So, what is the truth? As with many things in C++, the “truth” is whatever the implementation has chosen. Since I am most familiar with LLVM, we'll take a look at that.
So, LLVM does pay attention to the inline
keyword, and has been doing so since at least 2010. LLVM IR has the inlinehint
36 attribute which Clang will add to your function
definition if you add the inline
keyword.37. For example, in this snippet, Clang adds the inlinehint
attribute to int_run
. According to the LLVM IR reference on inlinehint
:
This attribute indicates that the source code contained a hint that inlining this function is desirable (such as the “inline” keyword in C/C++). It is just a hint; it imposes no requirements on the inliner.
Indeed, the optimizer increases the inline threshold for functions that have the inlinehint
attribute.38 But the change is relatively small. So, yes the inline
keyword has something to do with optimization, and yes the optimizer will care if you add it; but not too much (sorry!). I guess if you really want a function to be inlined, you should add the always_inline
specifier, which is a totally different story: the compiler will always inline these functions, no matter what, unless it's impossible.
Look, I love LLVM!39 In many ways, it's pretty educational. But it's a monstrosity! That's because LLVM is applied to so many use cases. LLVM is used from embedded systems to Google's data centers, from JIT compilers to aggressive high-performance scenarios, and from Zig to Rust to <name the next new language>. So, unavoidably, if you study LLVM, you'll come across a lot of edge and special cases it tries to handle which are of little educational value. You will also see a bunch of advanced algorithms. For example, the standard Lengauer-Tarjan Algorithm is already hard to understand. LLVM uses an even more advanced algorithm that was introduced recently.
So, while I definitely suggest that you take a serious look at LLVM internals at some point, I don't suggest that you start there. Just to be clear: this is if you are interested in learning about compiler development. If you are a user, LLVM is probably the best choice. So, to learn compilers, as I suggested in a Reddit comment, you can take a look at the Go compiler. It's a production compiler, with some very serious people behind it, but it's not the monstrosity that LLVM is. I also learned a lot from the folks at LDC and DMD.
Nope, undefined behavior can also disable optimizations. I wrote an article on this.
Generally speaking, a compiler is always allowed to define undefined
behavior.40 For example, the compiler is allowed to define a dereference from a
int*
pointer that points to NULL
to be 0
. But, this has performance
repercussions, as that would e.g., require that the compiler insert if
s in
every pointer dereference.
A more subtle argument is that the compiler can define all undefined behavior to be the platform's behavior. This would kill two birds with one stone. First, the compiler wouldn't need to add any extra code because the behavior it would have chosen would be what the target would do anyway. As such, there would be no extra overheads. Second, suddenly as a programmer you would know exactly what would happen. While this argument sounds great, it is truly not that simple. I wrote an article that examines this suggestion seriously and in detail.
The recent explosion in Artificial Intelligence (AI) has not left compilers untouched. In many cases, for a good reason. For example, cost models are one plausible application of machine learning. For simplicity, here's how a cost model works: it is given a space of possible solutions and it chooses one them. For example, let's say you have a loop that can be tiled be a factor of 16, 32, or 64. You may, then, have a cost model that is asked to choose among the four options: not tile at all, or tile by any of the three factors.
For the longest time, cost models have been written and maintained by hand. However, hand-tuned cost models are often boring to make, hard to scale (especially as the number of architectures increases), and anyway hardware vendors tell us so little about their architectures that hand-tuned cost models are (as machine-learning-based ones) black magic; they are not interpretable or intuitive. In other words, a great case for machine learning and neural networks because they can automate the process. Indeed, a lot of research on machine-learning cost models has appeared the last decade, and some have even made into production. For example, recently Google adopted a neural network cost model for its XLA compiler.
But there's a catch. Cost models are good application of neural networks because you don't need any strict guarantees. In particular, the way to think about it is that cost models are given a set of correct programs to choose from. As such, no matter which choice the cost model makes, the generated code will be correct!41 So, even though neural networks currently have this “minor” issue that we have no idea how they work and they have no guarantees, this is fine e.g., for cost models. However, it is not fine for code generation!42 This used to be obvious, but recently my impression is that it's not anymore. For example, recently we got a comment from one of the reviewers of Dias which basically asked “can't you just use an LLM?” No! Please allow me to take a moment to explain why using LLMs for code generation is not (at least currently) ok at all.
Here's the question this boils down to: Is it ok if the code generated from the compiler is correct, say, 99% of the time? We need to clarify this a bit. Does that mean that out of 100 generated programs, 99 are fully correct and the one is near-total garbage? Obviously not, this not realistic. A more realistic version would be something like this: Out of 100 lines of code, the compiler generates correct code for 99 of them. I know this is still pretty simplistic but I think it's good enough for this argument. So, is that ok?
I hope that your answer is no. From a compiler developer's standpoint, this absolute garbage. Basically such a compiler is unusable. At best, it is some kind of research artifact that helps you explore an idea. But forget production. It's not even ok for debugging. To see why, consider a small project with say 5000 lines of code. With 99% correctness rate, this means that in every compilation, 50 lines of code are incorrect. Fifty! And the worst part is: you don't know which and they can be different with every code change. You probably have had the experience of tracking down a bug in a single line of code, which can be both frustrating and time-consuming. Image how it is debugging 50 changing lines of code! Now, imagine moving this to a large-scale project, with possibly millions of lines of code.43 No, thanks.
A final note: Even if LLMs were producing correct code, they are currently extremely slow compared to compilers. I know, I said compilers are not that fast in previous sections. But, compared to LLMs, mainstream compilers are blazingly fast. So, even though LLMs have a promising future, and already a bright present, there are some things they currently don't solve. One of them is (online) code generation.
-O0
.↩-O1
, -O2
, etc.).↩.c
or .cpp
files, not header files.↩-O3
— and all the interprocedural optimizations in GCC described
here.↩inlinehint
attribute was introduced for real)!↩inline
specifier”, bullet 2.↩