Getting Started with Compilers

Mar 13, 2024

Introduction

A few months ago I published some articles that got the attention of people getting into compilers. Some of these folks reached out to ask me about how they can get started with compilers. Now, here's the problem: I don't think there's a single resource that introduces folks to compilers the way I think it should be done.

“Do you mean to tell me that I can't just grab the dragon book and be done?” Yes, that's what I'm telling you. Now you may ask who the he^l I am to have an opinion that is above Aho's opinion. Fair, but well folks asked me for my take and this article is the answer. Because I don't think anyone can take a quantitative approach to the topic, it will necessarily be mostly my opinion, but followed by what I hope are sound arguments.

So, are you ready for all the “calm” Reddit comments? Yes? Wait. Are you really? Ok, let's go then.

TL;DR...If I could choose only one resource?

So you really want the easy-and-short answer... Ok, here we go. If you were to pick one resource, that depends on how deeply you want to go. For a decent depth, I would choose Nora Sandler's: Writing a C Compiler article series. It teaches you the basics of a front-end (lexing, parsing) and back-end (code generation for x86) all in one.

She went on to publish a book too, which has much more content! Choose whichever suits you. But (!), my suggestion is to not write the compiler in OCaml (or Python) or any high-level language, as Nora suggests. The problem is that languages like OCaml can make your life too easy. This is good if you're an experienced developer and want to get a quick prototype but not very good for learning purposes. In particular, there's a lot of processing a compiler needs to do and by writing a C compiler e.g., in C you really get to know all the work that needs to be done (which involves implementing the necessary data structures, writing the algorithms in detail, doing some performance optimizations, etc.)

If you really want to go deep, then I think there's one book you want: the LCC compiler. This compiler is a full ANSI C retargetable compiler which comes with the source code and—most importantly—a whole freaking book which explains almost all of it. This must have been an unwieldy effort, but the result paid off. This is basically the only book that explains in depth a real compiler. This includes many things you wouldn't see in an educational compiler, e.g., custom memory allocators, string interning, and cool hash tables.

Does that mean the book has no drawbacks? No! It has serious drawbacks. For one, it's very obsolete. For example, no one cares about reading source files in chunks anymore, and it doesn't deal with SSA at all. Also, a lot of code is ugly. Examples include the AST simplifications, or the trees that become DAGs in the back-end. The biggest source of ugliness, however, is that it is a single-pass compiler. Yep, you have lexing, followed by parsing, followed by type-checking, followed by codegen, all smooshed together on after the other. Finally, it's just not easy to follow, partly because it's a real thing. In short, you'll sweat.

Nevertheless, I do think the book makes up for all these drawbacks. It's the only one-stop shop to learn how to write a real compiler with everything involved. It includes lexing, parsing, type checking, mild optimizations, and code generation, for a real source language, with real intermediate representations, and real (and multiple!) targets.

Front-end

Lexing

Lexing is when we break the input into tokens, i.e., the smallest bits the rest of the compiler cares about (e.g., a C compiler generally doesn't care about spaces so these are not tokens). Nora Sandler does a good job at introducing you to lexing, so I'd take a look. Another less principled resource is Handmade Hero's Implementing Introspection video. Does it matter it's less principled? No. As with everything in compilers, you learn by doing. This video is a good introduction because Casey creates a lexer that just does the job. That's what you want when you're starting out. You can do (and probably come up with) more principled patterns when you need them.

Parsing (or Syntactic Analysis)

First things first: Skip LR parsing. If you don't know what that is, good. Few compilers use it in practice for reasons that I explain here. So, you really want to focus on nailing down top-down recursive-descent (RD) parsers.

There's no better way to learn how to do recursive-descent parsing than writing an RD parser. But there's a catch. RD parsing can feel overwhelming if you've never seen it before, even though once you see it, you can't unsee it. And who do we turn to for that? Nora Sandler again! Not only does she do a good job at introducing you practically to RD parsing, but also she gently teaches you about all the terminology around parsing like ASTs and grammars.

Type Checking (or Semantic Analysis)

Type checking is usually the biggest component of a front-end, and the most interesting. Nora Sandler does type checking in the book. From a cursory look, it seems good. Type checking is something that is sort of an art; generally, it cannot be automated (while parsing can). This means that, as with everything in compiler development, there's not much of a point in reading/watching someone else do it for long. So, here's what I recommend: read Nora's Part II for a simple language, and then do it on your own! That said, if you do want to watch someone else do it for long, I recommend Per Vognsen's Bitwise Series. More specifically, days 6-9 are about type checking.

But again, I think it's better to do it directly for your own language. A great language to write a type-checker for is Minijava. It has a bunch of interesting stuff (besides the basics like assignments, ifs, loops, etc.) like classes, methods, overriding, inheritance, etc. I'd give it a try. In fact, I gave it a pretty hard try in minijava-cpp. There's something good for you in this. You see, there's really no better way to check you've done it correctly other than throwing a large and diverse set of test cases to it. Well, minijava-cpp has a decent test suite. So, you can ignore my code but use my test suite.

Back-end

By “back-end” I mean generating assembly code and doing target-dependent optimizations. Again, my suggestion here is Nora Sandler's articles or book. In this case, however, there's a deeper reason. Many people believe that it's very hard to write a back-end. Even in production compilers like LLVM, with millions of lines of code in other components, the back-end is considered the most esoteric piece. However, just generating correct code is not hard (it's writing optimizations and supporting multiple architectures that are hard). But, the only way to convince yourself about that is to generate x86 code yourself. Nora does a good job at motivating you towards that.

After you do that enough, you'll be able to generate x86 (or MIPS) code for most C features. At that point, there are two ways to go: either you increase the complexity of the source language or you generate more optimized code. Well, why not both? At that point, I would suggest to generate code for MiniJava (and have fun with the virtual tables).

Regarding generating more optimized code, in a modern back-end there are basically two main components: register allocation and instruction selection. Conveniently, Engineering a Compiler has 2 whole chapters devoted to these topics; one for each.1 I personally like instruction selection more than register allocation. If you're like me, I'd definitely take a look at Instruction Selection: Principles, Methods, and Applications, which is basically someone's multi-year effort of reading everything anyone has ever written about instruction selection. That said, the book is not an unstructured lump of everything ever. It is nicely structured and decently written.

Middle-end

The middle-end2 conventionally is thought of as the component for target-independent optimizations. Here, I'll deviate slightly from my previous suggestions. In everything I said before, I urged you to build your own infrastructure. This is something you can do for the middle-end too and it depends on which you care the most about: compiler architecture or compiler algorithms. For the former, build your own infrastructure, otherwise use someone else's in the beginning.

At this point, most people would recommend getting involved in LLVM. But, while LLVM has many benefits, it's a massive piece of software. Once you get the handle of the infrastructure, it really makes many things easy. But, it may be scary at first, and really you don't need all this infrastructure to code your first couple of algorithms. So, what I recommend instead is the Go compiler. The infrastructure is super easy to understand; if you run the code and start modifying it, you'll get the handle of it pretty quickly.

So, the question now is: which algorithms to code first? For this, we turn again to the book Engineering a Compiler. I suggest coding the following algorithms from the book: Dead-code elimination (what the book calls “eliminating useless code”), inlining,3 local value numbering,4 SSA construction,5 and dominator tree construction.6

Then, I suggest moving to LLVM. I wish a had a single resource which has a good introduction to LLVM, but I don't. So, in the following I'll try to give you a guide with excerpts from multiple resources.

First let's learn about LLVM IR. It is an assembly-like IR, with instructions, registers (of infinite amount!) and functions. Your first step is to get some intuition of how we get from C to LLVM IR. You don't have to learn the details, you just need a sense of it. And the fastest way to get this intuition is to start giving small examples to Clang and see how it translates them to LLVM IR (godbolt helps a lot).

Once you get a vague sense, I suggest this LLVM IR Tutorial. But, here are some notes:

At 9:30, he mentions implicit conversions. C and C++ have what is called the “usual arithmetic conversions”, which are implicit conversions from some type to another. For instance, when you do: char c = 'a'; int b = c; there's an implicit conversion from c's to b's type. Also, when you do 1 + 2.0, there's an implicit conversion of 1 from int to double. Such implicit conversions don't exist in LLVM IR, that's why you have to say explicitly that you want to extend an integer from i1 (i.e., boolean) to i32 (i.e., 32-bit integer).
At 9:47, he mentions opt. For now, you can think of opt as the middle-end optimizer. It takes in IR and it spits out IR. If you give it IR and you tell it nothing, it will spit it out as is. But you can also pass flags like -verify. You can also specify passes as you will soon learn!
At 20:06, he talks about the problem of assigning twice to the same name. If you've learned about SSA construction, you know what's up. But, I will give you a different solution. Instead of keeping the value of a to a virtual register, you can save it in memory. Allocate some space on the stack for the variable, and every time you want to assign a value to the variable, store the value to this space. Every time you want to use/read it, read from this space. This is actually what Clang does by default (godbolt). And by seeing this example, I don't even have to explain a lot what the instructions alloca, load and store do. The first allocates space on the stack (in this case, enough space to keep an i32) and returns a pointer to its beginning. The second loads from a pointer and the third stores a value to a location of a pointer. By the way, the arguments I gave to Clang in the Godbolt snippet are not magic. emit-llvm says to Clang to print LLVM IR instead of compiling to executable. -S says to print it in human-readable format (because LLVM IR can be exported to, and re-loaded from, a binary format to). The third removes debug information because it just clutters the code. Finally, for now you can ignore most of the stuff you don't understand, e.g., what's dso_local, what's #0 and what's an attribute.

After you gain some familiarity with LLVM IR, the next step is to become familiar with the LLVM infrastructure. The hardest problem here (maybe to your surprise) is settings things up.

Building a Debug Version of LLVM

In my experience, the best way to get familiar is to compile a Debug version of LLVM and start stepping into the code. Warning: A Debug version of LLVM takes a lot of space; in the order of 100GB. Now, here's how to get a Debug version. First, download LLVM (e.g., from Github):

git clone https://github.com/llvm/llvm-project

Then, create a different folder like llvm-dbg-build so that your current directory looks like this:

├── llvm-dbg-build
└── llvm-project

Now, you want to install ninja, cmake—and while you're at it—I would also install lld to make our life easier. After that, you go to llvm-dbg-build and you run a CMake script like this:

cmake -G Ninja ../llvm-project/llvm \
      -DLLVM_ENABLE_PROJECTS='clang;compiler-rt;clang-tools-extra' \
      -DLLVM_TARGETS_TO_BUILD='X86' \
      -DCMAKE_BUILD_TYPE='Debug' \
      -DLLVM_USE_LINKER=lld \
      -DLLVM_OPTIMIZED_TABLEGEN=ON

Here's what these arguments do:

-G Ninja tells CMake to use ninja as the build system. ninja is generally better than make, and in my experience with LLVM, faster.
Then we tell it where the source code is.
-DLLVM_ENABLE_PROJECTS indicates which projects to build. LLVM has many sub-projects (e.g., one is lldb). Here I'm going for a relatively minimal build because most are not relevant. In particular, I'm building clang because a Debug version of Clang gives us more readable LLVM IR and you can also step through its code if you need too. There other too are kind of irrelevant.
-DLLVM_TARGETS_TO_BUILD indicates the targets. We really don't want to build it for all targets because it will be massive. Assuming you have an x86 machine, you're good with just x86.
We use lld because it links faster than e.g., ld.
You can ignore the argument about TableGen for now, it's to make builds faster.

You can run this and then run:

cmake --build .

or use ninja directly, e.g.:

ninja -j8

Now you'll have to wait for some time... One thing to keep in mind is that when you're working on LLVM, i.e., during iteration, you'll probably be interested in only one tool, like opt. You'll get faster builds if you just build that with e.g.,:

ninja opt -j8

After it finishes, you can check that you can make Clang generate LLVM IR with something like:

.../llvm-dbg-build/bin/clang test.c -o test.ll -emit-llvm -S

Writing (or not) your first pass

At this point many people would suggest to write your own LLVM pass! And indeed, there are many related tutorials. The problem is that writing your own pass from scratch involves a lot of irrelevant crap in registering the pass. So, what I recommend is to modify an existing pass.

So, the way I'd start is from ADCE. This file implements the dead-code elimination algorithm described in the book Engineering a Compiler, which I mentioned above. So, assuming you know the algorithm, you can first get familiar with the infrastructure by reading how LLVM implements it (the implementation is relatively simple, which is evident from the small number of lines of code). Then, you can start modifying it accordingly depending on what you want to do. To me this is a much better way to get familiar than dealing with pass registration.

Conclusion

This is a relatively short article (unlike the amount of time I sat thinking to write it) because when it comes to compiler engineering, as I said multiple times, you learn by doing! So, why don't you get your keyboard rolling?7

Want to join the discussion? Head over to Reddit.

Don't want to miss any updates? You can follow this RSS feed or sign up for my newsletter:

Footnotes

It also has a chapter on instruction scheduling, which I would leave for much later.↩
A term which is a misnomer as it's not an end.↩
After becoming familiar with LLVM, as I recommend later, I suggest taking a look at Chris Lattner's implementation of inlining 20 years ago. You'll be pleasantly surprised by its simplicity. In fact, today's inliner in LLVM is the same at its core, but it has a lot of code to deal with all sorts of edge cases. That's why it seems scary, but it also doesn't have that much educational value.↩
For global value numbering, I recommend taking a look at this: An Implementation of Global Value Numbering in the GNU Compiler Collection with Performance Measurements.↩
FYI, LLVM's Mem2Reg implements the SSA construction algorithm.↩
This file in the Go compiler implements different dominance algorithms. In particular, the dominatorsSimple function implements the dominance algorithm described that in Engineering a Compiler. So, you can check yourself against that.↩
Please don't actually attempt to get your keyboard rolling. Given the shape of most keyboards, it will not end well. That said, however, if you do get your keyboard rolling, please record it and send it my way :-) ↩