May 7, 2025

Do you want to share your thoughts? Please shoot me an email.

Don't want to miss any updates? You can follow this RSS feed or sign up for my newsletter:

Background

Gather around kids! Today we'll explore the following intriguing question: what happens if we inline everything?

If you're not familiar with inlining, you should know that it is considered by many (including important figures in the compiler community like Chandler Carruth) to be the single most important optimization in optimizing compilers. For a little more info on how it works, here's a shameless plug of a presentation we gave with a bunch of folks at the LLVM Developers' Meeting on interprocedural optimization. I talked about inling, and watching the first ~6 minutes should give a good idea.1

In this video I mention that inlining has drawbacks. In other words, yes, inlining is supposed to be the super duper tool in the compiler's toolkit, but you're supposed to use it with care! Or should you?

What are the drawbacks anyway? In the video I mention basically three: (a) code duplication, (b) code-size explosion, and (c) increased register-allocator presure. (a) can increase compile times2 because the same code is compiled multiple times. (b)3 can create large executables, which means they both take more space on the disk, and they take longer to load. Moreover, it can theoretically lead to instruction cache misses as you jump through all this code. The problem of (c) is that it can obviously lead to unnecessary register spills.

But today we're going against frugality. The attitude we'll take is simple: we only care about runtime performance. This means “I have a lot of disk space, and a lot of time to spare on compile times; let's make our code as fast as possible!” In other words, if we ignore other constraints and we inline everything, will the runtime performance be any worse (as (b) and (c) would imply)? To the best of my knowledge, nobody has answered this (using experimental evidence) up to now. So, let's find out.


Can we inline everything?

Well, no. There are calls that LLVM simply cannot inline;4 I mention some cases in the video. In the following experiments, most of the code used will be available to the compiler (basically except for standard-library code), and there won't be many virtual functions or function pointers (definitely not in the code we'll write). So, the biggest obstacle to inlining will be recursion (which will also be rare). In short, the vast majority of functions will be inlinable.

I should note here that LLVM can inline some simple recursive calls; particularly calls to tail-recursive functions, but that's because the tail-recursive call elimination pass removed the recursion first (see this). So, for example, it can inline a simple factorial function:

int fac(int x) {
  if (x == 1) {
    return 1;
  }
  return fac(x - 1) * x;
}

but it cannot inline a recursive fibonacci function:

int fib(int x) {
  if (x == 1) {
    return 1;
  }
  if (x == 2) {
    return 1;
  }
  return fib(x - 1) + fib(x - 2);
}

Making LLVM inline everything (else)

This should have been easy, but it's not. It should have been easy to change the inlining profitability, i.e., the part of LLVM that decides when inlining is beneficial. If that was the case, folks could change the profitability analysis without having to understand the internals of inlining. For example, it should have been easy for a non-compiler-expert, machine-learning researcher to change LLVM's inlining profitability analysis. In our case, we would just change the profitability analysis to always say yes!

Unfortunately, however, things are not so simple in modern compilers. It turns out I wrote a whole thesis on this problem. Today, the transformation and profitability analysis code is tightly coupled, such that one cannot change the one without understanding the other. Nevertheless, inlining in LLVM is one of the good cases, as we have a better separation than in other passes.5 So, let me briefly explain the course of an inlining decision today. We'll use the latest LLVM release, which, as of writing this article, is 20.1.0.

The inlining pass starts at Inliner.cpp. This is a Call-Graph Strongly-Connected-Component (CGSCC) pass. To get a good idea of what this means, I suggest watching this excerpt by Chandler Carruth, or watching the latter half of my video. But you may be relieved to know that it's not that relevant to this article; you can just think that the Inliner visits each function separately. If you want the TL;DR, here it is: we take the call graph,6 we split it into strongly connected components (SCCs), and then we visit each SCC separately.7

As I mentioned above, most of the complexity in inlining comes from the profitability analysis. To decide if it's profitable to inline a function call, LLVM uses what it calls an InlineAdvisor. So, eventually we ask for its advice here.

...
std::unique_ptr<InlineAdvice> Advice =
    Advisor.getAdvice(*CB, OnlyMandatory);
...

If you follow the call, you'll end up here.

std::unique_ptr<InlineAdvice> InlineAdvisor::getAdvice(CallBase &CB,
                                                       bool MandatoryOnly) {
...

Now, if we're not using any fancy inline advisor (e.g., for size), we'll end up here.

std::optional<llvm::InlineCost> static getDefaultInlineAdvice(
    CallBase &CB, FunctionAnalysisManager &FAM, const InlineParams &Params) {
...

As is obvious from the code, what we need to focus on is this call to shouldInline().

...
return llvm::shouldInline(
    CB, CalleeTTI, GetInlineCost, ORE,
    Params.EnableDeferral.value_or(EnableInlineDeferral));

But is actually important is this lambda here.

auto GetInlineCost = [&](CallBase &CB) {
  bool RemarksEnabled =
      Callee.getContext().getDiagHandlerPtr()->isMissedOptRemarkEnabled(
          DEBUG_TYPE);
  return getInlineCost(CB, Params, CalleeTTI, GetAssumptionCache, GetTLI,
                        GetBFI, PSI, RemarksEnabled ? &ORE : nullptr);
};

If you follow the calls, you will eventually reach the place in the code that concerns us: getInlineCost(). In particular, this line concerns us.

...
InlineResult ShouldInline = CA.analyze();
...

Now, if you look at the rest of the code inside getInlineCost(), you'll see that we return either a success or a failure. So, you would think that making LLVM inline everything would be as simple as: only honor the user's decision, but otherwise add a return InlineCost::getAlways("Decided by our majesty") here. Let's do this and see what happens.

To test this, I have written some simple image processing code in C, using the stb_image* family of libraries (which is available in the public domain). You can find more information on the code by expanding the following section.

Image Processing Code

This is a simple image processing pipeline that loads JPEG images, resizes them, and saves them as PNG. The images I used for testing are the following simple images, available without a copyright: Photo by M Venter, Photo by Lukas Kloeppel, Photo by eberhard grossgasteiger

#define STB_IMAGE_IMPLEMENTATION
#define STB_IMAGE_RESIZE_IMPLEMENTATION
#define STB_IMAGE_WRITE_IMPLEMENTATION

#include "stb_image.h"
#include "stb_image_resize2.h"
#include "stb_image_write.h"

#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

#define NUM_IMAGES 3
#define MAX_RES_HEIGHT 5000
#define MAX_RES_WIDTH 5000
#define MAX_CHANNELS 3

void load_resize_write(const char *in, const char *out, uint8_t *resized_image) {
  int width, height, channels;

  uint8_t *input_image = stbi_load(in, &width, &height, &channels, 0);
  assert(input_image);
  assert(channels <= MAX_CHANNELS);

  int new_width = width / 1.3;
  int new_height = height / 1.3;

  assert(new_width < MAX_RES_WIDTH);
  assert(new_height < MAX_RES_HEIGHT);

  stbir_resize_uint8_linear(input_image, width, height, 0,
      resized_image, new_width, new_height, 0,
      (stbir_pixel_layout)channels);

  assert(stbi_write_png(out, new_width, new_height, channels, resized_image, new_width * channels));

  stbi_image_free(input_image);
}

int main() {
  uint8_t *resized_image = malloc(MAX_RES_WIDTH * MAX_RES_HEIGHT * MAX_CHANNELS);
  assert(resized_image);
  char in[256];
  char out[256];

  for (int i = 0; i < NUM_IMAGES; ++i) {
    sprintf(in, "./in/input%d.jpg", i);
    sprintf(out, "./out/out%d.png", i);
    load_resize_write(in, out, resized_image);
  }

  return 0;
}


Let's try compiling our file with:

time <path to custom clang build>/bin/clang resize.c -O3 -Rpass=inline

To enable the inliner, we need to enable the optimizations with a flag like -O3. We also ask LLVM to tell us when it successfully inlines a function with -Rpass=inline, and when it did not inline a function with -Rpass-missed=inline. If we run this, Clang crashes. Here's the output I got.

There are a couple of things to observe here. First, the compilation time is huge! In my Mac Studio it took about 39s, which insanely long (note that I'm using an optimized LLVM build); soon we'll compile the file with vanilla Clang to see the difference. Second, almost all functions were inlined; there are only 22 lines in the output from -Rpass-missed. We'll return to the missed functions soon, but for now let's focus on the functions that did get inlined.

A bunch of functions were inlined because the source code had the __always_inline__ attribute (which is different from inline; see my previous article for an in-depth dive.). Interestingly, these functions are not part of stb_image* itself, but rather they're intrinsics. For example, in my Mac I use the built-in Clang include files, which include this:

__ai __attribute__((target("neon"))) float32x4_t vreinterpretq_f32_u8(uint8x16_t __p0) {
  float32x4_t __ret;
  __ret = (float32x4_t)(__p0);
  return __ret;
}

The __ai does the job in this case, and it's defined as:

#define __ai static __inline__ __attribute__((__always_inline__, __nodebug__))

If we remove all the “always inline” lines, all the other lines that tell us that a function got inlined include “Decided by our majesty”, which: (a) verifies that we hit our modified code, and (b) we actually do inline almost everything.

But what about the missed functions? All 22 cases give us the same message: “noinline call site attribute”. If you want to learn more about attributes in LLVM, expand the section below.

LLVM Attributes

Attributes in LLVM can be attached to all sorts of entities, like instructions, functions, etc. An attribute communicates some piece of information, e.g., that a function should not be inlined, or that a pointer does not alias with any other pointer (LLVM internally calls it noalias, which is the respective restrict in C). LLVM makes heavy use of attributes, as you can see in the manual. Even compiling a small piece of code will give you LLVM IR with many attributes.

The noinline attribute can be added to many places, and when it's added to a function definition, it tells the optimizer not to inline any call to this function. Let's see this in practice. Here's a simple example. As you can see, even in -O1, the function square() gets inlined. But now let's manually generate the LLVM IR code and add the noinline attribute to see what happens. Here's the godbolt example. But I should take a pause here and explain what all these command-line arguments are.

Interlude: Getting optimizable LLVM IR from Clang

-emit-llvm tells LLVM to emit LLVM IR, but by default it emits it in binary format (i.e., .bc). This is great if you don't need a human to be able to read the IR because it has better backwards compatibility, smaller size, and it's faster to load. But, because we do want to read it, we also add -S, which outputs the text version. -g0 tells Clang not to output any debug information because it just clutters the output. The rest is the interesting part.

Remember, we want unoptimized code, because we want to pass it through the optimizer ourselves, after we add the attribute. But, we want it to optimizable, which in this case means that the optimizer should not ignore it. If you don't add the other command-line arguments, you will get unoptimizable code. Take a look at this. Both functions have the #0 attributes, which include optnone. This is an attribute which tells the optimizer not to touch a function at all, and it's great if that's what you want. Here's what happens if we try to optimize this LLVM IR as is: godbolt. As you can see, even -O3 does nothing.

One way to deal with this is to just remove optnone manually: godbolt. This leaves noinline, which is what we want.

In general, however, we don't want to have to change the code manually. Most people deal with that by using -O0, but also -Xclang -disable-O0-optnone, which tells the Clang driver not to add optnone: godbolt. The problem with that, however, is that the functions still get the noinline attribute by default (as we saw above).8 Usually this is not what you want because you do want the inliner to run. What I used—-O1 -Xclang -disable-llvm-passes—uses -O1 to avoid adding optnone and noinline, but then it tells the driver not to run any passes, which is what we want.

--- End of the interlude...

Whether we add noinline ourselves, or get LLVM IR that has it already, the effect is that square() does not get inlined, as you can see: godbolt. One last thing to mention here is that can add noinline to a specific call site, which is what is relevant to the original problem we faced. Here's an example which has two places where square() is called. Now, I can add the attribute to the call inside bar(): godbolt. As you can see, the call inside bar() did not get inlined, whereas the call inside foo() did.


Assuming you now know about LLVM attributes, one thing to mention is that we can add the noinline attribute in LLVM at the C level using the __attribute__((noinline)) GCC extension, which Clang supports. As you can see in this snippet, square() gets noinline, whereas foo() doesn't. However, I don't know of a way to add it to a specific call site (although, if you really need this, you can create a helper function that just calls the function you don't want to inline, and you add the attribute to the helper).

In any case, the pressing question is: who added this attribute? In our compilation we have enabled optimizations, so clearly it wasn't because we were in -O0. Also, remember the message: “noinline call site attribute”. Since there's no way to add it to a call site manually, and no part of our code includes it at the function level, this must mean that some part of the LLVM compilation pipeline added it automatically. There are at least two questions here. First, which part added it? Second, why didn't we inline the call anyway? Aren't we supposed to inline everything?

The second question can be answered easily. You see, this message appears here and here. The fact that it appears in two places is a bit confusing, so let's make sure we know which place it is e.g., by adding a suffix (I just added “1” and “2” to the two places). After running with the new binary, we can see that all cases stem from the second place. This is expected because the first place checks whether there is the alwaysinline along with noinline. Since stb_image* is serious code, we wouldn't expect it to have added alwaysinline to functions that are not inlinable. In short, all the non-inlined call sites were because some part of the pipeline added it automatically.

But still, why didn't we inline the call anyway? Remember that we added our code here, to respect the user's decision. As you can see, the user decision is based on getAttributeBasedInliningDecision(), which includes the two places we mentioned above, which check for noinline. That's why we did not inline these calls (even though clearly in this case it wasn't the user's—i.e., the programmer's—decision).

The first question is a little subtler: which part added it, and why? Of course, LLVM will not add the noinline attribute willy-nilly. If it added it, it must mean that it cannot be inlined. Let's take a look at which functions could not be inlined. If we take a look at the remark messages, we see that all 22 cases involve the stbi__out_gif_code() function. If we take a look at the function's code, we'll see the problem: it's recursive. Check here. So, it's a reasonable hypothesis that this is why LLVM added the attribute. But it would be good to confirm it.

In short, the answer is yes; the noinline attribute is added here. The exact reason is a bit complicated, and since we're hitting sort of an edge case in the LLVM inliner, I will skip the full explanation. But thinking that it's because is recursive is not far from the truth.


The Clang Crash

But remember, Clang crashed! We need to figure out why. In my Mac Studio I can't access the .crash file. But thankfully the error message tells us enough to figure out what the problems: we have inlined some function that uses variadic arguments. There are two functions in our source code that use variadic arguments: this and this. Adding __attribute__((noinline)) fixes the problem (and now the compile time jumps to about 50s).

But the bigger problem here is that even though we tried our best to only deal with the profitability analysis, we did break the transformation, which highlights the problem I talked about earlier. Thankfully, however, there is an easy fix. Hopefully you remember the user decision we mentioned earlier. In that function, if the user uses alwaysinline, but the function also has noinline, we issue the first “noinline call site attribute” remark. However, right after that, if it doesn't have the noinline attribute, we check whether the function can be inlined!

So, LLVM makes our live easier by providing isInlineViable, which checks if a function can be inlined. And we can see here that this function returns false if the function in question uses va_start(). Now, we can modify the code that inlines everything inlinable to:

if (isInlineViable(*Callee).isSuccess()) {
  return InlineCost::getAlways("Decided by our majesty");
}

And now, without changing anything in our original code, we can compile all of it! Perhaps surprisingly, the executable also runs without any errors, and the resized and saved images look fine. We'll have more to say about that executable and its running time, but for now we need to make sure that our code can compile complicated stuff.


The true crash test: Compiling the LLVM test suite

The LLVM test suite includes a bunch of applications and benchmarks; it is a publicly available crash test for compilers in general, and LLVM in particular. I used the following CMake configuration to configure it (barring Mac-specific stuff):

cmake ../test-suite \
  -DCMAKE_C_COMPILER=<path to custom build>/bin/clang \
  -DCMAKE_CXX_COMPILER=<path to custom build>/bin/clang++ \
  -C../test-suite/cmake/caches/O3.cmake

You can optionally add the following to confirm that our modifications take place, but it will clutter the output:

-DCMAKE_C_FLAGS="-Rpass=inline" \
-DCMAKE_CXX_FLAGS="-Rpass=inline"

I usually build SingleSource/Benchmarks and MultiSource/Benchmarks, which in this case will be useful later too. In my setup some benchmarks don't compile, but they don't compile with my vanilla Clang build either, so I don't consider it a problem. The reason they fail is because I didn't build compiler-rt in my LLVM builds, which is a pain to deal with on a Mac. Here is the full list of benchmarks that didn't compile in my setup, for both builds:

SingleSource/Benchmarks/CoyoteBench/fftbench
MultiSource/Benchmarks/DOE-ProxyApps-C/RSBench

I also disabled the following benchmarks:

MultiSource/Benchmarks/tramp3d-v4/
MultiSource/Benchmarks/PAQ8p

because they took way too long to compile with the always-inline build.

On comparing Clang builds

Do not compare your custom-built Clang with the built-in Clang because the built-in Clang may have been built differently (e.g., without assertions, or with compiler-rt). This, in turn, will lead to inaccurate comparisons. The easiest way to avoid that is to just download the LLVM source code twice, and use two different build directories, one for each source code directory. Of course you can use a single source code directory by creating a vanilla build before you make any changes in the code, but this is more risky. With a separate source you just never touch that source and that's it. Even if you accidentally hit recompile in the vanilla version, you know nothing's changed.

In any case, you should use the same CMake configuration on both. For the following experiments I used:

cmake ../llvm -DLLVM_ENABLE_PROJECTS='clang;clang-tools-extra' -DLLVM_TARGETS_TO_BUILD=host -DCMAKE_BUILD_TYPE='Release' -DLLVM_ENABLE_ASSERTIONS=ON -G Ninja

Building with assertions is fine in terms of performance in this case, because the inlining overhead will (strongly) dominate the comparisons.

For similar reasons, it will make your life easier to have 2 test-suite builds, although in this case it's better to have one source directory because you want any changes you make to apply to both.


Other than that, everything compiles successfully. You can try compiling some applications from MultiSource/Applications, but note that if you try to compile sqlite3 it will take forever.


How bad (or not) is it to inline everything?

Now it's time to answer the most important question: is there any harm in inlining everything (as compiler developers tell you)?


Compilation times and executable size

First, even though originally we said we didn't care about these metrics, I think it's still interesting to explore them. Compiling the image processing pipeline takes a whopping ~50 seconds if we use the always-inline build (with -O3). In contrast, it takes only ~2 seconds if we use the vanilla clang -O3. This is about a 25x slowdown!

In terms of executable size, vanilla Clang gives us an executable that is about 304 kilobytes, whereas always-inline Clang gives us one that is 3.4 megabytes, or about 10x larger.

In short, things are definitely not looking great in terms of these metrics. One thing to keep in mind is that both compilation time and executable size increase exponentially as the source-code size increases. This is simply because if, say, A() calls B() which calls C(), inlining C() into B() and B() into A() will compile C()'s code twice; once when we compile B()'s body, and once when we compile A()'s body. Also, if X() is called in Y() and Z() and we inline it on both, again its code will be compiled twice. You can see how this can blow up very quickly.

This is why compiling a large application can still be viable if it involves many small compilation units, but it can be out of the question if it contains few but large compilation units. For example, MultiSource/JM contains about 65k LoC and MultiSource/sqlite3 contains about 53k LoC. But these 65k LoC are distributed across many compilation units in JM, while sqlite3 has a single compilation unit (comprised of sqlite3.c, and the .h it includes). As such, compiling JM takes about 25s, whereas compiling sqlite3 takes... well, I got tired of waiting after 4 hours had passed.


Runtime performance

Finally we come to the most important question: do we have any performance degradation?

Let's start from the image processing pipeline. In this case, the answer seems to be a clear no. Both versions with -O3 run in about 2.9s. Using -O2 doesn't make a difference between the two builds either.9. Using -O1 does not make a difference between the two compilations (but it does make a difference from -O2 and -O3 as they now run in about 3.3s).

Now let's compare the test suite. This is what I used to run the benchmarks (once for each build):

<path to build>/bin/llvm-lit -sv -j 1 -o results.json ./SingleSource/Benchmarks/ ./MultiSource/Benchmarks/

In terms of total time for all the benchmarks, the vanilla build took about 258 seconds, whereas the always-inline build took about 260 seconds. This is practically no difference, especially across 309 tests. We can use the two .json files and the following Python code to see if there was any significant difference in any particular benchmark.

Python code to compare results

import pandas as pd
import json
from scipy.stats import gmean

def read_json(f):
  with open(f, 'r') as fp:
    d = json.load(fp)
  return d['tests']

def process_tests(tests):
  names = []
  times = []
  for t in tests:
    name = t['name']
    name = name[len("test-suite :: "):]
    name = name[:-len('.test')]
    name = name.replace('MultiSource/Benchmarks', 'MS')
    name = name.replace('SingleSource/Benchmarks', 'SS')
    names.append(name)
    times.append(t['metrics']['exec_time'])
  ### END FOR ###
  return names, times

van = read_json('results_vanilla.json')
inl = read_json('results_inline.json')

names_van, times_van = process_tests(van)
names_inl, times_inl = process_tests(inl)
assert names_van == names_inl

ser_van = pd.Series(times_van, index=names_van)
ser_inl = pd.Series(times_inl, index=names_van)

ser_rel1 = ser_van / ser_inl
ser_rel2 = ser_inl / ser_van
ser_abs = ser_van - ser_inl
df = pd.DataFrame(data={'rel1': ser_rel1, 'rel2': ser_rel2, 'abs': ser_abs, 'van': ser_van}, index=ser_rel1.index)
print('--- 5 fastest benchmarks with always-inline')
print(df.nlargest(50, columns=['rel1']))
print()
print('--- 5 fastest benchmarks with vanilla')
print(df.nlargest(50, columns=['rel2']))
print('-----')
m_perc = (gmean(ser_rel1) - 1) * 100
print(f'Geomean relative speedup of inline vs vanilla: {m_perc:.3}%')


You can expand the collapsible below to see the detailed results:

Detailed results

--- 5 fastest benchmarks with always-inline
                                                        rel1      rel2     abs     van
MS/Prolangs-C++/fsm/fsm                             4.166667  0.240000  0.0019  0.0025
MS/Prolangs-C/unix-tbl/unix-tbl                     3.375000  0.296296  0.0019  0.0027
MS/Prolangs-C++/family/family                       3.000000  0.333333  0.0012  0.0018
SS/Stanford/FloatMM                                 2.904555  0.344287  0.0878  0.1339
MS/Prolangs-C++/deriv2/deriv2                       2.714286  0.368421  0.0012  0.0019
MS/Prolangs-C/simulator/simulator                   2.714286  0.368421  0.0012  0.0019
MS/Prolangs-C/assembler/assembler                   2.666667  0.375000  0.0015  0.0024
MS/Prolangs-C/unix-smail/unix-smail                 2.566667  0.389610  0.0047  0.0077
MS/Prolangs-C++/trees/trees                         2.250000  0.444444  0.0010  0.0018
MS/Prolangs-C/allroots/allroots                     2.166667  0.461538  0.0007  0.0013
MS/Prolangs-C/agrep/agrep                           2.111111  0.473684  0.0050  0.0095
MS/Prolangs-C++/simul/simul                         1.933333  0.517241  0.0014  0.0029
MS/Prolangs-C++/garage/garage                       1.812500  0.551724  0.0013  0.0029
SS/Polybench/linear-algebra/solvers/durbin/durbin   1.800000  0.555556  0.0028  0.0063
MS/MiBench/security-sha/security-sha                1.660000  0.602410  0.0033  0.0083
MS/Prolangs-C/gnugo/gnugo                           1.639269  0.610028  0.0140  0.0359
MS/Prolangs-C/bison/mybison                         1.586207  0.630435  0.0017  0.0046
MS/BitBench/uudecode/uudecode                       1.581395  0.632353  0.0025  0.0068
MS/Prolangs-C++/vcirc/vcirc                         1.571429  0.636364  0.0004  0.0011
MS/mediabench/g721/g721encode/encode                1.533019  0.652308  0.0113  0.0325
MS/TSVC/Symbolics-flt/Symbolics-flt                 1.528159  0.654382  0.1116  0.3229
MS/Prolangs-C++/office/office                       1.500000  0.666667  0.0004  0.0012
MS/mediabench/gsm/toast/toast                       1.500000  0.666667  0.0030  0.0090
SS/Shootout-C++/Shootout-C++-moments                1.469231  0.680628  0.0061  0.0191
MS/MiBench/security-rijndael/security-rijndael      1.469072  0.680702  0.0091  0.0285
SS/McGill/exptree                                   1.400000  0.714286  0.0002  0.0007
MS/Prolangs-C++/employ/employ                       1.393443  0.717647  0.0024  0.0085
MS/mediabench/adpcm/rawcaudio/rawcaudio             1.384615  0.722222  0.0005  0.0018
MS/Prolangs-C/compiler/compiler                     1.380952  0.724138  0.0008  0.0029
MS/MiBench/consumer-jpeg/consumer-jpeg              1.357143  0.736842  0.0005  0.0019
MS/Prolangs-C/cdecl/cdecl                           1.346154  0.742857  0.0009  0.0035
SS/Shootout-C++/EH/Shootout-C++-except              1.320276  0.757417  0.0556  0.2292
SS/BenchmarkGame/n-body                             1.295739  0.771760  0.0354  0.1551
SS/Stanford/Oscar                                   1.285714  0.777778  0.0002  0.0009
SS/Misc-C++-EH/spirit                               1.282140  0.779946  0.2706  1.2297
MS/FreeBench/mason/mason                            1.266839  0.789366  0.0103  0.0489
MS/McCat/08-main/main                               1.253731  0.797619  0.0017  0.0084
MS/MiBench/office-stringsearch/office-stringsearch  1.250000  0.800000  0.0002  0.0010
SS/Misc/flops-8                                     1.233927  0.810421  0.0302  0.1593
SS/Polybench/linear-algebra/blas/gesummv/gesummv    1.200000  0.833333  0.0001  0.0006
SS/Shootout-C++/Shootout-C++-hello                  1.200000  0.833333  0.0001  0.0006
MS/Prolangs-C++/ocean/ocean                         1.185714  0.843373  0.0026  0.0166
MS/DOE-ProxyApps-C/SimpleMOC/SimpleMOC              1.181476  0.846399  0.0578  0.3763
MS/McCat/17-bintr/bintr                             1.174089  0.851724  0.0043  0.0290
SS/Misc/himenobmtxpa                                1.159236  0.862637  0.0175  0.1274
SS/Stanford/Queens                                  1.157895  0.863636  0.0006  0.0044
MS/TSVC/CrossingThresholds-flt/CrossingThreshol...  1.117769  0.894640  0.0684  0.6492
MS/BitBench/uuencode/uuencode                       1.114754  0.897059  0.0007  0.0068
MS/MiBench/consumer-typeset/consumer-typeset        1.107527  0.902913  0.0040  0.0412
SS/Stanford/Bubblesort                              1.103896  0.905882  0.0008  0.0085

--- 5 fastest benchmarks with vanilla
                                                        rel1      rel2     abs     van
SS/Shootout-C++/Shootout-C++-wordfreq               0.206897  4.833333 -0.0023  0.0006
SS/Shootout-C++/Shootout-C++-wc                     0.250000  4.000000 -0.0015  0.0005
MS/MiBench/security-blowfish/security-blowfish      0.625000  1.600000 -0.0003  0.0005
SS/Shootout/Shootout-ackermann                      0.646617  1.546512 -0.0047  0.0086
SS/Stanford/Quicksort                               0.675926  1.479452 -0.0070  0.0146
MS/McCat/05-eks/eks                                 0.692308  1.444444 -0.0004  0.0009
SS/Stanford/Puzzle                                  0.711575  1.405333 -0.0152  0.0375
MS/Prolangs-C++/objects/objects                     0.714286  1.400000 -0.0002  0.0005
MS/mediabench/jpeg/jpeg-6a/cjpeg                    0.725490  1.378378 -0.0014  0.0037
MS/Prolangs-C++/city/city                           0.731707  1.366667 -0.0011  0.0030
SS/Stanford/Perm                                    0.743902  1.344262 -0.0042  0.0122
SS/Stanford/Towers                                  0.750000  1.333333 -0.0028  0.0084
SS/Shootout-C++/Shootout-C++-strcat                 0.755968  1.322807 -0.0092  0.0285
MS/MiBench/network-patricia/network-patricia        0.780992  1.280423 -0.0053  0.0189
SS/Polybench/stencils/jacobi-1d/jacobi-1d           0.785714  1.272727 -0.0003  0.0011
MS/McCat/03-testtrie/testtrie                       0.796610  1.255319 -0.0012  0.0047
MS/Prolangs-C++/NP/np                               0.800000  1.250000 -0.0001  0.0004
MS/Prolangs-C/fixoutput/fixoutput                   0.812500  1.230769 -0.0003  0.0013
SS/Stanford/Treesort                                0.816327  1.225000 -0.0054  0.0240
MS/mediabench/mpeg2/mpeg2dec/mpeg2decode            0.827586  1.208333 -0.0015  0.0072
SS/Shootout-C++/Shootout-C++-objinst                0.833333  1.200000 -0.0001  0.0005
SS/Shootout-C++/Shootout-C++-spellcheck             0.833333  1.200000 -0.0001  0.0005
MS/MiBench/office-ispell/office-ispell              0.857143  1.166667 -0.0001  0.0006
MS/Prolangs-C++/deriv1/deriv1                       0.857143  1.166667 -0.0001  0.0006
MS/Prolangs-C++/shapes/shapes                       0.857143  1.166667 -0.0001  0.0006
SS/Shootout-C++/Shootout-C++-sumcol                 0.857143  1.166667 -0.0001  0.0006
SS/Stanford/IntMM                                   0.857143  1.166667 -0.0001  0.0006
MS/McCat/18-imp/imp                                 0.864130  1.157233 -0.0025  0.0159
MS/MallocBench/gs/gs                                0.866197  1.154472 -0.0019  0.0123
MS/FreeBench/neural/neural                          0.874126  1.144000 -0.0054  0.0375
SS/Shootout-C++/Shootout-C++-reversefile            0.875000  1.142857 -0.0001  0.0007
SS/Polybench/linear-algebra/blas/symm/symm          0.888889  1.125000 -0.0001  0.0008
SS/Dhrystone/fldry                                  0.890285  1.123236 -0.0227  0.1842
MS/Prolangs-C/football/football                     0.894737  1.117647 -0.0002  0.0017
SS/Shootout/Shootout-lists                          0.899421  1.111826 -0.1963  1.7554
MS/Olden/treeadd/treeadd                            0.905782  1.104019 -0.0044  0.0423
SS/CoyoteBench/almabench                            0.915684  1.092080 -0.0872  0.9470
MS/TSVC/Reductions-flt/Reductions-flt               0.928026  1.077556 -0.1179  1.5202
MS/Olden/voronoi/voronoi                            0.930070  1.075188 -0.0050  0.0665
MS/llubenchmark/llu                                 0.935111  1.069392 -0.0993  1.4310
SS/Shootout-C++/Shootout-C++-hash                   0.937424  1.066753 -0.0103  0.1543
MS/Ptrdist/ks/ks                                    0.941482  1.062155 -0.0124  0.1995
MS/Olden/perimeter/perimeter                        0.941704  1.061905 -0.0026  0.0420
SS/Shootout-C++/Shootout-C++-ary2                   0.943396  1.060000 -0.0003  0.0050
SS/Shootout-C++/Shootout-C++-lists1                 0.958273  1.043544 -0.0029  0.0666
MS/TSVC/StatementReordering-dbl/StatementReorde...  0.958845  1.042921 -0.0479  1.1160
SS/Shootout/Shootout-heapsort                       0.959572  1.042132 -0.0370  0.8782
SS/Polybench/linear-algebra/kernels/bicg/bicg       0.961538  1.040000 -0.0003  0.0075
SS/Shootout-C++/Shootout-C++-fibo                   0.964357  1.036960 -0.0267  0.7224
SS/Polybench/linear-algebra/blas/gemver/gemver      0.966102  1.035088 -0.0004  0.0114
-----
Geomean relative speedup of inline vs vanilla: 4.66%


What can we make out of that? Well, the differences are too small to make any strong claim in this setup (see the future work below). But if anything, inlining everything makes things faster; the results anyway tell us that the always-inline version has a geomean speedup of 4.66%.


Conclusions and future work

From the experimental evidece above, we can definitely not conclude that inlining everything generates slower executables. We can definitely conclude, though, that compilation times and executable sizes are inflated when we inline everything. Personally, this is what I make out of these experiments: don't be afraid of __attribute__((always_inline)). And I say this because important figures in the compiler community urge not to use it at all! One prominent such figure is Chandler Carruth. In this video he says:

[T]here are special implementation tricks, like the always_inline attribute, that GCC supports, and ICC supports, and Microsoft [i.e., MSVC] supports—everyone has one of these attributes, right? I would strongly encourage to not use that attribute—ever, okay? If you're using that attribute, you have found a bug in the inliner of your optimizer. If you're going to use the attribute, I would ask: please, please, file a bug against the optimizer first, and then add the attribute with a little comment that says: “by the way, we should remove this as soon as they fix this bug over here that's hurting our performance”. It should always be based on performance measurements, and you should always file a bug with the complaint to the optimizer about “why did you not actually inline this?”.

I'm not sure I'm totally in line (pun intended) with this argument. Yes, I agree that your decisions should be based on performance measurements. And I agree that you should file a bug to the optimizer. But to say that you should never use it is a completely different story. Software like LLVM is slow-moving, and you can't wait for everything to be fixed, especially as we move more and more towards machine-learning-based tools, in which fixing special cases is harder than adding an if. So, I think the relaxed statement above—that you should use it and file a bug—makes more sense in practice.

Whether you should add a comment... I'm not sure. People have things to worry about other than whether LLVM fixed this or that inlining bug. Frankly, who cares? And I say this as a compiler enthusiast, but most aren't. All of this is all the more true when you consider that adding an always_inline here and there is not the end of the world, as our experiments showed; that's especially true if you use it pointedly and based on performance measurements, rather than adding it everywhere as we did.


Do you want to share your thoughts? Please shoot me an email.

Don't want to miss any updates? You can follow this RSS feed or sign up for my newsletter:



Footnotes

  1. As I mention in this video, inlining is a pretty simple transformation, and here you can find an implementation of inlining from Chris Lattner the Great from about 20 years ago, in about 200 lines of code. Unfortunately, today even the transformation itself has grown out of proportion; look for example at InlineFunction.cpp.
  2. Spoiler alert: it does, as we'll see shortly.
  3. Which can be a side-effect of (a), but also appear on its own even if you inline unique functions.
  4. Most of which are fundamentally not inlinable; at least without intrusive changes like using a custom stack.
  5. Part of the reason is that, as inlining is one of the most important transformations, people have tried to create different profitability analyses the last couple of years. A popular example is MLGO. Still, however, these folks needed to mess with the internals of compilers.
  6. That is, a graph where every node is a function in our program, and there's an edge AB if function A calls function B.
  7. Why does LLVM split it into SCCs? Because an SCC is basically a cycle (which is caused by (potentially transitive) recursion), and cycles require special treatment. You see, the idea in inlining is that if A calls B (that is, you have an edge AB in the call graph), then you may potentially inline B into A. This is also why it makes sense to visit the call graph bottom-up (i.e., starting from the leaves in the call graph, the functions that don't call any other functions), as I explain in the video (although other compilers use a top-down approach). So, in this case, we first visit B, inline anything we want to inline into B, and then potentially inline the “new” B—that is, the new B now has all the code of the inlined functions in its body—into A. But say we have a cycle—let's say ABCA; how will we proceed? As we said, you'd usually start from the “bottom”—the leaves. But here there are no leaves! So, you see the problem! If you try to force it, you will end up in a cycle of inlining, i.e., inlining A into C, and C into B, and B into A, and now again A into C, and so on... This gives you an idea of why it's hard to inline recursive calls.
  8. This happens because by default, if Clang adds optnone, it also adds noinline. Here's the relevant code.
  9. It also doesn't make a difference from -O2; see my previous article. But it does make a difference in compilation times, especially for the always-inline build, which takes about 45s to compile.