Gather around kids! Today we'll explore the following intriguing question: what happens if we inline everything?
If you're not familiar with inlining, you should know that it is considered by many (including important figures in the compiler community like Chandler Carruth) to be the single most important optimization in optimizing compilers. For a little more info on how it works, here's a shameless plug of a presentation we gave with a bunch of folks at the LLVM Developers' Meeting on interprocedural optimization. I talked about inling, and watching the first ~6 minutes should give a good idea.1
In this video I mention that inlining has drawbacks. In other words, yes, inlining is supposed to be the super duper tool in the compiler's toolkit, but you're supposed to use it with care! Or should you?
What are the drawbacks anyway? In the video I mention basically three: (a) code duplication, (b) code-size explosion, and (c) increased register-allocator presure. (a) can increase compile times2 because the same code is compiled multiple times. (b)3 can create large executables, which means they both take more space on the disk, and they take longer to load. Moreover, it can theoretically lead to instruction cache misses as you jump through all this code. The problem of (c) is that it can obviously lead to unnecessary register spills.
But today we're going against frugality. The attitude we'll take is simple: we only care about runtime performance. This means “I have a lot of disk space, and a lot of time to spare on compile times; let's make our code as fast as possible!” In other words, if we ignore other constraints and we inline everything, will the runtime performance be any worse (as (b) and (c) would imply)? To the best of my knowledge, nobody has answered this (using experimental evidence) up to now. So, let's find out.
Well, no. There are calls that LLVM simply cannot inline;4 I mention some cases in the video. In the following experiments, most of the code used will be available to the compiler (basically except for standard-library code), and there won't be many virtual functions or function pointers (definitely not in the code we'll write). So, the biggest obstacle to inlining will be recursion (which will also be rare). In short, the vast majority of functions will be inlinable.
I should note here that LLVM can inline some simple recursive calls; particularly calls to tail-recursive functions, but that's because the tail-recursive call elimination pass removed the recursion first (see this). So, for example, it can inline a simple factorial function:
int fac(int x) {
if (x == 1) {
return 1;
}
return fac(x - 1) * x;
}
but it cannot inline a recursive fibonacci function:
int fib(int x) {
if (x == 1) {
return 1;
}
if (x == 2) {
return 1;
}
return fib(x - 1) + fib(x - 2);
}
This should have been easy, but it's not. It should have been easy to change the inlining profitability, i.e., the part of LLVM that decides when inlining is beneficial. If that was the case, folks could change the profitability analysis without having to understand the internals of inlining. For example, it should have been easy for a non-compiler-expert, machine-learning researcher to change LLVM's inlining profitability analysis. In our case, we would just change the profitability analysis to always say yes!
Unfortunately, however, things are not so simple in modern compilers. It turns out I wrote a whole thesis on this problem. Today, the transformation and profitability analysis code is tightly coupled, such that one cannot change the one without understanding the other. Nevertheless, inlining in LLVM is one of the good cases, as we have a better separation than in other passes.5 So, let me briefly explain the course of an inlining decision today. We'll use the latest LLVM release, which, as of writing this article, is 20.1.0.
The inlining pass starts at Inliner.cpp. This is a Call-Graph Strongly-Connected-Component (CGSCC) pass. To get a good idea of what this means, I suggest watching this excerpt by Chandler Carruth, or watching the latter half of my video. But you may be relieved to know that it's not that relevant to this article; you can just think that the Inliner visits each function separately. If you want the TL;DR, here it is: we take the call graph,6 we split it into strongly connected components (SCCs), and then we visit each SCC separately.7
As I mentioned above, most of the complexity in inlining comes from the
profitability analysis. To decide if it's profitable to inline a function call,
LLVM uses what it calls an InlineAdvisor
. So, eventually we ask for its advice
here.
...
std::unique_ptr<InlineAdvice> Advice =
Advisor.getAdvice(*CB, OnlyMandatory);
...
If you follow the call, you'll end up here.
std::unique_ptr<InlineAdvice> InlineAdvisor::getAdvice(CallBase &CB,
bool MandatoryOnly) {
...
Now, if we're not using any fancy inline advisor (e.g., for size), we'll end up here.
std::optional<llvm::InlineCost> static getDefaultInlineAdvice(
CallBase &CB, FunctionAnalysisManager &FAM, const InlineParams &Params) {
...
As is obvious from the code, what we need to focus on is this call to
shouldInline()
.
...
return llvm::shouldInline(
CB, CalleeTTI, GetInlineCost, ORE,
Params.EnableDeferral.value_or(EnableInlineDeferral));
But is actually important is this lambda here.
auto GetInlineCost = [&](CallBase &CB) {
bool RemarksEnabled =
Callee.getContext().getDiagHandlerPtr()->isMissedOptRemarkEnabled(
DEBUG_TYPE);
return getInlineCost(CB, Params, CalleeTTI, GetAssumptionCache, GetTLI,
GetBFI, PSI, RemarksEnabled ? &ORE : nullptr);
};
If you follow the calls, you will eventually reach the place in the code that concerns us: getInlineCost(). In particular, this line concerns us.
...
InlineResult ShouldInline = CA.analyze();
...
Now, if you look at the rest of the code inside getInlineCost()
,
you'll see that we return either a success or a failure. So, you would think
that making LLVM inline everything would be as simple as: only honor the user's
decision, but otherwise add a return InlineCost::getAlways("Decided by our
majesty")
here.
Let's do this and see what happens.
To test this, I have written some simple image processing code in C, using the
stb_image*
family of libraries (which is available in the public
domain). You can find more information on the
code by expanding the following section.
This is a simple image processing pipeline that loads JPEG images, resizes them, and saves them as PNG. The images I used for testing are the following simple images, available without a copyright: Photo by M Venter, Photo by Lukas Kloeppel, Photo by eberhard grossgasteiger
#define STB_IMAGE_IMPLEMENTATION
#define STB_IMAGE_RESIZE_IMPLEMENTATION
#define STB_IMAGE_WRITE_IMPLEMENTATION
#include "stb_image.h"
#include "stb_image_resize2.h"
#include "stb_image_write.h"
#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#define NUM_IMAGES 3
#define MAX_RES_HEIGHT 5000
#define MAX_RES_WIDTH 5000
#define MAX_CHANNELS 3
void load_resize_write(const char *in, const char *out, uint8_t *resized_image) {
int width, height, channels;
uint8_t *input_image = stbi_load(in, &width, &height, &channels, 0);
assert(input_image);
assert(channels <= MAX_CHANNELS);
int new_width = width / 1.3;
int new_height = height / 1.3;
assert(new_width < MAX_RES_WIDTH);
assert(new_height < MAX_RES_HEIGHT);
stbir_resize_uint8_linear(input_image, width, height, 0,
resized_image, new_width, new_height, 0,
(stbir_pixel_layout)channels);
assert(stbi_write_png(out, new_width, new_height, channels, resized_image, new_width * channels));
stbi_image_free(input_image);
}
int main() {
uint8_t *resized_image = malloc(MAX_RES_WIDTH * MAX_RES_HEIGHT * MAX_CHANNELS);
assert(resized_image);
char in[256];
char out[256];
for (int i = 0; i < NUM_IMAGES; ++i) {
sprintf(in, "./in/input%d.jpg", i);
sprintf(out, "./out/out%d.png", i);
load_resize_write(in, out, resized_image);
}
return 0;
}
Let's try compiling our file with:
time <path to custom clang build>/bin/clang resize.c -O3 -Rpass=inline
To enable the inliner, we need to enable the optimizations with a flag like
-O3
. We also ask LLVM to tell us when it successfully inlines a function with
-Rpass=inline
, and when it did not inline a function with
-Rpass-missed=inline
. If we run this, Clang crashes. Here's the
output I got.
There are a couple of things to observe here. First, the compilation time is
huge! In my Mac Studio it took about 39s, which insanely long (note that I'm
using an optimized LLVM build); soon we'll compile the file with vanilla Clang
to see the difference. Second, almost all functions were inlined; there are
only 22 lines in the output from -Rpass-missed
. We'll return to the missed
functions soon, but for now let's focus on the functions that did get inlined.
A bunch of functions were inlined because the source code had the
__always_inline__
attribute (which is different from
inline
; see my previous
article for an in-depth dive.). Interestingly, these functions are not part
of
stb_image*
itself, but rather they're intrinsics. For example, in my Mac I use
the built-in Clang include files, which include this:
__ai __attribute__((target("neon"))) float32x4_t vreinterpretq_f32_u8(uint8x16_t __p0) {
float32x4_t __ret;
__ret = (float32x4_t)(__p0);
return __ret;
}
The __ai
does the job in this case, and it's defined as:
#define __ai static __inline__ __attribute__((__always_inline__, __nodebug__))
If we remove all the “always inline” lines, all the other lines that tell us that a function got inlined include “Decided by our majesty”, which: (a) verifies that we hit our modified code, and (b) we actually do inline almost everything.
But what about the missed functions? All 22 cases give us the same message: “noinline call site attribute”. If you want to learn more about attributes in LLVM, expand the section below.
Attributes in LLVM can be attached to all sorts of entities, like instructions,
functions, etc. An attribute communicates some piece of information, e.g., that
a function should not be inlined, or that a pointer does not alias with any
other pointer (LLVM internally calls it noalias
, which is the respective
restrict
in C). LLVM makes heavy use of attributes, as you can see in the
manual. Even compiling
a small piece of code will give you LLVM IR with many attributes.
The noinline
attribute can be added to many places, and when it's added to a
function definition, it tells the optimizer not to inline any call to this
function. Let's see this in practice. Here's a simple
example. As you can see, even in -O1
, the
function square()
gets inlined. But now let's manually generate the LLVM IR
code and add the noinline
attribute to see what happens. Here's the
godbolt example. But I should take a pause
here and explain what all these command-line arguments are.
Interlude: Getting optimizable LLVM IR from Clang
-emit-llvm
tells LLVM to emit LLVM IR, but by default it emits it in binary
format (i.e., .bc
). This is great if you don't need a human to be able to read
the IR because it has better backwards compatibility, smaller size, and it's
faster to load. But, because we do want to read it, we also add -S
, which
outputs the text version. -g0
tells Clang not to output any debug information
because it just clutters the output. The rest is the interesting part.
Remember, we want unoptimized code, because we want to pass it through the
optimizer ourselves, after we add the attribute. But, we want it to
optimizable, which in this case means that the optimizer should not ignore it.
If you don't add the other command-line arguments, you will get unoptimizable
code. Take a look at this. Both functions
have the #0
attributes, which include optnone
. This is an attribute which
tells the optimizer not to touch a function at all, and it's great if that's
what you want. Here's what happens if we try to optimize this LLVM IR as is:
godbolt. As you can see, even -O3
does
nothing.
One way to deal with this is to just remove optnone
manually:
godbolt. This leaves noinline
, which is
what we want.
In general, however, we don't want to have to change the code manually. Most
people deal with that by using -O0
, but also -Xclang -disable-O0-optnone
,
which tells the Clang driver not to add optnone
:
godbolt. The problem with that, however, is
that the functions still get the noinline
attribute by default (as we saw
above).8 Usually this is not what you want because you do want the
inliner to run. What I used—-O1 -Xclang -disable-llvm-passes
—uses
-O1
to avoid adding optnone
and noinline
, but then it tells the driver not
to run any passes, which is what we want.
--- End of the interlude...
Whether we add noinline
ourselves, or get LLVM IR that has it already, the
effect is that square()
does not get inlined, as you can see:
godbolt. One last thing to mention here is
that can add noinline
to a specific call site, which is what is relevant to
the original problem we faced. Here's an
example which has two places where square()
is called. Now, I can add the attribute to the call inside bar()
:
godbolt. As you can see, the call inside
bar()
did not get inlined, whereas the call inside foo()
did.
Assuming you now know about LLVM attributes, one thing to mention is that
we can add the noinline
attribute in LLVM at the C level using the
__attribute__((noinline))
GCC extension, which Clang supports. As you can see
in this snippet, square()
gets noinline
,
whereas foo()
doesn't. However, I don't know of a way to add it to a specific
call site (although, if you really need this, you can create a helper function
that just calls the function you don't want to inline, and you add the attribute
to the helper).
In any case, the pressing question is: who added this attribute? In our
compilation we have enabled optimizations, so clearly it wasn't because we were
in -O0
. Also, remember the message: “noinline call site attribute”. Since
there's no way to add it to a call site manually, and no part of our code
includes it at the function level, this must mean that some part of the LLVM
compilation pipeline added it automatically. There are at least two questions
here. First, which part added it? Second, why didn't we inline the call anyway?
Aren't we supposed to inline everything?
The second question can be answered easily. You see, this message appears
here
and
here.
The fact that it appears in two places is a bit confusing, so let's make sure we
know which place it is e.g., by adding a suffix (I just added “1” and “2” to the
two places). After running with the new binary, we can see that all cases stem
from the second place. This is expected because the first place checks whether
there is the alwaysinline
along with noinline
. Since stb_image*
is serious
code, we wouldn't expect it to have added alwaysinline
to functions that are
not inlinable. In short, all the non-inlined call sites were because some part
of the pipeline added it automatically.
But still, why didn't we inline the call anyway? Remember that we added our code
here,
to respect the user's decision. As you can see, the user decision is based on
getAttributeBasedInliningDecision()
, which includes the two places we
mentioned above, which check for noinline
. That's why we did not inline these
calls (even though clearly in this case it wasn't the user's—i.e., the
programmer's—decision).
The first question is a little subtler: which part added it, and why? Of course,
LLVM will not add the noinline
attribute willy-nilly. If it added it, it must
mean that it cannot be inlined. Let's take a look at which functions could not
be inlined. If we take a look at the remark messages, we see that all 22 cases
involve the stbi__out_gif_code()
function. If we take a look at the function's
code, we'll see the problem: it's recursive. Check
here.
So, it's a reasonable hypothesis that this is why LLVM added the attribute. But
it would be good to confirm it.
In short, the answer is yes; the noinline
attribute is added
here.
The exact reason is a bit complicated, and since we're hitting sort of an edge
case in the LLVM inliner, I will skip the full explanation. But thinking that
it's because is recursive is not far from the truth.
But remember, Clang crashed! We need to figure out why. In my Mac Studio I
can't access the .crash
file. But thankfully the error message tells us enough
to figure out what the problems: we have inlined some function that uses
variadic arguments. There are two functions in our source code that use variadic
arguments:
this
and
this.
Adding __attribute__((noinline))
fixes the problem (and now the compile time
jumps to about 50s).
But the bigger problem here is that even though we tried our best to only deal
with the profitability analysis, we did break the transformation, which
highlights the problem I talked about earlier. Thankfully, however,
there is an easy fix. Hopefully you remember the user decision we mentioned
earlier. In that function, if the user uses alwaysinline
, but the function
also has noinline
, we issue the first “noinline call site attribute” remark.
However, right after
that,
if it doesn't have the noinline
attribute, we check whether the function can
be inlined!
So, LLVM makes our live easier by providing
isInlineViable,
which checks if a function can be inlined. And we can see
here
that this function returns false
if the function in question uses
va_start()
. Now, we can modify the code that inlines everything inlinable to:
if (isInlineViable(*Callee).isSuccess()) {
return InlineCost::getAlways("Decided by our majesty");
}
And now, without changing anything in our original code, we can compile all of it! Perhaps surprisingly, the executable also runs without any errors, and the resized and saved images look fine. We'll have more to say about that executable and its running time, but for now we need to make sure that our code can compile complicated stuff.
The LLVM test suite includes a bunch of applications and benchmarks; it is a publicly available crash test for compilers in general, and LLVM in particular. I used the following CMake configuration to configure it (barring Mac-specific stuff):
cmake ../test-suite \
-DCMAKE_C_COMPILER=<path to custom build>/bin/clang \
-DCMAKE_CXX_COMPILER=<path to custom build>/bin/clang++ \
-C../test-suite/cmake/caches/O3.cmake
You can optionally add the following to confirm that our modifications take place, but it will clutter the output:
-DCMAKE_C_FLAGS="-Rpass=inline" \
-DCMAKE_CXX_FLAGS="-Rpass=inline"
I usually build SingleSource/Benchmarks
and MultiSource/Benchmarks
, which in
this case will be useful later too. In my setup some benchmarks don't compile,
but they don't compile with my vanilla Clang build either, so I don't consider
it a problem. The reason they fail is because I didn't build compiler-rt
in my
LLVM builds, which is a pain to deal with on a Mac. Here is the full list of
benchmarks that didn't compile in my setup, for both builds:
SingleSource/Benchmarks/CoyoteBench/fftbench
MultiSource/Benchmarks/DOE-ProxyApps-C/RSBench
I also disabled the following benchmarks:
MultiSource/Benchmarks/tramp3d-v4/
MultiSource/Benchmarks/PAQ8p
because they took way too long to compile with the always-inline build.
Do not compare your custom-built Clang with the built-in Clang because the
built-in Clang may have been built differently (e.g., without assertions, or
with compiler-rt
). This, in turn, will lead to inaccurate comparisons. The
easiest way to avoid that is to just download the LLVM source code twice, and
use two different build directories, one for each source code directory. Of
course you can use a single source code directory by creating a vanilla build
before you make any changes in the code, but this is more risky. With a separate
source you just never touch that source and that's it. Even if you accidentally
hit recompile in the vanilla version, you know nothing's changed.
In any case, you should use the same CMake configuration on both. For the following experiments I used:
cmake ../llvm -DLLVM_ENABLE_PROJECTS='clang;clang-tools-extra' -DLLVM_TARGETS_TO_BUILD=host -DCMAKE_BUILD_TYPE='Release' -DLLVM_ENABLE_ASSERTIONS=ON -G Ninja
Building with assertions is fine in terms of performance in this case, because the inlining overhead will (strongly) dominate the comparisons.
For similar reasons, it will make your life easier to have 2 test-suite builds, although in this case it's better to have one source directory because you want any changes you make to apply to both.
Other than that, everything compiles successfully. You can try compiling some
applications from MultiSource/Applications
, but note that if you try to
compile sqlite3
it will take forever.
Now it's time to answer the most important question: is there any harm in inlining everything (as compiler developers tell you)?
First, even though originally we said we didn't care about these metrics, I
think it's still interesting to explore them. Compiling the image processing
pipeline takes a whopping ~50 seconds if we use the always-inline build (with
-O3
). In contrast, it takes only ~2 seconds if we use the vanilla clang
-O3
. This is about a 25x slowdown!
In terms of executable size, vanilla Clang gives us an executable that is about 304 kilobytes, whereas always-inline Clang gives us one that is 3.4 megabytes, or about 10x larger.
In short, things are definitely not looking great in terms of these metrics. One
thing to keep in mind is that both compilation time and executable size increase
exponentially as the source-code size increases. This is simply because if,
say, A()
calls B()
which calls C()
, inlining C()
into B()
and B()
into A()
will compile C()
's code twice; once when we compile B()
's body,
and once when we compile A()
's body. Also, if X()
is called in Y()
and
Z()
and we inline it on both, again its code will be compiled twice. You can
see how this can blow up very quickly.
This is why compiling a large application can still be viable if it involves
many small compilation units, but it can be out of the question if it contains
few but large compilation units. For example, MultiSource/JM
contains about
65k LoC and MultiSource/sqlite3
contains about 53k LoC. But these 65k LoC are
distributed across many compilation units in JM
, while sqlite3
has a
single compilation unit (comprised of sqlite3.c
, and the .h
it includes).
As such, compiling JM
takes about 25s, whereas compiling sqlite3
takes...
well, I got tired of waiting after 4 hours had passed.
Finally we come to the most important question: do we have any performance degradation?
Let's start from the image processing pipeline. In this case, the answer seems
to be a clear no. Both versions with -O3
run in about 2.9s. Using -O2
doesn't make a difference between the two builds either.9. Using -O1
does not
make a difference between the two compilations (but it does make a difference
from -O2
and -O3
as they now run in about 3.3s).
Now let's compare the test suite. This is what I used to run the benchmarks (once for each build):
<path to build>/bin/llvm-lit -sv -j 1 -o results.json ./SingleSource/Benchmarks/ ./MultiSource/Benchmarks/
In terms of total time for all the benchmarks, the vanilla build took about 258
seconds, whereas the always-inline build took about 260 seconds. This is
practically no difference, especially across 309 tests. We can use the two
.json
files and the following Python code to see if there was any significant
difference in any particular benchmark.
import pandas as pd
import json
from scipy.stats import gmean
def read_json(f):
with open(f, 'r') as fp:
d = json.load(fp)
return d['tests']
def process_tests(tests):
names = []
times = []
for t in tests:
name = t['name']
name = name[len("test-suite :: "):]
name = name[:-len('.test')]
name = name.replace('MultiSource/Benchmarks', 'MS')
name = name.replace('SingleSource/Benchmarks', 'SS')
names.append(name)
times.append(t['metrics']['exec_time'])
### END FOR ###
return names, times
van = read_json('results_vanilla.json')
inl = read_json('results_inline.json')
names_van, times_van = process_tests(van)
names_inl, times_inl = process_tests(inl)
assert names_van == names_inl
ser_van = pd.Series(times_van, index=names_van)
ser_inl = pd.Series(times_inl, index=names_van)
ser_rel1 = ser_van / ser_inl
ser_rel2 = ser_inl / ser_van
ser_abs = ser_van - ser_inl
df = pd.DataFrame(data={'rel1': ser_rel1, 'rel2': ser_rel2, 'abs': ser_abs, 'van': ser_van}, index=ser_rel1.index)
print('--- 5 fastest benchmarks with always-inline')
print(df.nlargest(50, columns=['rel1']))
print()
print('--- 5 fastest benchmarks with vanilla')
print(df.nlargest(50, columns=['rel2']))
print('-----')
m_perc = (gmean(ser_rel1) - 1) * 100
print(f'Geomean relative speedup of inline vs vanilla: {m_perc:.3}%')
You can expand the collapsible below to see the detailed results:
--- 5 fastest benchmarks with always-inline
rel1 rel2 abs van
MS/Prolangs-C++/fsm/fsm 4.166667 0.240000 0.0019 0.0025
MS/Prolangs-C/unix-tbl/unix-tbl 3.375000 0.296296 0.0019 0.0027
MS/Prolangs-C++/family/family 3.000000 0.333333 0.0012 0.0018
SS/Stanford/FloatMM 2.904555 0.344287 0.0878 0.1339
MS/Prolangs-C++/deriv2/deriv2 2.714286 0.368421 0.0012 0.0019
MS/Prolangs-C/simulator/simulator 2.714286 0.368421 0.0012 0.0019
MS/Prolangs-C/assembler/assembler 2.666667 0.375000 0.0015 0.0024
MS/Prolangs-C/unix-smail/unix-smail 2.566667 0.389610 0.0047 0.0077
MS/Prolangs-C++/trees/trees 2.250000 0.444444 0.0010 0.0018
MS/Prolangs-C/allroots/allroots 2.166667 0.461538 0.0007 0.0013
MS/Prolangs-C/agrep/agrep 2.111111 0.473684 0.0050 0.0095
MS/Prolangs-C++/simul/simul 1.933333 0.517241 0.0014 0.0029
MS/Prolangs-C++/garage/garage 1.812500 0.551724 0.0013 0.0029
SS/Polybench/linear-algebra/solvers/durbin/durbin 1.800000 0.555556 0.0028 0.0063
MS/MiBench/security-sha/security-sha 1.660000 0.602410 0.0033 0.0083
MS/Prolangs-C/gnugo/gnugo 1.639269 0.610028 0.0140 0.0359
MS/Prolangs-C/bison/mybison 1.586207 0.630435 0.0017 0.0046
MS/BitBench/uudecode/uudecode 1.581395 0.632353 0.0025 0.0068
MS/Prolangs-C++/vcirc/vcirc 1.571429 0.636364 0.0004 0.0011
MS/mediabench/g721/g721encode/encode 1.533019 0.652308 0.0113 0.0325
MS/TSVC/Symbolics-flt/Symbolics-flt 1.528159 0.654382 0.1116 0.3229
MS/Prolangs-C++/office/office 1.500000 0.666667 0.0004 0.0012
MS/mediabench/gsm/toast/toast 1.500000 0.666667 0.0030 0.0090
SS/Shootout-C++/Shootout-C++-moments 1.469231 0.680628 0.0061 0.0191
MS/MiBench/security-rijndael/security-rijndael 1.469072 0.680702 0.0091 0.0285
SS/McGill/exptree 1.400000 0.714286 0.0002 0.0007
MS/Prolangs-C++/employ/employ 1.393443 0.717647 0.0024 0.0085
MS/mediabench/adpcm/rawcaudio/rawcaudio 1.384615 0.722222 0.0005 0.0018
MS/Prolangs-C/compiler/compiler 1.380952 0.724138 0.0008 0.0029
MS/MiBench/consumer-jpeg/consumer-jpeg 1.357143 0.736842 0.0005 0.0019
MS/Prolangs-C/cdecl/cdecl 1.346154 0.742857 0.0009 0.0035
SS/Shootout-C++/EH/Shootout-C++-except 1.320276 0.757417 0.0556 0.2292
SS/BenchmarkGame/n-body 1.295739 0.771760 0.0354 0.1551
SS/Stanford/Oscar 1.285714 0.777778 0.0002 0.0009
SS/Misc-C++-EH/spirit 1.282140 0.779946 0.2706 1.2297
MS/FreeBench/mason/mason 1.266839 0.789366 0.0103 0.0489
MS/McCat/08-main/main 1.253731 0.797619 0.0017 0.0084
MS/MiBench/office-stringsearch/office-stringsearch 1.250000 0.800000 0.0002 0.0010
SS/Misc/flops-8 1.233927 0.810421 0.0302 0.1593
SS/Polybench/linear-algebra/blas/gesummv/gesummv 1.200000 0.833333 0.0001 0.0006
SS/Shootout-C++/Shootout-C++-hello 1.200000 0.833333 0.0001 0.0006
MS/Prolangs-C++/ocean/ocean 1.185714 0.843373 0.0026 0.0166
MS/DOE-ProxyApps-C/SimpleMOC/SimpleMOC 1.181476 0.846399 0.0578 0.3763
MS/McCat/17-bintr/bintr 1.174089 0.851724 0.0043 0.0290
SS/Misc/himenobmtxpa 1.159236 0.862637 0.0175 0.1274
SS/Stanford/Queens 1.157895 0.863636 0.0006 0.0044
MS/TSVC/CrossingThresholds-flt/CrossingThreshol... 1.117769 0.894640 0.0684 0.6492
MS/BitBench/uuencode/uuencode 1.114754 0.897059 0.0007 0.0068
MS/MiBench/consumer-typeset/consumer-typeset 1.107527 0.902913 0.0040 0.0412
SS/Stanford/Bubblesort 1.103896 0.905882 0.0008 0.0085
--- 5 fastest benchmarks with vanilla
rel1 rel2 abs van
SS/Shootout-C++/Shootout-C++-wordfreq 0.206897 4.833333 -0.0023 0.0006
SS/Shootout-C++/Shootout-C++-wc 0.250000 4.000000 -0.0015 0.0005
MS/MiBench/security-blowfish/security-blowfish 0.625000 1.600000 -0.0003 0.0005
SS/Shootout/Shootout-ackermann 0.646617 1.546512 -0.0047 0.0086
SS/Stanford/Quicksort 0.675926 1.479452 -0.0070 0.0146
MS/McCat/05-eks/eks 0.692308 1.444444 -0.0004 0.0009
SS/Stanford/Puzzle 0.711575 1.405333 -0.0152 0.0375
MS/Prolangs-C++/objects/objects 0.714286 1.400000 -0.0002 0.0005
MS/mediabench/jpeg/jpeg-6a/cjpeg 0.725490 1.378378 -0.0014 0.0037
MS/Prolangs-C++/city/city 0.731707 1.366667 -0.0011 0.0030
SS/Stanford/Perm 0.743902 1.344262 -0.0042 0.0122
SS/Stanford/Towers 0.750000 1.333333 -0.0028 0.0084
SS/Shootout-C++/Shootout-C++-strcat 0.755968 1.322807 -0.0092 0.0285
MS/MiBench/network-patricia/network-patricia 0.780992 1.280423 -0.0053 0.0189
SS/Polybench/stencils/jacobi-1d/jacobi-1d 0.785714 1.272727 -0.0003 0.0011
MS/McCat/03-testtrie/testtrie 0.796610 1.255319 -0.0012 0.0047
MS/Prolangs-C++/NP/np 0.800000 1.250000 -0.0001 0.0004
MS/Prolangs-C/fixoutput/fixoutput 0.812500 1.230769 -0.0003 0.0013
SS/Stanford/Treesort 0.816327 1.225000 -0.0054 0.0240
MS/mediabench/mpeg2/mpeg2dec/mpeg2decode 0.827586 1.208333 -0.0015 0.0072
SS/Shootout-C++/Shootout-C++-objinst 0.833333 1.200000 -0.0001 0.0005
SS/Shootout-C++/Shootout-C++-spellcheck 0.833333 1.200000 -0.0001 0.0005
MS/MiBench/office-ispell/office-ispell 0.857143 1.166667 -0.0001 0.0006
MS/Prolangs-C++/deriv1/deriv1 0.857143 1.166667 -0.0001 0.0006
MS/Prolangs-C++/shapes/shapes 0.857143 1.166667 -0.0001 0.0006
SS/Shootout-C++/Shootout-C++-sumcol 0.857143 1.166667 -0.0001 0.0006
SS/Stanford/IntMM 0.857143 1.166667 -0.0001 0.0006
MS/McCat/18-imp/imp 0.864130 1.157233 -0.0025 0.0159
MS/MallocBench/gs/gs 0.866197 1.154472 -0.0019 0.0123
MS/FreeBench/neural/neural 0.874126 1.144000 -0.0054 0.0375
SS/Shootout-C++/Shootout-C++-reversefile 0.875000 1.142857 -0.0001 0.0007
SS/Polybench/linear-algebra/blas/symm/symm 0.888889 1.125000 -0.0001 0.0008
SS/Dhrystone/fldry 0.890285 1.123236 -0.0227 0.1842
MS/Prolangs-C/football/football 0.894737 1.117647 -0.0002 0.0017
SS/Shootout/Shootout-lists 0.899421 1.111826 -0.1963 1.7554
MS/Olden/treeadd/treeadd 0.905782 1.104019 -0.0044 0.0423
SS/CoyoteBench/almabench 0.915684 1.092080 -0.0872 0.9470
MS/TSVC/Reductions-flt/Reductions-flt 0.928026 1.077556 -0.1179 1.5202
MS/Olden/voronoi/voronoi 0.930070 1.075188 -0.0050 0.0665
MS/llubenchmark/llu 0.935111 1.069392 -0.0993 1.4310
SS/Shootout-C++/Shootout-C++-hash 0.937424 1.066753 -0.0103 0.1543
MS/Ptrdist/ks/ks 0.941482 1.062155 -0.0124 0.1995
MS/Olden/perimeter/perimeter 0.941704 1.061905 -0.0026 0.0420
SS/Shootout-C++/Shootout-C++-ary2 0.943396 1.060000 -0.0003 0.0050
SS/Shootout-C++/Shootout-C++-lists1 0.958273 1.043544 -0.0029 0.0666
MS/TSVC/StatementReordering-dbl/StatementReorde... 0.958845 1.042921 -0.0479 1.1160
SS/Shootout/Shootout-heapsort 0.959572 1.042132 -0.0370 0.8782
SS/Polybench/linear-algebra/kernels/bicg/bicg 0.961538 1.040000 -0.0003 0.0075
SS/Shootout-C++/Shootout-C++-fibo 0.964357 1.036960 -0.0267 0.7224
SS/Polybench/linear-algebra/blas/gemver/gemver 0.966102 1.035088 -0.0004 0.0114
-----
Geomean relative speedup of inline vs vanilla: 4.66%
What can we make out of that? Well, the differences are too
small to make any strong claim in this setup (see the future work below). But if
anything, inlining everything makes things faster; the results anyway tell us
that the always-inline version has a geomean speedup of 4.66%
.
From the experimental evidece above, we can definitely not conclude that
inlining everything generates slower executables. We can definitely conclude,
though, that compilation times and executable sizes are inflated when we inline
everything. Personally, this is what I make out of these experiments: don't be
afraid of __attribute__((always_inline))
. And I say this because important
figures in the compiler community urge not to use it at all! One prominent such
figure is Chandler Carruth. In this video
he says:
[T]here are special implementation tricks, like the always_inline
attribute, that GCC supports, and ICC supports, and Microsoft [i.e., MSVC] supports—everyone has one of these attributes, right? I would strongly encourage to not use that attribute—ever, okay? If you're using that attribute, you have found a bug in the inliner of your optimizer. If you're going to use the attribute, I would ask: please, please, file a bug against the optimizer first, and then add the attribute with a little comment that says: “by the way, we should remove this as soon as they fix this bug over here that's hurting our performance”. It should always be based on performance measurements, and you should always file a bug with the complaint to the optimizer about “why did you not actually inline this?”.
I'm not sure I'm totally in line (pun intended) with this argument. Yes, I agree
that your decisions should be based on performance measurements. And I agree
that you should file a bug to the optimizer. But to say that you should never
use it is a completely different story. Software like LLVM is slow-moving, and
you can't wait for everything to be fixed, especially as we move more and more
towards machine-learning-based tools, in which fixing special cases is harder
than adding an if
. So, I think the relaxed statement above—that you
should use it and file a bug—makes more sense in practice.
Whether you should add a comment... I'm not sure. People have things to worry
about other than whether LLVM fixed this or that inlining bug. Frankly, who
cares? And I say this as a compiler enthusiast, but most aren't. All of this is
all the more true when you consider that adding an always_inline
here and
there is not the end of the world, as our experiments showed; that's especially
true if you use it pointedly and based on performance measurements, rather than
adding it everywhere as we did.
A
→B
if function A
calls
function B
.↩A
calls B
(that is, you have an edge A
→B
in the call
graph), then you may potentially inline B
into A
. This is also why it makes
sense to visit the call graph bottom-up (i.e., starting from the leaves in the
call graph, the functions that don't call any other functions), as I explain in
the video (although other compilers use a top-down approach). So, in this case,
we first visit B
, inline anything we want to inline into B
, and then
potentially inline the “new” B
—that is, the new B
now has all the code
of the inlined functions in its body—into A
. But say we have a
cycle—let's say A
→B
→C
→A
; how will we proceed? As we
said, you'd usually start from the “bottom”—the leaves. But here there are
no leaves! So, you see the problem! If you try to force it, you will end up in a
cycle of inlining, i.e., inlining A
into C
, and C
into B
, and B
into
A
, and now again A
into C
, and so on... This gives you an idea of why it's
hard to inline recursive calls.↩optnone
, it also
adds noinline
.
Here's
the relevant code.↩-O2
; see my previous
article.
But it does make a difference in compilation times, especially for the
always-inline build, which takes about 45s to compile.↩