A big thanks to Diomidis Spinellis who read both parts in this series, caught *many* mistakes, and provided valuable feedback!
This is the successor to a previous
article where we learned how to simulate Unix v7 on PDP-11 with OpenSIMH. In
this article we will use the C compilers that shipped with Unix v7—namely
Dennis Ritchie’s compiler (cc) and the Portable C Compiler
(pcc) by Steven C. Johnson—to explore how C was back then,
before it became ANSI C. Theoretically you don’t need to go into the troubles of
the previous article to follow this article. You can follow along as long as you
have a running copy of Ritchie’s compiler. But I find it much more reliable to
use Unix v7 on PDP-11.
Even though, as today, variables could have names
arbitrarily long (meaning you wouldn’t get a compiler error if you had e.g., a
20-char variable name), only the first 8 characters were considered by the
compiler. To see that in practice, create a
.c file with:
int main(void) {
int our_variable_one=111;
int our_variable_two=222;
return 0;
}
Now let’s try to compile it with cc:
# cc compiler.c
compiler.c:3: our_vari redeclared
compiler.c:3: Declaration syntax
pcc has the same behavior:
# pcc compiler.c
compiler.c
"compiler.c", line 3: redeclaration of our_vari
Later, when we look at the source code of the compiler, we’ll see exactly why we get this behavior in cc.
As you might expect, this is how the compiler created the object files too,
which means it manifested as a linking problem too. For example, let’s create
link1.c:
int global_variable_one=111;
int ret1() {
return global_variable_one;
}
Then link2.c:
int global_variable_two=222;
int ret2() {
return global_variable_two;
}
And finally main.c:
int ret1();
int ret2();
int main() {
int a = ret1() + ret2();
}
Now we do:
# cc link1.c -c
# cc link2.c -c
# cc main.c -c
These three run fine, but then if we try to link them...
# cc link1.o link2.o main.o
_global_: ld:link2.o: Multiply defined
But that’s not a problem caused by the linker. Let’s see the object file:
# nm link1.o
000020 t L1
000006 t L2
000014 t L3
000022 D _global_
000000 T _ret1
U cret
U csv
000000 t ~ret1
There are a bunch of interesting things here. First of all, look how small the
object file is! This will make more sense if we see the assembly, which we can
do with adb (in the last section I will present an easier way).
# adb link1.o
_ret1,10?i
_ret1: jsr r5,_ret1
br L1
L2: mov _global_,r0
br L3
L3: jmp _ret1
L1: br L2
jmp *-(pc)
text address not found
We launch link1.o in adb, and then we tell the tool to print 10 instructions
starting from _ret1 label. First, _ret1 corresponds to ret1(). It was a
convention back then (that is still true in different forms) for the C compiler
to prepend C symbols with _ to distinguish them from non-C symbols, like csv
(which we will talk about in a bit). The first line is a jump to _ret1, which
looks like infinite recursion. But the reason is that in place of _ret1 is
an undefined relocation. In other words, the instruction is really like this:
jsr r5,<undefined>
because _ret1 has not gotten an address yet. What will go to its place is
csv, which stands for “C save (registers)”, i.e., it sets up the stack. In any
case, we could spend a lot of time talking about linkers, but let’s get back to
the problem at hand. As you can see, the symbol was truncated by the compiler
and that’s how it made it to the object file. It’s not the linker’s problem.
I wanted to test if the linker has this problem, and I tried to write raw
assemby and and then assemble it into an object file:
cat > ret1.s << 'EOF'
.data
.globl _global_variable_one
_global_variable_one:
.word 111
.text
.globl _ret1
_ret1:
mov _global_variable_one, r0
rts pc / return r0
EOF
and we assemble it:
as ret1.s
This produces a.out. Now let’s see the symbols:
# nm a.out
u .word
000006 D _global_
000000 T _ret1
So apparently it is also a problem in the assembler. I think the only alternative if we want to make sure our object files have the full identifier is to write raw bytes. That is quite a pain, so it is left as an exercise to the reader...
In the very old C you could initialize variables without the = operator. Here’s
init.c:
#include <stdio.h>
int main() {
int a 43;
printf("%d\n", a);
}
Then the following works just fine:
# cc init.c
# ./a.out
43
pcc gives us a warning but it still works:
# pcc init.c
init.c
"init.c", line 4: warning: old-fashioned initialization: use =
# ./a.out
43
There are many keywords in ANSI C that have been introduced back then. One of
them is void. Here’s void.c:
void test() { return; }
int main() {
test();
}
Compilation fails:
# cc void.c
void.c:1: External definition syntax
const and volatile were also not available. Here’s const.c:
int main() {
const int a = 2;
}
And again, compilation fails:
# cc const.c
const.c:2: const undefined; func. main
const.c:2: Expression syntax
const.c:3: a undefined
Similarly for volatile:
# cc volatile.c
volatile.c:2: volatile undefined; func. main
volatile.c:2: Expression syntax
volatile.c:4: a undefined
Fields of one struct could be accessed from a variable of type of
another struct without a cast. Here’s an example in sct.c:
#include <stdio.h>
struct A {
int x;
int y;
};
struct B {
int w;
int z;
};
int main() {
struct A a;
struct B *bp;
a.x = 1;
a.y = 2;
bp = &a;
printf("%d\n", bp->x);
}
This compiles and runs fine:
# cc sct.c
# ./a.out
1
We did not have to add a cast to assign &a to bp. More surprisingly,
even though the compiler knows that bp points to struct B and struct B
does not have a field x, it does not complain. This is because struct fields
in K&R C were global, not tied to a specific struct declaration. What was
saved was only the offset from the struct they were declared to so that you
can access them correctly. This is not an edge case; as we will see, it is
used all over Dennis Ritchie’s C compiler.
K&R C allowed what we now call “old-style function definition”. Here’s what it looks like:
#include <stdio.h>
int test(x) int x; { return x + 2; }
int main() {
float x = 1.2;
int y = test(x);
printf("%d\n", y);
}
We provide the name of parameters (x) but not its type. The type is provided
between ) and {, where nothing goes today. But if the type is added there,
it is not type-checked, which is why this program compiles with not even a
warning by both cc and pcc. Instead, this should be thought of as a
cast. That is, we cast whatever comes in to int. The value we get in both
cases is: (a) bogus, and (b) the same across the two compilers.
# cc fdef.c
# ./a.out
16539
# pcc fdef.c
fdef.c
# ./a.out
16539
Now there are two things that are crazy. The first is that all the previous examples do not compile today. However, this example not only does it still compiles today, but it also does so with no warning at all even if you ask for extra warnings!
> gcc test.c -o test -Wall -Wextra && ./test
3
> clang test.c -o test -Wall -Wextra && ./test
3
The only difference is that you get a value you would expect.
The second thing that is crazy is that K&R C in Unix v7 simply did not support the function definitions we know today.
int test(int a, int b) { return a + b; }
int main() {
int a = test(1 , 2);
}
This does not compile with either compiler.
# cc fdef2.c
fdef2.c:1: Declaration syntax
fdef2.c:1: Declaration syntax
fdef2.c:1: a undefined; func. test
fdef2.c:1: b undefined; func. test
# pcc fdef2.c
fdef2.c
"fdef2.c", line 1: syntax error
"fdef2.c", line 1: warning: old-fashioned initialization: use =
"fdef2.c", line 1: warning: undeclared initializer name a
"fdef2.c", line 1: unknown size
"fdef2.c", line 1: cannot recover from earlier errors: goodbye!
The moment we’ve all been waiting for arrived: we are going to take a sneak peek
at the internals of Dennis Richie’s compiler. The source code lives under
/usr/src/cmd/c and the directory looks like this:
c0.h
c00.c
c01.c
c02.c
c03.c
c04.c
c05.c
c1.h
c10.c
c11.c
c12.c
c13.c
c2.h
c20.c
c21.c
cvopt.c
makefile
table.s
The compiler has 3 main passes: c0, c1, and c2, and a special pass, cvopt. c1 is
what we would call today the “front end”. It does lexing, parsing, and semantic
analysis. c1 is what we would call today the “back end”, as it generates
assembly from an AST, while also performing some AST optimizations. c2 is a
low-level optimizer that works directly on the assembly. Finally, cvopt is also
an optimization pass, but doesn’t perfrorm standard compiler optimizations; it’s
for compressing the output assembly. The files of each pass (e.g.,
c00.c-c05.c) don’t seem to have any specific ordering.
Let’s first review the issue of the 8-character length for identifiers. What
concerns us is lexing, which is where identifiers should be recognized. Lexing
happens in the function symbol() in c00.c. If we ignore some logic dealing
with peeking characters, basically this function first calls getchar() from
the C standard library to get the next character. It also has a character table,
ctab, defined in c05.c, which maps characters to syntactical categories. It
looks like this:
char ctab[] {
EOFC, INSERT, UNKN, UNKN, UNKN, UNKN, UNKN, UNKN,
UNKN, SPACE, NEWLN, SPACE, SPACE, UNKN, UNKN, UNKN,
...
SPACE, EXCLA, DQUOTE, SHARP, UNKN, MOD, AND, SQUOTE,
LPARN, RPARN, TIMES, PLUS, COMMA, MINUS, PERIOD, DIVIDE,
...
DIGIT, DIGIT, COLON, SEMI, LESS, ASSIGN, GREAT, QUEST,
UNKN, LETTER, LETTER, LETTER, LETTER, LETTER, LETTER, LETTER,
...
LETTER, LETTER, LETTER, LBRACE, OR, RBRACE, COMPL, UNKN
};
A character table is standard practice in front ends, but this is a pretty
hardcore way to make one, meaning, by hand. You have to make sure you map each
character in the ASCII table correctly, and if you miss one, then it’s all
wrong. But it wasn’t the only case in old compilers. LCC did the
same.
You don’t have to do it this way, though. You can do something like
this.
Note something subtle in Ritchie’s table: the underscore maps to LETTER.
symbol() then switches on the lookup (still inside symbol()):
switch(ctab[c]) {
...
We are concerned with the case LETTER:
case LETTER:
sp = symbuf;
while(ctab[c]==LETTER || ctab[c]==DIGIT) {
if (sp<symbuf+NCPS)
*sp++ = c;
c = getchar();
}
while(sp<symbuf+NCPS)
*sp++ = '\0';
mossym = 0;
if (mosflg) {
mossym = FMOS;
mosflg = 0;
}
peekc = c;
if ((c=lookup())==KEYW && cval==SIZEOF)
c = SIZEOF;
return(c);
The goal of this code is to save the characters of an identifier into symbuf,
which is just a buffer of characters (defined in c0.h):
char symbuf[NCPS+2];
We keep going as long as we have a letter or a digit and then ignoring this
mos-related logic,1 we look up the identifier to see if it’s a keyword. There’s a
special case for sizeof, I think because it’s the only keyword that can appear
arbitrarily inside expressions. So if it is sizeof we return that, otherwise
we return what lookup() returned. If lookup() can’t find the identifier in
keywords, it will return NAME, which means this code will return NAME for
valid identifiers, which is what we want.
But wait a minute... What is this NCPS? Apparently the code saves in
symbuf (through sp) only the first NCPS characters. As you might
expect, this is the Number of Characters Per Symbol. It’s defined in c0.h:
#define NCPS 8 /* # chars per symbol */
And this is why our identifiers are truncated to the first 8 characters. This is also why old C code used short names for variables, function names, etc.
Surprisingly there is quite a bit of optimization performed by cc. Here we’ll
take a look at a two cases.
In c12.c we can find the following:
constant:
if (tree->tr1->op==CON && tree->tr2->op==CON) {
const(op, &tree->tr1->value, tree->tr2->value);
return(tree->tr1);
}
The compiler deals with trees internally. As it was common in old compilers, a
tree is really a tree node. In this case it has two children, tr1 (the LHS)
and tr2 (the RHS). If both are constant, we go to const() to optimize it. In
the second argument we pass a pointer because we will modify it to have the
folded constant. Let’s take a look at a few cases in the body of the function.
Generally, it is what you would expect:
case PLUS:
*vp =+ v;
return;
case TIMES:
*vp =* v;
return;
case AND:
*vp =& v;
return;
However, note that it does not contain any floating-point constant folding
(unlike modern compilers; see this). Let’s
see that in practice. We first create flt.c:
float foo() {
double a = 1.2 + 2.4;
}
I use doubles because K&R C did not have float literals (e.g., 1.2f), so all
floating-point literals were by default 64-bit doubles.
Now we want to see the assembly code. Instead of following our previous method
with adb, we will simply ask the compiler to generate assembly code. To do
that, we need to understand better how compiler the driver works. This is
/usr/src/cmd/cc.c (which calls the passes). The main()
of this file has the following near the beginning:
while(++i < argc) {
if(*argv[i] == '-') switch (argv[i][1]) {
default:
goto passa;
case 'S':
sflag++;
cflag++;
So, if we pass -S, then sflag becomes 1 and also -c is assumed. The
other interesting part of this program is this:
if (oflag) {
av[0] = "c2";
av[1] = tmp5;
av[2] = tmp3;
av[3] = 0;
if (callsys(pass2, av)) {
unlink(tmp3);
tmp3 = assource = tmp5;
}
else
unlink(tmp5);
}
if (sflag)
continue;
assemble:
av[0] = "as";
av[1] = "-u";
av[2] = "-o";
...
if (callsys("/bin/as", av) > 1) {
A couple of things happen here. First, c2 is not called unless we pass the
-O option. Second, if we pass -S we don’t proceed to assemble the assembly
file into an object file. And this is what we want because that way we can both:
(a) inspect the assembly, and (b) call c2 manually to see what it does. We’re
ready to compile the file:
# cc flt.c -S
# cat flt.s
...
L2:.data
L10000:40231;114631;114631;114632
.text
.data
L10001:40431;114631;114631;114632
.text
movf L10000,r0
addf L10001,r0
movf r0,-16(r5)
...
.globl fltused
...
I have isolated the interesting parts. Most importantly, in
the data segment we have the two FP numbers, and in the text segment (i.e., the
body of the function) we have an explicit addf instruction (which
was not folded during compilation). We can run c2 on its own, but it gives us
basically the same thing:
# /lib/c2 flt.s
.data
L10000:40231;114631;114631;114632
L10001:40431;114631;114631;114632
...
movf L10000,r0
addf L10001,r0
movf r0,-16(r5)
jmp cret
.globl fltused
Two things are subtle here. The first is the floating-point encoding. It is not the standard IEEE-754 floating-point format we have today, because that had not even been established yet (that happened in 1985)! It uses the PDP-11 floating-point format.
The other thing you probably noticed is that in both
snippets I left the fltused declaration there. The reason is that
this symbol is still relevant today for the same reason it was back then. In
c10.c we find the following:
/*
* If any floating-point instructions
* were used, generate a reference that
* pulls in the floating-point part of printf.
*/
if (nfloat)
printf(".globl fltused\n");
Back then it was a linking optimization, i.e., the linker
pulled parts of the standard library that dealt with floating-point numbers only
if fltused was defined. Today it is only relevant on Windows, whose
linker expects to find the symbol, something one has to deal with if they want
to create freestanding executables.
This is how LLVM names this kind of optimization, which is why I’m using this
name (we could also use the term “peephole optimizations”, but it is more
general). I am talking about e.g., turning a*2 to a<<1. Here’s an excerpt
from c11.c:
ispow2(atree)
{
register int d;
register struct tnode *tree;
tree = atree;
if (!isfloat(tree) && tree->tr2->op==CON) {
d = tree->tr2->value;
if (d>1 && (d&(d-1))==0)
return(d);
}
return(0);
}
pow2(atree)
struct tnode *atree;
{
register int d, i;
register struct tnode *tree;
tree = atree;
if (d = ispow2(tree)) {
for (i=0; (d=>>1)!=0; i++);
tree->tr2->value = i;
switch (tree->op) {
case TIMES:
tree->op = LSHIFT;
break;
case DIVIDE:
tree->op = ULSH;
tree->tr2->value = -i;
break;
There are a couple of interesting things here. First, note that in both
functions tree is a pointer to struct tnode. Here’s its declaration in
c1.h:
struct tnode {
int op;
int type;
int degree;
struct tnode *tr1, *tr2;
};
You may notice that it does not have a
value field, yet both functions access one using
tree->tr2 (which is also a struct tnode pointer). This
because we have already checked that tree->tr2 is a constant, and
there is a struct tnode that has an int value field.
We explained why this is possible earlier.
Other than that, ispow2() checks whether the RHS is a constant with a value
that is a power of 2. If that is the case, pow2() handles many different
cases. Here I have two: multiplication to left shift and division to right
shift. The way it happens is with a for loop that keeps incrementing i while
there are still set bits in d. Note that the loop body is empty! It is just
;. Since d is a power of 2, i.e. 2^e, there is only one set bit and this
loop stores the exponent e in i. Today we wouldn’t use a for loop to do that
because there is an instruction that counts trailing zeroes in almost all
instruction sets (e.g.,
tzcnt
in x86). But PDP-11 did not have such an instruction. Note that today the
compiler will optimize the code accordingly only if d is unsigned
(example) or if
you tell it (in a non-portable way) that d >= 0 (example).
Finally, note the =>> operator. Because of it,
the same code cannot be
parsed today. However, back then the compound assignment operators like
+=, >>=, etc. could be written either way. That is,
=>> was equivalent to >>=. This was discontinued in
ANSI C.
This concludes this small 2-part series. Hopefully you had fun with PDP-11, Unix v7, and K&R C. I hope it inspired you to go ahead and figure out more on your own!
symbol() says “mosflg means that the next symbol, if an
identifier, is a member of structure or a structure tag”. It’s a global variable
that is set in a few places, like strdec() in c03.c which parses struct
declarations.↩