Chapter 2 Integer Arithmetic
Where is this going: parse expressions like 3+4 or 5*6 and generate machine code.
Along the way:
- Programs as data, assigning meaning, changing representation
- Deriving a compiler from an interpreter
- Assembly primer, ISA as an abstraction over hardware
2.1 Representing Programs as Data
Programs are just data. Like other data, we can represent programs in different ways: sequences of characters (e.g. source code) binary formats (e.g. machine code), various forms of trees and graphs.
How to choose? Depending on purpose. Here: pick “the most natural” as a starting point.
2.1.1 Abstract Syntax
Abstract syntax in BNF notation:
<exp> ::= <num> + <num>
| <num> - <num>
| <num> * <num>
| <num> / <num>
| <num> % <num>
2.1.2 Programs as Data
Abstract syntax as Scala data structure:
abstract class Exp
case class Plus(x: Int, y: Int) extends Exp
case class Minus(x: Int, y: Int) extends Exp
case class Times(x: Int, y: Int) extends Exp
case class Div(x: Int, y: Int) extends Exp
case class Mod(x: Int, y: Int) extends Exp
2.1.3 Writing an Interpreter
What makes data a program is a function that assigns meaning – an interpreter:
type Val = Int
def eval(e: Exp): Val = e match {
case Plus(x,y) => x + y
case Minus(x,y) => x - y
...
}
Example:
eval(Plus(3,4)) // => 7
2.1.4 Our First Compiler
Instead of directly interpreting programs, we can translate them to another language. Let’s build our first (simple) compiler that translates expressions to source code in another language (Scala or C):
type Code = String
def trans(e: Exp): Code = e match {
case Plus(x,y) => s"$x + $y"
case Minus(x,y) => s"$x - $y"
...
}
Example:
trans(Plus(3,4)) // => "3+4"
Note: Compilers and interpreters are fundamentally linked (specialization, Futamura)
We can paste the generated code into a .c
file, run it through a
C compiler, and the run the executable from the shell, but of
course we’d like to automate this process so that we can just
write:
run(trans(Plus(3,4))) // => 7
This function run
is defined as follows:
def run(e: Code): Int = {
val code =
s"""
#include <stdio.h>
int miniscala_main() {
return ${e};
}
int main(int argc, char** argv) {
printf("%d\\n", miniscala_main());
return 0;
}
"""
// write code to a file
val out = new PrintStream("driver.c")
out.println(code)
out.close()
// compile using GCC
val compRes = Seq("gcc","driver.c","-o","driver").!
if (compRes != 0)
sys.error("GCC compilation failed")
// run it and return result
val runRes = Seq("./driver").!!
runRes.trim.toInt
}
2.1.5 Correctness of our Compiler
We can verify experimentally that compiling and then running the code
produces an equivalent output to interpreting the original program.
For all programs p
:
eval(p) == run(trans(p))
Note: run
is doing multiple steps in one. We could also write
run(trans(p))
as shell(gcc(trans(p)))
to make the C compilation
step explicit.
2.2 Compiling to Machine Code
Our first compiler is basically the identity transform. It does not lower the level of abstraction. Compiling to C is often a good strategy, but we want the real deal: x86-64 assembly.
2.2.1 Architecture Refresher
Machine language. Assembly language (ISA) is an abstraction. Processor is an interpreter for the given ISA. MuOps. ILP. Pipelining. Register renaming. (Don’t talk about memory hierarchy yet, leave for next chapter).
2.2.2 Notional Machine
It’s useful to build a mental model how a processor works (a notional machine). What we need to know right now is that a CPU has some internal state, and it executes commands that modify that state. For the moment we can assume that the internal state is a single memory slot (a register), often called the accumulator. CPU instructions modify this accumulator.
We can simulate CPU execution of 3+4 in Scala:
var ax: Int // accumulator register
ax = 3
ax += 4
Exercise: adapt our interpreter and to-source compiler to this model.
But now we need to figure out how assembly language works for real.
2.2.3 Figuring out Assembly Language
To figure out how assembly language works, it is often a good idea to look at output generated by an existing compiler. For example, we can take the following code:
int miniscala_main() {
return 7;
}
If we put it into a file driver.c
we can have GCC emit the generated
assembly into a file driver.s
by running:
gcc -S driver.c
The output will be quite large and inscrutable, but we can get something much more readable with two additional flags:
gcc -fno-asynchronous-unwind-tables --omit-frame-pointer -S driver.c
The file driver.s
should look roughly like this:
.text
.globl _miniscala_main
_miniscala_main:
movq $7, %rax
ret
Note: int vs long, movl vs movq
Note: we won’t get useful output for return (3+4)
because GCC will
optimize 3+4
into constant 7
right away. How can we modify the
program to break this optimization? Hint: try int x = 3; return x+4
.
Hint: Try the same with higher optimizations flags (-O1
to -O3
).
Note: on Mac, gcc
actually calls Clang (the LLVM-based C/C++
toolchain) per default. If you actually want GCC, you need to
install the GNU toolchain via Homebrew.
We do not have to understand everything right now.
2.2.4 Assembly Language Primer
See the following cheat sheet: https://cs.brown.edu/courses/cs033/docs/guides/x64_cheatsheet.pdf
Arithmetic instructions perform their operations in-place:
addq $4, %rax
This command adds the value 4 to register ax
. The $
in
$4
means that the literal 4 is to be read as a decimal number.
Without the dollar sign, it would be read as hexadecimal
(no difference here, but consider $10
vs. 10
). The q in
addq
and the r in rax
denotes that this is a 64-bit operation.
If we wanted to operate on 32-bit values we would write:
addl $4, %eax
Move instructions are similar but just move values around without changing them (literal to register, register to memory, memory to register, …).
Confusingly, there are two ways to write x86 instructions in textual form: Intel syntax vs AT&T or Unix syntax. We use Unix syntax since that is what GCC, LLVM, and many other tools expect.
2.2.5 Compiling to x86-64 Assembly
type Code = String
def trans(e: Exp): Code = e match {
case Plus(x,y) => s" movq $$$x, %rax\n addq $$$y, %rax"
case Minus(x,y) => s" movq $$$x, %rax\n subq $$$y, %rax"
case Times(x,y) => s" movq $$$x, %rax\n imulq $$$y, %rax" // rdx also set?
case Div(x,y) => s" movq $$$x, %rax\n" +
s" movq $$$y, %rbx\n" +
s" cqto # sign extend rax to rdx:rax\n" +
s" idivq %rbx"
case Mod(x,y) => s" movq $$$x, %rax\n" +
s" movq $$$y, %rbx\n" +
s" cqto # sign extend rax to rdx:rax\n" +
s" idivq %rbx\n" +
s" movq %rdx, %rax"
}
The triple $$$x
is necessary for Scala’s string interpolation
to emit $4
and so on. Another way to write this would be ${"$"+x}
.
Note that division and modulo are special: idivq
always
expects the divisor argument in %rax
and leaves the results
in %rax
and %rdx
. The sign extension is also necessary.
2.2.6 Compiling and Running x86-64
Use a little piece of C driver code to return program result to
the caller. We could also use the program exit value (int main()
)
but this only goes from 0..255. On MacOS it’s _miniscala_main
,
on Linux it’s miniscala_main
without the underscore.
def run(e: String): Int = {
val code =
s"""
.text
.globl _miniscala_main
_miniscala_main:
$e
ret
"""
// write code to a file
val out = new java.io.PrintStream("driver.s")
out.println(code)
out.close()
// compile using GCC
val compRes = Seq("gcc","runtime.o","driver.s","-o","driver").!
if (compRes != 0)
Reporter.abort("GCC compilation failed")
// run it and return result
val runRes = Seq("./driver").!!
runRes.trim.toInt
}
We assume that we have the same main
routine written
in C as before in a file runtime.c
:
#include <stdio.h>
int miniscala_main();
int main(int argc, char** argv) {
printf("%d\n", miniscala_main());
return 0;
}
This needs to be precompiled into runtime.o
using:
gcc -c runtime.c -o runtime.o
Of course this step can be automated from Scala as well.
2.3 Parsing
With this we can transform our canonical language definition to target code. Now we extend our compiler in the other direction to take textual input.
2.3.1 Utilities
Extend the standard iterator interface with a lookahead facility.
class Reader[T](input: Iterator[T], eof: T) {
var peek = if (input.hasNext) input.next() else eof
def hasNext: Boolean = peek != eof
def hasNext(f: T => Boolean) = f(peek)
def next() = try peek finally peek = if (input.hasNext) input.next() else eof
}
We add some functions for error reporting. These are very basic and can be extended later, e.g. to allow multiple errors per file and to show source file/line info for errors.
def error(s: String): Unit = println(s"Error: $s.") // report an error
def abort(s: String): Nothing = {error(s); throw new Exception(s)} // report error and halt
def expected(s: String): Nothing = abort(s"$s expected")
2.3.2 Source Code as Stream of Characters
Reading a single-digit number:
val in: Reader[Char] // peek, hasNext(), next()
def isDigit(c: Char) = '0' <= c && c <= '9'
def getNum(): Int = {
if (in.hasNext(isDigit)) (in.next() - '0')
else expected("Number")
}
def parseTerm(): Exp = Lit(getNum())
Reading a multi-digit number:
def getNum(): Int = {
if (in.hasNext(isDigit)) {
var n = 0
while (in.hasNext(isDigit)) {
n = 10 * n + (in.next() - '0')
}
num
} else expected("Number")
}
2.3.3 Parsing Expressions
def isOperator(c: Char) = c == '+' || c == '-' || ...
def parseExpression(): Tree = {
val left = parseTerm
if (in.hasNext(isOperator)) {
in.next() match {
case '+' => Plus(left, parseTerm())
case '-' => Minus(left, parseTerm())
...
}
} else expected("Operator")
}
Now we can parse single operations: “1+2”, “3-5”, etc.
2.3.4 End-Of-File
It is important to check that after a successful parse no input is left.
def parse(s: String): Exp = {
in = new Reader(s.iterator, '\u0000')
val res = parseExpression()
if (in.hasNext)
expected("End of file")
}
Our error messaged are already quite useful. Examples:
parse("3+4") // => Plus(3,4)
parse("3") // => "Error: Operator expected."
parse("3+") // => "Error: Number expected."
parse("3+4+") // => "Error: End of file expected."
parse("3+4+5") // => "Error: End of file expected."
2.4 Error Handling
What could go wrong?
2.4.1 Parser: Literal Out of Bounds
Done in later chapter.
2.4.2 Runtime: Divide by Zero
We don’t attempt to handle this kind of error!!
Think about why.
But consider a runtime signal handler such as: https://rosettacode.org/wiki/Detect_division_by_zero#Library:_POSIX
2.5 Where are we?
In just one lecture, we have built an end-to-end compiler, from simple arithmetic expressions to native x86-64 code.
Over the next lectures, we will add language features such as nested expressions, variables, control flow, functions, etc.
We will keep the pace high, and have a fully functional compiler for a quite substantial language in no time.
We can parse single operations: “1+2”, “3-5”, etc. Wouldn’t it be nice to have “1+(3-4)”? Let add nesting!