Chapter 2 Integer Arithmetic

Where is this going: parse expressions like 3+4 or 5*6 and generate machine code.

Along the way:

Programs as data, assigning meaning, changing representation
Deriving a compiler from an interpreter
Assembly primer, ISA as an abstraction over hardware

2.1 Representing Programs as Data

Programs are just data. Like other data, we can represent programs in different ways: sequences of characters (e.g. source code) binary formats (e.g. machine code), various forms of trees and graphs.

How to choose? Depending on purpose. Here: pick “the most natural” as a starting point.

2.1.1 Abstract Syntax

Abstract syntax in BNF notation:

<exp> ::= <num> + <num>
       |  <num> - <num>
       |  <num> * <num>
       |  <num> / <num>
       |  <num> % <num>

2.1.2 Programs as Data

Abstract syntax as Scala data structure:

abstract class Exp
case class Plus(x: Int, y: Int) extends Exp
case class Minus(x: Int, y: Int) extends Exp
case class Times(x: Int, y: Int) extends Exp
case class Div(x: Int, y: Int) extends Exp
case class Mod(x: Int, y: Int) extends Exp

2.1.3 Writing an Interpreter

What makes data a program is a function that assigns meaning – an interpreter:

type Val = Int

def eval(e: Exp): Val = e match {
  case Plus(x,y)  => x + y
  case Minus(x,y) => x - y
  ...
}

Example:

eval(Plus(3,4)) // => 7

2.1.4 Our First Compiler

Instead of directly interpreting programs, we can translate them to another language. Let’s build our first (simple) compiler that translates expressions to source code in another language (Scala or C):

type Code = String

def trans(e: Exp): Code = e match {
  case Plus(x,y)  => s"$x + $y"
  case Minus(x,y) => s"$x - $y"
  ...
}

Example:

trans(Plus(3,4)) // => "3+4"

Note: Compilers and interpreters are fundamentally linked (specialization, Futamura)

We can paste the generated code into a .c file, run it through a C compiler, and the run the executable from the shell, but of course we’d like to automate this process so that we can just write:

run(trans(Plus(3,4))) // => 7

This function run is defined as follows:

def run(e: Code): Int = {
  val code =
s"""
#include <stdio.h>
int miniscala_main() {
  return ${e};
}
int main(int argc, char** argv) {
  printf("%d\\n", miniscala_main());
  return 0;
}
"""
  // write code to a file
  val out = new PrintStream("driver.c")
  out.println(code)
  out.close()

  // compile using GCC
  val compRes = Seq("gcc","driver.c","-o","driver").!
  if (compRes != 0)
    sys.error("GCC compilation failed")

  // run it and return result
  val runRes = Seq("./driver").!!
  runRes.trim.toInt
}

2.1.5 Correctness of our Compiler

We can verify experimentally that compiling and then running the code produces an equivalent output to interpreting the original program. For all programs p:

eval(p) == run(trans(p))

Note: run is doing multiple steps in one. We could also write run(trans(p)) as shell(gcc(trans(p))) to make the C compilation step explicit.

2.2 Compiling to Machine Code

Our first compiler is basically the identity transform. It does not lower the level of abstraction. Compiling to C is often a good strategy, but we want the real deal: x86-64 assembly.

2.2.1 Architecture Refresher

Machine language. Assembly language (ISA) is an abstraction. Processor is an interpreter for the given ISA. MuOps. ILP. Pipelining. Register renaming. (Don’t talk about memory hierarchy yet, leave for next chapter).

2.2.2 Notional Machine

It’s useful to build a mental model how a processor works (a notional machine). What we need to know right now is that a CPU has some internal state, and it executes commands that modify that state. For the moment we can assume that the internal state is a single memory slot (a register), often called the accumulator. CPU instructions modify this accumulator.

We can simulate CPU execution of 3+4 in Scala:

var ax: Int // accumulator register
ax = 3
ax += 4

Exercise: adapt our interpreter and to-source compiler to this model.

But now we need to figure out how assembly language works for real.

2.2.3 Figuring out Assembly Language

To figure out how assembly language works, it is often a good idea to look at output generated by an existing compiler. For example, we can take the following code:

int miniscala_main() {
  return 7;
}

If we put it into a file driver.c we can have GCC emit the generated assembly into a file driver.s by running:

gcc -S driver.c

The output will be quite large and inscrutable, but we can get something much more readable with two additional flags:

gcc -fno-asynchronous-unwind-tables --omit-frame-pointer -S driver.c

The file driver.s should look roughly like this:

  .text
  .globl _miniscala_main
_miniscala_main:
  movq $7, %rax
  ret

Note: int vs long, movl vs movq

Note: we won’t get useful output for return (3+4) because GCC will optimize 3+4 into constant 7 right away. How can we modify the program to break this optimization? Hint: try int x = 3; return x+4. Hint: Try the same with higher optimizations flags (-O1 to -O3).

Note: on Mac, gcc actually calls Clang (the LLVM-based C/C++ toolchain) per default. If you actually want GCC, you need to install the GNU toolchain via Homebrew.

We do not have to understand everything right now.

2.2.4 Assembly Language Primer

See the following cheat sheet: https://cs.brown.edu/courses/cs033/docs/guides/x64_cheatsheet.pdf

Arithmetic instructions perform their operations in-place:

addq $4, %rax

This command adds the value 4 to register ax. The $ in $4 means that the literal 4 is to be read as a decimal number. Without the dollar sign, it would be read as hexadecimal (no difference here, but consider $10 vs. 10). The q in addq and the r in rax denotes that this is a 64-bit operation. If we wanted to operate on 32-bit values we would write:

addl $4, %eax

Move instructions are similar but just move values around without changing them (literal to register, register to memory, memory to register, …).

Confusingly, there are two ways to write x86 instructions in textual form: Intel syntax vs AT&T or Unix syntax. We use Unix syntax since that is what GCC, LLVM, and many other tools expect.

2.2.5 Compiling to x86-64 Assembly

type Code = String

def trans(e: Exp): Code = e match {
  case Plus(x,y)  => s"  movq $$$x, %rax\n  addq $$$y, %rax"
  case Minus(x,y) => s"  movq $$$x, %rax\n  subq $$$y, %rax"
  case Times(x,y) => s"  movq $$$x, %rax\n  imulq $$$y, %rax" // rdx also set?
  case Div(x,y)   => s"  movq $$$x, %rax\n" +
                     s"  movq $$$y, %rbx\n" +
                     s"  cqto  # sign extend rax to rdx:rax\n" +
                     s"  idivq %rbx"
  case Mod(x,y)   => s"  movq $$$x, %rax\n" +
                     s"  movq $$$y, %rbx\n" +
                     s"  cqto  # sign extend rax to rdx:rax\n" +
                     s"  idivq %rbx\n" +
                     s"  movq %rdx, %rax"
}

The triple $$$x is necessary for Scala’s string interpolation to emit $4 and so on. Another way to write this would be ${"$"+x}.

Note that division and modulo are special: idivq always expects the divisor argument in %rax and leaves the results in %rax and %rdx. The sign extension is also necessary.

2.2.6 Compiling and Running x86-64

Use a little piece of C driver code to return program result to the caller. We could also use the program exit value (int main()) but this only goes from 0..255. On MacOS it’s _miniscala_main, on Linux it’s miniscala_main without the underscore.

def run(e: String): Int = {
  val code =
s"""
  .text
  .globl _miniscala_main
_miniscala_main:
$e
  ret
"""

  // write code to a file
  val out = new java.io.PrintStream("driver.s")
  out.println(code)
  out.close()

  // compile using GCC
  val compRes = Seq("gcc","runtime.o","driver.s","-o","driver").!
  if (compRes != 0)
    Reporter.abort("GCC compilation failed")

  // run it and return result
  val runRes = Seq("./driver").!!
  runRes.trim.toInt
}

We assume that we have the same main routine written in C as before in a file runtime.c:

#include <stdio.h>

int miniscala_main();

int main(int argc, char** argv) {
  printf("%d\n", miniscala_main());
  return 0;
}

This needs to be precompiled into runtime.o using:

gcc -c runtime.c -o runtime.o

Of course this step can be automated from Scala as well.

2.3 Parsing

With this we can transform our canonical language definition to target code. Now we extend our compiler in the other direction to take textual input.

2.3.1 Utilities

Extend the standard iterator interface with a lookahead facility.

class Reader[T](input: Iterator[T], eof: T) {
   var peek = if (input.hasNext) input.next() else eof
   def hasNext: Boolean = peek != eof
   def hasNext(f: T => Boolean) = f(peek)
   def next() = try peek finally peek = if (input.hasNext) input.next() else eof
}

We add some functions for error reporting. These are very basic and can be extended later, e.g. to allow multiple errors per file and to show source file/line info for errors.

def error(s: String): Unit = println(s"Error: $s.")    // report an error
def abort(s: String): Nothing = {error(s); throw new Exception(s)}     // report error and halt
def expected(s: String): Nothing = abort(s"$s expected")

2.3.2 Source Code as Stream of Characters

Reading a single-digit number:

val in: Reader[Char] // peek, hasNext(), next()

def isDigit(c: Char) = '0' <= c && c <= '9'

def getNum(): Int = {
  if (in.hasNext(isDigit)) (in.next() - '0')
  else expected("Number")
}

def parseTerm(): Exp = Lit(getNum())

Reading a multi-digit number:

def getNum(): Int = {
  if (in.hasNext(isDigit)) {
    var n = 0
    while (in.hasNext(isDigit)) {
      n = 10 * n + (in.next() - '0')
    }
    num
  } else expected("Number")
}

2.3.3 Parsing Expressions

def isOperator(c: Char) = c == '+' || c == '-' || ...

def parseExpression(): Tree = {
  val left = parseTerm
  if (in.hasNext(isOperator)) {
    in.next() match {
      case '+' => Plus(left, parseTerm())
      case '-' => Minus(left, parseTerm())
      ...
    }
  } else expected("Operator")
}

Now we can parse single operations: “1+2”, “3-5”, etc.

2.3.4 End-Of-File

It is important to check that after a successful parse no input is left.

def parse(s: String): Exp = {
  in = new Reader(s.iterator, '\u0000')
  val res = parseExpression()
  if (in.hasNext)
    expected("End of file")
}

Our error messaged are already quite useful. Examples:

parse("3+4")   // => Plus(3,4)
parse("3")     // => "Error: Operator expected."
parse("3+")    // => "Error: Number expected."
parse("3+4+")  // => "Error: End of file expected."
parse("3+4+5") // => "Error: End of file expected."

2.4 Error Handling

What could go wrong?

2.4.1 Parser: Literal Out of Bounds

Done in later chapter.

2.4.2 Runtime: Divide by Zero

We don’t attempt to handle this kind of error!!

Think about why.

But consider a runtime signal handler such as: https://rosettacode.org/wiki/Detect_division_by_zero#Library:_POSIX

2.5 Where are we?

In just one lecture, we have built an end-to-end compiler, from simple arithmetic expressions to native x86-64 code.

Over the next lectures, we will add language features such as nested expressions, variables, control flow, functions, etc.

We will keep the pace high, and have a fully functional compiler for a quite substantial language in no time.

We can parse single operations: “1+2”, “3-5”, etc. Wouldn’t it be nice to have “1+(3-4)”? Let add nesting!