Chapter 4 Variables

In the last chapter, we’ve seen how to compile nested expressions like (3+4)*5 by using temporary storage for intermediate expresssions. On a high level this is not so different from rewriting the program ourselves as val x = 3+4; x * 5.

So let’s support this next, and add support for variables.

4.1 Language Definition

4.1.1 Syntax

Concrete syntax grammar in BNF:

<op>    ::= '+' | '-' | '*' | '/'
<atom>  ::= <num>
          | <ident>
          | '(' <simp> ')'
          | '{' <exp> '}'
<simp>  ::= <atom> [<op> <atom>]*
<exp>   ::= <simp>
          | 'val' <ident> '=' <simp>';' <exp>

Example:

val x = 10;
x*2

Abstract syntax:

<exp> ::= <num>
        | <ident>
        | <exp> <op> <exp>
        | val <ident> = <exp>; <exp>

Scala Definition:

abstract class Exp
case class Lit(x: Int) extends Exp
case class Prim(op: String, xs: List[Exp]) extends Exp
case class Var(x: String) extends Exp
case class Let(x: String, rhs: Exp, body: Exp) extends Exp

4.1.2 Direct Interpreter and Compiler

type Val = Int

def eval(exp: Exp): Val = exp match {
  // previous cases
  // ...
  case Let(n, v, b) =>
    val ev = eval(v)
    val eb = eval(b)
    ???
  case Var(name) => ???
}

Problem: we’re missing context information.

What can we do?

Idea 1: Substitution (inefficient, and wouldn’t work for compiler)

type Val = Int

def eval(exp: Exp): Val = exp match {
  // previous cases
  // ...
  case Let(n, v, b) =>
    val ev = eval(v)
    eval(subst(b,n,ev)) // Var(name) --> Lit(ev)
  case Var(name) =>
    error(s"Unbound variable $name")
}

Example:

eval(Let("x", Lit(10), Prim("*", Var("x"), Lit(2)))) // => ???

Idea 2: Add an environment data structure! Make some information about the evaluation context, i.e., the surrounding program explicit.

type Val = Int

def eval(exp: Exp)(implicit env: Map[String,Val]): Val = exp match {
  // previous cases
  // ...
  case Let(x, v, b) => eval(b)(env + (x -> eval(v)))
  case Var(x)       => env(x)
}

Example:

eval(Let("x", Lit(10), Prim("*", Var("x"), Lit(2)))) // => 20

Compiler:

def block(s: Code) = if (!s.contains("\n")) s else
  "{\n      " + s.replace("\n","\n  ") + "\n    }"

def trans(e: Exp)(implicit env: Map[String,Code]): Code = e match {
  case Lit(x)          => x.toString
  case Prim(op, xs)    => transPrim(op, xs.map(trans))
  case Var(x)          => env(x)
  case Let(x,rhs,body) =>
    s"val $x = ${block(trans(rhs))}" + NL +
    s"${trans(body)(env + (x -> x))}"
}

Note: could have a first-cut compiler with dynamic environment.

4.2 Code Generation

4.2.1 Stack-Based Interpreter and Compiler

Interpreter:

type Val = Int
val memory = new Array[Int](MEM_SIZE)

def eval(exp: Exp, sp: Int)(env: Map[String,Int]): Unit = exp match {
  // previous cases
  // ...
  case Let(n, v, b) =>
    eval(v, sp)(env)
    eval(b, sp + 1)(env + n -> sp)
    memory(sp) = memory(sp + 1)
  case Var(n) =>
    memory(sp) = memory(env(n))
}

Example:

eval(Let("x", Lit(10), Prim("*", Ref("x"), Lit(2)))) // => 20

Compiler:

def trans(exp: Exp, sp: Int)(env: Map[String,Int]): Unit = exp match {
  case Let(n, v, b) =>
    trans(v, sp)(env); trans(b, sp + 1)(env + n -> sp)
    println(s"memory($sp) = memory(${sp + 1})")
  case Var(n) =>
    println(s"memory($sp) = memory(${env(n)})")
}

Example:

trans(Let("x", Lit(10), Prim("*", Ref("x"), Lit(2)))) // x = 10; x*2

memory(0) = 10         // Lit(10), Map()
memory(1) = memory(0)  // Var("x"), Map("x" -> 0)
memory(2) = 2          // Lit(2), Map("x" -> 0)
memory(1) *= memory(2) // Prim("*", ...), Map("x" -> 0)
memory(0) = memory(1)  // Let("x", ...), Map()

4.2.2 Stack-Based Compiler Targeting x86-64 Registers

First straightforward impl:

def trans(e: Exp)(implicit env: Map[String,Int]): Unit = e match {
  case Lit(x)              => grow(); emitln(s"movq $$$x, ${mem(sp-1)} # ${sp-1}");
  case Var(x)              => grow(); emitln(s"movq ${mem(env(x))}, ${mem(sp-1)} # ${sp-1}");
  case Prim(op,List(x,y)) => trans(x); trans(y); shrink(); transPrim(op)
  case Let(x,rhs,body) =>
    trans(rhs)
    trans(body)(env + (x -> (sp-1)))
    shrink()
    emitln(s"movq ${mem(sp)}, ${mem(sp-1)} # drop ${sp}")
}

Example:

trans(Let("x", Lit(10), Prim("*", Ref("x"), Lit(2)))) // x = 10; x*2

movq $10, %rbx
movq %rbx, %rcx
movq $2, %rdi
imulq %rdi, %rcx
movq %rcx, %rbx

Works fine. But now let’s consider the interaction with registers and spilling. Example:

val x1 = 1
val x2 = 2
val x3 = 3
val x4 = 4
val x5 = 5
val x6 = 6
val x7 = 7
val x8 = 8
9

With only three available registers this yields:

movq $1, %rax # 0
movq $2, %rdx # 1
movq $3, %rcx # 2
push %rax # evict 0
movq $4, %rax # 3
push %rdx # evict 1
movq $5, %rdx # 4
push %rcx # evict 2
movq $6, %rcx # 5
push %rax # evict 3
movq $7, %rax # 6
push %rdx # evict 4
movq $8, %rdx # 7
push %rcx # evict 5
movq $9, %rcx # 8
movq %rcx, %rdx # drop 8
movq %rdx, %rax # drop 7
pop %rcx # reload 5
movq %rax, %rcx # drop 6
pop %rdx # reload 4
movq %rcx, %rdx # drop 5
pop %rax # reload 3
movq %rdx, %rax # drop 4
pop %rcx # reload 2
movq %rax, %rcx # drop 3
pop %rdx # reload 1
movq %rcx, %rdx # drop 2
pop %rax # reload 0
movq %rdx, %rax # drop 1

Problem is that all the reloading at the end is unnecessary. We’re exiting the scope so the variables can’t be used anymore. Fix: pass along reset information.

def shrink(dst: Int) = {
  if (offset != dst) emitln(s"addq $$${(offset-dst)*8}, %rsp # reset stack ${offset} to $dst")
  sp = dst; offset = dst
}

// put result in location 'dst' and drop everything to the right of it
def trans1(e: Exp, dst: Int)(implicit env: Map[String,Int]): Unit = e match {
  case Let(x,rhs,body) =>
    trans(rhs)
    trans1(body,dst)(env + (x -> (sp-1)))
  case Lit(x) =>
    shrink(dst); emitln(s"movq $$$x, ${mem(dst)}")
  case Var(x) =>
    shrink(dst); if (mem(env(x)) != mem(dst)) emitln(s"movq ${mem(env(x))}, ${mem(dst)}")
  case _ =>
    trans(e); val src = sp-1
    shrink(dst); if (mem(src) != mem(dst)) emitln(s"movq ${mem(src)}, ${mem(dst)} # drop $src to $dst")
}

// put result on the next free stack slot
def trans(e: Exp)(implicit env: Map[String,Int]): Unit = e match {
  case Lit(x)              => grow(); emitln(s"movq $$$x, ${mem(sp-1)} # ${sp-1}");
  case Var(x)              => grow(); emitln(s"movq ${mem(env(x))}, ${mem(sp-1)} # ${sp-1}");
  case Prim("+",List(x,y)) => trans(x); trans(y); shrink(); transPrim(op)
  case Let(x,rhs,body) =>
    trans(rhs)
    trans1(body,sp-1)(env + (x -> (sp-1)))
}

Now the example val x1 = 1; ...; 9 looks much better. All the reloads are removed and the final result is directly stored in the target register %rax:

movq $1, %rax # 0
movq $2, %rdx # 1
movq $3, %rcx # 2
push %rax # evict 0
movq $4, %rax # 3
push %rdx # evict 1
movq $5, %rdx # 4
push %rcx # evict 2
movq $6, %rcx # 5
push %rax # evict 3
movq $7, %rax # 6
push %rdx # evict 4
movq $8, %rdx # 7
addq $40, %rsp # reset stack 5 to 0
movq $9, %rax

Note of course that the variables are never used so the code could be further optimized. However this requires somewhat different techniques that we’ll cover at a later point.

4.3 Parsing

Example:

val x = 10;
x*2

Can we parse this expression? Not with the strategy we were using.

Todo: show code?

We need to distinguish between keywords and identifiers. Consider:

valxyx

We need to look at the first four characters to figure out that this isn’t the keyword val. Then we need to backtrack. Doable, but lookahead gets large (length of longest keyword?). Better to read input only once and make a decision right away whether this is a keyword or an identifier.

Let’s introduce tokenization. Idea: don’t have parser operate directly on sequence of characters as input but first group characters into a sequence of words.

Some other benefits: easy to skip whitespace, easy to add comments.

4.3.1 Tokenization

What classes of tokens do we have?

  • Numbers: [0-9]+
  • Identifiers: [a-zA-Z][a-zA-Z0-9]*
  • Keywords: val
  • Operators: +, -, …
  • Delimiters: =, ;

And we ignore whitespace: ' ', '\n', '\r', '\t'

We also want comments (now or later?).

abstract class Token
case object EOF extends Token
case class Number(x: Int) extends Token
case class Ident(x: String) extends Token
case class Keyword(x: String) extends Token
case class Delim(x: Char) extends Token

4.3.1.1 Scanner

class Scanner(in: Reader[Char]) extends Reader[Token] {
  // reader interface
  var peek  = getToken()
  def hasNext: Boolean = peek != EOF
  def hasNext(f: Token => Boolean) = f(peek)
  def next() = try peek finally { peek = getToken() }

  // character classes
  def isAlpha(c: Char) =
    ('a' <= c && c <= 'z') || ('A' <= c && c <= 'Z')
  def isDigit(c: Char) = '0' <= c && c <= '9'
  def isAlphaNum(c: Char) = isAlpha(c) || isDigit(c)
  val isWhiteSpace = Set(' ','\t','\n','\r')
  val isOperator   = Set('+','-','*','/')
  val isDelim      = Set('{','}','(',')',';','=')
  val isKeyword    = Set("val")

  // core tokenization logic
  def skipWhiteSpace() =
    while (in.hasNext(isWhiteSpace)) in.next() // skip white space

  def getRawToken(): Token = {
    if (in.hasNext(isAlpha)) {
      getName()
    } else if (in.hasNext(isOperator)) {
      getOperator()
    } else if (in.hasNext(isDigit)) {
      getNum()
    } else if (in.hasNext(isDelim)) {
      Delim(in.next())
    } else if (!in.hasNext) {
      EOF
    } else
      abort(s"Unexpected character: ${in.peek}")
  }

  def getToken(): Token = {
    skipWhiteSpace()
    getRawToken()
  }

  // individual token classes
  def getName() = {
    val buf = new StringBuilder
    while (in.hasNext(isAlphaNum))
      buf += in.next()
    val s = buf.toString
    if (isKeyword(s)) Keyword(s) else Ident(s)
  }

  def getNum() = ...
  def getOperator() = ...
}

4.3.1.2 Parser

class Parser(in: Reader[Token]) {

  // convenience over scanner methods
  def follows(c: Char) =
    if (in.hasNext(_ == Delim(c))) { in.next(); true }
    else false
  def follows(s: String) =
    if (in.hasNext(_ == Keyword(s))) { in.next(); true }
    else false
  def require(c: Char) =
    if (in.hasNext(_ == Delim(c))) in.next()
    else expected(s"'$c'")
  def require(s: String) =
    if (in.hasNext(_ == Keyword(s))) in.next()
    else expected(s"'$s'")

  // token classes
  def isName(x: Token) = x match {
    case Ident(x) => true
    case _ => false
  }
  def isNum(x: Token) = x match {
    case Number(x) => true
    case _ => false
  }
  def isInfixOp(min: Int)(x: Token) = x match {
    case Ident(x) => prec(x) >= min
    case _ => false
  }

  // grammar productions
  def name(): String = {
    if (!in.hasNext(isName)) expected("Name")
    val Ident(x) = in.next(); x
  }

  // e.g.: 45 | x | (...) | {...}
  def atom(): Exp = {
    in.peek match {
      case Number(x)  => in.next(); Lit(x)
      case Ident(x)   => in.next(); Var(x)
      case Delim('(') => in.next(); val x = simpl(); require(')'); x
      case Delim('{') => in.next(); val x = expr(); require('}'); x
      case _ => expected("Atom"); EmptyExp
    }
  }

  // e.g.: x + 4 * 7
  def simpl(): Exp = binop(0)
  def binop(min: Int): Exp = {
    var res = atom()
    while (in.hasNext(isInfixOp(min))) {
      val op = name()
      val nextMin = prec(op) + assoc(op) // + 1 for left assoc
      val rhs = binop(nextMin)
      res = Prim(op, res, rhs)
    }
    res
  }

  // val x: T = e; e | e
  def expr(): Exp = {
    in.peek match {
      case Keyword("val") =>
        require("val")
        val nm = name()
        require('=')
        val rhs = simpl()
        require(';')
        val body = expr()
        Let(nm,tpe,rhs,body)
      case _ =>
        simpl()
    }
  }

  def parse() = {
    val s = expr();
    if (in.hasNext) expected("End of file");  // eof
    s
  }
}

4.4 Error Cases

Refer to a variable that isn’t defined. Interpreter will error at runtime. Compiler will currently crash during codegen. See next chapter.

4.5 Where are we?

We added (immutable) variables to our language and we added tokenization to our parser, so that it will process the input source as sequences of words instead of characters. This was necessary to support keywords like “val” without backtracking.