Skip to content

Language and IR Overview

vsmenon edited this page Mar 31, 2013 · 14 revisions

Start

This section describes the high-level language, the start compiler, and the start runtime. The start compiler converts the source language into a bytecode-like intermediate form. The start runtime directly executes this form. A single unified binary, by default, compiles source and directly executes it.

The Start Language

We first give a rough description, some examples, and then a more formal description.

  • Arithmetic operations are: +, -, *, /, %, <, <=, ==, !=, >=, >
  • Has only one integer data type int with a size of 8 bytes.
  • Has a builtin fixed-size List datatype.
  • Supports user-defined class datatypes with fields.
  • Has a dynamic type that can represent either int, List, or class types.
  • Has a new operator to create List and class instances on the heap.
  • Supports top-level functions and recursion.
  • There are only two namespaces/scopes: global and local variables.
  • Functions return void (i.e., nothing). The way to communicate the result of a function is to use global variables.
  • Has no pointers.
  • Has no inheritance.
  • Has no instance methods.
  • Does not support floating point.
  • Does not have gotos. Has structured programming constructs, if and while
  • The compiler supports only a single source file.

Start Intermediate Format

The start compiler produces bytecode-like format. Before taking a look at the specification of the format, let us look at a simple program.

class Point {
  int x;
  int y;
}

int size;

void show(List points) {
  int i;
  
  i = 0;
  while (i < size) {
    WriteLong(points[i].y);
    i = i + 1;
  }
  WriteLine();
}

void main() {
  List points;
  int i;
  Point p;
  
  size = 10;
  points = new List(size);
  i = 0;
  while (i < size) {
    p = new Point();
    p.x = i;
    p.y = i*i;
    points[i] = p;
    i = i + 1;
  }
  
  show(points);
}

and its corresponding representation

type Point: x#8:int y#16:int
method show@2: points#16:List i#-8:int
method main@23: points#-8:List i#-16:int p#-24:Point
global size#32760:int
instr 1: nop
instr 2: enter 8
instr 3: move 0 i#-8
instr 4: add size_base#32760 GP :int*
instr 5: load (4) :int
instr 6: cmplt i#-8 (5) :bool
instr 7: blbc (6) [21]
instr 8: checknull points#16 :List
instr 9: checkbounds (8) i#-8
instr 10: add (8) 16 :dynamic*
instr 11: mul i#-8 8 :int
instr 12: add (10) (11) :dynamic*
instr 13: load (12) :dynamic
instr 14: checknull (13) :dynamic
instr 15: lddynamic (14) y_offset#? :dynamic
instr 16: unbox (15) :int
instr 17: write (16)
instr 18: add i#-8 1 :int
instr 19: move (18) i#-8
instr 20: br [4]
instr 21: wrl
instr 22: ret 8
instr 23: entrypc
instr 24: enter 24
instr 25: add size_base#32760 GP :int*
instr 26: store 10 (25)
instr 27: add size_base#32760 GP :int*
instr 28: load (27) :int
instr 29: newlist (28) :List
instr 30: move (29) points#-8
instr 31: move 0 i#-16
instr 32: add size_base#32760 GP :int*
instr 33: load (32) :int
instr 34: cmplt i#-16 (33) :bool
instr 35: blbc (34) [54]
instr 36: new Point_type#24 :Point
instr 37: move (36) p#-24
instr 38: checknull p#-24 :Point
instr 39: add (38) x_offset#8 :int*
instr 40: store i#-16 (39)
instr 41: checknull p#-24 :Point
instr 42: add (41) y_offset#16 :int*
instr 43: mul i#-16 i#-16 :int
instr 44: store (43) (42)
instr 45: checknull points#-8 :List
instr 46: checkbounds (45) i#-16
instr 47: add (45) 16 :dynamic*
instr 48: mul i#-16 8 :int
instr 49: add (47) (48) :dynamic*
instr 50: store p#-24 (49)
instr 51: add i#-16 1 :int
instr 52: move (51) i#-16
instr 53: br [32]
instr 54: param points#-8
instr 55: call [2]
instr 56: ret 0
instr 57: nop

The 3-address format uses a simple instruction set, and assumes an infinite number of virtual registers. The operands to these instructions may be one of the following:

  • GP (global pointer): A pointer to the beginning of the global address space.
  • FP (frame pointer): A pointer to the beginning of the stack frame of the current function.
  • Constants: For example, 24
  • Address offsets: E.g., size_base#32760 represents the starting address of global variable size as an offset relative to the GP. The name makes the output easier to ready by a human. What really matters is the offset 32760.
  • Field offsets: E.g., y_offset#16 represents the offset of a field (in this case, y) of a class. Similar to the example above, what matters is the offset 16.
  • Local variables: E.g., p#-16 represents a local variable and its offset within the stack frame (i.e., relative FP).
  • Register name: E.g., (11) stands for virtual register r11. This is the register implicitly written to by instruction 13.
  • Code label: E.g., [4] represents a code location to jump to (in this example, instruction 4).

The instruction set contains the following instructions:

  • Arithmetic instructions: The opcodes add, sub, mul, div, mod, neg, cmpeq, cmple, and cmplt perform arithmetic operations on their operand(s) and write the result to a virtual register. For example:

    instr 42: add (41) y_offset#16 :int*

stands for

r42 = r41 + 16;
  • Branch instructions: The opcodes are br, blbc, blbs, and call. The opcode br stands for unconditional branch. The opcodes blbc and blbs denote branch if the register is cleared or set, respectively. The call is an unconditional jump-and-link instruction used for function calls. For example,

    instr 7: blbc (6) [21]

stands for

if (r6 == 0) goto instr_21;
  • Data-movement instructions: load and store instructions read from and write to memory. The move instruction copies to local variables. For example,

    instr 27: add size_base#32760 GP :int* instr 28: load (27) :int

computes the address of size in memory and loads its value

r27 = 32760 + GP;
r28 = *(r27);
  • I/O instructions: write and wrl are opcodes that are used for output. The opcode write prints an integer to stdout, and wrl prints a newline to stdout.

  • The param (used in conjunction with the call instruction) instruction pushes its operand onto the stack (it is used by a function that will be called in the future). For example

    instr 54: param points#-8 instr 55: call [2]

represents

function_2(points);

or alternatively, using the address of points:

function_2(*(FP - 8));
  • The enter instruction denotes the beginning of a function. Its operand specifies the amount of memory in bytes to be allocated on the stack frame for local variables of that function.

  • The ret instruction denotes a function return. Its operand specifies the amount of memory for formal parameters that needs to be removed (popped) from the stack.

The following function

void foo(int p1, int p2, int p3) {
  int x;
  int y;
}

is compiled into

instr 1: nop
instr 2: enter 16
instr 3: ret 24
instr 4: nop

which denotes 16 bytes of storage for local variables x and y and 24 bytes for the formal parameters p1, p2, and p3 that must be popped on a return.

  • The entrypc instruction denotes the beginning of the 'main' function.

  • The opcode nop does nothing.

Function Call ABI

The addresses/offsets for the global variables, local variables, and formal parameters are calculated using the following Application Binary Interface (ABI). To help you understand the various offsets, we describe how the stack frame is organized.

We lay out classes and lists out from lower to higher addresses. Since our only basic data type occupies 8 bytes, all integers, classes and lists are 8-byte aligned.

We lay out global variables starting at address 32768 and downwards towards zero. This organization implies that the maximum space for global variables is 32768 bytes. In the above example, the only global variables is size. Its starting address, represented as and offset from GP is 32760.

The stack grows from higher to lower addresses. The stack frame for a function looks like:

+---------------------+
|     low addresses   |
+---------------------+ <--- Bottom of frame
|        ...          |
+---------------------+
|      3rd local      |
+---------------------+
|      2nd local      |  
+---------------------+
|      1st local      |
+---------------------+ <--- FP
|       Prev FP       |
+---------------------+
|      Return Addr.   |
+---------------------+
|      2nd param      |
+---------------------+
|      1st param      |
+---------------------+ <--- Top of frame
|        ...          |
|---------------------|
|     high addresses  |
|---------------------|

Accordingly, in show in the example above, i and points have the following offsets from FP:

i : -8
points : 16

Start EBNF Grammar Specification

We provide a partial EBNF grammar specification to help you write Start programs.

The EBNF grammar for this language is given below.

Factor = Designator | Number | '(' Expression ')'.

Term = Factor {('*' | '/' | '&') Factor}.

SimpleExpr = ['+' | '-'] Term {('+' | '-') Term}.

EqualityExpr = SimpleExpr [('<' | '<=' | '>' | '>=') SimpleExpr].

Expression = EqualityExpr [('==' | '!=') EqualityExpr].

ConstExpression = Expression.

FieldList = VariableDeclaration {VariableDeclaration}.

StructType = 'struct' Ident ['{' FieldList '}'].

Type = Ident | StructType.

IdentArray = Ident {'[' ConstExpression ']'}.

IdentList = IdentArray {',' IdentArray}.

VariableDeclaration = Type IdentList ';'.

ConstantDeclaration = 'const' Type Ident '=' ConstExpression ';'.

Designator = Ident {('.' Ident) | ('[' Expression ']')}.

Assignment = Designator '=' Expression ';'.

ExpList = Expression {',' Expression}.

ProcedureCall = Ident '(' [ExpList] ')' ';'.

IfStatement = 'if' '(' Expression ')' '{' StatementSequence '}' 
              ['else' '{' StatementSequence '}'].

WhileStatement = 'while' '(' Expression ')' '{' StatementSequence '}'.

Statement = [Assignment | ProcedureCall | IfStatement | WhileStatement].

StatementSequence = {Statement}.

FPSection = Type IdentArray.

FormalParameters = FPSection {',' FPSection}.

ProcedureHeading = Ident '(' [FormalParameters] ')'.

ProcedureBody = {ConstDeclaration | VariableDeclaration}
                StatementSequence.

ProcedureDeclaration = 'void' ProcedureHeading '{' ProcedureBody '}'.

Program = {ConstantDeclaration | VariableDeclaration}
          ProcedureDeclaration {ProcedureDeclaration}.

Number and Ident are terminal symbols. Program is the start symbol.

Clone this wiki locally