Skip to content

Language and IR Overview

vsmenon edited this page Apr 7, 2013 · 14 revisions

Start

This section describes the high-level language, the Start compiler, and the Start runtime. The Start compiler converts the source language into a bytecode-like intermediate form. The Start runtime directly executes this form. The single unified binary, by default, compiles source and directly executes it.

See the top-level page for instructions on downloading and running Start.

The Start Language

We first give a rough description, some examples, and then a more formal description.

  • Arithmetic operations are: +, -, *, /, %, <, <=, ==, !=, >=, >
  • Has only one integer data type int with a size of 4 bytes.
  • Has a builtin fixed-size List datatype.
  • Supports user-defined class datatypes with fields.
  • Has a dynamic type that can represent either int, List, or class types.
  • Has a new operator to create List and class instances on the heap.
  • Supports top-level functions and recursion.
  • There are only two namespaces/scopes: global and local variables.
  • Functions return void (i.e., nothing). The way to communicate the result of a function is to use global variables.
  • Has no pointers.
  • Has no inheritance.
  • Has no instance methods.
  • Does not support floating point.
  • Does not have gotos. Has structured programming constructs, if and while
  • The compiler supports only a single source file.

Start Intermediate Format

The start compiler produces bytecode-like format. Before taking a look at the specification of the format, let us look at a simple program.

class Point {
  int x;
  int y;
}

int size;

void show(List points) {
  int i;
  
  i = 0;
  while (i < points.length) {
    WriteLong(points[i].y);
    i = i + 1;
  }
  WriteLine();
}

void main() {
  List points;
  int i;
  Point p;
  
  size = 10;
  points = new List(size);
  i = 0;
  while (i < size) {
    p = new Point();
    p.x = i;
    p.y = i*i;
    points[i] = p;
    i = i + 1;
  }
  
  show(points);
}

and its corresponding representation

type Point: x#4:int y#8:int
method show@2: points#8:List i#-4:int
method main@27: points#-4:List i#-8:int p#-12:Point
global size#32764:int
instr 1: nop
instr 2: enter 4
instr 3: move 0 i#-4
instr 4: checknull points#8 :List
instr 5: add (4) length_offset#4 :int*
instr 6: load (5) :int
instr 7: cmplt i#-4 (6) :bool
instr 8: blbc (7) [25]
instr 9: checknull points#8 :List
instr 10: checkbounds (9) i#-4
instr 11: add (9) 8 :dynamic*
instr 12: mul i#-4 4 :int
instr 13: add (11) (12) :dynamic*
instr 14: load (13) :dynamic
instr 15: checknull (14) :dynamic
instr 16: lddynamic (15) y_offset#? :dynamic
instr 17: checknull (16) :dynamic
instr 18: checktype (17) Integer_type#8 :Integer
instr 19: add (18) value_offset#4 :int*
instr 20: load (19) :int
instr 21: write (20)
instr 22: add i#-4 1 :int
instr 23: move (22) i#-4
instr 24: br [4]
instr 25: wrl
instr 26: ret 4
instr 27: entrypc
instr 28: enter 12
instr 29: add size_base#32764 GP :int*
instr 30: store 10 (29)
instr 31: add size_base#32764 GP :int*
instr 32: load (31) :int
instr 33: newlist (32) :List
instr 34: move (33) points#-4
instr 35: move 0 i#-8
instr 36: add size_base#32764 GP :int*
instr 37: load (36) :int
instr 38: cmplt i#-8 (37) :bool
instr 39: blbc (38) [58]
instr 40: new Point_type#12 :Point
instr 41: move (40) p#-12
instr 42: checknull p#-12 :Point
instr 43: add (42) x_offset#4 :int*
instr 44: store i#-8 (43)
instr 45: checknull p#-12 :Point
instr 46: add (45) y_offset#8 :int*
instr 47: mul i#-8 i#-8 :int
instr 48: store (47) (46)
instr 49: checknull points#-4 :List
instr 50: checkbounds (49) i#-8
instr 51: add (49) 8 :dynamic*
instr 52: mul i#-8 4 :int
instr 53: add (51) (52) :dynamic*
instr 54: store p#-12 (53)
instr 55: add i#-8 1 :int
instr 56: move (55) i#-8
instr 57: br [36]
instr 58: param points#-4
instr 59: call [2]
instr 60: ret 0
instr 61: nop

The Start IR uses a simple instruction set. Each instruction has a numerical label, followed by an opcode, zero to three operands, and, if the instruction computes a value, a type corresponding to that computed value.

Operands

The operands to Start instructions may be one of the following:

  • GP (global pointer): A pointer to the beginning of the global address space.
  • FP (frame pointer): A pointer to the beginning of the stack frame of the current function.
  • Constants: For example, 24
  • Address offsets: E.g., size_base#32760 represents the starting address of global variable size as an offset relative to the GP. The name makes the output easier to ready by a human. What really matters is the offset 32760.
  • Static field offsets: E.g., y_offset#8 represents the offset of a field (in this case, y) of a class. Similar to the example above, what matters is the offset 8.
  • Dynamic field offsets: E.g., y_offset#? represents the offset of a field named y. Unlike a static field offset, the offset is unknown. Instead, the name y must be looked up at runtime.
  • Local variables: E.g., p#-12 represents a local variable and its offset within the stack frame (i.e., relative FP).
  • Register name: E.g., (11) stands for virtual register r11. This is the register implicitly written to by instruction 11.
  • Code label: E.g., [4] represents a code location to jump to (in this example, instruction 4).

The instruction set contains the following instructions:

Instruction opcodes

Arithmetic

The add, sub, mul, div, mod, neg, cmpeq, cmple, and cmplt instructions perform arithmetic operations on their operand(s) and write the result to a virtual register.
For example:

instr 46: add (45) y_offset#8 :int*

stands for

r46 = r45 + 8;

Branch

The br, blbc, blbs, and call instructions transfer control. The opcode br stands for unconditional branch. The opcodes blbc and blbs denote branch if the register is cleared or set, respectively. The call is an unconditional jump-and-link instruction used for function calls. For example,

instr 8: blbc (7) [25]

stands for

if (r7 == 0) goto instr_25;

Data-movement

The load and store instructions read from and write to memory. The move instruction copies to local variables. For example,

instr 31: add size_base#32764 GP :int*
instr 32: load (31) :int

computes the address of size in memory and loads its value

r31 = 32764 + GP;
r32 = *(r31);

Memory allocation

The new and newlist instructions allocate user-defined classes and lists respectively on the heap. For example,

instr 40: new Point_type#12 :Point

allocates a new Point on the heap. Note that 12 is the size of a Point object: a 4-byte header, a 4-byte x value, and a 4-byte y value.

Safety checks

The checknull, checktype, and checkbounds instructions enforce type safety checks. checknull enforces that its argument is non-null. checktype enforces that its argument matches the given type. checkbounds enforces that an index is in bounds for a given list. If any of these check opcodes fail, an error is printed and the program is immediately halted. For example

instr 49: checknull points#-4 :List
instr 50: checkbounds (49) i#-8

is essentially equivalent to:

assert(points != NULL);
assert(0 <= i && i < points.length);

Dynamic access

The lddynamic and stdynamic instructions access the fields of dynamically typed variables. For example:

instr 16: lddynamic (15) y_offset#? :dynamic

loads the value at the y_offset. Note that the actually offset is unknown. These operations require runtime introspection and are typically expensive to execute.

Output

The write and wrl instructions are used for output. The opcode write prints an integer to stdout, and wrl prints a newline to stdout.

Stack manipulation

The enter, ret, and param instructions are used to adjust the stack on function calls. The enter instruction denotes the beginning of a function. Its operand specifies the amount of memory in bytes to be allocated on the stack frame for local variables of that function. The ret instruction denotes a function return. Its operand specifies the amount of memory for formal parameters that needs to be removed (popped) from the stack.

The following function

void foo(int p1, int p2, int p3) {
  int x;
  int y;
}

is compiled into

instr 1: nop
instr 2: enter 8
instr 3: ret 12
instr 4: nop

which denotes 8 bytes of storage for local variables x and y and 12 bytes for the formal parameters p1, p2, and p3 that must be popped on a return.

The param (used in conjunction with the call instruction) instruction pushes its operand onto the stack (it is used by a function that will be called in the future). For example

instr 58: param points#-4
instr 59: call [2]

represents

function_2(points);

or alternatively, using the address of points:

function_2(*(FP - 4));

Other

The entrypc and nop instructions have no runtime execution semantics. The entrypc opcode denotes the beginning of the 'main' function.

Types

Start provides builtin and user defined types. The builtin types include both primitive and boxed types.

Primitive (unboxed) types

Primitive types are stored by value, directly in a variable, stack slot, or heap slot. Start only has the following two primitive types:

  • int : A 32-bit integer type. This is the basic type for arithmetic in Start.
  • bool : A boolean type representing true or false. This is the result of comparison instructions and consumed by conditional branches.

Note that (unlike in Dart) primitive types can never be null.

Boxed types

Boxed types are stored by reference. Any variable, stack slot, or heap slot referring to a boxed type either has a null value or points to an object on the heap. Each object on the heap has a boxed type associated with it and implicitly stored at offset 0.

Builtin boxed types include:

  • Integer : A 32-bit integer on the heap.
  • Boolean : A boolean on the heap.
  • List : A fixed-size list type.

Additionally, Start allows user-defined classes. In the IR, user-defined types are listed first. In our example:

type Point: x#4:int y#8:int

Point is a user-defined class with two fields, x at offset 4 and y at offset 8, both of type int.

Dynamic

Start and Start IR also provide a dynamic type. A value with type dynamic can store any boxed type. (Note, the Start language automatically boxes and unboxes values when converting from a primitive to dynamic and vice versa.)

  • dynamic : The dynamic type representing any boxed value.

The elements of a List object are typed as dynamic.

Pointer types

Although the Start language does not provide direct access to pointers, the IR supports them. Pointers are addresses in memory. A pointer type indicates the type of the value stored in the corresponding memory address. E.g, int* is the address of a slot that contains a primitive 32-bit integer. List* is the address of a slot that contains a references to a List.

Note, all pointers in Start are 4-byte aligned (i.e., the last two bits are 0).

Function Call ABI

The addresses/offsets for the global variables, local variables, and formal parameters are calculated using the following Application Binary Interface (ABI). To help you understand the various offsets, we describe how the stack frame is organized.

We lay out classes and lists out from lower to higher addresses. Since our only basic data type occupies 4 bytes, all integers, classes and lists are 4-byte aligned.

We lay out global variables starting at address 32768 and downwards towards zero. This organization implies that the maximum space for global variables is 32768 bytes. In the above example, the only global variables is size. Its starting address, represented as and offset from GP is 32764.

The stack grows from higher to lower addresses. The stack frame for a function looks like:

+---------------------+
|     low addresses   |
+---------------------+ <--- Bottom of frame
|        ...          |
+---------------------+
|      3rd local      |
+---------------------+
|      2nd local      |  
+---------------------+
|      1st local      |
+---------------------+ <--- FP
|       Prev FP       |
+---------------------+
|      Return Addr.   |
+---------------------+
|      2nd param      |
+---------------------+
|      1st param      |
+---------------------+ <--- Top of frame
|        ...          |
|---------------------|
|     high addresses  |
|---------------------|

Accordingly, in show in the example above, i and points have the following offsets from FP:

i : -4
points : 8

Start EBNF Grammar Specification

We provide a partial EBNF grammar specification to help you write Start programs.

The EBNF grammar for this language is given below.

Factor = Designator | Number | '(' Expression ')'.

Term = Factor {('*' | '/' | '%') Factor}.

SimpleExpr = ['+' | '-'] Term {('+' | '-') Term}.

EqualityExpr = SimpleExpr [('<' | '<=' | '>' | '>=') SimpleExpr].

Expression = EqualityExpr [('==' | '!=') EqualityExpr].

ConstExpression = Expression.

FieldList = VariableDeclaration {VariableDeclaration}.

ClassType = 'class' Ident ['{' FieldList '}'].

Type = Ident | ClassType.

IdentList = Ident {',' Ident}.

VariableDeclaration = Type IdentList ';'.

Designator = Ident {('.' Ident) | ('[' Expression ']')}.

Assignment = Designator '=' Expression ';'.

ExpList = Expression {',' Expression}.

ProcedureCall = Ident '(' [ExpList] ')' ';'.

IfStatement = 'if' '(' Expression ')' '{' StatementSequence '}' 
              ['else' '{' StatementSequence '}'].

WhileStatement = 'while' '(' Expression ')' '{' StatementSequence '}'.

Statement = [Assignment | ProcedureCall | IfStatement | WhileStatement].

StatementSequence = {Statement}.

FPSection = Type Ident.

FormalParameters = FPSection {',' FPSection}.

ProcedureHeading = Ident '(' [FormalParameters] ')'.

ProcedureBody = {VariableDeclaration}
                StatementSequence.

ProcedureDeclaration = 'void' ProcedureHeading '{' ProcedureBody '}'.

Program = {ConstantDeclaration | VariableDeclaration}
          ProcedureDeclaration {ProcedureDeclaration}.

Number and Ident are terminal symbols. Program is the start symbol.

Clone this wiki locally